<a href="https://colab.research.google.com/github/Kalyanireddy78/FMML-LAB--1/blob/main/Module_01_Lab_02_MLPractice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine learning terms and metrics

FMML Module 1, Lab 2<br>


 In this lab, we will show a part of the ML pipeline by extracting features, training and testing

In [1]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
# set randomseed
rng = np.random.default_rng(seed=42)

In this lab, we will use the California Housing dataset. There are 20640 samples, each with 8 attributes like income of the block, age of the houses per district etc. The task is to predict the cost of the houses per district.

Let us download and examine the dataset.

In [2]:
 dataset =  datasets.fetch_california_housing()
 # print(dataset.DESCR)  # uncomment this if you want to know more about this dataset
 # print(dataset.keys())  # if you want to know what else is there in this dataset
 dataset.target = dataset.target.astype(np.int) # so that we can classify
 print(dataset.data.shape)
 print(dataset.target.shape)

(20640, 8)
(20640,)


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dataset.target = dataset.target.astype(np.int) # so that we can classify


Here is a function for calculating the 1-nearest neighbours

In [3]:
def NN1(traindata, trainlabel, query):
  diff  = traindata - query  # find the difference between features. Numpy automatically takes care of the size here
  sq = diff*diff # square the differences
  dist = sq.sum(1) # add up the squares
  label = trainlabel[np.argmin(dist)] # our predicted label is the label of the training data which has the least distance from the query
  return label

def NN(traindata, trainlabel, testdata):
  # we will run nearest neighbour for each sample in the test data
  # and collect the predicted classes in an array using list comprehension
  predlabel = np.array([NN1(traindata, trainlabel, i) for i in testdata])
  return predlabel

We will also define a 'random classifier', which randomly allots labels to each sample

In [4]:
def RandomClassifier(traindata, trainlabel, testdata):
  # in reality, we don't need these arguments

  classes = np.unique(trainlabel)
  rints = rng.integers(low=0, high=len(classes), size=len(testdata))
  predlabel = classes[rints]
  return predlabel

Let us define a metric 'Accuracy' to see how good our learning algorithm is. Accuracy is the ratio of the number of correctly classified samples to the total number of samples. The higher the accuracy, the better the algorithm.

In [5]:
def Accuracy(gtlabel, predlabel):
  assert len(gtlabel)==len(predlabel), "Length of the groundtruth labels and predicted labels should be the same"
  correct = (gtlabel==predlabel).sum() # count the number of times the groundtruth label is equal to the predicted label.
  return correct/len(gtlabel)

Let us make a function to split the dataset with the desired probability.

In [6]:
def split(data, label, percent):
  # generate a random number for each sample
  rnd = rng.random(len(label))
  split1 = rnd<percent
  split2 = rnd>=percent
  split1data = data[split1,:]
  split1label = label[split1]
  split2data = data[split2,:]
  split2label = label[split2]
  return split1data, split1label, split2data, split2label

We will reserve 20% of our dataset as the test set. We will not change this portion throughout our experiments

In [7]:
testdata, testlabel, alltraindata, alltrainlabel = split(dataset.data, dataset.target, 20/100)
print('Number of test samples = ', len(testlabel))
print('Number of other samples = ', len(alltrainlabel))
print('Percent of test data = ', len(testlabel)*100/len(dataset.target),'%')

Number of test samples =  4144
Number of other samples =  16496
Percent of test data =  20.07751937984496 %


## Experiments with splits

Let us reserve some of our train data as a validation set

In [8]:
traindata, trainlabel, valdata, vallabel = split(alltraindata, alltrainlabel, 75/100)

What is the accuracy of our classifiers on the train dataset?

In [9]:
trainpred = NN(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Train accuracy using nearest neighbour is ", trainAccuracy)

trainpred = RandomClassifier(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Train accuracy using random classifier is ", trainAccuracy)

Train accuracy using nearest neighbour is  1.0
Train accuracy using random classifier is  0.164375808538163


For nearest neighbour, the train accuracy is always 1. The accuracy of the random classifier is close to 1/(number of classes) which is 0.1666 in our case.

Let us predict the labels for our validation set and get the accuracy

In [10]:
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour is ", valAccuracy)

valpred = RandomClassifier(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using random classifier is ", valAccuracy)

Validation accuracy using nearest neighbour is  0.34108527131782945
Validation accuracy using random classifier is  0.1688468992248062


Validation accuracy of nearest neighbour is considerably less than its train accuracy while the validation accuracy of random classifier is the same. However, the validation accuracy of nearest neighbour is twice that of the random classifier.

Now let us try another random split and check the validation accuracy

In [11]:
traindata, trainlabel, valdata, vallabel = split(alltraindata, alltrainlabel, 75/100)
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy of nearest neighbour is ", valAccuracy)

Validation accuracy of nearest neighbour is  0.34048257372654156


You can run the above cell multiple times to try with different random splits.
We notice that the accuracy is different for each run, but close together.

Now let us compare it with the accuracy we get on the test dataset.

In [12]:
testpred = NN(alltraindata, alltrainlabel, testdata)
testAccuracy = Accuracy(testlabel, testpred)
print('Test accuracy is ', testAccuracy)

Test accuracy is  0.34917953667953666


### Try it out for yourself and answer:
1. How is the accuracy of the validation set affected if we increase the percentage of validation set? What happens when we reduce it?
2. How does the size of the train and validation set affect how well we can predict the accuracy on the test set using the validation set?
3. What do you think is a good percentage to reserve for the validation set so that thest two factors are balanced?

Answer for both nearest neighbour and random classifier. You can note down the values for your experiments and plot a graph using  <a href=https://matplotlib.org/stable/gallery/lines_bars_and_markers/step_demo.html#sphx-glr-gallery-lines-bars-and-markers-step-demo-py>plt.plot<href>. Check also for extreme values for splits, like 99.9% or 0.1%

#QUESTION 1

The size of the validation set in a machine learning experiment can have an impact on the accuracy and generalization performance of your model. Here's how it generally works:

### Increase the Percentage of Validation Set:

1. **Pros:**
   - **Better Generalization Evaluation:** A larger validation set provides a more reliable estimate of how well your model will generalize to unseen data. It helps ensure that the performance metrics are representative of the model's true ability to generalize.
   - **Reduced Variance:** With a larger validation set, the performance evaluation is less sensitive to the specific examples chosen for validation, reducing the variability in performance metrics.

2. **Cons:**
   - **Reduced Training Data:** The downside is that as you increase the size of the validation set, you have less data available for training. This could potentially lead to a less effective model if the training set is too small.

### Reduce the Percentage of Validation Set:

1. **Pros:**
   - **More Data for Training:** A smaller validation set means more data is available for training. This can be beneficial if your dataset is limited, and you want to maximize the amount of data used for model training.

2. **Cons:**
   - **Less Reliable Generalization Estimate:** A smaller validation set might provide a less reliable estimate of your model's ability to generalize. The performance metrics on a small validation set may be more prone to randomness or outliers.

### Finding the Right Balance:

The choice of the validation set size often involves a trade-off between having a reliable estimate of generalization performance and having sufficient data for training. It's common to use techniques like cross-validation (e.g., k-fold cross-validation) to mitigate the impact of the validation set size on model evaluation.

In practice, there is no one-size-fits-all answer, and the optimal size of the validation set depends on the specifics of your dataset, the complexity of your model, and other factors. Experimenting with different validation set sizes and using cross-validation can help you find the right balance for your particular case.




#QUESTION 2
The size of the train and validation sets can impact how well the model generalizes to unseen data (like the test set). Here's how the size of these sets can affect the prediction of accuracy on the test set using the validation set:

1. **Large Training Set:**
   - **Pros:**
     - A larger training set generally allows the model to learn more complex patterns in the data. This can be particularly beneficial for complex models.
     - The model might have a better understanding of the underlying data distribution.

   - **Cons:**
     - If the validation set is too small, the accuracy estimate on the validation set may be less reliable. A small validation set might not capture the diversity of the data, leading to an inaccurate assessment of the model's performance.

2. **Large Validation Set:**
   - **Pros:**
     - A larger validation set provides a more reliable estimate of how well the model generalizes to unseen data. It reduces the variability in performance metrics and makes the evaluation more robust.

   - **Cons:**
     - A large validation set means less data is available for training, which might limit the model's ability to capture complex patterns in the data.

3. **Balanced Split:**
   - **Pros:**
     - A balanced split between training and validation sets attempts to strike a balance between having enough data for training and having a reliable estimate of generalization performance.

   - **Cons:**
     - It may not be optimal for all scenarios. The ideal split often depends on the size and nature of the dataset.

4. **Overfitting to Validation Set:**
   - **Cons:**
     - If the model is tuned too much based on the validation set (e.g., hyperparameter tuning), it might overfit to the validation set and not generalize well to the test set.

In summary, finding the right balance between the size of the training and validation sets is crucial. It involves considering the complexity of the model, the size of the dataset, and the need for a reliable estimate of generalization performance. Techniques like cross-validation can also be used to mitigate the impact of a fixed train-validation split and provide a more robust estimate of model performance.




#QUESTION 3
There is no one-size-fits-all answer to the question of what percentage to reserve for the validation set, as the optimal split depends on various factors including the size and nature of your dataset, the complexity of your model, and the specific goals of your machine learning task. However, some common practices and guidelines can be considered:

1. **Rule of Thumb:**
   - A common practice is to use a split like 80-20 or 70-30 for training-validation. For example, 80% of the data for training and 20% for validation. This split often strikes a reasonable balance between having enough data for training and a sufficiently large validation set for reliable performance estimation.

2. **Cross-Validation:**
   - Instead of a fixed train-validation split, you might consider using cross-validation (e.g., k-fold cross-validation). This involves dividing the dataset into k folds and using each fold as a validation set while training on the remaining data. This helps in obtaining a more robust estimate of model performance.

3. **Data Size Consideration:**
   - If you have a large dataset, you might be able to allocate a smaller percentage to the validation set and still obtain reliable estimates. Conversely, with a smaller dataset, you might need a larger validation set to get a good estimate.

4. **Model Complexity:**
   - If your model is relatively simple, it might require less data for training and a smaller validation set could be sufficient. For complex models, you might want a larger validation set to ensure a reliable estimate of generalization performance.

5. **Experimentation:**
   - It's often a good idea to experiment with different splits and evaluate the impact on your model's performance. This can involve trying different percentages for validation and observing how it affects the model's ability to generalize to unseen data.

In summary, there's no magic percentage that works for all situations. The best approach is to consider the specific characteristics of your dataset and model, and possibly experiment with different splits or use cross-validation to find a suitable balance between training and validation sets.

## Multiple Splits

One way to get more accurate estimates for the test accuracy is by using <b>crossvalidation</b>. Here, we will try a simple version, where we do multiple train/val splits and take the average of validation accuracies as the test accuracy estimation. Here is a function for doing this. Note that this function will take a long time to execute.

In [13]:
# you can use this function for random classifier also
def AverageAccuracy(alldata, alllabel, splitpercent, iterations, classifier=NN):
  accuracy = 0
  for ii in range(iterations):
    traindata, trainlabel, valdata, vallabel = split(alldata, alllabel, splitpercent)
    valpred = classifier(traindata, trainlabel, valdata)
    accuracy += Accuracy(vallabel, valpred)
  return accuracy/iterations # average of all accuracies

In [14]:
print('Average validation accuracy is ', AverageAccuracy(alltraindata, alltrainlabel, 75/100, 10, classifier=NN))
testpred = NN(alltraindata, alltrainlabel, testdata)
print('test accuracy is ',Accuracy(testlabel, testpred) )

Average validation accuracy is  0.33584635395170215
test accuracy is  0.34917953667953666


This is a very simple way of doing cross-validation. There are many well-known algorithms for cross-validation, like k-fold cross-validation, leave-one-out etc. This will be covered in detail in a later module. For more information about cross-validation, check <a href=https://en.wikipedia.org/wiki/Cross-validation_(statistics)>Cross-validatioin (Wikipedia)</a>

### Questions
1. Does averaging the validation accuracy across multiple splits give more consistent results?
2. Does it give more accurate estimate of test accuracy?
3. What is the effect of the number of iterations on the estimate? Do we get a better estimate with higher iterations?
4. Consider the results you got for the previous questions. Can we deal with a very small train dataset or validation dataset by increasing the iterations?


#QUESTION 1
Yes, averaging the validation accuracy across multiple splits, especially when using techniques like cross-validation, generally provides more consistent and reliable results compared to relying on a single train-validation split. Here's why:

1. **Reduces Variability:**
   - Cross-validation involves splitting the dataset into multiple folds, training the model on different subsets, and validating it on different subsets. Averaging the results over these multiple splits helps reduce the impact of randomness or the specific choice of data in a single split. This, in turn, provides a more stable and reliable estimate of model performance.

2. **Better Generalization Estimate:**
   - By using multiple splits, you get a more comprehensive view of how well your model generalizes to different parts of the dataset. This helps in obtaining a more robust estimate of the model's true ability to perform on unseen data.

3. **Mitigates Overfitting to a Specific Split:**
   - Averaging over multiple splits also helps mitigate the risk of overfitting to a specific train-validation split. If the model performs exceptionally well on a particular split due to chance, the average performance over multiple splits is likely to provide a more realistic assessment of the model's capabilities.

4. **Provides Confidence Intervals:**
   - The variability observed across different splits in cross-validation can be used to estimate confidence intervals for your performance metrics. This can give you a sense of the stability and reliability of your model's performance estimates.

5. **More Representative Evaluation:**
   - Averaging over multiple splits ensures that the model is evaluated on a diverse set of training and validation data, providing a more representative evaluation of its generalization performance.

In summary, while a single train-validation split can be susceptible to the specific characteristics of that split, averaging validation accuracy across multiple splits, especially with techniques like cross-validation, enhances the reliability and stability of your performance estimates. This is particularly important when making decisions about model selection, hyperparameter tuning, or assessing the overall performance of your machine learning model.


#QUESTION 2
Averaging validation accuracy across multiple splits, particularly when using techniques like cross-validation, can provide a more reliable and stable estimate of how well your model is likely to perform on unseen data, such as a test set. However, it's essential to clarify the terminology here:

1. **Validation Accuracy vs. Test Accuracy:**
   - **Validation Accuracy:** This is the accuracy metric calculated on the validation set during the training process. It helps you assess how well your model is performing on data it hasn't seen during training.
   - **Test Accuracy:** This is the accuracy metric calculated on a completely separate test set that the model has not encountered during training or validation. The test set is used to evaluate the final performance of the model.

2. **Cross-Validation and Generalization Estimate:**
   - Cross-validation provides a more robust estimate of the model's generalization performance, which is its ability to perform well on unseen data. The average performance across multiple folds in cross-validation is considered a more reliable indicator of how well the model might perform on a truly unseen test set.

3. **Correlation with Test Accuracy:**
   - While a good performance on cross-validation is indicative of a model's ability to generalize, it doesn't guarantee an identical performance on a test set. However, there is often a positive correlation between good cross-validation performance and good test set performance.

4. **Limitations:**
   - The test set is crucial for obtaining a final, unbiased evaluation of your model. Cross-validation helps you make more informed decisions during development, but the ultimate test is how well the model performs on data it has never seen before.

In summary, while averaging validation accuracy across multiple splits with cross-validation provides a more accurate and stable estimate of a model's generalization performance, it does not directly give you the test accuracy. The test accuracy needs to be evaluated separately on a dedicated test set that has not been used during training or cross-validation. The cross-validation results can guide your decisions and provide confidence in the model's performance, but the final assessment should always be based on the test set.


#QUESTION 3
The number of iterations (or epochs) in the training process of a machine learning model can have an impact on the model's performance estimate. Here's how the number of iterations might affect the estimate:

1. **Underfitting and Overfitting:**
   - **Too Few Iterations (Underfitting):** If you have too few iterations during training, the model might not have sufficient time to learn complex patterns in the data. This can lead to underfitting, where the model doesn't capture the underlying relationships in the training data.
   - **Too Many Iterations (Overfitting):** On the other hand, if you have too many iterations, the model might start memorizing the training data (overfitting) instead of generalizing to unseen data. This can result in a model that performs well on the training set but poorly on new, unseen data.

2. **Learning Curve:**
   - **Observing the Learning Curve:** Plotting the learning curve, which shows the training and validation performance over iterations, can help you understand how well your model is learning. It can provide insights into whether the model is underfitting, overfitting, or achieving a good balance.

3. **Validation Performance Stabilization:**
   - **Convergence of Validation Performance:** Initially, as the model learns, you might observe improvements in both training and validation performance. However, after a certain point, the validation performance might stabilize or even degrade, indicating that the model is no longer improving on unseen data.

4. **Early Stopping:**
   - **Preventing Overfitting with Early Stopping:** To prevent overfitting, practitioners often use techniques like early stopping, where the training process is halted when the validation performance stops improving. This helps to obtain a model that generalizes well to new data.

5. **Trade-Off:**
   - **Balancing Training Time and Performance:** The number of iterations also affects the training time. There is often a trade-off between training for a longer time to potentially improve performance and stopping early to avoid overfitting and reduce training time.

In summary, the relationship between the number of iterations and the performance estimate is nuanced. Too few iterations can lead to underfitting, and too many can lead to overfitting. The goal is to find a balance where the model generalizes well to unseen data. Techniques like cross-validation and monitoring the learning curve can help you make informed decisions about the optimal number of iterations for your specific machine learning task.



#QUESTION 4
Increasing the number of iterations during training can help to some extent when dealing with a very small training dataset. However, it's important to note that there are limitations to what more iterations can achieve, and there are potential risks associated with relying solely on this approach. Here are some considerations:

### Pros of Increasing Iterations with a Small Training Dataset:

1. **More Exposure to Data:**
   - With more iterations, the model has more opportunities to see and learn from the limited training data.

2. **Potential for Better Generalization:**
   - If the model is not overfitting and has the capacity to learn from the small dataset, increasing iterations might improve its ability to generalize to new, unseen data.

### Cons and Limitations:

1. **Risk of Overfitting:**
   - With a very small training dataset, there's an increased risk of overfitting, especially if the model becomes too complex. More iterations might lead the model to memorize the training examples instead of learning general patterns.

2. **Limited Diversity in Data:**
   - Even with more iterations, the model is still constrained by the limited diversity in the training data. It may struggle to capture the true underlying patterns if the dataset is not representative.

3. **Increased Training Time:**
   - More iterations typically mean longer training times. While this can allow the model to see the data more, it might not be the most efficient use of time, especially if the dataset is extremely small.

4. **Diminishing Returns:**
   - There's a point of diminishing returns where additional iterations may not significantly improve performance, and the risk of overfitting increases.

### Alternatives:

1. **Data Augmentation:**
   - If applicable, consider data augmentation techniques to artificially increase the effective size of your training dataset. This involves applying transformations to the existing data to create new variations.

2. **Transfer Learning:**
   - Explore transfer learning if pre-trained models are available for a related task. Transfer learning allows the model to leverage knowledge learned from a larger dataset before fine-tuning on the small dataset.

3. **Regularization Techniques:**
   - Implement regularization techniques (e.g., dropout, weight regularization) to prevent overfitting.

4. **Use a Simpler Model:**
   - Consider using a simpler model architecture that is less prone to overfitting, especially if the dataset is small.

In conclusion, while increasing the number of iterations can provide the model with more exposure to the limited training data, it's crucial to monitor for signs of overfitting. Additionally, exploring alternative approaches such as data augmentation, transfer learning, regularization, or using simpler models can be beneficial when dealing with very small datasets.