<a href="https://colab.research.google.com/github/Surendra123m/FMML_2023_PROGRAM_SOLVING/blob/main/Module_01_Lab_02_MLPractice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine learning terms and metrics

FMML Module 1, Lab 2<br>


 In this lab, we will show a part of the ML pipeline by extracting features, training and testing

In [1]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
# set randomseed
rng = np.random.default_rng(seed=42)

In this lab, we will use the California Housing dataset. There are 20640 samples, each with 8 attributes like income of the block, age of the houses per district etc. The task is to predict the cost of the houses per district.

Let us download and examine the dataset.

In [2]:
 dataset =  datasets.fetch_california_housing()
 # print(dataset.DESCR)  # uncomment this if you want to know more about this dataset
 # print(dataset.keys())  # if you want to know what else is there in this dataset
 dataset.target = dataset.target.astype(np.int) # so that we can classify
 print(dataset.data.shape)
 print(dataset.target.shape)

(20640, 8)
(20640,)


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dataset.target = dataset.target.astype(np.int) # so that we can classify


Here is a function for calculating the 1-nearest neighbours

In [3]:
def NN1(traindata, trainlabel, query):
  diff  = traindata - query  # find the difference between features. Numpy automatically takes care of the size here
  sq = diff*diff # square the differences
  dist = sq.sum(1) # add up the squares
  label = trainlabel[np.argmin(dist)] # our predicted label is the label of the training data which has the least distance from the query
  return label

def NN(traindata, trainlabel, testdata):
  # we will run nearest neighbour for each sample in the test data
  # and collect the predicted classes in an array using list comprehension
  predlabel = np.array([NN1(traindata, trainlabel, i) for i in testdata])
  return predlabel

We will also define a 'random classifier', which randomly allots labels to each sample

In [4]:
def RandomClassifier(traindata, trainlabel, testdata):
  # in reality, we don't need these arguments

  classes = np.unique(trainlabel)
  rints = rng.integers(low=0, high=len(classes), size=len(testdata))
  predlabel = classes[rints]
  return predlabel

Let us define a metric 'Accuracy' to see how good our learning algorithm is. Accuracy is the ratio of the number of correctly classified samples to the total number of samples. The higher the accuracy, the better the algorithm.

In [5]:
def Accuracy(gtlabel, predlabel):
  assert len(gtlabel)==len(predlabel), "Length of the groundtruth labels and predicted labels should be the same"
  correct = (gtlabel==predlabel).sum() # count the number of times the groundtruth label is equal to the predicted label.
  return correct/len(gtlabel)

Let us make a function to split the dataset with the desired probability.

In [6]:
def split(data, label, percent):
  # generate a random number for each sample
  rnd = rng.random(len(label))
  split1 = rnd<percent
  split2 = rnd>=percent
  split1data = data[split1,:]
  split1label = label[split1]
  split2data = data[split2,:]
  split2label = label[split2]
  return split1data, split1label, split2data, split2label

We will reserve 20% of our dataset as the test set. We will not change this portion throughout our experiments

In [7]:
testdata, testlabel, alltraindata, alltrainlabel = split(dataset.data, dataset.target, 20/100)
print('Number of test samples = ', len(testlabel))
print('Number of other samples = ', len(alltrainlabel))
print('Percent of test data = ', len(testlabel)*100/len(dataset.target),'%')

Number of test samples =  4144
Number of other samples =  16496
Percent of test data =  20.07751937984496 %


## Experiments with splits

Let us reserve some of our train data as a validation set

In [8]:
traindata, trainlabel, valdata, vallabel = split(alltraindata, alltrainlabel, 75/100)

What is the accuracy of our classifiers on the train dataset?

In [9]:
trainpred = NN(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Train accuracy using nearest neighbour is ", trainAccuracy)

trainpred = RandomClassifier(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Train accuracy using random classifier is ", trainAccuracy)

Train accuracy using nearest neighbour is  1.0
Train accuracy using random classifier is  0.164375808538163


For nearest neighbour, the train accuracy is always 1. The accuracy of the random classifier is close to 1/(number of classes) which is 0.1666 in our case.

Let us predict the labels for our validation set and get the accuracy

In [10]:
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour is ", valAccuracy)

valpred = RandomClassifier(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using random classifier is ", valAccuracy)

Validation accuracy using nearest neighbour is  0.34108527131782945
Validation accuracy using random classifier is  0.1688468992248062


Validation accuracy of nearest neighbour is considerably less than its train accuracy while the validation accuracy of random classifier is the same. However, the validation accuracy of nearest neighbour is twice that of the random classifier.

Now let us try another random split and check the validation accuracy

In [11]:
traindata, trainlabel, valdata, vallabel = split(alltraindata, alltrainlabel, 75/100)
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy of nearest neighbour is ", valAccuracy)

Validation accuracy of nearest neighbour is  0.34048257372654156


You can run the above cell multiple times to try with different random splits.
We notice that the accuracy is different for each run, but close together.

Now let us compare it with the accuracy we get on the test dataset.

In [12]:
testpred = NN(alltraindata, alltrainlabel, testdata)
testAccuracy = Accuracy(testlabel, testpred)
print('Test accuracy is ', testAccuracy)

Test accuracy is  0.34917953667953666


### Try it out for yourself and answer:
1. How is the accuracy of the validation set affected if we increase the percentage of validation set? What happens when we reduce it?
2. How does the size of the train and validation set affect how well we can predict the accuracy on the test set using the validation set?
3. What do you think is a good percentage to reserve for the validation set so that thest two factors are balanced?

Answer for both nearest neighbour and random classifier. You can note down the values for your experiments and plot a graph using  <a href=https://matplotlib.org/stable/gallery/lines_bars_and_markers/step_demo.html#sphx-glr-gallery-lines-bars-and-markers-step-demo-py>plt.plot<href>. Check also for extreme values for splits, like 99.9% or 0.1%

#1 How is the accuracy of the validation set affected if we increase the percentage of validation set? What happens when we reduce it?

The percentage of the validation set in a machine learning or deep learning experiment can significantly impact the performance evaluation of your model. Here's how it's affected when you increase or decrease the percentage of the validation set:

Increase the Percentage of Validation Set:

Pros:
More Reliable Evaluation: With a larger validation set, you get a more reliable estimate of your model's performance. This is particularly useful when your dataset is large, and you can afford to allocate more data to validation.
Reduced Overfitting Risk: A larger validation set can help you detect overfitting more effectively. Overfitting occurs when your model learns to perform well on the training data but doesn't generalize to new, unseen data. A larger validation set makes it harder for a model to overfit since it needs to perform well on a larger variety of examples.
Cons:
Reduced Training Data: By allocating more data to the validation set, you have less data available for training your model. This can be a problem when you have a limited dataset, as it might lead to underfitting.
Decrease the Percentage of Validation Set:

Pros:
More Data for Training: Allocating a smaller percentage to the validation set means you have more data available for training your model. This can be beneficial when you have a limited dataset.
Faster Training: With less data in the validation set, your model's training process may be faster, allowing you to experiment with different hyperparameters or model architectures more quickly.
Cons:
Less Reliable Evaluation: A smaller validation set may lead to a less reliable estimate of your model's performance. The evaluation may be more sensitive to variations in the validation data, and the reported performance metrics may not be representative of how well your model generalizes to new, unseen data.
Higher Risk of Overfitting: With a smaller validation set, it's easier for a model to overfit since it has less data to generalize from. This can result in a model that performs well on the validation set but poorly on new data.

#2 How does the size of the train and validation set affect how well we can predict the accuracy on the test set using the validation set?



The size of the training and validation sets can significantly affect how well you can predict the accuracy on the test set using the validation set. This relationship is influenced by several factors:

Representativeness of the Validation Set:

If the validation set is not representative of the test set in terms of data distribution, size, or difficulty, then the accuracy on the validation set may not be a good indicator of performance on the test set. For instance, if the validation set is very small or lacks diversity, it might not capture the full range of scenarios that the test set contains.
Sample Variability:

A larger validation set is generally better at capturing the variability present in the dataset. It provides a more stable estimate of your model's performance because it's less sensitive to random fluctuations in the data. This can lead to more accurate predictions of how well your model will perform on unseen data.
Overfitting and Underfitting:

The size of the training set can influence the risk of overfitting or underfitting. If you have a very small training set, your model might overfit, which means it memorizes the training data but doesn't generalize well. In this case, the validation accuracy may be deceptively high. Conversely, if the training set is too large, it can help the model generalize better, but it might also make it harder to detect overfitting.
Statistical Confidence:

A larger validation set allows for more statistically confident performance estimation. With a small validation set, you may have a higher chance of getting misleading results due to random variation. A larger validation set reduces this risk and provides a more reliable estimate of model performance.
Hyperparameter Tuning:

When tuning hyperparameters (e.g., learning rate, regularization strength), a larger validation set can be more informative. It allows for a better assessment of how different hyperparameters impact model performance. In contrast, a small validation set may not provide enough data points for robust hyperparameter tuning.
Computational Resources:

The size of the validation set can also affect the speed of experimentation. A smaller validation set can lead to faster model training and evaluation, which can be advantageous when you have limited computational resources or time constraints.

#3 What do you think is a good percentage to reserve for the validation set so that thest two factors are balanced?

The percentage of data to reserve for the validation set should be chosen carefully to strike a balance between having a representative sample for model evaluation and having enough data for model training. While there is no one-size-fits-all answer, a common practice is to allocate around 10-20% of your total dataset for validation when you have a reasonably sized dataset.

Here are some considerations for choosing the percentage of data for the validation set:

Dataset Size: The total size of your dataset is a crucial factor. If you have a large dataset (tens of thousands of samples or more), you can afford to allocate a smaller percentage to the validation set (e.g., 10%). In contrast, with a smaller dataset, you may want to allocate a larger percentage (e.g., 20%) to ensure you have enough data for reliable validation.

Complexity of the Problem: More complex problems might require larger validation sets. If your problem involves a high degree of variability or if it's challenging for your model to generalize, a larger validation set can help capture this complexity.

Resource Constraints: Consider your available computational resources. A smaller validation set can lead to faster model training and experimentation, which can be advantageous when you have limited resources or need to iterate quickly.

Cross-Validation: If you are concerned about the balance between validation and training data, you can employ cross-validation techniques. For example, k-fold cross-validation divides your dataset into k subsets, using each as a validation set while training on the remaining data. This helps you use your data more efficiently and provides multiple estimates of model performance.

Domain Knowledge: Your domain expertise can guide your decision. Some domains may require specific validation set sizes based on known characteristics of the data or problem.

## Multiple Splits

One way to get more accurate estimates for the test accuracy is by using <b>crossvalidation</b>. Here, we will try a simple version, where we do multiple train/val splits and take the average of validation accuracies as the test accuracy estimation. Here is a function for doing this. Note that this function will take a long time to execute.

In [13]:
# you can use this function for random classifier also
def AverageAccuracy(alldata, alllabel, splitpercent, iterations, classifier=NN):
  accuracy = 0
  for ii in range(iterations):
    traindata, trainlabel, valdata, vallabel = split(alldata, alllabel, splitpercent)
    valpred = classifier(traindata, trainlabel, valdata)
    accuracy += Accuracy(vallabel, valpred)
  return accuracy/iterations # average of all accuracies

In [14]:
print('Average validation accuracy is ', AverageAccuracy(alltraindata, alltrainlabel, 75/100, 10, classifier=NN))
testpred = NN(alltraindata, alltrainlabel, testdata)
print('test accuracy is ',Accuracy(testlabel, testpred) )

Average validation accuracy is  0.33584635395170215
test accuracy is  0.34917953667953666


This is a very simple way of doing cross-validation. There are many well-known algorithms for cross-validation, like k-fold cross-validation, leave-one-out etc. This will be covered in detail in a later module. For more information about cross-validation, check <a href=https://en.wikipedia.org/wiki/Cross-validation_(statistics)>Cross-validatioin (Wikipedia)</a>

### Questions
1. Does averaging the validation accuracy across multiple splits give more consistent results?
2. Does it give more accurate estimate of test accuracy?
3. What is the effect of the number of iterations on the estimate? Do we get a better estimate with higher iterations?
4. Consider the results you got for the previous questions. Can we deal with a very small train dataset or validation dataset by increasing the iterations?


#1 Does averaging the validation accuracy across multiple splits give more consistent results?

Yes, averaging the validation accuracy across multiple splits, such as in k-fold cross-validation, generally provides more consistent and robust results compared to a single validation split. Here's why averaging validation accuracy across multiple splits is beneficial:

Reduced Variability: Averaging over multiple splits reduces the impact of random variations in the data that can affect the performance estimate. In a single validation split, the choice of which data points end up in the validation set can influence the result significantly. With multiple splits, these variations tend to balance out, resulting in a more stable and reliable estimate of model performance.

Better Generalization Assessment: By repeatedly splitting the data into different subsets, cross-validation provides a more comprehensive assessment of how well your model generalizes to various parts of the dataset. This helps ensure that your performance estimate is representative of the model's ability to perform well on unseen data.

Mitigating Data Imbalance: In cases where your dataset is imbalanced (i.e., one class is much more prevalent than others), cross-validation ensures that each class has a chance to appear in both training and validation sets across multiple folds. This prevents the model from learning biases associated with imbalanced datasets.

More Efficient Data Utilization: Cross-validation makes more efficient use of your dataset by repeatedly using each data point for both training and validation, reducing the risk of wasting valuable data.

Hyperparameter Tuning: When tuning hyperparameters or selecting the best model among several candidates, cross-validation provides a more robust basis for comparison. It allows you to assess how well each model configuration generalizes across multiple data subsets.

Statistical Significance: Averaging results over multiple folds also allows you to assess statistical significance. You can compute confidence intervals or perform hypothesis tests to determine if differences in performance between models or configurations are statistically significant.

#2 Does it give more accurate estimate of test accuracy?

Averaging the validation accuracy across multiple splits in techniques like k-fold cross-validation provides a more accurate estimate of the model's generalization performance compared to a single validation split. However, it's important to note that the test accuracy and the cross-validation accuracy are not the same thing.

Here's a breakdown of how these concepts relate:

Validation Accuracy (Cross-Validation Accuracy): This is an estimate of how well your model is expected to perform on new, unseen data based on the validation splits within your training data. When you perform k-fold cross-validation, you compute multiple validation accuracies, one for each fold. Averaging these validation accuracies provides a more robust estimate of your model's performance on unseen data compared to a single validation split.

Test Accuracy: This is the performance metric you obtain when you evaluate your trained model on a completely independent and previously unseen dataset, often referred to as the "test set." The test set is distinct from your training and validation data and serves as a final and realistic assessment of how well your model generalizes to new, real-world examples.

While cross-validation helps you estimate how well your model might perform on new data based on your training data, the test accuracy provides a more accurate and final evaluation because it evaluates the model on data it has never seen during training or validation. The test accuracy is often considered the gold standard for assessing model performance because it simulates how the model will perform in real-world scenarios.

In summary, cross-validation, by averaging validation accuracies across multiple splits, provides a more reliable estimate of model generalization compared to a single validation split. However, the true and final measure of a model's performance on new, unseen data is the test accuracy, which should be evaluated on an independent test set that the model has not been exposed to during training or validation.

#3 What is the effect of the number of iterations on the estimate? Do we get a better estimate with higher iterations?

The number of iterations or folds in techniques like k-fold cross-validation can have an impact on the accuracy of your performance estimate. Here's how it affects the estimate, and whether a higher number of iterations results in a better estimate:

Effect of Number of Iterations (Folds):

More Iterations (Folds): Increasing the number of iterations (i.e., using more folds) in cross-validation generally leads to a more accurate and robust estimate of model performance. This is because more iterations provide more opportunities for the model to be trained and evaluated on different subsets of the data.
Fewer Iterations (Folds): Using fewer iterations can still provide an estimate of performance but may be more sensitive to the specific random splits of the data. It may result in a less stable estimate, especially when the dataset is relatively small.
Bias-Variance Trade-off:

A larger number of iterations in cross-validation reduces the bias in the performance estimate. Bias refers to systematic errors in the estimate, and with more iterations, the estimate tends to be closer to the true performance of the model.
However, there is a trade-off with variance. More iterations can increase the variance in the estimate, as each fold provides a different estimate, and averaging more estimates can result in a wider range of values. This is generally not a problem as long as you have a sufficiently large dataset.
Computational Cost:

Increasing the number of iterations also increases the computational cost of cross-validation, as the model needs to be trained and evaluated multiple times. You should consider your available computational resources when choosing the number of iterations.
Optimal Number of Iterations:

There is no one-size-fits-all answer for the optimal number of iterations. The choice of the number of iterations depends on factors such as the size of your dataset, the computational resources available, and the stability of the performance estimate you require.
Common choices for the number of iterations in k-fold cross-validation are 5-fold and 10-fold, but you can experiment with different values to see how they affect your results. In some cases, leave-one-out cross-validation (where each sample is its own fold) is used for small datasets.

#4 Consider the results you got for the previous questions. Can we deal with a very small train dataset or validation dataset by increasing the iterations?

Increasing the number of iterations (folds) in cross-validation can help mitigate the impact of having a very small train or validation dataset to some extent, but it doesn't completely solve the underlying issues associated with limited data. Here's how increasing the iterations can help and the limitations:

Advantages of Increasing Iterations:

Reduced Variability: When you have a very small training or validation dataset, there's a higher risk of getting unrepresentative splits or experiencing greater variability in performance estimates. Increasing the number of iterations can help reduce this variability by considering multiple different data partitions.

More Stable Estimates: With more iterations, you get a more stable and reliable estimate of model performance. This can be especially useful when dealing with limited data because it allows you to assess how well your model generalizes across different subsets of the small dataset.

Limitations and Considerations:

Limited Data: Increasing the number of iterations does not magically create more data. It only provides more ways to sample and assess the same limited data. If your training dataset is very small, your model's ability to generalize effectively might still be compromised.

Computation Cost: While increasing iterations helps with performance estimation, it also increases computational cost. If you have extremely limited computational resources, this approach may not be feasible.

Risk of Overfitting: When dealing with a very small dataset, there's an increased risk of overfitting. Cross-validation can help in estimating performance, but it may not fully address the underlying issue of model generalization if the dataset size is severely limited. You may need to consider strategies like simplifying the model or using regularization techniques.

Data Imbalance: If the small dataset is imbalanced (i.e., some classes have very few examples), increasing iterations can help ensure that each class appears in both training and validation sets more frequently, reducing the risk of biased performance estimates.

In situations where you have an extremely small dataset, it's essential to be cautious and consider other techniques beyond just increasing iterations. You might explore data augmentation (creating synthetic data), transfer learning (leveraging pre-trained models), or using simpler model architectures to combat overfitting. Additionally, collecting more data, if possible, remains one of the most effective ways to address the challenges associated with very small datasets.
