<a href="https://colab.research.google.com/github/PavanBorigi/FMML-22B21A42C2/blob/main/Module_01_Lab_02_MLPractice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine learning terms and metrics

FMML Module 1, Lab 2<br>


 In this lab, we will show a part of the ML pipeline by extracting features, training and testing

In [1]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
# set randomseed
rng = np.random.default_rng(seed=42)

In this lab, we will use the California Housing dataset. There are 20640 samples, each with 8 attributes like income of the block, age of the houses per district etc. The task is to predict the cost of the houses per district.

Let us download and examine the dataset.

In [2]:
 dataset =  datasets.fetch_california_housing()
 # print(dataset.DESCR)  # uncomment this if you want to know more about this dataset
 # print(dataset.keys())  # if you want to know what else is there in this dataset
 dataset.target = dataset.target.astype(np.int) # so that we can classify
 print(dataset.data.shape)
 print(dataset.target.shape)

(20640, 8)
(20640,)


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dataset.target = dataset.target.astype(np.int) # so that we can classify


Here is a function for calculating the 1-nearest neighbours

In [3]:
def NN1(traindata, trainlabel, query):
  diff  = traindata - query  # find the difference between features. Numpy automatically takes care of the size here
  sq = diff*diff # square the differences
  dist = sq.sum(1) # add up the squares
  label = trainlabel[np.argmin(dist)] # our predicted label is the label of the training data which has the least distance from the query
  return label

def NN(traindata, trainlabel, testdata):
  # we will run nearest neighbour for each sample in the test data
  # and collect the predicted classes in an array using list comprehension
  predlabel = np.array([NN1(traindata, trainlabel, i) for i in testdata])
  return predlabel

We will also define a 'random classifier', which randomly allots labels to each sample

In [4]:
def RandomClassifier(traindata, trainlabel, testdata):
  # in reality, we don't need these arguments

  classes = np.unique(trainlabel)
  rints = rng.integers(low=0, high=len(classes), size=len(testdata))
  predlabel = classes[rints]
  return predlabel

Let us define a metric 'Accuracy' to see how good our learning algorithm is. Accuracy is the ratio of the number of correctly classified samples to the total number of samples. The higher the accuracy, the better the algorithm.

In [5]:
def Accuracy(gtlabel, predlabel):
  assert len(gtlabel)==len(predlabel), "Length of the groundtruth labels and predicted labels should be the same"
  correct = (gtlabel==predlabel).sum() # count the number of times the groundtruth label is equal to the predicted label.
  return correct/len(gtlabel)

Let us make a function to split the dataset with the desired probability.

In [6]:
def split(data, label, percent):
  # generate a random number for each sample
  rnd = rng.random(len(label))
  split1 = rnd<percent
  split2 = rnd>=percent
  split1data = data[split1,:]
  split1label = label[split1]
  split2data = data[split2,:]
  split2label = label[split2]
  return split1data, split1label, split2data, split2label

We will reserve 20% of our dataset as the test set. We will not change this portion throughout our experiments

In [7]:
testdata, testlabel, alltraindata, alltrainlabel = split(dataset.data, dataset.target, 20/100)
print('Number of test samples = ', len(testlabel))
print('Number of other samples = ', len(alltrainlabel))
print('Percent of test data = ', len(testlabel)*100/len(dataset.target),'%')

Number of test samples =  4144
Number of other samples =  16496
Percent of test data =  20.07751937984496 %


## Experiments with splits

Let us reserve some of our train data as a validation set

In [8]:
traindata, trainlabel, valdata, vallabel = split(alltraindata, alltrainlabel, 75/100)

What is the accuracy of our classifiers on the train dataset?

In [9]:
trainpred = NN(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Train accuracy using nearest neighbour is ", trainAccuracy)

trainpred = RandomClassifier(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Train accuracy using random classifier is ", trainAccuracy)

Train accuracy using nearest neighbour is  1.0
Train accuracy using random classifier is  0.164375808538163


For nearest neighbour, the train accuracy is always 1. The accuracy of the random classifier is close to 1/(number of classes) which is 0.1666 in our case.

Let us predict the labels for our validation set and get the accuracy

In [10]:
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour is ", valAccuracy)

valpred = RandomClassifier(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using random classifier is ", valAccuracy)

Validation accuracy using nearest neighbour is  0.34108527131782945
Validation accuracy using random classifier is  0.1688468992248062


Validation accuracy of nearest neighbour is considerably less than its train accuracy while the validation accuracy of random classifier is the same. However, the validation accuracy of nearest neighbour is twice that of the random classifier.

Now let us try another random split and check the validation accuracy

In [11]:
traindata, trainlabel, valdata, vallabel = split(alltraindata, alltrainlabel, 75/100)
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy of nearest neighbour is ", valAccuracy)

Validation accuracy of nearest neighbour is  0.34048257372654156


You can run the above cell multiple times to try with different random splits.
We notice that the accuracy is different for each run, but close together.

Now let us compare it with the accuracy we get on the test dataset.

In [12]:
testpred = NN(alltraindata, alltrainlabel, testdata)
testAccuracy = Accuracy(testlabel, testpred)
print('Test accuracy is ', testAccuracy)

Test accuracy is  0.34917953667953666


### Try it out for yourself and answer:
1. How is the accuracy of the validation set affected if we increase the percentage of validation set? What happens when we reduce it?
2. How does the size of the train and validation set affect how well we can predict the accuracy on the test set using the validation set?
3. What do you think is a good percentage to reserve for the validation set so that thest two factors are balanced?

Answer for both nearest neighbour and random classifier. You can note down the values for your experiments and plot a graph using  <a href=https://matplotlib.org/stable/gallery/lines_bars_and_markers/step_demo.html#sphx-glr-gallery-lines-bars-and-markers-step-demo-py>plt.plot<href>. Check also for extreme values for splits, like 99.9% or 0.1%

1.The accuracy of a machine learning model's validation set can be affected when you change the percentage of data allocated to the validation set. This change impacts the trade-off between bias and variance in your model, which can in turn affect its performance.

Increase Percentage of Validation Set:

Pros:

More data in the validation set means a larger portion of your dataset is used for model evaluation. This can lead to a more reliable estimate of your model's performance.
A larger validation set can help you detect overfitting more effectively. If your model performs well on a larger validation set, it is less likely to be overfitting the training data.
Cons:

With a larger validation set, you have less data available for training. This can lead to a less complex model, potentially underfitting the data.
If your dataset is small to begin with, increasing the validation set percentage could lead to a training set that is too small to effectively train a model.
Decrease Percentage of Validation Set:

Pros:

More data in the training set can enable you to train a more complex model, which may capture finer patterns in the data.
If your dataset is very large, reducing the validation set percentage may still provide a sufficiently large validation set for reliable evaluation.
Cons:

Smaller validation sets may lead to noisier estimates of model performance. The validation score may vary significantly between different random splits of the data.
If the validation set is too small, you may not be able to detect overfitting effectively, as the model could perform well on a small validation set by chance.
The choice of the validation set percentage depends on several factors, including the size of your dataset, the complexity of your model, and the presence of overfitting. Typically, a common practice is to use a 70-30 or 80-20 split for training and validation, respectively. However, you should consider cross-validation techniques like k-fold cross-validation to get a more robust estimate of your model's performance if you have limited data or want to reduce the impact of the random split.

In summary, increasing the validation set percentage can lead to a more reliable estimate of model performance but may result in underfitting, while reducing it can allow for more complex models but may result in noisier performance estimates and a higher risk of overfitting. The choice of the validation set percentage should be made based on the specific characteristics of your dataset and modeling goals



# The sizes of the training and validation sets can significantly impact how well you can predict the accuracy on the test set using the validation set. This relationship is closely tied to the concept of bias and variance in machine learning models.

***2.***Here's how the size of the training and validation sets can affect your ability to predict test set accuracy:

Large Training Set, Small Validation Set:

Pros:

With a large training set, your model has more data to learn from and can potentially capture complex patterns in the data.
A smaller validation set can still provide a reasonable estimate of your model's performance, especially if it's randomly sampled from a sufficiently large dataset.
Cons:

A small validation set may lead to noisier performance estimates. The validation score can vary significantly depending on the specific data points in the validation set.
Prediction of Test Set Accuracy:

Your model's performance on the validation set may not be a reliable indicator of its performance on the test set due to the small size of the validation set. There's a higher risk of overfitting to the validation set.
Small Training Set, Large Validation Set:

Pros:

With a small training set, you reduce the risk of overfitting the training data, as the model is forced to be simpler.
A larger validation set can provide a more stable estimate of model performance.
Cons:

The model may not capture complex patterns in the data with a small training set, potentially leading to underfitting.
Prediction of Test Set Accuracy:

The validation set performance may be a better indicator of test set performance in this scenario because the model is less likely to overfit. However, if the training set is too small, the model's overall performance may be limited.
In summary, the key is finding a balance between the size of the training and validation sets. A larger training set helps the model learn better but can lead to overfitting on the validation set. A larger validation set provides a more stable estimate but can lead to underfitting if the training set is too small.

To improve your ability to predict test set accuracy using the validation set, consider techniques like cross-validation, where you split your data into multiple folds and iteratively use different subsets for training and validation. This can provide a more robust estimate of your model's generalization performance and reduce the impact of the initial random split.







3.The percentage of data to reserve for a validation set depends on various factors, including the size of your overall dataset, the complexity of your machine learning model, and the nature of the problem you're trying to solve. However, there are some general guidelines that can help you strike a balance between having enough data for training and having enough for validation and testing:

70/30 or 80/20 Split: A common starting point is to split your dataset into a training set (70% or 80% of the data) and a validation/test set (30% or 20% of the data). This is a good rule of thumb for medium-sized datasets.

Cross-Validation: If you have a relatively small dataset, you can use techniques like k-fold cross-validation. In k-fold cross-validation, you divide your dataset into k subsets and train your model k times, each time using a different subset for validation and the remaining data for training. This helps ensure that you get a more robust estimate of your model's performance.

Stratified Sampling: If your dataset is imbalanced (i.e., some classes or categories have significantly fewer samples than others), you should consider using stratified sampling when creating your validation/test set. This ensures that the distribution of classes in the validation/test set is representative of the overall dataset.

Leave-One-Out Cross-Validation: For very small datasets, you can use leave-one-out cross-validation, where you leave one data point as the validation set and use the rest for training. You repeat this process for each data point, which can provide a good estimate of your model's performance but can be computationally expensive.

Holdout Validation: If you have a very large dataset, you might be able to afford a smaller percentage for validation and testing, such as a 90/10 or 95/5 split.

Domain Knowledge: Your knowledge of the problem domain can also influence the choice of validation split. If you know that the data is noisy or that the model is particularly sensitive to variations in the training data, you may want to allocate a larger percentage to the validation set.

Ultimately, the best percentage for the validation set depends on the specifics of your project. It's often a matter of experimentation and tuning. You can start with one of the common splits mentioned above and then adjust as needed based on your model's performance on the validation set. The goal is to have enough data for training while still having a representative sample for validation and testing to accurately assess your model's generalization performance.








## Multiple Splits

One way to get more accurate estimates for the test accuracy is by using <b>crossvalidation</b>. Here, we will try a simple version, where we do multiple train/val splits and take the average of validation accuracies as the test accuracy estimation. Here is a function for doing this. Note that this function will take a long time to execute.

In [13]:
# you can use this function for random classifier also
def AverageAccuracy(alldata, alllabel, splitpercent, iterations, classifier=NN):
  accuracy = 0
  for ii in range(iterations):
    traindata, trainlabel, valdata, vallabel = split(alldata, alllabel, splitpercent)
    valpred = classifier(traindata, trainlabel, valdata)
    accuracy += Accuracy(vallabel, valpred)
  return accuracy/iterations # average of all accuracies

In [14]:
print('Average validation accuracy is ', AverageAccuracy(alltraindata, alltrainlabel, 75/100, 10, classifier=NN))
testpred = NN(alltraindata, alltrainlabel, testdata)
print('test accuracy is ',Accuracy(testlabel, testpred) )

Average validation accuracy is  0.33584635395170215
test accuracy is  0.34917953667953666


This is a very simple way of doing cross-validation. There are many well-known algorithms for cross-validation, like k-fold cross-validation, leave-one-out etc. This will be covered in detail in a later module. For more information about cross-validation, check <a href=https://en.wikipedia.org/wiki/Cross-validation_(statistics)>Cross-validatioin (Wikipedia)</a>

### Questions
1. Does averaging the validation accuracy across multiple splits give more consistent results?
2. Does it give more accurate estimate of test accuracy?
3. What is the effect of the number of iterations on the estimate? Do we get a better estimate with higher iterations?
4. Consider the results you got for the previous questions. Can we deal with a very small train dataset or validation dataset by increasing the iterations?


# 1.Yes, averaging the validation accuracy across multiple splits can provide more consistent and robust results when evaluating the performance of a machine learning model. This technique is often referred to as cross-validation, and it helps to mitigate the impact of randomness in the data split and provides a more reliable estimate of a model's performance.

Here are a few reasons why averaging validation accuracy across multiple splits is beneficial:

Reduced Variance: When you split your data into a single training set and a single validation/test set, the performance evaluation can be sensitive to the specific data points in each set. A single split might result in unusually high or low accuracy due to the luck of the draw. Cross-validation averages out these variations across multiple splits, reducing the variance in the evaluation.

Better Generalization Estimate: Cross-validation provides a more accurate estimate of how well your model is likely to perform on unseen data. By training and validating the model multiple times on different subsets of the data, you obtain a more comprehensive view of its generalization performance.

Utilizing the Entire Dataset: In k-fold cross-validation, you use the entire dataset for both training and validation, ensuring that all data points are used for assessment. This is particularly important when working with limited data.

Model Tuning: Cross-validation is often used in the model selection and hyperparameter tuning process. It allows you to compare different models or hyperparameter settings more fairly by providing an average performance score across multiple validation sets.

Common cross-validation techniques include k-fold cross-validation, stratified k-fold cross-validation, and leave-one-out cross-validation, as mentioned in the previous response. The choice of which technique to use depends on factors such as your dataset size, the nature of your data, and the computational resources available.

In summary, averaging validation accuracy across multiple splits using cross-validation is a valuable practice for obtaining a more stable and reliable assessment of your machine learning model's performance. It helps you make more informed decisions about model selection, hyperparameter tuning, and overall model quality.








2.Cross-validation, specifically k-fold cross-validation, provides a more accurate estimate of how well your machine learning model is likely to perform on unseen data compared to a single train/validation split. However, it's important to clarify that cross-validation estimates the model's performance on validation data, not on the actual test data that you should set aside for final evaluation.

Here's how it works:

Training and Validation: In k-fold cross-validation, you divide your dataset into k equally sized subsets or "folds." You train and validate your model k times, each time using a different fold as the validation set and the remaining k-1 folds as the training set. This process gives you k different estimates of your model's performance on validation data.

Performance Averaging: You then calculate the average (or sometimes other statistics like standard deviation) of the performance metrics (e.g., accuracy) from these k rounds of validation. This average provides a more stable and reliable estimate of how well your model generalizes to unseen data compared to a single validation split.

Test Data: After you've chosen your final model and tuned its hyperparameters based on cross-validation, you should have a good estimate of how well your model is expected to perform on new, unseen data. However, to get a true estimate of the model's performance, you should evaluate it on a separate and previously untouched test dataset that has not been used during model development or hyperparameter tuning. This test dataset serves as a final, unbiased assessment of your model's generalization ability.

In summary, while cross-validation provides a more accurate estimate of how well your model performs on validation data, it doesn't directly estimate test accuracy. Test accuracy can only be determined by evaluating your model on a dedicated, independent test dataset. Cross-validation helps you select and tune your model, giving you confidence that it is likely to perform well on unseen data, but the true test accuracy is assessed separately on the test dataset.







3.The number of iterations, often referred to as "folds," in a cross-validation procedure can have an impact on the estimate of your model's performance. The relationship between the number of iterations and the quality of the estimate is not linear, and there are trade-offs to consider:

More Iterations (Higher k):

Pros: Using more iterations in cross-validation, such as a higher k (e.g., 10-fold or 5-fold cross-validation), can provide a more stable and robust estimate of your model's performance. It reduces the variance in the performance estimate because you're averaging over more validation sets.
Cons: However, using a higher k means you have smaller training sets in each fold. Smaller training sets may result in models that are less representative of the entire dataset, which could lead to a slight bias in the estimate.
Fewer Iterations (Lower k):

Pros: Using fewer iterations, such as 2-fold or 3-fold cross-validation, results in larger training sets for each fold. This can lead to models that are more representative of the entire dataset and may provide a slightly less biased estimate.
Cons: With fewer iterations, the estimate may be more sensitive to the specific random split of the data. The variance in the performance estimate may be higher, making it less stable and reliable.
In practice, the choice of the number of iterations (k) in cross-validation depends on several factors, including the size of your dataset, the nature of your data, and your computational resources. Here are some general guidelines:

For medium-sized datasets, 5-fold or 10-fold cross-validation is often a good starting point because they strike a balance between stability and computational efficiency.

For small datasets, you might opt for leave-one-out cross-validation (k equal to the number of data points) to maximize the use of your limited data, but this can be computationally expensive.

For very large datasets, you can use fewer folds, like 3-fold or 2-fold cross-validation, to speed up the process while still obtaining a reasonable estimate.

Ultimately, the goal is to achieve a good trade-off between a stable estimate (low variance) and a representative training set in each fold (low bias). You may need to experiment with different values of k to find the best balance for your specific problem. It's also important to remember that cross-validation is just one part of model evaluation; the final test on an independent test dataset is crucial for obtaining an unbiased estimate of your model's performance.








4.Increasing the number of iterations (folds) in a cross-validation procedure can help mitigate the issues associated with having a very small train dataset or validation dataset to some extent. However, it may not completely overcome the limitations imposed by extremely small datasets.

Here's how increasing the iterations can help when dealing with small datasets:

Smaller Training Sets: When you have a very small dataset, each fold in cross-validation will have an even smaller training set if you increase the number of iterations (k). By increasing k, you ensure that the model is trained on a larger portion of the data, which can lead to better model stability and potentially better representation of the data.

Robustness: Increasing k can provide a more robust estimate of your model's performance because you are averaging over more validation sets. This can help reduce the impact of random variations in the data splits.

However, there are limitations and considerations:

Data Quality: Increasing the number of iterations won't create more data or improve the quality of your data. If your dataset is very small and lacks diversity or is noisy, increasing the number of iterations won't address these issues.

Computational Cost: Using a large value of k can be computationally expensive, especially if your model is complex or training takes a long time. You should consider your available computational resources when deciding on the number of iterations.

Bias-Variance Trade-Off: While increasing k can help reduce the variance in your performance estimate, it may introduce a slight bias if each fold's training set is too small. This bias might result in an overly optimistic or pessimistic estimate, depending on the specifics of your dataset.

Limited Data: If your dataset is extremely small, cross-validation might not be the ideal choice. In such cases, you might consider alternative approaches like bootstrapping or resampling techniques to generate more data points for training and validation.

In summary, increasing the number of iterations in cross-validation can be a reasonable strategy to deal with very small training or validation datasets, as it provides better stability and a more robust performance estimate. However, it cannot fundamentally address limitations related to the quantity and quality of the data. When dealing with extremely small datasets, it's essential to consider the broader context and explore other techniques to improve your model's performance and reliability.





