<a href="https://colab.research.google.com/github/Mekavenkatakishore/FMML-22B21A4483/blob/main/Fmml_of_Module_01_Lab_02_MLPractice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine learning terms and metrics

FMML Module 1, Lab 2<br>


 In this lab, we will show a part of the ML pipeline by extracting features, training and testing

In [1]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
# set randomseed
rng = np.random.default_rng(seed=42)

In this lab, we will use the California Housing dataset. There are 20640 samples, each with 8 attributes like income of the block, age of the houses per district etc. The task is to predict the cost of the houses per district.

Let us download and examine the dataset.

In [2]:
 dataset =  datasets.fetch_california_housing()
 # print(dataset.DESCR)  # uncomment this if you want to know more about this dataset
 # print(dataset.keys())  # if you want to know what else is there in this dataset
 dataset.target = dataset.target.astype(np.int) # so that we can classify
 print(dataset.data.shape)
 print(dataset.target.shape)

(20640, 8)
(20640,)


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dataset.target = dataset.target.astype(np.int) # so that we can classify


Here is a function for calculating the 1-nearest neighbours

In [3]:
def NN1(traindata, trainlabel, query):
  diff  = traindata - query  # find the difference between features. Numpy automatically takes care of the size here
  sq = diff*diff # square the differences
  dist = sq.sum(1) # add up the squares
  label = trainlabel[np.argmin(dist)] # our predicted label is the label of the training data which has the least distance from the query
  return label

def NN(traindata, trainlabel, testdata):
  # we will run nearest neighbour for each sample in the test data
  # and collect the predicted classes in an array using list comprehension
  predlabel = np.array([NN1(traindata, trainlabel, i) for i in testdata])
  return predlabel

We will also define a 'random classifier', which randomly allots labels to each sample

In [4]:
def RandomClassifier(traindata, trainlabel, testdata):
  # in reality, we don't need these arguments

  classes = np.unique(trainlabel)
  rints = rng.integers(low=0, high=len(classes), size=len(testdata))
  predlabel = classes[rints]
  return predlabel

Let us define a metric 'Accuracy' to see how good our learning algorithm is. Accuracy is the ratio of the number of correctly classified samples to the total number of samples. The higher the accuracy, the better the algorithm.

In [5]:
def Accuracy(gtlabel, predlabel):
  assert len(gtlabel)==len(predlabel), "Length of the groundtruth labels and predicted labels should be the same"
  correct = (gtlabel==predlabel).sum() # count the number of times the groundtruth label is equal to the predicted label.
  return correct/len(gtlabel)

Let us make a function to split the dataset with the desired probability.

In [6]:
def split(data, label, percent):
  # generate a random number for each sample
  rnd = rng.random(len(label))
  split1 = rnd<percent
  split2 = rnd>=percent
  split1data = data[split1,:]
  split1label = label[split1]
  split2data = data[split2,:]
  split2label = label[split2]
  return split1data, split1label, split2data, split2label

We will reserve 20% of our dataset as the test set. We will not change this portion throughout our experiments

In [7]:
testdata, testlabel, alltraindata, alltrainlabel = split(dataset.data, dataset.target, 20/100)
print('Number of test samples = ', len(testlabel))
print('Number of other samples = ', len(alltrainlabel))
print('Percent of test data = ', len(testlabel)*100/len(dataset.target),'%')

Number of test samples =  4144
Number of other samples =  16496
Percent of test data =  20.07751937984496 %


## Experiments with splits

Let us reserve some of our train data as a validation set

In [8]:
traindata, trainlabel, valdata, vallabel = split(alltraindata, alltrainlabel, 75/100)

What is the accuracy of our classifiers on the train dataset?

In [9]:
trainpred = NN(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Train accuracy using nearest neighbour is ", trainAccuracy)

trainpred = RandomClassifier(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Train accuracy using random classifier is ", trainAccuracy)

Train accuracy using nearest neighbour is  1.0
Train accuracy using random classifier is  0.164375808538163


For nearest neighbour, the train accuracy is always 1. The accuracy of the random classifier is close to 1/(number of classes) which is 0.1666 in our case.

Let us predict the labels for our validation set and get the accuracy

In [10]:
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour is ", valAccuracy)

valpred = RandomClassifier(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using random classifier is ", valAccuracy)

Validation accuracy using nearest neighbour is  0.34108527131782945
Validation accuracy using random classifier is  0.1688468992248062


Validation accuracy of nearest neighbour is considerably less than its train accuracy while the validation accuracy of random classifier is the same. However, the validation accuracy of nearest neighbour is twice that of the random classifier.

Now let us try another random split and check the validation accuracy

In [11]:
traindata, trainlabel, valdata, vallabel = split(alltraindata, alltrainlabel, 75/100)
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy of nearest neighbour is ", valAccuracy)

Validation accuracy of nearest neighbour is  0.34048257372654156


You can run the above cell multiple times to try with different random splits.
We notice that the accuracy is different for each run, but close together.

Now let us compare it with the accuracy we get on the test dataset.

In [12]:
testpred = NN(alltraindata, alltrainlabel, testdata)
testAccuracy = Accuracy(testlabel, testpred)
print('Test accuracy is ', testAccuracy)

Test accuracy is  0.34917953667953666


### Try it out for yourself and answer:
1. How is the accuracy of the validation set affected if we increase the percentage of validation set? What happens when we reduce it?
2. How does the size of the train and validation set affect how well we can predict the accuracy on the test set using the validation set?
3. What do you think is a good percentage to reserve for the validation set so that thest two factors are balanced?

Answer for both nearest neighbour and random classifier. You can note down the values for your experiments and plot a graph using  <a href=https://matplotlib.org/stable/gallery/lines_bars_and_markers/step_demo.html#sphx-glr-gallery-lines-bars-and-markers-step-demo-py>plt.plot<href>. Check also for extreme values for splits, like 99.9% or 0.1%

3.Determining the appropriate percentage to reserve for the validation set involves finding a balance between having sufficient data for training your model and having a representative sample to assess the model's performance.

Here are some general guidelines, but keep in mind that the optimal split can vary depending on the size of your overall dataset, the complexity of the task, and other specific considerations:

Typical Splits:

Training Set: Often, a common starting point is to allocate 70-80% of your data to the training set. This allows the model to learn the underlying patterns and relationships present in the data.
Validation Set: Allocating 10-15% of your data to the validation set is a good starting point. This provides enough data to evaluate the model's performance and make necessary adjustments, such as tuning hyperparameters or selecting the best model.
Test Set: The remaining 10-15% of the data can be reserved for the test set, which is used for final evaluation and assessing the model's generalization to unseen data.
Cross-validation:

Another effective approach, especially if you have limited data, is to use techniques like k-fold cross-validation on the training set. This way, you can iteratively use different portions of the data for training and validation, ensuring that each data point is used for validation at least once.
Stratified Split:

If your dataset has class imbalance or specific patterns you want to ensure are adequately represented in both training and validation sets, consider using a stratified split. This ensures that each subset (training, validation, test) maintains a similar class distribution to the original dataset.
Consider Dataset Size:

For very large datasets, you might be able to allocate a smaller percentage to the validation and test sets while still having a sufficient number of samples for evaluation. Conversely, for small datasets, you should prioritize larger validation and test sets to ensure a representative evaluation.
In summary, there's no one-size-fits-all answer, and the split should be chosen based on the specific characteristics of your dataset and problem. Experimentation and iterative evaluation are key to finding the optimal split that balances the need for training data and accurate model evaluation.







2.The size of the training and validation sets can influence how well we can predict the accuracy on the test set using the validation set. Here's how:

Larger Training Set:

Effect: A larger training set usually allows the model to learn more diverse patterns and relationships in the data. As the model becomes more robust and generalizes better, the validation set's performance tends to be a good indicator of how the model will perform on unseen data, such as the test set.
Predicting Test Set Accuracy: With a large and representative training set, the validation set can provide a reasonably accurate estimate of the model's performance on the test set. The performance on the validation set is likely to closely reflect the performance on unseen data.
Smaller Training Set:

Effect: A smaller training set may result in a model that is more sensitive to specific patterns present in the training data, potentially leading to overfitting. In this case, the model may perform well on the training set but not generalize well to unseen data.
Predicting Test Set Accuracy: When the training set is small, the validation set's accuracy may not be a reliable predictor of the test set accuracy. The model might have overfit to the training set, and the validation set may not capture this effectively.
Larger Validation Set:

Effect: A larger validation set provides a more robust assessment of the model's performance on unseen data. It helps reduce the variability in the estimated model performance and provides a more reliable indicator of how well the model might perform on the test set.
Predicting Test Set Accuracy: A larger validation set improves the reliability of predicting test set accuracy as it provides a more accurate estimate of the model's generalization performance.
Smaller Validation Set:

Effect: A smaller validation set might result in higher variability in estimated model performance. It may not adequately represent the true generalization capabilities of the model, leading to less reliable predictions of the test set accuracy.
Predicting Test Set Accuracy: With a smaller validation set, the predictions about the test set accuracy might be less accurate due to the higher likelihood of the validation set's estimate being affected by chance or not capturing critical patterns.
In summary, a well-sized training set and a representative validation set are crucial for accurately predicting the model's performance on unseen data (test set). A larger, diverse training set and a reasonably sized validation set can improve the reliability of predicting test set accuracy using the validation set.







1.The size of the validation set, relative to the training set, can impact the accuracy of a machine learning model during the validation phase. Let's discuss how changing the percentage of the validation set can affect model accuracy:

Increasing the Percentage of the Validation Set:

Effect on Accuracy: Increasing the size of the validation set generally leads to a more reliable estimate of the model's performance. A larger validation set provides a better representation of unseen data, allowing for a more accurate evaluation of the model's generalization performance.
Advantages:
Better estimation of the true model performance.
Reduces the likelihood of overfitting to the training set.
Disadvantages:
A larger validation set means fewer data for training the model, potentially resulting in a less optimal model.
Computationally more expensive to validate with a larger dataset.
Reducing the Percentage of the Validation Set:

Effect on Accuracy: Reducing the size of the validation set may lead to a less reliable estimate of the model's performance. The smaller the validation set, the more susceptible the estimate is to variability.
Advantages:
More data available for training the model, potentially leading to a better-fit model.
Computationally less expensive to validate with a smaller dataset.
Disadvantages:
A smaller validation set might not provide an accurate representation of the model's generalization to unseen data.
The estimate of model performance may be less reliable due to higher variability in validation results.
In practice, finding the optimal size for the validation set involves a trade-off between having a representative validation set and maximizing the amount of data available for training. Cross-validation techniques (e.g., k-fold cross-validation) can also be used to mitigate the impact of a fixed validation set size, ensuring a more robust evaluation of the model's performance. It's important to strike a balance based on the specific problem, dataset size, and computational resources available.



## Multiple Splits

One way to get more accurate estimates for the test accuracy is by using <b>crossvalidation</b>. Here, we will try a simple version, where we do multiple train/val splits and take the average of validation accuracies as the test accuracy estimation. Here is a function for doing this. Note that this function will take a long time to execute.

In [13]:
# you can use this function for random classifier also
def AverageAccuracy(alldata, alllabel, splitpercent, iterations, classifier=NN):
  accuracy = 0
  for ii in range(iterations):
    traindata, trainlabel, valdata, vallabel = split(alldata, alllabel, splitpercent)
    valpred = classifier(traindata, trainlabel, valdata)
    accuracy += Accuracy(vallabel, valpred)
  return accuracy/iterations # average of all accuracies

In [14]:
print('Average validation accuracy is ', AverageAccuracy(alltraindata, alltrainlabel, 75/100, 10, classifier=NN))
testpred = NN(alltraindata, alltrainlabel, testdata)
print('test accuracy is ',Accuracy(testlabel, testpred) )

Average validation accuracy is  0.33584635395170215
test accuracy is  0.34917953667953666


This is a very simple way of doing cross-validation. There are many well-known algorithms for cross-validation, like k-fold cross-validation, leave-one-out etc. This will be covered in detail in a later module. For more information about cross-validation, check <a href=https://en.wikipedia.org/wiki/Cross-validation_(statistics)>Cross-validatioin (Wikipedia)</a>

### Questions
1. Does averaging the validation accuracy across multiple splits give more consistent results?
2. Does it give more accurate estimate of test accuracy?
3. What is the effect of the number of iterations on the estimate? Do we get a better estimate with higher iterations?
4. Consider the results you got for the previous questions. Can we deal with a very small train dataset or validation dataset by increasing the iterations?



4.Increasing the number of iterations can help to some extent in dealing with a very small training dataset or validation dataset. However, it's important to understand the limitations and considerations associated with this approach:

Small Training Dataset:

Increasing the number of training iterations can allow the model to "see" the small dataset more times, potentially improving its ability to learn patterns from the limited data. It may help the model converge to a better solution within the constraints of the available data.
Small Validation Dataset:

Increasing the number of validation iterations in the context of cross-validation can provide a more stable estimate of model performance. Averaging over more validation iterations can reduce the impact of randomness associated with a small validation dataset.
However, there are important considerations to keep in mind:

Overfitting:

Even with increased training iterations, a very small training dataset is more prone to overfitting. The model might learn to fit the noise in the small dataset rather than capturing the true underlying patterns.
Limitations of Data Size:

No matter how many iterations are used, a very small dataset might not capture the true complexity of the problem. Models benefit from diverse and sufficient data to generalize well.
Validation Dataset Size:

Increasing validation iterations can help stabilize the estimate of the model's performance. However, if the validation dataset is extremely small, there's a limit to how much this can compensate for its size.
In summary, while increasing the number of iterations can aid in dealing with very small datasets, it's not a complete solution. The fundamental importance of having a sufficiently sized and diverse training dataset for effective learning and generalization cannot be replaced by iteration count alone. It's crucial to aim for an appropriate balance between data size, iteration count, and other regularization techniques to achieve the best possible model performance.



3.The number of iterations, particularly in the context of model training and evaluation, can have an effect on the estimate of the model's performance. However, it's important to clarify that "iterations" can refer to different things based on the context, such as training iterations, iterations in algorithms, or iterations in cross-validation. Let's explore these scenarios and how the number of iterations impacts the estimate:

Training Iterations:

In machine learning models, during the training phase, each iteration often corresponds to one update of the model's parameters using a batch of training data. More training iterations (assuming convergence) can lead to a better-trained model, potentially resulting in a better estimate of model performance.
Algorithm Iterations:

In iterative algorithms like gradient descent or optimization algorithms, more iterations can help the algorithm converge to a more optimal solution. This can indirectly improve the model's performance, as a better solution is found during the training process.
Cross-Validation Iterations:

In cross-validation, the number of iterations corresponds to the number of splits or folds in the data. More iterations (folds) typically lead to a more reliable estimate of the model's performance. Averaging the performance over more folds provides a more stable and accurate assessment of the model's generalization.
In the context of cross-validation, more iterations (folds) generally lead to a better estimate of the model's performance. Here's how:

Higher Stability: Averaging performance over more folds provides a more stable estimate, reducing the impact of randomness associated with a particular data split.

Better Representation: More iterations ensure that the entire dataset is better represented across different validation sets, which can lead to a more comprehensive understanding of the model's generalization.

Reduced Bias: A larger number of iterations can help reduce any bias that might be introduced due to a specific training-validation split.

In summary, while increasing the number of iterations can often lead to a better estimate of the model's performance, it's crucial to strike a balance based on computational resources, dataset size, and the need for a reliable estimate. The choice of the number of iterations should consider the trade-off between accuracy and computational cost.







2.In summary, averaging validation accuracy across multiple splits (such as in k-fold cross-validation) is a standard practice in model evaluation, providing a more stable and reliable estimate of the model's performance. It helps in making informed decisions about the model's effectiveness and generalization capabilities across different subsets of the data.




User
2. Does it give more accurate estimate of test accuracy?
ChatGPT
Averaging the validation accuracy across multiple splits in techniques like k-fold cross-validation can provide a more accurate estimate of the model's performance and, by extension, a more accurate estimate of how the model might perform on unseen data (test accuracy). However, it's essential to clarify that while cross-validation helps estimate the model's performance more accurately, it doesn't directly estimate the test accuracy.

Here's how it works:

Estimating Model Performance:

Cross-validation estimates how well the model is likely to perform on unseen data by evaluating its performance on multiple validation sets (folds) created from the training data.
More Accurate Performance Estimate:

By averaging the validation performance over multiple folds, you get a more stable and reliable estimate of how the model is likely to perform on unseen data, compared to a single validation split.
Indirect Estimation of Test Accuracy:

The averaged validation accuracy gives you a better idea of how well the model is expected to generalize to unseen data. While it's not a direct estimate of test accuracy, it serves as a reasonable proxy.
Final Evaluation on Test Set:

To obtain the actual test accuracy, you would use a separate, untouched test set that was not used during training or validation. The model's final performance is assessed on this dedicated test set, providing an unbiased and direct evaluation of its generalization.
In summary, while averaging validation accuracy via cross-validation provides a more accurate estimate of the model's performance and its likely generalization to unseen data, the actual test accuracy is determined by evaluating the model on a separate and previously unseen test set. Cross-validation helps you have more confidence in how well the model is expected to perform on this unseen data.







Yes, averaging the validation accuracy across multiple splits, especially in techniques like k-fold cross-validation, can provide more consistent and reliable estimates of a model's performance. This is one of the primary advantages of using cross-validation.

Here's why averaging validation accuracy across multiple splits is beneficial:

Reduces Variance:

Averaging over multiple folds helps to smooth out any variability in the model's performance that might occur due to a particular random split. This makes the evaluation more robust and less sensitive to a specific subset of the data.
Better Generalization:

By evaluating the model on multiple, non-overlapping subsets of the data, you get a more comprehensive assessment of how the model generalizes across the entire dataset. Averaging the validation accuracy over these assessments provides a more representative estimate of the model's true performance.
Mitigates Overfitting to a Single Split:

Averaging helps prevent overfitting to a specific training-validation split. If a model performs exceptionally well on one particular split but poorly on another, averaging balances these extremes, providing a more accurate reflection of the model's actual performance.
Robustness to Data Imbalances:

If the dataset has imbalanced classes, averaging the performance over multiple splits ensures that each class has a fair representation in the validation process, reducing the potential bias that might occur in a single split.
In summary, averaging validation accuracy across multiple splits (such as in k-fold cross-validation) is a standard practice in model evaluation, providing a more stable and reliable estimate of the model's performance. It helps in making informed decisions about the model's effectiveness and generalization capabilities across different subsets of the data.






