<a href="https://colab.research.google.com/github/Amrutha4561/FMML-lab/blob/main/Module_01_Lab_02_MLPractice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine learning terms and metrics

FMML Module 1, Lab 2<br>


 In this lab, we will show a part of the ML pipeline by extracting features, training and testing

In [1]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
# set randomseed
rng = np.random.default_rng(seed=42)

In this lab, we will use the California Housing dataset. There are 20640 samples, each with 8 attributes like income of the block, age of the houses per district etc. The task is to predict the cost of the houses per district.

Let us download and examine the dataset.

In [2]:
 dataset =  datasets.fetch_california_housing()
 # print(dataset.DESCR)  # uncomment this if you want to know more about this dataset
 # print(dataset.keys())  # if you want to know what else is there in this dataset
 dataset.target = dataset.target.astype(np.int) # so that we can classify
 print(dataset.data.shape)
 print(dataset.target.shape)

(20640, 8)
(20640,)


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dataset.target = dataset.target.astype(np.int) # so that we can classify


Here is a function for calculating the 1-nearest neighbours

In [3]:
def NN1(traindata, trainlabel, query):
  diff  = traindata - query  # find the difference between features. Numpy automatically takes care of the size here
  sq = diff*diff # square the differences
  dist = sq.sum(1) # add up the squares
  label = trainlabel[np.argmin(dist)] # our predicted label is the label of the training data which has the least distance from the query
  return label

def NN(traindata, trainlabel, testdata):
  # we will run nearest neighbour for each sample in the test data
  # and collect the predicted classes in an array using list comprehension
  predlabel = np.array([NN1(traindata, trainlabel, i) for i in testdata])
  return predlabel

We will also define a 'random classifier', which randomly allots labels to each sample

In [4]:
def RandomClassifier(traindata, trainlabel, testdata):
  # in reality, we don't need these arguments

  classes = np.unique(trainlabel)
  rints = rng.integers(low=0, high=len(classes), size=len(testdata))
  predlabel = classes[rints]
  return predlabel

Let us define a metric 'Accuracy' to see how good our learning algorithm is. Accuracy is the ratio of the number of correctly classified samples to the total number of samples. The higher the accuracy, the better the algorithm.

In [5]:
def Accuracy(gtlabel, predlabel):
  assert len(gtlabel)==len(predlabel), "Length of the groundtruth labels and predicted labels should be the same"
  correct = (gtlabel==predlabel).sum() # count the number of times the groundtruth label is equal to the predicted label.
  return correct/len(gtlabel)

Let us make a function to split the dataset with the desired probability.

In [6]:
def split(data, label, percent):
  # generate a random number for each sample
  rnd = rng.random(len(label))
  split1 = rnd<percent
  split2 = rnd>=percent
  split1data = data[split1,:]
  split1label = label[split1]
  split2data = data[split2,:]
  split2label = label[split2]
  return split1data, split1label, split2data, split2label

We will reserve 20% of our dataset as the test set. We will not change this portion throughout our experiments

In [7]:
testdata, testlabel, alltraindata, alltrainlabel = split(dataset.data, dataset.target, 20/100)
print('Number of test samples = ', len(testlabel))
print('Number of other samples = ', len(alltrainlabel))
print('Percent of test data = ', len(testlabel)*100/len(dataset.target),'%')

Number of test samples =  4144
Number of other samples =  16496
Percent of test data =  20.07751937984496 %


## Experiments with splits

Let us reserve some of our train data as a validation set

In [9]:
traindata, trainlabel, valdata, vallabel = split(alltraindata, alltrainlabel, 75/100)

What is the accuracy of our classifiers on the train dataset?

In [10]:
trainpred = NN(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Train accuracy using nearest neighbour is ", trainAccuracy)

trainpred = RandomClassifier(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Train accuracy using random classifier is ", trainAccuracy)

Train accuracy using nearest neighbour is  1.0
Train accuracy using random classifier is  0.164375808538163


For nearest neighbour, the train accuracy is always 1. The accuracy of the random classifier is close to 1/(number of classes) which is 0.1666 in our case.

Let us predict the labels for our validation set and get the accuracy

In [11]:
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour is ", valAccuracy)

valpred = RandomClassifier(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using random classifier is ", valAccuracy)

Validation accuracy using nearest neighbour is  0.34108527131782945
Validation accuracy using random classifier is  0.1688468992248062


Validation accuracy of nearest neighbour is considerably less than its train accuracy while the validation accuracy of random classifier is the same. However, the validation accuracy of nearest neighbour is twice that of the random classifier.

Now let us try another random split and check the validation accuracy

In [12]:
traindata, trainlabel, valdata, vallabel = split(alltraindata, alltrainlabel, 75/100)
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy of nearest neighbour is ", valAccuracy)

Validation accuracy of nearest neighbour is  0.34048257372654156


You can run the above cell multiple times to try with different random splits.
We notice that the accuracy is different for each run, but close together.

Now let us compare it with the accuracy we get on the test dataset.

In [13]:
testpred = NN(alltraindata, alltrainlabel, testdata)
testAccuracy = Accuracy(testlabel, testpred)
print('Test accuracy is ', testAccuracy)

Test accuracy is  0.34917953667953666


### Try it out for yourself and answer:
1. How is the accuracy of the validation set affected if we increase the percentage of validation set? What happens when we reduce it?
2. How does the size of the train and validation set affect how well we can predict the accuracy on the test set using the validation set?
3. What do you think is a good percentage to reserve for the validation set so that thest two factors are balanced?

Answer for both nearest neighbour and random classifier. You can note down the values for your experiments and plot a graph using  <a href=https://matplotlib.org/stable/gallery/lines_bars_and_markers/step_demo.html#sphx-glr-gallery-lines-bars-and-markers-step-demo-py>plt.plot<href>. Check also for extreme values for splits, like 99.9% or 0.1%

1.The accuracy of the validation set can be affected when you increase or decrease the percentage of data allocated to the validation set in the following ways:

i.Increase in Validation Set Percentage:

Pros:
Higher validation set percentage typically leads to a more reliable estimate of your model's performance. With more data in the validation set, you have a better assessment of how your model will perform on unseen data, which is important for generalization.

It can help in detecting overfitting. If your model performs well on a larger validation set, it's less likely to be overfitting the training data.

Cons:
The downside is that you'll have fewer training data points. This can make it challenging for the model to learn effectively, especially if you have limited data.

ii.Decrease in Validation Set Percentage:

Pros:
More data available for training, which can help the model learn better. With less data in the validation set, your model can potentially train on a larger portion of the dataset.
This can be beneficial when you have a large dataset and you can still get a reasonable estimate of model performance with a smaller validation set.

Cons:
A smaller validation set may lead to less reliable estimates of your model's performance. The model's accuracy on the validation set can fluctuate more due to the reduced number of data points, making it harder to draw conclusions about the model's generalization performance.
It might be more challenging to detect overfitting since the validation set is less representative.

2.The size of the train and validation set can have an impact on how well you can predict the accuracy on the test set using the validation set. Here's how it works:

1.Large Training Set:

When you have a large training set and a relatively small validation set, the model has more data to learn from during training. This can lead to a better-trained model, which is likely to perform well on the validation set if the data distribution in the validation set is similar to the training set.
However, a small validation set might not provide a very reliable estimate of the model's performance on unseen data (the test set). There's a risk that the validation set may not be representative enough, and the model might overfit to it.

2.Large Validation Set:

If you allocate a significant portion of your data to the validation set, you get a better estimate of how well your model is likely to perform on unseen data. A larger validation set provides a more reliable indicator of the model's generalization performance.
However, this comes at the cost of having less data available for training. The model may not perform as well on the training set because it has less data to learn from.

3.Balancing Trade-offs:

It's a trade-off between training and validation set sizes. A larger training set generally helps your model learn better, while a larger validation set provides a more accurate estimate of the model's performance.
You need to find a balance that suits your specific problem. Cross-validation techniques, which involve splitting the data into multiple train/validation sets and averaging the results, can help mitigate the trade-off.

3.1.70/30 or 80/20 Split:
 An 70% training and 30% validation split or an 80% training and 20% validation split is a common starting point. This is a reasonable choice for many datasets and models. It ensures a large enough training set to learn from while leaving a substantial portion for validation.

2.Cross-Validation:
 In cases where you have a relatively small dataset, k-fold cross-validation is often used. This involves splitting the data into 'k' subsets (folds) and training the model 'k' times, each time using a different fold for validation. This helps to get a more robust estimate of the model's performance and ensures that all data is used for validation at some point.

3.Leave-One-Out Cross-Validation (LOOCV):
For very small datasets, you can use LOOCV, where each data point is used as a separate validation set in turn. While this is highly informative, it can be computationally expensive.

4.Stratified Sampling:
If your dataset is imbalanced (e.g., you have significantly more samples from one class than another), you might want to use stratified sampling to ensure that each class is represented proportionally in both the training and validation sets.

5.Time Series Data:
In cases where you are dealing with time series data, it's often best to reserve a fixed time period for validation. For example, if you are working with daily stock prices, you might use the most recent year's data for validation.

6.Domain-Specific Considerations:
The nature of your problem, the amount of available data, and the specific requirements of your application can also influence the choice of validation set size.










## Multiple Splits

One way to get more accurate estimates for the test accuracy is by using <b>crossvalidation</b>. Here, we will try a simple version, where we do multiple train/val splits and take the average of validation accuracies as the test accuracy estimation. Here is a function for doing this. Note that this function will take a long time to execute.

In [14]:
# you can use this function for random classifier also
def AverageAccuracy(alldata, alllabel, splitpercent, iterations, classifier=NN):
  accuracy = 0
  for ii in range(iterations):
    traindata, trainlabel, valdata, vallabel = split(alldata, alllabel, splitpercent)
    valpred = classifier(traindata, trainlabel, valdata)
    accuracy += Accuracy(vallabel, valpred)
  return accuracy/iterations # average of all accuracies

In [15]:
print('Average validation accuracy is ', AverageAccuracy(alltraindata, alltrainlabel, 75/100, 10, classifier=NN))
testpred = NN(alltraindata, alltrainlabel, testdata)
print('test accuracy is ',Accuracy(testlabel, testpred) )

Average validation accuracy is  0.33584635395170215
test accuracy is  0.34917953667953666


This is a very simple way of doing cross-validation. There are many well-known algorithms for cross-validation, like k-fold cross-validation, leave-one-out etc. This will be covered in detail in a later module. For more information about cross-validation, check <a href=https://en.wikipedia.org/wiki/Cross-validation_(statistics)>Cross-validatioin (Wikipedia)</a>

### Questions
1. Does averaging the validation accuracy across multiple splits give more consistent results?
2. Does it give more accurate estimate of test accuracy?
3. What is the effect of the number of iterations on the estimate? Do we get a better estimate with higher iterations?
4. Consider the results you got for the previous questions. Can we deal with a very small train dataset or validation dataset by increasing the iterations?

1.Yes, averaging the validation accuracy across multiple splits, typically in the context of cross-validation, can provide more consistent and reliable results when assessing the performance of a machine learning model. Cross-validation is a technique used to evaluate the model's performance by splitting the data into multiple subsets, training and testing the model on different combinations of these subsets, and then averaging the results. The most common form of cross-validation is k-fold cross-validation.

Here's why averaging across multiple splits can be beneficial:

i.Reduced Variance:

Averaging over multiple splits helps reduce the impact of data variability. With a single train-test split, your results might be heavily influenced by the specific random choice of data points in the split. Averaging over k different splits smooths out this variability.

ii.Better Estimation:

 By repeatedly partitioning the data into different training and validation sets, you get a more robust estimation of your model's performance. This provides a more realistic assessment of how well your model is likely to perform on unseen data.

iii.Minimizing Bias:

 Averaging over multiple splits helps mitigate any potential bias introduced by a single random split. This bias could occur if, by chance, the initial split resulted in a particularly easy or difficult validation set.

iv.Model Robustness Assessment:

Cross-validation can also help you assess how robust your model is to different subsets of data. If your model's performance varies significantly across different folds, it may indicate that your model is sensitive to the choice of training data, which can be a concern.

v.More Data Utilization:

 Cross-validation allows you to use as much data as possible for both training and validation. In k-fold cross-validation, each data point is used for validation exactly once, which can be important for small datasets.

The choice of the value for 'k' (the number of folds) can also impact the consistency of the results. Common choices for 'k' are 5, 10, or higher, depending on the size of your dataset and the computational resources available. Generally, larger values of 'k' lead to a more reliable estimation of model performance but can be computationally more expensive.

2.Cross-validation, including well-known techniques like k-fold cross-validation and leave-one-out cross-validation, can provide a more accurate estimate of a model's test accuracy compared to a single train-test split, especially in cases where the dataset is limited. However, the "accuracy" of the estimate is subject to the context of the problem and the specific goals of your analysis. Here are some key points to consider:

i.Reduced Variance:

Cross-validation helps reduce the variance in performance estimates. When you perform a single train-test split, the performance metric (e.g., accuracy) can vary significantly depending on the random choice of data in that split. Cross-validation averages the results over multiple splits, providing a more stable estimate.

ii.Bias-Variance Trade-off:

Cross-validation allows you to assess the bias-variance trade-off of your model. With multiple splits, you can observe how your model's performance varies across different subsets of the data. This can help you understand how well your model generalizes and whether it's overfitting or underfitting.

iii.Better Utilization of Data:

 Cross-validation maximizes the use of your data. In k-fold cross-validation, for instance, each data point is used for validation exactly once, which is particularly useful when you have a limited dataset.

iv.Robustness:

Cross-validation provides a more robust assessment of a model's performance because it takes into account the entire dataset and multiple testing scenarios. This is valuable for model selection, hyperparameter tuning, and evaluating model stability.

v.Avoiding Data Leakage:

 Cross-validation helps avoid potential issues of data leakage. In a single train-test split, you might inadvertently introduce data leakage if you perform any data preprocessing steps based on the entire dataset. Cross-validation mitigates this risk because the preprocessing is applied separately to each fold.

vi.Validation for Small Datasets:

 In cases with limited data, cross-validation is almost essential. Leave-one-out cross-validation (LOOCV) is often used for very small datasets because it ensures that each data point serves as a test point exactly once.

 3.The number of iterations, or folds, in cross-validation can have an effect on the estimate of a model's performance. The impact of the number of iterations depends on various factors, and it's essential to strike a balance to obtain a reliable estimate without introducing excessive computational complexity. Here's how the number of iterations can affect the estimate:

1.Fewer Iterations (Smaller 'k'):

Pros:

Computationally less expensive:

 Using fewer iterations is quicker because you're fitting and evaluating the model fewer times.
Useful for larger datasets: With large datasets, you can still obtain a reasonable estimate of model performance with fewer iterations.

Cons:

Higher variance:
 The performance estimate can be more variable with fewer iterations. The estimate might be influenced by the specific random splits, making it less stable.

2.More Iterations (Larger 'k'):

Pros:

Lower variance:

 With more iterations, the estimate tends to be less sensitive to the specific data splits. It provides a more stable and reliable estimate of model performance.
Useful for smaller datasets: With small datasets, more iterations help make the most of the available data.

Cons:

Increased computational cost:

 As 'k' increases, the computational cost also increases. This can become a concern with very large datasets or complex models.
Potential for overfitting: With an excessively large 'k', you might have subsets that are too small to be representative of the overall dataset, leading to potential overfitting in the evaluation.


4.Increasing the number of iterations (e.g., using a larger 'k' in k-fold cross-validation) can help make the most of a small dataset to some extent, but it doesn't magically solve the problem of a very small training or validation dataset. While more iterations can improve the stability of your performance estimate, there are practical limitations to consider:

1.Advantages of Increasing Iterations:

1.Improved Stability:

 With more iterations, your performance estimates will become more stable. Each fold will provide a different perspective on how your model is performing on different subsets of data, reducing the influence of a single random split.
Limitations and Considerations:

2.Data Size:

 Increasing iterations does not create more data. If your training dataset is very small, it remains small, which can limit the model's ability to learn and generalize.

3.Overfitting:

 With extremely small datasets, very large 'k' values may lead to small subsets of data used for training in each fold. This can lead to overfitting because the model may fit the noise in the data rather than the underlying patterns.

4.Computational Cost:

More iterations mean more model training and evaluation, which can become computationally expensive. This is particularly important if you have limited computational resources.

To mitigate the challenges of working with a very small dataset, you may consider the following approaches:

Use a Simpler Model: With limited data, it's often advisable to use a simpler model with fewer parameters to avoid overfitting.

Data Augmentation: If applicable to your problem, you can use data augmentation techniques to artificially increase the size of your dataset by creating variations of the existing data points.

Regularization: Apply regularization techniques like L1 or L2 regularization to prevent the model from fitting the training data too closely.

Transfer Learning: If relevant, consider using transfer learning by fine-tuning a pre-trained model on your small dataset.

Feature Engineering: Carefully engineer features to make the most of the limited information available.

Stratified Sampling: Ensure that your cross-validation splits maintain class balance if you're dealing with a classification problem.

Use Additional Data: If possible, acquire more data to increase the size of your dataset. This is often the most effective way to improve model performance.
