<a href="https://colab.research.google.com/github/Arundhathisree/FMML-IIIT-HYD/blob/main/Module_01_Lab_02_MLPractice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine learning terms and metrics

FMML Module 1, Lab 2<br>


 In this lab, we will show a part of the ML pipeline by extracting features, training and testing

In [1]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
# set randomseed
rng = np.random.default_rng(seed=42)

In this lab, we will use the California Housing dataset. There are 20640 samples, each with 8 attributes like income of the block, age of the houses per district etc. The task is to predict the cost of the houses per district.

Let us download and examine the dataset.

In [2]:
 dataset =  datasets.fetch_california_housing()
 # print(dataset.DESCR)  # uncomment this if you want to know more about this dataset
 # print(dataset.keys())  # if you want to know what else is there in this dataset
 dataset.target = dataset.target.astype(np.int) # so that we can classify
 print(dataset.data.shape)
 print(dataset.target.shape)

(20640, 8)
(20640,)


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dataset.target = dataset.target.astype(np.int) # so that we can classify


Here is a function for calculating the 1-nearest neighbours

In [3]:
def NN1(traindata, trainlabel, query):
  diff  = traindata - query  # find the difference between features. Numpy automatically takes care of the size here
  sq = diff*diff # square the differences
  dist = sq.sum(1) # add up the squares
  label = trainlabel[np.argmin(dist)] # our predicted label is the label of the training data which has the least distance from the query
  return label

def NN(traindata, trainlabel, testdata):
  # we will run nearest neighbour for each sample in the test data
  # and collect the predicted classes in an array using list comprehension
  predlabel = np.array([NN1(traindata, trainlabel, i) for i in testdata])
  return predlabel

We will also define a 'random classifier', which randomly allots labels to each sample

In [4]:
def RandomClassifier(traindata, trainlabel, testdata):
  # in reality, we don't need these arguments

  classes = np.unique(trainlabel)
  rints = rng.integers(low=0, high=len(classes), size=len(testdata))
  predlabel = classes[rints]
  return predlabel

Let us define a metric 'Accuracy' to see how good our learning algorithm is. Accuracy is the ratio of the number of correctly classified samples to the total number of samples. The higher the accuracy, the better the algorithm.

In [5]:
def Accuracy(gtlabel, predlabel):
  assert len(gtlabel)==len(predlabel), "Length of the groundtruth labels and predicted labels should be the same"
  correct = (gtlabel==predlabel).sum() # count the number of times the groundtruth label is equal to the predicted label.
  return correct/len(gtlabel)

Let us make a function to split the dataset with the desired probability.

In [6]:
def split(data, label, percent):
  # generate a random number for each sample
  rnd = rng.random(len(label))
  split1 = rnd<percent
  split2 = rnd>=percent
  split1data = data[split1,:]
  split1label = label[split1]
  split2data = data[split2,:]
  split2label = label[split2]
  return split1data, split1label, split2data, split2label

We will reserve 20% of our dataset as the test set. We will not change this portion throughout our experiments

In [7]:
testdata, testlabel, alltraindata, alltrainlabel = split(dataset.data, dataset.target, 20/100)
print('Number of test samples = ', len(testlabel))
print('Number of other samples = ', len(alltrainlabel))
print('Percent of test data = ', len(testlabel)*100/len(dataset.target),'%')

Number of test samples =  4144
Number of other samples =  16496
Percent of test data =  20.07751937984496 %


## Experiments with splits

Let us reserve some of our train data as a validation set

In [8]:
traindata, trainlabel, valdata, vallabel = split(alltraindata, alltrainlabel, 75/100)

What is the accuracy of our classifiers on the train dataset?

In [9]:
trainpred = NN(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Train accuracy using nearest neighbour is ", trainAccuracy)

trainpred = RandomClassifier(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Train accuracy using random classifier is ", trainAccuracy)

Train accuracy using nearest neighbour is  1.0
Train accuracy using random classifier is  0.164375808538163


For nearest neighbour, the train accuracy is always 1. The accuracy of the random classifier is close to 1/(number of classes) which is 0.1666 in our case.

Let us predict the labels for our validation set and get the accuracy

In [10]:
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour is ", valAccuracy)

valpred = RandomClassifier(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using random classifier is ", valAccuracy)

Validation accuracy using nearest neighbour is  0.34108527131782945
Validation accuracy using random classifier is  0.1688468992248062


Validation accuracy of nearest neighbour is considerably less than its train accuracy while the validation accuracy of random classifier is the same. However, the validation accuracy of nearest neighbour is twice that of the random classifier.

Now let us try another random split and check the validation accuracy

In [11]:
traindata, trainlabel, valdata, vallabel = split(alltraindata, alltrainlabel, 75/100)
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy of nearest neighbour is ", valAccuracy)

Validation accuracy of nearest neighbour is  0.34048257372654156


You can run the above cell multiple times to try with different random splits.
We notice that the accuracy is different for each run, but close together.

Now let us compare it with the accuracy we get on the test dataset.

In [12]:
testpred = NN(alltraindata, alltrainlabel, testdata)
testAccuracy = Accuracy(testlabel, testpred)
print('Test accuracy is ', testAccuracy)

Test accuracy is  0.34917953667953666


### Try it out for yourself and answer:
1. How is the accuracy of the validation set affected if we increase the percentage of validation set? What happens when we reduce it?
2. How does the size of the train and validation set affect how well we can predict the accuracy on the test set using the validation set?
3. What do you think is a good percentage to reserve for the validation set so that thest two factors are balanced?

Answer for both nearest neighbour and random classifier. You can note down the values for your experiments and plot a graph using  <a href=https://matplotlib.org/stable/gallery/lines_bars_and_markers/step_demo.html#sphx-glr-gallery-lines-bars-and-markers-step-demo-py>plt.plot<href>. Check also for extreme values for splits, like 99.9% or 0.1%

**ANSWERS**


**1ST ANSWER**
The size of the validation set, relative to the training set, can have a significant impact on the accuracy of the validation set, as well as on the overall model development process. The percentage of the dataset allocated to the validation set affects various aspects of model training and evaluation:

**Increasing the Percentage of the Validation Set:**

1. **Pros:**
   - **Better Estimate of Generalization:** A larger validation set provides a more reliable estimate of how well your model generalizes to unseen data. It reduces the chance of random fluctuations affecting the accuracy estimate.
   - **More Conservative Model Selection:** With a larger validation set, it's less likely that a model's performance is overestimated. This can lead to more conservative model selection, reducing the risk of choosing a model that doesn't generalize well.

2. **Cons:**
   - **Reduced Training Data:** A larger validation set means a smaller training set. With less data for training, your model may not be able to learn complex patterns as effectively. This can be especially problematic for deep learning models or when you have limited data to begin with.
   - **Increased Variability:** While a larger validation set provides a more stable accuracy estimate, it also means that each fold of cross-validation has a smaller training set, potentially leading to more variability in model performance across different folds.

**Reducing the Percentage of the Validation Set:**

1. **Pros:**
   - **More Training Data:** A smaller validation set leaves a larger portion of the data for training. This can be beneficial for models that require more data to generalize effectively, such as deep learning models.
   - **Reduced Variability:** With a larger training set, individual fold performance may be less variable, leading to more consistent results in cross-validation.

2. **Cons:**
   - **Less Reliable Estimate:** A smaller validation set may produce less reliable accuracy estimates because the estimate is more sensitive to random fluctuations in the data. It may not provide a robust assessment of model generalization.
   - **Risk of Overfitting:** With a smaller validation set, there's a greater risk of overfitting to the validation data, as the model may learn to perform well specifically on that subset rather than on unseen data.

The choice of the percentage of the validation set should be made based on the characteristics of your dataset, the amount of data available, and the specific goals of your modeling project. There is often a trade-off between obtaining a more reliable accuracy estimate and providing sufficient data for training. In practice, a typical split might allocate around 70-80% of the data to training and the remaining 20-30% to validation. However, these percentages can vary depending on the size and nature of the dataset. Cross-validation techniques can also help address some of the trade-offs by repeatedly splitting the data into training and validation subsets.

**2ND ANSWER**
The sizes of the training and validation sets can indeed affect how well the performance on the validation set predicts the accuracy on the test set. This relationship is influenced by several factors:

1. **Representativeness of Validation Set:** The validation set should be a representative sample of the overall dataset. If the validation set is too small, it may not capture the diversity and variability of the data adequately. In such cases, the performance on the validation set may not accurately reflect how well the model will generalize to the test set.

2. **Statistical Reliability:** The reliability of the validation set as a predictor of test set performance depends on the sample size. Larger validation sets tend to provide more statistically reliable estimates of model performance. Smaller validation sets are more susceptible to random fluctuations, and their accuracy may not be a robust predictor of test set accuracy.

3. **Model Complexity:** The complexity of the model you're training can also impact the relationship between validation and test set performance. Complex models with many parameters may require larger training datasets to generalize effectively. In such cases, a small validation set may not reveal overfitting issues, and the model may perform significantly worse on the test set.

4. **Cross-Validation:** Techniques like k-fold cross-validation can help mitigate the impact of validation set size. With cross-validation, the dataset is divided into multiple folds, and the model is trained and validated multiple times on different subsets. This provides a more comprehensive assessment of model performance and reduces the dependency on a single validation set size.

5. **Overfitting to the Validation Set:** If the validation set is very small, there's a risk that the model may overfit to it, essentially learning to perform well on that specific subset of the data but not on unseen data (the test set). This is more likely when the validation set size is extremely limited.

In summary, the size of the training and validation sets can affect how well the validation set predicts the accuracy on the test set. A larger, more representative validation set tends to provide a more reliable estimate of generalization performance. However, it's crucial to strike a balance between the sizes of the training and validation sets, as very small validation sets may not provide robust predictions and may not effectively identify overfitting issues. Cross-validation can be a useful approach to address these challenges and provide a more comprehensive assessment of model performance.

**3RD ANSWER**


1. **Dataset Size**: For larger datasets, you can often afford to allocate a smaller percentage to the validation set, as there's still a substantial amount of data for training. Conversely, with smaller datasets, you may need to allocate a larger portion to validation to ensure a representative sample.

2. **Model Complexity**: More complex models (e.g., deep learning models with many parameters) may require larger validation sets to detect overfitting. In such cases, you might opt for a larger validation set to ensure that the model's performance on the validation set is a reliable indicator of generalization.

3. **Cross-Validation**: If you're using k-fold cross-validation, the choice of k (the number of folds) can also impact the size of each fold. Smaller values of k result in larger validation sets, while larger values of k lead to smaller validation sets within each fold. Adjusting the value of k can help you balance the need for reliable performance estimation and ample training data.

4. **Data Quality**: If your dataset contains noisy or low-quality data, you might consider a larger validation set to mitigate the impact of noise on performance estimation.

5. **Computational Resources**: The available computational resources can also influence your choice. Training deep learning models or running complex algorithms may require more computational power, so you might allocate a larger portion of the data to training if resources are limited.

6. **Domain Knowledge**: Domain expertise can play a role. If you have prior knowledge about your problem and dataset, it can guide your decision on validation set size. For instance, if you know that specific subsets of your data are more challenging, you may allocate more data to validation from those subsets.

In practice, there's no one-size-fits-all answer. A good approach is to start with a reasonable initial split (e.g., 70-30 or 80-20 for training-validation), and then iteratively adjust it based on your observations during model development and validation. The key is to strike a balance that allows you to obtain reliable estimates of model performance while ensuring that your model has sufficient data for training. Experimentation and iteration are often necessary to find the right balance for your specific project.

## Multiple Splits

One way to get more accurate estimates for the test accuracy is by using <b>crossvalidation</b>. Here, we will try a simple version, where we do multiple train/val splits and take the average of validation accuracies as the test accuracy estimation. Here is a function for doing this. Note that this function will take a long time to execute.

In [13]:
# you can use this function for random classifier also
def AverageAccuracy(alldata, alllabel, splitpercent, iterations, classifier=NN):
  accuracy = 0
  for ii in range(iterations):
    traindata, trainlabel, valdata, vallabel = split(alldata, alllabel, splitpercent)
    valpred = classifier(traindata, trainlabel, valdata)
    accuracy += Accuracy(vallabel, valpred)
  return accuracy/iterations # average of all accuracies

In [14]:
print('Average validation accuracy is ', AverageAccuracy(alltraindata, alltrainlabel, 75/100, 10, classifier=NN))
testpred = NN(alltraindata, alltrainlabel, testdata)
print('test accuracy is ',Accuracy(testlabel, testpred) )

Average validation accuracy is  0.33584635395170215
test accuracy is  0.34917953667953666


This is a very simple way of doing cross-validation. There are many well-known algorithms for cross-validation, like k-fold cross-validation, leave-one-out etc. This will be covered in detail in a later module. For more information about cross-validation, check <a href=https://en.wikipedia.org/wiki/Cross-validation_(statistics)>Cross-validatioin (Wikipedia)</a>

### Questions
1. Does averaging the validation accuracy across multiple splits give more consistent results?
2. Does it give more accurate estimate of test accuracy?
3. What is the effect of the number of iterations on the estimate? Do we get a better estimate with higher iterations?
4. Consider the results you got for the previous questions. Can we deal with a very small train dataset or validation dataset by increasing the iterations?


**ANSWERS**


**1ST ANSWER**


1. **Reduces Variance**: By splitting your dataset into multiple subsets (folds) and training and evaluating the model on each of these folds separately, you get a more comprehensive view of how your model performs across different subsets of data. This helps reduce the impact of random variability in the data, leading to more stable and consistent results.

2. **Better Generalization**: Averaging the accuracy over multiple folds provides a more robust estimate of your model's generalization performance. It helps ensure that your model isn't overfitting to a specific subset of the data and is capable of making accurate predictions on unseen data.

3. **Detects Overfitting**: If your model's performance varies significantly across different folds, it may indicate that your model is overfitting to the training data. Consistent accuracy scores across folds are a good sign that your model is not overly specialized to the training data.

4. **More Informative**: Instead of relying on a single train-test split, cross-validation provides multiple accuracy scores, which can give you a better sense of the variability in model performance. This information is valuable when assessing the model's stability and robustness.

5. **Model Selection**: Cross-validation is often used for model selection and hyperparameter tuning. By comparing the average performance of different models or parameter configurations across folds, you can make more informed decisions about which model or settings to choose.

6. **Small Dataset Handling**: In situations where you have a relatively small dataset, cross-validation is particularly useful. It allows you to maximize the use of your limited data by repeatedly partitioning it into training and validation sets.

However, it's important to note that while cross-validation provides more reliable estimates of model performance, it also comes at the cost of increased computational time, as you need to train and evaluate the model multiple times (k times for k-fold cross-validation). Additionally, the choice of the number of folds (k) can impact the results, with larger values of k leading to more stable but computationally expensive evaluations.

In summary, averaging validation accuracy across multiple splits using k-fold cross-validation is a valuable technique for assessing and comparing machine learning models. It helps you obtain more consistent and trustworthy estimates of a model's performance, especially in situations where data variability or overfitting may be concerns.

**2ND ANSWER**


Here's how k-fold cross-validation works and why it's a more accurate estimate than a single train-test split:

1. **K-Fold Cross-Validation**: In k-fold cross-validation, your dataset is divided into k subsets (or "folds"). The model is trained and evaluated k times, each time using a different fold as the validation set and the remaining k-1 folds as the training set. The results (e.g., accuracy scores) from each fold are averaged to provide an overall estimate of model performance.

2. **Benefits of Cross-Validation**:
   - It reduces the impact of data variability and randomness. A single train-test split can lead to highly variable results depending on which data points end up in the training and test sets. Cross-validation averages out this variability.
   - It helps detect and prevent overfitting. If a model performs exceptionally well on a single train-test split but poorly on another, it may be overfitting to the specific training data. Cross-validation provides a more robust assessment of a model's ability to generalize.

3. **Estimating Test Accuracy**: While k-fold cross-validation provides a more accurate estimate of how your model is likely to perform on new, unseen data, it's essential to understand that it's still an estimate. It reflects the model's performance on a representative subset of your data but doesn't guarantee identical performance on a different dataset.

4. **Final Test Set**: After model development and validation using cross-validation, it's a good practice to set aside a separate test dataset that has not been used during model training or validation. This final test set serves as a more direct estimate of how your model will perform on entirely new and unseen data.

In summary, k-fold cross-validation gives you a more accurate and reliable estimate of your model's generalization performance compared to a single train-test split. However, it is not a direct estimate of the test accuracy on entirely new data, which is the role of a final, untouched test set. Cross-validation is a valuable tool for model development, selection, and evaluation, but it should be complemented by a final test set evaluation for a complete assessment of model performance.

**3RD ANSWER**


**Effect of Number of Iterations on the Estimate:**

1. **Low Number of Iterations (e.g., k = 2 or 3):** With a low number of iterations, the cross-validation estimate of model performance may be subject to higher variability. In other words, the results could be more sensitive to the specific random splits of the data. This means that the estimate might not be as reliable and might not generalize well to unseen data.

2. **Moderate Number of Iterations (e.g., k = 5 or 10):** A moderate number of iterations strikes a balance between computational efficiency and reliable estimates. It provides reasonably stable estimates of model performance and is often a practical choice for most machine learning tasks.

3. **High Number of Iterations (e.g., k = 20 or 50):** Increasing the number of iterations can further reduce variability in the estimate, resulting in more stable and reliable performance metrics. However, it comes at the cost of increased computational time, as the model needs to be trained and evaluated k times.

**Considerations When Choosing the Number of Iterations:**

1. **Data Size:** The size of your dataset plays a role. With a very large dataset, you may be able to achieve reliable estimates with fewer iterations. Conversely, with a small dataset, more iterations can help stabilize the estimate.

2. **Computational Resources:** A higher number of iterations requires more computational resources and time. You should consider the available resources and the trade-off between computational cost and the accuracy of the estimate.

3. **Stability:** If you notice that your model's performance estimates are highly variable with a low number of iterations, increasing the number of iterations can lead to more consistent results.

4. **Cross-Validation Strategy:** The choice of cross-validation strategy can also impact the effect of the number of iterations. For example, stratified k-fold cross-validation ensures that each fold has a balanced representation of classes, which can be important in imbalanced datasets.

In general, there's a diminishing return on the improvement in estimate stability as you increase the number of iterations. After a certain point, the additional computational cost may not be justified by the marginal improvement in estimate reliability. A moderate number of iterations (e.g., 5 or 10) is often a reasonable choice for many machine learning tasks, but it's essential to consider the specific characteristics of your data and your computational constraints when making this decision.

**4TH ANSWER**

**Advantages of Increasing Iterations with Small Datasets:**

1. **Better Utilization of Data:** With a larger number of iterations, you can make more effective use of the available data. Each fold represents a different split of the data, allowing your model to be trained on and evaluated against various subsets of the dataset.

2. **Reduced Variability:** Small datasets tend to produce more variable results due to the limited number of data points. Increasing iterations can help reduce this variability by averaging the performance metrics over multiple splits, providing a more stable estimate of model performance.

3. **Robustness Assessment:** More iterations allow you to assess the robustness of your model by observing how its performance varies across different subsets of the data. This can help you identify whether your model is overly sensitive to specific training/validation splits.

**Considerations and Limitations:**

1. **Computational Cost:** Increasing the number of iterations also increases the computational cost. Training and evaluating the model k times can be time-consuming, especially if your dataset is very small.

2. **Sample Size per Fold:** With a very small dataset, increasing the number of iterations might result in extremely small training or validation sets in some folds. This can lead to poor model training or evaluation, as the model may struggle to learn from or generalize to such small subsets.

3. **Bias-Variance Trade-off:** While more iterations can reduce variability, they may not address fundamental issues related to dataset size. Very small datasets inherently have limitations in terms of their representativeness, and increasing iterations can't create more data.

4. **Risk of Overfitting to Validation Set:** With very small validation sets, there's a higher risk that the model might overfit to those subsets, as it may have more room to "memorize" rather than "learn" patterns. Cross-validation can help, but it doesn't fundamentally resolve the small dataset problem.

In summary, increasing the number of iterations in cross-validation is a valuable technique to extract as much information as possible from a small dataset and to obtain more stable performance estimates. However, it doesn't fully compensate for the limitations of small datasets, such as the risk of overfitting and the inherent lack of data diversity. If possible, collecting more data or using data augmentation techniques can be more effective solutions to address the challenges associated with very small datasets.