<a href="https://colab.research.google.com/github/Ravigamimg/FMML_LAB/blob/main/Module_01_Lab_02_MLPractice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine learning terms and metrics

FMML Module 1, Lab 2<br>


 In this lab, we will show a part of the ML pipeline by extracting features, training and testing

In [1]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
# set randomseed
rng = np.random.default_rng(seed=42)

In this lab, we will use the California Housing dataset. There are 20640 samples, each with 8 attributes like income of the block, age of the houses per district etc. The task is to predict the cost of the houses per district.

Let us download and examine the dataset.

In [2]:
 dataset =  datasets.fetch_california_housing()
 # print(dataset.DESCR)  # uncomment this if you want to know more about this dataset
 # print(dataset.keys())  # if you want to know what else is there in this dataset
 dataset.target = dataset.target.astype(np.int) # so that we can classify
 print(dataset.data.shape)
 print(dataset.target.shape)

(20640, 8)
(20640,)


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dataset.target = dataset.target.astype(np.int) # so that we can classify


Here is a function for calculating the 1-nearest neighbours

In [3]:
def NN1(traindata, trainlabel, query):
  diff  = traindata - query  # find the difference between features. Numpy automatically takes care of the size here
  sq = diff*diff # square the differences
  dist = sq.sum(1) # add up the squares
  label = trainlabel[np.argmin(dist)] # our predicted label is the label of the training data which has the least distance from the query
  return label

def NN(traindata, trainlabel, testdata):
  # we will run nearest neighbour for each sample in the test data
  # and collect the predicted classes in an array using list comprehension
  predlabel = np.array([NN1(traindata, trainlabel, i) for i in testdata])
  return predlabel

We will also define a 'random classifier', which randomly allots labels to each sample

In [4]:
def RandomClassifier(traindata, trainlabel, testdata):
  # in reality, we don't need these arguments

  classes = np.unique(trainlabel)
  rints = rng.integers(low=0, high=len(classes), size=len(testdata))
  predlabel = classes[rints]
  return predlabel

Let us define a metric 'Accuracy' to see how good our learning algorithm is. Accuracy is the ratio of the number of correctly classified samples to the total number of samples. The higher the accuracy, the better the algorithm.

In [5]:
def Accuracy(gtlabel, predlabel):
  assert len(gtlabel)==len(predlabel), "Length of the groundtruth labels and predicted labels should be the same"
  correct = (gtlabel==predlabel).sum() # count the number of times the groundtruth label is equal to the predicted label.
  return correct/len(gtlabel)

Let us make a function to split the dataset with the desired probability.

In [6]:
def split(data, label, percent):
  # generate a random number for each sample
  rnd = rng.random(len(label))
  split1 = rnd<percent
  split2 = rnd>=percent
  split1data = data[split1,:]
  split1label = label[split1]
  split2data = data[split2,:]
  split2label = label[split2]
  return split1data, split1label, split2data, split2label

We will reserve 20% of our dataset as the test set. We will not change this portion throughout our experiments

In [7]:
testdata, testlabel, alltraindata, alltrainlabel = split(dataset.data, dataset.target, 20/100)
print('Number of test samples = ', len(testlabel))
print('Number of other samples = ', len(alltrainlabel))
print('Percent of test data = ', len(testlabel)*100/len(dataset.target),'%')

Number of test samples =  4144
Number of other samples =  16496
Percent of test data =  20.07751937984496 %


## Experiments with splits

Let us reserve some of our train data as a validation set

In [8]:
traindata, trainlabel, valdata, vallabel = split(alltraindata, alltrainlabel, 75/100)

What is the accuracy of our classifiers on the train dataset?

In [9]:
trainpred = NN(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Train accuracy using nearest neighbour is ", trainAccuracy)

trainpred = RandomClassifier(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Train accuracy using random classifier is ", trainAccuracy)

Train accuracy using nearest neighbour is  1.0
Train accuracy using random classifier is  0.164375808538163


For nearest neighbour, the train accuracy is always 1. The accuracy of the random classifier is close to 1/(number of classes) which is 0.1666 in our case.

Let us predict the labels for our validation set and get the accuracy

In [10]:
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour is ", valAccuracy)

valpred = RandomClassifier(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using random classifier is ", valAccuracy)

Validation accuracy using nearest neighbour is  0.34108527131782945
Validation accuracy using random classifier is  0.1688468992248062


Validation accuracy of nearest neighbour is considerably less than its train accuracy while the validation accuracy of random classifier is the same. However, the validation accuracy of nearest neighbour is twice that of the random classifier.

Now let us try another random split and check the validation accuracy

In [11]:
traindata, trainlabel, valdata, vallabel = split(alltraindata, alltrainlabel, 75/100)
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy of nearest neighbour is ", valAccuracy)

Validation accuracy of nearest neighbour is  0.34048257372654156


You can run the above cell multiple times to try with different random splits.
We notice that the accuracy is different for each run, but close together.

Now let us compare it with the accuracy we get on the test dataset.

In [12]:
testpred = NN(alltraindata, alltrainlabel, testdata)
testAccuracy = Accuracy(testlabel, testpred)
print('Test accuracy is ', testAccuracy)

Test accuracy is  0.34917953667953666


### Try it out for yourself and answer:
1. How is the accuracy of the validation set affected if we increase the percentage of validation set? What happens when we reduce it?
2. How does the size of the train and validation set affect how well we can predict the accuracy on the test set using the validation set?
3. What do you think is a good percentage to reserve for the validation set so that thest two factors are balanced?

Answer for both nearest neighbour and random classifier. You can note down the values for your experiments and plot a graph using  <a href=https://matplotlib.org/stable/gallery/lines_bars_and_markers/step_demo.html#sphx-glr-gallery-lines-bars-and-markers-step-demo-py>plt.plot<href>. Check also for extreme values for splits, like 99.9% or 0.1%

    Increase Validation Set Percentage:
        When you increase the percentage of data in the validation set, you're giving the model more data to evaluate itself on.
        This can lead to a more reliable estimate of how well the model generalizes to unseen data.
        The accuracy on the validation set might become more stable and closer to the model's true performance.

    Decrease Validation Set Percentage:
        When you decrease the percentage of data in the validation set, you're using less data to evaluate the model.
        This can make the accuracy on the validation set less reliable because it's based on a smaller sample.
        The accuracy may fluctuate more from one evaluation to another, and it might not accurately represent how well the model performs on new, unseen data.


    Larger Train Set:
        If you have a larger training set, it means the model learns from more data.
        This can lead to a better understanding of the underlying patterns in the data, potentially resulting in a more accurate model.
        With a larger training set, the model has a better chance of generalizing well to the test set, so the validation set's accuracy is a better indicator of what to expect on the test set.

    Larger Validation Set:
        A larger validation set gives you a more reliable estimate of how well the model generalizes to unseen data.
        It can provide a better assessment of the model's performance, making your predictions about test set accuracy more accurate.
        However, it also means you have less data for training, which might affect the model's learning quality.

    Balance Between Train and Validation Set:
        Finding the right balance between the size of the training and validation sets is crucial.
        If the training set is too small, the model may not learn enough, and the validation set's accuracy may not predict test set performance accurately.

In simple words, finding a good percentage to reserve for the validation set to balance the two factors (model learning and accurate evaluation) depends on your specific dataset and the problem you're solving. However, a common rule of thumb is to use around 20% to 30% of your data for the validation set.

Here's a bit more detail:

- If you allocate too much data to the validation set (e.g., more than 30%), you might not have enough data left for the model to learn effectively during training. This can lead to an underfit model that doesn't capture the data's patterns well.

- On the other hand, if you allocate too little data to the validation set (e.g., less than 20%), your evaluation may not be very reliable, and the validation set's accuracy might not accurately predict how well your model will perform on unseen test data.


## Multiple Splits

One way to get more accurate estimates for the test accuracy is by using <b>crossvalidation</b>. Here, we will try a simple version, where we do multiple train/val splits and take the average of validation accuracies as the test accuracy estimation. Here is a function for doing this. Note that this function will take a long time to execute.

In [None]:
# you can use this function for random classifier also
def AverageAccuracy(alldata, alllabel, splitpercent, iterations, classifier=NN):
  accuracy = 0
  for ii in range(iterations):
    traindata, trainlabel, valdata, vallabel = split(alldata, alllabel, splitpercent)
    valpred = classifier(traindata, trainlabel, valdata)
    accuracy += Accuracy(vallabel, valpred)
  return accuracy/iterations # average of all accuracies

In [None]:
print('Average validation accuracy is ', AverageAccuracy(alltraindata, alltrainlabel, 75/100, 10, classifier=NN))
testpred = NN(alltraindata, alltrainlabel, testdata)
print('test accuracy is ',Accuracy(testlabel, testpred) )

Average validation accuracy is  0.33584635395170215
test accuracy is  0.34917953667953666


This is a very simple way of doing cross-validation. There are many well-known algorithms for cross-validation, like k-fold cross-validation, leave-one-out etc. This will be covered in detail in a later module. For more information about cross-validation, check <a href=https://en.wikipedia.org/wiki/Cross-validation_(statistics)>Cross-validatioin (Wikipedia)</a>

### Questions
1. Does averaging the validation accuracy across multiple splits give more consistent results?
2. Does it give more accurate estimate of test accuracy?
3. What is the effect of the number of iterations on the estimate? Do we get a better estimate with higher iterations?
4. Consider the results you got for the previous questions. Can we deal with a very small train dataset or validation dataset by increasing the iterations?


    Single Split: When you use just one split (train/validation/test), the results can be influenced by the specific data in that split. This means that your model's accuracy might vary depending on the luck of the draw when creating the split. It might not reflect the model's true performance on average.

    Multiple Splits (Cross-Validation): Instead of relying on one split, you can perform multiple splits of your data into training and validation sets. Each time, you train your model on one split and validate it on another. By doing this multiple times (e.g., 5 or 10 splits), you get several accuracy scores.

    Averaging: You can then calculate the average of these accuracy scores. This average provides a more stable and robust estimate of your model's performance because it accounts for variations in the data splits. It reduces the impact of random factors that might affect a single split's accuracy.

Using cross-validation, which involves averaging validation accuracy across multiple data splits, provides a more accurate estimate of test accuracy compared to a single validation split. However, it's important to clarify what we mean by "test accuracy" in this context.

    Validation Accuracy: Validation accuracy is used to assess how well your model is likely to perform on unseen data (like the test set). When you perform cross-validation and average the validation accuracy scores over multiple splits, you get a more reliable estimate of how well your model generalizes to unseen data. This estimate is more accurate than relying on a single validation split.

    Test Accuracy: Test accuracy, on the other hand, is the performance metric you compute on a completely separate dataset that the model has never seen before. This is the final evaluation to understand how well your model truly performs in a real-world scenario.

While cross-validation gives you a better estimate of how your model will perform on unseen data, it doesn't replace the need for a dedicated test set evaluation. The test set accuracy remains the most accurate estimate of how your model will perform in practice.

In simple words, the number of iterations can affect the estimate of your model's performance, but more iterations don't necessarily guarantee a better estimate.


1. Fewer Iterations:
   - If you use too few iterations (or epochs) when training your model, it might not have enough chances to learn from the data effectively.
   - This can result in underfitting, where the model doesn't capture the data's patterns well, and the estimate of performance may be lower than what the model could achieve with more training.

2. Adequate Iterations:
   - There's usually a sweet spot where the model trains for an adequate number of iterations.
   - In this range, the model learns well without overfitting (fitting too closely to the training data), and the estimate of performance is reliable.

3. Too Many Iterations:
   - Training for an excessive number of iterations can lead to overfitting, where the model starts to memorize the training data instead of learning the underlying patterns.
   - In this case, the estimate of performance on validation data may seem high, but it won't reflect the model's actual ability to generalize to new, unseen data.


1. Small Training Dataset:
   - When your training dataset is very small, the model may struggle to learn and capture meaningful patterns in the data.
   - Increasing the number of iterations can help the model to learn more from the limited data, but it can only do so much.
   - There's a limit to how much improvement you can achieve with more iterations, and the model may still underperform compared to a larger dataset.

2. Small Validation Dataset:
   - If your validation dataset is very small, it can lead to less reliable estimates of your model's performance.
   - Increasing the number of iterations won't necessarily solve this problem because the issue lies in the size of the validation data.
   - A small validation dataset can lead to more variability in accuracy scores, and relying solely on iterations won't make these scores more trustworthy.

In summary, while increasing the number of iterations can help a model learn better from small datasets, it cannot completely overcome the limitations of having very little data.