<a href="https://colab.research.google.com/github/BhuvaneswariGudivaka/FMML-LAB-1/blob/main/Module_01_Lab_02_MLPractice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine learning terms and metrics

FMML Module 1, Lab 2<br>


 In this lab, we will show a part of the ML pipeline by extracting features, training and testing

In [1]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
# set randomseed
rng = np.random.default_rng(seed=42)

In this lab, we will use the California Housing dataset. There are 20640 samples, each with 8 attributes like income of the block, age of the houses per district etc. The task is to predict the cost of the houses per district.

Let us download and examine the dataset.

In [2]:
 dataset =  datasets.fetch_california_housing()
 # print(dataset.DESCR)  # uncomment this if you want to know more about this dataset
 # print(dataset.keys())  # if you want to know what else is there in this dataset
 dataset.target = dataset.target.astype(np.int) # so that we can classify
 print(dataset.data.shape)
 print(dataset.target.shape)

AttributeError: module 'numpy' has no attribute 'int'.
`np.int` was a deprecated alias for the builtin `int`. To avoid this error in existing code, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

Here is a function for calculating the 1-nearest neighbours

In [None]:
def NN1(traindata, trainlabel, query):
  diff  = traindata - query  # find the difference between features. Numpy automatically takes care of the size here
  sq = diff*diff # square the differences
  dist = sq.sum(1) # add up the squares
  label = trainlabel[np.argmin(dist)] # our predicted label is the label of the training data which has the least distance from the query
  return label

def NN(traindata, trainlabel, testdata):
  # we will run nearest neighbour for each sample in the test data
  # and collect the predicted classes in an array using list comprehension
  predlabel = np.array([NN1(traindata, trainlabel, i) for i in testdata])
  return predlabel

We will also define a 'random classifier', which randomly allots labels to each sample

In [None]:
def RandomClassifier(traindata, trainlabel, testdata):
  # in reality, we don't need these arguments

  classes = np.unique(trainlabel)
  rints = rng.integers(low=0, high=len(classes), size=len(testdata))
  predlabel = classes[rints]
  return predlabel

Let us define a metric 'Accuracy' to see how good our learning algorithm is. Accuracy is the ratio of the number of correctly classified samples to the total number of samples. The higher the accuracy, the better the algorithm.

In [None]:
def Accuracy(gtlabel, predlabel):
  assert len(gtlabel)==len(predlabel), "Length of the groundtruth labels and predicted labels should be the same"
  correct = (gtlabel==predlabel).sum() # count the number of times the groundtruth label is equal to the predicted label.
  return correct/len(gtlabel)

Let us make a function to split the dataset with the desired probability.

In [None]:
def split(data, label, percent):
  # generate a random number for each sample
  rnd = rng.random(len(label))
  split1 = rnd<percent
  split2 = rnd>=percent
  split1data = data[split1,:]
  split1label = label[split1]
  split2data = data[split2,:]
  split2label = label[split2]
  return split1data, split1label, split2data, split2label

We will reserve 20% of our dataset as the test set. We will not change this portion throughout our experiments

In [None]:
testdata, testlabel, alltraindata, alltrainlabel = split(dataset.data, dataset.target, 20/100)
print('Number of test samples = ', len(testlabel))
print('Number of other samples = ', len(alltrainlabel))
print('Percent of test data = ', len(testlabel)*100/len(dataset.target),'%')

## Experiments with splits

Let us reserve some of our train data as a validation set

In [None]:
traindata, trainlabel, valdata, vallabel = split(alltraindata, alltrainlabel, 75/100)

What is the accuracy of our classifiers on the train dataset?

In [None]:
trainpred = NN(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Train accuracy using nearest neighbour is ", trainAccuracy)

trainpred = RandomClassifier(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Train accuracy using random classifier is ", trainAccuracy)

For nearest neighbour, the train accuracy is always 1. The accuracy of the random classifier is close to 1/(number of classes) which is 0.1666 in our case.

Let us predict the labels for our validation set and get the accuracy

In [None]:
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour is ", valAccuracy)

valpred = RandomClassifier(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using random classifier is ", valAccuracy)

Validation accuracy of nearest neighbour is considerably less than its train accuracy while the validation accuracy of random classifier is the same. However, the validation accuracy of nearest neighbour is twice that of the random classifier.

Now let us try another random split and check the validation accuracy

In [None]:
traindata, trainlabel, valdata, vallabel = split(alltraindata, alltrainlabel, 75/100)
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy of nearest neighbour is ", valAccuracy)

You can run the above cell multiple times to try with different random splits.
We notice that the accuracy is different for each run, but close together.

Now let us compare it with the accuracy we get on the test dataset.

In [None]:
testpred = NN(alltraindata, alltrainlabel, testdata)
testAccuracy = Accuracy(testlabel, testpred)
print('Test accuracy is ', testAccuracy)

### Try it out for yourself and answer:
1. How is the accuracy of the validation set affected if we increase the percentage of validation set? What happens when we reduce it?
2. How does the size of the train and validation set affect how well we can predict the accuracy on the test set using the validation set?
3. What do you think is a good percentage to reserve for the validation set so that thest two factors are balanced?

Answer for both nearest neighbour and random classifier. You can note down the values for your experiments and plot a graph using  <a href=https://matplotlib.org/stable/gallery/lines_bars_and_markers/step_demo.html#sphx-glr-gallery-lines-bars-and-markers-step-demo-py>plt.plot<href>. Check also for extreme values for splits, like 99.9% or 0.1%

## Multiple Splits

One way to get more accurate estimates for the test accuracy is by using <b>crossvalidation</b>. Here, we will try a simple version, where we do multiple train/val splits and take the average of validation accuracies as the test accuracy estimation. Here is a function for doing this. Note that this function will take a long time to execute.

In [None]:
# you can use this function for random classifier also
def AverageAccuracy(alldata, alllabel, splitpercent, iterations, classifier=NN):
  accuracy = 0
  for ii in range(iterations):
    traindata, trainlabel, valdata, vallabel = split(alldata, alllabel, splitpercent)
    valpred = classifier(traindata, trainlabel, valdata)
    accuracy += Accuracy(vallabel, valpred)
  return accuracy/iterations # average of all accuracies

In [None]:
print('Average validation accuracy is ', AverageAccuracy(alltraindata, alltrainlabel, 75/100, 10, classifier=NN))
testpred = NN(alltraindata, alltrainlabel, testdata)
print('test accuracy is ',Accuracy(testlabel, testpred) )

This is a very simple way of doing cross-validation. There are many well-known algorithms for cross-validation, like k-fold cross-validation, leave-one-out etc. This will be covered in detail in a later module. For more information about cross-validation, check <a href=https://en.wikipedia.org/wiki/Cross-validation_(statistics)>Cross-validatioin (Wikipedia)</a>

### Questions
1. Does averaging the validation accuracy across multiple splits give more consistent results?
2. Does it give more accurate estimate of test accuracy?
3. What is the effect of the number of iterations on the estimate? Do we get a better estimate with higher iterations?
4. Consider the results you got for the previous questions. Can we deal with a very small train dataset or validation dataset by increasing the iterations?


#QUESTION 1
Yes, averaging the validation accuracy across multiple splits of the dataset (e.g., using cross-validation) generally provides more consistent and reliable results compared to a single split. Here's why:

Reduced Variance: By performing multiple splits and averaging the results, you reduce the impact of variability in the data partitioning. Each split represents a different random selection of data for training and validation, which helps to mitigate the influence of any particular data configuration on the model's performance.

Better Representation of Model Performance: Averaging the validation accuracy across multiple splits provides a more representative estimate of the model's generalization performance. It accounts for variations in the data distribution and ensures that the reported accuracy is less likely to be biased by the specific data partition used.

Robustness to Data Imbalance or Variability: If the dataset is imbalanced or exhibits variability in the distribution of classes or features, averaging across multiple splits can help to ensure that the model's performance is robust and not overly influenced by specific instances of class imbalance or data variability.

Statistical Significance: Averaging validation accuracy across multiple splits increases the statistical significance of the results. It provides a more reliable estimate of the model's performance and increases confidence in the reported accuracy

#QUESTION 2
Averaging the validation accuracy across multiple splits can provide a more accurate estimate of the model's generalization performance compared to a single split. However, it's essential to note that the validation accuracy estimated through cross-validation (averaging across multiple splits) is still an estimate and may not perfectly reflect the model's performance on unseen data (test accuracy).

Here's why:

Closer to True Generalization Performance: Averaging validation accuracy across multiple splits helps to reduce the bias introduced by a single split and provides a more representative estimate of the model's generalization performance. Therefore, it tends to be closer to the true generalization performance compared to a single split.

Reduced Variance: By averaging across multiple splits, the estimate of the model's performance becomes less sensitive to the randomness introduced by a single data split. This reduction in variance increases the reliability of the estimated performance.

Not Equivalent to Test Accuracy: While cross-validated validation accuracy provides a more reliable estimate of generalization performance, it's important to distinguish it from test accuracy. Test accuracy is obtained by evaluating the model on a completely unseen dataset, which serves as an independent validation of the model's performance. Cross-validated validation accuracy estimates how well the model generalizes to different subsets of the training data, but it doesn't replace the need for evaluation on a separate test set.

Limited Data Variability: Despite its advantages, cross-validation may still not capture all sources of variability present in the data. It relies on random sampling of data splits, which may not fully represent the diversity of the dataset or the potential variability in real-world data.

In summary, while averaging validation accuracy across multiple splits improves the reliability and stability of the estimated generalization performance, it's still an estimate and may not perfectly reflect the model's performance on unseen data. Evaluation on a separate test set is necessary to obtain a more definitive assessment of the model's performance.






#QUESTION 3
The number of iterations can have an effect on the estimate of model performance, particularly in iterative algorithms such as those used in training machine learning models. However, whether increasing the number of iterations leads to a better estimate depends on various factors:

Convergence: In iterative optimization algorithms such as gradient descent, increasing the number of iterations may allow the algorithm to converge to a better solution. If the optimization process hasn't converged yet, increasing the number of iterations can improve the estimate of the optimal model parameters and consequently lead to a better estimate of model performance.

Computational Resources: Increasing the number of iterations typically requires more computational resources, such as time and memory. While more iterations may lead to better estimates in theory, there may be practical limitations on the number of iterations due to computational constraints.

Overfitting: In some cases, increasing the number of iterations excessively may lead to overfitting. If the model becomes too complex or fits the training data too closely, it may not generalize well to unseen data, resulting in a worse estimate of model performance.

Stability: Increasing the number of iterations can improve the stability of the estimate by reducing the variability introduced by random initialization or noise in the data. However, beyond a certain point, further increasing the number of iterations may have diminishing returns in terms of improving stability.

Hyperparameter Tuning: The number of iterations can be considered as a hyperparameter that needs to be tuned alongside other hyperparameters. The optimal number of iterations may vary depending on the dataset, the complexity of the model, and other factors. Therefore, it's essential to experiment with different numbers of iterations to find the optimal value for a particular problem.

In summary, increasing the number of iterations can potentially lead to a better estimate of model performance by allowing the optimization process to converge to a better solution. However, the impact of the number of iterations depends on various factors, and it's important to strike a balance between computational resources, model complexity, and generalization performance.






#QUESTION 4
Increasing the number of iterations might help to some extent in dealing with very small training or validation datasets, but it's not a complete solution. Here's why:

Model Convergence: Increasing the number of iterations can help the model converge to a better solution even with a small dataset. With more iterations, the model has more opportunities to adjust its parameters to minimize the loss function and learn from the limited amount of data available.

Risk of Overfitting: However, increasing the number of iterations excessively can also increase the risk of overfitting, especially when dealing with small datasets. The model might start to memorize the training data instead of learning generalizable patterns, leading to poor performance on unseen data.

Generalization Ability: The model's ability to generalize to unseen data depends not only on the number of iterations but also on the quality and representativeness of the training data. No amount of iteration can compensate for a lack of diversity or representative samples in the dataset.

Regularization: Instead of solely relying on increasing iterations, regularization techniques such as L1/L2 regularization, dropout, or early stopping can help mitigate overfitting and improve generalization performance, especially in scenarios with limited data.

Data Augmentation: Another approach to address the limitations of small datasets is data augmentation, where additional training samples are generated from the existing data through techniques such as rotation, scaling, or adding noise. This can effectively increase the effective size of the training dataset and help improve model performance.

In summary, while increasing the number of iterations can potentially help in dealing with very small training or validation datasets by allowing the model to learn more from the available data, it's not a complete solution. Other techniques such as regularization, early stopping, and data augmentation should also be considered to improve model generalization performance and mitigate overfitting.




