<a href="https://colab.research.google.com/github/Bandi-Lavanya/FMML_2023_ASSIGNMENTS/blob/main/Module_01_Lab_02_MLPractice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine learning terms and metrics

FMML Module 1, Lab 2<br>


 In this lab, we will show a part of the ML pipeline by extracting features, training and testing

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
# set randomseed
rng = np.random.default_rng(seed=42)

In this lab, we will use the California Housing dataset. There are 20640 samples, each with 8 attributes like income of the block, age of the houses per district etc. The task is to predict the cost of the houses per district.

Let us download and examine the dataset.

In [None]:
 dataset =  datasets.fetch_california_housing()
 # print(dataset.DESCR)  # uncomment this if you want to know more about this dataset
 # print(dataset.keys())  # if you want to know what else is there in this dataset
 dataset.target = dataset.target.astype(np.int) # so that we can classify
 print(dataset.data.shape)
 print(dataset.target.shape)

(20640, 8)
(20640,)


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dataset.target = dataset.target.astype(np.int) # so that we can classify


Here is a function for calculating the 1-nearest neighbours

In [None]:
def NN1(traindata, trainlabel, query):
  diff  = traindata - query  # find the difference between features. Numpy automatically takes care of the size here
  sq = diff*diff # square the differences
  dist = sq.sum(1) # add up the squares
  label = trainlabel[np.argmin(dist)] # our predicted label is the label of the training data which has the least distance from the query
  return label

def NN(traindata, trainlabel, testdata):
  # we will run nearest neighbour for each sample in the test data
  # and collect the predicted classes in an array using list comprehension
  predlabel = np.array([NN1(traindata, trainlabel, i) for i in testdata])
  return predlabel

We will also define a 'random classifier', which randomly allots labels to each sample

In [None]:
def RandomClassifier(traindata, trainlabel, testdata):
  # in reality, we don't need these arguments

  classes = np.unique(trainlabel)
  rints = rng.integers(low=0, high=len(classes), size=len(testdata))
  predlabel = classes[rints]
  return predlabel

Let us define a metric 'Accuracy' to see how good our learning algorithm is. Accuracy is the ratio of the number of correctly classified samples to the total number of samples. The higher the accuracy, the better the algorithm.

In [None]:
def Accuracy(gtlabel, predlabel):
  assert len(gtlabel)==len(predlabel), "Length of the groundtruth labels and predicted labels should be the same"
  correct = (gtlabel==predlabel).sum() # count the number of times the groundtruth label is equal to the predicted label.
  return correct/len(gtlabel)

Let us make a function to split the dataset with the desired probability.

In [None]:
def split(data, label, percent):
  # generate a random number for each sample
  rnd = rng.random(len(label))
  split1 = rnd<percent
  split2 = rnd>=percent
  split1data = data[split1,:]
  split1label = label[split1]
  split2data = data[split2,:]
  split2label = label[split2]
  return split1data, split1label, split2data, split2label

We will reserve 20% of our dataset as the test set. We will not change this portion throughout our experiments

In [None]:
testdata, testlabel, alltraindata, alltrainlabel = split(dataset.data, dataset.target, 20/100)
print('Number of test samples = ', len(testlabel))
print('Number of other samples = ', len(alltrainlabel))
print('Percent of test data = ', len(testlabel)*100/len(dataset.target),'%')

Number of test samples =  4144
Number of other samples =  16496
Percent of test data =  20.07751937984496 %


## Experiments with splits

Let us reserve some of our train data as a validation set

In [None]:
traindata, trainlabel, valdata, vallabel = split(alltraindata, alltrainlabel, 75/100)

What is the accuracy of our classifiers on the train dataset?

In [None]:
trainpred = NN(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Train accuracy using nearest neighbour is ", trainAccuracy)

trainpred = RandomClassifier(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Train accuracy using random classifier is ", trainAccuracy)

Train accuracy using nearest neighbour is  1.0
Train accuracy using random classifier is  0.164375808538163


For nearest neighbour, the train accuracy is always 1. The accuracy of the random classifier is close to 1/(number of classes) which is 0.1666 in our case.

Let us predict the labels for our validation set and get the accuracy

In [None]:
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour is ", valAccuracy)

valpred = RandomClassifier(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using random classifier is ", valAccuracy)

Validation accuracy using nearest neighbour is  0.34108527131782945
Validation accuracy using random classifier is  0.1688468992248062


Validation accuracy of nearest neighbour is considerably less than its train accuracy while the validation accuracy of random classifier is the same. However, the validation accuracy of nearest neighbour is twice that of the random classifier.

Now let us try another random split and check the validation accuracy

In [None]:
traindata, trainlabel, valdata, vallabel = split(alltraindata, alltrainlabel, 75/100)
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy of nearest neighbour is ", valAccuracy)

Validation accuracy of nearest neighbour is  0.34048257372654156


You can run the above cell multiple times to try with different random splits.
We notice that the accuracy is different for each run, but close together.

Now let us compare it with the accuracy we get on the test dataset.

In [None]:
testpred = NN(alltraindata, alltrainlabel, testdata)
testAccuracy = Accuracy(testlabel, testpred)
print('Test accuracy is ', testAccuracy)

Test accuracy is  0.34917953667953666


### Try it out for yourself and answer:
1. How is the accuracy of the validation set affected if we increase the percentage of validation set? What happens when we reduce it?
2. How does the size of the train and validation set affect how well we can predict the accuracy on the test set using the validation set?
3. What do you think is a good percentage to reserve for the validation set so that thest two factors are balanced?

Answer for both nearest neighbour and random classifier. You can note down the values for your experiments and plot a graph using  <a href=https://matplotlib.org/stable/gallery/lines_bars_and_markers/step_demo.html#sphx-glr-gallery-lines-bars-and-markers-step-demo-py>plt.plot<href>. Check also for extreme values for splits, like 99.9% or 0.1%

1. How is the accuracy of the validation set affected if we increase the percentage of validation set? What happens when we reduce it?

**ANSWER:**
*The percentage of data allocated to the validation set in a machine learning experiment can significantly impact the accuracy of the validation set and, by extension, the model development and evaluation process.

**Affectes if we increase the percentage of validation set when we reduce the data allocate dto it:**

**Increase Percentage of Validation Set:**

**1.Positive Impact on Validation Accuracy**: When we allocate a larger percentage of our data to the validation set, we are effectively using more data to evaluate our model's performance. This can lead to a more accurate estimate of our model's performance on unseen data, resulting in a higher validation accuracy.

**2.Reduced Training Data**: One trade-off is that you'll have less data available for training your model. With a smaller training set, your model may not learn as effectively from the data, and it might be prone to overfitting, especially if the dataset is small to begin with.

**Reduce Percentage of Validation Set:**

**1.Positive Impact on Training Accuracy:** Allocating a smaller percentage to the validation set means you have more data available for training. This can lead to a model that learns more effectively from the training data and achieves higher training accuracy.

**2.Lower Validation Accuracy**: On the downside, a smaller validation set may result in a less reliable estimate of your model's performance. With less data for validation, your estimate may be more susceptible to random variations in the validation set, potentially leading to a lower validation accuracy.

2. How does the size of the train and validation set affect how well we can predict the accuracy on the test set using the validation set?

**ANSWER:**
The size of the training and validation sets can indeed affect how well we can predict the accuracy on the test set using the validation set.

**Training Set Size:**

**Larger Training Set:**
* It results in a better-trained model because it has more data to learn from.
* It can lead to a more accurate prediction of test set performance using the validation set.
**Smaller Training Set:**
* with a smaller training set, your model may not learn as effectively, potentially leading to overfitting.
* In such cases, the validation set's accuracy may not be a good predictor of test set accuracy.

**Validation Set Size:**

**Larger Validation Set:**
* A larger validation set can provide a more reliable estimate of your model's performance because it's based on more data.
* The validation set's accuracy is a better predictor of the test set accuracy.

**Smaller Validation Set:**
* A smaller validation set can still provide valuable insights into model performance, but it may be more sensitive to variations in the data.
* It could lead to less accurate predictions of test set accuracy.

3. What do you think is a good percentage to reserve for the validation set so that thest two factors are balanced?

**ANSWER**:
* I think good percentage to reserve for the validation set is typically around "20-30%" so that thest two factors are balanced.
* There are some reasons like common practices , balancedtrade-off and overfitting for reserving around 20-30% .

## Multiple Splits

One way to get more accurate estimates for the test accuracy is by using <b>crossvalidation</b>. Here, we will try a simple version, where we do multiple train/val splits and take the average of validation accuracies as the test accuracy estimation. Here is a function for doing this. Note that this function will take a long time to execute.

In [None]:
# you can use this function for random classifier also
def AverageAccuracy(alldata, alllabel, splitpercent, iterations, classifier=NN):
  accuracy = 0
  for ii in range(iterations):
    traindata, trainlabel, valdata, vallabel = split(alldata, alllabel, splitpercent)
    valpred = classifier(traindata, trainlabel, valdata)
    accuracy += Accuracy(vallabel, valpred)
  return accuracy/iterations # average of all accuracies

In [None]:
print('Average validation accuracy is ', AverageAccuracy(alltraindata, alltrainlabel, 75/100, 10, classifier=NN))
testpred = NN(alltraindata, alltrainlabel, testdata)
print('test accuracy is ',Accuracy(testlabel, testpred) )

Average validation accuracy is  0.33584635395170215
test accuracy is  0.34917953667953666


This is a very simple way of doing cross-validation. There are many well-known algorithms for cross-validation, like k-fold cross-validation, leave-one-out etc. This will be covered in detail in a later module. For more information about cross-validation, check <a href=https://en.wikipedia.org/wiki/Cross-validation_(statistics)>Cross-validatioin (Wikipedia)</a>

### Questions
1. Does averaging the validation accuracy across multiple splits give more consistent results?
2. Does it give more accurate estimate of test accuracy?
3. What is the effect of the number of iterations on the estimate? Do we get a better estimate with higher iterations?
4. Consider the results you got for the previous questions. Can we deal with a very small train dataset or validation dataset by increasing the iterations?


(1)Does averaging the validation accuracy across multiple splits give more consistent results?

**ANSWER:**
Yes,averaging the validation accuracy across multiple splits give more consistent results.
*cross-validation helps to reduce the impact of data randomness and variability in your model's performance evaluation.

**Reasons for more consistent reasults:**

**1.Robustness to Data Variability:**
*   Inherent variability in data sets which is used to variation in data distribution
* By using multiple splits, you get a better sense of how well your model generalizes to different subsets of the data.  

**2.Reduced Overfitting:**
*   The averaging of results over multiple splits helps to reduce ovrfitting.
*   Using single validation to evaluate our model,it might perform exceptionally well or poorly due to the luck of the split, leading to an inaccurate representation of its actual performance.
  
**3.Improved Confidence Estimation:**
*   The averaging of results over multiple splits helps to improved confidential estimation.
*   It also reduces the impact of noise in a single validation set.

**4.Consistency:**
*   When you repeatedly split the data and evaluate your model's performance, you can observe how consistent its results are across different subsets.   



2. Does it give more accurate estimate of test accuracy?

**ANSWER:**
*   Averaging the validation accuracy across multiple splits in cross-validation does not directly provide a more accurate estimate of the test accuracy.
*   This estimate is often referred to as the **"cross-validated performance" or "cross-validated accuracy."**
*   Purpose of cross-validation: It helps you evaluate the model's performance more reliably and provides insights into its ability to handle different data subsets.
*   cross-validation can give a good estimate of  model's generalization performance, but it doesn't directly provide the test accuracy.
*   The test dataset remains crucial for confirming your model's performance before deploying it in real-world applications, as it serves as the ultimate benchmark of its accuracy on new, independent samples.

*The example below serves as the closest estimate to the actual test accuracy.



In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
data = load_iris()
X = data.data
y = data.target

# Split the data into a training set and a test set (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Decision Tree Classifier
model = DecisionTreeClassifier(random_state=42)

# Perform 5-fold cross-validation on the training set
k = 5
cross_val_scores = cross_val_score(model, X_train, y_train, cv=k, scoring='accuracy')

# Calculation of the average cross-validated accuracy
average_accuracy = sum(cross_val_scores) / k

# Training the model on the entire training set
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate the test accuracy
test_accuracy = accuracy_score(y_test, y_pred)

# Print the cross-validated accuracy and test accuracy
print("The Cross-Validated Accuracy:", average_accuracy)
print("The Test Accuracy:", test_accuracy)


Cross-Validated Accuracy: 0.9416666666666668
Test Accuracy: 1.0


3. What is the effect of the number of iterations on the estimate? Do we get a better estimate with higher iterations?

**ANSWER:**
The number of iterations or folds in cross-validation can have an effect on the estimate of your model's performance. The impact of the number of iterations on the estimate depends on several factors:

**Data Availability:**
 If you have a small dataset, using a very high number of folds might not be feasible, and you could end up with very small training and validation sets in each fold.

 **Bias-Variance Trade-Off:**
 1. With a higher number of folds (e.g., k-fold cross-validation with a large k), us reduce the bias in our estimate because our're using more data for both training and validation.
 2. This can lead to higher variance in our estimate because each fold represents a smaller portion of our data, which can make the estimate more sensitive to random fluctuations.

 **Computational Cost:**
 Increasing the number of iterations in cross-validation can significantly increase the computational cost, especially for large datasets and complex models.

 * Increasing the number of iterations (folds) in cross-validation can lead to a better estimate of your model's performance, but it comes with trade-offs.
 * The effect of higher iterations on the estimate depends on various factors, including your dataset size, the complexity of your model, and computational resources.

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier

# Load the Iris dataset (a well-known dataset for classification)
data = load_iris()
X = data.data
y = data.target

# Define a decision tree classifier
model = DecisionTreeClassifier(random_state=42)

# Try different numbers of folds (iterations)
fold_values = [3, 5, 10]
for folds in fold_values:
    # Perform cross-validation
    scores = cross_val_score(model, X, y, cv=folds, scoring='accuracy')

    # Calculate and print the average accuracy
    average_accuracy = sum(scores) / len(scores)
    print(f'{folds}-Fold Cross-Validation - Average Accuracy: {average_accuracy:.2f}')


3-Fold Cross-Validation - Average Accuracy: 0.96
5-Fold Cross-Validation - Average Accuracy: 0.95
10-Fold Cross-Validation - Average Accuracy: 0.95


4.Consider the results you got for the previous questions. Can we deal with a very small train dataset or validation dataset by increasing the iterations?

**ANSWER**:
* Increasing the iterations in cross-validation can help mitigate the impact of a very small train or validation dataset to some extent, but it may not fully compensate for the limitations of extremely small datasets.
* Instead, it repeatedly partitions your existing dataset into training and validation subsets, allowing you to assess your model's performance multiple times with different data splits.