<a href="https://colab.research.google.com/github/Shrut718/FMML--lab/blob/main/FMML_M1L2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning terms and metrics

FMML Module 1, Lab 2

In this lab, we will show a part of the ML pipeline by using the California Housing dataset. There are 20640 samples, each with 8 attributes like income of the block, age of the houses per district etc. The task is to predict the cost of the houses per district. We will use the scikit-learn library to load the data and perform some basic data preprocessing and model training. We will also show how to evaluate the model using some common metrics, split the data into training and testing sets, and use cross-validation to get a better estimate of the model's performance.

In [1]:
import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt

rng = np.random.default_rng(seed=42)

In [2]:
dataset = datasets.fetch_california_housing()
# Dataset description
print(dataset.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

:Number of Instances: 20640

:Number of Attributes: 8 numeric, predictive attributes and the target

:Attribute Information:
    - MedInc        median income in block group
    - HouseAge      median house age in block group
    - AveRooms      average number of rooms per household
    - AveBedrms     average number of bedrooms per household
    - Population    block group population
    - AveOccup      average number of household members
    - Latitude      block group latitude
    - Longitude     block group longitude

:Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, using one row per ce

Given below are the list of target values. These correspond to the house value derived considering all the 8 input features and are continuous values. We should use regression models to predict these values but we will start with a simple classification model for the sake of simplicity. We need to just round off the values to the nearest integer and use a classification model to predict the house value.

In [3]:
print("Orignal target values:", dataset.target)

dataset.target = dataset.target.astype(int)

print("Target values after conversion:", dataset.target)
print("Input variables shape:", dataset.data.shape)
print("Output variables shape:", dataset.target.shape)

Orignal target values: [4.526 3.585 3.521 ... 0.923 0.847 0.894]
Target values after conversion: [4 3 3 ... 0 0 0]
Input variables shape: (20640, 8)
Output variables shape: (20640,)


The simplest model to use for classification is the K-Nearest Neighbors model. We will use this model to predict the house value with a K value of 1. We will also use the accuracy metric to evaluate the model.

In [4]:
def NN1(traindata, trainlabel, query):
    """
    This function takes in the training data, training labels and a query point
    and returns the predicted label for the query point using the nearest neighbour algorithm

    traindata: numpy array of shape (n,d) where n is the number of samples and d is the number of features
    trainlabel: numpy array of shape (n,) where n is the number of samples
    query: numpy array of shape (d,) where d is the number of features

    returns: the predicted label for the query point which is the label of the training data which is closest to the query point
    """
    diff = (
        traindata - query
    )  # find the difference between features. Numpy automatically takes care of the size here
    sq = diff * diff  # square the differences
    dist = sq.sum(1)  # add up the squares
    label = trainlabel[np.argmin(dist)]
    return label


def NN(traindata, trainlabel, testdata):
    """
    This function takes in the training data, training labels and test data
    and returns the predicted labels for the test data using the nearest neighbour algorithm

    traindata: numpy array of shape (n,d) where n is the number of samples and d is the number of features
    trainlabel: numpy array of shape (n,) where n is the number of samples
    testdata: numpy array of shape (m,d) where m is the number of test samples and d is the number of features

    returns: the predicted labels for the test data which is the label of the training data which is closest to each test point
    """
    predlabel = np.array([NN1(traindata, trainlabel, i) for i in testdata])
    return predlabel

We will also define a 'random classifier', which randomly allots labels to each sample

In [5]:
def RandomClassifier(traindata, trainlabel, testdata):
    """
    This function takes in the training data, training labels and test data
    and returns the predicted labels for the test data using the random classifier algorithm

    In reality, we don't need these arguments but we are passing them to keep the function signature consistent with other classifiers

    traindata: numpy array of shape (n,d) where n is the number of samples and d is the number of features
    trainlabel: numpy array of shape (n,) where n is the number of samples
    testdata: numpy array of shape (m,d) where m is the number of test samples and d is the number of features

    returns: the predicted labels for the test data which is a random label from the training data
    """

    classes = np.unique(trainlabel)
    rints = rng.integers(low=0, high=len(classes), size=len(testdata))
    predlabel = classes[rints]
    return predlabel

We need a metric to evaluate the performance of the model. Let us define a metric 'Accuracy' to see how good our learning algorithm is. Accuracy is the ratio of the number of correctly classified samples to the total number of samples. The higher the accuracy, the better the algorithm. We will use the accuracy metric to evaluate and compate the performance of the K-Nearest Neighbors model and the random classifier.

In [6]:
def Accuracy(gtlabel, predlabel):
    """
    This function takes in the ground-truth labels and predicted labels
    and returns the accuracy of the classifier

    gtlabel: numpy array of shape (n,) where n is the number of samples
    predlabel: numpy array of shape (n,) where n is the number of samples

    returns: the accuracy of the classifier which is the number of correct predictions divided by the total number of predictions
    """
    assert len(gtlabel) == len(
        predlabel
    ), "Length of the ground-truth labels and predicted labels should be the same"
    correct = (
        gtlabel == predlabel
    ).sum()  # count the number of times the groundtruth label is equal to the predicted label.
    return correct / len(gtlabel)

Let us make a function to split the dataset with the desired probability. We will use this function to split the dataset into training and testing sets. We will use the training set to train the model and the testing set to evaluate the model.

In [7]:
def split(data, label, percent):
    # generate a random number for each sample
    rnd = rng.random(len(label))
    split1 = rnd < percent
    split2 = rnd >= percent

    split1data = data[split1, :]
    split1label = label[split1]
    split2data = data[split2, :]
    split2label = label[split2]
    return split1data, split1label, split2data, split2label

We will reserve 20% of our dataset as the test set. We will not change this portion throughout our experiments

In [8]:
testdata, testlabel, alltraindata, alltrainlabel = split(
    dataset.data, dataset.target, 20 / 100
)
print("Number of test samples:", len(testlabel))
print("Number of train samples:", len(alltrainlabel))
print("Percent of test data:", len(testlabel) * 100 / len(dataset.target), "%")

Number of test samples: 4144
Number of train samples: 16496
Percent of test data: 20.07751937984496 %


## Experiments with splits

Let us reserve some of our train data as a validation set

In [9]:
traindata, trainlabel, valdata, vallabel = split(
    alltraindata, alltrainlabel, 75 / 100)

What is the accuracy of our classifiers on the train dataset?

In [10]:
trainpred = NN(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Training accuracy using nearest neighbour algorithm:", trainAccuracy*100, "%")

trainpred = RandomClassifier(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Training accuracy using random classifier: ", trainAccuracy*100, "%")

Training accuracy using nearest neighbour algorithm: 100.0 %
Training accuracy using random classifier:  16.4375808538163 %


For nearest neighbour, the train accuracy is always 1. The accuracy of the random classifier is close to 1/(number of classes) which is 0.1666 in our case. This is because the random classifier randomly assigns a label to each sample and the probability of assigning the correct label is 1/(number of classes). Let us predict the labels for our validation set and get the accuracy. This accuracy is a good estimate of the accuracy of our model on unseen data.

In [11]:
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour algorithm:", valAccuracy*100, "%")


valpred = RandomClassifier(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using random classifier:", valAccuracy*100, "%")

Validation accuracy using nearest neighbour algorithm: 34.10852713178294 %
Validation accuracy using random classifier: 16.884689922480618 %


Validation accuracy of nearest neighbour is considerably less than its train accuracy while the validation accuracy of random classifier is the same. However, the validation accuracy of nearest neighbour is twice that of the random classifier. Now let us try another random split and check the validation accuracy. We will see that the validation accuracy changes with the split. This is because the validation set is small and the accuracy is highly dependent on the samples in the validation set. We can get a better estimate of the accuracy by using cross-validation.

In [12]:
traindata, trainlabel, valdata, vallabel = split(
    alltraindata, alltrainlabel, 75 / 100)
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour algorithm:", valAccuracy*100, "%")

Validation accuracy using nearest neighbour algorithm: 34.048257372654156 %


You can run the above cell multiple times to try with different random splits.
We notice that the accuracy is different for each run, but close together.

Now let us compare it with the accuracy we get on the test dataset.

In [13]:
testpred = NN(alltraindata, alltrainlabel, testdata)
testAccuracy = Accuracy(testlabel, testpred)

print("Test accuracy:", testAccuracy*100, "%")

Test accuracy: 34.91795366795367 %


### Try it out for yourself and answer:
1. How is the accuracy of the validation set affected if we increase the percentage of validation set? What happens when we reduce it?
2. How does the size of the train and validation set affect how well we can predict the accuracy on the test set using the validation set?
3. What do you think is a good percentage to reserve for the validation set so that thest two factors are balanced?

Answer for both nearest neighbour and random classifier. You can note down the values for your experiments and plot a graph using  <a href=https://matplotlib.org/stable/gallery/lines_bars_and_markers/step_demo.html#sphx-glr-gallery-lines-bars-and-markers-step-demo-py>plt.plot<href>. Check also for extreme values for splits, like 99.9% or 0.1%

> Exercise: Try to implement a 3 nearest neighbour classifier and compare the accuracy of the 1 nearest neighbour classifier and the 3 nearest neighbour classifier on the test dataset. You can use the KNeighborsClassifier class from the scikit-learn library to implement the K-Nearest Neighbors model. You can set the number of neighbors using the n_neighbors parameter. You can also use the accuracy_score function from the scikit-learn library to calculate the accuracy of the model.

## Multiple Splits

One way to get more accurate estimates for the test accuracy is by using <b>cross-validation</b>. Here, we will try a simple version, where we do multiple train/val splits and take the average of validation accuracies as the test accuracy estimation. Here is a function for doing this. Note that this function will take a long time to execute. You can reduce the number of splits to make it faster.

In [14]:
def AverageAccuracy(alldata, alllabel, splitpercent, iterations, classifier=NN):
    """
    This function takes in the data, labels, split percentage, number of iterations and classifier function
    and returns the average accuracy of the classifier

    alldata: numpy array of shape (n,d) where n is the number of samples and d is the number of features
    alllabel: numpy array of shape (n,) where n is the number of samples
    splitpercent: float which is the percentage of data to be used for training
    iterations: int which is the number of iterations to run the classifier
    classifier: function which is the classifier function to be used

    returns: the average accuracy of the classifier
    """
    accuracy = 0
    for ii in range(iterations):
        traindata, trainlabel, valdata, vallabel = split(
            alldata, alllabel, splitpercent
        )
        valpred = classifier(traindata, trainlabel, valdata)
        accuracy += Accuracy(vallabel, valpred)
    return accuracy / iterations  # average of all accuracies

In [15]:
avg_acc = AverageAccuracy(alltraindata, alltrainlabel, 75 / 100, 10, classifier=NN)
print("Average validation accuracy:", avg_acc*100, "%")
testpred = NN(alltraindata, alltrainlabel, testdata)

print("Test accuracy:", Accuracy(testlabel, testpred)*100, "%")

Average validation accuracy: 33.58463539517022 %
Test accuracy: 34.91795366795367 %


This is a very simple way of doing cross-validation. There are many well-known algorithms for cross-validation, like k-fold cross-validation, leave-one-out etc. This will be covered in detail in a later module. For more information about cross-validation, check <a href=https://en.wikipedia.org/wiki/Cross-validation_(statistics)>Cross-validatioin (Wikipedia)</a>

### Questions
1. Does averaging the validation accuracy across multiple splits give more consistent results?Yes, averaging the validation accuracy across multiple splits (e.g., using k-fold cross-validation) provides more consistent and reliable results than using a single validation split. Here’s why:


---

Benefits of Averaging Across Multiple Splits

1. Reduces Variance:

A single train-test split might not represent the overall data distribution, especially if the dataset is small or imbalanced.

Multiple splits ensure the model is evaluated on various subsets, reducing the impact of random variations.



2. Better Representation of Performance:

Each fold uses a different portion of the data for training and validation.

This provides a more comprehensive understanding of how the model performs across the entire dataset.



3. Avoids Overfitting to a Specific Split:

Models might perform well on a particular validation set (lucky split) but poorly on others.

Averaging across splits mitigates the risk of overfitting to the validation data.



4. Reliable Hyperparameter Tuning:

When optimizing hyperparameters, using cross-validation ensures that the chosen configuration generalizes well, not just for a single split.





---

How k-Fold Cross-Validation Works

The data is divided into k subsets (folds).

The model is trained on  folds and validated on the remaining fold.

This process is repeated  times, with each fold serving as the validation set once.

The final metric (e.g., validation accuracy) is averaged across all folds.



---

Example of Averaging Validation Accuracy

For 5-fold cross-validation, the validation accuracies might be:

Average Accuracy =

This average gives a more stable estimate of the model’s performance.


---

When Does It Matter Most?

Small Datasets: Cross-validation makes better use of limited data by ensuring all data points are used for both training and validation.

Imbalanced Data: Averaging over splits helps account for varying class distributions across folds.

High Variability in Data: Reduces the impact of outliers or peculiarities in individual splits.



---

Caveats

1. Computational Cost:

Cross-validation requires training the model  times, which can be computationally expensive, especially for large datasets or complex models.

Solution: Use stratified k-fold (preserves class balance) or smaller  (e.g., 5 instead of 10).



2. Overlapping Training Data:

Data points appear in multiple training folds, which might introduce a slight bias.

However, this is typically negligible compared to the benefit of stability.





---

Conclusion

Averaging validation accuracy across multiple splits indeed gives more consistent and reliable results, particularly for hyperparameter tuning and performance estimation. If computational cost isn’t a major concern, it’s a best practice to use techniques like k-fold cross-validation. Let me know if you'd like a Python implementation!


2. Does it give more accurate estimate of test accuracy?Yes, averaging validation accuracy across multiple splits (e.g., using k-fold cross-validation) generally provides a more accurate estimate of test accuracy compared to using a single validation split. Here's why:


---

1. Cross-Validation Mimics Real-World Test Conditions

In cross-validation, the model is evaluated on different subsets of the data, simulating multiple test scenarios. This process ensures the estimate reflects the model's ability to generalize across the entire dataset, not just one particular split.

A single validation split might overestimate or underestimate accuracy, especially if the split is not representative of the data distribution.

Cross-validation mitigates this by combining results from multiple splits.



---

2. Reduces Overfitting to a Specific Validation Set

When you rely on a single split for validation:

The model might overfit or underfit to the validation set's specific characteristics.

Cross-validation avoids this by repeatedly training and validating on different subsets.


The averaged validation accuracy across folds serves as a more robust approximation of how the model will perform on unseen data.


---

3. Considers Variability in Data

If your dataset has:

Outliers: These may disproportionately affect a single test split, skewing results.

Class Imbalances: Cross-validation ensures every class is represented across splits (with stratified k-fold for classification tasks).


Averaging across folds smooths out these variabilities, giving a better sense of how the model handles all data.


---

4. Empirical Evidence

Research and practical experience show that cross-validation provides a closer estimate of test accuracy compared to using a fixed train-test split. Models tuned using cross-validation tend to generalize better to unseen test data.


---

5. When It Might Not Improve Accuracy Estimates

Data Leakage: If there’s leakage between folds (e.g., overlapping time-series data), cross-validation can lead to overly optimistic results.

Improper Cross-Validation Design: For example, not stratifying folds in imbalanced datasets can lead to unreliable estimates.



---

Conclusion

Averaging validation accuracy from cross-validation is not only more consistent but also more accurate as an estimate of test accuracy because it reduces the impact of randomness and provides a more comprehensive evaluation across the dataset. However, the final accuracy on a separate test set is the true measure of performance.

Would you like to see an example of using cross-validation to estimate test accuracy in Python?


3. What is the effect of the number of iterations on the estimate? Do we get a better estimate with higher iterations?The number of iterations (e.g., the number of folds in k-fold cross-validation or repeated cross-validation) can affect the accuracy and reliability of the estimate of the test accuracy. Here’s how it works and the trade-offs involved:


---

Effect of Increasing the Number of Iterations

1. Improved Stability

Higher iterations reduce variance in the accuracy estimate by averaging over more splits. This means the results become less sensitive to how the data is divided.

For example, using 10-fold cross-validation instead of 5-fold can yield a more stable and reliable estimate, especially for smaller datasets.



2. More Comprehensive Use of Data

With more folds, each fold uses smaller validation sets and larger training sets, which can lead to more generalizable models and better approximations of real-world test accuracy.

Example: In 5-fold cross-validation, each model is trained on 80% of the data, while in 10-fold, it’s trained on 90%.



3. Diminishing Returns

After a certain point, increasing iterations (e.g., going from 10 to 20 folds) provides only marginal improvements in the estimate.

The improvement in accuracy estimate stabilizes as the number of folds increases because the model already trains and validates on nearly all data points across the splits.




---

Trade-Offs of Higher Iterations

1. Computational Cost

More iterations mean the model must be trained and evaluated more times, significantly increasing computation time, especially for complex models or large datasets.

Example: For a 10-fold cross-validation, the model is trained and evaluated 10 times. Doubling to 20 folds doubles the computation time.



2. Bias-Variance Trade-Off

Lower Number of Folds (e.g., 3-fold):

Validation sets are larger, which may introduce higher bias in the estimate because the model trains on less data in each fold.


Higher Number of Folds (e.g., 10-20 folds):

Validation sets are smaller, which may introduce higher variance in the estimate because smaller validation sets are more sensitive to outliers or noise.



3. Overlapping Training Sets

As the number of folds increases, the overlap between training sets becomes more significant, which may reduce the diversity of training data configurations and limit the benefit of additional folds.



---

When Do Higher Iterations Provide a Better Estimate?

1. Small Datasets:

For datasets with limited data points, higher iterations ensure the model trains on nearly all data points at least once, improving the reliability of the estimate.

Example: Using leave-one-out cross-validation (LOOCV) for very small datasets.



2. High Variability in Data:

When the dataset contains significant noise or outliers, higher iterations average out the randomness better, leading to a more robust estimate.





---

Guidelines for Choosing Iterations

1. 5-Fold Cross-Validation:

A good balance between computational cost and reliability for most datasets.



2. 10-Fold Cross-Validation:

Recommended for smaller datasets or when you need a more accurate estimate.



3. Repeated k-Fold Cross-Validation:

Repeat the cross-validation process several times (e.g., 5-fold repeated 5 times) to further reduce variance in the estimate.



4. Leave-One-Out Cross-Validation (LOOCV):

Best for very small datasets, but computationally expensive and can have high variance due to small validation sets.





---

Conclusion

Higher iterations generally provide more accurate and stable estimates of test accuracy, but the improvements diminish as iterations increase beyond a certain point.

For practical purposes:

Use 5-10 folds for most cases.

Consider repeated k-fold cross-validation if stability is critical.



Would you like a Python example to demonstrate the effect of iterations on accuracy estimates?


4. Consider the results you got for the previous questions. Can we deal with a very small train dataset or validation dataset by increasing the iterations?Yes, increasing the number of iterations (e.g., using more folds in cross-validation or repeating cross-validation) can help deal with very small training or validation datasets. However, there are some caveats and trade-offs to consider. Let’s break it down:


---

1. How Increasing Iterations Helps with Small Datasets

a) Better Utilization of Data

In cross-validation, every data point gets used for both training and validation multiple times. For small datasets, this maximizes the use of limited data.

Example:

With 5-fold cross-validation, each fold trains on 80% of the data and validates on 20%.

With 10-fold cross-validation, the model trains on 90% of the data and validates on 10%.

Smaller validation sets mean more data is available for training, improving model learning.



b) Reduces Overfitting to a Specific Split

In a small dataset, a single train-test split can lead to overfitting on the small training set or underestimating performance on a small validation set.

Multiple iterations average out the variability caused by small sample sizes, providing a more reliable estimate of model performance.


c) Handles Variability in Small Datasets

Small datasets are more prone to noise and outliers. By increasing the number of iterations, the effect of these anomalies is averaged out, improving the stability of the results.



---

2. Challenges of Small Train or Validation Datasets

a) Risk of Overlap Between Training and Validation

As you increase the number of folds, the size of the validation set decreases (e.g., in leave-one-out cross-validation (LOOCV), each fold contains a single validation point).

With small validation sets, the model's performance may vary significantly, and the evaluation becomes sensitive to individual data points.


b) Limited Generalization

If the training dataset is very small, the model may struggle to learn meaningful patterns, regardless of the number of iterations.

Even with more iterations, the fundamental limitation of insufficient training data remains.



---

3. Trade-Offs of Increasing Iterations for Small Datasets

a) Computational Cost

More iterations mean more training cycles. For very small datasets, LOOCV or 10-fold cross-validation might be feasible, but for complex models or large datasets, it can be computationally expensive.


b) Risk of Overfitting

Repeated exposure to the same data points in different folds may lead to overfitting to the overall dataset, especially for high-capacity models like neural networks.



---

4. Alternatives to Increasing Iterations

If the dataset is too small, simply increasing the number of iterations might not fully solve the problem. Consider these additional strategies:

a) Data Augmentation

For image, text, or time-series data, augment the dataset by creating synthetic variations (e.g., rotating images, paraphrasing text, or generating new samples).


b) Bootstrapping

Use resampling techniques like bootstrapping to create multiple datasets and estimate model performance.


c) Transfer Learning

Pretrained models can leverage knowledge from larger datasets and fine-tune on the small dataset, reducing the dependency on large training data.


d) Regularization

Add regularization (e.g., L1, L2, dropout) to prevent overfitting when the training data is small.


e) Semi-Supervised or Unsupervised Learning

Leverage unlabeled data to extract useful patterns or representations and use these to enhance learning on the small labeled dataset.



---

5. Practical Example

Suppose you have only 50 samples in your dataset:

10-Fold Cross-Validation: Train on 45 samples, validate on 5 samples in each fold.

Leave-One-Out Cross-Validation (LOOCV): Train on 49 samples, validate on 1 sample in each fold.


While LOOCV maximizes training size for each iteration, the variability of validation results can be high due to the single-sample validation set.


---

Conclusion

Increasing iterations (e.g., using k-fold cross-validation, LOOCV, or repeated cross-validation) can help address small training or validation datasets by:

1. Making better use of the data.


2. Reducing bias from small validation sizes.


3. Providing a more stable and reliable performance estimate.



However, for extremely small datasets, consider combining this with techniques like data augmentation, transfer learning, or regularization to overcome the inherent limitations of small sample sizes.

Would you like a code example demonstrating how to handle a small dataset with cross-validation?

> Exercise: How does the accuracy of the 3 nearest neighbour classifier change with the number of splits? How is it affected by the split size? Compare the results with the 1 nearest neighbour classifier.