<a href="https://colab.research.google.com/github/Keerthineelima/Fmml_all-repos/blob/main/FMML_M1L2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning terms and metrics

FMML Module 1, Lab 2

In this lab, we will show a part of the ML pipeline by using the California Housing dataset. There are 20640 samples, each with 8 attributes like income of the block, age of the houses per district etc. The task is to predict the cost of the houses per district. We will use the scikit-learn library to load the data and perform some basic data preprocessing and model training. We will also show how to evaluate the model using some common metrics, split the data into training and testing sets, and use cross-validation to get a better estimate of the model's performance.

In [1]:
import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt

rng = np.random.default_rng(seed=42)

In [2]:
dataset = datasets.fetch_california_housing()
# Dataset description
print(dataset.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

:Number of Instances: 20640

:Number of Attributes: 8 numeric, predictive attributes and the target

:Attribute Information:
    - MedInc        median income in block group
    - HouseAge      median house age in block group
    - AveRooms      average number of rooms per household
    - AveBedrms     average number of bedrooms per household
    - Population    block group population
    - AveOccup      average number of household members
    - Latitude      block group latitude
    - Longitude     block group longitude

:Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, using one row per ce

Given below are the list of target values. These correspond to the house value derived considering all the 8 input features and are continuous values. We should use regression models to predict these values but we will start with a simple classification model for the sake of simplicity. We need to just round off the values to the nearest integer and use a classification model to predict the house value.

In [3]:
print("Orignal target values:", dataset.target)

dataset.target = dataset.target.astype(int)

print("Target values after conversion:", dataset.target)
print("Input variables shape:", dataset.data.shape)
print("Output variables shape:", dataset.target.shape)

Orignal target values: [4.526 3.585 3.521 ... 0.923 0.847 0.894]
Target values after conversion: [4 3 3 ... 0 0 0]
Input variables shape: (20640, 8)
Output variables shape: (20640,)


The simplest model to use for classification is the K-Nearest Neighbors model. We will use this model to predict the house value with a K value of 1. We will also use the accuracy metric to evaluate the model.

In [4]:
def NN1(traindata, trainlabel, query):
    """
    This function takes in the training data, training labels and a query point
    and returns the predicted label for the query point using the nearest neighbour algorithm

    traindata: numpy array of shape (n,d) where n is the number of samples and d is the number of features
    trainlabel: numpy array of shape (n,) where n is the number of samples
    query: numpy array of shape (d,) where d is the number of features

    returns: the predicted label for the query point which is the label of the training data which is closest to the query point
    """
    diff = (
        traindata - query
    )  # find the difference between features. Numpy automatically takes care of the size here
    sq = diff * diff  # square the differences
    dist = sq.sum(1)  # add up the squares
    label = trainlabel[np.argmin(dist)]
    return label


def NN(traindata, trainlabel, testdata):
    """
    This function takes in the training data, training labels and test data
    and returns the predicted labels for the test data using the nearest neighbour algorithm

    traindata: numpy array of shape (n,d) where n is the number of samples and d is the number of features
    trainlabel: numpy array of shape (n,) where n is the number of samples
    testdata: numpy array of shape (m,d) where m is the number of test samples and d is the number of features

    returns: the predicted labels for the test data which is the label of the training data which is closest to each test point
    """
    predlabel = np.array([NN1(traindata, trainlabel, i) for i in testdata])
    return predlabel

We will also define a 'random classifier', which randomly allots labels to each sample

In [5]:
def RandomClassifier(traindata, trainlabel, testdata):
    """
    This function takes in the training data, training labels and test data
    and returns the predicted labels for the test data using the random classifier algorithm

    In reality, we don't need these arguments but we are passing them to keep the function signature consistent with other classifiers

    traindata: numpy array of shape (n,d) where n is the number of samples and d is the number of features
    trainlabel: numpy array of shape (n,) where n is the number of samples
    testdata: numpy array of shape (m,d) where m is the number of test samples and d is the number of features

    returns: the predicted labels for the test data which is a random label from the training data
    """

    classes = np.unique(trainlabel)
    rints = rng.integers(low=0, high=len(classes), size=len(testdata))
    predlabel = classes[rints]
    return predlabel

We need a metric to evaluate the performance of the model. Let us define a metric 'Accuracy' to see how good our learning algorithm is. Accuracy is the ratio of the number of correctly classified samples to the total number of samples. The higher the accuracy, the better the algorithm. We will use the accuracy metric to evaluate and compate the performance of the K-Nearest Neighbors model and the random classifier.

In [6]:
def Accuracy(gtlabel, predlabel):
    """
    This function takes in the ground-truth labels and predicted labels
    and returns the accuracy of the classifier

    gtlabel: numpy array of shape (n,) where n is the number of samples
    predlabel: numpy array of shape (n,) where n is the number of samples

    returns: the accuracy of the classifier which is the number of correct predictions divided by the total number of predictions
    """
    assert len(gtlabel) == len(
        predlabel
    ), "Length of the ground-truth labels and predicted labels should be the same"
    correct = (
        gtlabel == predlabel
    ).sum()  # count the number of times the groundtruth label is equal to the predicted label.
    return correct / len(gtlabel)

Let us make a function to split the dataset with the desired probability. We will use this function to split the dataset into training and testing sets. We will use the training set to train the model and the testing set to evaluate the model.

In [7]:
def split(data, label, percent):
    # generate a random number for each sample
    rnd = rng.random(len(label))
    split1 = rnd < percent
    split2 = rnd >= percent

    split1data = data[split1, :]
    split1label = label[split1]
    split2data = data[split2, :]
    split2label = label[split2]
    return split1data, split1label, split2data, split2label

We will reserve 20% of our dataset as the test set. We will not change this portion throughout our experiments

In [8]:
testdata, testlabel, alltraindata, alltrainlabel = split(
    dataset.data, dataset.target, 20 / 100
)
print("Number of test samples:", len(testlabel))
print("Number of train samples:", len(alltrainlabel))
print("Percent of test data:", len(testlabel) * 100 / len(dataset.target), "%")

Number of test samples: 4144
Number of train samples: 16496
Percent of test data: 20.07751937984496 %


## Experiments with splits

Let us reserve some of our train data as a validation set

In [9]:
traindata, trainlabel, valdata, vallabel = split(
    alltraindata, alltrainlabel, 75 / 100)

What is the accuracy of our classifiers on the train dataset?

In [10]:
trainpred = NN(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Training accuracy using nearest neighbour algorithm:", trainAccuracy*100, "%")

trainpred = RandomClassifier(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Training accuracy using random classifier: ", trainAccuracy*100, "%")

Training accuracy using nearest neighbour algorithm: 100.0 %
Training accuracy using random classifier:  16.4375808538163 %


For nearest neighbour, the train accuracy is always 1. The accuracy of the random classifier is close to 1/(number of classes) which is 0.1666 in our case. This is because the random classifier randomly assigns a label to each sample and the probability of assigning the correct label is 1/(number of classes). Let us predict the labels for our validation set and get the accuracy. This accuracy is a good estimate of the accuracy of our model on unseen data.

In [11]:
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour algorithm:", valAccuracy*100, "%")


valpred = RandomClassifier(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using random classifier:", valAccuracy*100, "%")

Validation accuracy using nearest neighbour algorithm: 34.10852713178294 %
Validation accuracy using random classifier: 16.884689922480618 %


Validation accuracy of nearest neighbour is considerably less than its train accuracy while the validation accuracy of random classifier is the same. However, the validation accuracy of nearest neighbour is twice that of the random classifier. Now let us try another random split and check the validation accuracy. We will see that the validation accuracy changes with the split. This is because the validation set is small and the accuracy is highly dependent on the samples in the validation set. We can get a better estimate of the accuracy by using cross-validation.

In [12]:
traindata, trainlabel, valdata, vallabel = split(
    alltraindata, alltrainlabel, 75 / 100)
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour algorithm:", valAccuracy*100, "%")

Validation accuracy using nearest neighbour algorithm: 34.048257372654156 %


You can run the above cell multiple times to try with different random splits.
We notice that the accuracy is different for each run, but close together.

Now let us compare it with the accuracy we get on the test dataset.

In [13]:
testpred = NN(alltraindata, alltrainlabel, testdata)
testAccuracy = Accuracy(testlabel, testpred)

print("Test accuracy:", testAccuracy*100, "%")

Test accuracy: 34.91795366795367 %


### Try it out for yourself and answer:
1. How is the accuracy of the validation set affected if we increase the percentage of validation set? What happens when we reduce it?
2. How does the size of the train and validation set affect how well we can predict the accuracy on the test set using the validation set?
3. What do you think is a good percentage to reserve for the validation set so that thest two factors are balanced?

Answer for both nearest neighbour and random classifier. You can note down the values for your experiments and plot a graph using  <a href=https://matplotlib.org/stable/gallery/lines_bars_and_markers/step_demo.html#sphx-glr-gallery-lines-bars-and-markers-step-demo-py>plt.plot<href>. Check also for extreme values for splits, like 99.9% or 0.1%

> Exercise: Try to implement a 3 nearest neighbour classifier and compare the accuracy of the 1 nearest neighbour classifier and the 3 nearest neighbour classifier on the test dataset. You can use the KNeighborsClassifier class from the scikit-learn library to implement the K-Nearest Neighbors model. You can set the number of neighbors using the n_neighbors parameter. You can also use the accuracy_score function from the scikit-learn library to calculate the accuracy of the model.

## Multiple Splits

One way to get more accurate estimates for the test accuracy is by using <b>cross-validation</b>. Here, we will try a simple version, where we do multiple train/val splits and take the average of validation accuracies as the test accuracy estimation. Here is a function for doing this. Note that this function will take a long time to execute. You can reduce the number of splits to make it faster.

In [14]:
def AverageAccuracy(alldata, alllabel, splitpercent, iterations, classifier=NN):
    """
    This function takes in the data, labels, split percentage, number of iterations and classifier function
    and returns the average accuracy of the classifier

    alldata: numpy array of shape (n,d) where n is the number of samples and d is the number of features
    alllabel: numpy array of shape (n,) where n is the number of samples
    splitpercent: float which is the percentage of data to be used for training
    iterations: int which is the number of iterations to run the classifier
    classifier: function which is the classifier function to be used

    returns: the average accuracy of the classifier
    """
    accuracy = 0
    for ii in range(iterations):
        traindata, trainlabel, valdata, vallabel = split(
            alldata, alllabel, splitpercent
        )
        valpred = classifier(traindata, trainlabel, valdata)
        accuracy += Accuracy(vallabel, valpred)
    return accuracy / iterations  # average of all accuracies

In [None]:
avg_acc = AverageAccuracy(alltraindata, alltrainlabel, 75 / 100, 10, classifier=NN)
print("Average validation accuracy:", avg_acc*100, "%")
testpred = NN(alltraindata, alltrainlabel, testdata)

print("Test accuracy:", Accuracy(testlabel, testpred)*100, "%")

This is a very simple way of doing cross-validation. There are many well-known algorithms for cross-validation, like k-fold cross-validation, leave-one-out etc. This will be covered in detail in a later module. For more information about cross-validation, check <a href=https://en.wikipedia.org/wiki/Cross-validation_(statistics)>Cross-validatioin (Wikipedia)</a>

### Questions
1. Does averaging the validation accuracy across multiple splits give more consistent results?
Yes, averaging the validation accuracy across multiple splits typically gives more consistent and reliable results, especially when working with datasets that might have variability in their structure. Here's why and how this works:


---

1. Why Averaging Validation Accuracy Across Splits Helps

a) Reduces Bias

A single train-test split can introduce sampling bias, as the performance may depend heavily on how the data is split (e.g., uneven representation of classes or important patterns in the train/validation sets).

Averaging over multiple splits smooths out the effect of these biases, providing a more representative estimate of model performance.


b) Accounts for Variability

If your dataset is small or imbalanced, splitting it into training and validation sets may lead to different results depending on the split.

Using multiple splits (e.g., through k-fold cross-validation) ensures that the model is evaluated on different subsets of the data, reducing variance in results.


c) Improves Generalization Estimate

The average accuracy across splits provides a better estimate of how the model will perform on unseen data since it’s been tested on multiple validation sets.



---

2. How to Implement It

a) k-Fold Cross-Validation

Divide your data into k folds (e.g., 5 or 10).

Use one fold for validation and the rest for training. Repeat the process k times, with each fold serving as



2. Does it give more accurate estimate of test accuracy?
Averaging validation accuracy across multiple splits (e.g., using k-fold cross-validation) generally gives a more accurate estimate of test accuracy, but there are caveats. Here's why it works, its limitations, and when it might differ from actual test performance.


---

1. Why Averaging Validation Accuracy Gives a Better Estimate

a) Reduces Sampling Bias

When you rely on a single train-test split, the split may not represent the entire dataset (e.g., one split might have more challenging examples in the test set).

Averaging over multiple splits ensures that every data point is included in training and validation at least once, leading to a more balanced estimate of model performance.


b) Incorporates Variability in Data

Different splits expose the model to varied subsets of data. Averaging the performance across these splits provides a more reliable estimate of how the model will perform on unseen data.


c) Represents the Dataset Holistically

By testing across all data points (as part of validation in different splits), cross-validation approximates the model's behavior on the entire dataset.



---

2. When Averaged Validation Accuracy Aligns with Test Accuracy

Large and Representative Dataset:

When your dataset is large and well-represented, validation accuracy (averaged over splits) will closely align with test accuracy.


Balanced Classes:

Stratified k-fold cross-validation ensures the class distribution in training/validation folds matches the original data, improving the alignment with test accuracy.


Stable Model:

If your model isn't highly sensitive to small changes in training data (e.g., overfitting-prone models like deep networks), validation results will better reflect test performance.




---

3. Potential Discrepancies Between Validation and Test Accuracy

a) Overfitting During Cross-Validation

If you repeatedly tune hyperparameters or features using cross-validation results, your model may indirectly overfit to the validation set.

As a result, validation accuracy may be higher than actual test accuracy.

Solution: Use a separate test set for final evaluation.


b) Distribution Shift

If the test set comes from a slightly different distribution than the training data (e.g., due to time, location, or other factors), validation accuracy won't perfectly predict test accuracy.

Solution: Ensure the training and test sets come from the same distribution or use domain adaptation techniques.


c) Small Dataset

If the dataset is small, even cross-validation might give overly optimistic results due to lack of diversity in splits.

Solution: Use techniques like leave-one-out cross-validation (LOOCV) or augment the dataset.


d) Class Imbalance

If some classes are underrepresented in the dataset, cross-validation accuracy might not fully capture the model's performance on minority classes, leading to a test accuracy discrepancy.

Solution: Use stratified cross-validation and evaluate class-wise metrics (precision, recall, etc.).



---

4. Example: Validation vs. Test Accuracy


3. What is the effect of the number of iterations on the estimate? Do we get a better estimate with higher iterations?
The number of iterations (e.g., the number of splits or folds in cross-validation) has a direct impact on the stability and accuracy of the estimate, but it also introduces trade-offs in terms of computational cost. Here’s a breakdown of how the number of iterations affects the estimate and whether more iterations provide a better estimate:


---

1. Effect of Number of Iterations

a) Lower Iterations

With fewer iterations (e.g., 3 splits or folds), the estimate of accuracy can be more variable and less representative because:

Fewer data subsets are used for training and testing.

Some data points may be underrepresented or excluded entirely from validation.


Example: A 3-fold cross-validation might give varying results depending on how data is split, especially with smaller datasets.


b) Higher Iterations

As the number of iterations increases, the estimate becomes more reliable and consistent because:

Every data point is used in training and validation more evenly across iterations.

Random variations due to splitting are smoothed out.


Example: 10-fold cross-validation generally provides a better estimate of generalization than 3-fold, especially for imbalanced datasets.


c) Extreme Iterations (e.g., LOOCV)

Leave-One-Out Cross-Validation (LOOCV) uses N iterations, where  is the number of data points.

Every data point is used once as the validation set, so the estimate reflects the entire dataset’s variability.

This gives a very accurate and unbiased estimate but is computationally expensive, especially for large datasets.




---

2. Why More Iterations Improve the Estimate

1. Reduced Variance:

Averaging over a larger number of iterations reduces the variance in the performance estimate.

The estimate is less sensitive to outliers or specific data splits.



2. More Generalized Results:

With higher iterations, the model trains and validates on a greater variety of data splits, leading to a better approximation of its true performance on unseen data.



3. Data Coverage:

Every data point is included in training and validation more frequently, ensuring that no subset of the data disproportionately influences the results.





---

3. Trade-Offs with Higher Iterations

a) Diminishing Returns

Beyond a certain number of iterations (e.g., increasing from 10 to 20 folds), the improvement in estimate accuracy is minimal.

Example: For large datasets, a 10-fold cross-validation estimate is often sufficient and close to LOOCV.


b) Computational Cost

Each additional iteration increases the computational



4. Consider the results you got for the previous questions. Can we deal with a very small train dataset or validation dataset by increasing the iterations?
Yes, you can deal with a very small training or validation dataset by increasing the number of iterations, as this allows you to make better use of the available data. However, there are important caveats and limitations to consider. Here's an analysis based on the results and principles we've discussed:


---

1. How Increasing Iterations Helps with Small Datasets

a) Better Utilization of Data

In scenarios like k-fold cross-validation, increasing the number of folds (or iterations) means smaller validation subsets and larger training subsets.

For example, with 5-fold cross-validation, 20% of the data is used for validation in each fold, while 80% is used for training.

With 10-fold cross-validation, only 10% of the data is used for validation, leaving 90% for training. This is particularly beneficial for small datasets where every additional training sample matters.



b) Reduces Variability in Results

Small datasets are prone to high variance in performance metrics due to the limited representation of the data distribution in training and validation splits.

More iterations ensure that every data point is used for both training and validation multiple times, leading to a more stable and reliable performance estimate.


c) Handles Sampling Bias

In small datasets, random splits can lead to sampling bias, where certain important patterns or outliers are excluded from the training set.

More iterations reduce the likelihood of these biases affecting the final estimate by testing the model across many different splits.




---

2. Practical Example: Small Dataset with Increasing Iterations

Increasing iterations ensures the model is trained on a larger proportion of the data in each iteration, which is critical when the dataset is small.



---

3. Challenges with Very Small Datasets

While increasing iterations helps, there are challenges that arise when working with very small datasets:

a) Overfitting Risk

In small datasets, the model is more likely to memorize the training data rather than learn generalizable patterns.

Increasing iterations won't solve this directly. Proper regularization (e.g., L1/L2 penalties, dropout) and simpler models are needed.



b) Computational Cost

Techniques like LOOCV involve training the model  times (where  is the number of data points). This is computationally expensive, especially for complex models like deep learning.


c) Limited Representation of Data Distribution

A very small dataset might not represent the full variability of the problem space (e.g., insufficient examples of edge cases).

Cross-validation with more iterations can reduce sampling bias but cannot create new information about unseen patterns.



d) Validation Size Becomes Too Small

With higher iterations (e.g., 10-fold or LOOCV), the validation set becomes very small. This may result in unstable validation metrics as a single misclassification has a larger impact.



---

4. Strategies for Very Small Datasets

To address the limitations of small datasets while benefiting from increased iterations, you can apply these strategies:

a) Data Augmentation

Artificially increase the size of the dataset by transforming the data (e.g., rotations, flips for images; adding noise for numeric data).


b) Transfer Learning

Use a pre-trained model and fine-tune it on the small dataset. This leverages knowledge from larger datasets to improve performance.


c) Simpler Models

Use models with fewer parameters to reduce the risk of overfitting.

Example: Linear regression or logistic regression over deep learning for small datasets.



d) Use Stratified Cross-Validation

If the dataset is imbalanced, ensure that each fold in cross-validation preserves the class distribution, so the model doesn't miss rare classes.


e) Regularization and Early Stopping

Apply techniques like L2 regularization and monitor performance on validation data during training to prevent overfitting.



---

5. Does Increasing Iterations Fully Solve the Problem?

While increasing iterations can help mitigate the challenges of small datasets, it is not a silver bullet:

It improves the reliability of performance estimates by making better use of the data.

However, it doesn't address fundamental issues like overfitting, insufficient data variability, or lack of representativeness in the dataset.


For best results, you should combine increased iterations with regularization, data augmentation, or external data sources when feasible.


---

6. Key Takeaways

1. Yes, increasing iterations (e.g., folds) helps in small datasets by improving training set size, reducing bias, and giving more consistent validation estimates.


2. Be cautious of overfitting and computational cost, especially for methods like LOOCV.


3. Complement increased iterations with techniques like regularization, transfer learning, or data augmentation to make the most of small datasets.



Would you like me to expand on any specific strategy or run an example analysis?

> Exercise: How does the accuracy of the 3 nearest neighbour classifier change with the number of splits? How is it affected by the split size? Compare the results with the 1 nearest neighbour classifier.