<a href="https://colab.research.google.com/github/Revalla/FMML_M1L2.ipynb/blob/main/FMML_M1L2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning terms and metrics

FMML Module 1, Lab 2

In this lab, we will show a part of the ML pipeline by using the California Housing dataset. There are 20640 samples, each with 8 attributes like income of the block, age of the houses per district etc. The task is to predict the cost of the houses per district. We will use the scikit-learn library to load the data and perform some basic data preprocessing and model training. We will also show how to evaluate the model using some common metrics, split the data into training and testing sets, and use cross-validation to get a better estimate of the model's performance.

In [1]:
import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt

rng = np.random.default_rng(seed=42)

In [2]:
dataset = datasets.fetch_california_housing()
# Dataset description
print(dataset.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived

Given below are the list of target values. These correspond to the house value derived considering all the 8 input features and are continuous values. We should use regression models to predict these values but we will start with a simple classification model for the sake of simplicity. We need to just round off the values to the nearest integer and use a classification model to predict the house value.

In [3]:
print("Orignal target values:", dataset.target)

dataset.target = dataset.target.astype(int)

print("Target values after conversion:", dataset.target)
print("Input variables shape:", dataset.data.shape)
print("Output variables shape:", dataset.target.shape)

Orignal target values: [4.526 3.585 3.521 ... 0.923 0.847 0.894]
Target values after conversion: [4 3 3 ... 0 0 0]
Input variables shape: (20640, 8)
Output variables shape: (20640,)


The simplest model to use for classification is the K-Nearest Neighbors model. We will use this model to predict the house value with a K value of 1. We will also use the accuracy metric to evaluate the model.

In [4]:
def NN1(traindata, trainlabel, query):
    """
    This function takes in the training data, training labels and a query point
    and returns the predicted label for the query point using the nearest neighbour algorithm

    traindata: numpy array of shape (n,d) where n is the number of samples and d is the number of features
    trainlabel: numpy array of shape (n,) where n is the number of samples
    query: numpy array of shape (d,) where d is the number of features

    returns: the predicted label for the query point which is the label of the training data which is closest to the query point
    """
    diff = (
        traindata - query
    )  # find the difference between features. Numpy automatically takes care of the size here
    sq = diff * diff  # square the differences
    dist = sq.sum(1)  # add up the squares
    label = trainlabel[np.argmin(dist)]
    return label


def NN(traindata, trainlabel, testdata):
    """
    This function takes in the training data, training labels and test data
    and returns the predicted labels for the test data using the nearest neighbour algorithm

    traindata: numpy array of shape (n,d) where n is the number of samples and d is the number of features
    trainlabel: numpy array of shape (n,) where n is the number of samples
    testdata: numpy array of shape (m,d) where m is the number of test samples and d is the number of features

    returns: the predicted labels for the test data which is the label of the training data which is closest to each test point
    """
    predlabel = np.array([NN1(traindata, trainlabel, i) for i in testdata])
    return predlabel

We will also define a 'random classifier', which randomly allots labels to each sample

In [5]:
def RandomClassifier(traindata, trainlabel, testdata):
    """
    This function takes in the training data, training labels and test data
    and returns the predicted labels for the test data using the random classifier algorithm

    In reality, we don't need these arguments but we are passing them to keep the function signature consistent with other classifiers

    traindata: numpy array of shape (n,d) where n is the number of samples and d is the number of features
    trainlabel: numpy array of shape (n,) where n is the number of samples
    testdata: numpy array of shape (m,d) where m is the number of test samples and d is the number of features

    returns: the predicted labels for the test data which is a random label from the training data
    """

    classes = np.unique(trainlabel)
    rints = rng.integers(low=0, high=len(classes), size=len(testdata))
    predlabel = classes[rints]
    return predlabel

We need a metric to evaluate the performance of the model. Let us define a metric 'Accuracy' to see how good our learning algorithm is. Accuracy is the ratio of the number of correctly classified samples to the total number of samples. The higher the accuracy, the better the algorithm. We will use the accuracy metric to evaluate and compate the performance of the K-Nearest Neighbors model and the random classifier.

In [6]:
def Accuracy(gtlabel, predlabel):
    """
    This function takes in the ground-truth labels and predicted labels
    and returns the accuracy of the classifier

    gtlabel: numpy array of shape (n,) where n is the number of samples
    predlabel: numpy array of shape (n,) where n is the number of samples

    returns: the accuracy of the classifier which is the number of correct predictions divided by the total number of predictions
    """
    assert len(gtlabel) == len(
        predlabel
    ), "Length of the ground-truth labels and predicted labels should be the same"
    correct = (
        gtlabel == predlabel
    ).sum()  # count the number of times the groundtruth label is equal to the predicted label.
    return correct / len(gtlabel)

Let us make a function to split the dataset with the desired probability. We will use this function to split the dataset into training and testing sets. We will use the training set to train the model and the testing set to evaluate the model.

In [7]:
def split(data, label, percent):
    # generate a random number for each sample
    rnd = rng.random(len(label))
    split1 = rnd < percent
    split2 = rnd >= percent

    split1data = data[split1, :]
    split1label = label[split1]
    split2data = data[split2, :]
    split2label = label[split2]
    return split1data, split1label, split2data, split2label

We will reserve 20% of our dataset as the test set. We will not change this portion throughout our experiments

In [8]:
testdata, testlabel, alltraindata, alltrainlabel = split(
    dataset.data, dataset.target, 20 / 100
)
print("Number of test samples:", len(testlabel))
print("Number of train samples:", len(alltrainlabel))
print("Percent of test data:", len(testlabel) * 100 / len(dataset.target), "%")

Number of test samples: 4144
Number of train samples: 16496
Percent of test data: 20.07751937984496 %


## Experiments with splits

Let us reserve some of our train data as a validation set

In [9]:
traindata, trainlabel, valdata, vallabel = split(
    alltraindata, alltrainlabel, 75 / 100)

What is the accuracy of our classifiers on the train dataset?

In [10]:
trainpred = NN(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Training accuracy using nearest neighbour algorithm:", trainAccuracy*100, "%")

trainpred = RandomClassifier(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Training accuracy using random classifier: ", trainAccuracy*100, "%")

Training accuracy using nearest neighbour algorithm: 100.0 %
Training accuracy using random classifier:  16.4375808538163 %


For nearest neighbour, the train accuracy is always 1. The accuracy of the random classifier is close to 1/(number of classes) which is 0.1666 in our case. This is because the random classifier randomly assigns a label to each sample and the probability of assigning the correct label is 1/(number of classes). Let us predict the labels for our validation set and get the accuracy. This accuracy is a good estimate of the accuracy of our model on unseen data.

In [11]:
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour algorithm:", valAccuracy*100, "%")


valpred = RandomClassifier(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using random classifier:", valAccuracy*100, "%")

Validation accuracy using nearest neighbour algorithm: 34.10852713178294 %
Validation accuracy using random classifier: 16.884689922480618 %


Validation accuracy of nearest neighbour is considerably less than its train accuracy while the validation accuracy of random classifier is the same. However, the validation accuracy of nearest neighbour is twice that of the random classifier. Now let us try another random split and check the validation accuracy. We will see that the validation accuracy changes with the split. This is because the validation set is small and the accuracy is highly dependent on the samples in the validation set. We can get a better estimate of the accuracy by using cross-validation.

In [12]:
traindata, trainlabel, valdata, vallabel = split(
    alltraindata, alltrainlabel, 75 / 100)
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour algorithm:", valAccuracy*100, "%")

Validation accuracy using nearest neighbour algorithm: 34.048257372654156 %


You can run the above cell multiple times to try with different random splits.
We notice that the accuracy is different for each run, but close together.

Now let us compare it with the accuracy we get on the test dataset.

In [13]:
testpred = NN(alltraindata, alltrainlabel, testdata)
testAccuracy = Accuracy(testlabel, testpred)

print("Test accuracy:", testAccuracy*100, "%")

Test accuracy: 34.91795366795367 %


### Try it out for yourself and answer:
1. How is the accuracy of the validation set affected if we increase the percentage of validation set? What happens when we reduce it?
2. How does the size of the train and validation set affect how well we can predict the accuracy on the test set using the validation set?
3. What do you think is a good percentage to reserve for the validation set so that thest two factors are balanced?

Answer for both nearest neighbour and random classifier. You can note down the values for your experiments and plot a graph using  <a href=https://matplotlib.org/stable/gallery/lines_bars_and_markers/step_demo.html#sphx-glr-gallery-lines-bars-and-markers-step-demo-py>plt.plot<href>. Check also for extreme values for splits, like 99.9% or 0.1%

> Exercise: Try to implement a 3 nearest neighbour classifier and compare the accuracy of the 1 nearest neighbour classifier and the 3 nearest neighbour classifier on the test dataset. You can use the KNeighborsClassifier class from the scikit-learn library to implement the K-Nearest Neighbors model. You can set the number of neighbors using the n_neighbors parameter. You can also use the accuracy_score function from the scikit-learn library to calculate the accuracy of the model.

## Multiple Splits

One way to get more accurate estimates for the test accuracy is by using <b>cross-validation</b>. Here, we will try a simple version, where we do multiple train/val splits and take the average of validation accuracies as the test accuracy estimation. Here is a function for doing this. Note that this function will take a long time to execute. You can reduce the number of splits to make it faster.

In [14]:
def AverageAccuracy(alldata, alllabel, splitpercent, iterations, classifier=NN):
    """
    This function takes in the data, labels, split percentage, number of iterations and classifier function
    and returns the average accuracy of the classifier

    alldata: numpy array of shape (n,d) where n is the number of samples and d is the number of features
    alllabel: numpy array of shape (n,) where n is the number of samples
    splitpercent: float which is the percentage of data to be used for training
    iterations: int which is the number of iterations to run the classifier
    classifier: function which is the classifier function to be used

    returns: the average accuracy of the classifier
    """
    accuracy = 0
    for ii in range(iterations):
        traindata, trainlabel, valdata, vallabel = split(
            alldata, alllabel, splitpercent
        )
        valpred = classifier(traindata, trainlabel, valdata)
        accuracy += Accuracy(vallabel, valpred)
    return accuracy / iterations  # average of all accuracies

In [15]:
avg_acc = AverageAccuracy(alltraindata, alltrainlabel, 75 / 100, 10, classifier=NN)
print("Average validation accuracy:", avg_acc*100, "%")
testpred = NN(alltraindata, alltrainlabel, testdata)

print("Test accuracy:", Accuracy(testlabel, testpred)*100, "%")

Average validation accuracy: 33.58463539517022 %
Test accuracy: 34.91795366795367 %


This is a very simple way of doing cross-validation. There are many well-known algorithms for cross-validation, like k-fold cross-validation, leave-one-out etc. This will be covered in detail in a later module. For more information about cross-validation, check <a href=https://en.wikipedia.org/wiki/Cross-validation_(statistics)>Cross-validatioin (Wikipedia)</a>

### Questions
1. Does averaging the validation accuracy across multiple splits give more consistent results?
Yes, averaging the validation accuracy across multiple splits generally gives more consistent and reliable results in machine learning, especially in the context of cross-validation.

### Why Averaging Over Multiple Splits is Beneficial

1. **Reduces Variance**: When you split your dataset into training and validation sets, the performance of your model can vary significantly depending on the specific split. Certain splits might be easier or harder for the model, leading to high variance in validation accuracy. By averaging the accuracy over multiple splits (like in k-fold cross-validation), you reduce the impact of any one split being particularly easy or hard. This gives a more stable estimate of model performance.

2. **More Representative Performance**: Averaging over multiple splits helps ensure that the performance metric reflects how the model will generalize to an independent dataset, as each fold is used as both training and validation at different times. This makes the averaged accuracy more representative of the model's expected performance on unseen data.

3. **Balances Data Distribution**: Different splits can capture different aspects of data variability (e.g., different classes, feature distributions). Averaging across multiple splits helps to ensure that the performance estimate takes into account the overall distribution of the data rather than being overly influenced by a single partition.

### Common Approaches to Averaging Validation Accuracy

- **k-Fold Cross-Validation**: The dataset is divided into `k` equally sized folds. The model is trained on `k-1` folds and validated on the remaining fold. This process is repeated `k` times, with each fold being used as the validation set once. The validation accuracy is averaged over the `k` iterations to provide a single performance metric.

- **Repeated k-Fold Cross-Validation**: This involves repeating the k-fold cross-validation process multiple times with different random splits. The final accuracy is the average of all iterations. This method can provide even more robust estimates, especially for small datasets.

- **Stratified k-Fold Cross-Validation**: Similar to k-fold, but ensures that each fold maintains the same proportion of each class as the original dataset. This is particularly useful for imbalanced datasets to ensure that each fold is representative.

### Conclusion

Averaging validation accuracy across multiple splits, such as in cross-validation, generally leads to more reliable and consistent estimates of a model's performance. It mitigates the effects of variability due to different data splits and provides a better understanding of how the model will generalize to unseen data.
2. Does it give more accurate estimate of test accuracy?
Yes, averaging validation accuracy across multiple splits typically gives a more accurate estimate of test accuracy. This approach, commonly used in cross-validation, is considered more reliable because it better represents how the model will perform on unseen data. Here's why:

### 1. **Reduces Overfitting to Specific Data Splits**
When a model is trained and validated on a single split of the data, the resulting accuracy can be overly influenced by the specific examples included in that split. If the validation set happens to be particularly easy or difficult, the accuracy might be higher or lower than what you would see on the test set. By averaging the accuracy over multiple splits, such as in k-fold cross-validation, you reduce the effect of any single, potentially unrepresentative, split. This gives a more balanced estimate of the model's true performance.

### 2. **Covers More Data Variability**
Each fold in cross-validation serves as a different validation set, which helps the model be evaluated on multiple subsets of data. This allows the accuracy estimate to account for different aspects of the data distribution, including any variability or noise. It ensures that the performance metrics are not dependent on any particular subset of the data, leading to a more robust and generalizable estimate of the test accuracy.

### 3. **Utilizes the Entire Dataset**
In k-fold cross-validation, every data point is used for both training and validation at different stages. This comprehensive use of the dataset ensures that the model is exposed to as much information as possible, providing a more thorough evaluation of its performance across all possible data points. This holistic view typically results in an accuracy estimate that is closer to what would be seen on an independent test set.

### 4. **Mitigates the Impact of Randomness**
When using a single train-test split, the random selection of data points can lead to a biased performance estimate. For example, the validation set might accidentally include more outliers or more examples of certain classes. By using multiple splits and averaging the results, cross-validation reduces the impact of this randomness, producing a more accurate estimate of how the model will perform on genuinely new data.

### Conclusion
Averaging validation accuracy across multiple splits, as done in cross-validation, generally provides a more accurate and reliable estimate of test accuracy. It reduces the influence of random variations in the data splits, ensures the model is evaluated on a broader range of data, and offers a better assessment of the model's ability to generalize to unseen data.
3. What is the effect of the number of iterations on the estimate? Do we get a better estimate with higher iterations?
Yes, generally speaking, increasing the number of iterations in cross-validation or repeated validation methods can lead to a better estimate of the model's performance on unseen data. However, the relationship between the number of iterations and the accuracy of the estimate has diminishing returns beyond a certain point. Let's explore this in more detail.

### Effects of Increasing the Number of Iterations on the Estimate

1. **Reduction in Variance**:
   - **Fewer Iterations**: When you perform a small number of iterations (e.g., low `k` in k-fold cross-validation), the variance of your estimate is higher. This means the estimated performance (like accuracy) might fluctuate more between different runs due to the random sampling of training and validation sets.
   - **More Iterations**: Increasing the number of iterations (e.g., using more folds in k-fold cross-validation or repeating cross-validation multiple times) tends to reduce the variance of the estimate. This leads to a more stable and reliable estimate of the model's performance because the model is evaluated on more diverse subsets of data, covering a wider range of possible data distributions.

2. **Improved Generalization Estimate**:
   - More iterations provide a more comprehensive assessment of the model's ability to generalize to unseen data. By evaluating the model on multiple different subsets of the data, we can obtain a better sense of its true generalization performance, especially in the presence of data variability or imbalances.

3. **Law of Diminishing Returns**:
   - While increasing the number of iterations generally improves the estimate by reducing variance, there is a point of diminishing returns. After a certain number of iterations, the gain in the accuracy of the estimate becomes minimal. Beyond this point, further increasing the number of iterations does not significantly enhance the estimate but will increase computational cost.

4. **Computational Cost**:
   - **Fewer Iterations**: Lower computational cost but higher variance in the estimate.
   - **More Iterations**: Higher computational cost but lower variance and a potentially more accurate estimate. The additional computational cost might not be justified if the performance gain is negligible.

### Practical Considerations

- **k-Fold Cross-Validation**: Common values for `k` are 5 or 10. Increasing `k` beyond 10 usually provides only marginal improvements in the performance estimate but significantly increases computational time.
  
- **Repeated k-Fold Cross-Validation**: Repeating k-fold cross-validation multiple times with different random splits (e.g., repeating 10-fold cross-validation 5 times) can provide more reliable performance estimates, especially in cases where the dataset is small or noisy.

- **Leave-One-Out Cross-Validation (LOOCV)**: This is an extreme form where `k` equals the number of data points. While LOOCV provides a very low-bias estimate, it is computationally expensive and can have high variance in the estimate because each fold is based on a single data point.

### Conclusion

Increasing the number of iterations in cross-validation generally leads to a more accurate estimate of the model's performance by reducing variance and providing a better sense of generalization ability. However, the benefit diminishes beyond a certain point, and the computational cost increases. In practice, a balance is often struck by using a reasonable number of folds (like 5 or 10) or repeating cross-validation a few times to achieve a reliable estimate without excessive computation.
4. Consider the results you got for the previous questions. Can we deal with a very small train dataset or validation dataset by increasing the iterations?
Yes, increasing the number of iterations in cross-validation can help deal with very small training or validation datasets, but there are important nuances to consider. Let's break down how increasing iterations can help and what limitations still exist:

### How Increasing Iterations Helps with Small Datasets

1. **Maximizes Data Usage**:
   - In scenarios with small datasets, the amount of data available for training and validation is limited. Increasing the number of iterations in cross-validation ensures that each data point is used in both the training and validation sets across different iterations. This helps make the most of a limited dataset by effectively utilizing all available data.

2. **Reduces Variance in Performance Estimates**:
   - With small datasets, the results of a single train-test split can vary significantly due to the small number of samples in each set. By increasing the number of iterations, such as using k-fold cross-validation with a higher `k` (e.g., `k = n`, where `n` is the number of data points in Leave-One-Out Cross-Validation), you reduce the variance in the performance estimates. Each fold will have more opportunities to serve as both training and validation sets, leading to a more robust and stable performance estimate.

3. **Provides More Robust Performance Estimates**:
   - Repeated cross-validation or k-fold cross-validation (with an appropriate value for `k`) can provide a more accurate measure of a model's performance, especially when data is scarce. This approach gives a more comprehensive view of how the model performs across different subsets of the data, which can be particularly valuable when the dataset is small and each sample's influence is substantial.

### Limitations of Increasing Iterations with Small Datasets

1. **Small Validation Sets Lead to High Variance Estimates**:
   - If the training dataset is small, each validation fold in cross-validation will also be small. Small validation sets can lead to high variance in performance estimates since the model's accuracy on a small set can fluctuate significantly depending on which samples are included. Increasing the number of iterations can somewhat mitigate this effect by averaging the results, but the fundamental issue of high variance due to small validation sets remains.

2. **Risk of Overfitting**:
   - With very small training datasets, there is a higher risk of overfitting. The model may learn patterns that are specific to the limited training data and do not generalize well to new data. Increasing the number of iterations does not address this problem directly; rather, it only provides a more stable estimate of the overfitted performance. Regularization techniques and simpler models are often needed to combat overfitting with small datasets.

3. **Computational Cost**:
   - For very small datasets, techniques like Leave-One-Out Cross-Validation (LOOCV) (where `k` equals the number of data points) can be computationally expensive, especially for models with high training complexity. Although each iteration trains on nearly all data points except one, the large number of iterations (one per data point) can lead to high computational costs.

4. **Limited Information**:
   - If the dataset is very small, there might simply not be enough information to learn a robust model, regardless of the number of iterations. Cross-validation can help provide a better estimate of performance, but it cannot create new information where there is none.

### Conclusion

Increasing the number of iterations in cross-validation (e.g., using k-fold cross-validation or repeated cross-validation) can help make the most of small datasets by providing more reliable performance estimates and reducing variance. However, this approach cannot overcome the fundamental limitations of small datasets, such as high variance in validation results, risk of overfitting, and limited information. To effectively work with small datasets, it is often necessary to combine cross-validation with other strategies, such as regularization, data augmentation, simpler models, or even domain-specific knowledge to enhance the learning process.

> Exercise: How does the accuracy of the 3 nearest neighbour classifier change with the number of splits? How is it affected by the split size? Compare the results with the 1 nearest neighbour classifier.