<a href="https://colab.research.google.com/github/Sivani63010/Fmmllab-2024/blob/main/FMML_M1L2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning terms and metrics

FMML Module 1, Lab 2

In this lab, we will show a part of the ML pipeline by using the California Housing dataset. There are 20640 samples, each with 8 attributes like income of the block, age of the houses per district etc. The task is to predict the cost of the houses per district. We will use the scikit-learn library to load the data and perform some basic data preprocessing and model training. We will also show how to evaluate the model using some common metrics, split the data into training and testing sets, and use cross-validation to get a better estimate of the model's performance.

In [None]:
import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt

rng = np.random.default_rng(seed=42)

In [None]:
dataset = datasets.fetch_california_housing()
# Dataset description
print(dataset.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived

Given below are the list of target values. These correspond to the house value derived considering all the 8 input features and are continuous values. We should use regression models to predict these values but we will start with a simple classification model for the sake of simplicity. We need to just round off the values to the nearest integer and use a classification model to predict the house value.

In [None]:
print("Orignal target values:", dataset.target)

dataset.target = dataset.target.astype(int)

print("Target values after conversion:", dataset.target)
print("Input variables shape:", dataset.data.shape)
print("Output variables shape:", dataset.target.shape)

Orignal target values: [4.526 3.585 3.521 ... 0.923 0.847 0.894]
Target values after conversion: [4 3 3 ... 0 0 0]
Input variables shape: (20640, 8)
Output variables shape: (20640,)


The simplest model to use for classification is the K-Nearest Neighbors model. We will use this model to predict the house value with a K value of 1. We will also use the accuracy metric to evaluate the model.

In [None]:
def NN1(traindata, trainlabel, query):
    """
    This function takes in the training data, training labels and a query point
    and returns the predicted label for the query point using the nearest neighbour algorithm

    traindata: numpy array of shape (n,d) where n is the number of samples and d is the number of features
    trainlabel: numpy array of shape (n,) where n is the number of samples
    query: numpy array of shape (d,) where d is the number of features

    returns: the predicted label for the query point which is the label of the training data which is closest to the query point
    """
    diff = (
        traindata - query
    )  # find the difference between features. Numpy automatically takes care of the size here
    sq = diff * diff  # square the differences
    dist = sq.sum(1)  # add up the squares
    label = trainlabel[np.argmin(dist)]
    return label


def NN(traindata, trainlabel, testdata):
    """
    This function takes in the training data, training labels and test data
    and returns the predicted labels for the test data using the nearest neighbour algorithm

    traindata: numpy array of shape (n,d) where n is the number of samples and d is the number of features
    trainlabel: numpy array of shape (n,) where n is the number of samples
    testdata: numpy array of shape (m,d) where m is the number of test samples and d is the number of features

    returns: the predicted labels for the test data which is the label of the training data which is closest to each test point
    """
    predlabel = np.array([NN1(traindata, trainlabel, i) for i in testdata])
    return predlabel

We will also define a 'random classifier', which randomly allots labels to each sample

In [None]:
def RandomClassifier(traindata, trainlabel, testdata):
    """
    This function takes in the training data, training labels and test data
    and returns the predicted labels for the test data using the random classifier algorithm

    In reality, we don't need these arguments but we are passing them to keep the function signature consistent with other classifiers

    traindata: numpy array of shape (n,d) where n is the number of samples and d is the number of features
    trainlabel: numpy array of shape (n,) where n is the number of samples
    testdata: numpy array of shape (m,d) where m is the number of test samples and d is the number of features

    returns: the predicted labels for the test data which is a random label from the training data
    """

    classes = np.unique(trainlabel)
    rints = rng.integers(low=0, high=len(classes), size=len(testdata))
    predlabel = classes[rints]
    return predlabel

We need a metric to evaluate the performance of the model. Let us define a metric 'Accuracy' to see how good our learning algorithm is. Accuracy is the ratio of the number of correctly classified samples to the total number of samples. The higher the accuracy, the better the algorithm. We will use the accuracy metric to evaluate and compate the performance of the K-Nearest Neighbors model and the random classifier.

In [None]:
def Accuracy(gtlabel, predlabel):
    """
    This function takes in the ground-truth labels and predicted labels
    and returns the accuracy of the classifier

    gtlabel: numpy array of shape (n,) where n is the number of samples
    predlabel: numpy array of shape (n,) where n is the number of samples

    returns: the accuracy of the classifier which is the number of correct predictions divided by the total number of predictions
    """
    assert len(gtlabel) == len(
        predlabel
    ), "Length of the ground-truth labels and predicted labels should be the same"
    correct = (
        gtlabel == predlabel
    ).sum()  # count the number of times the groundtruth label is equal to the predicted label.
    return correct / len(gtlabel)

Let us make a function to split the dataset with the desired probability. We will use this function to split the dataset into training and testing sets. We will use the training set to train the model and the testing set to evaluate the model.

In [None]:
def split(data, label, percent):
    # generate a random number for each sample
    rnd = rng.random(len(label))
    split1 = rnd < percent
    split2 = rnd >= percent

    split1data = data[split1, :]
    split1label = label[split1]
    split2data = data[split2, :]
    split2label = label[split2]
    return split1data, split1label, split2data, split2label

We will reserve 20% of our dataset as the test set. We will not change this portion throughout our experiments

In [None]:
testdata, testlabel, alltraindata, alltrainlabel = split(
    dataset.data, dataset.target, 20 / 100
)
print("Number of test samples:", len(testlabel))
print("Number of train samples:", len(alltrainlabel))
print("Percent of test data:", len(testlabel) * 100 / len(dataset.target), "%")

Number of test samples: 4144
Number of train samples: 16496
Percent of test data: 20.07751937984496 %


## Experiments with splits

Let us reserve some of our train data as a validation set

In [None]:
traindata, trainlabel, valdata, vallabel = split(
    alltraindata, alltrainlabel, 75 / 100)

What is the accuracy of our classifiers on the train dataset?

In [None]:
trainpred = NN(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Training accuracy using nearest neighbour algorithm:", trainAccuracy*100, "%")

trainpred = RandomClassifier(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Training accuracy using random classifier: ", trainAccuracy*100, "%")

Training accuracy using nearest neighbour algorithm: 100.0 %
Training accuracy using random classifier:  16.4375808538163 %


For nearest neighbour, the train accuracy is always 1. The accuracy of the random classifier is close to 1/(number of classes) which is 0.1666 in our case. This is because the random classifier randomly assigns a label to each sample and the probability of assigning the correct label is 1/(number of classes). Let us predict the labels for our validation set and get the accuracy. This accuracy is a good estimate of the accuracy of our model on unseen data.

In [None]:
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour algorithm:", valAccuracy*100, "%")


valpred = RandomClassifier(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using random classifier:", valAccuracy*100, "%")

Validation accuracy using nearest neighbour algorithm: 34.10852713178294 %
Validation accuracy using random classifier: 16.884689922480618 %


Validation accuracy of nearest neighbour is considerably less than its train accuracy while the validation accuracy of random classifier is the same. However, the validation accuracy of nearest neighbour is twice that of the random classifier. Now let us try another random split and check the validation accuracy. We will see that the validation accuracy changes with the split. This is because the validation set is small and the accuracy is highly dependent on the samples in the validation set. We can get a better estimate of the accuracy by using cross-validation.

In [None]:
traindata, trainlabel, valdata, vallabel = split(
    alltraindata, alltrainlabel, 75 / 100)
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour algorithm:", valAccuracy*100, "%")

Validation accuracy using nearest neighbour algorithm: 34.048257372654156 %


You can run the above cell multiple times to try with different random splits.
We notice that the accuracy is different for each run, but close together.

Now let us compare it with the accuracy we get on the test dataset.

In [None]:
testpred = NN(alltraindata, alltrainlabel, testdata)
testAccuracy = Accuracy(testlabel, testpred)

print("Test accuracy:", testAccuracy*100, "%")

Test accuracy: 34.91795366795367 %


### Try it out for yourself and answer:
1. How is the accuracy of the validation set affected if we increase the percentage of validation set? What happens when we reduce it?
2. How does the size of the train and validation set affect how well we can predict the accuracy on the test set using the validation set?
3. What do you think is a good percentage to reserve for the validation set so that thest two factors are balanced?

Answer for both nearest neighbour and random classifier. You can note down the values for your experiments and plot a graph using  <a href=https://matplotlib.org/stable/gallery/lines_bars_and_markers/step_demo.html#sphx-glr-gallery-lines-bars-and-markers-step-demo-py>plt.plot<href>. Check also for extreme values for splits, like 99.9% or 0.1%

Answer 1:
          The accuracy of the validation set can be influenced by the size of the validation set, but not in a straightforward way. Here’s how it typically works:

### Increasing the Percentage of the Validation Set:
1. **Potential for Increased Stability:** With a larger validation set, the accuracy is generally more stable and less sensitive to random fluctuations. This is because the model is being evaluated on more examples, which can give a better estimate of its generalization performance.

2. **Potential for Lower Variance:** A larger validation set reduces the variance of the accuracy estimate. This means that the accuracy metric you obtain is likely closer to the true performance of the model on unseen data.

3. **Less Training Data:** However, increasing the validation set size means you have less data available for training. This can potentially reduce the model's performance because it has fewer examples to learn from, especially if the dataset is small.

### Reducing the Percentage of the Validation Set:
1. **Potential for Higher Variance:** With a smaller validation set, the accuracy metric may have higher variance, meaning it could fluctuate more depending on which specific samples are included in the validation set. This might lead to an accuracy estimate that is less reliable.

2. **More Training Data:** Reducing the validation set size leaves more data for training. This could potentially improve the model’s performance if the training data was previously limited, as the model has more examples to learn from.

3. **Risk of Overfitting on Validation Set:** With a very small validation set, the accuracy might not be representative of the model's general performance. The model might perform well on this small subset simply due to chance, not because it has generalized well.

### Summary:
- **Larger Validation Set:** More stable and reliable accuracy but with potentially less training data.
- **Smaller Validation Set:** Less reliable accuracy estimate but more training data, which might improve model performance if the training data is limited.

In practice, the percentage of data allocated to the validation set is often a trade-off decision, balancing the need for a robust evaluation with the need to train the model effectively.

Answer 2:
          The size of the train and validation sets plays a critical role in predicting the accuracy on the test set using the validation set. Here's how they affect the process:

### 1. **Training Set Size:**
   - **Large Training Set:**
     - A larger training set provides the model with more data to learn from, generally leading to better model performance.
     - It reduces the variance of the model, making it less likely to overfit to the training data and more likely to generalize well to unseen data (i.e., the test set).
   - **Small Training Set:**
     - A smaller training set might not provide enough data for the model to learn effectively, leading to underfitting or overfitting.
     - This can result in a model that does not generalize well, making the validation accuracy less predictive of the test accuracy.

### 2. **Validation Set Size:**
   - **Large Validation Set:**
     - A larger validation set gives a more reliable estimate of the model's performance. The accuracy on the validation set is likely to be a better predictor of the test set accuracy.
     - However, using a large validation set leaves less data for training, which might reduce the model's performance, especially if the overall dataset is small.
   - **Small Validation Set:**
     - A smaller validation set might give a less reliable estimate of performance due to higher variance in the validation accuracy.
     - If the validation set is too small, the accuracy measured might not be representative, leading to poor predictions of the test set accuracy.

### 3. **Trade-offs:**
   - **Bias-Variance Trade-off:** When you allocate more data to the training set, you reduce the bias of the model by allowing it to learn better, but you increase the variance of your validation accuracy estimate because of the smaller validation set. Conversely, a larger validation set gives a lower variance estimate but might increase the bias of the model if the training set becomes too small.
   - **Data Size Consideration:** For large datasets, the trade-off is less critical because even after allocating a significant portion to validation, there remains sufficient data for training. In contrast, with smaller datasets, careful consideration is needed to balance the sizes of the training and validation sets to optimize model performance and reliable accuracy prediction.

### 4. **Cross-Validation:**
   - **K-Fold Cross-Validation:** When data is limited, using cross-validation (e.g., k-fold) allows you to make better use of the available data. Each fold acts as a validation set while the remaining folds are used for training. This helps in obtaining a more reliable estimate of the model's performance on unseen data without requiring a large validation set.

### Summary:
- A **large training set** generally leads to a better-performing model, but a **small validation set** might give a less reliable estimate of test accuracy.
- A **large validation set** gives a more reliable performance estimate but might weaken the model due to less data being available for training.
- **Cross-validation** can help mitigate these issues by maximizing the use of available data for both training and validation, providing a more reliable test accuracy prediction.

The optimal sizes depend on the specific dataset and model, and careful experimentation is often required to find the best balance.

Answer 3:
          A common and generally effective practice is to reserve **20%** of your data for the validation set. This percentage strikes a balance between having enough data for training the model and enough for validating its performance. However, the ideal percentage can vary depending on the following factors:

1. **Size of the Dataset:**
   - **Small Datasets:** If your dataset is small, you might want to reserve a lower percentage (e.g., 10-15%) to maximize the amount of data available for training.
   - **Large Datasets:** With larger datasets, a higher percentage like 20-30% can be used without compromising the training set's size.

2. **Complexity of the Model:**
   - **Simple Models:** For simpler models, a smaller validation set might suffice.
   - **Complex Models:** More complex models may require a larger validation set to adequately assess their performance.

3. **Objective of the Model:**
   - **Hyperparameter Tuning:** If you plan to use the validation set primarily for hyperparameter tuning, ensure it's large enough to provide reliable performance estimates.
   - **Final Model Evaluation:** If you're using the validation set to evaluate the final model before testing, maintaining a good balance (around 20%) is important.

Ultimately, while 20% is a good starting point, you may need to adjust this based on your specific scenario.

> Exercise: Try to implement a 3 nearest neighbour classifier and compare the accuracy of the 1 nearest neighbour classifier and the 3 nearest neighbour classifier on the test dataset. You can use the KNeighborsClassifier class from the scikit-learn library to implement the K-Nearest Neighbors model. You can set the number of neighbors using the n_neighbors parameter. You can also use the accuracy_score function from the scikit-learn library to calculate the accuracy of the model.

## Multiple Splits

One way to get more accurate estimates for the test accuracy is by using <b>cross-validation</b>. Here, we will try a simple version, where we do multiple train/val splits and take the average of validation accuracies as the test accuracy estimation. Here is a function for doing this. Note that this function will take a long time to execute. You can reduce the number of splits to make it faster.

In [None]:
def AverageAccuracy(alldata, alllabel, splitpercent, iterations, classifier=NN):
    """
    This function takes in the data, labels, split percentage, number of iterations and classifier function
    and returns the average accuracy of the classifier

    alldata: numpy array of shape (n,d) where n is the number of samples and d is the number of features
    alllabel: numpy array of shape (n,) where n is the number of samples
    splitpercent: float which is the percentage of data to be used for training
    iterations: int which is the number of iterations to run the classifier
    classifier: function which is the classifier function to be used

    returns: the average accuracy of the classifier
    """
    accuracy = 0
    for ii in range(iterations):
        traindata, trainlabel, valdata, vallabel = split(
            alldata, alllabel, splitpercent
        )
        valpred = classifier(traindata, trainlabel, valdata)
        accuracy += Accuracy(vallabel, valpred)
    return accuracy / iterations  # average of all accuracies

In [None]:
avg_acc = AverageAccuracy(alltraindata, alltrainlabel, 75 / 100, 10, classifier=NN)
print("Average validation accuracy:", avg_acc*100, "%")
testpred = NN(alltraindata, alltrainlabel, testdata)

print("Test accuracy:", Accuracy(testlabel, testpred)*100, "%")

Average validation accuracy: 33.58463539517022 %
Test accuracy: 34.91795366795367 %


This is a very simple way of doing cross-validation. There are many well-known algorithms for cross-validation, like k-fold cross-validation, leave-one-out etc. This will be covered in detail in a later module. For more information about cross-validation, check <a href=https://en.wikipedia.org/wiki/Cross-validation_(statistics)>Cross-validatioin (Wikipedia)</a>

### Questions
1. Does averaging the validation accuracy across multiple splits give more consistent results?
2. Does it give more accurate estimate of test accuracy?
3. What is the effect of the number of iterations on the estimate? Do we get a better estimate with higher iterations?
4. Consider the results you got for the previous questions. Can we deal with a very small train dataset or validation dataset by increasing the iterations?


Answer 1:Yes, averaging the validation accuracy across multiple splits (such as in cross-validation) tends to give more consistent and reliable results compared to using a single split. Here's why:

1. **Reduced Variance**: When you use only a single train/validation split, the result can be highly dependent on how the data was split. This can introduce a lot of variance, especially if the dataset is small or not perfectly representative. By averaging across multiple splits, you smooth out these variations, leading to a more stable and generalizable estimate of your model's performance.

2. **Better Representation of the Dataset**: Cross-validation ensures that every data point is used for both training and validation, giving a more comprehensive assessment of model performance. It helps to avoid the scenario where the model performs well on one specific split but poorly on others.

3. **Mitigation of Overfitting to Validation Set**: Relying on a single validation set can sometimes lead to overfitting to that specific data subset. Averaging across multiple validation sets helps to mitigate this risk by evaluating the model's performance on different data subsets.

In summary, averaging validation accuracy across multiple splits generally provides a more robust estimate of model performance, making it a standard practice in machine learning, especially when the goal is to assess the generalization ability of the model.

Answer 2:
           Yes, averaging validation accuracy across multiple splits generally gives a more accurate estimate of test accuracy. Here's why:

1. **More Comprehensive Evaluation**: By using multiple splits, you ensure that the model is evaluated on different subsets of the data, which makes the performance estimate more representative of how the model might perform on unseen data. This reduces the likelihood of the model being over-optimistic or pessimistic based on a particular train/validation split.

2. **Reduction of Bias**: A single validation set might not fully capture the diversity of the dataset, especially if the data is not perfectly homogeneous. Averaging across multiple splits helps to reduce bias in the performance estimate, leading to a more accurate reflection of the model's true ability to generalize.

3. **Mitigation of Overfitting to Validation Data**: If you tune your model based on a single validation split, there's a risk of overfitting to that specific subset, leading to a less accurate estimate of test accuracy. Cross-validation reduces this risk by ensuring that your performance metric is not overly dependent on any single subset of the data.

4. **Law of Large Numbers**: Statistically, when you average results across multiple experiments (in this case, multiple validation splits), the average tends to converge to the true performance measure, assuming the data is a good representation of the population. This principle helps in providing a more accurate estimate of test accuracy.

In summary, cross-validation and averaging the validation accuracies across multiple splits is a widely accepted practice for obtaining a more accurate estimate of test accuracy. It is particularly useful when you have limited data or when you want to ensure that your model generalizes well to new, unseen data.

Answer 3:
           Yes, increasing the number of iterations generally improves the accuracy of an estimate, particularly in iterative algorithms, simulations, or methods like Monte Carlo simulations.

Here’s why:

1. **Convergence**: Many algorithms rely on iterative refinement to approach an accurate solution. More iterations often mean the solution has more time to converge to a stable, accurate value.

2. **Reduced Variance**: In methods like Monte Carlo simulations, each iteration represents a sample. The more samples you have, the more the estimate tends to converge to the true value, reducing the variability of the estimate.

3. **Law of Large Numbers**: As the number of iterations increases, the average result of the samples (in stochastic processes) tends to converge to the expected value, leading to a better estimate.

However, there is a trade-off:

- **Diminishing Returns**: After a certain point, additional iterations may only yield marginal improvements in accuracy. This is due to the fact that as the estimate gets closer to the true value, the changes with each iteration become smaller.

- **Computational Cost**: More iterations mean more computation time and resources. So, while higher iterations generally provide better estimates, there is a practical limit depending on the available computational resources and the required precision.

In summary, yes, more iterations typically result in a better estimate, but the benefits diminish over time, and the computational cost increases.

Answer 4:
          Yes, you can deal with a very small training or validation dataset by increasing the number of iterations, but it's important to understand the implications and limitations of this approach.

### 1. **Training with a Small Dataset**:
   - **Overfitting Risk**: With a very small training dataset, the model is at high risk of overfitting, where it learns the noise and specific details of the training data rather than generalizable patterns. Increasing iterations (training for more epochs) will likely exacerbate overfitting, as the model has more opportunities to memorize the training data rather than learn to generalize.
   - **Data Augmentation**: To mitigate overfitting, data augmentation techniques can be employed to artificially expand the size of the training data. This might include transformations like rotation, flipping, scaling, or adding noise.
   - **Regularization**: Techniques like dropout, L2 regularization, or early stopping can also help prevent overfitting when training on a small dataset.

### 2. **Validation with a Small Dataset**:
   - **Unreliable Performance Metrics**: A small validation dataset can lead to unreliable performance metrics due to high variance. Even a single outlier can significantly affect the model's perceived performance.
   - **Cross-Validation**: Instead of relying on a single small validation set, you could use cross-validation. This approach splits the dataset into multiple folds, training and validating the model on different subsets, providing a more robust estimate of model performance.
   - **Use of Unseen Data**: It's also advisable to gather more validation data, if possible, to better estimate how the model will perform on truly unseen data.

### 3. **Increasing Iterations**:
   - **Convergence**: Increasing the number of iterations can help the model converge better when the dataset is small, but the benefits are limited if the data doesn't represent the broader problem space.
   - **Learning Rate Adjustment**: When increasing iterations, it may be necessary to adjust the learning rate. A smaller learning rate with more iterations can help the model find a more optimal solution without overshooting.

### 4. **Other Considerations**:
   - **Transfer Learning**: If your dataset is very small, using a pre-trained model (transfer learning) and fine-tuning it on your data can be highly effective. The pre-trained model already has learned general features from a large dataset, so your small dataset only needs to refine these features.

### Conclusion:
While increasing iterations can help in some scenarios, it is not a substitute for having sufficient and representative data. The strategies above can help mitigate some of the risks associated with small datasets.

> Exercise: How does the accuracy of the 3 nearest neighbour classifier change with the number of splits? How is it affected by the split size? Compare the results with the 1 nearest neighbour classifier.