<a href="https://colab.research.google.com/github/Aishvaryasai-05/FMML/blob/main/FMML_M1L2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning terms and metrics

FMML Module 1, Lab 2

In this lab, we will show a part of the ML pipeline by using the California Housing dataset. There are 20640 samples, each with 8 attributes like income of the block, age of the houses per district etc. The task is to predict the cost of the houses per district. We will use the scikit-learn library to load the data and perform some basic data preprocessing and model training. We will also show how to evaluate the model using some common metrics, split the data into training and testing sets, and use cross-validation to get a better estimate of the model's performance.

In [None]:
import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt

rng = np.random.default_rng(seed=42)

In [None]:
dataset = datasets.fetch_california_housing()
# Dataset description
print(dataset.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived

Given below are the list of target values. These correspond to the house value derived considering all the 8 input features and are continuous values. We should use regression models to predict these values but we will start with a simple classification model for the sake of simplicity. We need to just round off the values to the nearest integer and use a classification model to predict the house value.

In [None]:
print("Orignal target values:", dataset.target)

dataset.target = dataset.target.astype(int)

print("Target values after conversion:", dataset.target)
print("Input variables shape:", dataset.data.shape)
print("Output variables shape:", dataset.target.shape)

Orignal target values: [4.526 3.585 3.521 ... 0.923 0.847 0.894]
Target values after conversion: [4 3 3 ... 0 0 0]
Input variables shape: (20640, 8)
Output variables shape: (20640,)


The simplest model to use for classification is the K-Nearest Neighbors model. We will use this model to predict the house value with a K value of 1. We will also use the accuracy metric to evaluate the model.

In [None]:
def NN1(traindata, trainlabel, query):
    """
    This function takes in the training data, training labels and a query point
    and returns the predicted label for the query point using the nearest neighbour algorithm

    traindata: numpy array of shape (n,d) where n is the number of samples and d is the number of features
    trainlabel: numpy array of shape (n,) where n is the number of samples
    query: numpy array of shape (d,) where d is the number of features

    returns: the predicted label for the query point which is the label of the training data which is closest to the query point
    """
    diff = (
        traindata - query
    )  # find the difference between features. Numpy automatically takes care of the size here
    sq = diff * diff  # square the differences
    dist = sq.sum(1)  # add up the squares
    label = trainlabel[np.argmin(dist)]
    return label


def NN(traindata, trainlabel, testdata):
    """
    This function takes in the training data, training labels and test data
    and returns the predicted labels for the test data using the nearest neighbour algorithm

    traindata: numpy array of shape (n,d) where n is the number of samples and d is the number of features
    trainlabel: numpy array of shape (n,) where n is the number of samples
    testdata: numpy array of shape (m,d) where m is the number of test samples and d is the number of features

    returns: the predicted labels for the test data which is the label of the training data which is closest to each test point
    """
    predlabel = np.array([NN1(traindata, trainlabel, i) for i in testdata])
    return predlabel

We will also define a 'random classifier', which randomly allots labels to each sample

In [None]:
def RandomClassifier(traindata, trainlabel, testdata):
    """
    This function takes in the training data, training labels and test data
    and returns the predicted labels for the test data using the random classifier algorithm

    In reality, we don't need these arguments but we are passing them to keep the function signature consistent with other classifiers

    traindata: numpy array of shape (n,d) where n is the number of samples and d is the number of features
    trainlabel: numpy array of shape (n,) where n is the number of samples
    testdata: numpy array of shape (m,d) where m is the number of test samples and d is the number of features

    returns: the predicted labels for the test data which is a random label from the training data
    """

    classes = np.unique(trainlabel)
    rints = rng.integers(low=0, high=len(classes), size=len(testdata))
    predlabel = classes[rints]
    return predlabel

We need a metric to evaluate the performance of the model. Let us define a metric 'Accuracy' to see how good our learning algorithm is. Accuracy is the ratio of the number of correctly classified samples to the total number of samples. The higher the accuracy, the better the algorithm. We will use the accuracy metric to evaluate and compate the performance of the K-Nearest Neighbors model and the random classifier.

In [None]:
def Accuracy(gtlabel, predlabel):
    """
    This function takes in the ground-truth labels and predicted labels
    and returns the accuracy of the classifier

    gtlabel: numpy array of shape (n,) where n is the number of samples
    predlabel: numpy array of shape (n,) where n is the number of samples

    returns: the accuracy of the classifier which is the number of correct predictions divided by the total number of predictions
    """
    assert len(gtlabel) == len(
        predlabel
    ), "Length of the ground-truth labels and predicted labels should be the same"
    correct = (
        gtlabel == predlabel
    ).sum()  # count the number of times the groundtruth label is equal to the predicted label.
    return correct / len(gtlabel)

Let us make a function to split the dataset with the desired probability. We will use this function to split the dataset into training and testing sets. We will use the training set to train the model and the testing set to evaluate the model.

In [None]:
def split(data, label, percent):
    # generate a random number for each sample
    rnd = rng.random(len(label))
    split1 = rnd < percent
    split2 = rnd >= percent

    split1data = data[split1, :]
    split1label = label[split1]
    split2data = data[split2, :]
    split2label = label[split2]
    return split1data, split1label, split2data, split2label

We will reserve 20% of our dataset as the test set. We will not change this portion throughout our experiments

In [None]:
testdata, testlabel, alltraindata, alltrainlabel = split(
    dataset.data, dataset.target, 20 / 100
)
print("Number of test samples:", len(testlabel))
print("Number of train samples:", len(alltrainlabel))
print("Percent of test data:", len(testlabel) * 100 / len(dataset.target), "%")

Number of test samples: 4144
Number of train samples: 16496
Percent of test data: 20.07751937984496 %


## Experiments with splits

Let us reserve some of our train data as a validation set

In [None]:
traindata, trainlabel, valdata, vallabel = split(
    alltraindata, alltrainlabel, 75 / 100)

What is the accuracy of our classifiers on the train dataset?

In [None]:
trainpred = NN(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Training accuracy using nearest neighbour algorithm:", trainAccuracy*100, "%")

trainpred = RandomClassifier(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Training accuracy using random classifier: ", trainAccuracy*100, "%")

Training accuracy using nearest neighbour algorithm: 100.0 %
Training accuracy using random classifier:  16.4375808538163 %


For nearest neighbour, the train accuracy is always 1. The accuracy of the random classifier is close to 1/(number of classes) which is 0.1666 in our case. This is because the random classifier randomly assigns a label to each sample and the probability of assigning the correct label is 1/(number of classes). Let us predict the labels for our validation set and get the accuracy. This accuracy is a good estimate of the accuracy of our model on unseen data.

In [None]:
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour algorithm:", valAccuracy*100, "%")


valpred = RandomClassifier(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using random classifier:", valAccuracy*100, "%")

Validation accuracy using nearest neighbour algorithm: 34.10852713178294 %
Validation accuracy using random classifier: 16.884689922480618 %


Validation accuracy of nearest neighbour is considerably less than its train accuracy while the validation accuracy of random classifier is the same. However, the validation accuracy of nearest neighbour is twice that of the random classifier. Now let us try another random split and check the validation accuracy. We will see that the validation accuracy changes with the split. This is because the validation set is small and the accuracy is highly dependent on the samples in the validation set. We can get a better estimate of the accuracy by using cross-validation.

In [None]:
traindata, trainlabel, valdata, vallabel = split(
    alltraindata, alltrainlabel, 75 / 100)
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour algorithm:", valAccuracy*100, "%")

Validation accuracy using nearest neighbour algorithm: 34.048257372654156 %


You can run the above cell multiple times to try with different random splits.
We notice that the accuracy is different for each run, but close together.

Now let us compare it with the accuracy we get on the test dataset.

In [None]:
testpred = NN(alltraindata, alltrainlabel, testdata)
testAccuracy = Accuracy(testlabel, testpred)

print("Test accuracy:", testAccuracy*100, "%")

Test accuracy: 34.91795366795367 %


### Try it out for yourself and answer:
1. How is the accuracy of the validation set affected if we increase the percentage of validation set? What happens when we reduce it?

Ans:The accuracy of the validation set is affected by changes in the validation set percentage in the following ways:

### Increasing the Percentage of the Validation Set:
1. **Reduced Training Data**: As the validation set percentage increases, the training set size decreases. This often results in less effective model training because the model has fewer examples to learn from.
2. **Higher Variability in Validation Accuracy**: Larger validation sets might lead to more stable estimates of validation accuracy, but the reduced training data can cause the model's performance to degrade, potentially lowering validation accuracy.
3. **Overestimation of Accuracy**: In some cases, especially with simpler models, increasing the validation set size can make the validation set less representative of the training data, causing an overestimation or underestimation of performance.

### Reducing the Percentage of the Validation Set:
1. **Increased Training Data**: With a smaller validation set, more data is available for training, which can improve the model's learning and thus its overall performance on unseen data.
2. **Higher Variability in Validation Accuracy**: Smaller validation sets can result in higher variability in accuracy estimates because the validation results are based on fewer examples. This can make the validation accuracy less reliable as a predictor of test set performance.
3. **Risk of Overfitting to Validation Set**: With too small a validation set, there is a risk that the validation set might not represent the diversity of the data, leading to overfitting to the training data.

### Key Takeaways:
- **Moderate validation sizes (e.g., 10-30%)** often provide a good balance by allowing enough data for training and validation, leading to more reliable accuracy estimates.
- **Extreme percentages (very high or very low)** generally lead to less reliable accuracy predictions due to either too little training data or too few validation samples.

Thus, the validation set's accuracy tends to be most stable and informative when the validation percentage is balanced, avoiding the extremes that can lead to unreliable results.

2. How does the size of the train and validation set affect how well we can predict the accuracy on the test set using the validation set?

Ans:The size of the train and validation sets significantly affects how well the validation set accuracy predicts the test set accuracy. Here’s how the sizes influence this relationship:

### 1. **Small Validation Set (Large Training Set):**
   - **Effect on Validation Accuracy Prediction:**
     - A small validation set can lead to high variance in accuracy estimates. This means the validation accuracy can fluctuate greatly due to the limited number of samples, making it a less reliable predictor of test set accuracy.
     - It might not capture the full complexity or variability of the data, leading to over-optimistic or under-representative performance metrics.
   - **Effect on Training:**
     - More data for training typically allows the model to learn more effectively, improving the overall performance on unseen data (test set).
   - **Predictability of Test Set Accuracy:**
     - While the model may perform well on the test set, the small validation set size makes it harder to accurately predict this performance due to unreliable validation accuracy.

### 2. **Large Validation Set (Small Training Set):**
   - **Effect on Validation Accuracy Prediction:**
     - A large validation set provides a more stable and reliable estimate of model performance, as it includes a greater portion of the data, reducing variance in accuracy.
     - However, if the training set becomes too small, the model may not learn effectively, leading to lower performance on both validation and test sets.
   - **Effect on Training:**
     - Less data for training can result in a poorly fitted model, which underperforms on both validation and test sets, reducing the generalizability of the model.
   - **Predictability of Test Set Accuracy:**
     - Despite having stable validation accuracy, the prediction of test set performance can be poor if the training set is insufficient, as the model’s overall capacity to learn is compromised.

### 3. **Balanced Validation and Training Sets:**
   - **Effect on Validation Accuracy Prediction:**
     - A balanced split (often 70-80% training and 20-30% validation) typically results in a reliable estimate of test set performance. The validation accuracy closely matches what can be expected on the test set because both sets are sufficiently representative.
   - **Effect on Training:**
     - Adequate training data helps the model learn well, which in turn ensures that validation accuracy is a good indicator of test accuracy.
   - **Predictability of Test Set Accuracy:**
     - Validation accuracy in this case tends to be a strong predictor of test set accuracy because the training is robust and the validation set is representative enough to capture performance trends.

### Key Insights:
- **Small Validation Sets:** Lead to high variability and unreliable predictions of test accuracy.
- **Large Validation Sets:** Provide stable accuracy estimates but can undermine the model's training effectiveness if too little data remains for training.
- **Balanced Sizes:** Provide the best predictive accuracy of the test set, as both training effectiveness and validation reliability are maintained.

Overall, a well-balanced train-validation split ensures that validation accuracy is a reliable predictor of test accuracy, helping you better assess the true performance of your model.
3. What do you think is a good percentage to reserve for the validation set so that thest two factors are balanced?

Ans:To balance the trade-off between having enough data for effective training and obtaining a reliable estimate of model performance through validation, a good percentage to reserve for the validation set is typically between **10% to 30%** of the total data. Here's why this range is generally effective:

### Why 10% to 30% is a Good Range:
1. **Adequate Training Data**:
   - Retaining 70% to 90% of the data for training allows the model to learn effectively, capturing the underlying patterns and complexities of the data.
   - This ensures the model is well-trained and reduces the risk of underfitting.

2. **Reliable Validation Performance**:
   - A validation set of 10% to 30% provides a sufficient number of samples to give a stable and reliable estimate of model performance.
   - It reduces the variability in validation accuracy, making it a good predictor of test set performance.

3. **Avoiding Extremes**:
   - **Too Small Validation Sets (<10%)**: Can lead to high variance in validation accuracy, making predictions unreliable.
   - **Too Large Validation Sets (>30%)**: Can significantly reduce the training data, impairing the model's ability to learn effectively, leading to poor generalization.

### Specific Recommendations:
- **20% Validation Set**: Commonly used as it balances the need for training data with reliable validation performance. It is a widely accepted default split in many machine learning tasks.
- **Adjust Based on Dataset Size**:
   - For **large datasets**, even 10% can be sufficient for validation, as it still provides a substantial number of samples.
   - For **smaller datasets**, leaning towards 20-30% for validation might help ensure the validation set is representative enough.

### Practical Considerations:
- **Cross-Validation**: Using techniques like k-fold cross-validation can also help make the most of your data, providing multiple validation accuracy estimates without permanently setting aside a large portion of the data.
- **Task-Specific Adjustments**: Consider the complexity of the task and the model. More complex tasks might benefit from slightly larger validation sets to better capture variability.

In conclusion, a **20% validation set** is generally a good starting point, with adjustments made based on the specific size and needs of your dataset and model.

Answer for both nearest neighbour and random classifier. You can note down the values for your experiments and plot a graph using  <a href=https://matplotlib.org/stable/gallery/lines_bars_and_markers/step_demo.html#sphx-glr-gallery-lines-bars-and-markers-step-demo-py>plt.plot<href>. Check also for extreme values for splits, like 99.9% or 0.1%

> Exercise: Try to implement a 3 nearest neighbour classifier and compare the accuracy of the 1 nearest neighbour classifier and the 3 nearest neighbour classifier on the test dataset. You can use the KNeighborsClassifier class from the scikit-learn library to implement the K-Nearest Neighbors model. You can set the number of neighbors using the n_neighbors parameter. You can also use the accuracy_score function from the scikit-learn library to calculate the accuracy of the model.

## Multiple Splits

One way to get more accurate estimates for the test accuracy is by using <b>cross-validation</b>. Here, we will try a simple version, where we do multiple train/val splits and take the average of validation accuracies as the test accuracy estimation. Here is a function for doing this. Note that this function will take a long time to execute. You can reduce the number of splits to make it faster.

In [None]:
def AverageAccuracy(alldata, alllabel, splitpercent, iterations, classifier=NN):
    """
    This function takes in the data, labels, split percentage, number of iterations and classifier function
    and returns the average accuracy of the classifier

    alldata: numpy array of shape (n,d) where n is the number of samples and d is the number of features
    alllabel: numpy array of shape (n,) where n is the number of samples
    splitpercent: float which is the percentage of data to be used for training
    iterations: int which is the number of iterations to run the classifier
    classifier: function which is the classifier function to be used

    returns: the average accuracy of the classifier
    """
    accuracy = 0
    for ii in range(iterations):
        traindata, trainlabel, valdata, vallabel = split(
            alldata, alllabel, splitpercent
        )
        valpred = classifier(traindata, trainlabel, valdata)
        accuracy += Accuracy(vallabel, valpred)
    return accuracy / iterations  # average of all accuracies

In [None]:
avg_acc = AverageAccuracy(alltraindata, alltrainlabel, 75 / 100, 10, classifier=NN)
print("Average validation accuracy:", avg_acc*100, "%")
testpred = NN(alltraindata, alltrainlabel, testdata)

print("Test accuracy:", Accuracy(testlabel, testpred)*100, "%")

Average validation accuracy: 33.58463539517022 %
Test accuracy: 34.91795366795367 %


This is a very simple way of doing cross-validation. There are many well-known algorithms for cross-validation, like k-fold cross-validation, leave-one-out etc. This will be covered in detail in a later module. For more information about cross-validation, check <a href=https://en.wikipedia.org/wiki/Cross-validation_(statistics)>Cross-validatioin (Wikipedia)</a>

### Questions
1. Does averaging the validation accuracy across multiple splits give more consistent results?

Ans:Yes, averaging the validation accuracy across multiple splits, such as in k-fold cross-validation, generally provides more consistent and reliable results compared to evaluating on a single split.

Here's why:

1. **Reduction of Variance**: Single splits can introduce high variance due to the random nature of the data split. Some splits might be more favorable or challenging depending on the distribution of data points. Averaging across multiple splits helps to smooth out these random fluctuations, providing a more stable estimate of model performance.

2. **Better Representation of the Entire Dataset**: In k-fold cross-validation, each data point is used for validation exactly once and for training \( k-1 \) times. This ensures that the performance estimate accounts for all data points, leading to a more comprehensive evaluation.

3. **Mitigates the Risk of Overfitting**: Evaluating on multiple splits helps to ensure that the model's performance is not overly optimistic due to overfitting on a particular training set. It tests the model's ability to generalize across different subsets of data.

4. **Reliable Model Selection and Hyperparameter Tuning**: When tuning hyperparameters or comparing models, the averaged accuracy across multiple splits provides a more dependable metric for decision-making, as it reduces the likelihood of selecting models based on favorable but non-representative splits.

Overall, using an averaged accuracy across multiple splits is a standard practice in machine learning that enhances the robustness and credibility of the performance evaluation.
2. Does it give more accurate estimate of test accuracy?

AnsAveraging validation accuracy across multiple splits, like in k-fold cross-validation, generally provides a more accurate and unbiased estimate of test accuracy compared to using a single split. Here's why:

1. **Reduction of Sampling Bias**: By averaging across multiple splits, the evaluation reduces the impact of how the data is initially divided. Single splits can suffer from sampling bias, where certain subsets may not represent the overall data distribution well. Multiple splits ensure that all data points are used both for training and validation, which helps to balance out any biases.

2. **Better Generalization Estimate**: Since each fold tests the model on a different subset of data, the average accuracy across all folds reflects the model’s ability to generalize across different data subsets. This is closer to how the model is expected to perform on truly unseen test data.

3. **Lower Variance**: A single split can result in a high variance estimate of accuracy due to the randomness of the split. Averaging over multiple splits mitigates this issue, providing a more stable and reliable estimate that is likely closer to the true test accuracy.

4. **Avoids Overfitting to Specific Splits**: A model evaluated on just one split might inadvertently overfit to peculiarities of that split’s training data, leading to an overestimate or underestimate of test accuracy. Cross-validation minimizes this risk by exposing the model to a broader range of training and validation scenarios.

However, it's important to note that while averaging validation accuracies from multiple splits improves the estimate of test accuracy, it still relies on the assumption that the validation folds are representative of the test data. If the validation folds and the test data differ significantly (e.g., due to distribution shift), even cross-validation estimates may be off.

In practice, cross-validation generally provides a more accurate estimate of test accuracy, making it a widely used method for model evaluation.
3. What is the effect of the number of iterations on the estimate? Do we get a better estimate with higher iterations?
Ans:The number of iterations (or folds, in the context of cross-validation) has a significant effect on the accuracy and stability of the estimate of test accuracy. Here's how the number of iterations impacts the estimate:

### Effect of the Number of Iterations:

1. **Increased Stability and Reduced Variance**:
   - As the number of iterations increases, the estimate of test accuracy becomes more stable. This is because averaging over more iterations reduces the variance caused by random fluctuations in individual splits.
   - With more folds (e.g., 10-fold cross-validation vs. 5-fold), each fold contributes less to the overall variance, leading to a smoother and more reliable average.

2. **Better Utilization of Data**:
   - Higher iterations mean each data point is used more frequently in training and validation, providing a better overall representation of the data. This leads to a more accurate estimate because the model is evaluated on multiple, varied subsets of the data.
   - For example, in leave-one-out cross-validation (LOO-CV), where each iteration uses one data point for validation and the rest for training, every data point contributes directly to the performance estimate, albeit at a high computational cost.

3. **Diminishing Returns with Very High Iterations**:
   - While increasing the number of iterations generally improves the estimate, there are diminishing returns. For instance, going from 5-fold to 10-fold cross-validation might significantly improve the estimate, but going from 10-fold to 20-fold may yield only marginal improvements.
   - Extremely high iterations, such as in LOO-CV, can lead to a very low variance in the estimate, but the computational cost increases dramatically. Also, LOO-CV can suffer from high variance due to training on almost the entire dataset with only one data point left out for validation, which can sometimes lead to high variance in model performance estimates.

4. **Computational Cost**:
   - More iterations require more computational resources because the model needs to be trained and evaluated more times. The trade-off between accuracy and computation needs to be considered, especially with large datasets or complex models.

### Conclusion:
Increasing the number of iterations typically provides a more accurate and stable estimate of test accuracy, but with diminishing returns and higher computational costs. A balance is usually struck with common choices like 5-fold or 10-fold cross-validation, which are generally sufficient for most applications without being prohibitively expensive in terms of computation.
4. Consider the results you got for the previous questions. Can we deal with a very small train dataset or validation dataset by increasing the iterations?

Ans:Increasing the number of iterations (or folds) in cross-validation can help mitigate issues with very small train or validation datasets, but it does not completely solve the problem. Here’s how it helps and its limitations:

### Benefits:

1. **Improved Utilization of Data**:
   - With more folds, each subset of the data is used both for training and validation, which maximizes the amount of data available for training and validation. This is particularly useful with small datasets, as it ensures that the model is tested on every data point and trained on as much data as possible.

2. **Reduced Bias**:
   - By using different subsets for validation across multiple folds, you reduce the bias that might come from having a single small validation set. This helps to provide a more reliable estimate of the model's performance.

### Limitations:

1. **Diminishing Returns**:
   - Increasing the number of folds (e.g., moving from 5-fold to 10-fold) helps, but there are limits to how much this can compensate for a very small dataset. With extremely small datasets, even high-fold cross-validation might not provide sufficiently reliable estimates of model performance.

2. **Variance Issues**:
   - While higher folds reduce variance by averaging over more splits, the variance might still be high if the dataset is too small. Each fold might have limited samples, which can lead to instability in performance estimates.

3. **Computational Cost**:
   - More folds mean more computations, which can be significant with very small datasets where each data point's inclusion is critical. This might not always be practical, especially if the computational resources are limited.

4. **Overfitting Concerns**:
   - With very small datasets, overfitting is a major concern. Even with high-fold cross-validation, models might still overfit to the idiosyncrasies of the small data points, which can affect the reliability of the performance estimate.

### Alternatives for Small Datasets:

- **Data Augmentation**: Generating more data points through augmentation techniques can help increase the effective size of your dataset.
  
- **Bootstrapping**: This technique involves creating multiple subsets from the original data with replacement, which can be useful for small datasets.

- **Transfer Learning**: Leveraging pre-trained models and fine-tuning them on your small dataset can help improve performance without requiring a large amount of data.

- **Regularization**: Applying regularization techniques can help prevent overfitting and improve model generalization when working with small datasets.

In summary, while increasing the number of iterations in cross-validation can help when dealing with small datasets, it does not fully address all challenges associated with small sample sizes. Combining cross-validation with other strategies like data augmentation or regularization can often yield better results.

> Exercise: How does the accuracy of the 3 nearest neighbour classifier change with the number of splits? How is it affected by the split size? Compare the results with the 1 nearest neighbour classifier.

Ans:To analyze how the accuracy of the 3-nearest neighbor (3-NN) classifier changes with the number of splits and how it is affected by the split size, and to compare it with the 1-nearest neighbor (1-NN) classifier, follow these steps:

### 1. **Setup the Experiment**

1. **Dataset**: Use a dataset of your choice. It should be sufficiently large to provide meaningful insights. If the dataset is small, consider using techniques like data augmentation or bootstrapping.

2. **Split Sizes**: Define different split sizes (e.g., 5-fold, 10-fold, etc.).

3. **Classifiers**: Implement both the 3-NN and 1-NN classifiers.

4. **Evaluation Metric**: Use accuracy as the primary evaluation metric.

### 2. **Perform Cross-Validation**

1. **Cross-Validation**:
   - Perform k-fold cross-validation for different values of \( k \) (e.g., 5, 10). For each \( k \), split the dataset into \( k \) folds.

2. **Train and Test**:
   - Train both the 3-NN and 1-NN classifiers on the training folds.
   - Evaluate the classifiers on the validation folds and record the accuracy.

3. **Repeat**:
   - Repeat the cross-validation process several times (e.g., 10 iterations) to ensure robustness and calculate the average accuracy.

### 3. **Analyze Results**

1. **Effect of Number of Splits**:
   - **Accuracy Trends**: Compare the average accuracy of the 3-NN and 1-NN classifiers across different numbers of splits (folds).
   - **Consistency**: Observe how the accuracy varies with the number of splits. Typically, more folds (e.g., 10-fold vs. 5-fold) should provide a more stable estimate of accuracy.

2. **Effect of Split Size**:
   - **Training and Validation**: Assess how the size of the training and validation sets (which changes with the number of splits) affects the accuracy of the classifiers.
   - **Generalization**: Larger splits (fewer folds) might have more data in each training set, potentially improving generalization, while smaller splits (more folds) might provide a better estimate of performance but with less training data per fold.

3. **Comparison between 1-NN and 3-NN**:
   - **Accuracy**: Compare the accuracy of the 1-NN and 3-NN classifiers for the same number of folds.
   - **Overfitting**: Analyze how each classifier handles overfitting. The 1-NN classifier might overfit more due to its high sensitivity to individual training instances, whereas the 3-NN classifier is less sensitive and may generalize better.

### Expected Insights:

- **With more splits (higher \( k \))**: You should generally see more stable and reliable accuracy estimates. For 3-NN, accuracy might be slightly higher compared to 1-NN due to better generalization. With 1-NN, the accuracy may show more variability because it can overfit to the training data.

- **With larger split sizes (fewer folds)**: You might see a more optimistic accuracy estimate due to larger training sets, but this estimate could be less stable.

- **Comparison**: Typically, 3-NN might perform better in terms of generalization compared to 1-N