<a href="https://colab.research.google.com/github/Rahitya86/Fmml_all-repos/blob/main/FMML_M1L2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning terms and metrics

FMML Module 1, Lab 2

In this lab, we will show a part of the ML pipeline by using the California Housing dataset. There are 20640 samples, each with 8 attributes like income of the block, age of the houses per district etc. The task is to predict the cost of the houses per district. We will use the scikit-learn library to load the data and perform some basic data preprocessing and model training. We will also show how to evaluate the model using some common metrics, split the data into training and testing sets, and use cross-validation to get a better estimate of the model's performance.

In [1]:
import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt

rng = np.random.default_rng(seed=42)

In [2]:
dataset = datasets.fetch_california_housing()
# Dataset description
print(dataset.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

:Number of Instances: 20640

:Number of Attributes: 8 numeric, predictive attributes and the target

:Attribute Information:
    - MedInc        median income in block group
    - HouseAge      median house age in block group
    - AveRooms      average number of rooms per household
    - AveBedrms     average number of bedrooms per household
    - Population    block group population
    - AveOccup      average number of household members
    - Latitude      block group latitude
    - Longitude     block group longitude

:Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, using one row per ce

Given below are the list of target values. These correspond to the house value derived considering all the 8 input features and are continuous values. We should use regression models to predict these values but we will start with a simple classification model for the sake of simplicity. We need to just round off the values to the nearest integer and use a classification model to predict the house value.

In [3]:
print("Orignal target values:", dataset.target)

dataset.target = dataset.target.astype(int)

print("Target values after conversion:", dataset.target)
print("Input variables shape:", dataset.data.shape)
print("Output variables shape:", dataset.target.shape)

Orignal target values: [4.526 3.585 3.521 ... 0.923 0.847 0.894]
Target values after conversion: [4 3 3 ... 0 0 0]
Input variables shape: (20640, 8)
Output variables shape: (20640,)


The simplest model to use for classification is the K-Nearest Neighbors model. We will use this model to predict the house value with a K value of 1. We will also use the accuracy metric to evaluate the model.

In [4]:
def NN1(traindata, trainlabel, query):
    """
    This function takes in the training data, training labels and a query point
    and returns the predicted label for the query point using the nearest neighbour algorithm

    traindata: numpy array of shape (n,d) where n is the number of samples and d is the number of features
    trainlabel: numpy array of shape (n,) where n is the number of samples
    query: numpy array of shape (d,) where d is the number of features

    returns: the predicted label for the query point which is the label of the training data which is closest to the query point
    """
    diff = (
        traindata - query
    )  # find the difference between features. Numpy automatically takes care of the size here
    sq = diff * diff  # square the differences
    dist = sq.sum(1)  # add up the squares
    label = trainlabel[np.argmin(dist)]
    return label


def NN(traindata, trainlabel, testdata):
    """
    This function takes in the training data, training labels and test data
    and returns the predicted labels for the test data using the nearest neighbour algorithm

    traindata: numpy array of shape (n,d) where n is the number of samples and d is the number of features
    trainlabel: numpy array of shape (n,) where n is the number of samples
    testdata: numpy array of shape (m,d) where m is the number of test samples and d is the number of features

    returns: the predicted labels for the test data which is the label of the training data which is closest to each test point
    """
    predlabel = np.array([NN1(traindata, trainlabel, i) for i in testdata])
    return predlabel

We will also define a 'random classifier', which randomly allots labels to each sample

In [5]:
def RandomClassifier(traindata, trainlabel, testdata):
    """
    This function takes in the training data, training labels and test data
    and returns the predicted labels for the test data using the random classifier algorithm

    In reality, we don't need these arguments but we are passing them to keep the function signature consistent with other classifiers

    traindata: numpy array of shape (n,d) where n is the number of samples and d is the number of features
    trainlabel: numpy array of shape (n,) where n is the number of samples
    testdata: numpy array of shape (m,d) where m is the number of test samples and d is the number of features

    returns: the predicted labels for the test data which is a random label from the training data
    """

    classes = np.unique(trainlabel)
    rints = rng.integers(low=0, high=len(classes), size=len(testdata))
    predlabel = classes[rints]
    return predlabel

We need a metric to evaluate the performance of the model. Let us define a metric 'Accuracy' to see how good our learning algorithm is. Accuracy is the ratio of the number of correctly classified samples to the total number of samples. The higher the accuracy, the better the algorithm. We will use the accuracy metric to evaluate and compate the performance of the K-Nearest Neighbors model and the random classifier.

In [6]:
def Accuracy(gtlabel, predlabel):
    """
    This function takes in the ground-truth labels and predicted labels
    and returns the accuracy of the classifier

    gtlabel: numpy array of shape (n,) where n is the number of samples
    predlabel: numpy array of shape (n,) where n is the number of samples

    returns: the accuracy of the classifier which is the number of correct predictions divided by the total number of predictions
    """
    assert len(gtlabel) == len(
        predlabel
    ), "Length of the ground-truth labels and predicted labels should be the same"
    correct = (
        gtlabel == predlabel
    ).sum()  # count the number of times the groundtruth label is equal to the predicted label.
    return correct / len(gtlabel)

Let us make a function to split the dataset with the desired probability. We will use this function to split the dataset into training and testing sets. We will use the training set to train the model and the testing set to evaluate the model.

In [7]:
def split(data, label, percent):
    # generate a random number for each sample
    rnd = rng.random(len(label))
    split1 = rnd < percent
    split2 = rnd >= percent

    split1data = data[split1, :]
    split1label = label[split1]
    split2data = data[split2, :]
    split2label = label[split2]
    return split1data, split1label, split2data, split2label

We will reserve 20% of our dataset as the test set. We will not change this portion throughout our experiments

In [8]:
testdata, testlabel, alltraindata, alltrainlabel = split(
    dataset.data, dataset.target, 20 / 100
)
print("Number of test samples:", len(testlabel))
print("Number of train samples:", len(alltrainlabel))
print("Percent of test data:", len(testlabel) * 100 / len(dataset.target), "%")

Number of test samples: 4144
Number of train samples: 16496
Percent of test data: 20.07751937984496 %


## Experiments with splits

Let us reserve some of our train data as a validation set

In [9]:
traindata, trainlabel, valdata, vallabel = split(
    alltraindata, alltrainlabel, 75 / 100)

What is the accuracy of our classifiers on the train dataset?

In [10]:
trainpred = NN(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Training accuracy using nearest neighbour algorithm:", trainAccuracy*100, "%")

trainpred = RandomClassifier(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Training accuracy using random classifier: ", trainAccuracy*100, "%")

Training accuracy using nearest neighbour algorithm: 100.0 %
Training accuracy using random classifier:  16.4375808538163 %


For nearest neighbour, the train accuracy is always 1. The accuracy of the random classifier is close to 1/(number of classes) which is 0.1666 in our case. This is because the random classifier randomly assigns a label to each sample and the probability of assigning the correct label is 1/(number of classes). Let us predict the labels for our validation set and get the accuracy. This accuracy is a good estimate of the accuracy of our model on unseen data.

In [11]:
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour algorithm:", valAccuracy*100, "%")


valpred = RandomClassifier(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using random classifier:", valAccuracy*100, "%")

Validation accuracy using nearest neighbour algorithm: 34.10852713178294 %
Validation accuracy using random classifier: 16.884689922480618 %


Validation accuracy of nearest neighbour is considerably less than its train accuracy while the validation accuracy of random classifier is the same. However, the validation accuracy of nearest neighbour is twice that of the random classifier. Now let us try another random split and check the validation accuracy. We will see that the validation accuracy changes with the split. This is because the validation set is small and the accuracy is highly dependent on the samples in the validation set. We can get a better estimate of the accuracy by using cross-validation.

In [12]:
traindata, trainlabel, valdata, vallabel = split(
    alltraindata, alltrainlabel, 75 / 100)
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour algorithm:", valAccuracy*100, "%")

Validation accuracy using nearest neighbour algorithm: 34.048257372654156 %


You can run the above cell multiple times to try with different random splits.
We notice that the accuracy is different for each run, but close together.

Now let us compare it with the accuracy we get on the test dataset.

In [13]:
testpred = NN(alltraindata, alltrainlabel, testdata)
testAccuracy = Accuracy(testlabel, testpred)

print("Test accuracy:", testAccuracy*100, "%")

Test accuracy: 34.91795366795367 %


### Try it out for yourself and answer:
1. How is the accuracy of the validation set affected if we increase the percentage of validation set? What happens when we reduce it?
A) Understanding the Validation Set
The validation set is a crucial part of the machine learning model development process. It's a subset of the data that's separate from both the training and testing sets. Its primary purpose is to:
 * Tune hyperparameters: The validation set helps you choose the best hyperparameters (e.g., learning rate, number of layers) for your model.
 * Estimate generalization error: It provides an unbiased estimate of how well your model will perform on unseen data.
Impact of Increasing Validation Set Size
 * Improved accuracy: A larger validation set generally leads to a more accurate estimate of the model's performance. This is because a larger sample size reduces the impact of random fluctuations in the data.
 * Reduced variance: The variance of the performance estimate decreases as the validation set size increases. This means that the estimate is more stable and less likely to be affected by outliers or noise in the data.
 * Increased computational cost: However, a larger validation set also means that the model will take longer to train and evaluate.
Impact of Decreasing Validation Set Size
 * Reduced accuracy: A smaller validation set can lead to a less accurate estimate of the model's performance. This is because the estimate is more susceptible to random fluctuations in the data.
 * Increased variance: The variance of the performance estimate increases as the validation set size decreases. This means that the estimate is less stable and more likely to be affected by outliers or noise in the data.
 * Reduced computational cost: A smaller validation set can speed up the training and evaluation process.
Finding the Right Balance
The optimal size of the validation set depends on several factors, including:
 * The size of the overall dataset: A larger dataset can support a larger validation set.
 * The complexity of the model: More complex models may require larger validation sets to accurately estimate their performance.
 * Computational resources: The available computational resources will limit the maximum size of the validation set.
A common approach is to use a validation set that is 20-30% of the size of the overall dataset. However, this is just a rule of thumb, and the optimal size may vary depending on the specific problem and dataset.
In Summary
The size of the validation set is a critical factor in determining the accuracy and reliability of your model's performance estimate. Increasing the size of the validation set can improve accuracy and reduce variance, but it also increases computational cost. Conversely, decreasing the size of the validation set can reduce computational cost but may lead to less accurate and less reliable estimates. The key is to find a balance that provides a good estimate of the model's performance while keeping the computational cost manageable.
2. How does the size of the train and validation set affect how well we can predict the accuracy on the test set using the validation set?
A) The interplay between the size of the training, validation, and test sets significantly influences the accuracy prediction on the test set using the validation set. Here's a breakdown:
Training Set Size
 * Larger Training Set: Generally leads to better model performance. A larger training set allows the model to learn more complex patterns and nuances in the data, resulting in improved generalization ability.
 * Smaller Training Set: Can lead to underfitting, where the model fails to capture the underlying patterns in the data. This is because the model hasn't seen enough examples to learn effectively.
Validation Set Size
 * Larger Validation Set: Provides a more reliable estimate of the model's performance on unseen data. A larger validation set reduces the impact of random fluctuations and provides a more stable measure of generalization error.
 * Smaller Validation Set: Can lead to unreliable estimates of the model's performance. A small validation set may not be representative of the true distribution of the data, leading to inaccurate predictions of test set accuracy.
Test Set Size
 * Larger Test Set: Provides a more robust evaluation of the model's true performance. A larger test set reduces the impact of random fluctuations and provides a more accurate estimate of the model's generalization error.
 * Smaller Test Set: Can lead to unreliable estimates of the model's performance. A small test set may not be representative of the true distribution of the data, leading to inaccurate predictions of the model's generalization error.
Key Considerations:
 * Data Distribution: The distribution of data across the training, validation, and test sets should be representative of the overall data distribution. This ensures that the model is evaluated on data that is similar to what it will encounter in real-world scenarios.
 * Computational Resources: The size of the training, validation, and test sets can impact the computational resources required for training and evaluation. Larger datasets require more computational power and time.
 * Trade-offs: There is often a trade-off between the size of the training, validation, and test sets. For example, increasing the size of the training set may improve model performance but may require a larger validation set to maintain reliable estimates of generalization error.
Common Practices:
 * 80-20 Split: A common approach is to split the data into 80% for training and 20% for testing. The testing set can then be further split into validation and test sets, such as a 50-50 split.
 * Cross-Validation: Techniques like k-fold cross-validation can be used to improve the reliability of model evaluation, especially when the dataset is limited.
By carefully considering the size and distribution of the training, validation, and test sets, you can improve the accuracy of your model's performance predictions and build more robust and reliable machine learning models.
3. What do you think is a good percentage to reserve for the validation set so that thest two factors are balanced?
A) Balancing Model Complexity and Validation Set Size
Understanding the Trade-off
When determining the optimal percentage for your validation set, you're essentially balancing two key factors:
 * Model Complexity: A larger validation set provides a more accurate estimate of your model's performance on unseen data. This is crucial, especially for complex models that are prone to overfitting.
 * Training Data Availability: A smaller validation set means more data for training, which can lead to a more robust model, particularly when dealing with limited datasets.
A Rule of Thumb
A common starting point is a 20% split for the validation set. This leaves 80% for training, providing a reasonable balance between these two factors.
Factors to Consider
However, the ideal percentage can vary depending on several factors:
 * Dataset Size: For very large datasets, a smaller validation set (e.g., 10%) might be sufficient. For smaller datasets, a larger validation set (e.g., 30%) might be necessary.
 * Model Complexity: More complex models, such as deep neural networks, generally benefit from larger validation sets to prevent overfitting.
 * Computational Resources: If computational resources are limited, a smaller validation set might be preferable to reduce training time.
Experimentation and Cross-Validation
The best approach is often to experiment with different validation set sizes and observe the model's performance. Techniques like k-fold cross-validation can help you get a more robust estimate of your model's performance across different validation sets.
In Summary
While a 20% split is a good starting point, the optimal percentage for your validation set will depend on the specific characteristics of your dataset, model, and computational resources. Careful experimentation and consideration of these factors will help you find the best balance for your particular machine learning task.
Would you like to delve deeper into a specific aspect of this topic, such as k-fold cross-validation or dealing with imbalanced datasets?

Answer for both nearest neighbour and random classifier. You can note down the values for your experiments and plot a graph using  <a href=https://matplotlib.org/stable/gallery/lines_bars_and_markers/step_demo.html#sphx-glr-gallery-lines-bars-and-markers-step-demo-py>plt.plot<href>. Check also for extreme values for splits, like 99.9% or 0.1%

> Exercise: Try to implement a 3 nearest neighbour classifier and compare the accuracy of the 1 nearest neighbour classifier and the 3 nearest neighbour classifier on the test dataset. You can use the KNeighborsClassifier class from the scikit-learn library to implement the K-Nearest Neighbors model. You can set the number of neighbors using the n_neighbors parameter. You can also use the accuracy_score function from the scikit-learn library to calculate the accuracy of the model.

## Multiple Splits

One way to get more accurate estimates for the test accuracy is by using <b>cross-validation</b>. Here, we will try a simple version, where we do multiple train/val splits and take the average of validation accuracies as the test accuracy estimation. Here is a function for doing this. Note that this function will take a long time to execute. You can reduce the number of splits to make it faster.

In [14]:
def AverageAccuracy(alldata, alllabel, splitpercent, iterations, classifier=NN):
    """
    This function takes in the data, labels, split percentage, number of iterations and classifier function
    and returns the average accuracy of the classifier

    alldata: numpy array of shape (n,d) where n is the number of samples and d is the number of features
    alllabel: numpy array of shape (n,) where n is the number of samples
    splitpercent: float which is the percentage of data to be used for training
    iterations: int which is the number of iterations to run the classifier
    classifier: function which is the classifier function to be used

    returns: the average accuracy of the classifier
    """
    accuracy = 0
    for ii in range(iterations):
        traindata, trainlabel, valdata, vallabel = split(
            alldata, alllabel, splitpercent
        )
        valpred = classifier(traindata, trainlabel, valdata)
        accuracy += Accuracy(vallabel, valpred)
    return accuracy / iterations  # average of all accuracies

In [15]:
avg_acc = AverageAccuracy(alltraindata, alltrainlabel, 75 / 100, 10, classifier=NN)
print("Average validation accuracy:", avg_acc*100, "%")
testpred = NN(alltraindata, alltrainlabel, testdata)

print("Test accuracy:", Accuracy(testlabel, testpred)*100, "%")

Average validation accuracy: 33.58463539517022 %
Test accuracy: 34.91795366795367 %


This is a very simple way of doing cross-validation. There are many well-known algorithms for cross-validation, like k-fold cross-validation, leave-one-out etc. This will be covered in detail in a later module. For more information about cross-validation, check <a href=https://en.wikipedia.org/wiki/Cross-validation_(statistics)>Cross-validatioin (Wikipedia)</a>

### Questions
1. Does averaging the validation accuracy across multiple splits give more consistent results?
A) Yes, averaging validation accuracy across multiple splits generally leads to more consistent and reliable results. Here's why:
 * Reduced Variance: A single train-test split can be susceptible to high variance. If your split happens to include particularly easy or difficult examples, it can skew the performance estimate. Multiple splits average out these fluctuations, providing a more stable measure of the model's true performance.
 * Better Generalization Estimate: Averaging across splits gives you a better idea of how your model will generalize to unseen data. It simulates real-world scenarios where you'll encounter data that wasn't part of your training set.
 * Improved Model Selection: If you're comparing different models or hyperparameter settings, averaging validation accuracy helps you make more informed decisions. It reduces the chance of selecting a model that performs well on just one particular split but poorly on others.
Common Techniques:
 * k-Fold Cross-Validation: The most popular method, where the data is divided into k folds. The model is trained and validated k times, with each fold used as the validation set once.
 * Stratified k-Fold Cross-Validation: Ensures that the proportion of classes in each fold is roughly the same as in the original dataset, which is crucial for imbalanced datasets.
 * Repeated Train-Test Splits: Perform multiple random train-test splits and average the results.
Key Considerations:
 * Computational Cost: Multiple splits increase the training and evaluation time.
 * Data Size: For very small datasets, multiple splits might not be feasible due to limited data.
In summary, averaging validation accuracy across multiple splits is a valuable technique for obtaining more robust and reliable performance estimates. It's a standard practice in machine learning and often leads to better model selection and improved generalization.
2. Does it give more accurate estimate of test accuracy?
A) To determine if a method gives a more accurate estimate of test accuracy, you need to consider these factors:
 * Data Characteristics:
   * Size and Distribution: The size and how the data is distributed (e.g., balanced vs. imbalanced classes) significantly impact the accuracy of estimation methods.
   * Noise and Outliers: The presence of noise or outliers can affect the robustness of different estimation methods.
 * Method Characteristics:
   * Bias-Variance Trade-off: Some methods might have high bias (systematic error) but low variance (consistency across different data splits), while others might have low bias but high variance.
   * Computational Cost: Some methods are computationally more expensive than others.
   * Assumptions: Different methods often make different assumptions about the data. If these assumptions are violated, the accuracy of the estimate can be compromised.
 * Evaluation Metrics:
   * Accuracy: Overall accuracy might not be the most informative metric, especially for imbalanced datasets. Consider using metrics like precision, recall, F1-score, AUC, etc.
Common Methods for Estimating Test Accuracy and Their Considerations:
 * Train-Test Split:
   * Pros: Simple and easy to implement.
   * Cons: Can be sensitive to the specific train-test split, leading to high variance in the accuracy estimate.
   * Improvement: Use k-fold cross-validation to reduce variance.
 * k-Fold Cross-Validation:
   * Pros: Reduces variance compared to a single train-test split. Provides a more robust estimate.
   * Cons: Can be computationally more expensive than a single split, especially for large datasets.
 * Leave-One-Out Cross-Validation (LOOCV):
   * Pros: High variance reduction.
   * Cons: Extremely computationally expensive, especially for large datasets.
 * Bootstrap:
   * Pros: Can provide estimates of uncertainty in the accuracy estimate.
   * Cons: Can be computationally expensive.
 * Hold-out Validation:
   * Pros: Simple and quick.
   * Cons: High variance in the accuracy estimate.
To determine the most accurate method for a specific scenario, you can:
 * Experimentation: Try different methods on your dataset and compare their performance using appropriate metrics.
 * Theoretical Analysis: Analyze the properties of different methods and choose the one that best suits your data and the assumptions you are willing to make.
In summary, there is no single "best" method for estimating test accuracy. The most accurate method depends on the specific characteristics of your data and the computational resources available.
I hope this helps! Let me know if you have any further questions.
3. What is the effect of the number of iterations on the estimate? Do we get a better estimate with higher iterations?
A) The effect of the number of iterations on the estimate depends on the specific algorithm or method being used. In general, increasing the number of iterations can lead to a better estimate, but there are some important points to consider:
Convergence:
 * Iterative algorithms: Many algorithms, such as gradient descent or Monte Carlo methods, involve iterative steps to refine an initial estimate. With each iteration, the estimate is updated based on the previous estimate and some optimization rule.
 * Convergence behavior: The behavior of the estimate as the number of iterations increases depends on the algorithm's convergence properties. Some algorithms are guaranteed to converge to the true value (or a close approximation) as the number of iterations approaches infinity. Others may converge to a local optimum or oscillate around the true value.
Trade-offs:
 * Accuracy vs. computational cost: While more iterations generally lead to better accuracy, they also increase the computational cost. At some point, the marginal gain in accuracy from additional iterations may not justify the extra computational time.
 * Overfitting: In some cases, such as machine learning algorithms, too many iterations can lead to overfitting, where the model becomes too complex and performs poorly on new data.
Specific examples:
 * Monte Carlo methods: In Monte Carlo methods, increasing the number of iterations (samples) typically leads to a more accurate estimate due to the law of large numbers.
 * Gradient descent: In gradient descent, the number of iterations affects how close the algorithm gets to the minimum of the objective function. More iterations generally lead to a better minimum, but the improvement may become marginal after a certain point.
Key points:
 * The effect of the number of iterations on the estimate depends on the specific algorithm and problem.
 * Increasing the number of iterations generally improves accuracy but also increases computational cost.
 * There may be a point of diminishing returns where additional iterations provide little improvement in accuracy.
 * Overfitting can be a concern in some cases.
In conclusion, while more iterations often lead to better estimates, it's important to consider the trade-offs between accuracy and computational cost, as well as the potential for overfitting. The optimal number of iterations depends on the specific problem and algorithm being used.
4. Consider the results you got for the previous questions. Can we deal with a very small train dataset or validation dataset by increasing the iterations?
A) I'll provide a comprehensive response to the prompt, incorporating insights from the previous discussion and addressing potential issues:
Understanding the Trade-offs
While increasing iterations can sometimes help with small datasets, it's not a guaranteed solution and comes with potential drawbacks:
Potential Benefits:
 * Overfitting: More iterations might allow the model to better fit the training data, potentially improving performance, especially if the dataset is very small.
 * Fine-tuning: If the initial model is already reasonably good, additional iterations can fine-tune it further, leading to marginal improvements.
Potential Drawbacks:
 * Overfitting: If the dataset is too small, increasing iterations can lead to severe overfitting, where the model memorizes the training data instead of learning general patterns. This can result in poor performance on unseen data.
 * Computational Cost: More iterations mean more training time, which can be expensive, especially for large models or limited computational resources.
 * Diminishing Returns: After a certain point, additional iterations might not yield significant improvements, making the extra training time inefficient.
Strategies to Consider:
 * Data Augmentation: Artificially increase the dataset size by creating variations of existing data points (e.g., rotating, flipping, cropping images).
 * Regularization: Techniques like L1/L2 regularization or dropout can help prevent overfitting by adding noise or constraints to the training process.
 * Early Stopping: Monitor the model's performance on a validation set during training and stop training when performance starts to degrade, preventing overfitting.
 * Transfer Learning: If possible, leverage a pre-trained model on a large dataset and fine-tune it on your smaller dataset. This can provide a good starting point and improve generalization.
 * Ensemble Methods: Combine multiple models trained with different random seeds or hyperparameters to improve robustness and potentially reduce overfitting.
Key Considerations:
 * Dataset Size: The effectiveness of increasing iterations depends heavily on the dataset size. For very small datasets, it might be necessary to combine it with other strategies.
 * Model Complexity: Simpler models are less prone to overfitting on small datasets.
 * Validation Set: A proper validation set is crucial to monitor performance and prevent overfitting.
By carefully considering these factors and employing appropriate techniques, you can potentially mitigate the challenges of training with small datasets and achieve reasonable results.
 * https://github.com/Piyush-2405/16th_Mar_Assign

> Exercise: How does the accuracy of the 3 nearest neighbour classifier change with the number of splits? How is it affected by the split size? Compare the results with the 1 nearest neighbour classifier.
A)Understanding the Impact of Splits and Split Size on K-Nearest Neighbor Accuracy
Influence of Number of Splits:
 * Increased Splits, Increased Accuracy (Generally): As the number of splits (e.g., in k-fold cross-validation) increases, the model's ability to generalize to unseen data typically improves. This is because each split provides a more diverse and representative evaluation of the model's performance.
 * Diminishing Returns: While more splits generally lead to better accuracy, there's a point of diminishing returns. Extremely high numbers of splits can increase computational cost without a significant gain in accuracy.
Influence of Split Size:
 * Balanced Splits are Ideal: Unequal split sizes can introduce bias, as some models might be trained on significantly larger or smaller subsets of the data. This can lead to inaccurate estimates of the model's true performance.
 * Computational Considerations: Very small split sizes can increase computational overhead, as the model needs to be trained and evaluated more frequently.
3-Nearest Neighbor vs. 1-Nearest Neighbor:
 * 3-NN Often More Robust: The 3-Nearest Neighbor classifier can often be more robust to noise and outliers compared to 1-NN. By considering multiple neighbors, it reduces the impact of individual noisy data points.
 * 1-NN Sensitive to Noise: 1-NN is highly sensitive to noise, as a single noisy neighbor can significantly impact the classification of a new data point.
 * Bias-Variance Trade-off: 3-NN generally has lower variance (less sensitivity to specific training data) but higher bias compared to 1-NN. This means it might make slightly less accurate predictions on the training data but generalize better to unseen data.
In Summary:
 * Both the number of splits and the split size can influence the accuracy of K-NN classifiers.
 * Generally, increasing the number of splits (up to a point) and using balanced splits can improve accuracy.
 * 3-NN often provides a more robust and generalizable solution compared to 1-NN, especially in noisy datasets.
Visualizing the Impact:
To better understand these concepts, consider visualizing the accuracy of both 1-NN and 3-NN classifiers across different numbers of splits and split sizes. This can help you empirically determine the optimal settings for your specific dataset.
Note: The optimal number of splits and split size can vary depending on the dataset size, complexity, and the specific characteristics of the K-NN implementation.