# Cross Validation

Cross validation is a technique used in machine learning to evaluate the performance of a model on unseen data. It involves dividing the available data into multiple folds or subsets, using one of these folds as a validation set, and training the model on the remaining folds. This process is repeated multiple times, each time using a different fold as the validation set. Finally, the results from each validation step are averaged to produce a more robust estimate of the model’s performance.

The main purpose of cross validation is to prevent overfitting, which occurs when a model is trained too well on the training data and performs poorly on new, unseen data. By evaluating the model on multiple validation sets, cross validation provides a more realistic estimate of the model’s generalization performance, i.e., its ability to perform well on new, unseen data.

### Advantages and Disadvantages of Cross Validation: 

Advantages of Cross Validation:
* Overcoming Overfitting: Cross validation helps to prevent overfitting by providing a more robust estimate of the model’s performance on unseen data.
* Model Selection: Cross validation can be used to compare different models and select the one that performs the best on average.
* Hyperparameter tuning: Cross validation can be used to optimize the hyperparameters of a model, such as the regularization parameter, by selecting the values that result in the best performance on the validation set.
* Data Efficient: Cross validation allows the use of all the available data for both training and validation, making it a more data-efficient method compared to traditional validation techniques.

Disadvantages of Cross Validation:
* Computationally Expensive: Cross validation can be computationally expensive, especially when the number of folds is large or when the model is complex and requires a long time to train.
* Time-Consuming: Cross validation can be time-consuming, especially when there are many hyperparameters to tune or when multiple models need to be compared.
* Bias-Variance Tradeoff: The choice of the number of folds in cross validation can impact the bias-variance tradeoff, i.e., too few folds may result in high variance, while too many folds may result in high bias.

### Types of Cross Validation:

1. Hold-Out Method:

The hold-out method, also known as simple validation, is the simplest form of cross-validation. In this method, the dataset is divided into two parts: a training set and a validation set. The model is trained on the training set and then evaluated on the validation set. The performance metric obtained on the validation set is used as an estimate of the model's performance. Typically, a common split is 70% of the data for training and 30% for validation. However, the split ratio can be adjusted based on the size of the dataset and specific requirements.The hold-out method is easy to implement and suitable when the dataset is large. However, it may not provide a robust estimate of the model's performance as it heavily depends on the specific data split. Also, it is susceptible to high variance if the dataset is small, leading to potential overfitting or underfitting.

<img src= "Downloads/Handout.jpg">
2. K-Fold Cross-Validation:

K-fold cross-validation is a commonly used technique for model evaluation. In this method, the dataset is divided into k equal-sized folds or subsets. The model is trained on k-1 folds and evaluated on the remaining fold. This process is repeated k times, with each fold serving as the validation set once. The performance metrics obtained from each fold are then averaged to get an overall estimate of the model's performance.

Detailed Analysis of K-Fold Cross-Validation:

* Dataset and Number of Folds:Consider a dataset with a sufficient number of samples. To perform k-fold cross-validation, we choose a value for k, which represents the number of folds or subsets to divide the dataset into. Common values for k are 5 or 10, but this can be adjusted based on the dataset size and computational constraints.

* Splitting into Folds:The dataset is divided into k equal-sized folds. This can be done by random shuffling of the data points and then partitioning them into k subsets. The goal is to ensure that each fold is representative of the overall dataset.

* Iteration and Model Training:The cross-validation process proceeds with k iterations. In each iteration, one fold is used as the validation set, and the remaining k-1 folds are used as the training set. The model is trained on the training set, using the available data and any chosen training algorithm.

* Model Evaluation:After training the model, it is evaluated on the validation set (the fold that was left out). The performance of the model is measured using appropriate evaluation metrics such as accuracy, precision, recall, or F1 score. These metrics provide an assessment of the model's performance on the unseen data in the validation set.

* Repeat for All Folds:Steps 3 and 4 are repeated for each fold, allowing every fold to serve as the validation set once. This ensures that the model is trained and evaluated on all data points in the dataset.

* Average Performance Metric:Once all k iterations are completed, the performance metrics obtained from each fold are averaged to provide an overall estimate of the model's performance. This average performance metric represents the model's ability to generalize across different subsets of the data.

K-fold cross-validation helps in obtaining a more reliable estimate of a model's performance compared to simpler techniques like the hold-out method. It reduces bias by using multiple validation sets and provides a better assessment of the model's generalization ability. By training and evaluating the model on different subsets of the data, it helps to detect issues like overfitting or underfitting.

<img src= "Downloads/Cross.png">

3. Stratified K-Fold Cross-Validation:

Stratified k-fold cross-validation is an extension of the k-fold cross-validation method. It is particularly useful when dealing with imbalanced datasets or classification tasks with uneven class distributions. The goal of stratified k-fold is to preserve the class proportions in each fold, ensuring that each fold represents the overall class distribution.

Detailed Analysis of Stratified K-Fold Cross-Validation:

* Dataset and Class Distribution:Consider a dataset with multiple classes where the class distribution is uneven. For example, let's say we have three classes: Class A, Class B, and Class C. The dataset may have different numbers of instances for each class, and we want to ensure that the class proportions are preserved in the cross-validation process.

* Splitting into Folds: In stratified k-fold cross-validation, the dataset is divided into k equal-sized folds. To maintain class proportions, the split is performed in a way that each fold has a representative subset of instances from each class. This ensures that each fold is a good representation of the overall class distribution.

* Iteration and Model Training:The cross-validation process then proceeds with k iterations. In each iteration, one fold is used as the validation set, while the remaining k-1 folds are used as the training set. The model is trained on the training set, and its performance is evaluated on the validation set.

* Performance Metric Calculation:After each iteration, the performance metric of interest (e.g., accuracy, precision, recall) is calculated based on the predictions of the model on the validation set. This provides an evaluation of the model's performance for that specific fold.

* Average Performance Metric:Once all k iterations are completed, the performance metrics from each fold are averaged to obtain an overall estimate of the model's performance. This average performance metric is a reliable assessment of the model's ability to generalize across different class distributions.

Stratified k-fold cross-validation helps in obtaining more accurate and representative performance estimates, especially when dealing with imbalanced datasets. By ensuring that each fold contains a proportional representation of each class, it reduces the risk of biased evaluations that might occur if a specific class is underrepresented in certain folds.

<img src= "Downloads/Schematic-diagram-of-Stratified-K-fold-cross-validation.png">


4. Leave-One-Out Cross-Validation:

Leave-one-out cross-validation (LOOCV) is an extreme case of k-fold cross-validation, where k is equal to the total number of data points. In LOOCV, each data point is treated as a separate fold, and the model is trained on all data points except one, which serves as the validation set. This process is repeated for each data point, and the performance metrics are averaged.

LOOCV provides the least biased estimate of a model's performance as it uses nearly all the data for both training and validation. However, it can be computationally expensive, especially for large datasets, as it requires training the model multiple times. LOOCV is useful when the dataset is small, and computational resources permit its application.

Detailed Analysis of Leave One Out Cross Validation:

* Dataset and Data Points:Consider a dataset with a total of N data points. Each data point represents an individual sample or observation. LOOCV is particularly useful when dealing with small datasets or when computational resources permit its application.

* Iteration and Model Training:The LOOCV process starts with the first data point in the dataset. In each iteration, one data point is left out, and the model is trained on the remaining N-1 data points. This means that N-1 data points are used for training, and one data point is used for validation. The model is trained using the available data and any chosen training algorithm.

* Model Evaluation:After training the model, it is evaluated on the data point that was left out in the current iteration. The performance of the model is measured using appropriate evaluation metrics such as accuracy, precision, recall, or F1 score. These metrics provide an assessment of the model's performance on the unseen data point.

* Repeat for All Data Points:Steps 2 and 3 are repeated for each data point in the dataset, leaving out one data point at a time and training the model on the remaining N-1 data points. This ensures that each data point serves as the validation set once.

* Average Performance Metric:Once all N iterations are completed, the performance metrics obtained from each iteration (validation point) are averaged to provide an overall estimate of the model's performance. This average performance metric represents the model's ability to generalize across the entire dataset.

<img src= "Downloads/LOOCV.png">


Each type of cross-validation method has its own advantages and considerations, and the choice depends on factors such as dataset size, class distribution, and available computational resources. It's important to choose an appropriate cross-validation technique that suits the specific requirements of your machine learning task.

### Difference between K-flolds and LOOCV? 

The main differences can be summarized as follows:

* Data Split:k-fold cross-validation divides the dataset into k equal-sized folds, whereas LOOCV treats each data point as a separate fold.
* Validation Set Size:In k-fold cross-validation, the size of the validation set is relatively larger, typically around 1/k of the dataset.In LOOCV, the validation set size is smaller since only one data point is left out in each iteration.
* Computational Complexity:k-fold cross-validation requires training and evaluating the model k times, making it computationally less expensive compared to LOOCV.LOOCV requires training and evaluating the model N times, which can be computationally expensive, especially for large datasets.
* Bias and Variance:LOOCV tends to have lower bias than k-fold cross-validation since it uses nearly all the data for both training and validation. It captures the characteristics of each individual data point.k-fold cross-validation strikes a balance between bias and variance, providing a good estimate of the model's performance while considering the variability across different folds.