# What is Cross Validation?

Cross-validation is a statistical method used to estimate the skill of machine learning models. It involves partitioning a dataset into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set or testing set). This process is repeated multiple times (folds) to ensure that the model's performance is consistent and not due to random chance.

## Why Do We Need Cross Validation?

1. **Model Evaluation**: It provides a more reliable estimate of a model's performance than a single train/test split, especially on small datasets.
   
2. **Model Selection**: Helps in selecting the best model among different types by comparing their performance.

3. **Parameter Tuning**: Assists in tuning hyperparameters of a model to achieve the best performance.

4. **Detect Overfitting**: It helps to identify if the model is overfitting to the training data and performs poorly on unseen data.

5. **Bias-Variance Tradeoff**: It provides a balance between bias and variance, leading to a more generalized model.

## When Do We Need Cross Validation?

1. **Small Datasets**: When the dataset is not large enough, cross-validation maximizes the amount of data used for training and testing.
   
2. **Model Comparison**: When you need to compare multiple machine learning models to find the best one for your data.

3. **Hyperparameter Tuning**: During the process of finding the optimal hyperparameters for a model.

4. **Generalization**: When you want to ensure that your model generalizes well to unseen data.

## When Do We Not Need Cross Validation?

1. **Large Datasets**: When the dataset is large enough that a single train/test split can give a reliable estimate of model performance.
   
2. **Real-Time Predictions**: In real-time or streaming applications where splitting data into folds and training multiple models is impractical due to time constraints.

3. **Data Leakage**: If there's a risk of data leakage between folds, which can happen if the data is not properly randomized or if there's temporal dependency.

4. **Simplistic Models**: When using very simple models or in cases where the problem is straightforward and does not require extensive validation.

# Types of Cross Validation in Scikit-Learn
Cross-validation is a key technique for assessing the performance and robustness of a machine learning model. **Scikit-learn** offers several methods for cross-validation, each suited to different types of data and use cases. Here is a detailed overview of the various cross-validation techniques provided by scikit-learn:

## 1. K-Fold Cross Validation

K-Fold Cross Validation is a robust method used to evaluate the performance of a machine learning model. It involves dividing the dataset into `k` equal-sized subsets or "folds." The model is trained `k` times, each time using a different fold as the validation set and the remaining folds as the training set. The final performance metric is the average of the metrics calculated for each fold. This method helps to ensure that the model's performance is not dependent on the specific partitioning of the dataset.

#### How K-Fold Cross Validation Works
1. **Divide the Data**: Split the data into `k` equal-sized folds.
2. **Train and Validate**: For each fold:
   - Train the model on `k-1` folds.
   - Validate the model on the remaining fold.
3. **Compute Metrics**: Calculate performance metrics (e.g., accuracy, precision, recall) for each fold.
4. **Average Metrics**: Average the performance metrics across all folds to obtain a final evaluation.

![K-Fold Cross Validation source: Towards Data Science](https://i.ibb.co/H7mLgkk/k-fold-cv.png)

### When to Use K-Fold Cross Validation
- **Small Datasets**: When you have limited data, K-Fold Cross Validation allows you to make the most of your data by ensuring each data point is used for both training and validation.
- **Model Selection**: When selecting the best model from a set of candidate models, K-Fold Cross Validation provides a reliable estimate of each model’s performance.
- **Hyperparameter Tuning**: It helps in tuning hyperparameters by providing a robust evaluation metric.
- **Avoiding Overfitting**: It helps in detecting overfitting by ensuring that the model performs well on different subsets of the data.

### Using K-Fold Cross Validation in Scikit-Learn

Scikit-learn provides a convenient way to implement K-Fold Cross Validation using the `KFold` class and `cross_val_score` function.

### Key Parameters
- `n_splits`: Number of folds.
- `shuffle`: Whether to shuffle the data before splitting into folds.
- `random_state`: Seed for random number generator to ensure reproducibility.

### Advantages
- **Robust Evaluation**: Provides a more reliable estimate of model performance.
- **Efficient Use of Data**: Makes efficient use of limited data by utilizing every observation for both training and validation.
- **Reduced Variance**: Reduces the variance of performance metrics compared to a single train-test split.

### Disadvantages
- **Computational Cost**: More computationally expensive than a single train-test split, especially for large datasets.
- **Model Complexity**: May lead to more complex models due to the multiple training processes.

By using K-Fold Cross Validation, you can ensure a thorough and reliable evaluation of your machine learning models, leading to better model selection and improved generalization to unseen data.

In [None]:
from sklearn.model_selection import KFold, cross_val_score
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Initialize model
model = LogisticRegression(max_iter=200)

# Define K-Fold Cross Validator
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Perform Cross Validation
scores = cross_val_score(model, X, y, cv=kf, scoring='accuracy')

# Print results
print(f"Accuracy scores for each fold: {scores}")
print(f"Mean accuracy: {scores.mean()}")

## 2. Stratified K-Fold Cross Validation

Stratified K-Fold Cross Validation is a variation of K-Fold Cross Validation that ensures each fold is representative of the entire dataset by maintaining the same proportion of each class label. This method is particularly useful when dealing with imbalanced datasets, where some classes are underrepresented.

#### How Stratified K-Fold Cross Validation Works
1. **Divide the Data**: Split the data into `k` equal-sized folds while preserving the class distribution in each fold.
2. **Train and Validate**: For each fold:
   - Train the model on `k-1` folds.
   - Validate the model on the remaining fold.
3. **Compute Metrics**: Calculate performance metrics (e.g., accuracy, precision, recall) for each fold.
4. **Average Metrics**: Average the performance metrics across all folds to obtain a final evaluation.

### When to Use Stratified K-Fold Cross Validation
- **Imbalanced Datasets**: When dealing with imbalanced datasets, Stratified K-Fold Cross Validation ensures that each fold is representative of the overall class distribution, providing a more reliable evaluation.
- **Classification Problems**: Especially useful in classification problems where maintaining the class distribution in training and validation sets is crucial for model performance.
- **Model Selection and Hyperparameter Tuning**: Helps in selecting the best model and tuning hyperparameters by providing a robust evaluation metric that accounts for class imbalance.

### Using Stratified K-Fold Cross Validation in Scikit-Learn

Scikit-learn provides a convenient way to implement Stratified K-Fold Cross Validation using the `StratifiedKFold` class and `cross_val_score` function.

### Key Parameters
- `n_splits`: Number of folds.
- `shuffle`: Whether to shuffle the data before splitting into folds.
- `random_state`: Seed for random number generator to ensure reproducibility.

### Advantages
- **Class Distribution Preservation**: Maintains the proportion of each class in all folds, leading to more reliable performance metrics.
- **Effective for Imbalanced Data**: Provides a better evaluation for models trained on imbalanced datasets.
- **Reduced Bias**: Reduces bias in performance metrics by ensuring each fold is representative of the entire dataset.

### Disadvantages
- **Computational Cost**: More computationally expensive than a single train-test split, especially for large datasets.
- **Complexity in Implementation**: Slightly more complex to implement compared to standard K-Fold Cross Validation, although libraries like Scikit-learn simplify this process.

Stratified K-Fold Cross Validation is a powerful tool for evaluating machine learning models, particularly when dealing with imbalanced datasets. It ensures that each fold is representative of the overall class distribution, leading to more reliable and robust performance metrics.

In [None]:
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Initialize model
model = LogisticRegression(max_iter=200)

# Define Stratified K-Fold Cross Validator
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Perform Cross Validation
scores = cross_val_score(model, X, y, cv=skf, scoring='accuracy')

# Print results
print(f"Accuracy scores for each fold: {scores}")
print(f"Mean accuracy: {scores.mean()}")

## 3. Leave-One-Out Cross Validation (LOOCV)

Leave-One-Out Cross Validation (LOOCV) is an extreme case of K-Fold Cross Validation where `k` equals the number of data points in the dataset. In LOOCV, each data point is used once as a validation set while the remaining data points form the training set. This process is repeated for each data point, and the performance metric is averaged across all iterations.

#### How Leave-One-Out Cross Validation Works
1. **Divide the Data**: Treat each data point as a single fold.
2. **Train and Validate**: For each data point:
   - Train the model on the remaining `n-1` data points.
   - Validate the model on the single data point.
3. **Compute Metrics**: Calculate performance metrics (e.g., accuracy, precision, recall) for each iteration.
4. **Average Metrics**: Average the performance metrics across all iterations to obtain a final evaluation.

### When to Use Leave-One-Out Cross Validation
- **Small Datasets**: Ideal for very small datasets where splitting the data into larger folds is not feasible.
- **High-Variance Models**: Useful when you want to get an unbiased estimate of model performance, though it may have high variance.
- **Model Evaluation**: Provides a thorough evaluation as each data point is used for validation exactly once.

### Using Leave-One-Out Cross Validation in Scikit-Learn

Scikit-learn provides a convenient way to implement LOOCV using the `LeaveOneOut` class and `cross_val_score` function.

### Key Parameters
- `cv`: Number of folds, which in the case of LOOCV is equal to the number of data points in the dataset.

### Advantages
- **Unbiased Estimate**: Provides an unbiased estimate of the model’s performance since each data point is used for validation exactly once.
- **Maximal Data Utilization**: Ensures maximal utilization of data for training since `n-1` data points are used for training in each iteration.

### Disadvantages
- **Computationally Intensive**: Very computationally expensive, especially for large datasets, as it requires training the model `n` times.
- **High Variance**: The performance metric can have high variance since each validation set contains only one data point.

Leave-One-Out Cross Validation is a thorough and unbiased method for model evaluation, particularly useful for small datasets. However, its high computational cost makes it impractical for larger datasets.

In [None]:
from sklearn.model_selection import LeaveOneOut, cross_val_score
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Initialize model
model = LogisticRegression(max_iter=200)

# Define Leave-One-Out Cross Validator
loo = LeaveOneOut()

# Perform Cross Validation
scores = cross_val_score(model, X, y, cv=loo, scoring='accuracy')

# Print results
print(f"Accuracy scores for each fold: {scores}")
print(f"Mean accuracy: {scores.mean()}")

## 4. Leave-P-Out Cross Validation (LPOCV)

Leave-P-Out Cross Validation (LPOCV) is a generalization of Leave-One-Out Cross Validation (LOOCV). In LPOCV, `p` data points are left out for validation, and the model is trained on the remaining `n-p` data points. This process is repeated for all possible combinations of `p` data points. LPOCV provides a comprehensive evaluation of the model's performance but is computationally expensive, especially for large `p` and datasets.

#### How Leave-P-Out Cross Validation Works
1. **Divide the Data**: Generate all possible combinations of `p` data points to be used as validation sets.
2. **Train and Validate**: For each combination:
   - Train the model on the remaining `n-p` data points.
   - Validate the model on the `p` data points.
3. **Compute Metrics**: Calculate performance metrics (e.g., accuracy, precision, recall) for each iteration.
4. **Average Metrics**: Average the performance metrics across all iterations to obtain a final evaluation.

### When to Use Leave-P-Out Cross Validation
- **Small to Medium Datasets**: Feasible for smaller datasets where the number of combinations is manageable.
- **Thorough Evaluation**: Provides a thorough and exhaustive evaluation of model performance by considering all possible validation sets of size `p`.
- **Model Evaluation**: Useful for understanding model performance across different subsets of the data.

### Using Leave-P-Out Cross Validation in Scikit-Learn

Scikit-learn provides a way to implement LPOCV using the `LeavePOut` class and `cross_val_score` function.

### Key Parameters
- `p`: Number of data points to leave out for validation.
- `cv`: Number of combinations, which is determined by the number of ways to choose `p` data points from `n`.

### Advantages
- **Comprehensive Evaluation**: Provides a thorough and exhaustive evaluation of the model by considering all possible subsets of size `p`.
- **Detailed Insight**: Offers detailed insight into model performance across different combinations of data points.

### Disadvantages
- **Computationally Intensive**: Very computationally expensive, especially for large datasets and larger values of `p`, due to the combinatorial explosion of possible subsets.
- **High Complexity**: High complexity in implementation and computation makes it impractical for large datasets or large `p`.

Leave-P-Out Cross Validation is a powerful tool for thorough model evaluation, especially useful for smaller datasets where an exhaustive assessment is feasible. Its computational intensity makes it less suitable for large datasets.

In [None]:
from sklearn.model_selection import LeavePOut, cross_val_score
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Initialize model
model = LogisticRegression(max_iter=200)

# Define Leave-P-Out Cross Validator
lpo = LeavePOut(p=2)

# Perform Cross Validation
scores = cross_val_score(model, X, y, cv=lpo, scoring='accuracy')

# Print results
print(f"Accuracy scores for each fold: {scores}")
print(f"Mean accuracy: {scores.mean()}")