# What is Cross Validation?

Cross-validation is a statistical method used to estimate the skill of machine learning models. It involves partitioning a dataset into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set or testing set). This process is repeated multiple times (folds) to ensure that the model's performance is consistent and not due to random chance.

## Why Do We Need Cross Validation?

1. **Model Evaluation**: It provides a more reliable estimate of a model's performance than a single train/test split, especially on small datasets.
   
2. **Model Selection**: Helps in selecting the best model among different types by comparing their performance.

3. **Parameter Tuning**: Assists in tuning hyperparameters of a model to achieve the best performance.

4. **Detect Overfitting**: It helps to identify if the model is overfitting to the training data and performs poorly on unseen data.

5. **Bias-Variance Tradeoff**: It provides a balance between bias and variance, leading to a more generalized model.

## When Do We Need Cross Validation?

1. **Small Datasets**: When the dataset is not large enough, cross-validation maximizes the amount of data used for training and testing.
   
2. **Model Comparison**: When you need to compare multiple machine learning models to find the best one for your data.

3. **Hyperparameter Tuning**: During the process of finding the optimal hyperparameters for a model.

4. **Generalization**: When you want to ensure that your model generalizes well to unseen data.

## When Do We Not Need Cross Validation?

1. **Large Datasets**: When the dataset is large enough that a single train/test split can give a reliable estimate of model performance.
   
2. **Real-Time Predictions**: In real-time or streaming applications where splitting data into folds and training multiple models is impractical due to time constraints.

3. **Data Leakage**: If there's a risk of data leakage between folds, which can happen if the data is not properly randomized or if there's temporal dependency.

4. **Simplistic Models**: When using very simple models or in cases where the problem is straightforward and does not require extensive validation.

# Types of Cross Validation in Scikit-Learn
Cross-validation is a key technique for assessing the performance and robustness of a machine learning model. **Scikit-learn** offers several methods for cross-validation, each suited to different types of data and use cases. Here is a detailed overview of the various cross-validation techniques provided by scikit-learn:

<div style="text-align: center;">
  <img src="https://i.ibb.co/2qxCG23/cv.gif" alt="K-Fold Cross Validation source: 0xkerem" style="border: 2px solid black; border-radius: 10px;">
</div>

## 1. K-Fold Cross Validation

K-Fold Cross Validation is a robust method used to evaluate the performance of a machine learning model. It involves dividing the dataset into `k` equal-sized subsets or "folds." The model is trained `k` times, each time using a different fold as the validation set and the remaining folds as the training set. The final performance metric is the average of the metrics calculated for each fold. This method helps to ensure that the model's performance is not dependent on the specific partitioning of the dataset.

#### How K-Fold Cross Validation Works
1. **Divide the Data**: Split the data into `k` equal-sized folds.
2. **Train and Validate**: For each fold:
   - Train the model on `k-1` folds.
   - Validate the model on the remaining fold.
3. **Compute Metrics**: Calculate performance metrics (e.g., accuracy, precision, recall) for each fold.
4. **Average Metrics**: Average the performance metrics across all folds to obtain a final evaluation.

![K-Fold Cross Validation source: Towards Data Science](https://i.ibb.co/H7mLgkk/k-fold-cv.png)

### When to Use K-Fold Cross Validation
- **Small Datasets**: When you have limited data, K-Fold Cross Validation allows you to make the most of your data by ensuring each data point is used for both training and validation.
- **Model Selection**: When selecting the best model from a set of candidate models, K-Fold Cross Validation provides a reliable estimate of each model’s performance.
- **Hyperparameter Tuning**: It helps in tuning hyperparameters by providing a robust evaluation metric.
- **Avoiding Overfitting**: It helps in detecting overfitting by ensuring that the model performs well on different subsets of the data.

### Using K-Fold Cross Validation in Scikit-Learn

Scikit-learn provides a convenient way to implement K-Fold Cross Validation using the `KFold` class and `cross_val_score` function.

### Key Parameters
- `n_splits`: Number of folds.
- `shuffle`: Whether to shuffle the data before splitting into folds.
- `random_state`: Seed for random number generator to ensure reproducibility.

### Advantages
- **Robust Evaluation**: Provides a more reliable estimate of model performance.
- **Efficient Use of Data**: Makes efficient use of limited data by utilizing every observation for both training and validation.
- **Reduced Variance**: Reduces the variance of performance metrics compared to a single train-test split.

### Disadvantages
- **Computational Cost**: More computationally expensive than a single train-test split, especially for large datasets.
- **Model Complexity**: May lead to more complex models due to the multiple training processes.

By using K-Fold Cross Validation, you can ensure a thorough and reliable evaluation of your machine learning models, leading to better model selection and improved generalization to unseen data.

In [1]:
from sklearn.model_selection import KFold, cross_val_score
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Initialize model
model = LogisticRegression(max_iter=200)

# Define K-Fold Cross Validator
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Perform Cross Validation
scores = cross_val_score(model, X, y, cv=kf, scoring='accuracy')

# Print results
print(f"Accuracy scores for each fold: {scores}")
print(f"Mean accuracy: {scores.mean()}")

Accuracy scores for each fold: [1.         1.         0.93333333 0.96666667 0.96666667]
Mean accuracy: 0.9733333333333334


## 2. Stratified K-Fold Cross Validation

Stratified K-Fold Cross Validation is a variation of K-Fold Cross Validation that ensures each fold is representative of the entire dataset by maintaining the same proportion of each class label. This method is particularly useful when dealing with imbalanced datasets, where some classes are underrepresented.

#### How Stratified K-Fold Cross Validation Works
1. **Divide the Data**: Split the data into `k` equal-sized folds while preserving the class distribution in each fold.
2. **Train and Validate**: For each fold:
   - Train the model on `k-1` folds.
   - Validate the model on the remaining fold.
3. **Compute Metrics**: Calculate performance metrics (e.g., accuracy, precision, recall) for each fold.
4. **Average Metrics**: Average the performance metrics across all folds to obtain a final evaluation.

### When to Use Stratified K-Fold Cross Validation
- **Imbalanced Datasets**: When dealing with imbalanced datasets, Stratified K-Fold Cross Validation ensures that each fold is representative of the overall class distribution, providing a more reliable evaluation.
- **Classification Problems**: Especially useful in classification problems where maintaining the class distribution in training and validation sets is crucial for model performance.
- **Model Selection and Hyperparameter Tuning**: Helps in selecting the best model and tuning hyperparameters by providing a robust evaluation metric that accounts for class imbalance.

### Using Stratified K-Fold Cross Validation in Scikit-Learn

Scikit-learn provides a convenient way to implement Stratified K-Fold Cross Validation using the `StratifiedKFold` class and `cross_val_score` function.

### Key Parameters
- `n_splits`: Number of folds.
- `shuffle`: Whether to shuffle the data before splitting into folds.
- `random_state`: Seed for random number generator to ensure reproducibility.

### Advantages
- **Class Distribution Preservation**: Maintains the proportion of each class in all folds, leading to more reliable performance metrics.
- **Effective for Imbalanced Data**: Provides a better evaluation for models trained on imbalanced datasets.
- **Reduced Bias**: Reduces bias in performance metrics by ensuring each fold is representative of the entire dataset.

### Disadvantages
- **Computational Cost**: More computationally expensive than a single train-test split, especially for large datasets.
- **Complexity in Implementation**: Slightly more complex to implement compared to standard K-Fold Cross Validation, although libraries like Scikit-learn simplify this process.

Stratified K-Fold Cross Validation is a powerful tool for evaluating machine learning models, particularly when dealing with imbalanced datasets. It ensures that each fold is representative of the overall class distribution, leading to more reliable and robust performance metrics.

In [2]:
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Initialize model
model = LogisticRegression(max_iter=200)

# Define Stratified K-Fold Cross Validator
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Perform Cross Validation
scores = cross_val_score(model, X, y, cv=skf, scoring='accuracy')

# Print results
print(f"Accuracy scores for each fold: {scores}")
print(f"Mean accuracy: {scores.mean()}")

Accuracy scores for each fold: [1.         0.96666667 0.93333333 1.         0.93333333]
Mean accuracy: 0.9666666666666668


## 3. Leave-One-Out Cross Validation (LOOCV)

Leave-One-Out Cross Validation (LOOCV) is an extreme case of K-Fold Cross Validation where `k` equals the number of data points in the dataset. In LOOCV, each data point is used once as a validation set while the remaining data points form the training set. This process is repeated for each data point, and the performance metric is averaged across all iterations.

#### How Leave-One-Out Cross Validation Works
1. **Divide the Data**: Treat each data point as a single fold.
2. **Train and Validate**: For each data point:
   - Train the model on the remaining `n-1` data points.
   - Validate the model on the single data point.
3. **Compute Metrics**: Calculate performance metrics (e.g., accuracy, precision, recall) for each iteration.
4. **Average Metrics**: Average the performance metrics across all iterations to obtain a final evaluation.

### When to Use Leave-One-Out Cross Validation
- **Small Datasets**: Ideal for very small datasets where splitting the data into larger folds is not feasible.
- **High-Variance Models**: Useful when you want to get an unbiased estimate of model performance, though it may have high variance.
- **Model Evaluation**: Provides a thorough evaluation as each data point is used for validation exactly once.

### Using Leave-One-Out Cross Validation in Scikit-Learn

Scikit-learn provides a convenient way to implement LOOCV using the `LeaveOneOut` class and `cross_val_score` function.

### Key Parameters
- `cv`: Number of folds, which in the case of LOOCV is equal to the number of data points in the dataset.

### Advantages
- **Unbiased Estimate**: Provides an unbiased estimate of the model’s performance since each data point is used for validation exactly once.
- **Maximal Data Utilization**: Ensures maximal utilization of data for training since `n-1` data points are used for training in each iteration.

### Disadvantages
- **Computationally Intensive**: Very computationally expensive, especially for large datasets, as it requires training the model `n` times.
- **High Variance**: The performance metric can have high variance since each validation set contains only one data point.

Leave-One-Out Cross Validation is a thorough and unbiased method for model evaluation, particularly useful for small datasets. However, its high computational cost makes it impractical for larger datasets.

In [3]:
from sklearn.model_selection import LeaveOneOut, cross_val_score
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Initialize model
model = LogisticRegression(max_iter=200)

# Define Leave-One-Out Cross Validator
loo = LeaveOneOut()

# Perform Cross Validation
scores = cross_val_score(model, X, y, cv=loo, scoring='accuracy')

# Print results
print(f"Accuracy scores for each fold: {scores}")
print(f"Mean accuracy: {scores.mean()}")

Accuracy scores for each fold: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1.
 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1.]
Mean accuracy: 0.9666666666666667


## 4. Leave-P-Out Cross Validation (LPOCV)

Leave-P-Out Cross Validation (LPOCV) is a generalization of Leave-One-Out Cross Validation (LOOCV). In LPOCV, `p` data points are left out for validation, and the model is trained on the remaining `n-p` data points. This process is repeated for all possible combinations of `p` data points. LPOCV provides a comprehensive evaluation of the model's performance but is computationally expensive, especially for large `p` and datasets.

#### How Leave-P-Out Cross Validation Works
1. **Divide the Data**: Generate all possible combinations of `p` data points to be used as validation sets.
2. **Train and Validate**: For each combination:
   - Train the model on the remaining `n-p` data points.
   - Validate the model on the `p` data points.
3. **Compute Metrics**: Calculate performance metrics (e.g., accuracy, precision, recall) for each iteration.
4. **Average Metrics**: Average the performance metrics across all iterations to obtain a final evaluation.

### When to Use Leave-P-Out Cross Validation
- **Small to Medium Datasets**: Feasible for smaller datasets where the number of combinations is manageable.
- **Thorough Evaluation**: Provides a thorough and exhaustive evaluation of model performance by considering all possible validation sets of size `p`.
- **Model Evaluation**: Useful for understanding model performance across different subsets of the data.

### Using Leave-P-Out Cross Validation in Scikit-Learn

Scikit-learn provides a way to implement LPOCV using the `LeavePOut` class and `cross_val_score` function.

### Key Parameters
- `p`: Number of data points to leave out for validation.
- `cv`: Number of combinations, which is determined by the number of ways to choose `p` data points from `n`.

### Advantages
- **Comprehensive Evaluation**: Provides a thorough and exhaustive evaluation of the model by considering all possible subsets of size `p`.
- **Detailed Insight**: Offers detailed insight into model performance across different combinations of data points.

### Disadvantages
- **Computationally Intensive**: Very computationally expensive, especially for large datasets and larger values of `p`, due to the combinatorial explosion of possible subsets.
- **High Complexity**: High complexity in implementation and computation makes it impractical for large datasets or large `p`.

Leave-P-Out Cross Validation is a powerful tool for thorough model evaluation, especially useful for smaller datasets where an exhaustive assessment is feasible. Its computational intensity makes it less suitable for large datasets.

In [4]:
from sklearn.model_selection import LeavePOut, cross_val_score
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Initialize model
model = LogisticRegression(max_iter=200)

# Define Leave-P-Out Cross Validator
lpo = LeavePOut(p=2)

# Perform Cross Validation
scores = cross_val_score(model, X, y, cv=lpo, scoring='accuracy')

# Print results
print(f"Accuracy scores for each fold: {scores}")
print(f"Mean accuracy: {scores.mean()}")

Accuracy scores for each fold: [1. 1. 1. ... 1. 1. 1.]
Mean accuracy: 0.965413870246085


## 5. ShuffleSplit Cross Validation

ShuffleSplit Cross Validation is a method that involves randomly shuffling the dataset and then splitting it into a specified number of training and validation sets. Unlike K-Fold Cross Validation, ShuffleSplit does not ensure that each sample is used exactly once for validation. Instead, it allows for random sampling with replacement, which can be useful for generating multiple different training and validation splits.

#### How ShuffleSplit Cross Validation Works
1. **Shuffle the Data**: Randomly shuffle the dataset.
2. **Split the Data**: Split the shuffled data into training and validation sets according to specified proportions.
3. **Train and Validate**: For each split:
   - Train the model on the training set.
   - Validate the model on the validation set.
4. **Repeat**: Repeat the process for a specified number of iterations.
5. **Compute Metrics**: Calculate performance metrics (e.g., accuracy, precision, recall) for each split.
6. **Average Metrics**: Average the performance metrics across all splits to obtain a final evaluation.

### When to Use ShuffleSplit Cross Validation
- **Large Datasets**: Suitable for large datasets where traditional K-Fold Cross Validation may be computationally expensive.
- **Random Sampling**: When you want to ensure that different random samples of the dataset are used for training and validation.
- **Model Robustness**: Useful for testing the robustness of the model against different random splits of the data.

### Using ShuffleSplit Cross Validation in Scikit-Learn

Scikit-learn provides a convenient way to implement ShuffleSplit Cross Validation using the `ShuffleSplit` class and `cross_val_score` function.

### Key Parameters
- `n_splits`: Number of re-shuffling and splitting iterations.
- `test_size`: Proportion of the dataset to include in the validation set.
- `train_size`: Proportion of the dataset to include in the training set (if not specified, it's the complement of `test_size`).
- `random_state`: Seed for random number generator to ensure reproducibility.

### Advantages
- **Flexibility**: Allows for flexible training and validation sizes, making it adaptable to various dataset sizes and requirements.
- **Random Sampling**: Provides different random splits of the data, which can help in assessing the robustness of the model.
- **Computational Efficiency**: Can be more computationally efficient than K-Fold Cross Validation for large datasets.

### Disadvantages
- **Potential Overlap**: Some samples may be included in both training and validation sets across different splits, which may lead to less independent evaluation.
- **Less Comprehensive**: Does not ensure that each sample is used exactly once for validation, potentially leading to less comprehensive evaluation compared to K-Fold Cross Validation.

ShuffleSplit Cross Validation is a versatile and flexible method for evaluating machine learning models, particularly useful for large datasets and scenarios where random sampling is desired. Its ability to provide multiple different training and validation splits can help in assessing the robustness and generalization capability of the model.

In [5]:
from sklearn.model_selection import ShuffleSplit, cross_val_score
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Initialize model
model = LogisticRegression(max_iter=200)

# Define ShuffleSplit Cross Validator
ss = ShuffleSplit(n_splits=10, test_size=0.2, random_state=42)

# Perform Cross Validation
scores = cross_val_score(model, X, y, cv=ss, scoring='accuracy')

# Print results
print(f"Accuracy scores for each split: {scores}")
print(f"Mean accuracy: {scores.mean()}")

Accuracy scores for each split: [1.         0.96666667 0.96666667 0.93333333 0.93333333 1.
 0.93333333 0.96666667 1.         0.9       ]
Mean accuracy: 0.96


## 6. Stratified ShuffleSplit Cross Validation

Stratified ShuffleSplit Cross Validation is a variation of ShuffleSplit Cross Validation that maintains the proportion of each class in both the training and validation sets. This method is particularly useful for imbalanced datasets where it is important to ensure that each split preserves the original class distribution.

#### How Stratified ShuffleSplit Cross Validation Works
1. **Shuffle the Data**: Randomly shuffle the dataset while maintaining class distribution.
2. **Split the Data**: Split the shuffled data into training and validation sets according to specified proportions, ensuring each set retains the class distribution.
3. **Train and Validate**: For each split:
   - Train the model on the training set.
   - Validate the model on the validation set.
4. **Repeat**: Repeat the process for a specified number of iterations.
5. **Compute Metrics**: Calculate performance metrics (e.g., accuracy, precision, recall) for each split.
6. **Average Metrics**: Average the performance metrics across all splits to obtain a final evaluation.

### When to Use Stratified ShuffleSplit Cross Validation
- **Imbalanced Datasets**: Particularly useful when dealing with imbalanced datasets to ensure each class is appropriately represented in training and validation sets.
- **Classification Problems**: Ensures that the class distribution is preserved, leading to more reliable evaluation metrics.
- **Model Robustness**: Useful for testing the robustness of the model against different random splits of the data while maintaining class proportions.

### Using Stratified ShuffleSplit Cross Validation in Scikit-Learn

Scikit-learn provides a convenient way to implement Stratified ShuffleSplit Cross Validation using the `StratifiedShuffleSplit` class and `cross_val_score` function.

### Key Parameters
- `n_splits`: Number of re-shuffling and splitting iterations.
- `test_size`: Proportion of the dataset to include in the validation set.
- `train_size`: Proportion of the dataset to include in the training set (if not specified, it's the complement of `test_size`).
- `random_state`: Seed for random number generator to ensure reproducibility.

### Advantages
- **Class Distribution Preservation**: Maintains the class distribution in each split, leading to more reliable and representative performance metrics.
- **Random Sampling**: Provides different random splits of the data while preserving class proportions, useful for assessing model robustness.
- **Flexible and Robust**: Offers the flexibility of ShuffleSplit with the added benefit of stratification, making it suitable for a wide range of dataset sizes and types.

### Disadvantages
- **Potential Overlap**: Some samples may be included in both training and validation sets across different splits, which may lead to less independent evaluation.
- **Computational Cost**: Can be computationally intensive for very large datasets, though generally more efficient than exhaustive methods like LPOCV.

Stratified ShuffleSplit Cross Validation combines the benefits of random sampling with the need to maintain class distributions, making it a powerful tool for evaluating machine learning models, especially on imbalanced datasets. This method helps ensure that evaluation metrics are reliable and representative of the model's performance across different splits of the data.

In [6]:
from sklearn.model_selection import StratifiedShuffleSplit, cross_val_score
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Initialize model
model = LogisticRegression(max_iter=200)

# Define Stratified ShuffleSplit Cross Validator
sss = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=42)

# Perform Cross Validation
scores = cross_val_score(model, X, y, cv=sss, scoring='accuracy')

# Print results
print(f"Accuracy scores for each split: {scores}")
print(f"Mean accuracy: {scores.mean()}")

Accuracy scores for each split: [0.96666667 0.96666667 0.96666667 0.96666667 0.93333333 0.96666667
 1.         0.96666667 0.93333333 0.96666667]
Mean accuracy: 0.9633333333333333


## 7. Group K-Fold Cross Validation

Group K-Fold Cross Validation is a variation of K-Fold Cross Validation used when the data is organized into groups that should not be split across training and validation sets. It ensures that the same group is not represented in both training and validation sets, making it useful for scenarios where there is a need to avoid data leakage between groups, such as in time-series data, clustered data, or repeated measures.

#### How Group K-Fold Cross Validation Works
1. **Identify Groups**: Identify the groups within the dataset.
2. **Divide Groups**: Split the groups into `k` folds while ensuring that all data points within a group remain in the same fold.
3. **Train and Validate**: For each fold:
   - Train the model on `k-1` folds.
   - Validate the model on the remaining fold.
4. **Compute Metrics**: Calculate performance metrics (e.g., accuracy, precision, recall) for each fold.
5. **Average Metrics**: Average the performance metrics across all folds to obtain a final evaluation.

### When to Use Group K-Fold Cross Validation
- **Grouped Data**: When the dataset contains groups of data points that should not be split across training and validation sets to prevent data leakage.
- **Time-Series Data**: When working with time-series data or sequential data where the order and grouping of data points are crucial.
- **Hierarchical Data**: When dealing with hierarchical data or repeated measures where individual samples are not independent.

### Using Group K-Fold Cross Validation in Scikit-Learn

Scikit-learn provides a convenient way to implement Group K-Fold Cross Validation using the `GroupKFold` class and `cross_val_score` function.

### Key Parameters
- `n_splits`: Number of folds.
- `groups`: Array-like structure containing group labels for the samples.

### Advantages
- **Prevents Data Leakage**: Ensures that data from the same group does not appear in both training and validation sets, preventing data leakage.
- **Appropriate for Grouped Data**: Suitable for datasets with natural groupings, such as clustered data or repeated measures.
- **Maintains Group Integrity**: Ensures the integrity of groups within the dataset, making it a robust evaluation method for grouped data.

### Disadvantages
- **Computational Cost**: Slightly more computationally intensive than standard K-Fold Cross Validation due to the need to maintain group integrity.
- **Complexity in Implementation**: Requires additional handling of group labels and careful management of group splits.

Group K-Fold Cross Validation is a powerful tool for evaluating machine learning models when dealing with grouped data. It helps prevent data leakage and ensures that the evaluation metrics are reliable and representative of the model's performance across different groups in the dataset.

In [7]:
from sklearn.model_selection import GroupKFold, cross_val_score
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
import numpy as np

# Create a synthetic dataset with groups
X, y = make_classification(n_samples=100, n_features=10, random_state=42)
groups = np.array([i // 10 for i in range(100)])  # Create 10 groups

# Initialize model
model = LogisticRegression(max_iter=200)

# Define Group K-Fold Cross Validator
gkf = GroupKFold(n_splits=5)

# Perform Cross Validation
scores = cross_val_score(model, X, y, cv=gkf, groups=groups, scoring='accuracy')

# Print results
print(f"Accuracy scores for each fold: {scores}")
print(f"Mean accuracy: {scores.mean()}")

Accuracy scores for each fold: [0.95 0.95 0.95 0.95 0.95]
Mean accuracy: 0.95


## 8. Stratified Group K-Fold Cross Validation

Stratified Group K-Fold Cross Validation is a variation of Group K-Fold Cross Validation that maintains the proportion of each class within each fold, while also ensuring that groups are not split across training and validation sets. This method is particularly useful for imbalanced datasets organized into groups, where it is important to preserve the class distribution within each fold.

#### How Stratified Group K-Fold Cross Validation Works
1. **Identify Groups**: Identify the groups within the dataset.
2. **Divide Groups**: Split the groups into `k` folds while maintaining the class distribution in each fold and ensuring that all data points within a group remain in the same fold.
3. **Train and Validate**: For each fold:
   - Train the model on `k-1` folds.
   - Validate the model on the remaining fold.
4. **Compute Metrics**: Calculate performance metrics (e.g., accuracy, precision, recall) for each fold.
5. **Average Metrics**: Average the performance metrics across all folds to obtain a final evaluation.

### When to Use Stratified Group K-Fold Cross Validation
- **Imbalanced Datasets**: Particularly useful for imbalanced datasets to ensure each fold has a representative class distribution.
- **Grouped Data**: When the dataset contains groups of data points that should not be split across training and validation sets to prevent data leakage.
- **Classification Problems**: Ensures that the class distribution is preserved, leading to more reliable evaluation metrics.

### Using Stratified Group K-Fold Cross Validation in Scikit-Learn

Currently, Scikit-learn does not provide a built-in `StratifiedGroupKFold` class. However, you can create a custom implementation using a combination of `GroupKFold` and stratification logic. Here's an example implementation:

### Key Parameters
- `n_splits`: Number of folds.
- `groups`: Array-like structure containing group labels for the samples.

### Advantages
- **Class Distribution Preservation**: Maintains the class distribution in each fold, leading to more reliable and representative performance metrics.
- **Prevents Data Leakage**: Ensures that data from the same group does not appear in both training and validation sets, preventing data leakage.
- **Appropriate for Grouped Data**: Suitable for datasets with natural groupings, such as clustered data or repeated measures.

### Disadvantages
- **Complex Implementation**: Requires custom implementation, as it is not natively supported by Scikit-learn.
- **Computational Cost**: Slightly more computationally intensive than standard Group K-Fold Cross Validation due to the need to maintain class distribution.

Stratified Group K-Fold Cross Validation is a powerful tool for evaluating machine learning models on imbalanced and grouped data. It combines the benefits of stratification and group handling, ensuring reliable and representative performance metrics.

In [8]:
from sklearn.model_selection import StratifiedGroupKFold, cross_val_score
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
import numpy as np

# Create a synthetic dataset with groups
X, y = make_classification(n_samples=100, n_features=10, n_informative=5, n_redundant=5, random_state=42)
groups = np.random.randint(0, 10, size=len(y))  # Create 10 groups

# Initialize model
model = LogisticRegression(max_iter=200)

# Define Stratified Group K-Fold Cross Validator
sgkf = StratifiedGroupKFold(n_splits=5)

# Perform Cross Validation
scores = cross_val_score(model, X, y, cv=sgkf, groups=groups, scoring='accuracy')

# Print results
print(f"Accuracy scores for each fold: {scores}")
print(f"Mean accuracy: {scores.mean()}")

Accuracy scores for each fold: [0.85714286 0.77272727 1.         0.9047619  0.73684211]
Mean accuracy: 0.8542948279790383


# AutoCV

AutoCV is an automated cross-validation framework that I am developing to simplify and streamline the process of cross-validation in machine learning projects. Built on top of scikit-learn, it aims to reduce the manual effort involved in evaluating machine learning models by providing an easy-to-use interface and a set of tools that automate various cross-validation tasks. While AutoCV is still in development and may have some problems, it holds significant potential for improvement and expansion.

[Github](https://github.com/0xkerem/AutoCV)

In [9]:
# Clone the AutoCV repository
!git clone https://github.com/0xkerem/AutoCV.git
    
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'  # Suppress TensorFlow info and warnings

# Import necessary libraries
import sys
sys.path.append('/kaggle/working/AutoCV')  # Add AutoCV to the Python path

Cloning into 'AutoCV'...
remote: Enumerating objects: 152, done.[K
remote: Counting objects: 100% (152/152), done.[K
remote: Compressing objects: 100% (112/112), done.[K
remote: Total 152 (delta 85), reused 104 (delta 37), pack-reused 0[K
Receiving objects: 100% (152/152), 27.43 KiB | 6.86 MiB/s, done.
Resolving deltas: 100% (85/85), done.


In [10]:
from autocv import AutoCV  # Import the AutoCV module

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

# Load the Iris dataset
data = load_iris()
X, y = data.data, data.target

# Initialize the LogisticRegression model
model = LogisticRegression(max_iter=200)

# Initialize the AutoCV object
auto_cv = AutoCV(model=model)

# Perform cross-validation
scores = auto_cv.cross_validate(X, y)

# Print the cross-validation scores
print("Cross-validation scores:", scores)

# Print summary of the AutoCV results
auto_cv.summary()

2024-05-21 11:44:11.086331: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-21 11:44:11.086494: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-21 11:44:11.263914: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


Cross-validation scores: {'average_test_accuracy': 0.965413870246085, 'average_test_precision': 0.9747352721849366, 'average_test_recall': 0.9752945563012676, 'average_test_f1_score': 0.954213273676361}
Cross-Validation Summary:
-------------------------
Model: LogisticRegression(max_iter=200)
Cross-Validation Strategy: LeavePOut(p=2)
Average Fit Time: 0.0252 seconds
Average Score Time: 0.0059 seconds
Scores:
  average_test_accuracy: 0.9654
  average_test_precision: 0.9747
  average_test_recall: 0.9753
  average_test_f1_score: 0.9542
