# 5.1.2 Stratified K-Fold Cross-Validation

## Explanation of Stratified K-Fold Cross-Validation

Stratified K-Fold Cross-Validation is a variant of K-Fold Cross-Validation that ensures each fold is representative of the overall class distribution. This means that each fold maintains the same proportion of classes as the entire dataset. This method is particularly useful for imbalanced datasets, where some classes are underrepresented. By preserving the class distribution in each fold, Stratified K-Fold Cross-Validation provides more reliable and unbiased estimates of model performance, especially for classification tasks involving imbalanced data.

## Use Cases

1. **Imbalanced Datasets**: When dealing with datasets where the number of samples in each class is not evenly distributed, using regular K-Fold Cross-Validation may lead to folds that do not represent the overall distribution of the classes. Stratified K-Fold ensures that each fold has the same class proportions as the entire dataset, providing a more accurate assessment of model performance.
   
2. **Classification Tasks**: For classification tasks, especially multi-class classification, it is crucial to have each class represented proportionally in both the training and validation sets. This helps in achieving a more balanced evaluation of the model's performance across all classes.
   
3. **Small Datasets**: When working with small datasets, maintaining class proportions in each fold is essential to ensure that the evaluation metrics are not skewed by the small sample size. Stratified K-Fold Cross-Validation helps in maximizing the use of available data while preserving the class distribution.


___
___

![image.png](attachment:e199e271-e795-4d36-8b58-b309cde5484c.png)

## **Readings:**
- [What is Stratified Cross-Validation in Machine Learning?](https://medium.com/towards-data-science/what-is-stratified-cross-validation-in-machine-learning-8844f3e7ae8e)
- [A Comprehensive Guide to Stratified K-Fold Cross-validation for Unbalanced Data](https://medium.com/@juanc.olamendy/a-comprehensive-guide-to-stratified-k-fold-cross-validation-for-unbalanced-data-014691060f17)
- [Cross Validation](https://medium.com/@farheenshaukat_19/cross-validation-6ce5702e8eed)
- [Stratified K-Fold Cross-Validation on Grouped Datasets](https://towardsdatascience.com/stratified-k-fold-cross-validation-on-grouped-datasets-b3bca8f0f53e)

In [1]:
from sklearn.datasets import load_iris
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.tree import DecisionTreeClassifier

In [2]:
# Load the dataset
iris = load_iris()
X = iris.data
y = iris.target

In [3]:
# Initialize the Stratified K-Fold cross-validator
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

In [4]:
# Initialize the model
model = DecisionTreeClassifier()

# Perform Stratified K-Fold Cross-Validation
stratified_scores = cross_val_score(model, X, y, cv=skf, scoring='accuracy')

# Print the stratified cross-validation scores
print("Stratified Cross-Validation Scores:", stratified_scores)
print("Mean Accuracy:", stratified_scores.mean())
print("Standard Deviation:", stratified_scores.std())

Stratified Cross-Validation Scores: [1.         0.96666667 0.93333333 0.96666667 0.9       ]
Mean Accuracy: 0.9533333333333335
Standard Deviation: 0.03399346342395189


## Conclusion

Stratified K-Fold Cross-Validation is an essential technique for evaluating machine learning models, particularly when dealing with imbalanced datasets. By ensuring each fold has the same class distribution as the entire dataset, it provides more reliable and unbiased estimates of model performance. This method is especially beneficial for classification tasks and small datasets where maintaining class proportions is crucial. Implementing Stratified K-Fold Cross-Validation in Python using libraries like Scikit-learn allows for robust and efficient model evaluation, leading to better-performing models.
