# Day20:Cross-Validation Techniques

In [1]:
from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
import numpy as np

# Load dataset
data = load_iris()
X = data.data
y = data.target

# Define model
rf = RandomForestClassifier(random_state=42)

# Define K-Fold cross-validation with 5 folds
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Perform cross-validation
cv_scores = cross_val_score(rf, X, y, cv=kf, scoring='accuracy')

# Output the results
print("Cross-validation scores:", cv_scores)
print("Average accuracy:", np.mean(cv_scores))


Cross-validation scores: [1.         0.96666667 0.93333333 0.93333333 0.96666667]
Average accuracy: 0.9600000000000002


1. Explanation of Cross-Validation Techniques
Cross-validation is a model evaluation technique that involves partitioning the dataset into subsets to test and train the model multiple times. This helps assess the model’s performance and generalizability on different data splits, reducing the risk of overfitting or underfitting.

Types of Cross-Validation:
K-Fold Cross-Validation: The dataset is split into K equal-sized folds. The model is trained on K-1 folds and tested on the remaining fold. This is repeated K times, with each fold serving as the test set once.
Stratified K-Fold Cross-Validation: A variation of K-fold where the distribution of target classes is preserved in each fold, ensuring each fold has a representative distribution of the target variable.
Leave-One-Out Cross-Validation (LOOCV): Each data point is used as a test set once, and the model is trained on all other data points. This is computationally expensive for large datasets.
Leave-P-Out Cross-Validation: Similar to LOOCV, but instead of leaving one out, p data points are left out in each iteration.
Shuffle Split: The dataset is shuffled and split into training and test sets multiple times. Unlike K-fold, the test set size remains constant, but the data is split randomly each time.
2. Importance of Cross-Validation in ML
Cross-validation is important for:

Generalization: It helps evaluate the model’s ability to perform on unseen data by training on different subsets and testing on others.
Reducing Overfitting: Since the model is trained and tested on multiple splits, it reduces the chance of the model being overly tuned to a single data subset.
Reliable Performance Estimation: By averaging the performance metrics across several folds, cross-validation provides a more reliable estimate of a model's true performance.
Without cross-validation, we might end up with biased performance estimates based on a single train-test split, which could lead to misleading results.

#100DaysOfCodeDay20 #CrossValidation #MachineLearning