# What Is Cross-Validation (CV)?
Cross-validation (CV) is a resampling technique used to evaluate the generalization performance of a machine learning model. It helps ensure that your model performs well not just on the training data, but also on unseen data.

#### Why Use Cross-Validation?
- Avoids overfitting
- Provides a more reliable estimate of model performance
- Ensures the model is tested on different subsets of the data

### MAIN CROSS-VALIDATION TECHNIQUES

**1. Hold-Out Validation**

**Explanation:**
- Split data into one training and one test set (e.g., 80% train, 20% test).

**Use Case:**
- Quick model testing on small projects.

In [1]:
from sklearn.datasets import load_wine
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Load the Wine dataset
X, y = load_wine(return_X_y=True)

# Split data: 70% train, 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize model
model = RandomForestClassifier(random_state=42)

# Train
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Accuracy: 1.0

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       1.00      1.00      1.00        21
           2       1.00      1.00      1.00        14

    accuracy                           1.00        54
   macro avg       1.00      1.00      1.00        54
weighted avg       1.00      1.00      1.00        54



**2. K-Fold Cross-Validation (Standard)**

**Description:**
- Splits dataset into K equal folds.
- Trains on K-1 folds and tests on the remaining fold.
- Repeats K times with each fold as test set once.

In [2]:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold, cross_val_score

# Load data
X, y = load_iris(return_X_y=True)
model = LogisticRegression(max_iter=1000)

# K-Fold (K=5)
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kf)

print("K-Fold CV Scores:", scores)
print("Mean Accuracy:", scores.mean())

K-Fold CV Scores: [1.         1.         0.93333333 0.96666667 0.96666667]
Mean Accuracy: 0.9733333333333334


**3. Stratified K-Fold Cross-Validation**

**Description:**
- Like K-Fold but preserves class proportions in each fold.
- Especially useful for imbalanced datasets.

In [3]:
from sklearn.datasets import load_breast_cancer
from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold

X, y = load_breast_cancer(return_X_y=True)
model = SVC(kernel='linear')

# Stratified K-Fold (K=5)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf)

print("Stratified K-Fold CV Scores:", scores)
print("Mean Accuracy:", scores.mean())

Stratified K-Fold CV Scores: [0.94736842 0.92982456 0.95614035 0.93859649 0.96460177]
Mean Accuracy: 0.9473063188945815


**4. Leave-One-Out Cross-Validation (LOOCV)**

**Description:**
- Uses 1 data point for testing and the rest for training.
- Repeats for each data point.
- Computationally expensive but very thorough.

In [5]:
from sklearn.datasets import load_diabetes
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import LeaveOneOut

X, y = load_diabetes(return_X_y=True)
y = (y > y.mean()).astype(int)  # convert to classification
model = DecisionTreeClassifier()

# LOOCV
loo = LeaveOneOut()
scores = cross_val_score(model, X, y, cv=loo)

print("LOOCV Mean Accuracy:", scores.mean())

LOOCV Mean Accuracy: 0.6561085972850679


**5. ShuffleSplit Cross-Validation**
 
**Description:**
- Randomly shuffles data and creates random train/test splits.
- Flexible in specifying number of splits and test sizes.

In [6]:
from sklearn.datasets import load_digits
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import ShuffleSplit

X, y = load_digits(return_X_y=True)
model = KNeighborsClassifier()

# ShuffleSplit: 10 splits, 30% test size
ss = ShuffleSplit(n_splits=10, test_size=0.3, random_state=42)
scores = cross_val_score(model, X, y, cv=ss)

print("ShuffleSplit CV Scores:", scores)
print("Mean Accuracy:", scores.mean())

ShuffleSplit CV Scores: [0.99259259 0.98888889 0.98333333 0.98888889 0.98148148 0.98703704
 0.98518519 0.98333333 0.98333333 0.97407407]
Mean Accuracy: 0.9848148148148148


**6. TimeSeriesSplit Cross-Validation**

**Description:**
- Used for time-series data.
- Splits are done such that training data is always before test data.

In [7]:
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_squared_error, make_scorer

# Synthetic time-series data
X = np.arange(100).reshape(-1, 1)
y = X.flatten() + np.random.normal(0, 5, 100)

model = RandomForestRegressor()

# TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
scores = cross_val_score(model, X, y, cv=tscv, scoring=make_scorer(mean_squared_error))

print("TimeSeriesSplit CV MSE:", scores)
print("Mean MSE:", scores.mean())

TimeSeriesSplit CV MSE: [313.61490076 117.99026432 176.20466935 162.60255264  25.79126537]
Mean MSE: 159.2407304885267


### Visual Summary: Hyperparameter Tuning vs Cross-Validation Score

| **Feature**             | **Hyperparameter Tuning**                       | **Cross-Validation Score**              |
|-------------------------|--------------------------------------------------|------------------------------------------|
| **Purpose**             | Improve model via best hyperparameters          | Estimate generalization performance      |
| **Output**              | Best parameters and the model                   | Accuracy, F1, AUC, etc. (averaged)       |
| **Method**              | Grid, Random, Optuna, Hyperopt, etc.            | K-Fold, Stratified, TimeSeriesSplit      |
| **Used by**             | `GridSearchCV`, `RandomizedSearchCV`, etc.      | `cross_val_score`, internally in tuning  |
| **Computational Cost**  | Higher (tries multiple models)                  | Lower (only model evaluation)            |

In [8]:
from sklearn.datasets import load_iris
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

# Load data
X, y = load_iris(return_X_y=True)

# Hyperparameter grid
param_grid = {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']}

# Grid Search with Cross-Validation (cv=5)
grid = GridSearchCV(SVC(), param_grid, cv=5)
grid.fit(X, y)

# Best params and score (CV used internally)
print("Best Parameters:", grid.best_params_)
print("Best Cross-Validation Score:", grid.best_score_)

Best Parameters: {'C': 1, 'kernel': 'linear'}
Best Cross-Validation Score: 0.9800000000000001
