# **Cross Validation**

**DEFINITION**: Cross-validation is a model validation technique used to assess how well a machine learning model will perform on unseen data by splitting the dataset into multiple training and testing subsets. This helps in evaluating the model's generalization ability and prevents overfitting.

### **ADVANTAGES**
- It ensures that the model is tested on different subsets of data and improving its ability to generalize to unseen data.
- Cross-validation is widely used in hyperparameter tuning (e.g., GridSearchCV, RandomizedSearchCV) to evaluate different hyperparameter combinations systematically and select the best-performing set.

### **Disadvantages**  

- Cross-validation is computationally expensive because the model is trained K times in K-Fold Cross-Validation, requiring significantly more computation than a simple train-test split. This makes it time-consuming, especially for large datasets or complex models like Random Forest, XGBoost, or Neural Networks, where training multiple times on different subsets increases execution time.



### **Notes**

**Not Suitable for Time-Series Data Without Modification**  
   - Traditional K-Fold Cross-Validation randomly splits the data, which is **not ideal for time-series forecasting** since future data should never be used in training.  
   - Instead, **Time-Series Cross-Validation (Rolling Window)** should be used.  

**Variance in Scores**  
   - If the dataset is **highly imbalanced or small**, performance scores can vary significantly across folds, leading to instability in evaluation.  
   - **Stratified K-Fold** helps mitigate this issue in classification tasks.

### **Why Random K-fold Cross-Validation Can Be Misleading in Real-World Problems**

- **Ref**: https://medium.com/towards-data-science/why-you-should-never-use-cross-validation-4360d42456ac
1. **Overly Optimistic Estimates**  
   - Cross-validation assumes that the training and test data have the same distribution, which is **rare in real-world scenarios**.  
   - This can lead to **overestimating** model performance.  

2. **Fails When Data Distribution Changes**  
   - In real applications, new data often differs based on **time, location, or business factors**.  
   - Standard cross-validation **does not account for this shift**, making it unreliable.  

3. **Encourages Overfitting**  
   - Since cross-validation trains and tests on similar distributions, it **favors models that memorize patterns** instead of generalizing well.  
   - This can lead to selecting a model that **performs well in validation but fails in production**.  

4. **Better Alternative: Group K-Fold**  
   - Instead of randomly splitting data, **Group K-Fold** ensures that test data represents unseen groups (e.g., cities, months).  
   - This provides a **more realistic performance estimate** and helps in choosing the right model.  

#### **Conclusion:**  
Cross-validation works well for general evaluation but **can be misleading when data shifts over time or across categories**. In real-world scenarios, use **Group K-Fold, Time Series Split, or other domain-specific validation strategies** to ensure better model selection and avoid over-optimistic results.

## **Types of cross-validation**

- **Ref**: https://www.turing.com/kb/different-types-of-cross-validations-in-machine-learning-and-their-explanations

## **1. K-Fold Cross-Validation**
### **Concept**
- The dataset is **randomly split** into **K equal parts** (**folds**).  
- The model is trained on **K-1 folds** and tested on the **remaining fold**.  
- This process repeats **K times**, ensuring each fold is used as a test set once.  
- The **final performance score** is the **average of all K iterations**.  

**Best for:** Large, balanced datasets.  
**Not ideal for:** Imbalanced datasets.  

### **Example Walkthrough**
**Scenario:** Predicting student grades.  
- **Dataset:** 5 students → **A, B, C, D, E**  
- **K=5** (5 folds)  
- **Each student is used as a test set once, while the others train the model.**  

| **Iteration** | **Training Data (4 students)** | **Validation Data (1 student)** |
|--------------|----------------------------|----------------------------|
| **1** | B, C, D, E | A |
| **2** | A, C, D, E | B |
| **3** | A, B, D, E | C |
| **4** | A, B, C, E | D |
| **5** | A, B, C, D | E |

**Each fold is used once as a test set**  

### **Python Code**
```python
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Define K-Fold Cross-Validation (K=5)
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Define model
model = LogisticRegression(max_iter=200)

# Perform cross-validation
scores = cross_val_score(model, X, y, cv=kf, scoring='accuracy')

print("K-Fold Cross-validation scores:", scores)
print("Mean Accuracy:", scores.mean())
```

---

## **2. Stratified K-Fold Cross-Validation**
### **Concept**
- **Same as K-Fold**, but ensures that **each fold maintains the same class distribution** as the original dataset.  
- This prevents **class imbalance issues** (e.g., fraud detection, rare disease classification).  

**Best for:** **Imbalanced datasets** (e.g., fraud detection, medical diagnosis).  
**Not needed for:** **Already balanced datasets**.  

### **Example Walkthrough**
**Scenario:** Predicting disease presence (Imbalanced data: 3 Yes, 2 No).  
- **Dataset:** 5 patients → **A(Yes), B(No), C(Yes), D(Yes), E(No)**  
- **K=2** (2 folds)  
- **Each fold keeps the same Yes/No ratio as the original dataset**  

| **Iteration** | **Training Data** | **Validation Data** |
|--------------|----------------------|----------------------|
| **1** | A(Yes), B(No), D(Yes) | C(Yes), E(No) |
| **2** | C(Yes), E(No) | A(Yes), B(No), D(Yes) |

**Each fold maintains the same class balance.**  

### **Python Code**
```python
from sklearn.model_selection import StratifiedKFold

# Define stratified K-Fold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Perform stratified cross-validation
scores = cross_val_score(model, X, y, cv=skf, scoring='accuracy')

print("Stratified K-Fold Cross-validation scores:", scores)
print("Mean Accuracy:", scores.mean())
```

---

## **3. Group K-Fold Cross-Validation**
### **Concept**
- Similar to K-Fold, but **ensures that entire groups of related samples are kept in the same fold**.  
- Example: If you have multiple **records per patient**, **customer**, or **city**, this method prevents **data leakage**.  

**Best for:** **Grouped datasets (e.g., customers, cities, patients, sensors).**  
**Not ideal for:** **Completely independent observations.**  

### **Example Walkthrough**
**Scenario:** Predicting customer spending (Grouped by city).  
- **Dataset:** 6 customers from 3 cities →  
  **(A, B - New York), (C, D - LA), (E, F - Chicago)**  
- **K=3** (3 folds)  
- **Each fold keeps all customers from a city together**  

| **Iteration** | **Training Data (Cities)** | **Validation Data (City)** |
|--------------|----------------------------|----------------------------|
| **1** | LA, Chicago | New York |
| **2** | New York, Chicago | LA |
| **3** | New York, LA | Chicago |

**Prevents data leakage by keeping city-specific data in one fold.**  

### **Python Code**
```python
from sklearn.model_selection import GroupKFold

# Sample groups (e.g., different customers)
groups = [1, 1, 2, 2, 3, 3, 4, 4, 5, 5]  # Assign a group ID to each sample

# Define Group K-Fold
gkf = GroupKFold(n_splits=5)

# Perform Group K-Fold cross-validation
scores = cross_val_score(model, X, y, cv=gkf, groups=groups, scoring='accuracy')

print("Group K-Fold Cross-validation scores:", scores)
print("Mean Accuracy:", scores.mean())
```

---
### **4. Time Series Cross-Validation (Rolling Cross-Validation / Forward Chaining Method)**  
#### **Concept**  
Before diving into the rolling cross-validation technique, it's important to understand what **time-series data** is.  

Time-series data consists of observations recorded **at different time points**, making it useful for understanding **patterns, trends, and seasonality**. Examples of time-series data include **stock prices, weather forecasts, economic indicators, and website traffic**.  

Unlike regular datasets, where data points are independent of each other, **time-series data has a sequential nature**, meaning the past influences the future.  

This characteristic makes standard cross-validation techniques unsuitable because **randomly splitting** time-series data (as done in K-Fold or Holdout CV) **breaks the chronological order** and may lead to unrealistic predictions.  

### **How Rolling Cross-Validation Works**  
Since the order of data is critical, **Time Series Cross-Validation (also known as Forward Chaining or Rolling Cross-Validation)** ensures that training always happens on **past data**, and validation happens on **future data**.  

Instead of randomly splitting data, the dataset is divided **sequentially** into training and validation sets. Each iteration **expands the training set** by including more past data and tests the model on the next available time step.  

#### **Steps:**  
1. **Start with a small initial training set** and use it to make predictions on the next available time step.  
2. **Expand the training set** by including that time step and move to the next period for validation.  
3. **Repeat the process** until all time points are used for training and validation.  

This method mimics **real-world forecasting**, where a model continuously updates as new data becomes available.  

### **Example Walkthrough**  
**Scenario:** Predicting monthly sales trends using historical sales data.  

- **Dataset:** Monthly sales data from **January to June**.  
- **Rolling Window Approach:** Each training set increases in size, and the test set consists of the next time period.  

| **Iteration** | **Training Data (Months)** | **Validation Data (Next Month)** |  
|--------------|----------------------|----------------------|  
| **1** | January | February |  
| **2** | January, February | March |  
| **3** | January, February, March | April |  
| **4** | January, February, March, April | May |  
| **5** | January, February, March, April, May | June |  

Ensures the model **never trains on future data** to prevent data leakage.  

### **Python Code**  
```python
from sklearn.model_selection import TimeSeriesSplit
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
import numpy as np

# Generate synthetic time-series data
X, y = make_regression(n_samples=100, n_features=1, noise=0.1, random_state=42)

# Define Time Series Split (5 splits)
tscv = TimeSeriesSplit(n_splits=5)

# Define model
model = LinearRegression()

# Perform time series cross-validation
for train_index, test_index in tscv.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    model.fit(X_train, y_train)
    accuracy = model.score(X_test, y_test)
    print(f"Test set: {test_index}, Accuracy: {accuracy:.4f}")
```
### **Key Advantages of Time Series Cross-Validation**
**Maintains chronological order** – prevents data leakage from future events.  
**Mimics real-world forecasting** – models train on past data and predict future values.  
**Improves model robustness** – evaluates how well the model generalizes over time.  

### **When to Use It?**  
**Best for:** Time-series forecasting tasks such as **stock market predictions, weather forecasting, energy consumption forecasting, and sales predictions.**  
**Not suitable for:** Datasets with **independent observations** where order does not matter.  

---
### **5. Holdout Cross-Validation (Train-Test Split)**  
#### **Concept**  
- The dataset is **randomly split** into **training (e.g., 70%) and testing (e.g., 30%) sets**.  
- The model is trained on the training set and tested on the test set.  
- This method is **fast**, but results depend on **how the split is made**.  

**Best for:** Large datasets, quick evaluation  
**Not ideal for:** Small datasets (risk of poor train-test distribution)  

#### **Example Walkthrough**  
**Scenario:** Predicting movie ratings  
- **Dataset:** 10 movies  
- **Split:** 7 movies for training, 3 for testing  

| **Training Data** | **Validation Data** |  
|------------------|------------------|  
| Movie 1 - Movie 7 | Movie 8 - Movie 10 |  

**Simple but depends on the split**  

#### **Python Code**  
```python
from sklearn.model_selection import train_test_split

# Split dataset into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train model
model.fit(X_train, y_train)

# Evaluate on test set
accuracy = model.score(X_test, y_test)
print("Test Accuracy:", accuracy)
```

---

### **6. Leave-P-Out Cross-Validation (LPOCV)**  
#### **Concept**  
- **P samples are left out for validation**, while the remaining **N-P samples** are used for training.  
- The process is repeated **for every possible combination** of test samples.  
- Computationally expensive for large datasets.  

**Best for:** Small datasets  
**Not ideal for:** Large datasets (too slow)  

#### **Example Walkthrough**  
**Scenario:** Predicting employee performance  
- **Dataset:** 5 employees **(A, B, C, D, E)**  
- **P=2** (leave 2 out for validation)  

| **Iteration** | **Training Data (3 employees)** | **Validation Data (2 employees)** |  
|--------------|----------------------|----------------------|  
| **1** | C, D, E | A, B |  
| **2** | B, D, E | A, C |  
| **3** | B, C, E | A, D |  
| **4** | B, C, D | A, E |  

**Exhaustive but slow for large datasets**  

#### **Python Code**  
```python
from sklearn.model_selection import LeavePOut

# Define Leave-P-Out cross-validation (P=2)
lpo = LeavePOut(p=2)

# Perform cross-validation
scores = cross_val_score(model, X, y, cv=lpo, scoring='accuracy')

print("Leave-P-Out Cross-validation scores:", scores)
print("Mean Accuracy:", scores.mean())
```

---

### **7. Leave-One-Out Cross-Validation (LOOCV)**  
#### **Concept**  
- A **special case of Leave-P-Out (P=1)**, where **only 1 sample is used as validation** at a time.  
- The model is trained on **N-1 samples** and tested on **1 sample** in each iteration.  
- **Very computationally expensive**, but uses **all available data** for training.  

**Best for:** Small datasets  
**Not ideal for:** Large datasets (too slow, too many iterations)  

#### **Example Walkthrough**  
**Scenario:** Predicting student's final exam score  
- **Dataset:** 5 students **(A, B, C, D, E)**  
- **Each student is tested once, trained on the rest**  

| **Iteration** | **Training Data (4 students)** | **Validation Data (1 student)** |  
|--------------|----------------------|----------------------|  
| **1** | B, C, D, E | A |  
| **2** | A, C, D, E | B |  
| **3** | A, B, D, E | C |  
| **4** | A, B, C, E | D |  

**Good for small datasets, but slow**  

#### **Python Code**  
```python
from sklearn.model_selection import LeaveOneOut

# Define LOOCV
loo = LeaveOneOut()

# Perform cross-validation
scores = cross_val_score(model, X, y, cv=loo, scoring='accuracy')

print("Leave-One-Out Cross-validation scores:", scores)
print("Mean Accuracy:", scores.mean())
```

---

### **8. Monte Carlo Cross-Validation (Shuffle-Split)**  
#### **Concept**  
- The dataset is **randomly split into train/test sets multiple times** (e.g., 70-30 split).  
- Unlike K-Fold, each fold **is chosen randomly** (not sequential).  
- The results are **averaged** across iterations.  

**Best for:** Large datasets  
**Not ideal for:** Time-series data (random splits break time dependencies)  

#### **Example Walkthrough**  
**Scenario:** Predicting election outcomes  
- **Dataset:** 10 states  
- **Train-Test Split:** 7 states for training, 3 for testing (random split each time)  

| **Iteration** | **Training Data (7 states)** | **Validation Data (3 states)** |  
|--------------|----------------------|----------------------|  
| **1** | A, B, C, D, E, F, G | H, I, J |  
| **2** | B, D, E, F, G, H, I | A, C, J |  

**Random splits avoid overfitting**  

#### **Python Code**  
```python
from sklearn.model_selection import ShuffleSplit

# Define Shuffle-Split Cross-Validation
ss = ShuffleSplit(n_splits=5, test_size=0.3, random_state=42)

# Perform cross-validation
scores = cross_val_score(model, X, y, cv=ss, scoring='accuracy')

print("Monte Carlo Cross-validation scores:", scores)
print("Mean Accuracy:", scores.mean())
```