Cross-Validation is a statistical technique used to evaluate the performance and generalizability of a machine learning model. It involves partitioning the dataset into multiple subsets or "folds" to test the model's performance on different sections of the data, ensuring that the model is not overfitting or underfitting. The main goal of cross-validation is to estimate how well the model will perform on unseen data by using different subsets for training and testing.

## **Applications of Cross-Validation**
- **Model Evaluation**: Provides a robust estimate of a model's performance by testing it on multiple data subsets.
- **Hyperparameter Tuning**: Identifies the best hyperparameters by assessing model performance across different configurations.
- **Model Comparison**: Ensures fair evaluation of different models by using the same data partitions.
- **Reducing Overfitting**: Detects overfitting by validating the model on various subsets, revealing its performance on unseen data.
- **Feature Selection**: Assesses the impact of different features on model performance to determine their relevance.
- **Estimating Model Stability**: Evaluates how consistent a model’s performance is across different data subsets.


In [74]:
## Lets discuss and code each type of CrossValidation

**Leave-One-Out Cross-Validation (LOOCV)**

**Description**:  
In LOOCV, a single observation is used as the validation set, and the remaining observations form the training set. This process is repeated for each observation in the dataset.

**Advantages**:
- Uses as much data as possible for training.
- Low bias since almost all data points are used for training.

**Disadvantages**:
- Computationally expensive, especially for large datasets.
- High variance as each training set is almost identical.

**Use Cases**:
- Small datasets where retaining more data for training is crucial.
- When high accuracy is required and computational cost is not a constraint.

In [5]:
from sklearn.model_selection import LeaveOneOut
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score

In [6]:
# Load data
data = load_iris()
X, y = data.data, data.target

In [7]:
# Initialize the model
model = LogisticRegression(max_iter=200)

In [8]:
# Initialize LOOCV
loo = LeaveOneOut()
accuracy = []

In [9]:
# Perform LOOCV
for train_index, test_index in loo.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    # Train the model
    model.fit(X_train, y_train)

    # Make predictions
    y_pred = model.predict(X_test)

    # Evaluate the model
    accuracy.append(accuracy_score(y_test, y_pred))

In [10]:
# Calculate average accuracy
print(f'Average Accuracy: {sum(accuracy)/len(accuracy):.4f}')

Average Accuracy: 0.9667


**Hold-Out Cross-Validation**

**Description**:  
The dataset is randomly split into two parts: a training set and a validation set. The model is trained on the training set and evaluated on the validation set.

**Advantages**:
- Simple to implement.
- Fast, since the model is trained and validated only once.

**Disadvantages**:
- High variance as the results depend heavily on the split.
- Potential for overfitting if the split does not represent the overall dataset well.

**Use Cases**:
- Large datasets where a single split is representative.
- Quick evaluation of models.


In [45]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier

In [46]:
# Generate some example data
X = np.random.rand(100, 10)  # 100 samples, 10 features
y = np.random.randint(0, 2, 100)  # Binary target variable

In [47]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [48]:
# Initialize and train the model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

In [49]:
# Make predictions
y_pred = model.predict(X_test)

In [50]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

Accuracy: 0.60


**K-Fold Cross-Validation**

**Description**:  
The dataset is divided into K subsets or "folds." The model is trained on K-1 folds and validated on the remaining fold. This process is repeated K times, with each fold used exactly once as the validation set.

**Advantages**:
- Reduces variance as each data point is used for both training and validation.
- Provides a more comprehensive evaluation of the model.

**Disadvantages**:
- Computationally more expensive than hold-out cross-validation.
- Can still be computationally expensive for very large datasets.

**Use Cases**:
- When the dataset is not very large.
- To obtain a reliable estimate of model performance.


In [62]:
import numpy as np
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier

In [63]:
# Generate some example data
X = np.random.rand(100, 10)  # 100 samples, 10 features
y = np.random.randint(0, 2, 100)  # Binary target variable

In [64]:
# Initialize the model
model = RandomForestClassifier(n_estimators=100, random_state=42)

In [65]:
# Define the number of folds
k = 5
kf = KFold(n_splits=k, shuffle=True, random_state=42)

In [66]:
# Lists to store results
fold_accuracies = []

In [67]:
# Perform K-Fold Cross-Validation
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    # Train the model
    model.fit(X_train, y_train)

    # Make predictions
    y_pred = model.predict(X_test)

    # Evaluate the model
    accuracy = accuracy_score(y_test, y_pred)
    fold_accuracies.append(accuracy)

In [68]:
# Calculate the average accuracy across all folds
average_accuracy = np.mean(fold_accuracies)
print(f'Average Accuracy: {average_accuracy:.2f}')

Average Accuracy: 0.51


**Stratified K-Fold Cross-Validation**

**Description**:  
Stratified K-Fold Cross-Validation is similar to K-Fold, but it ensures that each fold has the same proportion of classes as the entire dataset. This is particularly useful for imbalanced datasets.

**Advantages**:
- Maintains the proportion of classes in each fold.
- More accurate performance estimation for imbalanced datasets.

**Disadvantages**:
- Slightly more complex to implement than regular K-Fold.
- Computationally more expensive than hold-out cross-validation.

**Use Cases**:
- Imbalanced datasets where maintaining class distribution is important.
- Situations where model evaluation needs to consider class proportions.


In [69]:
from sklearn.model_selection import StratifiedKFold

In [70]:
# Initialize Stratified K-Fold with 5 folds
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
accuracy = []

In [71]:
# Load data
data = load_iris()
X, y = data.data, data.target

In [72]:
# Perform Stratified K-Fold CV
for train_index, test_index in skf.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    # Train the model
    model.fit(X_train, y_train)

    # Make predictions
    y_pred = model.predict(X_test)

    # Evaluate the model
    accuracy.append(accuracy_score(y_test, y_pred))

In [73]:
# Calculate average accuracy
print(f'Average Accuracy: {sum(accuracy)/len(accuracy):.4f}')

Average Accuracy: 0.9467


**Time Series Cross-Validation**

**Description**:  
Time Series Cross-Validation is used for time series data, where the temporal order of the data must be preserved. It involves using past data to predict future data by progressively expanding the training set.

**Advantages**:
- Respects the temporal order of data.
- Suitable for time series forecasting tasks.

**Disadvantages**:
- Cannot shuffle data, which may lead to lower variability in training data.
- May result in less training data for early folds.

**Use Cases**:
- Time series data where temporal order is critical.
- Forecasting tasks where training on past data to predict future data is necessary.


In [51]:
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import TimeSeriesSplit

In [52]:
# Generate some example time series data
np.random.seed(42)
n_samples = 100
X = np.random.rand(n_samples, 1)  # 100 samples, 1 feature
y = np.sin(np.linspace(0, 10, n_samples)) + np.random.randn(n_samples) * 0.1  # Example target variable

In [34]:
# Initialize Time Series Split with 5 splits
tscv = TimeSeriesSplit(n_splits=5)
mse = []

In [53]:
# Initialize the model
model = RandomForestRegressor(n_estimators=100, random_state=42)

In [54]:
# Define the number of splits
n_splits = 5
tscv = TimeSeriesSplit(n_splits=n_splits)

In [55]:
# Lists to store results
fold_mse = []

In [56]:
# Perform Time Series Cross-Validation
for train_index, test_index in tscv.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    # Train the model
    model.fit(X_train, y_train)

    # Make predictions
    y_pred = model.predict(X_test)

    # Evaluate the model
    mse = mean_squared_error(y_test, y_pred)
    fold_mse.append(mse)

In [57]:
# Calculate the average Mean Squared Error across all folds
average_mse = np.mean(fold_mse)
print(f'Average Mean Squared Error: {average_mse:.2f}')

Average Mean Squared Error: 0.83
