# Types of Cross Validation

## Commands

* `GridSearchCV`
  * Used for hyperparameter tuning to find the optimal parameters for a model.
* `RandomizedSearchCV`
  * An alternative method for hyperparameter tuning that selects parameters randomly.
* `train_test_split(random_state=<value>)`
  * Used to split datasets; changing the `random_state` value results in different data splits and potentially different accuracy scores.

## Summary

* **Data Splitting Necessity**: Data is initially divided into **training** and **test** sets. The training set is further split into **train** and **validation** sets to perform model training and hyperparameter tuning.
* **Purpose of Cross Validation**: It addresses the issue of varying model accuracy caused by different random states during data splitting. It allows data scientists to determine the **average accuracy** of a model.
* **Leave One Out CV (LOOCV)**: Uses a single record for validation and the rest for training. It is computationally expensive and prone to **overfitting** due to high training data variance.
* **Leave P Out CV**: Similar to LOOCV but reserves **P** records for validation (e.g., 10, 20, 30) instead of just one.
* **K-Fold CV**: Splits the dataset into **K** equal parts (folds). The model iterates K times, using one fold for validation and the rest for training each time, providing an average accuracy.
* **Stratified K-Fold CV**: An improvement on K-Fold that ensures the **validation data** maintains the same proportion of target class labels (e.g., binary outputs) as the original dataset, preventing imbalanced validation sets.
* **Time Series CV**: Designed for time-dependent data (e.g., product reviews). It splits data chronologically (e.g., Day 1-4 for train, Day 5 for validation) rather than randomly, preserving the temporal sequence.

## Introduction to Model Evaluation

### Data Splitting Strategy
When training a machine learning model, the dataset is typically divided into two primary parts:
* **Training Data**: Used to train the model.
* **Test Data**: Used exclusively to check the performance of the model on **new data**. This data is never shown to the model during the training phase.

### Hyperparameter Tuning and Validation
To optimize the model, the **training data** is further split into **train** and **validation** sets.
* **Validation Data**: This subset is used for **hyperparameter tuning** (playing with multiple parameters) and validating the model during the training process.
* **Performance Metrics**: After training and validation are complete, the model's performance is evaluated on the test data using metrics such as **accuracy**, **precision**, **recall**, and **mean squared error**.

### The Need for Cross Validation
When splitting training data into train and validation sets, an important parameter called **random state** controls the split. Changing this value alters which records end up in which set, leading to fluctuating accuracy scores (e.g., 85% in one split, 92% in another). **Cross Validation** allows us to calculate the **average accuracy** across multiple splits, providing a more reliable measure of model performance.

## Types of Cross Validation techniques

### 1. Leave One Out Cross Validation (LOOCV)
In this technique, the model is trained and validated multiple times based on the total number of records ($N$).
* **Process**: For every experiment, **one record** is used as the validation set, and the remaining records are used as the training set.
* **Experiments**: If the dataset has 500 records, 500 separate experiments are performed.
* **Disadvantages**:
    * **High Complexity**: As data size increases, the computational cost of training the model effectively multiplies by the number of records (e.g., 5000 records require 5000 experiments).
    * **Overfitting**: Since the model trains on nearly the entire dataset (N-1 records) for every iteration, training accuracy is high, but validation accuracy may be low, leading to poor performance on new test data.

### 2. Leave P Out Cross Validation
This is a variation of LOOCV where **P** records are left out for validation instead of just one.
* **Process**: **P** can be set to values like 10, 20, or 30.
* The remaining process remains the same as LOOCV, but the validation set size is slightly larger.

### 3. K-Fold Cross Validation
This is a widely used technique that splits the data into **K** distinct sections or "folds."
* **Process**: The total number of records ($n$) is divided by $K$ to determine the test size for each fold.
    * *Example*: If $n=500$ and $K=5$, the test size is $100$ records.
* **Iteration**:
    * **Experiment 1**: The first 100 records are the **validation data**; the remaining 400 are training data.
    * **Experiment 2**: The next 100 records become validation data, and the rest are training.
    * This continues until all folds have been used as validation data exactly once.
* **Outcome**: An accuracy score is calculated for each experiment, and the **average accuracy** is derived from all $K$ experiments.

### 4. Stratified K-Fold Cross Validation
This method addresses a specific flaw in standard K-Fold Cross Validation regarding **imbalanced datasets**, particularly in classification problems.
* **The Problem**: In standard K-Fold, a random split might result in a validation set containing only one type of class category (e.g., all 1s or all 0s in a binary problem). This prevents the model from learning properly.
* **The Solution**: Stratified K-Fold ensures that the **validation data** maintains the same **proportion** of target classes (e.g., a 60:40 ratio of 1s to 0s) as found in the original dataset.
* **Benefit**: It guarantees that every validation fold is a representative sample of the overall data distribution.

### 5. Time Series Cross Validation
This technique is essential for **time-dependent data**, such as product sentiment analysis or stock prices, where the order of data points matters.
* **Context**: Reviews or data points may change nature over time (e.g., product features improve from January to December).
* **Constraint**: You cannot randomly split time-series data. The validation set must always come *after* the training set chronologically.
* **Process**:
    * Data is split based on time steps (e.g., days).
    * **Split Example**: Day 1 to Day 4 serves as **training data**, while Day 4 to Day N serves as **validation data**.
    * The sequence is strictly maintained; future data is never used to train for past predictions.