# Cross-validation




Cross-validation is a widely used technique in machine learning and statistics for assessing the performance and generalization of a predictive model. 

It helps to evaluate how well a model will perform on unseen data and provides a more robust estimate of a model's performance than a single train-test split. 

The basic idea is to split the dataset into multiple subsets, train and test the model on different combinations of these subsets, and then aggregate the results to get a more comprehensive performance evaluation. 

Common types of cross-validation include k-fold cross-validation and leave-one-out cross-validation.

<h6> 1. K-Fold Cross-Validation. </h6>

K-Fold Cross-Validation is a widely used technique in machine learning and statistics for assessing the performance and generalization of predictive models. It helps evaluate how well a model will perform on unseen data and provides a more robust estimate of a model's performance than a single train-test split. 

Here's a step-by-step explanation of K-Fold Cross-Validation:

   1. Data Splitting:
        * The dataset is divided into k equally sized "folds" or subsets.
        * Each fold contains approximately the same number of samples.
        * For example, if you have 100 data points and choose k=5, each fold will contain 20 data points.

        <br>
    
   2. Training and Testing:
        * The model is trained and tested k times.
        * In each iteration, one fold is used as the test set, and the remaining k-1 folds are used as the training set.
        * For instance, in the first iteration, Fold 1 is the test set, and Folds 2 to k are the training set. In the second iteration, Fold 2 is the test set, and so on.
        
        <br>
    
    
   3. Performance Metrics:
        * After each iteration, a performance metric (e.g., accuracy, mean squared error) is calculated based on the model's performance on the test fold.
        * This gives you k performance scores, one for each fold.
    
        <br>
    
   4. Aggregation:
        * The k performance metrics obtained from the iterations are typically aggregated to get a single estimate of the model's performance.
        * Common aggregation methods include taking the mean or median of the k scores.
        * This aggregated score provides a more reliable estimate of the model's performance than a single train-test split.
    
        <br>
    
   5. Advantages of K-Fold Cross-Validation:
        * It provides a more robust estimate of a model's performance since it evaluates the model on multiple test sets.
        * It helps detect issues like overfitting and data-specific patterns that may not be evident in a single train-test split.
        * It utilizes the entire dataset for training and testing, which is especially useful when the dataset is limited.
    
        <br>
    
   6. Common Values of k:
        * Common choices for k in K-Fold Cross-Validation include 5, 10, and sometimes 3 or 20, depending on the dataset size and computational resources. The choice of k can vary based on the specific problem.
        * In practice, libraries like scikit-learn in Python provide convenient functions to perform K-Fold Cross-Validation with various machine learning models and evaluation metrics.

<h5> Now let's use these metrics:</h5>

In [14]:
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression

# Load your dataset and target variable here
X, y = make_regression(n_samples=5, n_features=2, noise=1, random_state=42)

# Create a machine learning model (replace with your model)
model = LinearRegression()

# Specify the number of folds (e.g., 5-fold cross-validation)
num_folds = 5

# Create a KFold object to control the cross-validation process
kf = KFold(n_splits=num_folds, shuffle=True, random_state=42)

# Perform cross-validation and specify the scoring metric (e.g., mean squared error)
scores = cross_val_score(model, X, y, cv=kf, scoring='neg_mean_squared_error')

# The scores are typically negative, so we take their absolute values and calculate the mean
mse_scores = -scores
mean_mse = mse_scores.mean()

for i, mse in enumerate(mse_scores):
    print(f"Fold {i} | Mean Squared Error (MSE): {mse:.4f}")
    
print(f"\nMean Squared Error (MSE) across {num_folds} folds: {mean_mse:.4f}")

Fold 0 | Mean Squared Error (MSE): 0.8463
Fold 1 | Mean Squared Error (MSE): 0.0929
Fold 2 | Mean Squared Error (MSE): 3.3469
Fold 3 | Mean Squared Error (MSE): 1.5318
Fold 4 | Mean Squared Error (MSE): 1.7142

Mean Squared Error (MSE) across 5 folds: 1.5064


In [4]:
mse_scores

array([0.84628596, 0.09293514, 3.34686517, 1.53176163, 1.71417011])