# k-fold Validation



In machine learning, assessing the performance of a model is crucial, especially when working with limited data. One powerful technique for evaluating models is k-Fold Cross-Validation. This method involves partitioning the dataset into multiple subsets, or "folds," to comprehensively assess a model's generalization capability.

The term "k" in k-Fold refers to the number of folds the dataset is divided into. Typically, a common choice is k=10, resulting in 10-fold cross-validation. The primary goal of this technique is to estimate how well a machine learning model will perform on unseen data. By simulating the model's performance across multiple partitions of the dataset, we gain insights into its general behavior and potential for overfitting.


The procedure for k-Fold Cross-Validation is as follows:

1. Randomly shuffle the dataset to ensure randomness in fold assignments.
2. Divide the dataset into "k" approximately equal-sized groups (folds).
3. For each fold "i" (from 1 to k):
   - Treat fold "i" as the validation set.
   - Use the remaining k-1 folds as the training set.
   - Train a machine learning model on the training set and evaluate it on the validation set.
   - Retain the evaluation score and discard the model.
4. Calculate the mean and optionally the variance (e.g., standard deviation) of the evaluation scores to summarize the model's performance.

It's crucial to note that each data point appears in the validation set once and in the training set k-1 times, ensuring comprehensive assessment.


1. **Less Bias**: k-Fold Cross-Validation often yields less biased estimates of model performance compared to a simple train/test split.
2. **Comprehensive Evaluation**: All data points get an opportunity to be both in training and validation sets, leading to a more holistic evaluation.
3. **Hyperparameter Tuning**: Hyperparameter tuning can be done on the training set within each fold, preventing data leakage and over-optimistic estimates.
4. **Variance Estimate**: By calculating the variance of evaluation scores, we gain insights into the model's stability across different subsets of data.

## Task 5 [25 marks]

In this task, you will implement k-Fold Cross-Validation on the Diabetes dataset using scikit-learn.

Your task is to write a code that performs the following steps:

1. Load the Diabetes dataset.
2. Define the number of folds for cross-validation.
3. Calculate the number of samples per fold.
4. Perform k-Fold Cross-Validation:
   - For each fold:
     - Split the data into training and test sets.
     - Train a Linear Regression model on the training set.
     - Make predictions on the test set.
     - Calculate the Mean Squared Error (MSE) for the predictions.
     - Print the MSE for each fold.
     - Store the MSE values in a list.
5. Calculate the average MSE across all folds and print it.

> In this task you are allowed to use scikit-learn library to fit the linear regression model and for calculating the MSE. However, you are expected to create your own code (using numpy) for k-fold validation set. Below are the corresponding commands that you can import:
- `from sklearn.linear_model import LinearRegression`
- `from sklearn.metrics import mean_squared_error`


In [22]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Load the diabetes dataset
diabetes = load_diabetes()

# Create a DataFrame to store the data and target
diabetes_data = pd.DataFrame(data=diabetes['data'], columns=diabetes['feature_names'])
diabetes_data["DP"] = diabetes['target']

x = diabetes_data.iloc[:, 0:9].to_numpy()
y = diabetes_data.iloc[:, 10].to_numpy()

n_folds = 13
if(len(diabetes_data["DP"])%n_folds != 0):
    raise Exception('datasize is not divisible by n_folds')
else:
    n_samples = len(diabetes_data["DP"])/n_folds

mse_all = []
for fold_i in range(n_folds):
    train_x = np.vstack((x[:fold_i*n_folds,:],x[(fold_i+1)*n_folds:,:]))
    val_x = x[fold_i*n_folds:(fold_i+1)*n_folds,:]
    train_y = np.hstack((y[:fold_i*n_folds],y[(fold_i+1)*n_folds:]))
    val_y = y[fold_i*n_folds:(fold_i+1)*n_folds]
    reg = LinearRegression().fit(train_x, train_y)
    print(mean_squared_error(val_y, reg.predict(val_x)))
    mse_all.append(mean_squared_error(val_y, reg.predict(val_x)))

avg_mse = np.average(np.array(mse_all))
print("avg mse")
print(avg_mse)


2512.921130288053
1356.78908576085
3673.3256000439596
2197.12267605005
4203.846365617799
3117.53711285309
1610.6201354834727
5168.873084464219
3705.1646356286537
2264.699423407425
2921.950381901836
3140.6488083813724
2612.732363885544
avg mse
2960.47929259741
