# Cross Validation:

Cross-validation is a statistical method used to evaluate the performance of machine learning models. It helps in assessing how the results of a model will generalize to an independent dataset. The main goal is to prevent overfitting, which occurs when a model learns the training data too well, capturing noise and outliers, and performs poorly on unseen data.

#### Types of Cross-Validation:

1) K-Fold Cross-Validation: The dataset is divided into k subsets (or folds). The model is trained on k-1 folds and validated on the remaining fold. This process is repeated k times, each time with a different fold as the validation set. The final performance metric is the average of the k validation scores.

2) Stratified K-Fold Cross-Validation: Similar to K-Fold, but it ensures that each fold has the same proportion of observations with a given label. This is particularly useful for imbalanced datasets.

3) Leave-One-Out Cross-Validation (LOOCV): A special case of K-Fold where k equals the number of data points. Each iteration uses one data point as the validation set and the rest as the training set.

4) Time Series Cross-Validation: Used for time series data, where the training set consists of past observations and the validation set consists of future observations.


##### Why Use Cross-Validation?

1) Model Evaluation: It provides a more accurate estimate of the model's performance on unseen data.

2) Hyperparameter Tuning: Helps in selecting the best hyperparameters by evaluating the model on multiple subsets of the data.

3) Preventing Overfitting: Ensures that the model generalizes well to new data.

In [None]:
# !pip install scikit-learn

Collecting scikit-learn
  Using cached scikit_learn-1.6.1-cp313-cp313-win_amd64.whl.metadata (15 kB)
Collecting scipy>=1.6.0 (from scikit-learn)
  Using cached scipy-1.15.1-cp313-cp313-win_amd64.whl.metadata (60 kB)
Collecting joblib>=1.2.0 (from scikit-learn)
  Using cached joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Using cached threadpoolctl-3.5.0-py3-none-any.whl.metadata (13 kB)
Using cached scikit_learn-1.6.1-cp313-cp313-win_amd64.whl (11.1 MB)
Using cached joblib-1.4.2-py3-none-any.whl (301 kB)
Using cached scipy-1.15.1-cp313-cp313-win_amd64.whl (43.6 MB)
Using cached threadpoolctl-3.5.0-py3-none-any.whl (18 kB)
Installing collected packages: threadpoolctl, scipy, joblib, scikit-learn
Successfully installed joblib-1.4.2 scikit-learn-1.6.1 scipy-1.15.1 threadpoolctl-3.5.0


In [4]:
import numpy as np 
from sklearn.model_selection import KFold, cross_val_score 
from sklearn.datasets import load_iris 
from sklearn.ensemble import RandomForestClassifier

In [5]:
data = load_iris()
X = data.data
y = data.target

In [6]:
# Initialize random forest classfier
model = RandomForestClassifier(n_estimators=100)

In [7]:
# Initialize K-Fold cross-validation
k_folds = 5
kf = KFold(n_splits=k_folds, shuffle=True, random_state=42)

In [8]:
scores = cross_val_score(model, X, y, cv=kf, scoring='accuracy')

In [9]:
# Output the results
print(f"Accuracy scores for each fold: {scores}")
print(f"Mean accuracy: {np.mean(scores):.4f}")
print(f"Standard deviation of accuracy: {np.std(scores):.4f}")

Accuracy scores for each fold: [1.         0.96666667 0.93333333 0.93333333 0.96666667]
Mean accuracy: 0.9600
Standard deviation of accuracy: 0.0249
