# Cross-Validation in Machine Learning

## What is Cross-Validation?

Cross-validation is a statistical technique used to evaluate the performance of a machine learning model. It involves dividing the dataset into multiple subsets, training the model on some subsets, and validating it on the remaining subsets. This approach helps to estimate how well the model generalizes to unseen data.

---

## Why Use Cross-Validation?

- **Prevent Overfitting:** Ensures the model performs well on both training and unseen data.
- **Model Evaluation:** Provides a robust estimate of the model’s performance.
- **Hyperparameter Tuning:** Helps in selecting the best hyperparameters by evaluating performance over different splits of the data.

---

## Types of Cross-Validation

1. **Holdout Method:**
   - The dataset is split into two parts: training and testing.
   - Model is trained on the training set and evaluated on the testing set.
   - Simple but prone to variance due to the random split.

2. **k-Fold Cross-Validation:**
   - The dataset is divided into `k` equally-sized folds.
   - The model is trained on `k-1` folds and validated on the remaining fold.
   - This process is repeated `k` times, with each fold used as the validation set once.
   - The average performance across all `k` iterations is reported.
   - **Advantages:** Reduces bias and variance.
   - **Common Value for `k`:** 5 or 10.

3. **Leave-One-Out Cross-Validation (LOOCV):**
   - A special case of k-Fold where `k` equals the number of data points in the dataset.
   - Each data point is used as a validation set once.
   - Computationally expensive but gives the least biased performance estimate.

4. **Stratified k-Fold Cross-Validation:**
   - A variation of k-Fold where the folds are stratified to ensure class distribution is preserved.
   - Useful for imbalanced datasets.

5. **Time Series Cross-Validation:**
   - For time-series data, the splitting is done in a way that respects the temporal order.
   - Past data is used to predict future data.

---

![Screenshot (8136).png](attachment:61ba5b19-dc77-4cd6-8981-c7e90842394a.png)

![Screenshot (8138).png](attachment:90643a0d-3539-434b-896b-9c176daecc9e.png)

![Screenshot (8140).png](attachment:db1d26f0-67d8-4130-be53-9cd25314e784.png)

In [3]:
import pandas as pd
data = pd.read_csv("iris.csv")
data = data.drop("Id" , axis=1)

In [11]:
data.head()

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [9]:
x = data.drop("Species" , axis=1)
x.head(5)

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [13]:
y = data["Species"]
y.head()

0    Iris-setosa
1    Iris-setosa
2    Iris-setosa
3    Iris-setosa
4    Iris-setosa
Name: Species, dtype: object

In [15]:
from sklearn.model_selection import cross_val_score

## KNN

In [33]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
cross_val_score(knn , x , y , cv=5).mean()

0.9733333333333334

## Naive Bayes

In [36]:
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
cross_val_score(nb , x , y , cv=5).mean()

0.9533333333333334

## Random Forest

In [39]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
cross_val_score(rf , x , y , cv=5).mean()

0.9666666666666668