# Introduction to Cross-Validation

In this lesson, we will learn about how to evaluate machine learning models more reliably using a technique called cross-validation. This helps us get a better estimate of how well our model is likely to perform on unseen data.

## Why Cross-Validation?

When we test a model using just one train/test split, the results can be misleading if the split was lucky or unlucky. Cross-validation involves multiple splits, which give us a more trustworthy assessment of our model's performance.

### How does it work?

- The data is divided into *K* parts (or folds).
- The model trains on K-1 parts and tests on 1 part.
- This process repeats K times, each time with a different part as the test set.
- The results are then averaged for a more reliable estimate.

![Visual showing 5-fold cross-validation with data split into 5 parts, each serving as test set once while others are training data. Size 800x500](images/cross_validation_diagram.png)

### Example: K-Fold Cross-Validation in Python

Let's see how to perform K-Fold cross-validation using scikit-learn.

In [None]:
from sklearn.model_selection import cross_val_score, KFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris

# Load example data
X, y = load_iris(return_X_y=True)

# Initialize the model
knn = KNeighborsClassifier(n_neighbors=3)

In [None]:
# Set up cross-validation with 5 folds, shuffling data for randomness
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

# Perform cross-validation to get accuracy scores
cv_scores = cross_val_score(knn, X, y, cv=kfold, scoring='accuracy')

### Results: Cross-Validation Scores


In [None]:
print("Cross-Validation Scores:")
for i, score in enumerate(cv_scores):
    print(f"Fold {i+1}: {score:.3f}")

print(f"\nMean CV Score: {cv_scores.mean():.3f}")
print(f"Standard Deviation: {cv_scores.std():.3f}")

### Interpretation of Results

The scores across different folds show how consistent your model's performance is. A higher mean score indicates better overall accuracy, and the lower standard deviation indicates more consistency.

## Why Cross-Validation Matters

Getting multiple opinions (tests) on your model's performance is like asking multiple friends for advice before making a decision. This helps ensure your model's evaluation is reliable.

### Question:

If your CV scores vary wildly (e.g., from 0.5 to 0.9), what might that indicate about your model?

*It could suggest that the model is unstable or overfitting. The performance isn't consistent across different data splits.*