# Introduction to Cross-Validation
In this notebook, we will learn about Cross-Validation, a technique to evaluate how well a machine learning model performs.
Let's explore why we need it and how it works.

## Why is Single Train/Test Split Sometimes Misleading?
Here's a simple diagram to illustrate the problem:
![Single Split Problem](images/single_split_problem.png)
- **Randomness:** The split might be lucky or unlucky.
- **Bias:** Performance depends on which data is in the test set.
- **Question:** Is 95% accuracy truly good or just a lucky split?
- **Solution:** Use multiple splits to get a better estimate.

## K-Fold Cross-Validation Process
The most common method is **K-Fold Cross-Validation**.
Here's how it works:
1. **Split:** Divide your data into K equal parts (called folds).
2. **Iterate:** Use K-1 folds to train the model and 1 fold to test.
3. **Repeat:** Do this K times, each time with a different fold as the test set.
4. **Average:** Calculate the average performance across all K runs.
This gives a more reliable estimate of your model's performance.

## Benefits of Cross-Validation
- **Reliability:** Reduces the chance that a lucky split will give a misleading performance.
- **Confidence:** Shows the range of model performance, not just a single number.
- **Detection:** Helps identify if the model is overfitting.
- **Efficiency:** Uses all data for both training and testing.
⚠️ Note: Cross-Validation takes more computation time but provides better insights.

## Example: Cross-Validation in Python
Let's see how to perform cross-validation using scikit-learn:

In [None]:
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
import numpy as np

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Create model
model = KNeighborsClassifier()

# 5-fold cross-validation
cv_scores = cross_val_score(model, X, y, cv=5)

print("Cross-validation scores:", cv_scores)
print(f"Mean accuracy: {cv_scores.mean():.3f}")
print(f"Standard deviation: {cv_scores.std():.3f}")
print(f"95% confidence interval: {cv_scores.mean():.3f} ± {1.96 * cv_scores.std():.3f}")

# More detailed cross-validation
cv_results = cross_validate(model, X, y, cv=5, 
                           scoring=['accuracy', 'precision_macro', 'recall_macro'])
print("\nDetailed results:")
for metric, scores in cv_results.items():
    if metric.startswith('test_'):
        print(f"{metric}: {scores.mean():.3f} ± {scores.std():.3f}")

👉 [Open in Colab](https://colab.research.google.com/github/Roopesht/codeexamples/blob/main/genai/python_easy/2/concept_4.ipynb)

## Key Takeaway
Cross-validation gives you confidence in your model's true performance!
🤔 **Question:** Why might a model with 90% ± 2% accuracy be better than one with 95% ± 15% accuracy?