# 📅 Day 12: Cross-Validation & Hyperparameter Tuning

## 🎯 Objective
Understand the purpose of cross-validation and learn how to perform hyperparameter tuning using GridSearchCV and RandomizedSearchCV.

## 🔄 What is Cross-Validation?
- Technique for evaluating models on different subsets of data
- Helps detect overfitting and ensures robustness
- **K-Fold Cross-Validation** splits data into K parts and trains K times

## ⚙️ What is Hyperparameter Tuning?
- The process of finding the best configuration (e.g., `max_depth`, `n_estimators`)
- Tools: `GridSearchCV`, `RandomizedSearchCV`
- Evaluates all combinations of provided parameters using cross-validation

## 📦 Step 1 – Load & Prepare Data

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Load dataset
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## 🔍 Step 2 – Hyperparameter Tuning with GridSearchCV

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [3, 5, 7],
    'criterion': ['gini', 'entropy']
}

grid_search = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train_scaled, y_train)

print('Best Params:', grid_search.best_params_)
print('Best Score:', grid_search.best_score_)

## ✅ Step 3 – Evaluate the Best Model

In [None]:
from sklearn.metrics import classification_report, accuracy_score

best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test_scaled)

print('Test Accuracy:', accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

## 🎲 Bonus – RandomizedSearchCV (Faster for Large Grids)

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_dist = {
    'n_estimators': randint(50, 200),
    'max_depth': randint(2, 10)
}

random_search = RandomizedSearchCV(RandomForestClassifier(random_state=42), param_dist, n_iter=10, cv=5, random_state=42)
random_search.fit(X_train_scaled, y_train)

print('Random Search Best Params:', random_search.best_params_)
print('Best CV Score:', random_search.best_score_)