Evaluating a model on a single train-test split can be sensitive to how the split was made. Cross-validation (CV) provides a more robust estimate of model performance. Hyperparameter tuning involves finding the best settings for a model (e.g., the `C` parameter in SVC, or `n_neighbors` in KNN) that aren't learned during `fit()`. `Scikit-learn` provides excellent tools for both.

## Scikit-learn: Model Selection & Hyperparameter Tuning

This document covers:

* **Cross-Validation (CV):** Explains the need for CV and demonstrates using `cross_val_score` (for quick scoring) and `cross_validate` (for more detailed results including timing and multiple metrics). It also shows how to use different CV splitting strategies like `KFold` and `StratifiedKFold`.
* **Hyperparameter Tuning:** Explains the concept and demonstrates two common techniques:
    * `GridSearchCV`: Exhaustively searches a predefined grid of parameters.
    * `RandomizedSearchCV`: Samples a fixed number of combinations from parameter distributions or lists, often more efficient for large search spaces.
* **Best Model:** Shows how to access the best parameters (`best_params_`) and the refitted best estimator (`best_estimator_`) found during the search.
* **Learning & Validation Curves:** Briefly introduces these tools for diagnosing model performance issues like bias and variance.

---

Mastering these techniques is crucial for building robust machine learning models and selecting appropriate hyperparameters.Evaluating a model on a single train-test split can be sensitive to how the split was made. Cross-validation (CV) provides a more robust estimate of model performance. Hyperparameter tuning involves finding the best settings for a model (e.g., the `C` parameter in SVC, or `n_neighbors` in KNN) that aren't learned during `fit()`. `Scikit-learn` provides excellent tools for both.

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.model_selection import (train_test_split, cross_val_score, cross_validate,
                                     KFold, StratifiedKFold, LeaveOneOut,
                                     GridSearchCV, RandomizedSearchCV,
                                     validation_curve, learning_curve)
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC # Example model for tuning
from sklearn.ensemble import RandomForestClassifier # Another example
from sklearn.metrics import accuracy_score

# --- 1. Load and Prepare Data ---
print("--- Loading Iris Dataset ---")
iris = load_iris()
X = iris.data
y = iris.target
class_names = iris.target_names

# We'll use the *full* dataset for cross-validation and tuning examples,
# though often you'd still hold out a final test set *after* tuning.
# For simplicity here, we scale the whole dataset. In practice, scaling
# should ideally happen *inside* each CV fold (using Pipelines, Section VIII).
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print(f"Scaled Data shape: X={X_scaled.shape}, y={y.shape}")
print("-" * 30)


# --- 2. Cross-Validation (CV) ---
# Evaluate model performance more robustly by splitting data into multiple 'folds'.
# Train on K-1 folds, test on the remaining fold, repeat K times.

print("--- Cross-Validation ---")

# a) cross_val_score: Simple way to get scores for each fold.
print("\n--- a) cross_val_score ---")
svc = SVC(kernel='rbf', C=1.0, random_state=42) # Basic SVC model

# cv parameter determines the splitting strategy:
# - Integer (e.g., 5 or 10): Uses KFold (regression) or StratifiedKFold (classification) by default.
# - CV splitter object (e.g., KFold(n_splits=5), StratifiedKFold(n_splits=5))
# scoring: Metric to evaluate (e.g., 'accuracy', 'neg_mean_squared_error', 'r2', 'f1_macro')
#          See sklearn.metrics.SCORERS.keys() for options.

# Default CV for classifiers is StratifiedKFold
cv_scores_acc = cross_val_score(svc, X_scaled, y, cv=5, scoring='accuracy')
print(f"Cross-Validation Accuracy Scores (5 folds): {cv_scores_acc}")
print(f"Mean CV Accuracy: {cv_scores_acc.mean():.4f} (+/- {cv_scores_acc.std() * 2:.4f})") # Often report mean +/- 2*std dev

# Using a specific CV iterator (KFold - generally not ideal for classification)
kf = KFold(n_splits=5, shuffle=True, random_state=42)
cv_scores_kf = cross_val_score(svc, X_scaled, y, cv=kf, scoring='accuracy')
print(f"\nCV Accuracy Scores (KFold, 5 folds): {cv_scores_kf}")
print(f"Mean CV Accuracy (KFold): {cv_scores_kf.mean():.4f}")

# Using StratifiedKFold explicitly (good practice for classification)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores_skf = cross_val_score(svc, X_scaled, y, cv=skf, scoring='accuracy')
print(f"\nCV Accuracy Scores (StratifiedKFold, 5 folds): {cv_scores_skf}")
print(f"Mean CV Accuracy (StratifiedKFold): {cv_scores_skf.mean():.4f}")
print("-" * 20)

# b) cross_validate: More detailed results (fit time, score time, multiple metrics).
print("\n--- b) cross_validate ---")
scoring_metrics = ['accuracy', 'f1_macro'] # Evaluate multiple metrics
cv_results = cross_validate(svc, X_scaled, y, cv=skf, scoring=scoring_metrics, return_train_score=True)

print("Cross-Validate Results (Dictionary):")
# Convert results dict to DataFrame for nice printing
cv_results_df = pd.DataFrame(cv_results)
print(cv_results_df)
print(f"\nAverage Test Accuracy: {cv_results_df['test_accuracy'].mean():.4f}")
print(f"Average Test F1 (Macro): {cv_results_df['test_f1_macro'].mean():.4f}")
print(f"Average Fit Time: {cv_results_df['fit_time'].mean():.4f}s")
print("-" * 30)


# --- 3. Hyperparameter Tuning ---
# Finding the best parameters for a model (e.g., C and kernel for SVC).

print("--- Hyperparameter Tuning ---")

# Define the parameter grid to search
# Example for SVC
param_grid_svc = {
    'C': [0.1, 1, 10, 100],             # Regularization parameter
    'gamma': [1, 0.1, 0.01, 0.001],     # Kernel coefficient for 'rbf'
    'kernel': ['rbf', 'linear']         # Kernel type
}

# Example for RandomForest
param_grid_rf = {
    'n_estimators': [50, 100, 200],    # Number of trees
    'max_depth': [None, 5, 10, 20],    # Max depth of trees
    'min_samples_split': [2, 5, 10]    # Min samples required to split a node
}

# Use Stratified K-Fold for cross-validation during tuning
cv_strategy = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# a) GridSearchCV: Exhaustive search over all parameter combinations.
print("\n--- a) GridSearchCV ---")
# Instantiate the model (without specific C, gamma here)
svc_grid = SVC(random_state=42, probability=True) # probability needed if using AUC later

# Set up GridSearchCV
# estimator: The model to tune.
# param_grid: Dictionary of parameters to try.
# scoring: Metric to optimize (e.g., 'accuracy').
# cv: Cross-validation strategy.
# n_jobs: Number of CPU cores to use (-1 uses all available).
# verbose: Controls the verbosity (amount of messages).
grid_search = GridSearchCV(estimator=svc_grid,
                           param_grid=param_grid_svc,
                           scoring='accuracy',
                           cv=cv_strategy,
                           n_jobs=-1,
                           verbose=1)

# Fit GridSearchCV to the data (this performs the search)
print("Starting GridSearchCV...")
grid_search.fit(X_scaled, y)
print("GridSearchCV finished.")

# Print the best parameters and best score found
print(f"\nBest Parameters found by GridSearchCV: {grid_search.best_params_}")
print(f"Best Cross-Validation Score (Accuracy): {grid_search.best_score_:.4f}")

# Get the best estimator found
best_svc = grid_search.best_estimator_
print(f"\nBest Estimator: {best_svc}")

# Optionally, evaluate the best estimator on a separate test set
# (We didn't hold one out here, but in practice you would)
# X_train_full, X_final_test, y_train_full, y_final_test = train_test_split(...)
# grid_search.fit(X_train_full_scaled, y_train_full)
# best_model = grid_search.best_estimator_
# final_accuracy = best_model.score(X_final_test_scaled, y_final_test)
# print(f"Accuracy on final held-out test set: {final_accuracy:.4f}")

# You can inspect all results
# grid_results_df = pd.DataFrame(grid_search.cv_results_)
# print("\nGrid Search CV Results Summary (first 5 rows):\n", grid_results_df.head())
print("-" * 20)


# b) RandomizedSearchCV: Samples a fixed number of parameter settings from specified distributions.
# More efficient than GridSearchCV when the search space is large.
print("\n--- b) RandomizedSearchCV ---")
# Define parameter distributions (can mix lists and distributions)
from scipy.stats import expon, randint

param_dist_svc = {
    'C': expon(scale=10),             # Sample from exponential distribution
    'gamma': expon(scale=0.1),
    'kernel': ['rbf', 'linear']         # Choose from list
}

param_dist_rf = {
    'n_estimators': randint(50, 250), # Sample integers uniformly from 50 to 249
    'max_depth': [None, 5, 10, 15, 20, 30],
    'min_samples_split': randint(2, 11) # Sample integers uniformly from 2 to 10
}


# Instantiate the model
rf_rand = RandomForestClassifier(random_state=42)

# Set up RandomizedSearchCV
# n_iter: Number of parameter settings that are sampled. Controls the search budget.
random_search = RandomizedSearchCV(estimator=rf_rand,
                                   param_distributions=param_dist_rf,
                                   n_iter=50,        # Try 50 random combinations
                                   scoring='accuracy',
                                   cv=cv_strategy,
                                   n_jobs=-1,
                                   verbose=1,
                                   random_state=42) # For reproducibility of sampling

# Fit RandomizedSearchCV
print("Starting RandomizedSearchCV...")
random_search.fit(X_scaled, y)
print("RandomizedSearchCV finished.")

# Print best results
print(f"\nBest Parameters found by RandomizedSearchCV: {random_search.best_params_}")
print(f"Best Cross-Validation Score (Accuracy): {random_search.best_score_:.4f}")
best_rf = random_search.best_estimator_
print(f"\nBest Estimator: {best_rf}")
print("-" * 30)


# --- 4. Learning Curves & Validation Curves (Conceptual) ---
# Tools to diagnose model performance (bias vs. variance).

print("--- Learning & Validation Curves (Conceptual) ---")
# - Learning Curve (learning_curve): Shows training and validation scores as a
#   function of the number of training samples. Helps identify if more data
#   would help, or if the model suffers from high bias or high variance.
# - Validation Curve (validation_curve): Shows training and validation scores
#   as a function of a *single* hyperparameter's value. Helps understand how
#   sensitive the model is to that parameter and find a good range.

# Example usage (plotting is usually needed to interpret):
# train_sizes, train_scores, test_scores = learning_curve(
#     estimator=best_svc, X=X_scaled, y=y, cv=cv_strategy, n_jobs=-1,
#     train_sizes=np.linspace(0.1, 1.0, 10), scoring='accuracy')
#
# param_range = np.logspace(-3, 2, 6) # Example range for SVC 'C' parameter
# train_scores_vc, test_scores_vc = validation_curve(
#     estimator=SVC(random_state=42, kernel='rbf'), X=X_scaled, y=y,
#     param_name='C', param_range=param_range, cv=cv_strategy,
#     scoring='accuracy', n_jobs=-1)

print("Use learning_curve and validation_curve to diagnose bias/variance")
print("and understand hyperparameter sensitivity (plotting required).")
print("-" * 30)

--- Loading Iris Dataset ---
Scaled Data shape: X=(150, 4), y=(150,)
------------------------------
--- Cross-Validation ---

--- a) cross_val_score ---
Cross-Validation Accuracy Scores (5 folds): [0.96666667 0.96666667 0.96666667 0.93333333 1.        ]
Mean CV Accuracy: 0.9667 (+/- 0.0422)

CV Accuracy Scores (KFold, 5 folds): [1.         0.96666667 0.96666667 0.93333333 0.96666667]
Mean CV Accuracy (KFold): 0.9667

CV Accuracy Scores (StratifiedKFold, 5 folds): [1.         0.96666667 0.9        1.         0.9       ]
Mean CV Accuracy (StratifiedKFold): 0.9533
--------------------

--- b) cross_validate ---
Cross-Validate Results (Dictionary):
   fit_time  score_time  test_accuracy  train_accuracy  test_f1_macro  \
0  0.002250    0.003982       1.000000        0.966667       1.000000   
1  0.001369    0.002678       0.966667        0.975000       0.966583   
2  0.001482    0.002837       0.900000        0.983333       0.899749   
3  0.001537    0.002853       1.000000        0.966667 