In [16]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import accuracy_score, classification_report

import warnings
warnings.filterwarnings("ignore")

sns.set(style="whitegrid")


## Why Do Hyperparameters Matter?

Hyperparameters are **external configuration settings** of a model that are not learned from the data. They must be set manually or by search strategies.

Examples:
- `C` and `penalty` in Logistic Regression
- `max_depth`, `n_estimators` in Random Forest
- `kernel` and `gamma` in Support Vector Machines
- etc.

Choosing good values for these parameters can:
- Improve generalization
- Reduce overfitting
- Shorten training time


In [17]:
# Load breast cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training samples: {X_train.shape[0]}, Features: {X_train.shape[1]}")


Training samples: 455, Features: 30


## What Is the Hyperparameter Space?

The **hyperparameter space** is the set of all combinations of hyperparameter values you want to explore.

The space can be:
- **Grid-like** (e.g., all combinations of specific values)
- **Randomly sampled** from distributions
- **Explored adaptively** via optimization methods like Bayesian optimization


In [18]:
# Try different values manually
for c in [0.001, 0.01, 0.1, 1, 10]:
    model = LogisticRegression(C=c)
    scores = cross_val_score(model, X_train, y_train, cv=5)
    print(f"C={c} → Accuracy: {scores.mean():.3f}")


C=0.001 → Accuracy: 0.923
C=0.01 → Accuracy: 0.927
C=0.1 → Accuracy: 0.932
C=1 → Accuracy: 0.938
C=10 → Accuracy: 0.941


## Automated Search Methods

- **Grid Search** tries all combinations from a fixed grid.
- **Random Search** samples combinations randomly from specified distributions.

Both methods use **cross-validation** to evaluate performance.


In [19]:
param_grid = {
    'C': [0.01, 0.1, 1, 10],
    'penalty': ['l2'],
    'solver': ['liblinear']
}

grid_search = GridSearchCV(LogisticRegression(), param_grid, cv=5)
grid_search.fit(X_train, y_train)

print("Best Params (Grid Search):", grid_search.best_params_)
print("Best CV Score:", f"{grid_search.best_score_:.3f}")


Best Params (Grid Search): {'C': 10, 'penalty': 'l2', 'solver': 'liblinear'}
Best CV Score: 0.963


In [20]:
from scipy.stats import randint, uniform

param_dist = {
    'n_estimators': randint(10, 200),
    'max_depth': randint(2, 20),
    'max_features': ['sqrt', 'log2']
}

rnd_search = RandomizedSearchCV(RandomForestClassifier(), param_distributions=param_dist, n_iter=10, cv=5, random_state=42)
rnd_search.fit(X_train, y_train)

print("Best Params (Random Search):", rnd_search.best_params_)
print("Best CV Score:", rnd_search.best_score_)


Best Params (Random Search): {'max_depth': 8, 'max_features': 'log2', 'n_estimators': 102}
Best CV Score: 0.9626373626373625


## Beware of Data Leakage

**Data leakage** occurs when information from outside the training dataset is used to create the model.

Common causes:
- Performing scaling, encoding, or feature selection before splitting data
- Including future information in time series
- Leaking labels into features

Prevention:
- Always split data before preprocessing
- Use pipelines to apply preprocessing only on training data during cross-validation


## Beyond Grid Search: Bayesian Optimization and AutoML

**Bayesian optimization** models the search space as a probability distribution and selects the next best point to explore based on previous results.

Popular tools:
- `optuna`
- `scikit-optimize` (skopt)
- `hyperopt`

**AutoML frameworks** can automatically:
- Try different models
- Tune hyperparameters
- Perform preprocessing

Examples:
- `Auto-sklearn`
- `TPOT`
- `H2O AutoML`
- `Google AutoML`, `Azure Autotune`
