# 1) Risk & Supervised Learning

This notebook covers fundamental concepts in supervised learning:
- Loss functions and risk metrics
- Train/test/validation splits
- Grid search for hyperparameter tuning
- Cross-validation techniques

### Loss & Risk Metrics

**What is this?**
Loss functions measure how wrong a prediction is compared to the true value. Risk is the average (expected) loss across all predictions.

**Key Concepts:**
- **Mean Squared Error (MSE)**: Used for regression. Penalizes large errors heavily. 
  - Formula: $(y_{true} - y_{pred})^2$
  - Where: $y_{true}$ = actual/true value, $y_{pred}$ = predicted value
  
- **Mean Absolute Error (MAE)**: Used for regression. Less sensitive to outliers than MSE. 
  - Formula: $|y_{true} - y_{pred}|$
  - Where: $y_{true}$ = actual/true value, $y_{pred}$ = predicted value
  
- **Zero-One Loss**: Used for classification. Returns 1 if prediction is wrong, 0 if correct.
  - Formula: $\mathbb{1}(y_{true} \neq y_{pred})$ where $\mathbb{1}$ is the indicator function
  - Where: $y_{true}$ = actual label, $y_{pred}$ = predicted label
  
- **Log Loss (Binary Cross-Entropy)**: Used for probabilistic binary classification. Penalizes confident wrong predictions heavily. 
  - Formula: $-(y \log(p) + (1-y)\log(1-p))$
  - Where: $y$ = true binary label (0 or 1), $p$ = predicted probability (between 0 and 1)
  
- **Average Risk**: The mean of all losses - tells you overall model performance.
  - Formula: $R = \frac{1}{n}\sum_{i=1}^{n} L(y_i, \hat{y}_i)$
  - Where: $n$ = number of samples, $L$ = loss function, $y_i$ = true value for sample $i$, $\hat{y}_i$ = predicted value for sample $i$

**When to use:**
- Use MSE when you want to penalize large errors more
- Use MAE when outliers shouldn't dominate your loss
- Use Zero-One for simple classification accuracy
- Use Log Loss when you have probability predictions and want to encourage confident correct predictions

In [2]:
import numpy as np

# Mean Squared Error
def loss_squared(y_true, y_pred):
    """
    Calculate Mean Squared Error (MSE) loss.
    
    Parameters:
    -----------
    y_true : array-like
        True/actual values (ground truth)
    y_pred : array-like
        Predicted values from model
    
    Returns:
    --------
    mse : ndarray
        Array of squared errors (one per sample)
    """
    y_true = np.asarray(y_true, float)
    y_pred = np.asarray(y_pred, float)
    return (y_true - y_pred) ** 2

# Mean Absolute Error
def loss_absolute(y_true, y_pred):
    """
    Calculate Mean Absolute Error (MAE) loss.
    
    Parameters:
    -----------
    y_true : array-like
        True/actual values (ground truth)
    y_pred : array-like
        Predicted values from model
    
    Returns:
    --------
    mae : ndarray
        Array of absolute errors (one per sample)
    """
    y_true = np.asarray(y_true, float)
    y_pred = np.asarray(y_pred, float)
    return np.abs(y_true - y_pred)

# Zero-One Loss
def loss_zero_one(y_true, y_pred_label):
    """
    Calculate Zero-One Loss for classification.
    
    Parameters:
    -----------
    y_true : array-like of int
        True/actual class labels (ground truth)
    y_pred_label : array-like of int
        Predicted class labels from model
    
    Returns:
    --------
    loss : ndarray of float
        Array where 1.0 = incorrect prediction, 0.0 = correct prediction (one per sample)
    """
    y_true = np.asarray(y_true, int)
    y_pred_label = np.asarray(y_pred_label, int)
    return (y_true != y_pred_label).astype(float)

# Log Loss for Binary Classification
def loss_logloss_binary(y_true, y_pred_prob, eps=1e-12):
    """
    Calculate Log Loss (Binary Cross-Entropy) for binary classification.
    
    Parameters:
    -----------
    y_true : array-like of float
        True binary labels (0 or 1)
    y_pred_prob : array-like of float
        Predicted probabilities (between 0 and 1) for positive class
    eps : float, default=1e-12
        Small constant to avoid log(0) by clipping probabilities to [eps, 1-eps]
    
    Returns:
    --------
    log_loss : ndarray
        Array of log loss values (one per sample)
    """
    # y_true in {0,1}, y_pred_prob in [0,1]
    y_true = np.asarray(y_true, float)
    p = np.asarray(y_pred_prob, float)
    p = np.clip(p, eps, 1 - eps)  # Clip to avoid log(0)
    return -(y_true*np.log(p) + (1-y_true)*np.log(1-p))

# Average Risk (Expected Loss)
def average_risk(y_true, y_pred, loss_fn):
    """
    Calculate average risk (expected loss) across all samples.
    
    Parameters:
    -----------
    y_true : array-like
        True/actual values or labels (ground truth)
    y_pred : array-like
        Predicted values or labels from model
    loss_fn : callable
        Loss function that takes (y_true, y_pred) and returns array of losses
    
    Returns:
    --------
    avg_risk : float
        Mean loss across all samples (scalar value representing overall model performance)
    """
    # returns the average loss
    losses = loss_fn(y_true, y_pred)
    return float(np.mean(losses))

### Train Test Validation Split

**What is this?**
Splitting your data into separate sets to train your model and evaluate its performance fairly.

**Why we need it:**
- **Training set**: Used to train/fit the model
- **Validation set**: Used during model development to tune hyperparameters and make decisions
- **Test set**: Used only once at the end to get an unbiased estimate of model performance

**Key points:**
- Default split is 70% train, 15% validation, 15% test
- Always shuffle data before splitting (unless time-series)
- Use `random_state` for reproducibility
- The validation set helps prevent overfitting during model selection
- Never touch the test set until final evaluation!

In [None]:
from sklearn.model_selection import train_test_split

def train_val_test_split(X, y=None, train_frac=0.7, val_frac=0.15, test_frac=0.15, shuffle=True, random_state=None):
    """
    Split X (and optional y) into train/validation/test sets using scikit-learn.
    
    Parameters:
    -----------
    X : array-like of shape (n_samples, n_features)
        Feature matrix to split
    y : array-like of shape (n_samples,), default=None
        Target vector to split (optional). If provided, stratified split maintains label distribution
    train_frac : float, default=0.7
        Fraction of data for training set (must be between 0 and 1)
    val_frac : float, default=0.15
        Fraction of data for validation set (must be between 0 and 1)
    test_frac : float, default=0.15
        Fraction of data for test set (must be between 0 and 1)
        Note: train_frac + val_frac + test_frac must equal 1.0
    shuffle : bool, default=True
        Whether to shuffle data before splitting (recommended unless time-series data)
    random_state : int or None, default=None
        Random seed for reproducibility. Use fixed int for consistent splits across runs
    
    Returns:
    --------
    If y is None:
        X_train : array-like
            Training features (train_frac of original data)
        X_val : array-like
            Validation features (val_frac of original data)
        X_test : array-like
            Test features (test_frac of original data)
    
    If y is provided:
        X_train : array-like
            Training features
        X_val : array-like
            Validation features
        X_test : array-like
            Test features
        y_train : array-like
            Training labels
        y_val : array-like
            Validation labels
        y_test : array-like
            Test labels
    """
    fracs = float(train_frac) + float(val_frac) + float(test_frac)
    if not np.isclose(fracs, 1.0):
        raise ValueError("train_frac + val_frac + test_frac must sum to 1.0")

    # first split off the training set
    test_plus_val = val_frac + test_frac
    if y is None:
        X_train, X_temp = train_test_split(
            X, train_size=train_frac, test_size=test_plus_val,
            random_state=random_state, shuffle=shuffle
        )
    else:
        X_train, X_temp, y_train, y_temp = train_test_split(
            X, y, train_size=train_frac, test_size=test_plus_val,
            random_state=random_state, shuffle=shuffle
        )

    # handle edge cases where val or test fraction is zero
    if np.isclose(test_plus_val, 0.0):
        # no val/test portion
        X_val, X_test = X_temp[:0], X_temp[:0]
        if y is not None:
            y_val, y_test = y_temp[:0], y_temp[:0]
    elif np.isclose(val_frac, 0.0):
        X_val, X_test = X_temp[:0], X_temp
        if y is not None:
            y_val, y_test = y_temp[:0], y_temp
    elif np.isclose(test_frac, 0.0):
        X_val, X_test = X_temp, X_temp[:0]
        if y is not None:
            y_val, y_test = y_temp, y_temp[:0]
    else:
        # split the temp set into val and test according to their relative proportions
        test_rel = test_frac / (val_frac + test_frac)
        if y is None:
            X_val, X_test = train_test_split(
                X_temp, test_size=test_rel, random_state=random_state, shuffle=shuffle
            )
        else:
            X_val, X_test, y_val, y_test = train_test_split(
                X_temp, y_temp, test_size=test_rel, random_state=random_state, shuffle=shuffle
            )

    if y is None:
        return X_train, X_val, X_test
    return X_train, X_val, X_test, y_train, y_val, y_test

### Grid Search

**What is this?**
A systematic way to find the best hyperparameters for your model by trying all combinations of parameter values.

**How it works:**
1. Define a grid of hyperparameter values to try
2. For each combination, train the model using k-fold cross-validation
3. Evaluate performance using a scoring metric
4. Return the best combination

**Key parameters:**
- `estimator`: The machine learning model to tune (must have fit/predict methods)
- `param_grid`: Dictionary mapping parameter names (str) to lists of values to try
  - Example: `{'C': [0.1, 1, 10], 'kernel': ['rbf', 'linear']}` tries 6 combinations
- `cv`: Number of cross-validation folds (default: 5)
  - Higher values = more reliable but slower
- `scoring`: Metric to optimize 
  - Classification: 'accuracy', 'f1', 'precision', 'recall', 'roc_auc'
  - Regression: 'neg_mean_squared_error', 'neg_mean_absolute_error', 'r2'
- `n_jobs`: Number of parallel jobs (-1 = use all CPU cores)
- `verbose`: Controls output verbosity (0=silent, 1=progress, 2=detailed)

**Output:**
- `best_estimator_`: The fitted model with best hyperparameters
- `best_params_`: Dictionary of the best hyperparameter values found
- `best_score_`: The best cross-validation score achieved
- `cv_results_`: Full results for all parameter combinations

In [None]:
from sklearn.model_selection import GridSearchCV

def grid_search(model, param_grid, X_train, y_train, cv=5, scoring=None, verbose=1, n_jobs=-1):
    """
    Perform grid search to find the best hyperparameters for a given model.
    
    Parameters:
    -----------
    model : estimator object
        The machine learning model to tune (e.g., sklearn model or any estimator with fit/predict methods).
        Must implement the scikit-learn estimator interface.
    param_grid : dict or list of dicts
        Dictionary with parameter names (str) as keys and lists of parameter settings to try as values.
        Example: {'C': [0.1, 1, 10], 'kernel': ['rbf', 'linear']} will try all 6 combinations.
        Can also be a list of such dicts to search over different parameter spaces.
    X_train : array-like of shape (n_samples, n_features)
        Training feature matrix. Each row is a sample, each column is a feature.
    y_train : array-like of shape (n_samples,)
        Training target vector. Labels for classification or values for regression.
    cv : int, cross-validation generator, or iterable, default=5
        Number of cross-validation folds. If int, performs k-fold CV.
        If CV object (e.g., StratifiedKFold), uses that splitter.
    scoring : str, callable, or None, default=None
        Strategy to evaluate model performance on test set.
        - Classification: 'accuracy', 'f1', 'precision', 'recall', 'roc_auc'
        - Regression: 'neg_mean_squared_error', 'neg_mean_absolute_error', 'r2'
        - If None, uses model's default score() method
    verbose : int, default=1
        Controls verbosity of output during grid search:
        - 0: No output
        - 1: Prints progress for each parameter combination
        - 2+: More detailed output
    n_jobs : int, default=-1
        Number of jobs to run in parallel during cross-validation:
        - -1: Use all available CPU cores
        - 1: No parallelism (sequential)
        - n: Use n cores
    
    Returns:
    --------
    best_model : estimator object
        The model fitted with the best hyperparameters found by grid search.
        This is the estimator that gave the best CV score, refitted on full training data.
    best_params : dict
        Dictionary of the best hyperparameter values found.
        Example: {'C': 1, 'kernel': 'rbf'}
    best_score : float
        The best mean cross-validation score achieved across all parameter combinations.
        This is the average score across all CV folds for the best parameters.
    grid_search_results : GridSearchCV object
        The complete GridSearchCV object containing:
        - cv_results_: Full results dictionary with scores for all combinations
        - best_index_: Index of best parameter combination
        - n_splits_: Number of CV splits used
    """
    grid = GridSearchCV(
        estimator=model,
        param_grid=param_grid,
        cv=cv,
        scoring=scoring,
        verbose=verbose,
        n_jobs=n_jobs,
        return_train_score=True
    )
    
    grid.fit(X_train, y_train)
    
    return grid.best_estimator_, grid.best_params_, grid.best_score_, grid

### Cross-Validation

**What is this?**
A resampling technique to evaluate model performance by splitting data into multiple train/test folds.

**K-Fold Cross-Validation:**
1. Split data into $k$ equal-sized folds: $D = D_1 \cup D_2 \cup ... \cup D_k$
2. For each fold $i = 1, ..., k$: 
   - Train on $(k-1)$ folds: $D \setminus D_i$
   - Test on fold $i$: $D_i$
   - Compute score: $S_i$
3. Average the $k$ test scores: $\text{CV Score} = \frac{1}{k}\sum_{i=1}^{k} S_i$

**Mathematical notation:**
- $k$ = number of folds
- $D$ = full dataset of size $n$
- $D_i$ = fold $i$ (approximately $n/k$ samples)
- $S_i$ = performance score on fold $i$
- $\bar{S}$ = mean score, $\sigma_S$ = standard deviation of scores

**Benefits:**
- More reliable than single train/test split
- Uses all data for both training and testing
- Reduces variance in performance estimates

**Common k values:**
- $k=5$: Standard choice (good bias-variance trade-off)
- $k=10$: More computation but less bias
- $k=n$ (Leave-One-Out): Maximum data usage but high variance

**Stratified CV:** Maintains class proportions $P(y=c)$ in each fold (important for imbalanced data)

In [None]:
from sklearn.model_selection import cross_val_score, KFold, StratifiedKFold

def cross_validate_model(model, X, y, cv=5, scoring='accuracy', stratified=True):
    """
    Perform k-fold cross-validation on a model.
    
    Parameters:
    -----------
    model : estimator object
        The machine learning model to evaluate. Must implement fit/predict methods.
        Will be cloned and fitted k times (once per fold).
    X : array-like of shape (n_samples, n_features)
        Complete feature matrix. Will be split into k folds for training/testing.
        Each row is a sample, each column is a feature.
    y : array-like of shape (n_samples,)
        Complete target vector. Labels for classification or values for regression.
        Will be split into k folds along with X.
    cv : int, default=5
        Number of folds (k) for cross-validation.
        Common values: 5 (standard), 10 (more robust), 3 (faster)
    scoring : str or callable, default='accuracy'
        Scoring metric to evaluate model performance:
        - Classification: 'accuracy', 'f1', 'precision', 'recall', 'roc_auc', 'f1_macro', 'f1_weighted'
        - Regression: 'neg_mean_squared_error', 'neg_mean_absolute_error', 'r2'
        Note: scikit-learn uses negative values for error metrics (higher is better)
    stratified : bool, default=True
        Whether to use stratified folds (recommended for classification):
        - True: Maintains class distribution in each fold (for classification tasks)
        - False: Uses standard k-fold (may have imbalanced folds)
        Only applicable when y contains integer class labels
    
    Returns:
    --------
    scores : ndarray of shape (cv,)
        Array of cross-validation scores, one per fold.
        Example: [0.85, 0.87, 0.83, 0.89, 0.86] for cv=5
    mean_score : float
        Mean of CV scores across all folds: (1/k) * sum(scores).
        This is the primary performance estimate.
    std_score : float
        Standard deviation of CV scores: measures stability/consistency of model.
        Lower std = more stable performance across different data splits.
    """
    # Choose appropriate cross-validation splitter
    if stratified and hasattr(y, 'dtype') and y.dtype in [int, 'int64', 'int32']:
        # Stratified: maintains class proportions in each fold
        cv_splitter = StratifiedKFold(n_splits=cv, shuffle=True, random_state=42)
    else:
        # Standard k-fold: simple random splits
        cv_splitter = KFold(n_splits=cv, shuffle=True, random_state=42)
    
    # Perform cross-validation: trains and evaluates model k times
    scores = cross_val_score(model, X, y, cv=cv_splitter, scoring=scoring)
    
    return scores, float(np.mean(scores)), float(np.std(scores))