# Chapter 4 — Risk, Supervised Learning, Classification, and MLE (Reusable Python Toolkit)

This notebook is a **practical companion** to the provided lecture notes PDF (Chapter 4: *Risk*).  
It includes:
- Concise explanations of each concept in the notes  
- **Reusable Python functions** for each important component (risk, empirical risk, ERM, regression-function estimation, Bayes rule, MLE, linear/logistic regression)

> **Tip**: You can reuse these functions by only changing **inputs/parameters** (data arrays, loss choice, model function, optimizer settings, etc.).

## 0) Setup

We'll use standard scientific Python libraries. If something is missing in your environment, install with:
```bash
pip install numpy scipy scikit-learn matplotlib
```

In [None]:
import numpy as np
from dataclasses import dataclass
from typing import Callable, Dict, Iterable, Optional, Tuple, Union

# Optional imports used in later sections (safe if unavailable until used)
import math

In [None]:
# Numerical helpers
def sigmoid(x: np.ndarray) -> np.ndarray:
    """Stable logistic sigmoid: σ(x) = 1/(1+exp(-x)).
    
    Parameters
    ----------
    x : np.ndarray
        Input values (any shape)
    
    Returns
    -------
    np.ndarray
        Sigmoid-transformed values, same shape as input
    """
    x = np.asarray(x)
    # Avoid overflow: split positive/negative
    out = np.empty_like(x, dtype=float)
    pos = x >= 0
    out[pos] = 1.0 / (1.0 + np.exp(-x[pos]))
    expx = np.exp(x[~pos])
    out[~pos] = expx / (1.0 + expx)
    return out

def add_intercept(X: np.ndarray) -> np.ndarray:
    """Append a column of ones to X for intercept term in linear models.
    
    Parameters
    ----------
    X : np.ndarray
        Feature matrix, shape (n_samples,) or (n_samples, n_features)
    
    Returns
    -------
    np.ndarray
        Design matrix with intercept column, shape (n_samples, n_features+1)
    """
    X = np.asarray(X)
    if X.ndim == 1:
        X = X.reshape(-1, 1)
    return np.c_[np.ones((X.shape[0], 1)), X]


## 1) Statistical Models and Risk

A **statistical model** is a family of probability distributions $\{f(x;\theta) : \theta\in\Theta\}$, where:
- $f(x;\theta)$ = probability density/mass function parameterized by $\theta$
- $\theta$ = model parameter(s)
- $\Theta$ = parameter space (set of all possible parameter values)
- **Parametric**: The parameter $\theta$ is finite-dimensional (e.g., $\theta=(\mu,\sigma)$ for normal distribution)
- **Non-parametric**: The parameter space $\Theta$ is infinite-dimensional (e.g., all possible continuous functions)

In supervised learning, we observe training pairs $(X_i, Y_i)$ and select a function $g$ from a model class $\mathcal{M}$ that minimizes the **risk**:

$$
R(g)=\mathbb{E}[L(Z,g)]\quad\text{where } Z=(X,Y)
$$

**Variable definitions:**
- $Z = (X,Y)$ = random variable pair (feature, label)
- $X$ = feature/input variable
- $Y$ = response/output/label variable  
- $g$ = prediction function from model class $\mathcal{M}$
- $L(Z,g)$ = loss function measuring prediction error for data point $Z$ using predictor $g$
- $R(g)$ = risk (expected loss)
- $\mathbb{E}[\cdot]$ = expectation operator

In practice, we minimize **empirical risk** $\hat{R}(g) = \frac{1}{n}\sum_{i=1}^n L(Z_i, g)$ (average loss on training set) as an approximation of true risk.

In [None]:
# --- Loss functions (plug-and-play) ---

def squared_loss(y_true: np.ndarray, y_pred: np.ndarray) -> np.ndarray:
    """Pointwise squared loss (y - yhat)^2.
    
    Parameters
    ----------
    y_true : np.ndarray
        True target values, shape (n_samples,)
    y_pred : np.ndarray
        Predicted values, shape (n_samples,)
    
    Returns
    -------
    np.ndarray
        Squared loss for each sample, shape (n_samples,)
    """
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)
    return (y_true - y_pred) ** 2

def zero_one_loss(y_true: np.ndarray, y_pred: np.ndarray) -> np.ndarray:
    """Pointwise 0-1 loss: 1 if wrong else 0.
    
    Parameters
    ----------
    y_true : np.ndarray
        True class labels, shape (n_samples,)
    y_pred : np.ndarray
        Predicted class labels, shape (n_samples,)
    
    Returns
    -------
    np.ndarray
        0-1 loss for each sample (0=correct, 1=incorrect), shape (n_samples,)
    """
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)
    return (y_true != y_pred).astype(float)

def negative_log_likelihood_loss(logpdf: Callable[[np.ndarray, np.ndarray], np.ndarray],
                                 params: np.ndarray,
                                 z: np.ndarray) -> np.ndarray:
    """Generic negative log-likelihood loss: -log p_params(z).
    
    Parameters
    ----------
    logpdf : Callable
        Function that computes log p(z; params) for each data point
    params : np.ndarray
        Model parameters
    z : np.ndarray
        Data points
    
    Returns
    -------
    np.ndarray
        Negative log-likelihood for each data point
    """
    return -logpdf(params, z)


In [None]:
# --- Risk utilities ---

def empirical_risk(loss_fn: Callable[[np.ndarray, np.ndarray], np.ndarray],
                   y_true: np.ndarray,
                   y_pred: np.ndarray,
                   reducer: Callable[[np.ndarray], float] = np.mean) -> float:
    """Compute empirical risk: average (or other aggregate) of pointwise losses.
    
    Parameters
    ----------
    loss_fn : Callable
        Loss function mapping (y_true, y_pred) to pointwise loss array
    y_true : np.ndarray
        True target values, shape (n_samples,)
    y_pred : np.ndarray
        Predicted values, shape (n_samples,)
    reducer : Callable, optional
        Aggregation function (default: np.mean)
    
    Returns
    -------
    float
        Empirical risk (aggregated loss)
    """
    return float(reducer(loss_fn(y_true, y_pred)))

def monte_carlo_risk(loss_on_sample: Callable[[np.ndarray], np.ndarray],
                     sampler: Callable[[int], np.ndarray],
                     n_mc: int = 10000,
                     reducer: Callable[[np.ndarray], float] = np.mean,
                     random_state: Optional[int] = None) -> float:
    """Approximate R = E[L(Z)] by Monte Carlo sampling.
    
    Parameters
    ----------
    loss_on_sample : Callable
        Function that computes loss given sampled data Z
    sampler : Callable
        Function to generate n_mc samples from distribution of Z
    n_mc : int, optional
        Number of Monte Carlo samples (default: 10000)
    reducer : Callable, optional
        Aggregation function (default: np.mean)
    random_state : int, optional
        Random seed for reproducibility
    
    Returns
    -------
    float
        Monte Carlo estimate of expected risk
    """
    rng = np.random.default_rng(random_state)
    Z = sampler(n_mc) if sampler.__code__.co_argcount == 1 else sampler(n_mc, rng)
    losses = loss_on_sample(Z)
    return float(reducer(losses))


### Empirical Risk Minimization (ERM) Template

The lecture notes frame learning as: choose a model class $\mathcal{M}=\{g_\lambda\}$ parameterized by $\lambda$, and minimize risk over $\lambda$.

**Variable definitions:**
- $\mathcal{M}$ = model class (set of candidate prediction functions)
- $g_\lambda$ = prediction function parameterized by $\lambda$
- $\lambda$ = model parameters to optimize

Below is a **general ERM optimizer**: you provide a parameterized model, a loss function, and data; it returns the fitted parameters by minimizing the average loss.

In [None]:
from scipy.optimize import minimize

def erm_fit(model_predict: Callable[[np.ndarray, np.ndarray], np.ndarray],
            loss_fn: Callable[[np.ndarray, np.ndarray], np.ndarray],
            X: np.ndarray,
            y: np.ndarray,
            init_params: np.ndarray,
            method: str = "L-BFGS-B",
            bounds: Optional[Iterable[Tuple[Optional[float], Optional[float]]]] = None,
            options: Optional[Dict] = None) -> Dict:
    """Generic ERM: minimize average loss over parameters.

    Parameters
    ----------
    model_predict : Callable
        Function with signature (params, X) -> y_pred that generates predictions
    loss_fn : Callable
        Function with signature (y_true, y_pred) -> pointwise loss array
    X : np.ndarray
        Feature matrix, shape (n_samples, n_features) or (n_samples,)
    y : np.ndarray
        Target values, shape (n_samples,)
    init_params : np.ndarray
        Initial parameter vector for optimization
    method : str, optional
        Scipy optimization method (default: "L-BFGS-B")
    bounds : Iterable of tuples, optional
        Parameter bounds as [(min1, max1), (min2, max2), ...]
    options : dict, optional
        Additional options for scipy.optimize.minimize

    Returns
    -------
    dict
        Dictionary with keys:
        - params: optimized parameter vector
        - fun: final objective value (empirical risk)
        - success: whether optimization succeeded
        - message: optimization status message
        - result: full scipy optimization result object
    """
    X = np.asarray(X)
    y = np.asarray(y)

    def objective(params):
        y_pred = model_predict(params, X)
        return np.mean(loss_fn(y, y_pred))

    res = minimize(objective, np.asarray(init_params, dtype=float),
                   method=method, bounds=bounds, options=options)
    return {
        "params": res.x,
        "fun": float(res.fun),
        "success": bool(res.success),
        "message": res.message,
        "result": res
    }


## 2) Regression and the Regression Function $r(x)=\mathbb{E}[Y\mid X=x]$

In the notes, under squared loss, the risk is:

$$
R(g)=\mathbb{E}[(Y-g(X))^2]
$$

where:
- $Y$ = true response variable
- $g(X)$ = predicted value using function $g$ and features $X$
- $R(g)$ = expected squared error (risk)

This decomposes into three terms:

$$
\mathbb{E}[(Y-r(X))^2] + \mathbb{E}[(r(X)-g(X))^2] + 2\,\mathbb{E}[(Y-r(X))(r(X)-g(X))]
$$

**Term definitions:**
- $r(X) = \mathbb{E}[Y\mid X]$ = regression function (conditional expectation, the optimal predictor)
- First term: $\mathbb{E}[(Y-r(X))^2]$ = **irreducible error** (noise, Bayes error)
- Second term: $\mathbb{E}[(r(X)-g(X))^2]$ = **approximation error** (how far $g$ is from optimal $r$)
- Third term: $2\mathbb{E}[(Y-r(X))(r(X)-g(X))]$ = **cross term** (equals zero by tower property)

**Key insight**: Minimizing squared-loss risk within a model class $\mathcal{M}$ is equivalent to finding the member of $\mathcal{M}$ closest to $r(x)$ in the mean-square sense.

Below: 
- **(A)** A utility to compute the three decomposition terms given samples and an estimate of $r(X)$  
- **(B)** Reusable estimators for $r(x)$ (kernel regression & k-NN regression)

In [None]:
def mse_decomposition(y: np.ndarray,
                      gX: np.ndarray,
                      rX: np.ndarray) -> Dict[str, float]:
    """Empirical version of MSE decomposition:
       E[(Y-g)^2] = E[(Y-r)^2] + E[(r-g)^2] + 2E[(Y-r)(r-g)]
    
    Parameters
    ----------
    y : np.ndarray
        True response values, shape (n_samples,)
    gX : np.ndarray
        Model predictions g(X), shape (n_samples,)
    rX : np.ndarray
        Regression function values r(X) = E[Y|X], shape (n_samples,)
    
    Returns
    -------
    dict
        Dictionary with keys:
        - I_noise: irreducible error E[(Y-r)^2]
        - II_approx: approximation error E[(r-g)^2]
        - III_cross: cross term 2E[(Y-r)(r-g)] (should be ≈0)
        - total: total MSE E[(Y-g)^2]
    """
    y = np.asarray(y); gX = np.asarray(gX); rX = np.asarray(rX)
    I = np.mean((y - rX) ** 2)
    II = np.mean((rX - gX) ** 2)
    III = 2.0 * np.mean((y - rX) * (rX - gX))
    total = np.mean((y - gX) ** 2)
    return {"I_noise": float(I), "II_approx": float(II), "III_cross": float(III), "total": float(total)}


In [None]:
def kernel_regression_predict(X_train: np.ndarray,
                              y_train: np.ndarray,
                              X_query: np.ndarray,
                              bandwidth: float = 0.1,
                              kernel: str = "gaussian") -> np.ndarray:
    """Nadaraya–Watson kernel regression: estimate r(x)=E[Y|X=x].

    Parameters
    ----------
    X_train : np.ndarray
        Training features, shape (n_train,) or (n_train, n_features)
    y_train : np.ndarray
        Training targets, shape (n_train,)
    X_query : np.ndarray
        Query points where to predict, shape (n_query,) or (n_query, n_features)
    bandwidth : float, optional
        Smoothing parameter h (larger = smoother, default: 0.1)
    kernel : str, optional
        Kernel type: 'gaussian' or 'epanechnikov' (default: 'gaussian')

    Returns
    -------
    y_hat : np.ndarray
        Predicted values at query points, shape (n_query,)
    """
    X_train = np.asarray(X_train); y_train = np.asarray(y_train)
    X_query = np.asarray(X_query)

    if X_train.ndim == 1:
        X_train = X_train.reshape(-1, 1)
    if X_query.ndim == 1:
        X_query = X_query.reshape(-1, 1)

    # pairwise squared distances
    diffs = X_query[:, None, :] - X_train[None, :, :]
    d2 = np.sum(diffs * diffs, axis=-1)
    u2 = d2 / (bandwidth ** 2)

    if kernel == "gaussian":
        w = np.exp(-0.5 * u2)
    elif kernel == "epanechnikov":
        u = np.sqrt(u2)
        w = np.clip(1 - u**2, 0, None)
    else:
        raise ValueError("kernel must be 'gaussian' or 'epanechnikov'")

    denom = np.sum(w, axis=1)
    # avoid division by zero
    denom = np.where(denom == 0, 1e-12, denom)
    return (w @ y_train) / denom

def knn_regression_predict(X_train: np.ndarray,
                           y_train: np.ndarray,
                           X_query: np.ndarray,
                           k: int = 10) -> np.ndarray:
    """k-Nearest Neighbors regression: r(x) ≈ average of k nearest y values.
    
    Parameters
    ----------
    X_train : np.ndarray
        Training features, shape (n_train,) or (n_train, n_features)
    y_train : np.ndarray
        Training targets, shape (n_train,)
    X_query : np.ndarray
        Query points where to predict, shape (n_query,) or (n_query, n_features)
    k : int, optional
        Number of nearest neighbors to average (default: 10)
    
    Returns
    -------
    y_hat : np.ndarray
        Predicted values at query points, shape (n_query,)
    """
    X_train = np.asarray(X_train); y_train = np.asarray(y_train)
    X_query = np.asarray(X_query)

    if X_train.ndim == 1:
        X_train = X_train.reshape(-1, 1)
    if X_query.ndim == 1:
        X_query = X_query.reshape(-1, 1)

    diffs = X_query[:, None, :] - X_train[None, :, :]
    d2 = np.sum(diffs * diffs, axis=-1)
    idx = np.argpartition(d2, kth=min(k-1, d2.shape[1]-1), axis=1)[:, :k]
    return np.mean(y_train[idx], axis=1)


### Mini Demo: Regression Risk + Decomposition

This is a **toy** illustration to demonstrate the MSE decomposition. Replace the synthetic generator with your own data.

In [None]:
import matplotlib.pyplot as plt

def make_synthetic_regression(n: int = 300, noise_std: float = 0.15, random_state: int = 0):
    rng = np.random.default_rng(random_state)
    X = rng.uniform(0, 1, size=n)
    f = lambda x: np.sin(2*np.pi*x) + 0.3*np.cos(6*np.pi*x)
    r = f(X)
    y = r + rng.normal(0, noise_std, size=n)
    return X, y, f

X, y, f = make_synthetic_regression()
x_grid = np.linspace(0, 1, 400)
r_grid = f(x_grid)

# Estimate r(x) with kernel regression
rhat_grid = kernel_regression_predict(X, y, x_grid, bandwidth=0.08)

# A simple model class example: polynomial degree 3 fitted by least squares
Phi = np.vstack([np.ones_like(X), X, X**2, X**3]).T
beta, *_ = np.linalg.lstsq(Phi, y, rcond=None)
def poly3_predict(params, Xq):
    Xq = np.asarray(Xq)
    return params[0] + params[1]*Xq + params[2]*Xq**2 + params[3]*Xq**3

g_grid = poly3_predict(beta, x_grid)

# Decomposition terms (using rhat as proxy for r)
rhat_X = kernel_regression_predict(X, y, X, bandwidth=0.08)
g_X = poly3_predict(beta, X)
terms = mse_decomposition(y, g_X, rhat_X)
terms


In [None]:
plt.figure()
plt.scatter(X, y, s=12, alpha=0.35, label="data")
plt.plot(x_grid, r_grid, linewidth=2, label="true r(x)")
plt.plot(x_grid, rhat_grid, linewidth=2, label="kernel r̂(x)")
plt.plot(x_grid, g_grid, linewidth=2, label="poly3 g(x)")
plt.legend()
plt.title("Regression function and a model approximation")
plt.show()

## 3) Pattern Recognition (Classification) and Bayes Rule

In classification, labels are discrete (often $\{0,1\}$ for binary), and the notes use **0–1 loss**:

$$
L(y,u)=\begin{cases}
0 & \text{if } y=u\\
1 & \text{if } y\neq u
\end{cases}
$$

where:
- $y$ = true class label
- $u$ = predicted class label
- $L(y,u)$ = loss (0 if correct, 1 if incorrect)

Risk equals the **misclassification probability**:

$$
R(g)=\mathbb{P}(Y\neq g(X))
$$

where:
- $g(X)$ = classifier function that predicts class given features $X$
- $R(g)$ = probability of misclassification

For binary classification, define the **posterior probability**:

$$
r(x)=\mathbb{P}(Y=1\mid X=x)
$$

where $r(x)$ is the conditional probability of class 1 given features $x$.

The **Bayes rule** (optimal classifier) is:
- Predict 1 if $r(x)>1/2$, else predict 0

This rule minimizes the 0-1 loss risk and achieves the **Bayes error** $R^* = \mathbb{E}[\min(r(X), 1-r(X))]$.

Below are reusable helpers for the Bayes decision rule and computing empirical misclassification risk.

In [None]:
def bayes_rule_binary(p_y1_given_x: np.ndarray, threshold: float = 0.5) -> np.ndarray:
    """Bayes decision rule for binary classification given posterior P(Y=1|X).
    
    Parameters
    ----------
    p_y1_given_x : np.ndarray
        Posterior probabilities P(Y=1|X), shape (n_samples,)
    threshold : float, optional
        Decision threshold (default: 0.5 for 0-1 loss)
    
    Returns
    -------
    np.ndarray
        Predicted class labels (0 or 1), shape (n_samples,)
    """
    p = np.asarray(p_y1_given_x)
    return (p > threshold).astype(int)

def misclassification_rate(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    """Compute empirical misclassification rate (0-1 error).
    
    Parameters
    ----------
    y_true : np.ndarray
        True class labels, shape (n_samples,)
    y_pred : np.ndarray
        Predicted class labels, shape (n_samples,)
    
    Returns
    -------
    float
        Fraction of misclassified samples
    """
    return float(np.mean((np.asarray(y_true) != np.asarray(y_pred)).astype(float)))

def bayes_error_from_posterior(p_y1_given_x: np.ndarray) -> float:
    """Compute Bayes error if you know the true posterior r(X)=P(Y=1|X).
    
    Parameters
    ----------
    p_y1_given_x : np.ndarray
        True posterior probabilities r(X), shape (n_samples,)
    
    Returns
    -------
    float
        Bayes error = E[min(r(X), 1-r(X))]
    """
    p = np.asarray(p_y1_given_x)
    return float(np.mean(np.minimum(p, 1 - p)))


### Estimating $r(x)=P(Y=1|X=x)$ from Data

In practice, you need to estimate the posterior probability from training data. Two common approaches:

1. **Logistic regression** (parametric conditional model)  
2. **k-NN / kernel methods** (non-parametric)

We'll implement a reusable logistic regression MLE next (Section 5).  
Here is a simple k-NN posterior estimator you can use immediately:

In [None]:
def knn_posterior_predict(X_train: np.ndarray,
                          y_train: np.ndarray,
                          X_query: np.ndarray,
                          k: int = 25) -> np.ndarray:
    """Estimate posterior r(x)=P(Y=1|X=x) via k-NN (average of neighbor labels).
    
    Parameters
    ----------
    X_train : np.ndarray
        Training features, shape (n_train,) or (n_train, n_features)
    y_train : np.ndarray
        Training binary labels (0 or 1), shape (n_train,)
    X_query : np.ndarray
        Query points where to estimate posterior, shape (n_query,) or (n_query, n_features)
    k : int, optional
        Number of nearest neighbors (default: 25)
    
    Returns
    -------
    np.ndarray
        Estimated P(Y=1|X) at query points, shape (n_query,)
    """
    yhat = knn_regression_predict(X_train, y_train, X_query, k=k)
    # since y in {0,1}, average is an estimate of P(Y=1|X)
    return np.clip(yhat, 0.0, 1.0)


### Mini Demo: Bayes Rule Intuition (Synthetic)

We'll generate data from a known model so we can compute the **true** posterior and compare it with estimates.

In [None]:
def make_synthetic_binary(n: int = 600, random_state: int = 0):
    rng = np.random.default_rng(random_state)
    # X in R^2; two Gaussians
    n0 = n // 2
    n1 = n - n0
    X0 = rng.normal(loc=(-1.0, 0.0), scale=0.9, size=(n0, 2))
    X1 = rng.normal(loc=(+1.0, 0.0), scale=0.9, size=(n1, 2))
    X = np.vstack([X0, X1])
    y = np.array([0]*n0 + [1]*n1)
    # Shuffle
    idx = rng.permutation(n)
    return X[idx], y[idx]

Xc, yc = make_synthetic_binary()

# Posterior estimate via kNN
p_hat = knn_posterior_predict(Xc, yc, Xc, k=35)
y_hat = bayes_rule_binary(p_hat, threshold=0.5)
misclassification_rate(yc, y_hat)


## 4) Maximum Likelihood Estimation (MLE) as Risk Minimization

The notes show: if your model is a parametric density $p_\alpha(z)$, and you choose the loss function

$$
L(z,\alpha)=-\log p_\alpha(z),
$$

where:
- $z$ = observed data point
- $\alpha$ = model parameters
- $p_\alpha(z)$ = probability density/mass function of data $z$ under parameters $\alpha$
- $L(z,\alpha)$ = negative log-likelihood loss

then the risk is the **expected negative log-likelihood**, and empirical risk is the average negative log-likelihood over samples:

$$
\hat R(\alpha)=-\frac{1}{n}\sum_{i=1}^n \log p_\alpha(Z_i)
$$

where:
- $n$ = number of samples
- $Z_i$ = $i$-th data point
- $\hat R(\alpha)$ = empirical risk (average NLL)

**Key insight**: Minimizing this empirical risk is **equivalent** to maximizing the likelihood function $\mathcal{L}(\alpha) = \prod_{i=1}^n p_\alpha(Z_i)$.

Below are general MLE helpers + concrete examples (Gaussian, etc.).

In [None]:
def mle_fit(nll_fn: Callable[[np.ndarray, np.ndarray], float],
            data: np.ndarray,
            init_params: np.ndarray,
            method: str = "L-BFGS-B",
            bounds: Optional[Iterable[Tuple[Optional[float], Optional[float]]]] = None,
            options: Optional[Dict] = None) -> Dict:
    """Generic MLE by minimizing negative log-likelihood (sum or mean).
    
    Parameters
    ----------
    nll_fn : Callable
        Negative log-likelihood function with signature (params, data) -> float
    data : np.ndarray
        Observed data samples
    init_params : np.ndarray
        Initial parameter values for optimization
    method : str, optional
        Scipy optimization method (default: "L-BFGS-B")
    bounds : Iterable of tuples, optional
        Parameter bounds as [(min1, max1), (min2, max2), ...]
    options : dict, optional
        Additional options for scipy.optimize.minimize
    
    Returns
    -------
    dict
        Dictionary with keys:
        - params: MLE parameter estimates
        - fun: final NLL value
        - success: whether optimization succeeded
        - message: optimization status message
        - result: full scipy optimization result object
    """
    data = np.asarray(data)

    def objective(params):
        return float(nll_fn(params, data))

    res = minimize(objective, np.asarray(init_params, dtype=float),
                   method=method, bounds=bounds, options=options)
    return {"params": res.x, "fun": float(res.fun), "success": bool(res.success),
            "message": res.message, "result": res}


### Example: Gaussian MLE $(\mu,\sigma)$

For i.i.d. samples $z_1,\dots,z_n\sim\mathcal{N}(\mu,\sigma^2)$, MLE has a closed-form solution:

$$\hat\mu = \frac{1}{n}\sum_{i=1}^n z_i$$

$$\hat\sigma = \sqrt{\frac{1}{n}\sum_{i=1}^n (z_i-\hat\mu)^2}$$

where:
- $z_i$ = $i$-th sample
- $n$ = number of samples
- $\mu$ = mean parameter
- $\sigma$ = standard deviation parameter
- $\hat\mu$ = MLE estimate of mean (sample mean)
- $\hat\sigma$ = MLE estimate of std dev (uses $1/n$, not $1/(n-1)$ like unbiased estimator)

We'll provide both closed-form and numeric optimization versions for illustration.

In [None]:
def gaussian_logpdf(params: np.ndarray, z: np.ndarray) -> np.ndarray:
    """Log-density of N(mu, sigma^2) for each z.
    
    Parameters
    ----------
    params : np.ndarray
        Parameters [mu, sigma] where mu=mean, sigma=std dev
    z : np.ndarray
        Data points
    
    Returns
    -------
    np.ndarray
        Log-probability density for each data point
    """
    mu, sigma = float(params[0]), float(params[1])
    z = np.asarray(z)
    if sigma <= 0:
        return np.full_like(z, -np.inf, dtype=float)
    return -0.5*np.log(2*np.pi*sigma**2) - 0.5*((z - mu)/sigma)**2

def gaussian_nll(params: np.ndarray, z: np.ndarray) -> float:
    """Negative log-likelihood (sum) for Gaussian distribution.
    
    Parameters
    ----------
    params : np.ndarray
        Parameters [mu, sigma]
    z : np.ndarray
        Data samples
    
    Returns
    -------
    float
        Total negative log-likelihood
    """
    ll = gaussian_logpdf(params, z)
    if np.isneginf(ll).any():
        return float("inf")
    return float(-np.sum(ll))

def fit_gaussian_mle_closed_form(z: np.ndarray) -> Dict[str, float]:
    """Fit Gaussian parameters using closed-form MLE formulas.
    
    Parameters
    ----------
    z : np.ndarray
        Data samples
    
    Returns
    -------
    dict
        Dictionary with keys 'mu' (mean) and 'sigma' (std dev)
    """
    z = np.asarray(z)
    mu_hat = float(np.mean(z))
    sigma_hat = float(np.sqrt(np.mean((z - mu_hat)**2)))
    return {"mu": mu_hat, "sigma": sigma_hat}

def fit_gaussian_mle_numeric(z: np.ndarray,
                             init_params: Optional[np.ndarray] = None) -> Dict:
    """Fit Gaussian parameters using numeric optimization.
    
    Parameters
    ----------
    z : np.ndarray
        Data samples
    init_params : np.ndarray, optional
        Initial [mu, sigma] guess (auto-initialized if None)
    
    Returns
    -------
    dict
        MLE fit results including 'params' key with [mu, sigma]
    """
    z = np.asarray(z)
    if init_params is None:
        init_params = np.array([np.mean(z), np.std(z) if np.std(z) > 1e-6 else 1.0])
    bounds = [(None, None), (1e-9, None)]
    return mle_fit(gaussian_nll, z, init_params=init_params, bounds=bounds)


In [None]:
# Quick check
rng = np.random.default_rng(0)
z = rng.normal(loc=2.0, scale=1.5, size=500)
fit_gaussian_mle_closed_form(z), fit_gaussian_mle_numeric(z)["params"]


### (Optional) Jensen's Inequality Viewpoint: MLE Minimizes Cross-Entropy / KL

The notes use Jensen's inequality to show that the true parameter $\alpha^*$ minimizes the expected negative log-likelihood.

In information-theoretic terms:
- Expected NLL is the **cross-entropy** $H(P, Q) = -\mathbb{E}_P[\log Q]$ between true distribution $P$ and model $Q$
- The gap to the minimum is a **KL divergence** $D_{KL}(P||Q) = \mathbb{E}_P[\log P - \log Q]$ (always non-negative)

where:
- $P$ = true data distribution
- $Q$ = model distribution parameterized by $\alpha$
- $H(P,Q)$ = cross-entropy
- $D_{KL}(P||Q)$ = Kullback-Leibler divergence

Below are small helpers for **discrete** distributions (useful for sanity checks and understanding).

In [None]:
def kl_divergence_discrete(p: np.ndarray, q: np.ndarray, eps: float = 1e-12) -> float:
    """Compute KL(p||q) for discrete probability distributions.
    
    Parameters
    ----------
    p : np.ndarray
        True distribution (will be normalized)
    q : np.ndarray
        Model distribution (will be normalized)
    eps : float, optional
        Small constant for numerical stability (default: 1e-12)
    
    Returns
    -------
    float
        KL divergence from p to q
    """
    p = np.asarray(p, dtype=float); q = np.asarray(q, dtype=float)
    p = p / np.sum(p); q = q / np.sum(q)
    p = np.clip(p, eps, 1.0); q = np.clip(q, eps, 1.0)
    return float(np.sum(p * np.log(p / q)))

def cross_entropy_discrete(p: np.ndarray, q: np.ndarray, eps: float = 1e-12) -> float:
    """Compute cross-entropy H(p,q) = -E_p[log q] for discrete distributions.
    
    Parameters
    ----------
    p : np.ndarray
        True distribution (will be normalized)
    q : np.ndarray
        Model distribution (will be normalized)
    eps : float, optional
        Small constant for numerical stability (default: 1e-12)
    
    Returns
    -------
    float
        Cross-entropy from p to q
    """
    p = np.asarray(p, dtype=float); q = np.asarray(q, dtype=float)
    p = p / np.sum(p); q = q / np.sum(q)
    p = np.clip(p, eps, 1.0); q = np.clip(q, eps, 1.0)
    return float(-np.sum(p * np.log(q)))


## 5) MLE and Regression (Conditional Likelihood)

The notes show: for a joint model $f_{X,Y}(x,y)=f_{Y\mid X}(y\mid x)f_X(x)$, if the marginal $f_X$ has **no parameters**, then maximizing the joint likelihood is equivalent to maximizing the **conditional likelihood** $f_{Y\mid X}(y\mid x)$.

**Variable definitions:**
- $f_{X,Y}(x,y)$ = joint density of features and response
- $f_{Y\mid X}(y\mid x)$ = conditional density of response given features (depends on parameters)
- $f_X(x)$ = marginal density of features (parameter-free)
- $x$ = feature values
- $y$ = response values

This is the fundamental reason why:
- **Linear regression** with Gaussian noise $Y\mid X \sim \mathcal{N}(X\beta, \sigma^2)$ → least squares minimization
- **Logistic regression** with Bernoulli model $Y\mid X \sim \text{Bernoulli}(\text{sigmoid}(X\beta))$ → logistic loss / log-loss minimization

Below are ready-to-use functions for both models.

In [None]:
def fit_linear_regression_mle(X: np.ndarray,
                            y: np.ndarray,
                            add_bias: bool = True) -> Dict[str, Union[np.ndarray, float]]:
    """MLE for linear regression under Y|X ~ N(Xβ, σ^2).

    Parameters
    ----------
    X : np.ndarray
        Feature matrix, shape (n_samples,) or (n_samples, n_features)
    y : np.ndarray
        Target values, shape (n_samples,)
    add_bias : bool, optional
        Whether to add intercept term (default: True)
    
    Returns
    -------
    dict
        Dictionary with keys:
        - beta: coefficient vector (includes intercept if add_bias=True)
        - sigma: MLE estimate of noise std dev (uses 1/n)
        - y_hat: fitted values
        - residuals: y - y_hat
    """
    X = np.asarray(X); y = np.asarray(y)
    X_design = add_intercept(X) if add_bias else (X.reshape(-1,1) if X.ndim==1 else X)
    beta_hat, *_ = np.linalg.lstsq(X_design, y, rcond=None)
    y_hat = X_design @ beta_hat
    resid = y - y_hat
    sigma_hat = float(np.sqrt(np.mean(resid**2)))
    return {"beta": beta_hat, "sigma": sigma_hat, "y_hat": y_hat, "residuals": resid}

def predict_linear_regression(beta: np.ndarray, X: np.ndarray, add_bias: bool = True) -> np.ndarray:
    """Predict using fitted linear regression coefficients.
    
    Parameters
    ----------
    beta : np.ndarray
        Coefficient vector (includes intercept if add_bias=True)
    X : np.ndarray
        Feature matrix for prediction
    add_bias : bool, optional
        Whether beta includes intercept term (default: True)
    
    Returns
    -------
    np.ndarray
        Predicted values
    """
    X = np.asarray(X)
    X_design = add_intercept(X) if add_bias else (X.reshape(-1,1) if X.ndim==1 else X)
    return X_design @ np.asarray(beta)


In [None]:
# Example usage (synthetic)
rng = np.random.default_rng(0)
X = rng.uniform(-2, 2, size=300)
y = 1.0 + 2.5*X + rng.normal(0, 0.7, size=300)

lin = fit_linear_regression_mle(X, y)
lin["beta"], lin["sigma"], empirical_risk(squared_loss, y, lin["y_hat"])


In [None]:
def logistic_nll(beta: np.ndarray, X: np.ndarray, y: np.ndarray, add_bias: bool = True) -> float:
    """Negative log-likelihood for Bernoulli(sigmoid(Xβ)).

    Parameters
    ----------
    beta : np.ndarray
        Coefficient vector (includes intercept if add_bias=True)
    X : np.ndarray
        Feature matrix, shape (n_samples,) or (n_samples, n_features)
    y : np.ndarray
        Binary labels (0 or 1), shape (n_samples,)
    add_bias : bool, optional
        Whether beta includes intercept term (default: True)
    
    Returns
    -------
    float
        Total negative log-likelihood
    
    Notes
    -----
    Uses numerically stable form: sum log(1 + exp(-z_i * eta_i))
    where z_i = 2y_i - 1 and eta_i = Xβ
    """
    X = np.asarray(X); y = np.asarray(y).astype(int)
    X_design = add_intercept(X) if add_bias else (X.reshape(-1,1) if X.ndim==1 else X)
    eta = X_design @ np.asarray(beta)
    z = 2*y - 1  # +1 for y=1, -1 for y=0
    # log(1 + exp(-z*eta)) stable via logaddexp(0, -z*eta)
    return float(np.sum(np.logaddexp(0.0, -z * eta)))

def fit_logistic_regression_mle(X: np.ndarray,
                                y: np.ndarray,
                                add_bias: bool = True,
                                init_beta: Optional[np.ndarray] = None,
                                method: str = "L-BFGS-B",
                                l2: float = 0.0) -> Dict:
    """Fit logistic regression by MLE (optionally with L2 regularization).

    Parameters
    ----------
    X : np.ndarray
        Feature matrix, shape (n_samples,) or (n_samples, n_features)
    y : np.ndarray
        Binary labels (0 or 1), shape (n_samples,)
    add_bias : bool, optional
        Whether to include intercept term (default: True)
    init_beta : np.ndarray, optional
        Initial coefficient values (default: zeros)
    method : str, optional
        Scipy optimization method (default: "L-BFGS-B")
    l2 : float, optional
        L2 regularization strength (default: 0.0, no regularization)
    
    Returns
    -------
    dict
        Dictionary with keys:
        - beta: MLE coefficient estimates
        - fun: final NLL value
        - success: whether optimization succeeded
        - message: optimization status
        - result: full scipy result object
    
    Notes
    -----
    Objective: NLL(beta) + (l2/2)*||beta||^2
    If add_bias=True, intercept is NOT regularized.
    """
    X = np.asarray(X); y = np.asarray(y).astype(int)
    X_design = add_intercept(X) if add_bias else (X.reshape(-1,1) if X.ndim==1 else X)
    d = X_design.shape[1]
    if init_beta is None:
        init_beta = np.zeros(d)

    def objective(beta):
        nll = logistic_nll(beta, X, y, add_bias=add_bias)
        if l2 > 0:
            beta_reg = np.asarray(beta).copy()
            if add_bias:
                beta_reg[0] = 0.0  # don't penalize intercept
            nll = nll + 0.5*l2*np.sum(beta_reg**2)
        return float(nll)

    res = minimize(objective, np.asarray(init_beta, dtype=float), method=method)
    return {"beta": res.x, "fun": float(res.fun), "success": bool(res.success),
            "message": res.message, "result": res}

def predict_proba_logistic(beta: np.ndarray, X: np.ndarray, add_bias: bool = True) -> np.ndarray:
    """Predict probabilities P(Y=1|X) using logistic regression.
    
    Parameters
    ----------
    beta : np.ndarray
        Coefficient vector (includes intercept if add_bias=True)
    X : np.ndarray
        Feature matrix for prediction
    add_bias : bool, optional
        Whether beta includes intercept term (default: True)
    
    Returns
    -------
    np.ndarray
        Predicted probabilities P(Y=1|X)
    """
    X = np.asarray(X)
    X_design = add_intercept(X) if add_bias else (X.reshape(-1,1) if X.ndim==1 else X)
    return sigmoid(X_design @ np.asarray(beta))

def predict_logistic(beta: np.ndarray, X: np.ndarray, threshold: float = 0.5, add_bias: bool = True) -> np.ndarray:
    """Predict class labels using logistic regression.
    
    Parameters
    ----------
    beta : np.ndarray
        Coefficient vector (includes intercept if add_bias=True)
    X : np.ndarray
        Feature matrix for prediction
    threshold : float, optional
        Classification threshold (default: 0.5)
    add_bias : bool, optional
        Whether beta includes intercept term (default: True)
    
    Returns
    -------
    np.ndarray
        Predicted class labels (0 or 1)
    """
    return (predict_proba_logistic(beta, X, add_bias=add_bias) > threshold).astype(int)


In [None]:
# Example usage (synthetic binary 1D)
rng = np.random.default_rng(1)
X = rng.normal(0, 1, size=500)
true_beta = np.array([-0.2, 2.0])  # intercept + slope
p = sigmoid(add_intercept(X) @ true_beta)
y = rng.binomial(1, p)

fit = fit_logistic_regression_mle(X, y, l2=0.0)
beta_hat = fit["beta"]
acc = 1.0 - misclassification_rate(y, predict_logistic(beta_hat, X))
beta_hat, acc


## 6) Quick Function Index (Copy/Paste Friendly)

### Core Risk/ERM
- `squared_loss(y_true, y_pred)` — Pointwise squared loss $(y-\hat{y})^2$
- `zero_one_loss(y_true, y_pred)` — Pointwise 0-1 loss (classification)
- `empirical_risk(loss_fn, y_true, y_pred)` — Average loss on data
- `monte_carlo_risk(loss_on_sample, sampler, n_mc=...)` — Estimate true risk via sampling
- `erm_fit(model_predict, loss_fn, X, y, init_params, ...)` — General ERM optimizer

### Regression Function $r(x)$
- `kernel_regression_predict(X_train, y_train, X_query, bandwidth=..., kernel=...)` — Nadaraya-Watson estimator
- `knn_regression_predict(X_train, y_train, X_query, k=...)` — k-NN regression
- `mse_decomposition(y, gX, rX)` — Decompose MSE into noise, approximation, and cross terms

### Classification / Bayes Rule
- `bayes_rule_binary(p_y1_given_x, threshold=0.5)` — Optimal binary classifier
- `knn_posterior_predict(X_train, y_train, X_query, k=...)` — Estimate $P(Y=1|X)$ via k-NN
- `misclassification_rate(y_true, y_pred)` — 0-1 error rate
- `bayes_error_from_posterior(p_y1_given_x)` — Bayes error given true posterior

### MLE
- `mle_fit(nll_fn, data, init_params, ...)` — Generic MLE by minimizing NLL
- `fit_gaussian_mle_closed_form(z)` — Closed-form Gaussian MLE
- `fit_gaussian_mle_numeric(z)` — Numeric Gaussian MLE
- `fit_linear_regression_mle(X, y)` — Linear regression via MLE (Gaussian noise)
- `fit_logistic_regression_mle(X, y, l2=...)` — Logistic regression via MLE (Bernoulli)

## 7) How to Adapt This to Your Own Exercises Fast

**Step-by-step guide:**

1. **Replace data**: Use your own arrays `X, y` instead of synthetic generators
2. **Choose loss function**:
   - Regression → `squared_loss`
   - Classification → `zero_one_loss` (for evaluation), log-loss/NLL (for training)
3. **Select model class**:
   - Linear model → `fit_linear_regression_mle`
   - Logistic model → `fit_logistic_regression_mle`
   - Custom parametric model → use `erm_fit(...)` with your `model_predict(...)` function
4. **Estimate true risk**: If you need expected risk (not just empirical), use `monte_carlo_risk` with a sampler for your assumed distribution

All functions accept standard NumPy arrays and return dictionaries with results.