# Chapter 3 Bonus: Hands-On Linear and Logistic Regression

This bonus notebook extends **Chapter 3 â€“ Regression** with more hands-on, from-scratch implementations. We will:

1. Derive and implement **Gradient Descent for Linear Regression** using NumPy
2. Implement **Logistic Regression by hand** (no autograd)
3. Visualize **decision boundaries** for a 2D classification problem
4. Explore the effect of **L2 regularization** on logistic regression
5. Summarize when linear models work well and why we need multi-layer networks

The goal is to deepen your intuition for linear models and prepare you for **multi-layer perceptrons** in Chapter 4.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.datasets import make_moons, make_circles
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Plot style
plt.style.use('seaborn-v0_8-darkgrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 11

# Reproducibility
np.random.seed(42)

# 1. Gradient Descent for Linear Regression (from scratch)

In Chapter 3 we saw linear regression in vector form:

$$
\hat{y} = Xw + b, \quad X \in \mathbb{R}^{n \times d},\ w \in \mathbb{R}^d,\ b \in \mathbb{R}.
$$

For a dataset $(x_i, y_i)_{i=1}^n$, the **Mean Squared Error (MSE)** loss is:

$$
L(w, b) = \frac{1}{2n} \sum_{i=1}^n (\hat{y}_i - y_i)^2.
$$

Writing this in matrix form with $\hat{y} = Xw + b\mathbf{1}$:

$$
L(w, b) = \frac{1}{2n} \lVert Xw + b\mathbf{1} - y \rVert^2.
$$

The gradients are:

$$
\frac{\partial L}{\partial w} = \frac{1}{n} X^T(Xw + b\mathbf{1} - y), \quad
\frac{\partial L}{\partial b} = \frac{1}{n} \mathbf{1}^T(Xw + b\mathbf{1} - y).
$$

**Gradient Descent update rule:**

$$
w \leftarrow w - \eta \frac{\partial L}{\partial w}, \quad
b \leftarrow b - \eta \frac{\partial L}{\partial b},
$$

where $\eta > 0$ is the learning rate.

We will now:

1. Generate synthetic 1D data from a true line
2. Compute the **closed-form solution** for linear regression
3. Train with **gradient descent** using only NumPy
4. Compare the learned parameters and visualize the fit and the loss curve.

In [None]:
# Generate synthetic 1D data
def generate_linear_data(n_samples=100, noise_std=2.0):
    """y = 3x + 5 + noise"""
    X = np.linspace(-5, 5, n_samples).reshape(-1, 1)
    true_w = np.array([3.0])
    true_b = 5.0
    noise = np.random.randn(n_samples, 1) * noise_std
    y = X @ true_w.reshape(-1, 1) + true_b + noise
    return X, y.ravel(), true_w, true_b

X, y, true_w, true_b = generate_linear_data()

# Closed-form solution: w* = (X^T X)^{-1} X^T y
X_bias = np.c_[X, np.ones_like(X)]  # [x, 1]
XtX = X_bias.T @ X_bias
XtX_inv = np.linalg.inv(XtX)
XtY = X_bias.T @ y
w_closed = XtX_inv @ XtY  # [w, b]

print("True w, b:", true_w[0], true_b)
print("Closed-form w, b:", w_closed[0], w_closed[1])

# Gradient Descent implementation
def linear_regression_gd(X, y, lr=1e-2, n_iters=1000):
    n_samples, n_features = X.shape
    w = np.zeros(n_features)
    b = 0.0
    losses = []
    
    for it in range(n_iters):
        y_pred = X @ w + b
        error = y_pred - y
        loss = 0.5 / n_samples * np.sum(error ** 2)
        losses.append(loss)
        
        # Gradients
        grad_w = (1.0 / n_samples) * (X.T @ error)
        grad_b = (1.0 / n_samples) * np.sum(error)
        
        # Update
        w -= lr * grad_w
        b -= lr * grad_b
    
    return w, b, losses

w_gd, b_gd, losses = linear_regression_gd(X, y, lr=5e-3, n_iters=2000)

print("\nGradient Descent w, b:", w_gd[0], b_gd)

# Plot data and fits
x_plot = np.linspace(-5.5, 5.5, 200).reshape(-1, 1)
y_true_line = x_plot * true_w[0] + true_b
y_closed = x_plot * w_closed[0] + w_closed[1]
y_gd = x_plot * w_gd[0] + b_gd

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: data and lines
axes[0].scatter(X, y, alpha=0.6, label="Data")
axes[0].plot(x_plot, y_true_line, 'k--', label="True line")
axes[0].plot(x_plot, y_closed, 'r-', label="Closed-form fit")
axes[0].plot(x_plot, y_gd, 'g-', label="GD fit")
axes[0].set_title("Linear Regression: True vs Closed-form vs GD")
axes[0].set_xlabel("x")
axes[0].set_ylabel("y")
axes[0].legend()

# Right: loss curve
axes[1].plot(losses)
axes[1].set_title("Gradient Descent Training Loss (MSE)")
axes[1].set_xlabel("Iteration")
axes[1].set_ylabel("Loss")

plt.tight_layout()
plt.show()

**Observations:**

- Gradient descent converges to almost the same parameters as the closed-form solution.
- The loss decreases smoothly until it plateaus.
- This matches the theory from Chapter 3: linear regression has a convex loss and a unique global minimum.

Next we will move to **logistic regression**, still with manual gradients.

# 2. Logistic Regression by Hand (Binary Classification)

For binary classification with labels $y \in \{0, 1\}$, logistic regression models the probability of class 1 as:

$$
\hat{y} = \sigma(z) = \sigma(Xw + b), \quad \sigma(t) = \frac{1}{1 + e^{-t}}.
$$

The **Binary Cross-Entropy (BCE)** loss is:

$$
L(w, b) = -\frac{1}{n} \sum_{i=1}^n \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right].
$$

The gradients for a batch of size $n$ can be written compactly as:

$$
\frac{\partial L}{\partial w} = \frac{1}{n} X^T(\hat{y} - y), \quad
\frac{\partial L}{\partial b} = \frac{1}{n} \sum_{i=1}^n (\hat{y}_i - y_i).
$$

We will:

1. Generate a simple 2D dataset for binary classification
2. Implement logistic regression training **from scratch** using these gradients
3. Compare with `sklearn.linear_model.LogisticRegression` on the same data.

In [None]:
# Helper functions for logistic regression
def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))


def binary_cross_entropy(y_true, y_pred, eps=1e-8):
    """Binary cross-entropy for vectors y_true, y_pred in [0,1]."""
    y_pred = np.clip(y_pred, eps, 1 - eps)
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))


def logistic_regression_gd(X, y, lr=0.1, n_iters=1000):
    n_samples, n_features = X.shape
    w = np.zeros(n_features)
    b = 0.0
    losses = []
    
    for it in range(n_iters):
        z = X @ w + b
        y_pred = sigmoid(z)
        loss = binary_cross_entropy(y, y_pred)
        losses.append(loss)
        
        # Gradients
        error = y_pred - y
        grad_w = (1.0 / n_samples) * (X.T @ error)
        grad_b = (1.0 / n_samples) * np.sum(error)
        
        # Update
        w -= lr * grad_w
        b -= lr * grad_b
    
    return w, b, losses


# Generate a simple 2D dataset
X, y = make_moons(n_samples=400, noise=0.25, random_state=42)

# Standardize features for easier optimization
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train from-scratch logistic regression
w_lr, b_lr, losses_lr = logistic_regression_gd(X_scaled, y, lr=0.1, n_iters=2000)

print("From-scratch Logistic Regression:")
print("w:", w_lr)
print("b:", b_lr)
print("Final BCE loss:", losses_lr[-1])

# Compare with sklearn
log_reg = LogisticRegression(fit_intercept=True, solver="lbfgs")
log_reg.fit(X_scaled, y)

print("\nSklearn LogisticRegression:")
print("w:", log_reg.coef_.ravel())
print("b:", log_reg.intercept_[0])

# Plot loss curve
plt.figure(figsize=(8, 4))
plt.plot(losses_lr)
plt.title("Logistic Regression (from scratch) - Training Loss (BCE)")
plt.xlabel("Iteration")
plt.ylabel("Loss")
plt.tight_layout()
plt.show()

## Setup and Imports

We will use **NumPy** for numerical computation, **matplotlib** for plotting, and **scikit-learn** only for a few sanity checks (to compare with our from-scratch implementations).

# 3. Decision Boundary Visualization

A powerful way to understand linear models is to **visualize the decision boundary** in 2D.

For a logistic regression classifier trained on 2D features $x = (x_1, x_2)$:

$$
\hat{y} = \sigma(w_1 x_1 + w_2 x_2 + b),
$$

The **decision boundary** where the model is undecided ($\hat{y} = 0.5$) satisfies:

$$
w_1 x_1 + w_2 x_2 + b = 0.
$$

This is a **straight line** in the $(x_1, x_2)$ plane. We will:

1. Train our from-scratch logistic regression on a 2D dataset
2. Plot the data points
3. Overlay the decision boundary and probability contours.

In [None]:
# Reuse X_scaled, y, w_lr, b_lr from previous cell

# Create a grid over feature space
x1_min, x1_max = X_scaled[:, 0].min() - 0.5, X_scaled[:, 0].max() + 0.5
x2_min, x2_max = X_scaled[:, 1].min() - 0.5, X_scaled[:, 1].max() + 0.5
xx1, xx2 = np.meshgrid(np.linspace(x1_min, x1_max, 200),
                       np.linspace(x2_min, x2_max, 200))

grid = np.c_[xx1.ravel(), xx2.ravel()]
probs = sigmoid(grid @ w_lr + b_lr).reshape(xx1.shape)

fig, ax = plt.subplots(figsize=(7, 6))

# Plot decision regions
contour = ax.contourf(xx1, xx2, probs, levels=20, cmap="RdBu", alpha=0.6)
plt.colorbar(contour, ax=ax, label="P(y=1 | x)")

# Decision boundary (p = 0.5)
ax.contour(xx1, xx2, probs, levels=[0.5], colors="k", linewidths=2)

# Scatter data points
ax.scatter(X_scaled[y == 0, 0], X_scaled[y == 0, 1], c="blue", edgecolor="k", label="Class 0")
ax.scatter(X_scaled[y == 1, 0], X_scaled[y == 1, 1], c="red", edgecolor="k", label="Class 1")

ax.set_title("Logistic Regression Decision Boundary (Moons Dataset)")
ax.set_xlabel("x1 (standardized)")
ax.set_ylabel("x2 (standardized)")
ax.legend()

plt.tight_layout()
plt.show()

# 4. Regularization Effects (L2 / Ridge)

Regularization helps control model complexity and reduce overfitting. For logistic regression with **L2 regularization** (ridge), we add a penalty on the weights:

$$
L_{\text{reg}}(w, b) = L(w, b) + \lambda \lVert w \rVert^2,
$$

where $\lambda > 0$ controls the strength of regularization.

Effects of L2 regularization:

- Encourages **smaller weights** (shrinks coefficients)
- Can make the decision boundary **less sensitive** to noise
- Often improves **generalization** on test data

We will:

1. Train two logistic regressions on the same 2D dataset:
   - One **without** regularization
   - One **with** L2 regularization
2. Compare their weights and decision boundaries.

In [None]:
# Train logistic regression models with and without L2 regularization

# No regularization (very large C)
log_reg_no_reg = LogisticRegression(fit_intercept=True, solver="lbfgs", C=1e6)
log_reg_no_reg.fit(X_scaled, y)

# With L2 regularization (smaller C)
log_reg_l2 = LogisticRegression(fit_intercept=True, solver="lbfgs", C=0.1)
log_reg_l2.fit(X_scaled, y)

print("No regularization (C=1e6):")
print("  w:", log_reg_no_reg.coef_.ravel())
print("  b:", log_reg_no_reg.intercept_[0])

print("\nWith L2 regularization (C=0.1):")
print("  w:", log_reg_l2.coef_.ravel())
print("  b:", log_reg_l2.intercept_[0])

# Plot decision boundaries
xx1, xx2 = np.meshgrid(np.linspace(x1_min, x1_max, 200),
                       np.linspace(x2_min, x2_max, 200))

grid = np.c_[xx1.ravel(), xx2.ravel()]
probs_no_reg = log_reg_no_reg.predict_proba(grid)[:, 1].reshape(xx1.shape)
probs_l2 = log_reg_l2.predict_proba(grid)[:, 1].reshape(xx1.shape)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

for ax, probs, title in zip(
    axes,
    [probs_no_reg, probs_l2],
    ["No regularization (C=1e6)", "With L2 regularization (C=0.1)"]):
    
    contour = ax.contourf(xx1, xx2, probs, levels=20, cmap="RdBu", alpha=0.6)
    ax.contour(xx1, xx2, probs, levels=[0.5], colors="k", linewidths=2)
    
    ax.scatter(X_scaled[y == 0, 0], X_scaled[y == 0, 1], c="blue", edgecolor="k", label="Class 0", alpha=0.7)
    ax.scatter(X_scaled[y == 1, 0], X_scaled[y == 1, 1], c="red", edgecolor="k", label="Class 1", alpha=0.7)
    
    ax.set_title(title)
    ax.set_xlabel("x1 (standardized)")
    ax.set_ylabel("x2 (standardized)")

axes[0].legend(loc="upper left")
plt.tight_layout()
plt.show()

# 5. Summary and Connection to Multi-Layer Networks

In this bonus notebook we:

- Implemented **linear regression** from scratch and confirmed that gradient descent converges to the same solution as the closed-form formula.
- Implemented **logistic regression** by hand with explicit gradients and compared it to scikit-learn's implementation.
- Visualized **decision boundaries** in 2D and saw that logistic regression always learns a **linear (straight-line)** boundary in feature space.
- Demonstrated how **L2 regularization** shrinks weights and can slightly smooth the decision boundary, improving generalization.

**Key takeaway:**

- Linear and logistic models are **simple, fast, and interpretable**.
- However, they can only represent **linear decision boundaries**. Problems like XOR or highly non-linear patterns **cannot be solved** by a single linear model in the original feature space.
- This is the main motivation for **multi-layer networks (MLPs)** in Chapter 4: by stacking layers with non-linear activations, we can represent complex, non-linear decision boundaries while still training with gradient-based methods.

You can refer back to this notebook whenever you want a concrete reminder of how gradients and optimization work for the simplest models.