Here is **Chapter 6: Supervised Learning - Regression** — predicting continuous values with mathematical rigor.

---

# **CHAPTER 6: SUPERVISED LEARNING - REGRESSION**

*Predicting the Continuous*

## **Chapter Overview**

Regression is the foundation of predictive modeling. From forecasting stock prices to estimating house values, regression algorithms map input features to continuous outputs. This chapter progresses from the elegant simplicity of linear regression to the powerful complexity of gradient boosting, with mathematical derivations and production-ready implementations.

**Estimated Time:** 50-60 hours (3-4 weeks)  
**Prerequisites:** Chapters 1-5 (Math, Python, Data Preprocessing)

---

## **6.0 Learning Objectives**

By the end of this chapter, you will be able to:
1. Derive and implement linear regression using both closed-form (normal equations) and iterative (gradient descent) solutions
2. Apply regularization techniques (Ridge, Lasso, Elastic Net) to prevent overfitting and perform feature selection
3. Select appropriate evaluation metrics based on business requirements and error distributions
4. Implement and tune advanced regression algorithms (SVR, Random Forest, XGBoost, LightGBM)
5. Handle non-linear relationships through polynomial features and kernel methods
6. Build regression pipelines for time series forecasting

---

## **6.1 Linear Regression: The Foundation**

#### **6.1.1 The Model**

Given features $\mathbf{X} \in \mathbb{R}^{n \times d}$ and targets $\mathbf{y} \in \mathbb{R}^n$, find weights $\mathbf{w}$ and bias $b$ such that:

$$\hat{y} = \mathbf{X}\mathbf{w} + b$$

Or in matrix form (absorbing bias into $\mathbf{w}$ by adding column of 1s to $\mathbf{X}$):

$$\hat{\mathbf{y}} = \mathbf{X}\mathbf{w}$$

**Objective:** Minimize Mean Squared Error (MSE):

$$J(\mathbf{w}) = \frac{1}{2n} \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \frac{1}{2n} \|\mathbf{y} - \mathbf{X}\mathbf{w}\|_2^2$$

#### **6.1.2 Closed-Form Solution (Normal Equations)**

Set gradient to zero:

$$\nabla_{\mathbf{w}} J(\mathbf{w}) = -\frac{1}{n}\mathbf{X}^T(\mathbf{y} - \mathbf{X}\mathbf{w}) = 0$$

Solving:

$$\mathbf{w} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$

```python
import numpy as np

class LinearRegressionClosedForm:
    def fit(self, X, y):
        # Add bias term
        X_b = np.c_[np.ones((X.shape[0], 1)), X]
        
        # Normal equations: w = (X^T X)^-1 X^T y
        self.w = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y
        
    def predict(self, X):
        X_b = np.c_[np.ones((X.shape[0], 1)), X]
        return X_b @ self.w

# Usage
model = LinearRegressionClosedForm()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
```

**Computational Complexity:** $O(d^2n + d^3)$ due to matrix multiplication and inversion. Impractical for large $d$ (10,000+ features).

#### **6.1.3 Gradient Descent Solution**

When $d$ is large or data doesn't fit in memory, use iterative optimization.

**Gradient:**

$$\nabla_{\mathbf{w}} J(\mathbf{w}) = -\frac{1}{n}\mathbf{X}^T(\mathbf{y} - \mathbf{X}\mathbf{w})$$

**Update Rule:**

$$\mathbf{w} := \mathbf{w} - \alpha \nabla_{\mathbf{w}} J(\mathbf{w})$$

```python
class LinearRegressionGD:
    def __init__(self, lr=0.01, n_iter=1000, tol=1e-6):
        self.lr = lr
        self.n_iter = n_iter
        self.tol = tol
        
    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.w = np.zeros(n_features + 1)  # +1 for bias
        
        X_b = np.c_[np.ones((n_samples, 1)), X]
        
        for i in range(self.n_iter):
            # Predictions
            y_pred = X_b @ self.w
            
            # Gradient
            gradient = -(1/n_samples) * X_b.T @ (y - y_pred)
            
            # Update
            self.w -= self.lr * gradient
            
            # Convergence check
            if np.linalg.norm(gradient) < self.tol:
                print(f"Converged at iteration {i}")
                break
                
        return self
    
    def predict(self, X):
        X_b = np.c_[np.ones((X.shape[0], 1)), X]
        return X_b @ self.w
```

**Variants:**
- **Batch GD:** Uses all $n$ samples (accurate but slow)
- **Stochastic GD (SGD):** Uses 1 sample at a time (fast but noisy)
- **Mini-batch GD:** Uses $b$ samples (balance of speed and stability)

#### **6.1.4 Polynomial Regression**

Capture non-linear relationships by adding polynomial features:

$$\hat{y} = w_0 + w_1x + w_2x^2 + w_3x^3 + \dots + w_dx^d$$

```python
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

# Create pipeline
model = Pipeline([
    ('poly', PolynomialFeatures(degree=3, include_bias=False)),
    ('linear', LinearRegression())
])

model.fit(X, y)
```

**Danger:** High-degree polynomials overfit badly. Always use regularization (Ridge/Lasso) with degree > 2.

---

## **6.2 Regularization: Taming Complexity**

#### **6.2.1 Ridge Regression (L2 Regularization)**

Add penalty proportional to square of weights:

$$J(\mathbf{w}) = \frac{1}{2n}\|\mathbf{y} - \mathbf{X}\mathbf{w}\|_2^2 + \lambda\|\mathbf{w}\|_2^2$$

Closed-form solution:

$$\mathbf{w} = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}$$

**Effect:** Shrinks weights toward zero but rarely exactly zero. Good for multicollinearity (correlated features).

```python
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

# Find optimal alpha (lambda) via cross-validation
params = {'alpha': [0.001, 0.01, 0.1, 1, 10, 100]}
ridge = Ridge()
grid = GridSearchCV(ridge, params, cv=5, scoring='neg_mean_squared_error')
grid.fit(X_train, y_train)

print(f"Best alpha: {grid.best_params_['alpha']}")
```

#### **6.2.2 Lasso Regression (L1 Regularization)**

$$J(\mathbf{w}) = \frac{1}{2n}\|\mathbf{y} - \mathbf{X}\mathbf{w}\|_2^2 + \lambda\|\mathbf{w}\|_1$$

**Effect:** Drives some weights to exactly zero → **feature selection**.

**Why L1 induces sparsity:** The diamond-shaped constraint region (L1 norm) intersects the loss contours at corners (axes), where some coordinates are zero.

```python
from sklearn.linear_model import Lasso

lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)

# Selected features (non-zero coefficients)
selected_features = X.columns[lasso.coef_ != 0]
print(f"Selected {len(selected_features)} out of {X.shape[1]} features")
```

#### **6.2.3 Elastic Net**

Combines L1 and L2:

$$J(\mathbf{w}) = \frac{1}{2n}\|\mathbf{y} - \mathbf{X}\mathbf{w}\|_2^2 + \lambda_1\|\mathbf{w}\|_1 + \lambda_2\|\mathbf{w}\|_2^2$$

Best of both worlds: feature selection (L1) + handling correlated features (L2).

```python
from sklearn.linear_model import ElasticNet

# l1_ratio: 0=Ridge, 1=Lasso, 0.5=equal mix
elastic = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic.fit(X_train, y_train)
```

#### **6.2.4 Early Stopping**

For gradient descent: Stop training when validation error stops decreasing.

```python
from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)

sgd = SGDRegressor(max_iter=1000, tol=1e-3, early_stopping=True, 
                   validation_fraction=0.1, n_iter_no_change=5)
sgd.fit(X_train, y_train)
```

---

## **6.3 Evaluation Metrics: Measuring Error**

Choosing the right metric determines what "good" means.

#### **6.3.1 MSE and RMSE**

**Mean Squared Error:**
$$\text{MSE} = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2$$

**Root Mean Squared Error:**
$$\text{RMSE} = \sqrt{\text{MSE}}$$

- **Properties:** Differentiable (good for optimization), penalizes large errors heavily (quadratic), same units as target (RMSE)
- **Use when:** Large errors are particularly bad, errors are Gaussian-distributed

```python
from sklearn.metrics import mean_squared_error
import numpy as np

mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
```

#### **6.3.2 MAE**

**Mean Absolute Error:**
$$\text{MAE} = \frac{1}{n}\sum_{i=1}^n |y_i - \hat{y}_i|$$

- **Properties:** Robust to outliers, not differentiable at zero
- **Use when:** Outliers are common and shouldn't dominate the metric

#### **6.3.3 R² (Coefficient of Determination)**

$$R^2 = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}$$

- **Interpretation:** Proportion of variance explained. Range $(-\infty, 1]$, where 1 is perfect prediction, 0 is as good as predicting the mean.
- **Warning:** Can be negative if model is worse than predicting the mean.

```python
from sklearn.metrics import r2_score
r2 = r2_score(y_true, y_pred)
```

#### **6.3.4 MAPE and SMAPE**

**Mean Absolute Percentage Error:**
$$\text{MAPE} = \frac{100\%}{n}\sum_{i=1}^n \left|\frac{y_i - \hat{y}_i}{y_i}\right|$$

- **Problem:** Undefined when $y_i = 0$, asymmetric (penalizes under-prediction more)

**Symmetric MAPE:**
$$\text{SMAPE} = \frac{100\%}{n}\sum_{i=1}^n \frac{|y_i - \hat{y}_i|}{(|y_i| + |\hat{y}_i|)/2}$$

#### **6.3.5 Custom Business Metrics**

Often more important than statistical metrics:

```python
def revenue_weighted_mse(y_true, y_pred, revenue):
    """MSE weighted by customer revenue"""
    return np.mean(revenue * (y_true - y_pred)**2)

def quantile_loss(y_true, y_pred, quantile=0.9):
    """Pinball loss for quantile regression"""
    errors = y_true - y_pred
    return np.mean(np.maximum(quantile * errors, (quantile - 1) * errors))
```

---

## **6.4 Advanced Regression Algorithms**

#### **6.4.1 Support Vector Regression (SVR)**

Finds a function that deviates from actual targets by at most $\epsilon$ (epsilon-insensitive tube) while being as flat as possible.

**Kernel Trick:** Maps to high-dimensional space without computing coordinates.

```python
from sklearn.svm import SVR

# RBF kernel for non-linear relationships
svr = SVR(kernel='rbf', C=100, epsilon=0.1, gamma='scale')
svr.fit(X_train, y_train)

# C: regularization (inverse of alpha in Ridge)
# epsilon: width of tube where no penalty is applied
```

**When to use:** High-dimensional data, non-linear relationships, small-to-medium datasets (scales poorly with $n$).

#### **6.4.2 Decision Tree Regression**

Partitions feature space into regions, predicts mean of each region.

**Splitting Criterion:** Minimize MSE (or MAE for robustness):

$$\text{MSE}_{\text{node}} = \frac{1}{n_{\text{node}}}\sum_{i \in \text{node}} (y_i - \bar{y}_{\text{node}})^2$$

**Parameters to tune:**
- `max_depth`: Prevent overfitting (shallow = underfit, deep = overfit)
- `min_samples_leaf`: Minimum samples required at leaf node
- `min_samples_split`: Minimum samples to allow a split

```python
from sklearn.tree import DecisionTreeRegressor, plot_tree

tree = DecisionTreeRegressor(max_depth=5, min_samples_leaf=10)
tree.fit(X_train, y_train)

# Visualize
import matplotlib.pyplot as plt
plt.figure(figsize=(20,10))
plot_tree(tree, feature_names=X.columns, filled=True, max_depth=3)
```

#### **6.4.3 Ensemble Methods**

**Random Forest:** Bagging (Bootstrap Aggregating) of trees.

```python
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(
    n_estimators=100,    # Number of trees
    max_depth=10,
    min_samples_split=5,
    max_features='sqrt', # Number of features to consider at each split
    n_jobs=-1,           # Use all CPU cores
    random_state=42
)
rf.fit(X_train, y_train)

# Feature importance (caution: biased toward high-cardinality features)
importances = pd.Series(rf.feature_importances_, index=X.columns)
importances.sort_values(ascending=False).plot(kind='barh')
```

**Gradient Boosting:** Sequential trees, each correcting errors of previous.

```python
from sklearn.ensemble import GradientBoostingRegressor

gb = GradientBoostingRegressor(
    n_estimators=100,
    learning_rate=0.1,   # Shrinkage (prevent overfitting)
    max_depth=3,         # Usually shallow for boosting
    subsample=0.8,       # Stochastic gradient boosting
    random_state=42
)
```

#### **6.4.4 XGBoost / LightGBM / CatBoost**

Industry standard for tabular regression.

**XGBoost (Extreme Gradient Boosting):**
- Regularization (L1/L2) built-in
- Handles missing values automatically
- Parallel processing

```python
import xgboost as xgb

# Create DMatrix (optimized data structure)
dtrain = xgb.DMatrix(X_train, label=y_train)
dval = xgb.DMatrix(X_val, label=y_val)

params = {
    'objective': 'reg:squarederror',
    'max_depth': 6,
    'eta': 0.1,                    # Learning rate
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'lambda': 1,                   # L2 regularization
    'alpha': 0,                    # L1 regularization
    'eval_metric': 'rmse'
}

# Train with early stopping
model = xgb.train(
    params,
    dtrain,
    num_boost_round=1000,
    evals=[(dval, 'validation')],
    early_stopping_rounds=50,
    verbose_eval=100
)
```

**LightGBM:** Faster training, leaf-wise tree growth (can overfit if not careful).

```python
import lightgbm as lgb

train_data = lgb.Dataset(X_train, label=y_train)
val_data = lgb.Dataset(X_val, label=y_val, reference=train_data)

params = {
    'objective': 'regression',
    'metric': 'rmse',
    'boosting_type': 'gbdt',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': -1
}

model = lgb.train(
    params,
    train_data,
    num_boost_round=1000,
    valid_sets=[val_data],
    callbacks=[lgb.early_stopping(50), lgb.log_evaluation(100)]
)
```

**CatBoost:** Best for categorical features (handles them natively without encoding).

---

## **6.5 Time Series Regression**

Time series violates the i.i.d. assumption (independent and identically distributed). Use specialized techniques.

#### **6.5.1 Feature Engineering for Time Series**

```python
# Lag features
df['sales_lag_7'] = df['sales'].shift(7)  # Sales 7 days ago

# Rolling statistics
df['sales_ma_7'] = df['sales'].rolling(window=7).mean()  # Moving average
df['sales_std_30'] = df['sales'].rolling(window=30).std()

# Expanding window (cumulative statistics)
df['sales_expanding_mean'] = df['sales'].expanding().mean()

# Date features (cyclical encoding)
df['day_of_week'] = df.index.dayofweek
df['month'] = df.index.month
df['day_of_week_sin'] = np.sin(2 * np.pi * df['day_of_week'] / 7)
df['day_of_week_cos'] = np.cos(2 * np.pi * df['day_of_week'] / 7)
```

#### **6.5.2 Train-Test Split for Time Series**

**Never use random split!** Use temporal split.

```python
# Sort by time first
df = df.sort_index()

# Split: 80% train, 20% test temporally
split_idx = int(len(df) * 0.8)
train = df.iloc[:split_idx]
test = df.iloc[split_idx:]

# Or use TimeSeriesSplit for cross-validation
from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)
for train_idx, val_idx in tscv.split(X):
    X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
    # Train and evaluate
```

#### **6.5.3 ARIMA (Brief Introduction)**

AutoRegressive Integrated Moving Average. Classical statistical approach.

- **AR(p):** Uses past $p$ values
- **I(d):** Differencing to make stationary
- **MA(q):** Uses past $q$ forecast errors

```python
from statsmodels.tsa.arima.model import ARIMA

# Fit ARIMA(1,1,1)
model = ARIMA(train['sales'], order=(1,1,1))
fitted = model.fit()
forecast = fitted.forecast(steps=30)
```

**Stationarity:** ARIMA requires stationary series (constant mean/variance). Check with Augmented Dickey-Fuller test.

#### **6.5.4 Prophet (Facebook/Meta)**

Automated forecasting with trend, seasonality, and holidays.

```python
from prophet import Prophet

df_prophet = df.reset_index().rename(columns={'date': 'ds', 'sales': 'y'})

model = Prophet(
    yearly_seasonality=True,
    weekly_seasonality=True,
    daily_seasonality=False
)
model.add_country_holidays(country_name='US')
model.fit(df_prophet)

future = model.make_future_dataframe(periods=30)
forecast = model.predict(future)
model.plot(forecast)
```

---

## **6.6 Workbook Labs**

### **Lab 1: From Scratch Implementation**
Implement Linear Regression, Ridge, and Lasso from scratch using only NumPy:
1. Normal equations for Linear and Ridge
2. Coordinate descent for Lasso (update one coordinate at a time)
3. Compare your implementation's coefficients with sklearn

**Deliverable:** `regression_from_scratch.py` with unit tests matching sklearn within 1e-6 tolerance.

### **Lab 2: House Price Prediction Kaggle**
Use the Ames Housing dataset:
1. Comprehensive EDA and preprocessing (from Chapter 5)
2. Compare Linear, Ridge, Lasso, Random Forest, XGBoost
3. Hyperparameter tuning with RandomizedSearchCV
4. Ensemble top 3 models (stacking)
5. Submit to Kaggle and document leaderboard position

**Deliverable:** Jupyter notebook with complete pipeline, achieving RMSE < 0.12 (log-transformed).

### **Lab 3: Time Series Forecasting**
Predict energy consumption:
1. Create lag features, rolling statistics, calendar features
2. Compare XGBoost vs Prophet vs ARIMA
3. Implement walk-forward validation (expanding window)
4. Calculate MAPE and directional accuracy (did you predict up/down correctly?)

**Deliverable:** `time_series_forecast.py` with automated retraining schedule logic.

### **Lab 4: Regularization Path Analysis**
On a high-dimensional dataset (d > n):
1. Fit Lasso with 100 different alpha values
2. Plot the "regularization path" (coefficient vs alpha)
3. Identify when each feature enters the model
4. Compare with Elastic Net paths

**Deliverable:** Visualization showing feature selection dynamics.

---

## **6.7 Common Pitfalls**

1. **Multicollinearity in Linear Regression:**
   - Symptoms: High standard errors, unstable coefficients, high condition number
   - Solutions: Ridge regression, PCA, drop correlated features, combine into indices

2. **Extrapolation:**
   - Linear models assume linearity holds outside training range. Tree-based models can't extrapolate beyond training data range.
   - **Test:** Check if test set features are within training distribution (use KDE plots).

3. **Log-transform Bias:**
   - If you train on $\log(y)$, predictions are biased when exponentiated.
   - **Correction:** $\hat{y} = \exp(\hat{\log y} + \frac{\sigma^2}{2})$ (Duan's smearing estimator)

4. **Data Leakage in Time Series:**
   - Using future information (rolling mean including current point) to predict current point.
   - **Solution:** Use `shift(1)` for lag features, `rolling().apply()` with careful window alignment.

5. **Ignoring Heteroscedasticity:**
   - Error variance changes with target magnitude (common in financial data).
   - **Solution:** Weighted least squares, or transform target (log), or use quantile regression.

---

## **6.8 Interview Questions**

**Q1:** Why might you choose MAE over RMSE as a loss function?
*A: MAE is more robust to outliers (doesn't square errors). RMSE penalizes large errors heavily, which might be dominated by a few outliers. MAE is also more interpretable (average dollar error vs squared dollar error). However, MAE is not differentiable at zero, making optimization harder.*

**Q2:** Explain the bias-variance tradeoff in the context of Ridge vs Lasso regression.
*A: High regularization (large lambda) increases bias (underfitting) but decreases variance (stable predictions). Ridge reduces variance by shrinking coefficients but keeps all features (higher bias than Lasso for irrelevant features). Lasso increases bias more aggressively by zeroing out features, potentially missing relevant ones (high bias) but reducing variance through sparsity.*

**Q3:** Why does XGBoost often outperform Random Forest on tabular data?
*A: XGBoost uses boosting (sequential error correction) and second-order gradients (Hessian), allowing it to learn complex patterns more efficiently. It also has built-in regularization (L1/L2) and handles missing values optimally. Random Forest uses bagging (parallel trees averaging), which reduces variance but doesn't correct bias as aggressively.*

**Q4:** How do you handle categorical variables with high cardinality in linear regression?
*A: One-hot encoding creates too many columns (curse of dimensionality). Better approaches: Target encoding (mean of target per category), embeddings (neural networks), or hashing trick. For linear models specifically, target encoding with smoothing/cross-validation to prevent overfitting is standard.*

**Q5:** Your model has high $R^2$ on training but low on test. What regularization techniques do you try, in what order?
*A: 1) Check for data leakage first (most common cause). 2) If legitimate overfitting: Add Ridge (L2) regression as baseline. 3) If feature selection needed, try Lasso (L1) or Elastic Net. 4) For trees, reduce max_depth, increase min_samples_leaf, or use early stopping (XGBoost/LightGBM). 5) Collect more data or reduce feature dimensionality if possible.*

---

## **6.9 Further Reading**

**Books:**
- *The Elements of Statistical Learning* (Hastie et al.) - Chapters 3 (Linear), 7 (Model Selection), 9 (Additive Models), 10 (Boosting), 15 (Random Forests)
- *Pattern Recognition and Machine Learning* (Bishop) - Bayesian linear regression
- *Forecasting: Principles and Practice* (Hyndman & Athanasopoulos) - Time series focus

**Papers:**
- "XGBoost: A Scalable Tree Boosting System" (Chen & Guestrin, 2016)
- "LightGBM: A Highly Efficient Gradient Boosting Decision Tree" (Ke et al., 2017)

---

## **6.10 Checkpoint Project: Automated Pricing Model**

Build an end-to-end price prediction system for an e-commerce platform.

**Dataset:** Product listings with features (category, brand, description, historical sales, competitor prices).

**Requirements:**

1. **Feature Engineering:**
   - NLP features from description (TF-IDF, length, sentiment)
   - Competitor price ratios and statistics
   - Category target encoding with smoothing
   - Time-based features (seasonality, trend)

2. **Model Comparison:**
   - Baseline: Mean price per category
   - Linear: Ridge with polynomial features (degree 2)
   - Tree: LightGBM with early stopping
   - Ensemble: Weighted average of top models

3. **Business Constraints:**
   - Model must be interpretable (feature importance for pricing team)
   - Prediction intervals (not just point estimates) - use Quantile Regression or bootstrapping
   - Max 100ms inference time (optimize with ONNX or lightweight model)

4. **Evaluation:**
   - MAPE primary metric (business cares about % error)
   - Bias analysis: Does model systematically underprice luxury items?
   - A/B test simulation: Would using model prices increase revenue?

**Deliverables:**
- `pricing_model/` package with training and inference modules
- FastAPI endpoint `/predict` with input validation (Pydantic)
- Docker container with model artifact
- Report: "If we deployed this, expected revenue impact is X% with Y% risk"

---

**End of Chapter 6**

*You can now predict continuous values with statistical rigor. Chapter 7 will cover Classification, where we predict discrete categories - the most common ML task in industry.*

---