# Chapter 6 Linear Model Selection and Regularization

## Why Move Beyond Least Squares?

While least squares works well when the relationship between \( Y \) and the predictors is approximately linear and the number of observations (\( n \)) is much larger than the number of predictors (\( p \)), it can fail in certain scenarios:

### Prediction Accuracy Issues

- **Low Bias, Low Variance (Ideal Case)**:  
  If the true relationship is approximately linear and \( n \gg p \), least squares estimates are unbiased (accurate on average) and have low variance, leading to good predictions on new (test) data.

- **High Variance (Overfitting)**:  
  When \( n \) is not much larger than \( p \), least squares estimates can have high variance, meaning small changes in the training data lead to large changes in the model. This causes **overfitting**, where the model fits the training data too closely and performs poorly on new data.

- **No Unique Solution**:  
  When \( p > n \) (more predictors than observations), least squares fails because there are infinitely many solutions that perfectly fit the training data, but these solutions typically have extremely high variance and perform poorly on test data.

- **Solution**:  
  By **constraining** or **shrinking** the coefficients (e.g., making them smaller or setting some to zero), you can reduce variance at the cost of a small increase in bias, often leading to better predictions on new data.

### Model Interpretability Issues

- In multiple regression, some predictors may not be truly associated with the response, leading to unnecessarily complex models.

- Least squares rarely produces coefficient estimates that are exactly zero, so all predictors are included, even irrelevant ones.

- **Solution**:  
  Methods that perform **feature selection** (excluding irrelevant predictors) or **variable selection** (setting some coefficients to zero) can simplify the model, making it easier to interpret.

## Three Classes of Methods to Improve Linear Models

Chapter 6 discusses three approaches to enhance the standard linear model beyond least squares, improving **prediction accuracy** and **model interpretability**:

1. **Subset Selection**  
   - **What it does**: Identifies a subset of the \( p \) predictors most related to the response and fits a least squares model using only those predictors.  
   - **How it works**: Uses methods like best subset selection or stepwise selection to evaluate predictor combinations.  
   - **Benefits**: Simplifies models, improves interpretability, reduces overfitting.  
   - **Challenges**: Computationally intensive for large \( p \).

2. **Shrinkage (Regularization)**  
   - **What it does**: Fits a model with all \( p \) predictors but shrinks coefficient estimates toward zero to reduce variance.  
   - **How it works**: Methods like Ridge Regression (shrinks coefficients) and Lasso (can set coefficients to zero) add a penalty term to the least squares objective.  
   - **Benefits**: Reduces variance, improves prediction accuracy; Lasso also performs variable selection.  
   - **Challenges**: Requires tuning the penalty parameter (\( \lambda \)).

3. **Dimension Reduction**  
   - **What it does**: Projects the \( p \) predictors into a lower-dimensional subspace (\( M < p \)) and uses these projections in a least squares model.  
   - **How it works**: Methods like Principal Component Regression (PCR) or Partial Least Squares (PLS) create \( M \) linear combinations of predictors.  
   - **Benefits**: Reduces variance, handles multicollinearity.  
   - **Challenges**: Projections may be less interpretable; choosing \( M \) is critical.

## Subset Selection (Section 6.1)

Subset selection improves linear models by selecting a subset of \( p \) predictors to enhance interpretability and reduce overfitting.

### Best Subset Selection (Section 6.1.1)

- **What It Does**: Fits a least squares regression model for every possible combination of *p* predictors (2^*p* models) to find the best model.

- Example: *p* = 1: *p* models; *p* = 2: *p*(*p*-1)/2 models, etc.

## Number of Models with 2 Predictors in Best Subset Selection

- **Expression**: C(*p*,2) = *p*(*p*-1)/2 represents the number of ways to choose exactly 2 predictors from *p* total predictors.
- **Meaning**: In best subset selection, this is the number of regression models that include exactly 2 predictors.
- **Formula Derivation**:
 - Binomial coefficient: C(*p*,2) = *p*!/[2!(*p*-2)!].
 - Simplifies to: *p* × (*p*-1)/2, as *p*! = *p* × (*p*-1) × (*p*-2)! and the (*p*-2)! terms cancel.
- **Examples**:
 - For *p* = 3: C(3,2) = 3 × 2/2 = 3 models (e.g., pairs {*X*₁, *X*₂}, {*X*₁, *X*₃}, {*X*₂, *X*₃}).
 - For *p* = 5: C(5,2) = 5 × 4/2 = 10 models.
- **Context**: Contributes to the total 2^*p* models evaluated in best subset selection, highlighting the computational burden as *p* increases.

### Algorithm 6.1: Best Subset Selection
1. **Null Model (*M*₀)**: No predictors; predicts the sample mean.
2. **Fit Models by Subset Size**:
  - For *k* = 1, 2, ..., *p*:
    - Fit all C(*p*,*k*) models with *k* predictors.
    - Select the best model (*M*ₖ) with the smallest **RSS** or highest *R*².
3. **Select Best Model**: Choose one from *M*₀, *M*₁, ..., *M*ₚ using:
  - **Validation set error**, ***C*ₚ (AIC)**, **BIC**, or **adjusted *R*²** to estimate test error.
  - **Cross-validation**: Average validation errors across folds to select the best *k*, then fit *M*ₖ on the full training set.

## Explanning the Algorithm

### 1. Fit Models for Each Subset Size

**What It Means**: For each possible number of predictors (*k*), where *k* ranges from 1 to *p* (the total number of predictors):

- Fit a least squares regression model for every possible combination of exactly *k* predictors.
- The number of such combinations is given by the binomial coefficient C(*p*,*k*), which calculates how many ways you can choose *k* predictors out of *p* without regard to order.
- Mathematically, C(*p*,*k*) = *p*!/[*k*!(*p*-*k*)!], where *p*! is the factorial of *p*.

**Example**:
- If *p* = 3 predictors (e.g., *X*₁, *X*₂, *X*₃):
 - For *k* = 1: C(3,1) = 3 models (*X*₁, *X*₂, *X*₃).
 - For *k* = 2: C(3,2) = 3×2/(2×1) = 3 models ({*X*₁, *X*₂}, {*X*₁, *X*₃}, {*X*₂, *X*₃}).
 - For *k* = 3: C(3,3) = 1 model ({*X*₁, *X*₂, *X*₃}).
- Each model is fitted using least squares, meaning the coefficients are chosen to minimize the Residual Sum of Squares (RSS), the sum of squared differences between observed and predicted response values.

### 2. Select the Best Model (*M*ₖ)

**What It Means**: Among all C(*p*,*k*) models with exactly *k* predictors, choose the one with:

- The smallest RSS, which measures how well the model fits the training data (lower RSS indicates better fit).
- Equivalently, the highest *R*², where *R*² is the proportion of variance in the response explained by the predictors (higher *R*² indicates better fit).

This best model for *k* predictors is labeled *M*ₖ.

**Why RSS or *R*²?**: These metrics evaluate the model's fit on the training data. However, they are only used to select the best model for each *k*, not the final model (more on this below).

### 3. Reducing the Problem to *p* + 1 Models

**What It Means**: Instead of evaluating all 2^*p* possible models (every possible subset of predictors, including the empty set), this step reduces the problem to considering only *p* + 1 models:

- *M*₀: The null model with no predictors (predicts the sample mean).
- *M*₁: The best model with 1 predictor.
- *M*₂: The best model with 2 predictors.
- ...
- *M*ₚ: The best model with all *p* predictors.

**How It Works**: By selecting the best model (*M*ₖ) for each subset size *k* = 0, 1, ..., *p*, the algorithm narrows down the search from 2^*p* models to just *p* + 1 models (one for each possible number of predictors, including the null model).

**Why This Helps**: Evaluating 2^*p* models is computationally expensive (e.g., *p* = 20 means over 1 million models). Reducing to *p* + 1 models simplifies the next step, where a single best model is chosen based on a criterion like validation error or AIC (as described later in Algorithm 6.1).

### 4. Key Context

- This step is part of the first stage of best subset selection, which focuses on finding the best model for each subset size based on training data fit (RSS or *R*²).
- The next stage (Step 3 in Algorithm 6.1) selects the single best model from *M*₀, *M*₁, ..., *M*ₚ using criteria like validation set error, *C*ₚ (AIC), BIC, or adjusted *R*², which prioritize test error to avoid overfitting.
- The reduction to *p* + 1 models is critical because RSS and *R*² alone would always favor the model with all *p* predictors (since adding predictors always improves training fit), but this often leads to overfitting.

### Limitations
- **Computational Complexity**: 2^*p* models (e.g., *p* = 10: 1,024 models; *p* = 20: ~1 million; *p* = 40: ~1.1 trillion) make it infeasible for *p* > 40.
- **Branch-and-Bound**: Shortcuts that prune unlikely subsets, but:
 - Still impractical for large *p*.
 - Limited to least squares regression.
- **Alternatives**: Stepwise selection, shrinkage, and dimension reduction (discussed later) are more computationally efficient.

## Stepwise Selection



### 1. What is Forward Stepwise Selection?

**Purpose**: Forward stepwise selection is a method to build a linear regression model by adding predictors one at a time, starting from a model with no predictors. It aims to find a good subset of predictors while being less computationally intensive than best subset selection.

**Comparison to Best Subset Selection**:
- Best subset selection evaluates all 2^*p* possible combinations of *p* predictors, which becomes infeasible for large *p* (e.g., *p* = 40 yields 2^40 ≈ 1.1 trillion models).
- Forward stepwise selection considers a much smaller set of models by adding predictors sequentially, making it more practical for larger *p*.

### 2. How It Works

**Starting Point**: Begins with the null model (*M*₀), which contains no predictors and predicts the sample mean of the response for all observations.

**Step-by-Step Addition**:
- At each step, it adds one predictor that most improves the model's fit, based on a criterion like the smallest Residual Sum of Squares (RSS) or the highest *R*².
- This process continues until all *p* predictors are included in the model.

**Process**:
- Start with *M*₀ (no predictors).
- At step *k*, take the current model *M*ₖ and consider adding one more predictor from the remaining *p* - *k* predictors not yet included.
- Choose the addition that gives the best fit (lowest RSS or highest *R*²) to form *M*ₖ₊₁.
- Repeat until *M*ₚ (model with all *p* predictors).

### 3. Algorithm 6.2: Forward Stepwise Selection

The procedure is formalized as follows:

1. **Null Model (*M*₀)**: Start with a model containing no predictors.
2. **Iterative Addition**:
  - For *k* = 0 to *p* - 1 (i.e., until all predictors are added):
    - (a) Consider all *p* - *k* models that add one predictor to the current model *M*ₖ.
    - (b) Select the best model among these *p* - *k* options (based on smallest RSS or highest *R*²) and call it *M*ₖ₊₁.
  - This builds a sequence of models: *M*₀, *M*₁, ..., *M*ₚ, where each *M*ₖ₊₁ adds one predictor to *M*ₖ.
3. **Select the Best Model**: Choose the single best model from *M*₀, *M*₁, ..., *M*ₚ using:
  - Validation set error: Error on a separate validation dataset.
  - Information criteria: *C*ₚ (AIC), BIC, or adjusted *R*², which penalize complexity to avoid overfitting.
  - Cross-validation: Average prediction error across multiple training/validation splits to select the optimal *k*.

### 4. Key Differences from Best Subset Selection

**Computational Efficiency**: Instead of evaluating 2^*p* models, forward stepwise selection evaluates at most 1 + *p* + (*p*-1) + ... + 1 = 1 + *p*(*p*+1)/2 models (the initial model plus the sum of remaining predictors at each step). For *p* = 10, this is 1 + 55 = 56 models, compared to 2^10 = 1,024 for best subset.

**Greedy Approach**: It adds the best predictor at each step without reconsidering previous choices, which may miss the globally optimal subset but is much faster.

**Model Sequence**: Produces *p* + 1 models (*M*₀ to *M*ₚ), similar to best subset, but the path is constrained by the stepwise addition.


### 1. What is Backward Stepwise Selection?

**Purpose**: Backward stepwise selection is a method to simplify a linear regression model by starting with all *p* predictors and iteratively removing the least useful one until no predictors remain. It's an alternative to best subset selection, which evaluates all 2^*p* possible subsets.

**Comparison to Forward Stepwise Selection**:
- Forward Stepwise: Starts with no predictors (*M*₀) and adds one predictor at a time.
- Backward Stepwise: Starts with all predictors (*M*ₚ) and removes one predictor at a time.
- Both are more efficient than best subset selection and aim to find a good subset of predictors.

### 2. How It Works

**Starting Point**: Begins with the full model (*M*ₚ), which includes all *p* predictors and is fitted using least squares.

**Step-by-Step Removal**:
- At each step, it considers removing one predictor and selects the removal that least worsens the model fit (based on smallest RSS or highest *R*²).
- This process continues, reducing the number of predictors by one each time, until the null model (*M*₀) with no predictors is reached.

**Process**:
- Start with *M*ₚ (all *p* predictors).
- At step *k* (where *k* is the current number of predictors), consider removing one of the *k* predictors.
- Choose the model with the best fit among these *k* options to form *M*ₖ₋₁.
- Repeat until *k* = 1, resulting in *M*₀.

### 3. Algorithm 6.3: Backward Stepwise Selection

The procedure is formalized as follows:

1. **Full Model (*M*ₚ)**: Start with the model containing all *p* predictors.
2. **Iterative Removal**:
  - For *k* = *p* down to 1 (i.e., until no predictors remain):
    - (a) Consider all *k* models that remove one predictor from the current model *M*ₖ, resulting in *k* - 1 predictors.
    - (b) Select the best model among these *k* options (based on smallest RSS or highest *R*²) and call it *M*ₖ₋₁.
  - This builds a sequence of models: *M*ₚ, *M*ₚ₋₁, ..., *M*₀, where each *M*ₖ₋₁ removes one predictor from *M*ₖ.
3. **Select the Best Model**: Choose the single best model from *M*₀, *M*₁, ..., *M*ₚ using:
  - Validation set error: Error on a separate validation dataset.
  - Information criteria: *C*ₚ (AIC), BIC, or adjusted *R*², which penalize complexity to avoid overfitting.
  - Cross-validation: Average prediction error across multiple training/validation splits to select the optimal *k*.

### 4. Key Features and Comparisons

**Computational Efficiency**: Like forward stepwise, backward stepwise evaluates approximately 1 + *p*(*p*+1)/2 models:
- *M*ₚ (initial full model) + *p* (for *p* removals) + *p*-1 + *p*-2 + ... + 1 (choices at each step).
- For *p* = 10, this is 1 + 55 = 56 models, compared to 2^10 = 1,024 for best subset.

**Non-Optimal Nature**: Neither forward nor backward stepwise guarantees the globally best subset (the one with the lowest test error among all 2^*p* subsets), as they make greedy decisions (adding or removing one predictor at a time).

**Requirement on *n* and *p***:
- Backward Stepwise: Requires *n* > *p* (number of observations *n* must exceed predictors *p*) because it starts with the full model, which needs a unique least squares solution.
- Forward Stepwise: Can work when *n* < *p*, as it starts with no predictors and builds up, making it viable for high-dimensional data (e.g., *p* very large).

**Practical Use**: Backward stepwise is preferred when *n* > *p* and you want to start with a full model to assess which predictors are least important.

## 6.1.3 Choosing the Optimal Model

### Cp , AIC, BIC, and Adjusted R2

In linear regression, we fit models to training data using least squares, which minimizes the Residual Sum of Squares (RSS). However, RSS (and thus training Mean Squared Error, MSE = RSS/n) always decreases as more predictors are added, even if those predictors are irrelevant (noise). This makes training RSS or *R*² unsuitable for selecting the best model, as they don't account for overfitting. The test MSE, which measures generalization to new data, is what we care about, but we typically don't have access to it during model selection.

*C*ₚ, AIC, BIC, and Adjusted *R*² are techniques that adjust the training error to estimate test error, helping us select models that balance fit (low RSS) and complexity (fewer predictors) to avoid overfitting.

### 1. *C*ₚ Statistic

**Purpose**: Estimates the test MSE by adjusting the training RSS for model complexity.

**Formula**:
*C*ₚ = 1/*n* (RSS + 2*d*σ̂²)

where:
- *n*: Number of observations.
- RSS: Residual Sum of Squares for the model with *d* predictors.
- *d*: Number of predictors in the model.
- σ̂²: Estimate of the error variance, typically computed from the full model (including all predictors).

**Intuition**:
- The term 2*d*σ̂² is a penalty that increases with the number of predictors (*d*), accounting for the fact that adding predictors reduces training RSS but may not improve test MSE.
- If σ̂² is unbiased, *C*ₚ is an unbiased estimate of the test MSE.
- Model Selection: Choose the model with the lowest *C*ₚ, as it indicates the best balance of fit and complexity.

**Example**: For the Credit dataset, *C*ₚ selects a 6-variable model (income, limit, rating, cards, age, student).

**Key Insight**: *C*ₚ penalizes model complexity linearly with the number of predictors.

### 2. Akaike Information Criterion (AIC)

**Purpose**: Estimates the test error for models fit by maximum likelihood, adjusting for model complexity.

**Formula (for linear regression with Gaussian errors)**:
AIC = 1/*n* (RSS + 2*d*σ̂²)

Note: This is proportional to *C*ₚ for least squares linear regression (ignoring constant terms).

For general models, AIC is defined as:
AIC = -2 log(likelihood) + 2*d*

In linear regression with Gaussian errors, maximum likelihood and least squares are equivalent, leading to the above formula.

**Intuition**:
- Like *C*ₚ, AIC adds a penalty (2*d*σ̂²) to the RSS to adjust for overfitting.
- A smaller AIC indicates a model with lower estimated test error.
- Model Selection: Choose the model with the lowest AIC.

**Example**: In the Credit dataset, AIC is proportional to *C*ₚ, so it also selects the 6-variable model.

**Key Insight**: For linear regression, AIC and *C*ₚ are essentially the same, but AIC is more general and applies to other model types fit by maximum likelihood.

### 3. Bayesian Information Criterion (BIC)

**Purpose**: Similar to AIC, but derived from a Bayesian perspective, it estimates test error with a stronger penalty for model complexity.

**Formula (for linear regression)**:
BIC = 1/*n* (RSS + log(*n*)*d*σ̂²)

where:
- log(*n*): Natural logarithm of the number of observations.

**Intuition**:
- BIC is similar to *C*ₚ and AIC but uses a penalty of log(*n*)*d*σ̂² instead of 2*d*σ̂².
- Since log(*n*) > 2 for *n* > 7 (which is typical in most datasets), BIC penalizes complex models (more predictors) more heavily than *C*ₚ or AIC.
- As a result, BIC tends to select simpler models (fewer predictors) than *C*ₚ or AIC.
- Model Selection: Choose the model with the lowest BIC.

**Example**: For the Credit dataset, BIC selects a 4-variable model (income, limit, cards, student), which is simpler than the 6-variable model chosen by *C*ₚ and AIC.

**Key Insight**: BIC's heavier penalty makes it more conservative, favoring smaller models, especially when *n* is large.

### 4. Adjusted *R*²

**Purpose**: Adjusts the standard *R*² to account for model complexity, penalizing the inclusion of unnecessary predictors.

**Formula**:
Adjusted *R*² = 1 - [RSS/(*n* - *d* - 1)] / [TSS/(*n* - 1)]

where:
- TSS = Σ(*yᵢ* - *ȳ*)²: Total Sum of Squares, measuring the total variance in the response.
- *n* - *d* - 1: Degrees of freedom for the residuals (accounts for the number of predictors).

**Intuition**:
- Standard *R*² = 1 - RSS/TSS always increases as more predictors are added because RSS decreases.
- Adjusted *R*² modifies this by dividing RSS and TSS by their respective degrees of freedom (*n* - *d* - 1 and *n* - 1).
- The term RSS/(*n* - *d* - 1) may increase if adding a predictor (increasing *d*) does not sufficiently reduce RSS, as the denominator shrinks.
- This penalizes the inclusion of "noise" variables that don't meaningfully improve the model.
- Model Selection: Choose the model with the highest Adjusted *R*², as it indicates the best balance of fit and complexity.

**Example**: For the Credit dataset, Adjusted *R*² selects a 7-variable model (adding "own" to the 6-variable model chosen by *C*ₚ and AIC).

**Key Insight**: Unlike *C*ₚ, AIC, and BIC (which aim to minimize estimated test error), Adjusted *R*² maximizes an adjusted measure of explained variance. It's intuitive but less theoretically rigorous.

### Validation and Cross-Validation

**Purpose**:
- Validation and cross-validation directly estimate the test error for each candidate model to select the one with the smallest estimated test error.
- Used as an alternative to metrics like *C*ₚ, AIC, BIC, and Adjusted *R*² for model selection.

**Key Concepts**:
- Validation Set Approach: Split data into training and validation sets. Fit models on the training set, compute the error (e.g., MSE) on the validation set, and select the model with the lowest validation error.
- Cross-Validation: Extends the validation approach by dividing data into multiple folds (e.g., *k*-fold cross-validation). For each fold, train on *k*-1 folds and test on the held-out fold, then average the errors across folds to estimate test error.

**Procedure**:
- Generate candidate models (e.g., using best subset selection, forward selection, or backward selection) with different numbers of predictors.
- For each model size *k*:
 - In validation, compute the error on the validation set.
 - In cross-validation, compute the error for each fold (where the best subset of size *k*, denoted *M*ₖ, may differ across folds) and average the errors over all folds.
- Select the model size *k* with the lowest average validation or cross-validation error.
- Fit the final model of size *k* on the full dataset to obtain the best model.

**Advantages Over *C*ₚ, AIC, BIC, and Adjusted *R*²**:
- Direct Estimate: Provides a direct estimate of test error without relying on assumptions about the true underlying model.
- Fewer Assumptions: Does not require estimating the error variance (σ²) or pinpointing model degrees of freedom (e.g., number of predictors).
- Flexibility: Applicable to a wide range of models, including those where degrees of freedom or error variance are hard to define (e.g., non-linear models, machine learning methods).

## Shrinkage Methods (Section 6.2)

### Overview
- Shrinkage methods are an alternative to subset selection (e.g., best subset, forward/backward selection) for linear regression.
- Instead of selecting a subset of predictors, these methods fit a model with all $p$ predictors but constrain or regularize the coefficient estimates, shrinking them toward zero.
- Shrinking coefficients reduces their variance, which can improve model performance by reducing overfitting.

### Why Shrink Coefficients?
- Least squares regression (unconstrained) can produce high-variance coefficient estimates, especially when predictors are correlated or the number of predictors ($p$) is large.
- Shrinking coefficients reduces this variance, trading off some bias for better prediction accuracy on test data.

### Two Main Techniques
- **Ridge Regression**: Shrinks coefficients using an L2 penalty (sum of squared coefficients).
- **Lasso**: Shrinks coefficients using an L1 penalty (sum of absolute coefficients), which can also set some coefficients to exactly zero (covered later in the book).

---

## Ridge Regression (Section 6.2.1)

### Definition
Ridge regression is similar to least squares but modifies the objective function by adding a shrinkage penalty to constrain coefficient estimates.

It estimates coefficients $\hat{\beta}^R = (\hat{\beta}_0, \hat{\beta}_1, \dots, \hat{\beta}_p)$ by minimizing:

$$\text{RSS} + \lambda \sum_{j=1}^p \beta_j^2 = \sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right)^2 + \lambda \sum_{j=1}^p \beta_j^2$$

Where:
- **RSS**: Residual Sum of Squares, measuring model fit (same as least squares).
- $\lambda \sum_{j=1}^p \beta_j^2$: L2 penalty (shrinkage penalty), which is small when coefficients $\beta_1, \dots, \beta_p$ are close to zero.
- $\lambda \geq 0$: Tuning parameter, controlling the strength of the penalty.

### Key Components

#### Tuning Parameter ($\lambda$)
- Controls the trade-off between fitting the data (minimizing RSS) and shrinking coefficients (minimizing the penalty term).
- $\lambda = 0$: No penalty, ridge regression = least squares estimates.
- As $\lambda \to \infty$: Penalty dominates, coefficients $\hat{\beta}_j^R \to 0$ (except the intercept).

#### Shrinkage Penalty
- The term $\lambda \sum_{j=1}^p \beta_j^2$ (L2 norm) penalizes large coefficients, shrinking them toward zero but not exactly to zero.
- Encourages smaller, more stable coefficient estimates, reducing variance.

#### Intercept ($\beta_0$)
- The penalty does not apply to the intercept $\beta_0$, as it represents the mean response when all predictors are zero (no need to shrink).
- If predictors are centered (mean = 0), the estimated intercept is $\hat{\beta}_0 = \bar{y}$.

### How It Works
Ridge regression balances two goals:
1. **Fit the data well**: Minimize RSS (like least squares).
2. **Keep coefficients small**: Minimize $\sum \beta_j^2$ to reduce model complexity and variance.

- Produces a different set of coefficient estimates $\hat{\beta}_\lambda^R$ for each value of $\lambda$.
- Larger $\lambda$ values increase shrinkage, leading to smaller coefficients and more bias but less variance.

### Why It Improves Fit
- Shrinking coefficients reduces the model's sensitivity to noise in the training data, especially when predictors are highly correlated (multicollinearity) or when $p$ is large.
- This bias-variance trade-off often leads to lower test error compared to least squares, even though the model is biased (coefficients are not exactly the true values).

### Selecting $\lambda$
- The choice of $\lambda$ is critical and is typically determined using cross-validation (discussed in Section 6.2.3).
- Cross-validation estimates the test error for different $\lambda$ values, selecting the one that minimizes the estimated test error.

### Assumptions and Preprocessing
- Predictors are often centered (subtract the mean) before applying ridge regression to ensure the intercept is not penalized and to standardize the scale of predictors.
- Ridge regression assumes predictors are on similar scales; in practice, predictors are typically standardized (mean = 0, standard deviation = 1) to ensure the penalty affects all coefficients fairly.

### Key Insight
- Ridge regression improves over least squares by reducing coefficient variance, making it particularly effective when dealing with multicollinearity or high-dimensional data.
- Unlike subset selection, it keeps all predictors in the model but shrinks their impact.

## 6.2.2 The Lasso

### Overview
- The lasso (Least Absolute Shrinkage and Selection Operator) is an alternative to ridge regression that addresses its key limitation: ridge regression includes all $p$ predictors in the model, which can complicate interpretation when $p$ is large.
- Like ridge regression, the lasso shrinks coefficient estimates toward zero, but it can also set some coefficients exactly to zero, performing variable selection and producing sparse models (models with fewer predictors).

### Disadvantage of Ridge Regression
- Ridge regression shrinks coefficients using an L2 penalty ($\lambda \sum \beta_j^2$), but never sets them to zero, so all $p$ predictors remain in the model.
- This can make interpretation challenging, especially in high-dimensional settings (e.g., the Credit dataset with predictors like income, limit, rating, student, etc.).
- **Example**: In the Credit dataset, ridge regression always includes all 10 predictors, even if only a few (e.g., income, limit, rating, student) are most important.

### Lasso Objective
The lasso estimates coefficients $\hat{\beta}^L = (\hat{\beta}_0, \hat{\beta}_1, \dots, \hat{\beta}_p)$ by minimizing:

$$\text{RSS} + \lambda \sum_{j=1}^p |\beta_j| = \sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right)^2 + \lambda \sum_{j=1}^p |\beta_j|$$

Where:
- **RSS**: Residual Sum of Squares, measuring model fit (same as least squares).
- $\lambda \sum_{j=1}^p |\beta_j|$: L1 penalty (sum of absolute values of coefficients), encouraging sparsity.
- $\lambda \geq 0$: Tuning parameter controlling the strength of the penalty.

### Key Difference from Ridge Regression
- Ridge regression uses an L2 penalty ($\lambda \sum \beta_j^2$), which shrinks coefficients but never sets them to zero.
- The lasso uses an L1 penalty ($\lambda \sum |\beta_j|$), which can force some coefficients to exactly zero when $\lambda$ is large enough, effectively excluding those predictors from the model.
- The L1 norm ($\|\beta\|_1 = \sum |\beta_j|$) promotes sparsity, unlike the L2 norm ($\|\beta\|_2 = \sqrt{\sum \beta_j^2}$).

### Key Features

#### Variable Selection
- The lasso performs automatic variable selection, producing models with only a subset of predictors, similar to best subset selection but computationally more efficient.

#### Sparse Models
- Models with fewer non-zero coefficients are easier to interpret, especially when $p$ is large.

#### Tuning Parameter ($\lambda$)
- $\lambda = 0$: Lasso = least squares (no penalty, all predictors included).
- **Large $\lambda$**: Many coefficients are set to zero, producing a sparse model (possibly the null model with all $\beta_j = 0$).
- Selecting $\lambda$ is critical and is typically done via cross-validation (discussed in Section 6.2.3).

### Example: Credit Dataset
In Figure 6.6, as $\lambda$ increases:
- **Initially** ($\lambda = 0$), lasso gives the least squares fit (all predictors included).
- **As $\lambda$ increases**, the lasso first includes only rating, then adds student, limit, and income, and eventually other predictors.
- The number of predictors depends on $\lambda$, unlike ridge regression, which always includes all predictors.
- This flexibility allows the lasso to produce models with any number of variables, improving interpretability.

## Comparing the Lasso and Ridge Regression

### Key Advantage of Lasso
- The lasso produces simpler and more interpretable models by performing variable selection, setting some coefficients to exactly zero, resulting in sparse models with fewer predictors.
- Ridge regression includes all $p$ predictors, shrinking their coefficients toward zero but never excluding them, which can complicate interpretation when $p$ is large.

### Prediction Accuracy
The lasso and ridge regression exhibit similar behavior in terms of the bias-variance trade-off:
- As the tuning parameter $\lambda$ increases, variance decreases (coefficients shrink, reducing model sensitivity to noise) and bias increases (coefficients deviate from true values).
- This is illustrated in Figure 6.8, where the lasso's variance, squared bias, and test Mean Squared Error (MSE) are plotted against training $R^2$, compared to ridge regression (dotted lines).

### When Each Method Performs Better

#### Lasso
- Performs better when the true model is sparse, i.e., only a small number of predictors have substantial coefficients, and the rest are zero or nearly zero.
- By setting irrelevant coefficients to zero, the lasso reduces model complexity and improves prediction accuracy in such settings.

#### Ridge Regression
- Performs better when the response is a function of many predictors with coefficients of roughly equal size (non-sparse model).
- Shrinking all coefficients without excluding any is more effective when all predictors contribute to the response.

### Key Insight
- Neither method universally dominates; the best choice depends on the true underlying model:
  - Lasso excels in sparse settings (few important predictors).
  - Ridge excels in dense settings (many predictors with similar importance).
- In real datasets, the number of relevant predictors is unknown, so cross-validation is used to compare lasso and ridge regression and select the method (and $\lambda$) with the lowest estimated test error.

### Bias-Variance Trade-Off
- Like ridge regression, the lasso reduces variance compared to least squares (which can have high variance, especially with correlated predictors or large $p$) at the cost of a small increase in bias.
- This trade-off often leads to better prediction accuracy than least squares for both methods.
- The lasso's ability to perform variable selection makes its models easier to interpret compared to ridge regression.

### Practical Considerations
- Use cross-validation to choose between lasso and ridge regression and to tune $\lambda$ for each method.
- In practice, the lasso is preferred when interpretability or sparsity is desired, while ridge regression is better for handling multicollinearity or when all predictors are relevant.


## Selecting the Tuning Parameter

### Overview
- Both ridge regression and the lasso require selecting the tuning parameter $\lambda$ (or equivalently, the constraint $s$ in the alternative formulations (6.8) and (6.9)).
- $\lambda$ controls the strength of the shrinkage penalty:
  - **Small $\lambda$**: Less shrinkage, closer to least squares.
  - **Large $\lambda$**: More shrinkage, coefficients closer to zero (lasso may set some to exactly zero).
- The goal is to choose the $\lambda$ that minimizes the test error to optimize prediction accuracy.

### Method: Cross-Validation
Cross-validation (described in Chapter 5) is used to estimate the test error for different $\lambda$ values.

#### Steps:
1. **Define a grid of $\lambda$ values** (e.g., a range from small to large).
2. **For each $\lambda$**:
   - Perform k-fold cross-validation (e.g., 10-fold) or leave-one-out cross-validation (LOOCV).
   - Compute the cross-validation error (e.g., mean squared error, MSE) by averaging the errors across all folds.
3. **Select the $\lambda$** with the smallest cross-validation error.
4. **Re-fit the model** using the selected $\lambda$ and all available data to obtain the final coefficient estimates.

## 6.3 Dimension Reduction Methods

Dimension reduction methods are techniques used to simplify linear regression models by transforming the original predictors (features) into a smaller set of new variables, called linear combinations, and then fitting a least squares model using these transformed variables. This approach is particularly useful when you have a large number of predictors ($p$) relative to the number of observations ($n$), as it can reduce variance and improve model performance.

### Key Concepts

#### Why Dimension Reduction?

Previous methods in Chapter 6 (e.g., subset selection, ridge regression, lasso) controlled variance by either selecting a subset of predictors or shrinking their coefficients toward zero.

Dimension reduction methods take a different approach: they transform the original $p$ predictors ($X_1, X_2, \ldots, X_p$) into a smaller set of $M$ new predictors ($Z_1, Z_2, \ldots, Z_M$), where $M < p$, and then fit a linear regression model using these new predictors.

By reducing the number of predictors from $p + 1$ (including the intercept) to $M + 1$, the problem becomes simpler, which can lead to lower variance and better predictive performance compared to standard least squares regression on the original predictors.

### How It Works

#### Step 1: Create Linear Combinations

The new predictors $Z_1, Z_2, \ldots, Z_M$ are linear combinations of the original predictors $X_1, X_2, \ldots, X_p$. Mathematically, for each $m = 1, \ldots, M$:

$$Z_m = \sum_{j=1}^p \phi_{jm} X_j$$

Here, $\phi_{jm}$ are constants (weights) that define how much each original predictor $X_j$ contributes to the new predictor $Z_m$.

#### Step 2: Fit a Linear Regression Model

Using the $M$ new predictors, fit a linear regression model:

$$y_i = \theta_0 + \sum_{m=1}^M \theta_m z_{im} + \epsilon_i, \quad i = 1, \ldots, n$$

where $\theta_0, \theta_1, \ldots, \theta_M$ are the regression coefficients for the transformed predictors, and $\epsilon_i$ is the error term.

This model is fit using least squares, estimating only $M + 1$ coefficients instead of $p + 1$.

### Connection to the Original Model

The dimension reduction model can be rewritten to resemble the original linear regression model:

$$y_i = \beta_0 + \sum_{j=1}^p \beta_j x_{ij} + \epsilon_i$$

The coefficients $\beta_j$ in the original model are constrained by the dimension reduction approach. Specifically:

$$\beta_j = \sum_{m=1}^M \theta_m \phi_{jm}$$

This constraint means that the $\beta_j$ coefficients are expressed as combinations of the $\theta_m$ and $\phi_{jm}$, reducing the flexibility of the model.

**Trade-off:** This constraint introduces bias (because the coefficients are restricted), but when $p$ is large relative to $n$, choosing $M \ll p$ can significantly reduce the variance of the coefficient estimates, leading to better overall performance.

If $M = p$ and the $Z_m$ are linearly independent, no constraints are imposed, and the dimension reduction model is equivalent to the original least squares model (no dimension reduction occurs).

### Two-Step Process

**Step 1:** Obtain the transformed predictors $Z_1, Z_2, \ldots, Z_M$ by choosing appropriate constants $\phi_{jm}$.

**Step 2:** Fit the linear regression model using these $M$ predictors.

The key difference between dimension reduction methods lies in how the $Z_m$ predictors are chosen (i.e., how the $\phi_{jm}$ weights are determined). The book discusses two main approaches:

- **Principal Components Regression (PCR):** Uses principal component analysis (PCA) to create $Z_m$, where the new predictors are the principal components (directions of maximum variance in the data).

- **Partial Least Squares (PLS):** Creates $Z_m$ by considering both the predictors and the response variable, aiming to find directions that explain both the variance in $X$ and the relationship with $y$.

### Why It Can Outperform Least Squares

If the constants $\phi_{jm}$ are chosen wisely (e.g., via PCA or PLS), the dimension reduction model can capture the most important patterns in the data with fewer predictors.

This reduction in dimensionality lowers the risk of overfitting, especially when $p$ is large, and can lead to better predictive accuracy compared to fitting a least squares model with all $p$ predictors.

## Summary of PCA

PCA is a dimension reduction technique that transforms a dataset with $n$ observations and $p$ predictors (e.g., an $n \times p$ matrix $X$) into a new set of variables called principal components. These components are linear combinations of the original predictors, ordered by the amount of variance they explain. The goal is to simplify the data by reducing the number of dimensions while retaining most of the information (variance). In the ad example with 100 cities ($n = 100$) and two predictors (pop and ad, $p = 2$), PCA creates up to two components, $Z_1$ and $Z_2$.

### Process

1. Center the data by subtracting the mean of each predictor ($\text{pop} - \overline{\text{pop}}, \text{ad} - \overline{\text{ad}}$).

2. Compute principal components as directions of maximum variance, using loadings (weights) derived from the eigenvectors of the covariance matrix.

3. Project the data onto these directions to get scores ($z_{i1}, z_{i2}$).

**Purpose:** Reduces dimensionality (e.g., from 2 to 1 if $Z_1$ suffices), aids in visualization, and improves model performance by mitigating overfitting in high-dimensional data.

### Detailed Explanation of $Z_1$ (First Principal Component)

#### Definition
$Z_1$ is the linear combination that maximizes the variance of the projected data. Its formula is:

$$Z_1 = 0.839 \times (\text{pop} - \overline{\text{pop}}) + 0.544 \times (\text{ad} - \overline{\text{ad}})$$

#### Loadings
- $\phi_{11} = 0.839$ (pop)
- $\phi_{21} = 0.544$ (ad)
- Note: $0.839^2 + 0.544^2 \approx 1$

#### Scores
For the $i$-th observation:

$$z_{i1} = 0.839 \times (\text{pop}_i - \overline{\text{pop}}) + 0.544 \times (\text{ad}_i - \overline{\text{ad}})$$

**Example:** Bottom-left point has $z_{i1} = -26.1$ (below-average pop and ad), top-right has $z_{i1} = 18.7$ (above-average).

#### Geometric Interpretation
The green line in Figure 6.14, $Z_1$ minimizes the sum of squared perpendicular distances to the data points (Figure 6.15, left panel), making it the "closest" line.

#### Variance
$Z_1$ captures the dominant variability (wide spread on x-axis in Figure 6.15, right panel), reflecting the strong positive correlation between pop and ad.

### Detailed Explanation of $Z_2$ (Second Principal Component)

#### Definition
$Z_2$ is the next linear combination that maximizes variance, orthogonal to $Z_1$. Its formula is:

$$Z_2 = 0.544 \times (\text{pop} - \overline{\text{pop}}) - 0.839 \times (\text{ad} - \overline{\text{ad}})$$

#### Loadings
- $0.544$ (pop)
- $-0.839$ (ad)
- Note: $0.544^2 + (-0.839)^2 \approx 1$

#### Scores
For the $i$-th observation:

$$z_{i2} = 0.544 \times (\text{pop}_i - \overline{\text{pop}}) - 0.839 \times (\text{ad}_i - \overline{\text{ad}})$$

**Example:** High pop/low ad might yield a positive $z_{i2}$, low pop/high ad a negative $z_{i2}$, but values are near zero (y-axis in Figure 6.15).

#### Geometric Interpretation
The dashed blue line in Figure 6.14, perpendicular to $Z_1$, captures residual variation where pop and ad move oppositely.

#### Variance
$Z_2$ explains much less variance (narrow spread on y-axis in Figure 6.15), as most variability is already accounted for by $Z_1$ due to the linear relationship.

### Why $Z_1$ Captures More Information Than $Z_2$

1. **Variance Maximization:** $Z_1$ is designed to explain the largest possible variance, while $Z_2$, being uncorrelated with $Z_1$, captures the remaining variance, which is minimal when predictors are highly correlated.

2. **Data Structure:** The strong linear trend between pop and ad (Figure 6.14) aligns with $Z_1$'s direction, leaving little orthogonal variation for $Z_2$.

3. **Visual Evidence:** The wide range of $z_{i1}$ (-26.1 to 18.7) versus the tight clustering of $z_{i2}$ near zero (Figure 6.15) shows $Z_1$ dominates.

4. **Mathematical Basis:** The eigenvalue for $Z_1$ is larger than for $Z_2$, reflecting that $Z_1$ accounts for nearly all the total variance (e.g., >90%).

### Choosing Between $Z_1$ and $Z_2$

#### Use $Z_1$ Alone
If it explains >80-90% of variance (check via explained variance ratio or scree plot), it's sufficient for dimensionality reduction or modeling (e.g., PCR). In this case, $Z_1$ likely suffices due to the low $Z_2$ variability.

#### Add $Z_2$
Include it only if it explains significant additional variance (e.g., >5-10%) and improves model performance (e.g., via cross-validation). Here, $Z_2$'s minimal contribution suggests it's unnecessary.



## Principal Components Regression (PCR)

PCR is a regression technique that combines PCA with linear regression to handle high-dimensional data. It works by:

- Constructing the first $M$ principal components ($Z_1, Z_2, \ldots, Z_M$, where $M < p$) from the original $p$ predictors ($X_1, X_2, \ldots, X_p$).
- Using these components as predictors in a least squares regression model to predict the response $Y$.

The key assumption is that the directions of maximum variance in the predictors (captured by the principal components) are also the directions most related to $Y$. While not always true, this often provides a good approximation, especially when a small number of components explain most of the variability and the response relationship.

### How PCR Works

#### Step 1: Principal Components
As described earlier:
- $Z_1 = 0.839 \times (\text{pop} - \overline{\text{pop}}) + 0.544 \times (\text{ad} - \overline{\text{ad}})$ captures the direction of maximum variance.
- $Z_2 = 0.544 \times (\text{pop} - \overline{\text{pop}}) - 0.839 \times (\text{ad} - \overline{\text{ad}})$ captures the next largest variance, orthogonal to $Z_1$.
- These are linear combinations of all original predictors, not a subset.

#### Step 2: Regression
Fit a model:

$$Y_i = \theta_0 + \theta_1 z_{i1} + \theta_2 z_{i2} + \cdots + \theta_M z_{iM} + \epsilon_i$$

using least squares, where $M$ is chosen to balance bias and variance.

#### Benefit
By using $M < p$ components, PCR reduces the number of coefficients to estimate (from $p + 1$ to $M + 1$), mitigating overfitting, especially when $p$ is large relative to $n$.

### Performance and Trade-offs

#### Bias-Variance Trade-off
As $M$ increases:
- **Bias decreases** (more components capture the true relationship).
- **Variance increases** (more parameters risk overfitting).

### Choosing $M$

#### Cross-Validation
The number of components $M$ is typically selected using cross-validation to minimize prediction error. For the Credit dataset (Figure 6.20), $M = 10$ minimizes error, though this is close to using all 11 components (equivalent to least squares).

#### Practical Tip
Start with $M = 1$ (e.g., $Z_1$ alone) and increase until adding components no longer improves performance.

### Key Characteristics

#### Not Feature Selection
Unlike lasso, PCR uses linear combinations of all $p$ predictors (e.g., $Z_1$ includes both pop and ad), so it doesn't identify a small subset of important features. This makes it more similar to ridge regression, which shrinks all coefficients, and can even be seen as a discrete version of ridge.

#### Standardization
Standardize predictors (subtract mean, divide by standard deviation) before PCA to ensure equal scaling. This prevents high-variance variables from dominating components. Skip if all variables are in the same units (e.g., kilograms).

## Partial Least Squares (PLS)

PLS is a supervised dimension reduction method, contrasting with the unsupervised nature of Principal Components Regression (PCR). Like PCR, PLS reduces the dimensionality of the predictor space by creating new features ($Z_1, Z_2, \ldots, Z_M$) as linear combinations of the original $p$ predictors ($X_1, X_2, \ldots, X_p$), and then fits a least squares regression model using these $M$ new features to predict the response $Y$. However, unlike PCR, which focuses solely on maximizing variance in the predictors, PLS incorporates the response $Y$ to identify directions that are both representative of the predictors and predictive of $Y$.

### How PLS Works

#### Supervised Approach
PLS uses $Y$ to guide the construction of the new features, aiming to find directions that explain both the variability in the predictors and their relationship with the response. This makes it more tailored to prediction compared to the unsupervised PCA.

#### Process

1. **Standardization:** Standardize the $p$ predictors (and often the response) to ensure equal scaling, preventing high-variance variables from dominating.

2. **First PLS Direction ($Z_1$):** Compute $Z_1$ as a linear combination:
   $$Z_1 = \sum_{j=1}^p \phi_{j1} X_j$$
   where each $\phi_{j1}$ is set to the coefficient from the simple linear regression of $Y$ onto $X_j$. This coefficient is proportional to the correlation between $Y$ and $X_j$, so $Z_1$ weights variables more heavily if they are strongly correlated with $Y$.

3. **Subsequent Directions ($Z_2, \ldots, Z_M$):**
   - Adjust the predictors and response for $Z_1$ by regressing each $X_j$ and $Y$ on $Z_1$ and taking residuals (the unexplained variation).
   - Compute $Z_2$ using the same method on the orthogonalized data.
   - Repeat iteratively for $Z_3, \ldots, Z_M$.

4. **Regression:** Fit a least squares model using $Z_1, \ldots, Z_M$ to predict $Y$, similar to PCR.

#### Tuning Parameter
The number of components $M$ is chosen via cross-validation to minimize prediction error.

### Iterative Process for Multiple Components

- After computing $Z_1$, residuals (unexplained parts of $X_j$ and $Y$) are used to find $Z_2$, ensuring it's orthogonal to $Z_1$.
- This process repeats for $Z_3, \ldots, Z_M$, with each component capturing remaining variation related to $Y$.
- The final model uses all $M$ components in a least squares fit.

### Comparison with PCR and PCA

#### PCR (Unsupervised)
Uses PCA components ($Z_1, Z_2, \ldots$) based on maximum variance in predictors, ignoring $Y$. For example:
- $Z_1 = 0.839 \times (\text{pop} - \overline{\text{pop}}) + 0.544 \times (\text{ad} - \overline{\text{ad}})$
- $Z_2 = 0.544 \times (\text{pop} - \overline{\text{pop}}) - 0.839 \times (\text{ad} - \overline{\text{ad}})$

These are driven by predictor variance, not Sales.

#### PLS (Supervised)
Adjusts $Z_1$ to emphasize predictors correlated with $Y$. If Sales correlates more with pop, PLS's $Z_1$ will weigh pop higher than ad, unlike PCA's balance.

#### Drawback of PCR
If variance directions don't align with $Y$ (e.g., ad variance is high but unrelated to Sales), PCR may perform poorly. PLS mitigates this by incorporating $Y$.

### Performance and Practical Considerations

#### Cross-Validation
Like PCR, $M$ is selected to minimize error. Standardization is recommended for both predictors and $Y$.


#### Performance
PLS often performs similarly to PCR and ridge regression. Its supervised approach can reduce bias by focusing on $Y$, but may increase variance, leading to a neutral overall benefit compared to PCR.

## 6.4.2: What Goes Wrong in High Dimensions?

### Overview

When $p > n$ or $p \approx n$, traditional statistical methods like least squares regression, logistic regression, and linear discriminant analysis fail or produce misleading results due to overfitting. This section uses least squares as an example to illustrate the problem.

### Issues with Least Squares in High Dimensions

#### Perfect Fit Problem
When $p \geq n$, least squares can fit the training data perfectly, resulting in zero residuals. This happens because there are enough parameters (coefficients) to pass through every data point.

**Example (Figure 6.22):**
- With $n = 20$ and $p = 1$ (plus intercept), the regression line approximates the data but doesn't fit perfectly.
- With $n = 2$ and $p = 1$, the line fits exactly, regardless of the data, leading to overfitting.

**Consequence:** A perfect fit on training data doesn't generalize to a test set, as seen in Figure 6.22, where the $n = 2$ fit fails on the $n = 20$ test data.

#### Overfitting
The model becomes too flexible, capturing noise rather than the true signal, making it useless for prediction on new data.

**Simulated Example (Figure 6.23):**
- $n = 20$, $p$ varies from 1 to 20, with all features unrelated to $Y$.
- As $p$ increases, training $R^2$ reaches 1 and training MSE drops to 0, but test MSE skyrockets due to high variance in coefficient estimates.

**Lesson:** High $R^2$ or low training MSE with $p > n$ is misleading; test set performance is the true indicator.

### Limitations of Traditional Adjustments

Methods like $C_p$, AIC, BIC, and adjusted $R^2$ (from Section 6.1.3) fail when $p > n$ because estimating $\sigma^2$ (residual variance) becomes impossible (e.g., $\hat{\sigma}^2 = 0$), requiring alternative high-dimensional techniques.

## 6.4.3: Regression in High Dimensions

### Overview

High-dimensional regression requires methods that impose constraints to avoid overfitting. Techniques like forward stepwise selection, ridge regression, lasso, and PCR (from earlier sections) are well-suited for this setting.

### Role of Regularization

#### Less Flexible Models
These methods reduce flexibility compared to least squares, controlling variance at the cost of some bias.

**Lasso Example (Figure 6.24):**
- $n = 100$, $p = 20, 50,$ or $2,000$, with 20 features truly related to $Y$.
- Test MSE increases with $p$ due to the curse of dimensionality.
- Optimal $\lambda$ (regularization parameter) varies: small $\lambda$ (low shrinkage) for $p = 20$, larger $\lambda$ for $p = 2,000$, reflecting more shrinkage needed with more features.
- Degrees of freedom (number of non-zero coefficients) indicate model flexibility; higher $p$ requires fewer non-zero terms for good performance.

#### Key Points
- **Regularization:** Essential to stabilize estimates in high dimensions.
- **Tuning:** Cross-validation to select $\lambda$ or $M$ (e.g., in PCR) is critical.
- **Curse of Dimensionality:** Adding features increases test error unless they are signal (related to $Y$). Noise features inflate variance without reducing bias, worsening performance.

### Curse of Dimensionality

**Definition:** As $p$ grows, the data becomes sparse in the feature space, increasing overfitting risk. Adding noise features deteriorates models, while signal features may help only if their benefit outweighs variance.

**Implication:** New technologies (e.g., genomics with millions of SNPs) offer potential but risk poor results if irrelevant features dominate.

# 6.4.4: Interpreting Results in High Dimensions

### Overview

Interpreting high-dimensional models requires caution due to extreme multicollinearity and overfitting risks.

### Challenges in Interpretation

#### Multicollinearity
With $p > n$, any predictor can be a linear combination of others, making it impossible to pinpoint which variables truly predict $Y$ or their exact coefficients.

**Example:** Predicting blood pressure with 500,000 SNPs using forward stepwise selection might identify 17 SNPs, but other sets of 17 could work equally well. Re-running on new data likely yields a different set, showing the model's instability.

**Validation:** The selected model may predict well on a test set and be clinically useful, but claiming specific SNPs are causal is overreach without further validation.

### Reporting Model Fit

#### Avoid Training Metrics
With $p > n$, training SSE, $p$-values, $R^2$, or adjusted $R^2$ can reach 1 or 0, misleadingly suggesting a good fit (e.g., Figure 6.23).

#### Use Test Metrics
Report test set MSE or $R^2$, or cross-validation errors, to assess true performance. Training metrics are invalid in this setting.