# Chapter 7: Moving Beyond Linearity

## Overview
This chapter explores methods for extending linear regression models to capture non-linear relationships in data. These techniques provide flexibility while maintaining interpretability and computational efficiency.

## Key Methods

### 1. Polynomial Regression
- **Definition**: Extends the linear model by adding extra predictors obtained by raising each original predictor to a power
- **Example**: Cubic regression uses three variables: X, X², and X³ as predictors
- **Advantage**: Simple way to provide a non-linear fit to data
- **Form**: 
  ```
  y = β₀ + β₁X + β₂X² + β₃X³ + ... + βₚXᵖ + ε
  ```

### 2. Step Functions
- **Definition**: Cut the range of a variable into K distinct regions to produce a qualitative variable
- **Effect**: Fits a piecewise constant function
- **Characteristics**:
  - Creates discrete "steps" in the fitted function
  - Each region has a constant predicted value
  - Simple but potentially rough approximation

### 3. Regression Splines
- **Definition**: More flexible than polynomials and step functions; extension of both methods
- **Process**:
  1. Divide the range of X into K distinct regions
  2. Fit a polynomial function within each region
  3. Constrain polynomials to join smoothly at region boundaries (knots)
- **Key Feature**: Smooth continuity at knots
- **Flexibility**: With enough regions, can produce extremely flexible fits

### 4. Smoothing Splines
- **Definition**: Similar to regression splines but arise from a different approach
- **Method**: Result from minimizing a residual sum of squares criterion subject to a smoothness penalty
- **Key Difference**: Automatically balance fit vs. smoothness through penalty parameter
- **Advantage**: Don't require manual knot selection

### 5. Local Regression
- **Definition**: Similar to splines but with an important distinction
- **Key Feature**: Regions are allowed to overlap in a very smooth way
- **Characteristics**:
  - Fits local polynomial models
  - Uses weighted regression with nearby points
  - Provides smooth, flexible fits

### 6. Generalized Additive Models (GAMs)
- **Definition**: Extend the above methods to handle multiple predictors
- **Form**: 
  ```
  y = β₀ + f₁(X₁) + f₂(X₂) + ... + fₚ(Xₚ) + ε
  ```
- **Advantage**: Maintains additivity while allowing non-linear relationships for each predictor

## Key Concepts to Remember

- **Flexibility vs. Interpretability**: More flexible methods may sacrifice some interpretability
- **Bias-Variance Tradeoff**: More complex methods can reduce bias but increase variance
- **Smoothness**: Most methods aim to balance fit quality with smoothness
- **Knots**: Critical points where piecewise functions connect (important for splines)
- **Degrees of Freedom**: Measure of model complexity in non-linear methods

## Common Applications

- **Polynomial Regression**: When relationship has clear polynomial nature
- **Step Functions**: When natural breakpoints exist in data
- **Regression Splines**: When smooth, flexible fit is needed with interpretable regions
- **Smoothing Splines**: When automatic smoothness selection is desired
- **Local Regression**: For exploratory data analysis and smooth trend identification
- **GAMs**: When dealing with multiple predictors requiring non-linear treatment

## 7.2: Step Functions

### What Are Step Functions?

Step functions are a method to model a non-linear relationship by dividing the range of a predictor variable X (e.g., age, income) into bins or intervals and fitting a constant (a single value) within each bin. This approach converts a continuous variable into an ordered categorical variable, where each category corresponds to a bin, and a separate constant is estimated for each bin. This allows the model to capture non-linear patterns without assuming a specific functional form (e.g., linear or polynomial) across the entire range of X.

### How Step Functions Work

#### 1. Creating Bins

- The range of the predictor X is divided into K+1 intervals using cutpoints c₁, c₂, ..., cₖ
- These cutpoints define boundaries for the bins
- **Example**: If X is age, you might create bins like [0,20), [20,30), [30,40), ...

#### 2. Defining Dummy Variables

For each bin, a dummy variable (or indicator variable) is created:

```
C₀(X) = I(X < c₁)
C₁(X) = I(c₁ ≤ X < c₂)
C₂(X) = I(c₂ ≤ X < c₃)
...
Cₖ(X) = I(cₖ ≤ X)
```

**Key Points:**
- I(·) is an indicator function that equals 1 if the condition is true (e.g., X falls in that bin) and 0 otherwise
- For any value of X, exactly one of the dummy variables C₀(X), C₁(X), ..., Cₖ(X) equals 1, and the rest are 0
- This ensures: C₀(X) + C₁(X) + ... + Cₖ(X) = 1

#### 3. Fitting the Model

A linear regression model is fit using these dummy variables as predictors:

```
yᵢ = β₀ + β₁C₁(xᵢ) + β₂C₂(xᵢ) + ... + βₖCₖ(xᵢ) + εᵢ
```

**Where:**
- yᵢ is the response (e.g., wage)
- εᵢ is the error term
- β₀ represents the average response when X < c₁ (since all other Cⱼ(X) = 0)
- For cⱼ ≤ X < cⱼ₊₁, the predicted response is β₀ + βⱼ, where βⱼ represents the increase in the response relative to the baseline bin (X < c₁)

#### 4. Logistic Regression for Classification

Step functions can also be used in logistic regression to model probabilities. For example, to predict the probability that an individual earns more than $250,000 based on age:

```
Pr(yᵢ > 250 | xᵢ) = exp(β₀ + β₁C₁(xᵢ) + ... + βₖCₖ(xᵢ)) / (1 + exp(β₀ + β₁C₁(xᵢ) + ... + βₖCₖ(xᵢ)))
```

This estimates the probability of being a "high earner" for each bin of X.

### Example: Wage Data

- In the book's example (referring to Figure 7.2), step functions are applied to the Wage dataset, where the predictor X is age, and the response y is wage
- The age range is divided into bins (e.g., 20–30, 30–40, etc.), and a constant wage value is estimated for each bin
- The left-hand panel of Figure 7.2 shows the step function fit for wage, where each bin has a flat (constant) prediction
- The right-hand panel shows the logistic regression fit, estimating the probability of earning more than $250,000 as a function of age bins

### Advantages of Step Functions

- **Simplicity**: They are easy to interpret, as each bin has a single predicted value
- **Flexibility**: They can capture non-linear patterns without assuming a global functional form
- **Common in Practice**: Step functions are widely used in fields like biostatistics and epidemiology (e.g., 5-year age groups)

### Limitations of Step Functions

- **Missing Trends Within Bins**: Step functions assume a constant response within each bin, which can oversimplify the relationship. For example, the left-hand panel of Figure 7.2 shows that the first bin misses the increasing trend of wages with age
- **Arbitrary Cutpoints**: The choice of cutpoints c₁, c₂, ..., cₖ is often arbitrary unless there are natural breakpoints in the data (e.g., policy-driven age thresholds)
- **Discontinuities**: The model produces abrupt changes at bin boundaries, which may not reflect smooth real-world relationships

### Comparison to Polynomial Functions

- **Polynomial functions** (e.g., y = β₀ + β₁X + β₂X² + ...) impose a global structure across the entire range of X, assuming a smooth, continuous relationship
- **Step functions**, by contrast, impose a piecewise-constant structure, allowing different constant predictions in each bin without assuming continuity or smoothness across bins

---

## Defining Dummy Variables

### What Are Dummy Variables?

Dummy variables (also called indicator variables) are binary variables (0 or 1) used to represent categorical groups in a regression model. In the context of step functions, they are used to represent the bins into which a continuous predictor variable X (e.g., age, income) is divided.

### How Are Dummy Variables Created in Step Functions?

To model a non-linear relationship using step functions, the continuous predictor X is split into K+1 non-overlapping intervals (bins) using cutpoints c₁, c₂, ..., cₖ. Each bin corresponds to a dummy variable that indicates whether X falls within that bin.

The dummy variables are defined as follows:

```
C₀(X) = I(X < c₁)
C₁(X) = I(c₁ ≤ X < c₂)
C₂(X) = I(c₂ ≤ X < c₃)
...
Cₖ(X) = I(cₖ ≤ X)
```

#### Indicator Function I(·)

The function I(condition) returns:
- **1** if the condition is true (i.e., X falls in the specified bin)
- **0** otherwise

**Example**: If X is age and c₁ = 20, then:
- C₀(X) = 1 for X < 20 (e.g., age = 18)
- C₀(X) = 0 for X ≥ 20

#### Detailed Example

Suppose X is age, and we choose cutpoints c₁ = 20, c₂ = 30, and c₃ = 40. This creates four bins:

- **Bin 1**: X < 20 → C₀(X) = I(X < 20)
- **Bin 2**: 20 ≤ X < 30 → C₁(X) = I(20 ≤ X < 30)
- **Bin 3**: 30 ≤ X < 40 → C₂(X) = I(30 ≤ X < 40)
- **Bin 4**: X ≥ 40 → C₃(X) = I(X ≥ 40)

For a specific age, say X = 25:
- C₀(25) = I(25 < 20) = 0 (false)
- C₁(25) = I(20 ≤ 25 < 30) = 1 (true)
- C₂(25) = I(30 ≤ 25 < 40) = 0 (false)
- C₃(25) = I(25 ≥ 40) = 0 (false)

So, only C₁(25) = 1, and all others are 0.

#### Key Property

For any value of X, exactly one dummy variable is 1, and the rest are 0. This ensures:

```
C₀(X) + C₁(X) + ... + Cₖ(X) = 1
```

This property guarantees that X falls into exactly one bin, making the bins mutually exclusive and collectively exhaustive.

### Why Use Dummy Variables?

By converting the continuous X into a set of dummy variables, we treat X as an ordered categorical variable. Each dummy variable represents membership in a specific bin, and we can use these in a regression model to estimate a different constant (response value) for each bin.

---

## Fitting the Model

### The Linear Regression Model

Once the dummy variables are defined, they are used as predictors in a linear regression model to predict the response yᵢ (e.g., wage). The model is:

```
yᵢ = β₀ + β₁C₁(xᵢ) + β₂C₂(xᵢ) + ... + βₖCₖ(xᵢ) + εᵢ
```

**Where:**
- **yᵢ**: The response variable for the i-th observation (e.g., wage for the i-th person)
- **Cⱼ(xᵢ)**: The dummy variable for the j-th bin, which is 1 if xᵢ (the predictor, e.g., age) falls in that bin, and 0 otherwise
- **β₀, β₁, ..., βₖ**: Coefficients estimated using least squares
- **εᵢ**: The error term, capturing random noise or unmodeled variation

### Interpreting the Coefficients

#### Baseline (β₀)

The coefficient β₀ represents the average response when X falls in the first bin (X < c₁). Why? Because when X < c₁, all dummy variables C₁(X), C₂(X), ..., Cₖ(X) = 0, so the model reduces to:

```
yᵢ = β₀ + εᵢ
```

Thus, β₀ is the mean response for the first bin.

#### Other Coefficients (βⱼ)

For X in the j-th bin (cⱼ ≤ X < cⱼ₊₁), only Cⱼ(X) = 1, and all other dummy variables are 0. The model becomes:

```
yᵢ = β₀ + βⱼ · 1 + εᵢ = β₀ + βⱼ + εᵢ
```

Here, β₀ + βⱼ is the predicted response for the j-th bin, and βⱼ represents the difference in the average response for the j-th bin compared to the baseline bin (X < c₁).

### Example with Wage Data

Let's revisit the Wage dataset example:

#### Setup
- Suppose X is age, and we have cutpoints c₁ = 20, c₂ = 30, c₃ = 40, creating bins: <20, [20, 30), [30, 40), ≥40
- The model is: `wage_i = β₀ + β₁C₁(age_i) + β₂C₂(age_i) + β₃C₃(age_i) + εᵢ`

#### Interpretation
- If age < 20: All Cⱼ = 0, so predicted wage = β₀
- If 20 ≤ age < 30: C₁ = 1, others 0, so predicted wage = β₀ + β₁
- If 30 ≤ age < 40: C₂ = 1, others 0, so predicted wage = β₀ + β₂
- If age ≥ 40: C₃ = 1, others 0, so predicted wage = β₀ + β₃

#### Numerical Example
Suppose the fitted model gives:
- β₀ = 50,000 (average wage for age < 20)
- β₁ = 20,000, β₂ = 40,000, β₃ = 60,000

Then:
- Age < 20: Wage = $50,000
- Age [20, 30): Wage = $50,000 + $20,000 = $70,000
- Age [30, 40): Wage = $50,000 + $40,000 = $90,000
- Age ≥ 40: Wage = $50,000 + $60,000 = $110,000

### Why This Works

- The model fits a piecewise-constant function, where each bin gets its own constant prediction (β₀ + βⱼ)
- This allows the model to capture non-linear patterns (e.g., wages increasing with age but not linearly) without assuming a single global shape (like a polynomial)

## 7.4 Regression Splines

Regression splines are a flexible approach to modeling non-linear relationships between a predictor (X) and a response (y). Regression splines extend the ideas of polynomial regression (Section 7.1) and piecewise constant regression (step functions, Section 7.2) by fitting piecewise polynomials with constraints to ensure smoothness.

### 7.4.1 Piecewise Polynomials

#### What Are Piecewise Polynomials?

Piecewise polynomial regression involves dividing the range of a predictor (X) (e.g., age) into distinct regions using knots (specific points where the function changes) and fitting a separate low-degree polynomial in each region. Unlike a single high-degree polynomial (e.g., `y = β₀ + β₁X + β₂X² + β₃X³`) applied across the entire range of (X), piecewise polynomials allow different polynomial functions in different regions, making the model more flexible and better able to capture local patterns.

#### Example: Piecewise Cubic Polynomial

A piecewise cubic polynomial fits a cubic polynomial (degree 3) in each region defined by knots. For instance, with one knot at `X = c`, the model is:

```
yᵢ = {
  β₀₁ + β₁₁xᵢ + β₂₁xᵢ² + β₃₁xᵢ³ + εᵢ    if xᵢ < c
  β₀₂ + β₁₂xᵢ + β₂₂xᵢ² + β₃₂xᵢ³ + εᵢ    if xᵢ ≥ c
}
```

- **Coefficients**: Each region has its own set of coefficients (`β₀₁, β₁₁, β₂₁, β₃₁` for `xᵢ < c`, and `β₀₂, β₁₂, β₂₂, β₃₂` for `xᵢ ≥ c`).
- **Knots**: The points where the polynomial changes (e.g., c) are called knots. In this example, there's one knot at `X = c`.
- **Flexibility**: With K knots, you fit `K + 1` polynomials (one for each region). For example, two knots create three regions, each with its own cubic polynomial.

#### Degrees of Freedom

Each cubic polynomial has 4 parameters (`β₀, β₁, β₂, β₃`). With `K + 1` regions (from K knots), the total number of parameters is:

```
4 × (K + 1)
```

For example, with one knot (`K = 1`), there are two regions, so `4 × 2 = 8` parameters (degrees of freedom). This is illustrated in the top left panel of Figure 7.3, where a piecewise cubic polynomial is fit to the Wage data with a knot at age = 50, using 8 degrees of freedom.

#### Special Cases

- **No Knots**: If there are no knots (`K = 0`), the model is a single cubic polynomial across the entire range of X, as in standard polynomial regression (Section 7.1).
- **Lower-Degree Polynomials**: You can use piecewise linear functions (degree 1) or piecewise constant functions (degree 0, as in step functions from Section 7.2) instead of cubics.

#### Issue with Piecewise Polynomials

As shown in the top left panel of Figure 7.3, a piecewise cubic polynomial with a knot at age = 50 is discontinuous at the knot, leading to an unrealistic "jump" in the fitted curve. This happens because the polynomials in each region are fit independently, with no constraints ensuring they connect smoothly at the knot. This lack of smoothness makes the model overly flexible and visually unappealing, as it doesn't reflect typical real-world relationships.

## 7.4.2 Constraints and Splines

### The Problem with Unconstrained Piecewise Polynomials

The discontinuity in the top left panel of Figure 7.3 highlights the need for constraints to make piecewise polynomials smoother and more realistic. Splines address this by imposing constraints that ensure the fitted function is continuous and smooth at the knots.

### What Are Splines?

A spline is a piecewise polynomial where the polynomials are constrained to connect smoothly at the knots. The smoothness is controlled by requiring continuity in the function itself and its derivatives up to a certain order.

### Continuity Constraints

To address the discontinuity issue, constraints are added at the knots:

1. **Continuity of the Function**: The polynomials must meet at the knot, so there's no jump. For a knot at `X = c`, the left polynomial (`x < c`) and right polynomial (`x ≥ c`) must have the same value at `X = c`.
   - **Result**: The fitted curve is continuous (no gaps), as shown in the top right panel of Figure 7.3. However, this curve may still have a sharp "V-shaped" join at the knot, which looks unnatural because the slope changes abruptly.

2. **Continuity of the First Derivative**: The first derivatives (slopes) of the polynomials on either side of the knot must be equal. This ensures the curve is not only continuous but also has a smooth slope at the knot.

3. **Continuity of the Second Derivative**: The second derivatives (curvature) must also be equal at the knot, making the curve even smoother by avoiding abrupt changes in curvature.

### Cubic Splines

A cubic spline is a piecewise cubic polynomial with the following properties at each knot:
- The function is continuous.
- The first derivative is continuous.
- The second derivative is continuous.

These constraints make the cubic spline smooth and visually appealing. In the bottom left panel of Figure 7.3, a cubic spline is fit to the Wage data with a knot at age = 50, and it looks much smoother than the unconstrained piecewise cubic.

### Degrees of Freedom for Cubic Splines

Each constraint reduces the number of free parameters (degrees of freedom) in the model:

- Without constraints, a piecewise cubic with K knots has `4 × (K + 1)` degrees of freedom (4 parameters per region).
- Each knot imposes 3 constraints (continuity of the function, first derivative, and second derivative).
- For K knots, there are 3K constraints.
- The total degrees of freedom for a cubic spline is: `4 + K`

**Explanation**: Start with 4 degrees of freedom for the first cubic polynomial. Each additional region adds one more degree of freedom (not 4, because the 3 constraints at each knot "tie" the polynomials together). Thus, for K knots, the degrees of freedom are `4 + K`.

**Example**: In the bottom left panel of Figure 7.3 (`K = 1`), the cubic spline has `4 + 1 = 5` degrees of freedom, compared to 8 for the unconstrained piecewise cubic.

### Linear Splines

A linear spline is a piecewise linear polynomial (degree 1) with continuity at the knots. It fits straight lines in each region, ensuring the lines connect at the knots (no jumps). The bottom right panel of Figure 7.3 shows a linear spline with a knot at age = 50, which is continuous but less smooth than a cubic spline because it only enforces continuity of the function (not the slope).

### General Definition of Splines

A degree-d spline is a piecewise polynomial of degree d, with continuity in the function and its derivatives up to order `d - 1` at each knot:

- **Linear spline** (`d = 1`): Continuous function (0th derivative).
- **Quadratic spline** (`d = 2`): Continuous function and first derivative.
- **Cubic spline** (`d = 3`): Continuous function, first derivative, and second derivative.

The degrees of freedom for a degree-d spline with K knots is generally `d + 1 + K`, reflecting the base polynomial degree plus one degree of freedom per knot.

### Advantages and Limitations

**Advantages**:
- **Flexibility**: Splines can capture complex non-linear patterns by adapting polynomials to local regions.
- **Smoothness**: Continuity constraints ensure the fitted curve is smooth and realistic.
- **Control**: The number of knots controls the trade-off between flexibility and overfitting.

**Limitations**:
- **Knot Selection**: Choosing the number and location of knots can be challenging. Too few knots may underfit; too many may overfit.
- **Computational Complexity**: Splines require more computation than simple linear regression, especially with many knots.
- **Interpretability**: While more interpretable than some non-parametric methods, splines are less intuitive than step functions or linear models.

## 7.4.3: The Spline Basis Representation

### 1. Representing Splines with Basis Functions

Regression splines, such as cubic splines, may seem complex because they involve fitting piecewise polynomials with continuity constraints (e.g., continuous function, first, and second derivatives at knots). However, they can be expressed as a linear combination of basis functions, which simplifies the fitting process. The general model for a cubic spline with K knots is:

```
yᵢ = β₀ + β₁b₁(xᵢ) + β₂b₂(xᵢ) + ⋯ + βₖ₊₃bₖ₊₃(xᵢ) + εᵢ
```

- `yᵢ`: The response variable (e.g., wage).
- `b₁(xᵢ), b₂(xᵢ), …, bₖ₊₃(xᵢ)`: A set of basis functions that transform the predictor `xᵢ` (e.g., age).
- `β₀, β₁, …, βₖ₊₃`: Coefficients estimated using least squares.
- `εᵢ`: The error term.

This model is a standard linear regression, where the predictors are the basis functions evaluated at `xᵢ`. The key is choosing appropriate basis functions to ensure the resulting function is a cubic spline (i.e., piecewise cubic polynomials with continuous function, first, and second derivatives at the knots).

### 2. Truncated Power Basis for Cubic Splines

One way to represent a cubic spline is using the truncated power basis, which builds on a standard cubic polynomial by adding terms to handle the knots.

#### Basis Functions for a Cubic Spline

To fit a cubic spline with K knots at locations `ξ₁, ξ₂, …, ξₖ`, the basis consists of:

- **A cubic polynomial basis**: `1, X, X², X³` (4 functions, corresponding to the intercept and the linear, quadratic, and cubic terms).
- **One truncated power basis function per knot**, defined as:

```
h(x, ξ) = (x - ξ)₊³ = {
  (x - ξ)³    if x > ξ
  0           otherwise
}
```

- `ξ`: The knot location (e.g., age = 50).
- `(x - ξ)₊³`: The truncated power function, which equals `(x - ξ)³` when `x > ξ` and 0 otherwise. This introduces a change in the cubic polynomial at the knot `ξ`.

The full set of predictors for a cubic spline with K knots is:

```
1, X, X², X³, h(X, ξ₁), h(X, ξ₂), …, h(X, ξₖ)
```

This results in `3 + K + 1 = K + 4` predictors (3 for `X, X², X³`, 1 for the intercept, and K for the truncated power functions).

#### Why Truncated Power Basis?

Adding a term like `β₄h(X, ξ)` to a cubic polynomial allows the function to change at the knot `ξ` while maintaining:

- **Continuity**: The function itself doesn't jump at `ξ`.
- **Continuity of first and second derivatives**: The slopes and curvatures are smooth at `ξ`.
- **Discontinuity in the third derivative**: The truncated power function introduces a change in the third derivative at the knot, which allows the spline to adapt to local patterns.

This ensures the resulting function is a cubic spline with the desired smoothness properties.

#### Degrees of Freedom

- A cubic spline with K knots has `K + 4` degrees of freedom, corresponding to the `K + 4` coefficients (`β₀, β₁, …, βₖ₊₃`).
- Intuitively, the 4 degrees of freedom come from the base cubic polynomial (`1, X, X², X³`), and each knot adds 1 degree of freedom by introducing a new truncated power function.

### Example: Cubic Spline with One Knot

Suppose we fit a cubic spline to the Wage data with one knot at `ξ₁ = 50` (age = 50). The model is:

```
yᵢ = β₀ + β₁xᵢ + β₂xᵢ² + β₃xᵢ³ + β₄h(xᵢ, 50) + εᵢ
```

Where:

```
h(xᵢ, 50) = (xᵢ - 50)₊³ = {
  (xᵢ - 50)³    if xᵢ > 50
  0             if xᵢ ≤ 50
}
```

- For `xᵢ ≤ 50`, the model is a standard cubic polynomial: `β₀ + β₁xᵢ + β₂xᵢ² + β₃xᵢ³`.
- For `xᵢ > 50`, the model includes the additional term `β₄(xᵢ - 50)³`, allowing the cubic polynomial to change at age = 50 while remaining smooth.
- This model has `1 + 4 = 5` degrees of freedom (4 for the cubic polynomial + 1 for the knot).

The coefficients are estimated using least squares, treating the basis functions as predictors in a linear regression.

## 7.4.4: Choosing the Number and Locations of the Knots

### Why Knot Placement Matters

- Knots define the boundaries between regions where different polynomials are fitted in a spline model. In a cubic spline, for example, the function is a piecewise cubic polynomial, and knots determine where these polynomials change.
- The flexibility of a regression spline is greatest in regions with many knots because the polynomial coefficients can change rapidly, allowing the model to adapt to local patterns in the data.
- Conversely, regions with fewer or no knots are less flexible, as the model relies on a single polynomial over a larger range.

### Strategies for Placing Knots

The authors discuss two main approaches to choosing knot locations:

#### 1. Domain Knowledge-Based Placement

- Place more knots where the function is expected to vary rapidly (e.g., where the relationship between X and y changes significantly, such as at key age thresholds in the Wage dataset, like retirement age).
- Place fewer knots where the function is stable (e.g., where the relationship is relatively flat or linear).
- **Pros**: This approach leverages prior knowledge about the data-generating process, making the model more interpretable and tailored to the problem.
- **Cons**: Requires domain expertise, which may not always be available, and can be subjective.

#### 2. Uniform or Data-Driven Placement

- A common practice is to place knots uniformly across the range of the predictor X. One way to do this is to use quantiles of the data.
- For example, knots can be placed at the 25th, 50th, and 75th percentiles of X, ensuring an even distribution of knots across the data.
- This is often automated by specifying the degrees of freedom (df), and the software (e.g., R's splines package or Python's patsy) selects knot locations at uniform quantiles.
- **Example in Figure 7.5**: A natural cubic spline is fit to the Wage data with three knots placed at the 25th, 50th, and 75th percentiles of age, corresponding to 4 degrees of freedom (df = `K + 2` for a natural cubic spline, so `K = 3` knots gives `3 + 2 = 4` df after accounting for boundary constraints).

### Why 4 Degrees of Freedom = 3 Knots for Natural Splines?

- A natural cubic spline with K knots has `K + 2` degrees of freedom because:
  - A standard cubic spline has `K + 4` degrees of freedom (4 for the base cubic polynomial plus 1 per knot).
  - Natural splines impose linear constraints at the boundaries (for `X < smallest knot` and `X > largest knot`), reducing the degrees of freedom by 2.
  - Thus, `df = K + 2`.
- For `df = 4`, we solve `K + 2 = 4`, so `K = 2`. However, the text notes that specifying 4 degrees of freedom results in 3 knots in practice, likely due to software implementation details (e.g., R's ns() function may adjust knot placement to achieve the desired flexibility, as noted in the "somewhat technical" footnote).

### Choosing the Number of Knots

The number of knots (K) determines the spline's flexibility:

- **More knots** = more flexibility, as the model can fit more complex patterns by allowing polynomial changes in more regions.
- **Fewer knots** = less flexibility, potentially underfitting complex relationships but avoiding overfitting.

The authors suggest two approaches to select the number of knots (or equivalently, the degrees of freedom):

#### 1. Subjective Approach: Visual Inspection

- Fit splines with different numbers of knots and visually assess which produces the "best-looking" curve.
- This is subjective and depends on the analyst's judgment of what constitutes a good fit (e.g., smooth but capturing key trends).
- **Limitation**: This approach is not systematic and may lead to biased or inconsistent choices.

#### 2. Objective Approach: Cross-Validation

- Use cross-validation (discussed in Chapters 5 and 6) to select the number of knots that minimizes prediction error.
- **Process**:
  - Remove a portion of the data (e.g., 10% in 10-fold cross-validation).
  - Fit a spline with a specific number of knots (K) to the remaining data.
  - Predict the response for the held-out portion.
  - Repeat until all observations have been held out once, and compute the cross-validated residual sum of squares (RSS) (or mean squared error, MSE).
  - Repeat for different values of K, and choose the K that yields the smallest cross-validated RSS.
- **Figure 7.6**: Shows the 10-fold cross-validated MSE for natural cubic splines (left panel) and standard cubic splines (right panel) on the Wage data, plotted against degrees of freedom:
  - A spline with 1 degree of freedom (equivalent to linear regression) has high MSE, indicating underfitting.
  - The MSE decreases rapidly as degrees of freedom increase to 3 (natural spline) or 4 (cubic spline), then flattens out, suggesting that additional knots add little improvement.
  - This indicates that 3 degrees of freedom (1 knot) for a natural spline or 4 degrees of freedom (0–1 knots) for a cubic spline are sufficient for the Wage data.

## 7.5: Smoothing Splines

Smoothing splines provide an alternative approach to fitting smooth, non-linear curves to data, distinct from the regression splines discussed in Section 7.4. Unlike regression splines, which require specifying knot locations and use least squares to estimate coefficients for a fixed set of basis functions, smoothing splines automatically place knots at all unique data points and control smoothness through a penalty term.

### 1. Goal of Smoothing Splines

The goal is to find a smooth function $g(x)$ that fits a set of observations $(x_i, y_i), i = 1, \dots, n$ well while avoiding overfitting. Specifically:

- We want the residual sum of squares (RSS) to be small:
  $$\text{RSS} = \sum_{i=1}^n (y_i - g(x_i))^2$$

- However, without constraints, we could choose a function $g(x)$ that interpolates (passes exactly through) every data point, making RSS = 0. This would overfit the data, capturing noise rather than the underlying pattern.

To balance fit (low RSS) and smoothness (avoiding overfitting), smoothing splines introduce a penalty term that encourages $g(x)$ to be smooth.

### 2. The Smoothing Spline Objective

A smoothing spline finds the function $g(x)$ that minimizes the following objective:

$$\sum_{i=1}^n (y_i - g(x_i))^2 + \lambda \int g''(t)^2 \, dt$$

#### Components:

- **Loss Function**: $\sum_{i=1}^n (y_i - g(x_i))^2$ is the RSS, measuring how well $g(x)$ fits the data.

- **Penalty Term**: $\lambda \int g''(t)^2 \, dt$ penalizes the roughness of $g(x)$.

- **Tuning Parameter** $\lambda$: A nonnegative value that controls the trade-off between fit and smoothness.

#### The Penalty Term:

- $g''(t)$: The second derivative of $g(t)$, which measures the curvature or "wiggliness" of the function. A large $|g''(t)|$ indicates a rapidly changing slope (e.g., a sharp bend), while $g''(t) = 0$ for a straight line (perfectly smooth).

- $\int g''(t)^2 \, dt$: The integral sums the squared second derivative over the range of $t$, quantifying the total roughness of $g(x)$. A smooth function (e.g., a line) has a small integral; a wiggly function has a large integral.

#### Role of $\lambda$:

- When $\lambda = 0$: No penalty, so $g(x)$ interpolates all data points, leading to an overfit, wiggly curve.

- When $\lambda \to \infty$: The penalty dominates, forcing $g(x)$ to be a linear function (since $g''(t) = 0$ for a line), resulting in a simple linear regression that may underfit.

- Intermediate $\lambda$: Balances fit and smoothness, producing a curve that approximates the data while remaining smooth.

#### Bias-Variance Trade-Off:

- Small $\lambda$: Low bias (fits data closely) but high variance (overfits, sensitive to noise).
- Large $\lambda$: High bias (oversmooths, may miss patterns) but low variance (stable predictions).

### 3. Properties of the Smoothing Spline

The function $g(x)$ that minimizes the above objective has special properties:

- It is a natural cubic spline with knots at all unique values of $x_1, x_2, \dots, x_n$ (i.e., every distinct data point is a knot).

- It has continuous first and second derivatives at each knot, ensuring smoothness.

- It is linear in the regions beyond the smallest and largest data points (similar to a natural cubic spline, as discussed in Section 7.4.3).

However, unlike the natural cubic spline fitted via least squares with a fixed number of knots (Section 7.4.3), the smoothing spline is a shrunken version of such a spline. The tuning parameter $\lambda$ controls the level of shrinkage:

- A smaller $\lambda$ allows the spline to be more flexible, closer to a natural cubic spline with knots at all data points.
- A larger $\lambda$ shrinks the spline toward a linear function, reducing its flexibility.

### 4. Comparison to Regression Splines

#### Regression Splines (Section 7.4):
- Require specifying the number and locations of knots (e.g., at quantiles or based on domain knowledge).
- Use a fixed set of basis functions (e.g., truncated power basis) and estimate coefficients via least squares.
- Degrees of freedom are determined by the number of knots ($K + 4$ for cubic splines, $K + 2$ for natural cubic splines).

#### Smoothing Splines:
- Place knots at all unique data points ($x_1, \dots, x_n$), so no manual knot selection is needed.
- Control flexibility through the penalty term ($\lambda$) rather than the number of knots.
- The effective degrees of freedom are controlled by $\lambda$, not the number of knots, and can be tuned continuously (e.g., via cross-validation).

### Advantages and Limitations

#### Advantages:
- **Automatic Knot Placement**: Knots at all unique $x_i$ eliminate the need to choose knot locations, simplifying the modeling process.
- **Flexible Smoothness Control**: The tuning parameter $\lambda$ allows continuous adjustment of smoothness, unlike regression splines, which rely on discrete knot choices.
- **Smooth Fit**: Produces a natural cubic spline with smooth derivatives, ideal for modeling continuous, non-linear relationships (e.g., wage vs. age).

#### Limitations:
- **Computational Cost**: With knots at every unique data point, smoothing splines can be computationally intensive for large datasets (though efficient algorithms exist).
- **Tuning $\lambda$**: Selecting the optimal $\lambda$ requires cross-validation, which adds computational overhead.
- **Boundary Behavior**: Like natural cubic splines, smoothing splines are linear at the boundaries, which stabilizes estimates but may oversimplify relationships in sparse regions.

## 7.5.2: Choosing the Smoothing Parameter λ

### 1. The Role of λ and Effective Degrees of Freedom

#### Smoothing Spline Recap:
A smoothing spline is a natural cubic spline with knots at every unique data point $x_1, \dots, x_n$, minimizing the objective:

$$\sum_{i=1}^n (y_i - g(x_i))^2 + \lambda \int g''(t)^2 \, dt$$

Here, $\lambda$ controls the balance between the residual sum of squares (RSS) (fit to the data) and the penalty term (smoothness, measured by the integral of the squared second derivative).

#### Why So Many Knots?
Unlike regression splines (Section 7.4), which use a fixed number of knots, smoothing splines place knots at all unique $x_i$. This could suggest $n$ degrees of freedom (one per data point), which would lead to an overly flexible model that interpolates all points (overfitting).

#### Role of λ:
The tuning parameter $\lambda$ constrains the flexibility of the spline, effectively reducing the degrees of freedom to a value called the effective degrees of freedom, denoted $\text{df}_\lambda$.

- As $\lambda \to 0$: No penalty, so the spline interpolates all data points, and $\text{df}_\lambda \approx n$ (maximum flexibility, high variance).

- As $\lambda \to \infty$: The penalty dominates, forcing $g(x)$ to be a linear function ($g''(t) = 0$), and $\text{df}_\lambda \to 2$ (equivalent to a linear regression, high bias).

- Intermediate $\lambda$: Produces a smooth curve with $2 < \text{df}_\lambda < n$, balancing bias and variance.

#### Effective Degrees of Freedom ($\text{df}_\lambda$):
- Unlike traditional degrees of freedom (e.g., number of coefficients in a regression model), smoothing splines have $n$ nominal parameters (one per knot), but these are constrained by the penalty term.
- The effective degrees of freedom ($\text{df}_\lambda$) measures the model's flexibility, accounting for the shrinkage induced by $\lambda$.

### 2. Definition of Effective Degrees of Freedom

The fitted smoothing spline can be expressed as:

$$\hat{g}_\lambda = S_\lambda y$$

Where:
- $\hat{g}_\lambda$: An $n$-vector of fitted values at the training points $x_1, \dots, x_n$.
- $y$: The $n$-vector of response values $(y_1, \dots, y_n)$.
- $S_\lambda$: An $n \times n$ matrix (called the smoothing matrix or hat matrix), which transforms the response vector $y$ into the fitted values. This matrix depends on $\lambda$ and the knot locations (all unique $x_i$).

The effective degrees of freedom is defined as:

$$\text{df}_\lambda = \sum_{i=1}^n \{S_\lambda\}_{ii}$$

Where:
- $\{S_\lambda\}_{ii}$: The $i$-th diagonal element of the smoothing matrix $S_\lambda$.

#### Interpretation:
The sum of the diagonal elements of $S_\lambda$ quantifies the model's flexibility. It represents the "effective" number of parameters after accounting for the smoothness constraint imposed by $\lambda$.

- If $\text{df}_\lambda \approx n$, the spline is highly flexible (interpolates data points).
- If $\text{df}_\lambda \approx 2$, the spline is nearly linear (like a straight line).

### 3. Choosing λ with Cross-Validation

#### Problem:
Unlike regression splines, where you choose the number and location of knots, smoothing splines require selecting the smoothing parameter $\lambda$.

#### Solution:
Use cross-validation to find the $\lambda$ that minimizes the cross-validated residual sum of squares (RSS).

#### Leave-One-Out Cross-Validation (LOOCV):
In LOOCV, for each observation $(x_i, y_i)$, fit the model on all data except the $i$-th observation, predict $y_i$, and compute the squared error $(y_i - \hat{g}_\lambda^{(-i)}(x_i))^2$, where $\hat{g}_\lambda^{(-i)}(x_i)$ is the prediction for $x_i$ using the model fit without the $i$-th observation.

The LOOCV error is:

$$\text{RSS}_{\text{cv}}(\lambda) = \sum_{i=1}^n (y_i - \hat{g}_\lambda^{(-i)}(x_i))^2$$

Computing LOOCV directly (fitting $n$ models, each leaving out one observation) is computationally expensive.

#### Efficient LOOCV Formula for Smoothing Splines:
Smoothing splines have a remarkable property that allows computing the LOOCV error efficiently using the original fit (without refitting $n$ models):

$$\text{RSS}_{\text{cv}}(\lambda) = \sum_{i=1}^n \left( \frac{y_i - \hat{g}_\lambda(x_i)}{1 - \{S_\lambda\}_{ii}} \right)^2$$

Where:
- $\hat{g}_\lambda(x_i)$: The fitted value at $x_i$ using the full dataset.
- $\{S_\lambda\}_{ii}$: The $i$-th diagonal element of the smoothing matrix.

#### Intuition:
This formula adjusts the residual $y_i - \hat{g}_\lambda(x_i)$ by a factor $\frac{1}{1 - \{S_\lambda\}_{ii}}$, accounting for the influence of the $i$-th observation on the fit. It avoids the need to refit the model $n$ times, making LOOCV computationally efficient (comparable to fitting a single smoothing spline).

#### Connection to Linear Regression:
A similar LOOCV formula exists for linear regression (Equation 5.2, page 205), where the hat matrix $H = X(X^T X)^{-1} X^T$ is used. The smoothing spline's $S_\lambda$ plays an analogous role, extending the concept to non-linear models.

## 7.6 Local Regression

### What is Local Regression?

Local regression (also called locally weighted regression or LOESS/LOWESS) is a method for fitting a flexible, non-linear model to data by focusing on local subsets of the data around a target point $x_0$. Instead of fitting a single global model (like a straight line or polynomial) to the entire dataset, local regression fits a model at each point of interest using only nearby observations, weighted by their proximity to $x_0$. This allows the model to capture non-linear patterns in the data.

### Key Idea

Imagine you want to predict the value of a function $f(x)$ at a specific point $x_0$. Local regression assumes that the function behaves roughly linearly (or in some simple form) in a small neighborhood around $x_0$. By focusing only on the data points close to $x_0$, it fits a weighted regression model tailored to that local region. This process is repeated for every point where you want to make a prediction, making it a "memory-based" method since it requires the full training dataset each time.

### Algorithm 7.1 Explained

The steps in Algorithm 7.1 outline how local regression works at a specific point $x_0$:

#### 1. Select Nearby Points
Choose a fraction $s = k/n$ of the training data, where $k$ is the number of points closest to $x_0$, and $n$ is the total number of training points. These $k$ points form the "neighborhood" around $x_0$.

The parameter $s$ (called the span) controls how large this neighborhood is:
- A smaller $s$ means fewer points are used, leading to a wigglier, more local fit
- A larger $s$ includes more points, making the fit smoother and more global

#### 2. Assign Weights
Each of the $k$ points in the neighborhood is assigned a weight $K_{i0} = K(x_i, x_0)$, based on how close it is to $x_0$.

- Points closer to $x_0$ get higher weights, while those farther away get lower weights (the farthest point in the neighborhood may have a weight of zero)
- The weighting function $K$ (e.g., a Gaussian kernel or tricube kernel) determines how quickly the weights decay with distance
- Points outside the neighborhood get zero weight

#### 3. Fit a Weighted Regression
Fit a simple regression model (typically linear, as in Equation 7.14) to the $k$ points in the neighborhood, where the contribution of each point is weighted by $K_{i0}$.

The model minimizes the weighted sum of squared errors:
$$\sum_{i=1}^n K_{i0} (y_i - \beta_0 - \beta_1 x_i)^2$$

Here, $\beta_0$ and $\beta_1$ are the intercept and slope of the local linear regression, and $y_i$ are the observed responses.

#### 4. Predict at $x_0$
Use the fitted model to predict the value at $x_0$:
$$\hat{f}(x_0) = \hat{\beta}_0 + \hat{\beta}_1 x_0$$

This process is repeated for every target point $x_0$, with a new set of weights and a new regression model each time.

### Key Choices in Local Regression

To perform local regression, you need to make several decisions:

**Weighting Function ($K$)**: Determines how weights are assigned based on distance. Common choices include Gaussian or tricube kernels.

**Type of Local Model**: The book focuses on linear regression (as in Equation 7.14), but you could use a constant or quadratic model instead. Linear is most common for simplicity.

**Span ($s$)**: The most critical parameter. It controls the trade-off between bias and variance:
- Small $s$: Uses fewer points, leading to a highly flexible, wiggly fit that closely follows the data in the local region (low bias, high variance)
- Large $s$: Uses more points, producing a smoother, more global fit (high bias, low variance)
- You can choose $s$ using cross-validation or set it manually

**Distance Metric**: Defines what "close" means. Typically, Euclidean distance is used for one-dimensional data, but other metrics can be applied in higher dimensions.

### Intuition with Figure 7.9 and 7.10

**Figure 7.9**: This figure (from the book) shows simulated data with the true function $f(x)$ (blue line) and the local regression estimate $\hat{f}(x)$ (orange line). At a target point like $x_0 = 0.4$, the fit is based only on nearby points, allowing the model to adapt to the local shape of the data. Near the boundary (e.g., $x_0 = 0.05$), fewer points are available, so the fit may be less stable.

**Figure 7.10**: This shows local regression applied to the Wage dataset with two spans ($s = 0.7$ and $s = 0.2$). The larger span ($s = 0.7$) produces a smoother curve because it includes more points, while the smaller span ($s = 0.2$) results in a wigglier fit that closely tracks local variations.

### Extensions to Multiple Dimensions

Local regression can be extended to multiple features $(X_1, X_2, \ldots, X_p)$:

**Varying Coefficient Models**: Some variables (e.g., time) can be treated locally, while others are treated globally in a multiple linear regression. This is useful for adapting to recent data trends.

**Multidimensional Neighborhoods**: For two features $(X_1, X_2)$, you can define a two-dimensional neighborhood and fit a bivariate linear regression. For example, points are selected based on their distance in 2D space from $x_0$.

**Challenges in High Dimensions**: Local regression struggles when $p$ (the number of features) is large (e.g., $p > 3$ or $4$). In high dimensions, data points become sparse, and there may be too few points near $x_0$ to fit a reliable model. This is similar to the "curse of dimensionality" faced by nearest-neighbors methods (discussed in Chapter 3).

### Strengths and Weaknesses

#### Strengths
- Highly flexible and can capture complex non-linear relationships
- Intuitive: focuses on local patterns in the data
- Generalizes easily to multiple dimensions or varying coefficient models

#### Weaknesses
- Computationally intensive: requires fitting a new model for each prediction, making it memory-based and slow for large datasets
- Sensitive to the choice of $s$: Too small, and the fit is noisy; too large, and it oversmooths
- Poor performance in high dimensions due to data sparsity
- Boundary effects: Near the edges of the data range (e.g., $x_0 = 0.05$), fewer points are available, leading to less reliable fits

### Connection to Other Methods

**Smoothing Splines**: Like smoothing splines (also in Chapter 7), local regression controls flexibility via a tuning parameter ($s$ vs. $\lambda$). However, splines fit a single global model, while local regression fits many local models.

**Nearest Neighbors**: Local regression is similar to k-nearest neighbors (k-NN) in that it uses nearby points, but it fits a smooth model (e.g., linear) rather than averaging the neighbors' values directly.

**Kernel Methods**: The weighting function $K$ is akin to a kernel in kernel regression, which also weights observations by proximity.

## 7.7.1 GAMs for Regression Problems

### What are Generalized Additive Models (GAMs)?

GAMs are an extension of multiple linear regression that allow for non-linear relationships between each predictor (feature) and the response variable while maintaining an additive structure. In a standard multiple linear regression model:

$$y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip} + \epsilon_i$$

each predictor $x_{ij}$ has a linear effect on the response $y_i$. GAMs generalize this by replacing the linear terms $\beta_j x_{ij}$ with flexible, non-linear functions $f_j(x_{ij})$:

$$y_i = \beta_0 + f_1(x_{i1}) + f_2(x_{i2}) + \cdots + f_p(x_{ip}) + \epsilon_i$$

This is shown in Equation (7.15). The term "additive" refers to the fact that the contributions of each predictor's function $f_j$ are added together to predict the response. Each $f_j$ can be any smooth, non-linear function, allowing GAMs to capture complex patterns in the data without assuming linearity.

### Example: Wage Data

The book provides an example using the Wage dataset, where the response is wage, and the predictors are year, age, and education. The GAM model is:

$$\text{wage} = \beta_0 + f_1(\text{year}) + f_2(\text{age}) + f_3(\text{education}) + \epsilon$$

(Equation 7.16). Here:
- **year** and **age** are quantitative variables, so $f_1$ and $f_2$ are smooth, non-linear functions (e.g., splines or local regression)
- **education** is qualitative (categorical) with five levels: <HS, HS, <Coll, Coll, >Coll. For this variable, $f_3$ is modeled using a step function (dummy variables), assigning a constant value to each level

### How GAMs Work

GAMs leverage the methods discussed earlier in Chapter 7 (e.g., polynomial regression, splines, local regression) as building blocks to model each $f_j$. The key is that each predictor's effect is modeled independently, and their contributions are summed. This makes GAMs flexible yet interpretable, as you can examine the effect of each predictor separately.

### Fitting GAMs

The book describes two approaches to fitting GAMs on the Wage dataset:

#### Natural Splines (Figure 7.11)

**Method:**
- For **year** and **age**, natural splines are used to model $f_1$ and $f_2$. Natural splines are constructed using basis functions (as discussed in Section 7.4), which transform the predictors into a set of variables that can be included in a linear regression framework
- For **education**, dummy variables are used to represent the five categorical levels
- The entire model is fit using least squares, treating it as a large regression problem with spline basis functions and dummy variables combined into one big regression matrix

**Results (Figure 7.11):**
- **Year**: Holding age and education fixed, wages increase slightly with year, possibly due to inflation
- **Age**: Holding year and education fixed, wages peak at intermediate ages (e.g., middle career) and are lower for very young or old individuals
- **Education**: Holding year and age fixed, wages increase with higher education levels, which is intuitive

#### Smoothing Splines (Figure 7.12)

**Method:**
- For **year** and **age**, smoothing splines are used with specified degrees of freedom (e.g., 4 for year, 5 for age)
- Unlike natural splines, smoothing splines cannot be fit directly with least squares because they involve a penalty term to control smoothness
- Instead, GAMs with smoothing splines are fit using **backfitting**, an iterative algorithm:
  - **Backfitting**: Repeatedly update the fit for each predictor's function $f_j$ while holding the others fixed. For each predictor, the algorithm fits a single-variable smoother (e.g., a smoothing spline) to the partial residual (the response minus the contributions of all other predictors)
  - This process continues until the fits stabilize
- Software like the Python package `pygam` automates this process

**Results (Figure 7.12):** The fitted functions look similar to those from natural splines, indicating that the choice of spline type often has a small impact.

### Flexibility in Building GAMs

GAMs are not limited to splines. You can use:
- Local regression (Section 7.6) to model $f_j$
- Polynomial regression (Section 7.1)
- Any combination of methods from Chapter 7, depending on the nature of the predictors

This flexibility allows GAMs to adapt to different types of data and relationships.

### Pros and Cons of GAMs

The book summarizes the advantages and limitations of GAMs:

#### Pros

**Non-linear Modeling:**
GAMs automatically capture non-linear relationships for each predictor without requiring manual transformations (e.g., trying $x^2$, $\log(x)$, etc.).

**Improved Predictions:**
Non-linear fits can lead to more accurate predictions compared to linear regression when the true relationships are non-linear.

**Interpretability:**
Because the model is additive, you can examine the effect of each predictor $X_j$ on the response $Y$ individually by plotting $f_j$, while holding other variables fixed.

**Summarizing Smoothness:**
The smoothness of each $f_j$ can be quantified using degrees of freedom, which helps understand the complexity of the fit.

#### Cons

**Additivity Restriction:**
GAMs assume the effects of predictors are additive, meaning they cannot capture interactions between predictors (e.g., the effect of age on wage depending on education).

*Workaround:* You can manually add interaction terms like $X_j \times X_k$ or include low-dimensional interaction functions $f_{jk}(X_j, X_k)$, fit using two-dimensional smoothers (e.g., local regression or two-dimensional splines). However, this adds complexity.

**Limited Flexibility:**
For complex, non-additive relationships, GAMs may not be flexible enough. More general methods like random forests or boosting (Chapter 8) are needed for fully non-parametric modeling.

## 7.7.2 GAMs for Classification Problems

GAMs, which were introduced in Section 7.7.1 for regression problems, can also be applied to classification problems where the response variable $Y$ is qualitative (categorical). The book focuses on the case where $Y$ is binary, taking values 0 or 1, and the goal is to model the conditional probability $p(X) = \text{Pr}(Y = 1 | X)$, the probability that $Y = 1$ given the predictors $X$.

### From Linear Logistic Regression to Logistic GAM

In standard logistic regression (introduced in Chapter 4), the log-odds of the probability $p(X)$ is modeled as a linear combination of the predictors:

$$\log\left(\frac{p(X)}{1 - p(X)}\right) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p$$

This is Equation (7.17), where the left-hand side is the logit (log of the odds), and the right-hand side assumes a linear relationship between each predictor $X_j$ and the log-odds.

To allow for non-linear relationships, a logistic GAM extends this model by replacing the linear terms $\beta_j X_j$ with flexible, non-linear functions $f_j(X_j)$:

$$\log\left(\frac{p(X)}{1 - p(X)}\right) = \beta_0 + f_1(X_1) + f_2(X_2) + \cdots + f_p(X_p)$$

This is Equation (7.18). Like the regression GAM in Section 7.7.1, this model is additive, meaning the contributions of each predictor's function $f_j$ are summed to determine the log-odds. The functions $f_j$ can be fit using methods like splines, local regression, or step functions for categorical variables, just as in regression GAMs.

### Example: Wage Data for Classification

The book illustrates the use of a logistic GAM on the Wage dataset to predict the probability that an individual's income exceeds $250,000 per year (a binary outcome: $Y = 1$ if wage > $250,000, $Y = 0$ otherwise). The model is:

$$\log\left(\frac{p(X)}{1 - p(X)}\right) = \beta_0 + \beta_1 \cdot \text{year} + f_2(\text{age}) + f_3(\text{education})$$

where:
- $p(X) = \text{Pr}(\text{wage} > 250 | \text{year}, \text{age}, \text{education})$
- **year**: Modeled linearly with a coefficient $\beta_1$, assuming its effect on the log-odds is linear
- **age**: Modeled non-linearly using a smoothing spline with 5 degrees of freedom for $f_2$
- **education**: A categorical variable with five levels (<HS, HS, <Coll, Coll, >Coll), modeled as a step function $f_3$ using dummy variables, where each level has its own constant effect

### Fitting the Model

The logistic GAM is fit using methods similar to regression GAMs, but instead of minimizing squared errors, it maximizes the log-likelihood of the logistic regression model, accounting for the non-linear functions $f_j$.

- For smoothing splines (used for age), the backfitting algorithm (described in Section 7.7.1) can be used, iteratively updating each $f_j$ by fitting to partial residuals while holding other predictors' effects fixed
- Software like the Python package `pygam` or R's `mgcv` can handle fitting logistic GAMs efficiently

### Results: Figure 7.13

The initial fit of the model is shown in Figure 7.13, with plots for the estimated effects of year, age, and education on the log-odds of earning >$250,000:

- **Year**: A linear effect, suggesting a small increase in the probability of high earnings over time (possibly due to economic trends)
- **Age**: A non-linear effect (via smoothing spline), showing how the probability of high earnings varies with age, likely peaking in middle age
- **Education**: A step function, with each education level contributing a different constant to the log-odds

The plot for the <HS (less than high school) level shows very wide confidence intervals, indicating high uncertainty. This is because no individuals with less than a high school education in the dataset earn more than $250,000 ($Y = 1$), making the estimate for this level unreliable.

### Refitting the Model: Figure 7.14

Due to the issue with the <HS category, the model is refit excluding individuals with less than a high school education. The revised model is shown in Figure 7.14:

- The plots for year, age, and education are similar to Figure 7.13 but exclude the problematic <HS category
- All three panels use the same vertical scale (for the log-odds), allowing direct comparison of the relative contributions of each predictor
- **Key Observation**: age and education have a much larger effect on the probability of being a high earner than year. This suggests that age and education level are stronger drivers of high income than the year

### Pros and Cons of Logistic GAMs

The pros and cons of logistic GAMs are the same as those for regression GAMs (Section 7.7.1):

#### Pros

**Non-linear flexibility**: Captures non-linear relationships between predictors and the log-odds without manual transformations.

**Improved predictions**: Non-linear fits can improve accuracy for classification tasks with complex patterns.

**Interpretability**: The additive structure allows visualization of each $f_j$, showing how each predictor affects the log-odds while holding others fixed.

**Smoothness control**: Degrees of freedom for each $f_j$ (e.g., 5 for age) summarize the complexity of the fit.

#### Cons

**Additivity limitation**: The model assumes additive effects, missing interactions between predictors (e.g., the effect of age depending on education). Interactions can be added manually (e.g., $f_{jk}(X_j, X_k)$), but this increases complexity.

**Data sparsity issues**: As seen with the <HS category, if certain predictor levels have no positive outcomes ($Y = 1$), estimates can be unstable or undefined.