# Introduction to Statistical Learning - Chapter 6

- [6. Moving Beyond Linearity](#6.-Moving-Beyond-Linearity)
    * [6.1. Polynomial Regression](#6.1.-Polynomial-Regression)
    * [6.2. Step Functions](#6.2.-Step-Functions)
    * [6.3. Basis Functions](#6.3.-Basis-Functions)
    * [6.4. Regression Splines](#6.4.-Regression-Splines)
        + [6.4.1. Piecewise Polynomials](#6.4.1.-Piecewise-Polynomials)
        + [6.4.2. Constraints and Splines](#6.4.2.-Constraints-and-Splines)
        + [6.4.3. The Spline Basis Representation](#6.4.3.-The-Spline-Basis-Representation)
        + [6.4.4. Choosing the Number and Locations of the Knots](#6.4.4.-Choosing-the-Number-and-Locations-of-the-Knots)
        + [6.4.5. Comparison to Polynomial Regression](#6.4.5.-Comparison-to-Polynomial-Regression)
    * [6.5. Smoothing Splines](#6.5.-Smoothing-Splines)
        + [6.5.1. An Overview of Smoothing Splines](#6.5.1.-An-Overview-of-Smoothing-Splines)
        + [6.5.2. Choosing the Smoothing Parameter λ](#6.5.2.-Choosing-the-Smoothing-Parameter-λ)
    * [6.6. Local Regression](#6.6.-Local-Regression)
    * [6.7. Generalized Additive Models](#6.7.-Generalized-Additive-Models)
        + [6.7.1 GAMs for Regression Problems](#6.7.1-GAMs-for-Regression-Problems)

# 6. Moving Beyond Linearity

- Polynomial regression
    * Extends the linear model by adding extra predictors, obtained by raising each of the original predictors to a power
        + Simple way to provide a non-linear fit to data
- Step functions
    * Cut the range of a variables into $K$ distinct regions in order to produce a qualitative variable
        + Fitting a piecewise constant function
- Regression splines
    * More flexible than polynomial and step functions
    * Involves dividing the range of $X$ into $K$ distinct regions
    * Within each region, a polynomial function is fitted to the data
        + polynomials are constrained so they join smoothly at the region boundaries or knots
- Smoothing splines
    * Result from minimizing a residual sum of squares criterion subject to a smoothness penalty
- Local regression
    * Similar to splines but differs in an important way
        + Regions are allowed to overlap and they do so in a smooth way
- Generalized additive models
    * Allows us to extend the methods above to deal with multiple predictors

## 6.1. Polynomial Regression

$$ y_{i} = \beta_{0} + \beta_{1}x_{i} + \epsilon_{i} $$

$$ y_{i} = \beta_{0} + \beta_{1}x_{i} + \beta_{2}x_{i}^{2} + + \beta_{3}x_{i}^{3} + ... + + \beta_{d}x_{i}^{d} + \epsilon_{i} $$ 

- Polynomial regression allows us to produce an extremely non-linear curve
- Coefficients of the polynomial regression can be easily estimated using least squares linear regression because it is just a standard linear model with predictors $x_{i}$, $x_{i}^{2}$, $x_{i}^{3}$, ..., $x_{i}^{d}$
- Least squares regression returns variance estimates for each of the fitted coefficients, $\hat{\beta_{j}}$ as well as the covariances between pairs of coefficient estimates
    * The can be used to compute the estimated variance of $\hat{f}(x_{0})$

## 6.2. Step Functions

- Using polynomial functions of features as predictors in a linear model imposes a global structure on the non-linear function of $X$
    * Using step functions can avoids imposing such a global structure
- Step function breaks the range of $X$ into bins and fit a different constant in each bin
    * Converting a continuous variable into an ordered categorical variable
        + We can then use least squares to fit a linear model using $C_{1}(X), C_{2}(X), ... , C_{K}(X)$ as predictors

$$y_{i} = \beta_{0} + \beta_{1}C_{1}(x_{i}) + ... + \beta_{K}C_{K}(x_{i}) + \epsilon_{i}$$
where for a given value of X, at most one of the $C_{1}, C_{2}, ... , C_{K}$ can be non-zero

- Unless there are natural breakpoints in the predictors, piecewise-constant functions can miss the trend in the previous bins

## 6.3. Basis Functions

- Polynomial and piecewise-constant regression models are in fact special cases of a basis function approach

$$y_{i} = \beta_{0} + \beta_{1}b_{1}(x_{i}) + \beta_{2}b_{2}(x_{i}) +... + \beta_{K}b_{K}(x_{i}) + \epsilon_{i}$$

where the basis function $b_{1}(\cdot), b_{2}(\cdot),..., b_{K}(\cdot)$ are fixed and known

- For polynomial regression, the basis functions are $b_{j}(x_{i}) = x^{j}_{i}$ and for piecewise constant functions, they are $b_{j}(x_{i}) = I(c_{j} \leq x_{i} < c_{j+1})$
    * Similar to a standard linear model with predictors $\beta_{1}(x_{i}), \beta_{2}(x_{i}),..., \beta_{K}(x_{i})$

## 6.4. Regression Splines

### 6.4.1. Piecewise Polynomials

- Piecewise polynomial regression involves fitting separate low-degree polynomials over different regions of X
$$ y_{i} = \beta_{0} + \beta_{1}x_{i} + \beta_{2}x_{i}^{2} + \beta_{3}x_{i}^{3} + \epsilon_{i} $$

where coefficients $\beta_{0},\beta_{1}, \beta_{2}, \beta_{3}$ different in different parts of the range of X. The point where the coefficents change are called $knots$

$$ y_{i} = \beta_{01} + \beta_{11}x_{i} + \beta_{21}x_{i}^{2} + \beta_{31}x_{i}^{3} + \epsilon_{i} \space\space\text{ if } x_{i} < c \\ y_{i} = \beta_{02} + \beta_{12}x_{i} + \beta_{22}x_{i}^{2} + \beta_{32}x_{i}^{3} + \epsilon_{i}  \space\space\text{ if } x_{i} \geq c$$

- Using more knots leads to a more flexible piecewise polynomial
    * If we place K different knots throughout the range of X, then we will end up fitting K+1 different cubic polynomials
        + However, the function is dicontinuous and looks ridiculous

### 6.4.2. Constraints and Splines

- Need to fit a piecewise polynomial under the constraint that the fitted curve must be continuous
    * Adding of constraints help to ensure the piecewise polynomial is continuous and smooth
        + Each constraint frees up one degree of freedom by reducing the complexity of the resulting piecewise polynomial fit

### 6.4.3. The Spline Basis Representation

- A basis model can be used to represent a regression spline
    * Addition of one truncated power basis function per knot
    * However, splines can have high variance at the outer range of the predictors
        + Confidence bands in the boundary region can be fairly wild
- A natural spline is a regression spline with additional contraints
    * The function is required to be linear at the boundary 
        + This additional constraint means that natural splines generally produce more stable estimates at the boundaries

### 6.4.4. Choosing the Number and Locations of the Knots

- Regression splines is most flexible in regions that contain a lot of knots as the polynomial coefficients can change rapidly
    * Where to place the knots:
        + One option is to place more knots in placess where we feel the function might vary the most rapidly
        + In practice, knots are more commonly place in a uniform fashion
    * How many knots should we use:
        + One option is to try out different number of knots and see which produces the best looking curve
        + A more objective approach is to use cross-validation where the value of K giving the smallest RSS is chosen

### 6.4.5. Comparison to Polynomial Regression

- Regression splines often give superior results to polynomial regression
    * Splines invludes the number of knots but keeping the degree fixed instead of increasing the degree like polynomial regression
    * Splines allow us to place more knots and hence flexibility over regions where the function $f$ seems to be changing rapidly
    * Extra flexibility in the polynomial produces undesirable results at the boundaries while the natural cubic splines still provides a reasonable fit to the data

## 6.5. Smoothing Splines

### 6.5.1. An Overview of Smoothing Splines

- Find a function $g(x)$ that fits the observed data well and minimizes RSS

$$ \sum^{n}_{i=1}(y_{i}-g(x_{i})^{2} + \lambda \int g''(t)^{2} dt $$
where $\lambda$ is a non-negative tuning parameter. $\int g''(t)^{2}$ is a measure of the total change in the function $g'(t)$

- If g is very smooth, then $g'(t)$ will be close to constant and $\int g''(t)^{2}$ will take on a small value

### 6.5.2. Choosing the Smoothing Parameter λ

- The tuning parameter $\lambda$ controls the roughness of the smoothing spline
    * As $\lambda$ increases from 0 to $\infty$, the effective degrees of freedom $df_{\lambda}$ decrease from n to 2
    * $df_{\lambda}$ is a measure of flexibility of the smoothing spline, the higher it is, the more flexible the smoothing spline
        + Need to find the $\lambda$ that makes the cross-validated RSS as small as possible

## 6.6. Local Regression

- An approach for fitting flexible non-linear functions, which involves computing the fit at a target point $x_{0}$ using only the nearby training observations
    * The weights will differ for each value of $x_{0}$
- To perform local regression, there are a number of choices to be made
    * Defining the weighting function K, whether to fit a linear, constant or quadratic regression
    * The most important choice is the span $s$
        + Controls the flexibility of the non-linear fit
        + The smaller the value, the more local and wiggly the fit will be
- Local regression generalizes very naturally when we want to fit models that are local in a pair of variables $X_{1}$ and $X_{2}$ rather than one
    * Can use two-dimensional neighborhoods and fit bivaraite linear regression models using observations that are near each target point

## 6.7. Generalized Additive Models

- Generalized additive models (GAMs) provide a general framework for generalized additive model extending a standard linear model by allowing non-linear functions of each of the variables, while maintaining additivity.
    * Can be applied with both quantitative and qualitative responses

### 6.7.1 GAMs for Regression Problems

$$ y_{i} = \beta_{0} + \beta_{1}x_{i1} + \beta_{2}x_{i2} + ... + + \beta_{p}x_{ip} + \epsilon_{i}$$

- To allow for non-linear relationships between each feature and the response
    * Replace each linear component $B_{j}x_{ij}$ with a (smooth) non-linear function $f_{j}(x_{i1}$
        + Additive model because we calculate a separate $f_{j}$ for each $X_{j}$

$$ y_{i} = \beta_{0} + f_{1}(x_{i1} + + f_{2}x_{i2} + ... + f_{p}(x_{ip}) + \epsilon_{i}$$

**Pros and Cons of GAMs**
- GAMs allows to fit a non-linear $f_{j}$ to each $X_{j}$
- The non-linear fits can potentially make more accurate predictions for the response Y
- Because the model is additive, we cans till examine the effect of each $X_{j}$ on Y individually while holding all the other variables fixed
- Smoothness of the function $f_{j}$ for the variable $X_{j}$ can be summarized via degrees of freedom
- The main limitation of GAMs is that the model is restricted to be additive.
    * With many variables, important interactions can be missed
        + However, we can manually add interaction terms to the GAM model by including additional predictors of the form $X_{j} \times X_{k}$
- GAMs can also be used for classification problems