# Regression and Regularization

In [1]:
%matplotlib inline 

import numpy as np
import pandas as pd 
import seaborn as sns 
import matplotlib.pyplot as plt 

sns.set_context('notebook')
sns.set_style('whitegrid')

## Linear Regression

_Fit_ a linear model to supervised data by adjusting a set of coefficients, $w$ to minimize the residual sum of squares between the observed response and the prediction.

<dl style="float:left">
    <dt style="margin:12px 0">Linear Model</dt>
        <dd>$y=X\beta+\epsilon$</dd>
    <dt style="margin:12px 0">Objective Function</dt>
        <dd>$min_w\sum(Xw-y)^2$</dd>
    <dt style="margin:12px 0">Predictive Model</dt>
        <dd>$\hat{y}(w,x)=w_0+w_1x_1+...+w_px_p$</dd>
</dl>

<table style="float:left; margin-left:70px;">
    <thead>
        <th>Notation</th>
        <th>Description</th>
    </thead>
    <tbody>
        <tr>
            <td>$y$</td>
            <td>observed value</td>
        </tr>
        <tr>
            <td>$y$</td>
            <td>observed value</td>
        </tr>
        <tr>
            <td>$X$</td>
            <td>values of features</td>
        </tr>
        <tr>
            <td>$\beta$</td>
            <td>coefficients</td>
        </tr>
        <tr>
            <td>$\epsilon$</td>
            <td>noise or randomness in observation</td>
        </tr>
        <tr>
            <td>$w$</td>
            <td>weights</td>
        </tr>
        <tr>
            <td>$w_0$</td>
            <td>adjusts the decision plane in target space</td>
        </tr>
        <tr>
            <td>$\hat{y}$</td>
            <td>predicted value</td>
        </tr>
    </tbody>
</table>

<div style="clear: both"></div>

### Ordinary Least Squares

![OLS](figures/ordinary_least_squares.png)

### Normal Equation

$w = (X^TX)^{-1}X^Ty = X^+y$

One-step learning algorithm solved by linear algebra system of linear equations ($X^+y$ is called the pseudo-inverse of $X$). 

- No $\alpha$ to select (more on this shortly) 
- No iteration, computed in one step 
- Slow if $n$ is large (e.g. $n\geq10^4$)
- Computation of $(X^TX)^{-1}$ is slow
- $(X^TX)$ must be invertible 

### Gradient Descent

![Gradient Descent](figures/gradient_descent.png)

$$h_\theta = \theta_0 + \theta_1x$$

### Gradient Descent

We _iteratively_ minimize our error by taking the derivative of the cost function to find the downward slope of $h_\theta$ until we converge - e.g. there is no downward direction, we reach a suitable error threshold, or we reach a maximum number of steps. For linear regression these derivatives are as follows:

$$\theta_0 := \theta_0 - \alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x^i)-y^i)$$
$$\theta_1 := \theta_1 - \alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x^i)-y^i) \cdot x^i$$

- Guaranteed to converge 
- Works well with large $n$
- Need to do many iterations 
- How to choose $\alpha$?

### Learning Rate

The hyperparameter $\alpha$ in gradient descent is called the _learning rate_. It determines how fast down the error curve we descend. The larger $\alpha$ is the faster we will minimize our cost function, however if $\alpha$ is too large or too small, we might not be able to find the true minimum of the cost function. 

![Learning Rate](figures/learning_rate.png)

## Polynomial Regression

![Polynomial Regression](figures/polynomial_regression.png)

### Polynomial Regression

In order to do higher order polynomial regression, we can use linear models trained on nonlinear functions of data via a mapping, $\phi$.

- Speed of linear model computation
- Fit a wider range of data or functions
- But remember: polynomials arenâ€™t the only functions to fit


Consider the standard linear regression case:

$$\hat{y}(w,x) = w_0+\sum_{i=0}^n(w_ix_i)$$

The quadratic case (polynomial degree = 2) is:

$$\hat{y}(w,v,x) = w_0+\sum_{i=0}^n(w_ix_i)+\sum_{i=0}^n(v_ix_i^2)$$

We can simplify this by defining a _mapping_, $\phi$ that transforms our feature space:

$$\phi([x_0,...,x_n]) = [x_0,...,x_n,x_0^2,...,x_n^2]$$

At which point we can apply our standard linear models. 

In [3]:
from sklearn.pipeline import Pipeline 
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures 

model = Pipeline([
    ('quad', PolynomialFeatures(2)),
    ('regr', LinearRegression()), 
])

## Residuals



### Residuals Plot

### Prediction Error Plot

### Coefficient of Determination

### Multicollinearity

### Rank 2D

## Regularization

### Vector Norms

### Hypothesis

## Ridge Regression

## LASSO Regression

## ElasticNet Regression

## Alpha Selection

## Other Regression Models