## Assessing Performance

### Measuring loss
- Model + algorithm -> fitted function
- Predictions -> decisions -> outcome

### Polynomial regression
- More complex functions of a single input
- Model:
 - $y_i = w_0 + w_1x_i + w_2{x_i}^2 + \cdots + w_p{x_i}^p + \epsilon_i$
- Features:
 - feature 1 = 1 (constant)
 - feature 2 = $x$
 - feature 3 = $x^2$
 - feature p+1 = ${x}^p$
- Parameters:
 - parameter 1 = $w_0$
 - parameter 2 = $w_1$
 - parameter 3 = $w_2$
 - parameter p+1 = $w_p$

### Modeling seasonality
- On average, house prices tend to increase with time
- Most houses listed in summer + good houses sell quickly
- Few homes listed in Nov./Dec. + transactions often leftover inventory or special circumstances
- Model:
 - $y_i = w_0 + w_1t_i + w_2\sin{(2\pi t_i / 12 - \Phi)} + \epsilon_i$
 - $t_i$ : linear trend
 - $\sin{(2\pi t_i / 12 - \Phi)}$ : seasonal compoenet = sinusoid with period 12 (resets annually)
 - $\Phi$ : Unknown phase/shift
- Trigonometric identity: $\sin{(a-b)} = \sin{(a)}\cos{(b)} - \cos{(a)}\sin{(b)}$
 - $\sin{(2\pi t_i / 12 - \Phi)} = \sin{(2\pi t_i / 12)}\cos{(\Phi)} - \cos{(2\pi t_i / 12)}\sin{(\Phi)}$
- Equivalently,
 - $y_i = w_0 + w_1t_i + w_2\sin{(2\pi t_i / 12)} + w_3\cos{(2\pi t_i / 12)} + \epsilon_i$
- Features:
 - feature 1 = 1 (constant)
 - feature 2 = $t$
 - feature 3 = $\sin{(2\pi t / 12)}$
 - feature 4 = $\cos{(2\pi t / 12)}$
- Fit **polynomial trend** and sinusoidal **seasonal component**
<img src="./figures/w2-f1.png" width=400>

### Regression with general features of 1 input
- Generic basis expansion
 - $y_i = w_0h_0(x_i) + w_1h_1(x_i) + \cdots + w_Dh_D(x_i) + \epsilon_i$
 - $= \sum_j^D{w_jh_j(x_i) + \epsilon_i}$
 - $w_j$ : jth regression coefficient or weight
 - $h_j(x_i)$ : jth feature
- Features:
 - feature 1 = $h_0(x)$ ...often 1 (constant)
 - feature 2 = $h_1(x)$ ... e.g., $x$
 - feature 3 = $h_2(x)$ ... e.g., $x^2$ or $\sin{(2\pi x / 12)}$
 - faeture D+1 = $h_D(x)$  ... e.g., $x^D$
<img src="./figures/w2-f2.png" width=400>

### Incorporating multiple inputs
- General notation
 - Output (scalar) : $y$
 - Inputs (d-dim vector) : $X = (X[1], X[2], \cdots , X[d])$
 - Notational conventions:
   - $X[j]$ = jth input (scalar)
   - $h_j(X)$ = jth feature (scalar)
   - $X_i$ = input of ith data point (vector)
   - $X_i[j]$ = jth input of ith data point (scalar)
- Simple hyperplane
 - Model:
   - $y_i = w_0 + w_1 X_i[1] + \cdots + w_d X_i[d]+ \epsilon_i$
 - Features:
   - feature 1 = $1$ ...often 1 (constant)
   - feature 2 = $X[1]$ ... e.g., sq. ft.
   - feature 3 = $X[2]$ ... e.g., #bath
   - faeture d+1 = $X[d]$  ... e.g., lot size
- More generically - D-dimensional curve
 - `Model:`
   - $y_i = w_0 h_0(X_i) + w_1 h_1(X_i) + \cdots + w_D h_D(x_i)+ \epsilon_i = \sum^D_j{w_jh_j(X_i)} + \epsilon_i$
 - Features:
   - feature 1 = $h_0(X)$ ... e.g., 1
   - feature 2 = $h_1(X)$ ... e.g., $X[1]$ = sq. ft.
   - feature 3 = $h_2(X)$ ... e.g., $X[2]$ = #bath or, $\log{(X[7])}X[2]$ = log(#bed) x #bath
   - faeture D+1 = $h_D(X)$  ... some other function of $X[1], ..., X[d]$
- More on notation
 - \# observations $(X_i, y_i)$ : N
 - \# inputs $X[j]$ : d
 - \# features $h_j(X)$ : D

- Interpreting the multiple regression fit
 - Fix all the other inputs in the model, and just look at that one that we can vary
<img src="./figures/w2-f3.png" width=350 align="left">
<img src="./figures/w2-f4.png" width=350 align="right">
<img src="./figures/w2-f5.png" width=400>
- `Common mistake to just look at the coefficient`
 - what if, fixed sq. ft. and \# of bedrooms is increasing -> house values will be decreased
 - but, what if we exclude sq. ft. of input, house values will be increased
 - **Think about the coefficient and the context of what you'vve put into the model**

### Setting the stage for computing the least squares fit
- Rewrite in matrix notation
 - $y_i = \sum^D_j{w_jh_j(X_i)} + \epsilon_i$
- For single ovbservation
<img src="./figures/w2-f6.png" width=500>
- For all ovbservations
<img src="./figures/w2-f7.png" width=500>

- Computing the cost of a D-dimensional curve
 - RSS(w) = $\sum^N_i{(y_i - h(X_i)^T w)^2} = (y-Hw)^T(y-hw)$
<img src="./figures/w2-f8.png" width=500>
 - Why?
   - $\hat{y} = Hw$
   - $(y - \hat{y}) = \begin{bmatrix} residual_1 \\ residual_2 \\ \vdots \\ residual_N \end{bmatrix}$
   - $(y - \hat{y})^T (y - \hat{y}) = \begin{bmatrix} residual_1 & residual_2 & \cdots & residual_N \end{bmatrix} \begin{bmatrix} residual_1 \\ residual_2 \\ \vdots \\ residual_N \end{bmatrix}$
<img src="./figures/w2-f9.png" width=500>
<img src="./figures/w2-f10.png" width=500>

### Computing the least squares D-dimensional curve
- Gradient of RSS
 - $\nabla RSS(w) = \nabla [(y-Hw)^T (y-Hw)] = -2H^T(y-Hw)$
- Why? by analogy to 1D case:
 - $\frac{d}{dw}(y-hw)(y-hw) = \frac{d}{dw}(y-hw)^2 = -2h(y-hw)$
 - because $(y-hw)$ is scalar

- **Approach 1: closed-form solution**
 - $\nabla RSS(w) = -2H^T(y-Hw) = 0$
 - Solve for $w$:
   - $-H^Ty + H^TH\hat{w} = 0$
   - $H^TH\hat{w} = H^Ty$
   - $(H^TH)^{-1}H^TH\hat{w} = (H^TH)^{1}H^Ty$
     - because $A^{-1}A = I$ and $Iv = v$
   - $\hat{w} = (H^TH)^{-1}H^Ty$
 - \# features = $D$
   - $H^TH$ = $D x D$ matrix
   - $(H^TH)^{-1}$ is invertible
     - in most cases is $N > D$
     - really, \# of linearly independent observations N
   - Complexity of inverse:
     - $O(D^3)$, \# of features cubed
   - Total complexity
     - $O(ND^2 + D^3)$ is total complexity
       - $O(ND^2)$ is for $H^TH$
       - $O(D^3)$ is for inverse matrix
<img src="./figures/w2-f11.png" width=500>

- **Approach 2: gradient descent**
 - while not converged
   - $w^{(t+1)} \gets w^{(t)} - \eta \nabla RSS(w^{(t)})$
   - $~~~~~~~~~ \gets w^{(t)} + 2\eta H^T(y-Hw^{(t)})$
   - $~~~~~~~~~ \gets w^{(t)} + 2\eta H^T(y-\hat{y}(w^{(t)}))$
 - Feature-by-feature update
   - RSS(w) = $\sum^N_i{(y_i - h(X_i)^T w)^2}$
   - $~~~~~~~~~~~~~~\sum^N_i{(y_i - w_0h_0(X_i) - w_1h_1(X_i) - \cdots -w_Dh_D(X_i))^2}$
   - **Partial with respect to $w_j$**
     - $\frac{\partial RSS(w)}{\partial w_j} = \sum^N_i{2(y_i - w_0h_0(X_i) - w_1h_1(X_i) - \cdots -w_Dh_D(X_i))} (-h_j(X_i))$
     - $~~~~~~~~~~~ = -2\sum^N_i{h_j(X_i)(y_i - h(X_i)^T w)}$
<img src="./figures/w2-f12.png" width=500>     

   - **Update to $j^{th}$ feature weight:**
     - $w_j^{(t+1)} \gets w_j^{(t)} - \eta (-2\sum^N_i{h_j(X_i)(y_i - h(X_i)^T w^{(t)})})$
     - $~~~~~~~~~ \gets  w_j^{(t)} +2\eta \sum^N_i{h_j(X_i)(y_i - \hat{y}_i(w^{(t)}))}$
<img src="./figures/w2-f13.png" width=500>

 - Summary of gradient descent for multiple regression
   - init $w^{(1)} = 0$ (or randomly, or smartly), $t=1$
   - **while** $|| \nabla RSS(w^{(t)})|| > \epsilon$, $\epsilon$ is tolerance
     - **for** j = 0, ..., D
       - $\text{partial[j]} = -2\sum^N_i{h_j(X_i)(y_i - \hat{y}_i(w^{(t)}))}$
       - $w_j^{(t+1)} \gets w_j^{(t)} - \eta \text{partial[j]}$
     - $t \gets t + 1$
<img src="./figures/w2-f14.png" width=500>

### What you can do now...
- Describe polynomial regression
- Detrend a time series using trend and seasonal componenets
- Write a regression model using multiple inputs or features thereof
- Cast both polynomial regression and regression with multiple inputs as regression with multiple features
- Calculate a goodness-of-fit metric (e.g., RSS)
- Estimate model parameters of a general multiple regression model to minimize RSS:
 - In closed form
 - Using an iterative gradient descent algorithm
- Interpret the coefficients of a non-featurized multiple regression fit
- Exploit the estimated model to form predictions
- Explain applications of multiple regression beyond house price modeling