### Regression fundamentals: data, model, task
- **Data**
 - Input vs. Output:
   - $y$ is the quantity of interest
   - assume $y$ can be predicted from $x$
- **Model**
 - $f(x)$ : expected relationship between $x$ and $y$
 - **Regressions model:**
   - $y_i = f(x_i) + \epsilon_i$
   - $E[\epsilon_i] = 0$
     - equally likely that error is $+$ or $-$
     - $y_i$ is equally likely to be above or below $f(x_i)$

<img src="./figures/w1-f1.png" width=600>

- **Task 1 - Which model $f(x)$?**
 - average model
 - linear relationship model
 - quadratic fit
 - polynomial fit
 - ...

- **Task 2 - For a given model $f(x)$, estimate function $\hat{f}(x)$ from data**
 - Assume model $f(x)$ is a quadratic function
 - estimated quadratic fit $\hat{f}(x)$ from data, different fit, ...

### Regression ML block diagram
- `Training data` -> `Feature extraction` -> $x$
- $x$ -> `ML model` ( regression ) -> $\hat{y}$ : predicted values
- `Quality metric` ( $y$: actual value s & $\hat{y}$ ) : error in our predicted values -> `ML algorithm` -> $\hat{f}$ : estimated function fit from data -> `ML model`
 - loop, updating the weights or model parameters

### Simple linear regression
- $y_i = w_0 + w_1x_i + \epsilon_i$
 - $f(x) = w_0 + w_1x$
 - parameters: regression coefficients ($w_0, w_1$)
 
 
- **Fitting a line to data**
 - RSS: Residual sum of squares
   - `error` is a part of model, `residual` is the difference between a predictino and an actual value
 - $RSS(w_0, w_1) = \sum_{i=1}^{N}{(y_i - [w_0 + w_1x_i])^2}$
 - minimize cost over all possible $w_0, w_1$

### Optimization 1-dim
- Concave vs convex
 - concave : line lies below $g(w)$ everywhere
 - convex : line lies above $g(w)$ everywhere
<img src="./figures/w1-f2.png" width=500>
- weight update
 - minimize: $w^{t+1} = w^{t} - \eta\frac{dg(w)}{dw}$
 - hill climbing
   - if $\frac{dg(w)}{dw} > 0$ -> $w$ is decreased
   - if $\frac{dg(w)}{dw} < 0$ -> $w$ is increased
<img src="./figures/w1-f3.png" width=500>
- stepsize
 - fixed stepsize
 - decreasing stepsize(=stepsize schedule) : common choice
   - $\eta_t = \frac{\alpha}{t}$
   - $\eta_t = \frac{\alpha}{\sqrt{t}}$
- convergence criteria
 - $|\frac{dg(w)}{dw}| < \epsilon$, $\epsilon$ is threshold

### Optimization multi-dims
- derivatives in multi-dim
 - $\nabla{g(w)} = \begin{bmatrix} \frac{\partial{g}}{w_0} \\ \frac{\partial{g}}{w_1} \\ \vdots \\ \frac{\partial{g}}{w_p} \end{bmatrix}$
<img src="./figures/w1-f4.png" width=500>
- gradient descent
 - minimize: $w^{t+1} = w^{t} - \eta\nabla{g(w^{t})}$
 - $\nabla{g(w)} < \epsilon$, $\epsilon$ is threshold
<img src="./figures/w1-f5.png" width=500> 

### Finding the least squared line
- gradient of RSS
 - $RSS(w_0, w_1) = \sum_{i=1}^{N}{[y_i - (w_0 + w_1x_i)]^2}$
 - $\nabla{RSS(w_0, w_1)} = \begin{bmatrix} -2\sum_{i=1}^{N}{[y_i - (w_0 + w_1x_i)]} \\ -2\sum_{i=1}^{N}{[y_i - (w_0 + w_1x_i)]x_i} \end{bmatrix}$
- Approach 1 - closed form solution
<img src="./figures/w1-f7.png" width=500>
- Approach 2 - gradient descent
<img src="./figures/w1-f8.png" width=500>
<img src="./figures/w1-f9.png" width=500>
- Comparing the approaches
 - for most ML problems, cannot solve `gradient=0`
 - even if solving `gradient=0` is feasible, `gradient descent` can be more efficient
 - `gradient descent` relies on choosing `stepsize`and `convergence`criteria

In [1]:
X = np.array([0, 1, 2, 3, 4])
y = np.array([1, 3, 7, 13, 21])

In [2]:
# approach 1 - using sums
w1 = (sum(X*y) - (sum(X)*sum(y))/len(X)) / (sum(X**2) - (sum(X)*sum(X))/len(X))
w0 = (sum(y) - w1*sum(X)) / len(X)

w1, w0

(5.0, -1.0)

In [3]:
# approach 1 - using means
w1 = (np.mean(X*y) - np.mean(X)*np.mean(y)) / (np.mean(X**2) - np.mean(X)*np.mean(X))
w0 = np.mean(y) - w1*np.mean(X)

w1, w0

(5.0, -1.0)

**Recall that:**

- The derivative of the cost for the intercept is the sum of the errors
- The derivative of the cost for the slope is the sum of the product of the errors and the input


**The algorithm**

In each step of the gradient descent we will do the following:
1. Compute the predicted values given the current slope and intercept
2. Compute the prediction errors (prediction - Y)
3. Update the intercept:
 - compute the derivative: sum(errors)
 - compute the adjustment as step_size times the derivative
 - decrease the intercept by the adjustment
4. Update the slope:
 - compute the derivative: sum(errors*input)
 - compute the adjustment as step_size times the derivative
 - decrease the slope by the adjustment
5. Compute the magnitude of the gradient
6. Check for convergence

In [4]:
# approach 2
import numpy as np

def gradient_descent(X, y, initial_intercept=0, inital_slope=0, step_size=0.05, tolerance=0.01):
    w0 = initial_intercept
    w1 = inital_slope

    iterations = 0
    while True:
        preds = w0 + w1*X
        errors = y - preds
        sum_errors = np.sum(errors)
        sum_errors_X = np.sum(errors*X)
        magnitude = np.sqrt(sum_errors**2 + sum_errors_X**2)
        
        w0 += step_size * sum_errors
        w1 += step_size * sum_errors_X
        iterations += 1
        
        if magnitude < tolerance:
            break
    
#         print(preds, w0, w1, magnitude)
    return preds, w0, w1, iterations

gradient_descent(X, y)

(array([-0.99373992,  4.00406416,  9.00186824, 13.99967233, 18.99747641]),
 -0.9942069818917416,
 4.997967918970868,
 78)

### What you can do now..
- Describe the input (features) and ouput (real-valued predictions) of a regression model
- Calculate a goodness-of-fit metric (e.g., RSS)
- Estimate model parameters to minimize RSS using gradient descent
- Interpret estimated model parameters
- Exploit the estimated model to form predictions
- Discuss the possible influence of high leverage points
- Describe intuitively how fitted line might change when assuming different goodness-of-fit metrics