## Gradient Descent Code-Along

Let's walk through how gradient descent works using code.

In [None]:
import pandas as pd
import numpy as np

**Ohio State Fun Facts:**
1. Ohio Stadium can seat 104,944 people. (Source: [Wikipedia](https://en.wikipedia.org/wiki/Ohio_Stadium).)
2. Ohio Stadium's record attendance is 110,045 people. (Source: [Wikipedia](https://en.wikipedia.org/wiki/Ohio_Stadium).)
3. Michigan sucks. (Source: It's just a fact.)
4. Ohio State students enjoy alcohol. (Source: first-hand knowledge.)

In [None]:
beers_sold = 200000 + 1000 * temp + np.random.normal(0, 20000)

$$ \text{beers_sold}_i = 200000 + 1000 * \text{temp}_i + \varepsilon_i $$

In [None]:
df = pd.DataFrame.from_dict({'temp': temp,
                             'beers_sold': beers_sold})

#### Our goal is to fit a model here.
- You and I know that our $y$-intercept $\beta_0$ is 200,000.
- You and I know that our slope $\beta_1$ is 1,000.
- However, our computer does not know that. Our computer has to estimate $\hat{\beta}_0$ and $\hat{\beta}_1$ from the data.
    - We might say that our **machine** has to... **learn**.

#### Our workflow:
1. Instantiate model.
2. Select a learning rate $\alpha$.
3. Select a starting point $\hat{\beta}_{1,0}$.
4. Calculate the gradient of the loss function.
5. Calculate $\hat{\beta}_{1,i+1} = \hat{\beta}_{1,i} - \alpha * \frac{\partial L}{\partial \beta_1}$.
6. Check value of $|\hat{\beta}_{1,i+1} - \hat{\beta}_{1,i}|$.
7. Repeat steps 4 through 6 until "stopping condition" is met.

#### Step 1. Instantiate model.

Our model takes on the form:
$$ Y = \beta_0 + \beta_1 X + \varepsilon$$

#### Step 2. Select a learning rate $\alpha$.

$$\alpha = 0.1$$

#### Step 3. Select a starting point.
The zero-th iteration of $\hat{\beta}_1$ is going to start at, say, 20.
$$\hat{\beta}_{1,0} = 20$$

Two points:
- You and I know that the true value of $\beta_1$ is 1000. We need the computer to figure that part out!
- We're going to pretend like the computer already knows the value for $\beta_0$. In reality, we'd have to do this for $\beta_0$ and for $\beta_1$ at the same time.

#### Step 4. Calculate the gradient of the loss function with respect to parameter $\beta_1$.

The loss function, $L$, is our mean square error.

$$L = \sum_{i = 1} ^ n (y_i - \hat{y}_i)^2 $$

$$\Rightarrow L = \sum_{i = 1} ^ n (y_i - (\hat{\beta}_0 + \hat{\beta}_1x_i))^2 $$

The gradient of this loss function with respect to $\beta_1$ is:

$$\frac{\partial L}{\partial \beta_1} = \frac{2}{n} \sum_{i=1}^n -x_i(y_i - (\hat{\beta}_1x_i + \hat{\beta}_0)) $$

#### Step 5. Calculate $\hat{\beta}_{1,i+1} = \hat{\beta}_{1,i} - \alpha * \frac{\partial L}{\partial \beta_1}$.

#### Step 6. Check value of $|\hat{\beta}_{1,i+1} - \hat{\beta}_{1,i}|$.

#### Step 7: Save final value of $\hat{\beta}_1$.

#### Putting it all together...

In [None]:
def gradient_descent(x, y, beta_1 = 0, alpha = 0.01, max_iter = 100):