## Gradient Descent Code-Along

Let's walk through how gradient descent works using code.

In [1]:
import pandas as pd
import numpy as np

In [2]:
np.random.seed(42)

In [3]:
temp = np.random.uniform(-10, 80, 100)

In [4]:
temp

array([23.7086107 , 75.56428758, 55.87945476, 43.87926358,  4.04167764,
        4.03950683, -4.7724749 , 67.95585312, 44.10035106, 53.726532  ,
       -8.14739551, 77.29188669, 64.91983767,  9.11051996,  6.36424705,
        6.50640589, 17.38180187, 37.22807885, 28.87505168, 16.21062262,
       45.06676053,  2.55444746, 16.29301837, 22.9725659 , 31.04629858,
       60.66583653,  7.97064039, 36.28109946, 43.3173112 , -5.81946286,
       44.67903667,  5.34717113, -4.14535663, 75.39969835, 76.90688298,
       62.75576133, 17.41523923, -1.20950974, 51.58097239, 29.61372444,
        0.98344114, 34.56592191, -6.9050331 , 71.83883619, 13.29019834,
       49.62700559, 18.05399685, 36.80612191, 39.20392514,  6.636901  ,
       77.2626165 , 59.7619541 , 74.55490474, 70.53446154, 43.81099809,
       72.96868115, -2.03567482,  7.63845762, -5.929544  , 19.27972977,
       24.98095607, 14.42141286, 64.58637582, 22.1077994 , 15.28410587,
       38.84264748,  2.68318025, 62.19772827, -3.29044207, 78.81

In [5]:
print(np.mean(temp))
print(np.std(temp, ddof = 1))

32.31626690403884
26.77404699137873


**Ohio State Fun Facts:**
1. Ohio Stadium can seat 104,944 people. (Source: [Wikipedia](https://en.wikipedia.org/wiki/Ohio_Stadium).)
2. Ohio Stadium's record attendance is 110,045 people. (Source: [Wikipedia](https://en.wikipedia.org/wiki/Ohio_Stadium).)
3. Michigan sucks. (Source: It's just a fact.)
4. Ohio State students enjoy alcohol. (Source: first-hand knowledge.)

In [6]:
beers_sold = 200000 + 1000 * temp + np.random.normal(0, 20000)

In [7]:
beers_sold

array([225449.55206103, 277305.22894166, 257620.39612779, 245620.2049425 ,
       205782.61900458, 205780.44819502, 196968.4664599 , 269696.79448451,
       245841.29242165, 255467.47336641, 193593.54585139, 279032.82805934,
       266660.7790368 , 210851.46132581, 208105.1884134 , 208247.34725157,
       219122.74323112, 238969.02021166, 230615.99304255, 217951.56398259,
       246807.70188978, 204295.38882345, 218033.95973293, 224713.5072612 ,
       232787.2399443 , 262406.77789013, 209711.58175902, 238022.04082199,
       245058.25256235, 195921.47850956, 246419.97803589, 207088.11249662,
       197595.58473344, 277140.63971756, 278647.82434147, 264496.70269524,
       219156.18059037, 200531.43162534, 253321.91375086, 231354.66580133,
       202724.38250079, 236306.86327478, 194835.90826513, 273579.77755185,
       215031.13970876, 251367.94695662, 219794.93821281, 238547.06327077,
       240944.86650566, 208377.84236206, 279003.55786357, 261502.89546726,
       276295.84610554, 2

$$ \text{beers_sold}_i = 200000 + 1000 * \text{temp}_i + \varepsilon_i $$

In [8]:
df = pd.DataFrame.from_dict({'temp': temp,
                             'beers_sold': beers_sold})

In [9]:
df.head()

Unnamed: 0,beers_sold,temp
0,225449.552061,23.708611
1,277305.228942,75.564288
2,257620.396128,55.879455
3,245620.204942,43.879264
4,205782.619005,4.041678


#### Our goal is to fit a model here.
- You and I know that our $y$-intercept $\beta_0$ is 200,000.
- You and I know that our slope $\beta_1$ is 1,000.
- However, our computer does not know that. Our computer has to estimate $\hat{\beta}_0$ and $\hat{\beta}_1$ from the data.
    - We might say that our **machine** has to... **learn**.

#### Our workflow:
1. Instantiate model.
2. Select a learning rate $\alpha$.
3. Select a starting point $\hat{\beta}_{1,0}$.
4. Calculate the gradient of the loss function.
5. Calculate $\hat{\beta}_{1,i+1} = \hat{\beta}_{1,i} - \alpha * \frac{\partial L}{\partial \beta_1}$.
6. Check value of $|\hat{\beta}_{1,i+1} - \hat{\beta}_{1,i}|$.
7. Repeat steps 4 through 6 until "stopping condition" is met.

#### Step 1. Instantiate model.

Our model takes on the form:
$$ Y = \beta_0 + \beta_1 X + \varepsilon$$

#### Step 2. Select a learning rate $\alpha$.

$$\alpha = 0.1$$

In [10]:
alpha = 0.1

#### Step 3. Select a starting point.
The zero-th iteration of $\hat{\beta}_1$ is going to start at, say, 20.
$$\hat{\beta}_{1,0} = 20$$

Two points:
- You and I know that the true value of $\beta_1$ is 1000. We need the computer to figure that part out!
- We're going to pretend like the computer already knows the value for $\beta_0$. In reality, we'd have to do this for $\beta_0$ and for $\beta_1$ at the same time.

In [11]:
beta_1 = 20

#### Step 4. Calculate the gradient of the loss function with respect to parameter $\beta_1$.

The loss function, $L$, is our mean square error.

$$L = \sum_{i = 1} ^ n (y_i - \hat{y}_i)^2 $$

$$\Rightarrow L = \sum_{i = 1} ^ n (y_i - (\hat{\beta}_0 + \hat{\beta}_1x_i))^2 $$

The gradient of this loss function with respect to $\beta_1$ is:

$$\frac{\partial L}{\partial \beta_1} = \frac{2}{n} \sum_{i=1}^n -x_i(y_i - (\hat{\beta}_1x_i + \hat{\beta}_0)) $$

In [12]:
def beta_1_gradient(x, y, beta_1, beta_0):
    n = len(x)
    gradient = 0
    for i in range(n):
        gradient += -1 * x[i] * (y[i] - (beta_1 * x[i] + beta_0))
    gradient *= (2 / n)
    return gradient

#### Step 5. Calculate $\hat{\beta}_{1,i+1} = \hat{\beta}_{1,i} - \alpha * \frac{\partial L}{\partial \beta_1}$.

In [13]:
def update_beta_1(beta_1, alpha, gradient):
    beta_1 = beta_1 - alpha * gradient
    return beta_1

#### Step 6. Check value of $|\hat{\beta}_{1,i+1} - \hat{\beta}_{1,i}|$.

In [14]:
def check_update(beta_1, updated_beta_1, tolerance = 0.1):
    if abs(beta_1 - updated_beta_1) < tolerance:
        return True
    else:
        return False

#### Step 7: Save final value of $\hat{\beta}_1$.

#### Putting it all together...

In [15]:
def gradient_descent(x, y, beta_1 = 0, alpha = 0.01, max_iter = 100):
    converged = False
    for i in range(max_iter):
        gradient = beta_1_gradient(x, y, beta_1, 200000)
        updated_beta_1 = update_beta_1(beta_1, alpha, gradient)
        converged = check_update(beta_1, updated_beta_1)
        beta_1 = updated_beta_1
        if converged == True:
            print("Our algorithm converged after " + str(i) + "iterations with a beta_1 value of: "+ str(beta_1))
            break
        print("Iteration " + str(i) + " with beta_1 value of: " + str(beta_1))
    if converged == False:
        print("Our algorithm did not converge, so do not trust the value of beta_1.")
    return beta_1

In [16]:
gradient_descent(df['temp'], df['beers_sold'], beta_1 = 20, alpha = 0.0001, max_iter = 100)

Iteration 0 with beta_1 value of: 375.04049694698205
Iteration 1 with beta_1 value of: 605.5312109730789
Iteration 2 with beta_1 value of: 755.1647590023708
Iteration 3 with beta_1 value of: 852.3061939206527
Iteration 4 with beta_1 value of: 915.3699821036045
Iteration 5 with beta_1 value of: 956.3107133510871
Iteration 6 with beta_1 value of: 982.8892542756507
Iteration 7 with beta_1 value of: 1000.1439250192795
Iteration 8 with beta_1 value of: 1011.3455806449984
Iteration 9 with beta_1 value of: 1018.6176457351731
Iteration 10 with beta_1 value of: 1023.3386380994017
Iteration 11 with beta_1 value of: 1026.4034853782327
Iteration 12 with beta_1 value of: 1028.3931706218973
Iteration 13 with beta_1 value of: 1029.6848654466933
Iteration 14 with beta_1 value of: 1030.5234279910542
Iteration 15 with beta_1 value of: 1031.0678190711346
Iteration 16 with beta_1 value of: 1031.4212353429014
Iteration 17 with beta_1 value of: 1031.650671617153
Iteration 18 with beta_1 value of: 1031.79962

1031.8963176636985