# Gradient descent

Return to the [castle](https://github.com/Nkluge-correa/teeny-tiny_castle).

[![Scatterplot featuring a linear support vector machine's decision boundary (dashed line)](https://upload.wikimedia.org/wikipedia/commons/thumb/f/fe/Kernel_Machine.svg/220px-Kernel_Machine.svg.png)](https://en.wikipedia.org/wiki/File:Kernel_Machine.svg "Scatterplot featuring a linear support vector machine's decision boundary (dashed line)")

**In mathematics, _gradient descent_ (also often called _steepest descent_) is a [first-order](https://en.wikipedia.org/wiki/Category:First_order_methods "Category:First order methods") [iterative](https://en.wikipedia.org/wiki/Iterative_algorithm "Iterative algorithm") [optimization](https://en.wikipedia.org/wiki/Mathematical_optimization "Mathematical optimization") [algorithm](https://en.wikipedia.org/wiki/Algorithm "Algorithm") for finding a [local minimum](https://en.wikipedia.org/wiki/Local_minimum "Local minimum") of a [differentiable function](https://en.wikipedia.org/wiki/Differentiable_function "Differentiable function"). The idea is to take repeated steps in the opposite direction of the [gradient](https://en.wikipedia.org/wiki/Gradient "Gradient") (or approximate gradient) of the function at the current point, because this is the direction of steepest descent. Conversely, stepping in the direction of the gradient will lead to a [local maximum](https://en.wikipedia.org/wiki/Local_maximum "Local maximum") of that function; the procedure is then known as _gradient ascent_.**


In [1]:
import numpy as np
import pandas as pd
import plotly.offline as py
import plotly.graph_objects as go
import plotly.express as px
import plotly.figure_factory as ff
from plotly.offline import init_notebook_mode, iplot
from plotly.graph_objs import *


In [2]:
x = np.random.randn(100)*2
noise = np.random.normal(-1, 1, 100)*0.15
y = np.sin(x) + noise
px.scatter(x=x, y=y)

fig = go.Figure(data=go.Scatter(
    x=x, y=y, mode='markers', name='Mystery Function'))
fig.update_layout(template='plotly_dark',
                  title='Mystery Function',
                  paper_bgcolor='rgba(0, 0, 0, 0)',
                  plot_bgcolor='rgba(0, 0, 0, 0)')
iplot(fig)


### Possible models

**Models that could explain the above data distribution.**

```

y_1 = f(x) = cos(x - 3)
y_2 = f(x) = cos(2x + 2)
y_3 = f(x) = cos(x + 2)
y_4 = f(x) = cos(2x - 2)

```


In [3]:
def model(w, x):
    return np.cos(w[0] * x + w[1])


xtest = np.linspace(x.min()-.1, x.max()+.1, 500)  # evenly spaced test points

y_1 = model([1, -3], xtest)
y_2 = model([2, 2], xtest)
y_3 = model([1, 2], xtest)
y_4 = model([2, -2], xtest)

fig = go.Figure(data=go.Scatter(
    x=x, y=y, mode='markers', name='Mystery Function'))
fig.add_trace(go.Scatter(x=xtest, y=y_1, name='f(x) = cos(x - 3)'))
fig.add_trace(go.Scatter(x=xtest, y=y_2, name='f(x) = cos(2x + 2)'))
fig.add_trace(go.Scatter(x=xtest, y=y_3, name='f(x) = cos(x + 2)'))
fig.add_trace(go.Scatter(x=xtest, y=y_4, name='f(x) = cos(2x - 2)'))
fig.update_layout(template='plotly_dark',
                  title='Guess Functions',
                  paper_bgcolor='rgba(0, 0, 0, 0)',
                  plot_bgcolor='rgba(0, 0, 0, 0)')
iplot(fig)


### Loss Function

- **Average Squared Loss/Mean squared error**: `np.mean((y - np.sin(w[0] * x + w[1]))**2)`.


In [4]:
def mse(actual, predicted):
    actual = np.array(actual)
    predicted = np.array(predicted)
    differences = np.subtract(actual, predicted)
    squared_differences = np.square(differences)
    return squared_differences.mean()


fig = go.Figure([go.Bar(x=['y_1', 'y_2', 'y_3', 'y_4'], y=[abs(mse(np.sin(x), np.cos(x - 3))),
                                                           abs(mse(
                                                               np.sin(x), np.cos(2*x + 2))),
                                                           abs(mse(
                                                               np.sin(x), np.cos(x + 2))),
                                                           abs(mse(np.sin(x), np.cos(2*x - 2)))],
                        )])
fig.update_traces(texttemplate='%{y:.2f}')
fig.update_layout(template='plotly_dark',
                  title_text='Model Loss Comparasion',
                  paper_bgcolor='rgba(0, 0, 0, 0)',
                  plot_bgcolor='rgba(0, 0, 0, 0)')
iplot(fig)


### Brute Force Search

**In [computer science](https://en.wikipedia.org/wiki/Computer_science "Computer science"), _brute-force search_ or _exhaustive search_, also known as _generate and test_, is a very general [problem-solving](https://en.wikipedia.org/wiki/Problem-solving "Problem-solving") technique and [algorithmic paradigm](https://en.wikipedia.org/wiki/Algorithmic_paradigm "Algorithmic paradigm") that consists of systematically enumerating all possible candidates for the solution and checking whether each candidate satisfies the problem's statement.**

- **Let's generate random values between -3 and 3 and use them to create a model, i.e., as the values of `w0` and `w1`.**


In [5]:
w0_val = np.linspace(-3, 3, 50)
w1_val = np.linspace(-3, 3, 50)

loss_log = []

for w0 in w0_val:
    for w1 in w1_val:
        loss_log.append([w0, w1, mse(np.sin(x), np.cos(w0 * x + w1))])

loss_log_df = pd.DataFrame(loss_log, columns=['w0', 'w1', 'MSE'])

w0_BF = loss_log_df.loc[loss_log_df['MSE'].idxmin(), ['w0', 'w1']]['w0']
w1_BF = loss_log_df.loc[loss_log_df['MSE'].idxmin(), ['w0', 'w1']]['w1']

brute_force_model = model([w0_BF, w1_BF], xtest)

fig = go.Figure(data=go.Scatter(
    x=x, y=y, mode='markers', name='Mystery Function'))
fig.add_trace(go.Scatter(x=xtest, y=brute_force_model,
              name=f'f(x) = sin({w0_BF} * x + {w1_BF})'))
fig.update_layout(template='plotly_dark',
                  title='Brute Force Model',
                  paper_bgcolor='rgba(0, 0, 0, 0)',
                  plot_bgcolor='rgba(0, 0, 0, 0)')
iplot(fig)


**Now, let's take a look at the _Loss Surface/Loss Function Landscape_**.


In [6]:
fig = go.Figure()
fig.add_trace(go.Surface(x=w0_val, y=w1_val,
                         z=loss_log_df['MSE'].to_numpy().reshape((len(w0_val), len(w1_val)))))
fig.update_layout(margin=dict(l=0, r=0, t=70, b=0),
                  template='plotly_dark',
                  title='Loss Function Landscape - 3D',
                  paper_bgcolor='rgba(0, 0, 0, 0)',
                  plot_bgcolor='rgba(0, 0, 0, 0)')
iplot(fig)

fig = go.Figure()
fig.add_trace(go.Contour(
    x=loss_log_df['w0'], y=loss_log_df['w1'], z=loss_log_df['MSE']))
fig.update_layout(margin=dict(l=0, r=0, t=70, b=0),
                  template='plotly_dark',
                  title='Loss Function Landscape - 2D',
                  paper_bgcolor='rgba(0, 0, 0, 0)',
                  plot_bgcolor='rgba(0, 0, 0, 0)')
iplot(fig)


### Gradient Descent

**We need to compute the Gradient of our Loos Function (MSE):**

- `np.mean((y - np.sin(w[0] * x + w[1]))**2)`

_Note: The derivative of sin is cos._

- derivative (g) of w0 = `-np.mean(2 * (y - np.cos(w[0] * x + w[1]))* - np.sin(w[0]*x + w[1])*x)`
- derivative (g) of w1 = `-np.mean(2 * (y - np.cos(w[0] * x + w[1]))* - np.sin(w[0]*x + w[1]))`


In [7]:
def gradient(w):
    g0 = -np.mean(2 * (y - np.cos(w[0] * x + w[1]))
                  * - np.sin(w[0]*x + w[1])*x)
    g1 = -np.mean(2 * (y - np.cos(w[0] * x + w[1])) * - np.sin(w[0]*x + w[1]))
    return np.array([g0, g1])


loss_grad_df = loss_log_df.join(loss_log_df[['w0', 'w1']].apply(lambda w: gradient(
    w), axis=1, result_type='expand').rename(columns={0: 'g0', 1: 'g1'}))

fig = go.Figure()
fig = ff.create_quiver(x=loss_grad_df['w0'], y=loss_grad_df['w1'],
                       u=loss_grad_df['g0'], v=loss_grad_df['g1'],
                       line_width=2, line_color='white',
                       scale=0.1, arrow_scale=.2)
fig.add_trace(go.Contour(
    x=loss_grad_df['w0'], y=loss_grad_df['w1'], z=loss_grad_df['MSE']))
fig.update_layout(margin=dict(l=0, r=0, t=70, b=0),
                  template='plotly_dark',
                  title='Loss Function Landscape with Gradient - 2D',
                  paper_bgcolor='rgba(0, 0, 0, 0)',
                  plot_bgcolor='rgba(0, 0, 0, 0)')
fig.update_layout(xaxis_range=[w0_val.min(), w0_val.max()])
fig.update_layout(yaxis_range=[w1_val.min(), w1_val.max()])
iplot(fig)


**The `learning rate` controls the size of the step of the gradient.**

- `lr = lambda t: 1./(t+1.)` = _we are telling our learning rate to decrease as iterations (epochs) go._


In [8]:
from IPython.display import display, Markdown, Latex
import math


def gradient_descent(w_0, lr=lambda t: 1./(t+1.), nepochs=10):
    w = w_0.copy()
    values = [w]
    for t in range(nepochs):
        w = w - lr(t) * gradient(w)
        values.append(w)
    return np.array(values)


GD_values = gradient_descent(np.array([.8, 1.5]),
                             nepochs=200,
                             lr=lambda t: 1./np.sqrt(t+1.))

fig = go.Figure()
fig.add_trace(go.Contour(
    x=loss_grad_df['w0'], y=loss_grad_df['w1'], z=loss_grad_df['MSE']))
fig.add_trace(go.Scatter(x=GD_values[:, 0], y=GD_values[:, 1], name='Gradient Path', mode="markers+lines",
                         line=go.scatter.Line(color='white')))
fig.update_layout(margin=dict(l=0, r=0, t=70, b=0),
                  template='plotly_dark',
                  title='Loss Function Gradient Descent Path - 2D',
                  paper_bgcolor='rgba(0, 0, 0, 0)',
                  plot_bgcolor='rgba(0, 0, 0, 0)')
fig.update_layout(xaxis_range=[w0_val.min(), w0_val.max()])
fig.update_layout(yaxis_range=[w1_val.min(), w1_val.max()])
iplot(fig)

fig = go.Figure()
fig.add_trace(
    go.Surface(x=w0_val, y=w1_val,
               z=loss_log_df['MSE'].to_numpy().reshape((len(w0_val), len(w1_val)))))
fig.add_trace(
    go.Scatter3d(x=GD_values[:, 1], y=GD_values[:, 0], z=[np.mean((y - model(w, x))**2) for w in GD_values],
                 line=dict(color='white')))
fig.update_layout(margin=dict(l=0, r=0, t=70, b=0),
                  template='plotly_dark',
                  title='Loss Function Gradient Descent Path - 3D',
                  paper_bgcolor='rgba(0, 0, 0, 0)',
                  plot_bgcolor='rgba(0, 0, 0, 0)')
iplot(fig)


GD_values = gradient_descent(np.array([-.8, -1.5]),
                             nepochs=200,
                             lr=lambda t: 1./np.sqrt(t+1.))

fig = go.Figure()
fig.add_trace(go.Contour(
    x=loss_grad_df['w0'], y=loss_grad_df['w1'], z=loss_grad_df['MSE']))
fig.add_trace(go.Scatter(x=GD_values[:, 0], y=GD_values[:, 1], name='Gradient Path', mode="markers+lines",
                         line=go.scatter.Line(color='white')))
fig.update_layout(margin=dict(l=0, r=0, t=70, b=0),
                  template='plotly_dark',
                  title='Loss Function Gradient Descent Path - 2D',
                  paper_bgcolor='rgba(0, 0, 0, 0)',
                  plot_bgcolor='rgba(0, 0, 0, 0)')
fig.update_layout(xaxis_range=[w0_val.min(), w0_val.max()])
fig.update_layout(yaxis_range=[w1_val.min(), w1_val.max()])
iplot(fig)


fig = go.Figure()
fig.add_trace(
    go.Surface(x=w0_val, y=w1_val,
               z=loss_log_df['MSE'].to_numpy().reshape((len(w0_val), len(w1_val)))))
fig.add_trace(
    go.Scatter3d(x=GD_values[:, 1], y=GD_values[:, 0], z=[np.mean((y - model(w, x))**2) for w in GD_values],
                 line=dict(color='white')))
fig.update_layout(margin=dict(l=0, r=0, t=70, b=0),
                  template='plotly_dark',
                  title='Loss Function Gradient Descent Path - 3D',
                  paper_bgcolor='rgba(0, 0, 0, 0)',
                  plot_bgcolor='rgba(0, 0, 0, 0)')
iplot(fig)


display(Markdown(f'''
## Final values for w and b after gradient descent:

- w = {GD_values[-1][0]};
- b = {GD_values[-1][1]}.

### Final Model:

$\cos({GD_values[-1][0]} \\times x + {GD_values[-1][1]})$

### Fun Fact:

- $\sin(x) = \cos(x - \pi /2)$
- $- \sin(x) = \cos(x + \pi /2)$ 

**And that is why our loss-landscape has *two* global minima.**

Our model was trying to aproximate $\pi /2$, which is: {math.pi/2}. 

Our model got really close, the difference between {GD_values[-1][1]} and $\pi /2$ being {- GD_values[-1][1] - math.pi/2}!

'''
                 ))



## Final values for w and b after gradient descent:

- w = 0.9867198275468617;
- b = -1.6066606469171023.

### Final Model:

$\cos(0.9867198275468617 \times x + -1.6066606469171023)$

### Fun Fact:

- $\sin(x) = \cos(x - \pi /2)$
- $- \sin(x) = \cos(x + \pi /2)$ 

**And that is why our loss-landscape has *two* global minima.**

Our model was trying to aproximate $\pi /2$, which is: 1.5707963267948966. 

Our model got really close, the difference between -1.6066606469171023 and $\pi /2$ being 0.03586432012220575!



---

Return to the [castle](https://github.com/Nkluge-correa/teeny-tiny_castle).
