# Gradient Descent (GD)
# Stochastic Gradient Descent (SGD)
# Mini-Batch Gradient Descent (mini-batch SGD)

## Libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt

## GD, SGD and mini-batch SGD

<div class="alert alert-block alert-info">
    
Given a dataset $S$, the 
- **Gradient Descent (GD)**
- **Stochastic Gradient Descent (SGD)** and
- **Mini-Batch Stochasticv Gradient Descent (mini-batch SGD)**
    
algorithms are given as follows:

<img src="files/figures/GD_2.png" width="650px"/>
    
<img src="files/figures/SGD.png" width="650px"/>
    
<img src="files/figures/SGD_miniBatch.png" width="650px"/>

</div>

We are going to draw a set of $N = 500$ **random points on the surface** $z = f(x, y)$ and add to them some **uniform noise**, as illustrated below.

<img src="files/figures/surface.png" width="400px"/>

More precisely:
- Draw a set of $N = 500$ **triplets** of the form
$$S = \Big\{ \big( x_i, y_i, f(x_i, y_i) + \epsilon \big) : x_i, y_i \in [-5, 5], \epsilon \sim \mathcal{U} \big( [-1,1] \big) \text{ and } i = 1, \dots, N \Big\}$$
where:
    - $x_i, y_i $ are sampled uniformly inside $[-5, 5]$;
    - $f(x_i, y_i) = \cos(x_i) \cdot \cos(y_i) + \frac{1}{10} \cdot x_i^2 + \frac{1}{20} \cdot y_i^2$ is the surface equation;
    - $\epsilon \sim \mathcal{U} \big( [-1,1] \big)$ is a uniform noise.
- Store these points into a numpy tensor `train_set` of size $N \times 3$.
- Represent these points together with the surface $z = f(x, y)$ in a 3D plot (`plt.scatter()`).

In [None]:
# For plotting the surface, fill in the surface equation on line 5

x = np.arange(-5.0, 5.0, 0.1)
y = np.arange(-5.0, 5.0, 0.1)

X, Y = np.meshgrid(x, y)
Z = # surface equation here

fig = plt.figure(figsize=(8, 8))
ax = fig.add_subplot(111, projection= '3d')
surf = ax.plot_surface(X, Y, Z, 
                       cmap='YlOrRd', 
                       linewidth=0, 
                       antialiased='True', 
                       rstride=3, 
                       cstride=3, 
                       alpha=0.5
                      )

# points
points = ax.scatter(train_set[:, 0], train_set[:, 1], train_set[:, 2],
                    color = 'black',
                    marker="o"
                   )

ax.set_xlim([-5.0, 5.0])
ax.set_ylim([-5.0, 5.0])
ax.set_zlim([-1.0, 5.0])
# plt.title("Surface Plot", size=14)

plt.show()

<div class="alert alert-block alert-info">

Consider the **quadratic model** given by

$$
\hat f \left( x, y; \Theta \right) = \omega_0 x^4 + \omega_1 x^3y + \omega_2 x^2y^2 + \omega_3 x y^3 + \omega_4 y^4 + \omega_5
$$

where $\Theta = (w_0, w_1, w_2, w_3, w_4, w_5)$ are the **parameters** of the model. Obviously, different parameters $\Theta$ give rise to different models $\hat f(\cdot, \cdot; \boldsymbol{\Theta})$.

<br>    
    
Let $\hat f(\cdot, \cdot; \Theta)$ be a **model**, $B = \big\{ (x_i, y_i, z_i) : i = 1, \dots, K \big\}$ be a **batch of points** and $p_i = (x_i, y_i, z_i) \in B$ be a **point**:
- The **prediction** for $(x_i, y_i)$ by $\hat f$ is $\hat f(x_i, y_i; \Theta)$.
- The **(real) target** of $(x_i, y_i)$ is $z_i= \cos(x_i) \cdot \cos(y_i) + \frac{1}{10} \cdot x_i^2 + \frac{1}{20} \cdot y_i^2$.
- The **individual loss** for $p_i$ is the distance between the target and the prediction
    
$$
\ell \left( z_i , \hat f(x_i, y_i; \Theta) \right) = \frac{1}{2} \left( z_i - \hat f(x_i, y_i; \Theta) \right)^2.
$$
    
- The **collective loss** for $B$ is the distance between all targets and predictions
    
$$
\mathcal{L} \left( z_1, \dots, z_K , \hat f(x_1, y_1; \Theta), \dots, \hat f(x_N, y_K; \Theta) \right) = \frac{1}{2K} \sum_{i=1}^K \left( z_i - \hat f(x_i, y_i; \Theta) \right)^2.
$$

</div>

- Write a function<br>
`hat_f(x, y, theta)`<br>
that implements the **model** $\hat f(x, y; \Theta)$.
- Write a function<br> 
`small_loss(x_i, y_i, z_i, theta)`<br>
that implements the **individual loss** $\ell \left( z_i , \hat f(x_i, y_i; \Theta) \right)$, where `x_i`,`y_i` and `z_i` are values.
- Write a function<br>
`big_loss(x_t, y_t, z_t, theta)`<br>
that implements the **collective loss** $\mathcal{L} \left( z_1, \dots, z_K , \hat f(x_1, y_1; \Theta), \dots, \hat f(x_N, y_K; \Theta) \right)$, where `x_t`,`y_t` and `z_t` are tensors of values.

<div class="alert alert-block alert-info">
    
We have

\begin{align}
\nabla \ell(x_i, y_i, z_i; \Theta) & =
\frac{\partial \left[ \frac{1}{2} \left( z_i - \hat f(x_i, y_i; \Theta) \right)^2 \right]}{\partial \Theta} \\
& =
- \left( z_i - \hat f(x_i, y_i; \Theta) \right) \cdot \frac{\partial \hat f(x_i, y_i; \Theta)}{\partial \Theta}
\end{align}

and 

\begin{align}
\nabla \mathcal{L}(x_i, y_i, z_i; \Theta) & =
\frac{\partial \left[ \frac{1}{2K} \sum_{i=1}^K \left( z_i - \hat f(x_i, y_i; \Theta) \right)^2 \right]}{\partial \Theta} \\
& =
- \frac{1}{K} \sum_{i=1}^K \left( z_i - \hat f(x_i, y_i; \Theta) \right) \cdot \frac{\partial \hat f(x_i, y_i; \Theta)}{\partial \Theta}
\end{align}

</div>

- Write a function<br>
`grad_small_loss(x_i, y_i, z_i, theta)`<br>
that implements the **gradient of the individual loss** $\nabla \ell(x_i, y_i, z_i; \Theta)$
- Write a function<br>
`grad_big_loss(x_t, y_t, z_t, theta)`<br>
that implements the **gradient of the collective loss** $\nabla \mathcal{L}(\boldsymbol{x}, \boldsymbol{y}, \boldsymbol{z}; \Theta)$ as the sum of the gradients of the individual losses.

Using your functions `grad_small_loss(...)` and `grad_big_loss(...)`, write the three following functions:

- `GD(dataset, lamda, nb_epochs)`<br>
    
    that implements the **gradient descent (GD)** algorithm.<br>
    Run this function with the parameters: `dataset=train_set`, `lamda=1e-5`, `nb_epochs=1000`.<br>
    Check the collective loss with the parameters `theta` that you obtain.<br><br>

- `SGD(dataset, lamda, nb_epochs)`<br>
    
    that implements the **stochastic gradient descent (SGD)** algorithm.<br>
    Run this function with the parameters: `dataset=train_set`, `lamda=1e-8`, `nb_epochs=1000`.<br>
    Check the collective loss with the parameters `theta` that you obtain.<br><br>
    
- `mini_SGD(dataset, batch_size, lamda, nb_epochs)`<br>
    
    that implements the **mini-batch stochastic gradient descent (mini-batch SGD)** algorithm.<br>
    Run this function with the parameters: `dataset=train_set`, `lamda=1e-8`, `nb_epochs=1000`, `batch_size=64`.
    Check the collective loss with the parameters `theta` that you obtain.<br><br>

- Plot your **best model** $\hat f(\cdot, \cdot; \Theta)$ together with your **train set** by replacing the `your_parameters_theta`by the best parameters that you obtained in the code below.

In [None]:
x = np.arange(-5.0, 5.0, 0.1)
y = np.arange(-5.0, 5.0, 0.1)

X, Y = np.meshgrid(x, y)
# Z = hat_f(X, Y, your_parameters_theta)

fig = plt.figure(figsize=(8, 8))
ax = fig.add_subplot(111, projection= '3d')
surf = ax.plot_surface(X, Y, Z, 
                       cmap='YlOrRd', 
                       linewidth=0, 
                       antialiased='True', 
                       rstride=3, 
                       cstride=3, 
                       alpha=0.5
                      )

# points
points = ax.scatter(train_set[:, 0], train_set[:, 1], train_set[:, 2],
                    color = 'black',
                    marker="o"
                   )

ax.set_xlim([-5.0, 5.0])
ax.set_ylim([-5.0, 5.0])
ax.set_zlim([-1.0, 5.0])
plt.title("Surface Plot", size=14)

plt.show()