To open on Google Colab [link](https://colab.research.google.com/github/RodrigoAVargasHdz/CHEM-4PB3/blob/main/Course_Notes/Week7/intro_MLP.ipynb)

In [None]:
import numpy as np
import torch
from torch import nn
import torch.functional as F
import pandas as pd

import matplotlib
import matplotlib.pyplot as plt

## Beyond linear models

## What if $\phi(\cdot)$ also depends on internal paramters? **non-linear model**,
$f(\mathbf{x}) = \mathbf{w}^\top \phi(\mathbf{x},\mathbf{w}') = \sum_i w_i \phi_i(\mathbf{x},\mathbf{w}')$.

Now we also need to optimize the non-linear parameters $\mathbf{w}'$.

**Diagram**\
<img src="https://raw.github.com/RodrigoAVargasHdz/CHEM-4PB3/master/Course_Notes/Figures/nonLinear_model_diagram.png"  width="350" height="300">


Let's assume $\phi(\mathbf{x},\mathbf{w}')$ is another linear model,\
$\phi(\mathbf{x},\mathbf{w}') = \mathbf{z} = [z_0,z_1,\cdots,z_\ell]$, where $\ell$ is the "new" number of features.

## Code

In [None]:
# function
def f(x):
 return -(1.4 - 3.0 * x) * torch.sin(18.0 * x)

def get_data(n_batch=25):
    # X = torch.randn((n_batch,1))
    X = torch.distributions.uniform.Uniform(-0.01,1.).sample([n_batch,1]) 
    y = f(X)
    return X,y

In [None]:
# define model
model = nn.Sequential(
    nn.Linear(1, 100),
    nn.Linear(100,1)
)

In [None]:
def training(model,training_iter = 500,n_batch = 25):
    optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

    loss = nn.MSELoss()

    model.train()
    training_iter = 500
    n_batch = 25
    for itr in range(1,training_iter):

        X,y_true = get_data(n_batch)
        output = model(X)
        loss_val = loss(output,y_true)
        
        # l_norm = sum(p.pow(2.0).sum()
        #               for p in model.parameters())
        # l_norm = sum(p.abs().sum()for p in model.parameters())
        
        loss_val = loss_val # + l2_lambda*l_norm
        optimizer.zero_grad()
        loss_val.backward()
        optimizer.step()
        
        if itr%5 == 0 :
            print(f'itr = %s, loss = %.4f'%(itr,loss_val.item()))
    return model

In [None]:
model = training(model)
X_grid = torch.linspace(0.,1.,5000).unsqueeze(1)
plt.clf()
X,y = get_data(25)
print(X.shape,y.shape,model(X).shape)
plt.scatter(X.detach().numpy(),y.detach().numpy(),label='Batch')
plt.plot(X_grid.detach().numpy(),f(X_grid).detach().numpy(),ls='--',c='k',label=r'$f(x)$')
plt.plot(X_grid.detach().numpy(),model(X_grid).detach().numpy(),c='red',label=r'$NN(x)$')
plt.ylabel(r'$f(x)$',fontsize=12)
plt.xlabel(r'$x$',fontsize=12)
plt.legend()

## non-linear layers

**What happens if we consider a linear model of another-linear model (two linear models)?**

$$
f(x,\{\mathbf{W}\}_{\ell=1}^{2}) =  \left ( \mathbf{x}\mathbf{W}^\top_1 \right )\mathbf{W}^\top_2 = \mathbf{z}\mathbf{W}^\top_2
$$

$f(x,\{\mathbf{W}\}_{\ell=1}^{2})= \mathbf{W}_2^\top \phi(\mathbf{W}_1, \mathbf{x})$, where $\phi(\mathbf{W}_1, \mathbf{x}) = \left ( \mathbf{x}\mathbf{W}^\top_1 \right )$.



<!-- We will use the following notation for the elements of $[\mathbf{W}_{\ell}]_{i,j} = w_\kappa^{\ell,i}$,
* $\ell$ is the layer index
* $\kappa$ is the *input-feature* index
* $i$ is the *output-feature* index
  
```nn.Linear()``` is a linear transformation of the following way,
$$ \mathbf{x}\mathbf{W}_1^\top = \begin{bmatrix}
 x_0,& \cdots, & x_d \\
\end{bmatrix}\begin{bmatrix}
 &  &  \\
\mathbf{w}_1, & \cdots, & 
\mathbf{w}_\ell \\
 &  &  \\
\end{bmatrix}\\
\mathbf{x}\mathbf{W}_1^\top = \begin{bmatrix}
 x_0,& \cdots, & x_d \\
\end{bmatrix}\begin{bmatrix}
w^{1,1}_0& w^{1,j}_0 & w^{1,\ell}_0 \\
w^{1,1}_i & w^{1,j}_i & w^{1,\ell}_i \\
w^{1,1}_d & w^{1,j}_d & w^{1,\ell}_d \\
\end{bmatrix}\\
\mathbf{x}\mathbf{W}^\top = \begin{bmatrix}
 \mathbf{w}_1^\top \mathbf{x},& \cdots, & \mathbf{w}_i^\top \mathbf{x}, & \cdots, & \mathbf{w}_\ell^\top \mathbf{x} \\
\end{bmatrix},
$$
where $\mathbf{z}$ is a vector with $\ell$ entries. 

 
Let's assume that $\mathbf{W}_2^\top$ is a $(\ell,1)$ matix and  $\mathbf{W}_1^\top$ is a $(d,\ell)$ matrix.

$$
\mathbf{z}\mathbf{W}_2^\top = \begin{bmatrix}
 z_0,& \cdots, & z_\ell \\
\end{bmatrix}
\begin{bmatrix}
w_{0}^{2,1} \\
w_{i}^{2,1} \\
w_{\ell}^{2,1} \\ 
\end{bmatrix} = \sum_{i}^{\ell} w_{i}^{2,1}  \; z_{i}.
$$
* the $2$ in $w^{2,1}$ means the index of the **second** linear model.

From the above equation we can obtain an expression for $\mathbf{z}$,
$$
\mathbf{z} = \mathbf{x}\mathbf{W}_1^\top = \begin{bmatrix}
 \mathbf{w}_1^\top \mathbf{x},& \cdots, & \mathbf{w}_i^\top \mathbf{x}, & \cdots, & \mathbf{w}_\ell^\top \mathbf{x} \\
\end{bmatrix} \\
\mathbf{z} = \begin{bmatrix}
 \sum_jw^{1,1}_j \; x_j, & \sum_j w^{1,2}_j \; x_j, & \cdots, & \sum_jw^{1,\ell-1}_j \; x_j, &  \sum_jw^{1,\ell}_j \; x_j \\
\end{bmatrix}
$$

Combining all of the above we get,
$$
f(x,\{\mathbf{W}\}_{\ell=1}^{2}) = \mathbf{z}\mathbf{W}^\top_2 = \sum_i^\ell w_{i,2} \; z_i  \\
f(x,\{\mathbf{W}\}_{\ell=1}^{2}) = \sum_i^\ell w_{i}^{2,1} \; \left ( \mathbf{w}_i^\top \mathbf{x} \right )\\
f(x,\{\mathbf{W}\}_{\ell=1}^{2}) = \sum_i^\ell w_{i}^{2,1} \; \left ( \sum_j^d w_{j}^{1,i} \; x_j \right ) \\
f(x,\{\mathbf{W}\}_{\ell=1}^{2}) = \sum_i^\ell  \; \left ( \sum_j^d w_{i}^{2,1} w_{j}^{1,i} \; x_j \right ) \\
f(x,\{\mathbf{W}\}_{\ell=1}^{2}) = \sum_i^\ell  \; \left ( \sum_j^d \omega^{i}_{j} \; x_j \right ) \\
f(x,\{\mathbf{W}\}_{\ell=1}^{2}) =  \sum_j^d \omega^{1}_{j}  \; x_j = \mathbf{\omega}^\top \mathbf{x} \\
$$

$f(x,\{\mathbf{W}\}_{\ell=1}^{2})$ is another linear model. -->

## Non-linearies

We can make $\phi(\mathbf{W}_\ell, \mathbf{x})$ a non-linear transformation using an element-wise non-linear function.
*  element-wise non-linear function -> **activation function**

## activation functions

* hyperbolic tangent
$$
tanh(x) = \frac{\exp(x)-\exp(-x)}{\exp(x)+\exp(-x)}
$$

* Sigmoid
$$
\text{softmax}(x_i) = \frac{1}{1+ \exp(-x)}
$$

* ReLU
$$
\text{ReLU}(x) = \max(0,x)
$$

* Leaky RLU
$$
\text{LeakyReLU}(x) = \max(0,x) + \beta*\min(0,x)
$$

* SiLU
$$
\text{SiLU}(x) = x * \sigma(x)\\
\sigma(x) = \frac{1}{1+\exp(-x)}
$$


In [None]:
x = torch.linspace(-5,5,1000)

act_tanh = nn.Tanh()
y_tanh = act_tanh(x)

act_sigmoid = nn.Sigmoid()
y_sigmoid = act_sigmoid(x)

act_relu = nn.ReLU()
y_relu = act_relu(x)

act_lrelu = nn.LeakyReLU(0.1)
y_lrelu = act_lrelu(x)

act_silu = nn.SiLU()
y_silu = act_silu(x)

In [None]:
xnp = x.detach().numpy()
plt.plot(xnp,y_tanh.detach().numpy(),label='Tanh')
plt.plot(xnp,y_sigmoid.detach().numpy(),label='Sigmoid')
plt.plot(xnp,y_relu.detach().numpy(),label='ReLU')
plt.plot(xnp,y_lrelu.detach().numpy(),label='Leaky ReLU')
plt.plot(xnp,y_silu.detach().numpy(),label='SiLU')
plt.xlabel('x',fontsize=15)
plt.ylabel('activation function',fontsize=15)
plt.legend()

### Chose one of the above activation functions and use it in your linear model.

**Diagram**\
<img src="https://raw.github.com/RodrigoAVargasHdz/CHEM-4PB3/master/Course_Notes/Figures/MLP_diagram.png"  width="400" height="300">

<!-- '''python
model = nn.Sequential(
    nn.Linear(1, 100),
    nn.SiLU(),
    nn.Linear(100, 100),
    nn.SiLU(),
    nn.Linear(100, 100),
    nn.SiLU(),
    nn.Linear(100, 1)
)
''' -->

In [None]:
#code here
# define a model

model = training(model)
X_grid = torch.linspace(0., 1., 5000).unsqueeze(1)
plt.clf()
X, y = get_data(25)
print(X.shape, y.shape, model(X).shape)
plt.scatter(X.detach().numpy(), y.detach().numpy(), label='Batch')
plt.plot(X_grid.detach().numpy(), f(X_grid).detach().numpy(),
         ls='--', c='k', label=r'$f(x)$')
plt.plot(X_grid.detach().numpy(), model(
    X_grid).detach().numpy(), c='red', label=r'$NN(x)$')
plt.ylabel(r'$f(x)$', fontsize=12)
plt.xlabel(r'$x$', fontsize=12)
plt.legend()


Work in a small groups and discuss the following.
1. How many layers we need?
2. What is the *best* activation function?

# Extra

Go to the following [link](https://playground.tensorflow.org/#activation=tanh&batchSize=10&dataset=circle&regDataset=reg-plane&learningRate=0.03&regularizationRate=0&noise=0&networkShape=4,2&seed=0.03345&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false) and try to solve all the different tasks!