<img src='http://sn.nexteinstein.org/wp-content/uploads/sites/12/2016/07/aims_senegal.jpg' />

In this Tutorial we will cover:

*   Backpropagation - **Naive** implementation
*   Computational Graphs
*   Backpropagation - **Modular** implementation
*   Pytorch AutoGrad
*   Some of torch.nn module




---



Moustapha talked about Neural Networks and how they are a more powerfull class of models that let us model problems that are more complex and non-linear in nature.

We also saw that NN in its simplest form is just a chain of linear models (wx+b) followed by a non-linear activation function $\sigma$().

Moustapha also covered in class the main algorithm used to train neural networks which is exactly the same as what we used to do (mainly gradient descent) except here we utilize the chain rule to compute the gradient knowing that the loss has no direct relation to all parameters.

**Note: The loss is always scalar**

<img src='https://miro.medium.com/max/1276/1*F9capAHwl_rz2-Q8z511WQ.jpeg' />

<img src='https://images.contentstack.io/v3/assets/blt71da4c740e00faaa/blt3e9883f5dfd008f4/603039d9cb67827268e09219/saltbae_pytorch.jpg' />



---



let's take a two layer neural network for example to solve a binary classification problem:

$ X \in R^{N \times D}, y \in R^{N \times 1},$ and $H$ is the hidden size.

$out = \sigma(\sigma(X.W_1 + b_1).W_2 + b_2)$

where

$\sigma(x) = \frac{1}{1 + e^{-x}}$ is the sigmoid activation function

$W_1 \in R^{D \times H}, b_1 \in R^{H}, W_2 \in R^{H \times 1}, b_2 \in R$


---
---
---
---
---

Naive Implementation of BackProb

Basically get pen and paper, sit down for 1 hour review vector calculus, and compute the gradient explicity for each term.

In [1]:
import torch
import math



---



In [2]:
def sigmoid(x):
    return (1.0 / (1.0 + torch.exp(-x)))



---



In [3]:
def initParameters(X, y, H):
  '''
  X: Input data of shape (N, D). Each X[i] is a training sample.
  y: Vector of training labels. y[i] is the label for X[i]
  H: The Hidden size for a two layer NN
  '''
  D = X.shape[1] # number of neurons in input layer = D
  output_size = y.shape[1] # number of neurons in output layer.

  W1 = torch.rand(D, H) * 1e-2
  b1 = torch.zeros((H))
  W2 = torch.rand(H, output_size) * 1e-2 
  b2 = torch.zeros((output_size))

  params = {'W1': W1, 'W2': W2, 'b1': b1, 'b2': b2}

  return params



---



In [4]:
def loss(y_hat, y):
  '''
  y_hat: predict y of shape (N, 1)
  y: ground truth labesl of shape (N, 1)
  '''
  N = y.shape[0]
  loss = - torch.sum( y * torch.log(y_hat) + (1-y)*torch.log(1-y_hat) ) / N
  return loss



---



In [5]:
def forward_pass(X, params):
  '''
  X: Input data of shape (N, D). Each X[i] is a training sample.
  params: dictionary containing the parameters of a 2 layer NN model
  '''

  S1 = torch.mm(X, params['W1']) + params['b1']
  A1 = sigmoid(S1)
  S2 = torch.mm(A1, params['W2']) + params['b2']
  out = sigmoid(S2) 

  cache = {'S1': S1, 'S2': S2, 'A1': A1, 'out': out}

  return cache



---



In [12]:
def backward_pass(X, y, params, cache):
  '''
  X: Input data of shape (N, D). Each X[i] is a training sample.
  y: Vector of training labels. y[i] is the label for X[i]
  params: dictionary containing the parameters of a 2 layer NN model
  cache: dictionary containing the intermediate outputs of the forward pass needed to compute the grad 
  '''

  # Rule of thumb: Always follow the shapes
  # grad of W should have the same shape as W

  N = X.shape[0]

  grad = {}

  dS2 = cache['out'] - y  # prove it ? I can send some links to help (N, 1)
  grad['W2'] = (1/N) * torch.mm(cache['A1'].t(), dS2)  # (H, 1)
  grad['b2'] = (1/N) * torch.sum(dS2, dim = 0, keepdims=False)

  dS1 = torch.mm(dS2, params['W2'].t()) * sigmoid(cache['S1']) * (1 - sigmoid(cache['S1'])) # (N, H)

  grad['W1'] = (1/N) * torch.mm(X.t(), dS1)
  grad['b1'] = (1/N) * torch.sum(dS1, dim= 0, keepdims=False)

  return grad




---



In [13]:
def update(params, grad, lr):
  params['W1'] -= lr * grad['W1']
  params['b1'] -= lr * grad['b1']
  params['W2'] -= lr * grad['W2']
  params['b2'] -= lr * grad['b2']

  return params 



---

let's train the model

In [14]:
X = torch.rand((1000, 16)) # dummy input
y = 1.0 * (torch.rand((1000, 1)) > 0.5) # random output

def fit(X, y, H=64, lr = 0.1, n_epochs=20):

  params = initParameters(X, y, H)

  for epoch in range(n_epochs):

    cache = forward_pass(X, params)

    epoch_loss = loss(cache['out'], y)

    print('epoch ==> ', epoch, ' loss ==> ', epoch_loss.item())

    grad = backward_pass(X, y, params, cache)

    params = update(params, grad, lr)


fit(X, y)

epoch ==>  0  loss ==>  0.7010156512260437
epoch ==>  1  loss ==>  0.6947508454322815
epoch ==>  2  loss ==>  0.6927878856658936
epoch ==>  3  loss ==>  0.6921752691268921
epoch ==>  4  loss ==>  0.6919838786125183
epoch ==>  5  loss ==>  0.6919241547584534
epoch ==>  6  loss ==>  0.6919053792953491
epoch ==>  7  loss ==>  0.6918995380401611
epoch ==>  8  loss ==>  0.6918976902961731
epoch ==>  9  loss ==>  0.6918971538543701
epoch ==>  10  loss ==>  0.6918969750404358
epoch ==>  11  loss ==>  0.6918968558311462
epoch ==>  12  loss ==>  0.691896915435791
epoch ==>  13  loss ==>  0.6918968558311462
epoch ==>  14  loss ==>  0.6918968558311462
epoch ==>  15  loss ==>  0.6918968558311462
epoch ==>  16  loss ==>  0.691896915435791
epoch ==>  17  loss ==>  0.6918968558311462
epoch ==>  18  loss ==>  0.6918968558311462
epoch ==>  19  loss ==>  0.6918968558311462




---



As you can see this way of implementing Backprob has a lot of problems:

*  it's not scalable, what if we want to add more layers or change the loss function or change the activatoin function ==> we will have to do it all over again.

*  it's very tedius and prone to error.

* it's not feasible for complex models.



---



# Computaional Graph
is a directed graph that represents the computations we are performing inside our model.

We can also see computational graphs as a way or a framework to ease the computations of gradient for us.

Pytorch, Tensorflow, Theano ...etc, All these libraries are based on the fundamental idea of computaional graphs.

follow on board and see slides: 56 --->  111

We will utilize this idea of local node computational graph to improve (make modular) implementation of our Naive Backprob.

Basically for each operation/layer we will implement a `forward` and a `backward` function. The `forward` function will receive inputs, weights, and other parameters and will return both an output and a `cache` object storing data needed for the `backward` pass, like this:

```python
def forward(x, w):
  """ Receive inputs x and weights w """
  # Do some computations ...
  z = # ... some intermediate value
  # Do some more computations ...
  out = # the output
   
  cache = (x, w, z, out) # Values we need to compute gradients
   
  return out, cache
```

The `backward` pass will receive upstream derivatives and the `cache` object, and will return gradients with respect to the inputs and weights, like this:

```python
def backward(dout, cache):
  """
  Receive dout (derivative of loss with respect to outputs) and cache,
  and compute derivative with respect to inputs.
  """
  # Unpack cache values
  x, w, z, out = cache
  
  # Use values in cache to compute derivatives
  dx = # Derivative of loss with respect to x
  dw = # Derivative of loss with respect to w
  
  return dx, dw
```

After implementing a bunch of layers this way, we will be able to easily combine them to build models with different architectures.

For each layer we implement, we will define a class with two static methods `forward` and `backward`.



---



### Linear layer


In [None]:
class Linear(object):

  @staticmethod
  def forward(x, w, b):
    """
      Inputs:
    - x: A tensor containing input data, of shape (N, D)
    - w: A tensor of weights, of shape (D, M)
    - b: A tensor of biases, of shape (M,)
    Returns a tuple of:
    - out: output, of shape (N, M)
    - cache: (x, w, b)
    """
    N = x.shape[0]
    out = torch.mm(x,w) + b
    cache = (x, w, b)
    return out, cache

  @staticmethod
  def backward(dout, cache):
      """
      Computes the backward pass for an linear layer.
      Inputs:
      - dout: Upstream derivative, of shape (N, M)
      - cache: Tuple of:
        - x: Input data, of shape (N, D)
        - w: Weights, of shape (D, M)
        - b: Biases, of shape (M,)
      Returns a tuple of:
      - dx: Gradient with respect to x, of shape (N, D)
      - dw: Gradient with respect to w, of shape (D, M)
      - db: Gradient with respect to b, of shape (M,)
      """
      x, w, b = cache
      N = x.shape[0]

      db = torch.sum(dout, dim = 0, keepdims=False)
      dw = torch.mm(x.t(), dout)
      dx = torch.mm(dout, w.t())
      
      return dx, dw, db

---

### ReLU activation

In [None]:
class ReLU(object):

  @staticmethod
  def forward(x):
      """
      Computes the forward pass for a layer of rectified linear units (ReLUs).
      Input:
      - x: Input; a tensor of any shape
      Returns a tuple of:
      - out: Output, a tensor of the same shape as x
      - cache: x
      """
      out = x * torch.gt(x, 0)
      cache = x
      return out, cache

  @staticmethod
  def backward(dout, cache):
    """
    Computes the backward pass for a layer of rectified linear units (ReLUs).
    Input:
    - dout: Upstream derivatives, of any shape
    - cache: Input x, of same shape as dout
    Returns:
    - dx: Gradient with respect to x
    """
    x = cache
    dx = torch.gt(x, 0) * dout 
    return dx

In [None]:
torch.ma



---



### Sigmoid activation

In [None]:
class sigmoid(object):

  @staticmethod
  def forward(x):
      """
      Computes the forward pass for a layer of a sigmoid input.
      Input:
      - x: Input; a tensor of any shape
      Returns a tuple of:
      - out: Output, a tensor of the same shape as x
      - cache: x
      """
      out = 1.0 / (1.0 + torch.exp(-x))
      cache = out
      return out, cache

  @staticmethod
  def backward(dout, cache):
    """
    Computes the backward pass for a layer of rectified linear units (ReLUs).
    Input:
    - dout: Upstream derivatives, of any shape
    - cache: Input output of the sigmoid, of same shape as dout
    Returns:
    - dx: Gradient with respect to x
    """
    out = cache
    dx = out * (1-out) * dout 
    return dx



---

now we have all the building blocks, we can train a 2 layer network using this strategy:



In [None]:
class TwoLayerNet(object):

  def __init__(self, input_dim, hidden_dim=100, output_dim = 1, weight_scale=1e-3):
    """
    Initialize a new network.
    Inputs:
    - input_dim: An integer giving the size of the input
    - hidden_dim: An integer giving the size of the hidden layer
    """
    self.params = {}
    self.params['W1'] = weight_scale * torch.randn((input_dim, hidden_dim))
    self.params['b1'] = torch.zeros((hidden_dim))
    self.params['W2'] = weight_scale * torch.randn((hidden_dim, output_dim))
    self.params['b2'] = torch.zeros((output_dim))

  def loss(self, y_hat, y):

    N = y.shape[0]
    loss = - torch.sum( y * torch.log(y_hat) + (1-y)*torch.log(1-y_hat) ) / N
    dloss = (y_hat - y)/(y_hat * (1 - y))
    return loss, dloss


  def one_pass(self, X, y=None):
    """
    Compute loss and gradient for a minibatch of data.

    Inputs:
    - X: Tensor of input data of shape (N, D)
    - y: Tensor of labels, of shape (N,). y[i] gives the label for X[i].

    Returns:
    - loss: Scalar value giving the loss
    - grads: Dictionary with the same keys as self.params, mapping parameter
      names to gradients of the loss with respect to those parameters.
    """
    W1 = self.params['W1'] 
    b1 = self.params['b1'] 
    W2 = self.params['W2'] 
    b2 = self.params['b2'] 

    A1, linear_cache_1 = Linear.forward(X, W1, b1)
    S1, relu_cache = ReLU.forward(A1)
    S2, linear_cache_2 = Linear.forward(h1, W2, b2)
    out, sigmoidcache = sigmoid.forward(S2)

    loss, grads = 0, {}
  
    l, dloss = loss(out, y)

    dA1, grads['W2'], grads['b2'] = Linear.backward(dloss, linear_cache_2)
    dS1, = ReLU.backward(dA1, relu_cache)
    dX, grads['W1'], grads['b1'] =  Linear.backward(dS1, linear_cache_1)
  
    return loss, grads

# **But do we need all of This to train neural networks?**

---
---
---
---
---
---
---
---
---
---
---
---
---
---
---
---
---
---
---
---
---
---
---


Any deep learning framework should support some of the following features:

Fast prototyping, Automatically computing the gradient and to Accelerate computations by utillizing gpus.

If you ever visit the github repo of pytorch, you will find the difinition of it as:

PyTorch is a Python package that provides two high-level features:

* Tensor computation (like NumPy) with strong GPU acceleration.

* Deep neural networks built on a tape-based autograd system.

Now let's explore the power of pytorch.



---



Francis told you that Tensors in pytorch are just like nd-arrays in numpy, while this right, because they share many attributes like (shape, size, dtype...etc). PyTorch Tensors are more powerfull as they support some additional enhancements which make them unique: 

Apart from CPU, they can be loaded on the GPU for faster computations using the `.device` attribute. A similarlly important feature of them is that when setting `.requires_grad = True` pytorch autograd engine start forming a graph that tracks every operation applied on them to calculate the gradients using the same idea of computaional graphs we talked about it previously.

Pytorch Autograd is an engine to calculate derivatives. It records a graph of all the operations performed on a gradient enabled tensor and creates an acyclic graph called the dynamic computational graph. The leaves of this graph are input tensors and the roots are output tensors. Gradients are calculated by tracing the graph from the root to the leaf and multiplying every gradient in the way using the chain rule.

if you still don't get it I recommend this video: https://www.youtube.com/watch?v=MswxJw-8PvE&t=645s

In [2]:
x = torch.rand((2, 3))

In [3]:
x.device 

device(type='cpu')

In [4]:
torch.cuda.is_available()

True

In [5]:
x = x.to(torch.device('cuda'))
x.device

device(type='cuda', index=0)

In [6]:
x.requires_grad

False

In [7]:
x.grad #Holds the gradient of x, for now it's None

In [8]:
x.is_leaf

True

In [9]:
x.grad_fn #references a Function that has created the Variable used to calculate the gradient, for now it's None

In [10]:
z = 2 * x
z.grad_fn



---



The autograd package provides automatic differentiation for all operations on Tensors. Once you finish your computation you can call The magic word `.backward()` and have all the gradients computed automatically.

In [11]:
a = torch.tensor(3.0, requires_grad=True)
b = torch.tensor(5.0, requires_grad=True)

c = a * b
c.backward()

In [12]:
a.grad

tensor(5.)

In [13]:
b.grad

tensor(3.)

In [14]:
c.grad_fn

<MulBackward0 at 0x7f8e23a101d0>

<img src='https://miro.medium.com/max/589/1*viCEZbSODfA8ZA4ECPwHxQ.png' />

---
---
---
---
---
---
---
---
---
---
---
---
---
---
---
---
---
---
---
---
---
---
---


Know we know autograd let's use it to train neural networks ===> first we will watch justin explaining it lecture 9 (34:30 -> 51:00) :)

---
---
---

In the next few cells I will give a simple example of using torch.nn module, But it's not enough to cover all of it so please read this: https://pytorch.org/tutorials/beginner/nn_tutorial.html when you go back 

we will use Mini batch SGD to solve a regression toy example using a custom multilayer neural network utillizing torch.nn module. 

In [None]:
import torch
import torch.nn as nn
from torch import optim

In [None]:
if torch.cuda.is_available():
  device = torch.device('cuda')
else:
  device = torch.device('cpu')

In [None]:
N, D_in, H1, H2, H3, D_out =  1280, 16, 32, 64, 32, 1

X = torch.rand((N, D_in), device=device)
y = torch.rand((N, D_out), device=device)

model = nn.Sequential(
    nn.Linear(D_in, H1),
    nn.Linear(H1, H2),
    nn.Linear(H2, H3),
    nn.Linear(H3, D_out)
)

model = model.to(device = device)

optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
criterion = nn.MSELoss()

epochs = 10 
bs = 32
n_batch = int(N/bs)

for epoch in range(epochs):
    epoch_loss = 0
    for i in range(n_batch):
        
        start_i = i * bs
        end_i = start_i + bs

        xb = X[start_i:end_i]
        yb = y[start_i:end_i]

        y_pred = model(xb)

        loss = criterion(y_pred, yb)
        epoch_loss += loss.item()
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    epoch_loss /= N
    print('epoch ==> ', epoch, 'epoch loss ==> ', epoch_loss)

Another way (more common) to create a model in pytorch using the nn module is to extend the nn.module class

In [None]:
import torch
import torch.nn as nn
# import torch.nn.functional as F
F = nn.functional

class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        # 1 input image channel, 6 output channels, 3x3 square convolution
        # kernel
        self.conv1 = nn.Conv2d(1, 6, 3)
        self.conv2 = nn.Conv2d(6, 16, 3)
        # an affine operation: y = Wx + b
        self.fc1 = nn.Linear(16 * 6 * 6, 120)  # 6*6 from image dimension
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        # Max pooling over a (2, 2) window
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        # If the size is a square you can only specify a single number
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = x.view(-1, self.num_flat_features(x))
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x
        
net = Net()
print(net)

Now There are ton of things to cover in pytorch: 

DataLoaders: explicit Minibatch implementation with options to do data augmentation.

Different loss functions: cross entropy (for multiclass classification), Nigative log liklihood (for binary classification), CTC loss (mostly used for speech data) ... etc

other layers: 2dconv, 3dconv, rnn, lstm, gru, attention ... etc.

other optimizers: Adam, RMSProp, Adagrad ... etc

Regularizers: l2 (weight decay), dropout, BatchNorm ... etc

how to save models, torch.save()

learning rate schedulers: cosine, step, ... etc

activation functions: relu, sigmoid, tanh, swish, leaky relu, elu, ...etc

pretrained models: ResNets, Vgg, BERT, ... etc



and alot alot alot more: We will try to cover as much as possible But you have to work with us: Best Thing to do is watch and do the assignments in this course:
https://web.eecs.umich.edu/~justincj/teaching/eecs498/FA2019/

Also check the solutions for the deep learning nano degree from udacity: https://github.com/udacity/deep-learning-v2-pytorch

# Thank you, Assignment soon to be released