# Deep Learning Tutorials - Chapter 4 - Graphs of operators, autograd, and convolutional layers

# 4.1 DAG-Networks

We can generalize an MLP to an arbitrary "Directed Acyclic Graph" operator. Moreso we can build an arbitrary directed acyclic graph with these operators at the nodes, compute the response of the resulting mapping and compute its gradient with back propogation.

![dag](./images/dag.png)



In [None]:
# the code for the above dag
import tensorflow as tf
w1 = tf.Variable(tf.random_normal([5,5]))
w2 = tf.Variable(tf.random_normal([5,5]))
x = tf.Variable(tf.random_normal([5,1]))
x0 = x
x1 = tf.matmul(w1, x0)
x2 = x0 + tf.matmul(w2,x1)
x3 = tf.matmul(w1,x1+x2)
q = tf.norm(x3)

gw1, gw2 = tf.gradients(q, [w1, w2])

with tf.Session() as sess:
    sess.run(tf.global_variables_intializer())
    _gw1, _gw2 = sess.run([gw1, gw2])

In our generalized DAg formulation, we have in particular implicitly allowed the same parameters to modulate different parts of the processing. For instance $w^{1}$ in our example paramterizes both $\phi^{1}$ and $\phi^{3}$. This is called weight sharing.

Weight sharing allows in particular to build **Siamese Networks** where a full sub-network is replicated several times

# 4.2 Autograd

Conceptually, the forward pass is a standard tensor computation, and the DAG of the tensor operations is required only to compute derivatives

When executing tensor operatioms, PyTorch can automatically constrcut on-the-fly the graoh of operations to compute the gradient of any quantify with respect to any tensors involved.

This **autograd** mechanism has two main benefits:

1. simpler syntax: one just needs to write the forward pass as a standard sequence of python operations

2. Greater flexibility: since the graph is not static, the forward pass can be dynamically modulated

A `Tensor` has a Boolean Field `requires_grad` set to `False` by default, which states if PyTorch should build the graph operations so that gradients with respect to it can be computed.

The results of a tensorial operation has this flag to `True` if any of its operand has it to `True`

In [1]:
import torch
x = torch.tensor([1.,2.,])
y = torch.tensor([4.,5.,])
z = torch.tensor([7.,3.,])


In [2]:
x.requires_grad

False

In [3]:
(x+y).requires_grad

False

In [6]:
z.requires_grad = True

In [7]:
(x+z).requires_grad

True

Only floating point type tensors can have their gradient computed. 

`torch.autograd.grad(outputs, inputs)` computes and returns the gradient of `outputs` with respect to `inputs`

In [8]:
t = torch.tensor([1.,2.,4.]).requires_grad_()
u = torch.tensor([10.,20.,]).requires_grad_()
a = t.pow(2).sum() + u.log().sum()
torch.autograd.grad(a,(t,u))

(tensor([2., 4., 8.]), tensor([0.1000, 0.0500]))

inputs can be a single tensor, but the result is still a one element tuple

if outputs is a tuple, the results is the sum of the gradients of its elements

The function `Tensor.backward` accumulates gradients in the grad fields of tensors which are not results of operations, the "leaves" in the autograd graph

In [15]:
x = torch.tensor([-3.,2.,5.]).requires_grad_()
u = x.pow(3).sum()
x.grad
u.backward()
x.grad

tensor([27., 12., 75.])

This function is an alternative to `torch.autograd.grad(...)` and standard for training models

**warning** `Tensor.backward()` accumulates the gradients in the grad fields of tensors so one may have to set them to zero before calling it. 

This accumulating behavior is desireable to compute the gradient of a loss summed over several "mini-batches" or the gradient of a sum of losses

![dag](./images/dag.png)

In [16]:
# writing this DAG with new autograd commands

w1 = torch.rand(5,5).requires_grad_()
w2 = torch.rand(5,5).requires_grad_()
x = torch.empty(5).normal_()

x0 = x
x1 = w1 @ x0
x2 = x0 + w2 @ x1
x3 = w1 @ (x1 + x2)

q = x3.norm()

q.backward

<bound method Tensor.backward of tensor(11.2018, grad_fn=<NormBackward1>)>

![net](./images/network.png)

In [None]:
w1 = torch.rand(20,10).requires_grad_()
b1 = torch.rand(20).requires_grad_()
w2 = torch.rand(5,20).requires_grad_()
b2= torch.rand(5).requires_grad_()

x = torch.rand(10)
h = torch.tanh(w1 @ x + b1)
y = torch.tanh(w2 @ h + b2)

target = torch.rand(5)

loss = (y-target).pow(2).mean()

![net2](./images/net2.png)

In [None]:
w = torch.rand(3,10,10).requires_grad_()

def blah(k,x):
    for i in range(k):
        x = torch.tanh(w[i] @ x)
    return x

u = blah(1, torch.rand(10))
v = blah(3, torch.rand(10))
q = u.dot(v)

**warning:** Although they are related, the autograd graph is not the networks structure, but the graph of operations to compute the gradient. It can be data-dependent and miss or replicate sub-parts of the network

The `torch.no_grad()` context switches of the autograd machinery, and can be used for operations such as parameter updates

The `detach()` method creates a tensor which shares the data, but does not require gradient computation, and is not connected to the current graph.

This method should be used when the gradient should not be propogated beyond the variable or to update leaf tensors

In [21]:

a = torch.tensor(0.5).requires_grad_()
b = torch.tensor(-0.5).requires_grad_()
eta = 1e-3

for k in range(100):
    l = (a-1)**2 + (b+1)**2 + (a-b)**2
    ga, gb = torch.autograd.grad(l, (a,b))
    with torch.no_grad():
        a -= eta * ga
        b -= eta * gb

print(a,b)

tensor(0.4246, requires_grad=True) tensor(-0.4246, requires_grad=True)


In [20]:

a = torch.tensor(0.5).requires_grad_()
b = torch.tensor(-0.5).requires_grad_()
eta = 1e-4

for k in range(100):
    l = (a-1)**2 + (b+1)**2 + (a.detach()-b)**2
    ga, gb = torch.autograd.grad(l, (a,b))
    with torch.no_grad():
        a -= eta * ga
        b -= eta * gb

print(a,b)

tensor(0.5099, requires_grad=True) tensor(-0.4901, requires_grad=True)


By default, autograd deletes the computational graph when it is used. The flag `retain-graph` indicates to keep it 

Autograd can also track the computation of the gradient itself, to allow higher order derivatives. This specified `create_graph = True`

![hod](./images/hod.png)

**warning**: In-place operations may corrupt values required to compute the gradient and this is tracked down by autograd

They also prohibited on so-called "leaf tensors, which are not the results of operations but the initial inputs to the whole computation

# 4.3 PyTorch modules and batch processing

Elements from `torch.nn.functional` are autograd-compliant functions which compute a result from provided arguments alone

Subclasses of `torch.nn.Module` are losses and network components. The latter embed parameters to be optimized during training.

Parameters are of the type `torch.nn.Parameter` which is a `Tensor` with `requires_grad` to `True` and known to be a model parameter by various utility functions in particular `torch.nn.Module.parameters`.

Usuall `torch.nn.functional` is imported as `F` and `torch.nn` as nn

**warning:** functions and modules from nn process **batches** of inputs stored in tensor whose first dimension indexes them and produce corresponding tensor with the same additional dimension.

#### Pytorch modules

`F.relu(input, inplace=False)` takes a tensor of any size as input, applies a ReLU on each value to produce a tensor of the same size.

`inplace` indicates if the operation should modify the argyment itself. This may be desirable to reduce the memory footprint of the processing

In [22]:
import torch.nn.functional as F
x = torch.tensor([0.8008, -0.2586, 0.5019, -0.2002, -0.7416])
F.relu(x)

tensor([0.8008, 0.0000, 0.5019, 0.0000, 0.0000])

`nn.Linear(in_features, out_features, bias=true)` implements a $\real^{C} -> \real^{D}$ fully connected layer. It takes as input a tensor size $N X C$ and proiuce a tensor size $N X D$

In [23]:
import torch.nn as nn
f = nn.Linear(in_features=10, out_features=4)
for n, p in f.named_parameters():
    print(n, p.size())

weight torch.Size([4, 10])
bias torch.Size([4])


In [24]:
x = torch.empty(523,10).normal_()
y=f(x)
y.size()

torch.Size([523, 4])

`nn.MSELoss` - implements the mean squared error loss: the sum of the component wise squared difference, divided by the total number of component in the tensors

In [25]:
f = nn.MSELoss()
x = torch.tensor([[3.,]])
y = torch.tensor([[0.,]])
f(x,y)

tensor(9.)

In [26]:
x = torch.tensor([[3.,0.,0.,0.]])
y = torch.tensor([[0.,0.,0.,0.]])
f(x,y)

tensor(2.2500)

The first parameter of a loss is traditionally called the **input** and the second the **target**. The two quantities may be of different dimensions or even types of some losses.

#### Batch Processing

Functions and modules from `nn` process samples by batches. This is motivated by the computational speed-up it induces. 

To evaluate a module on a sample, both the modules parameters and the sample have to be first copied into **cache memory** which is fast but small.

For any model of reasonable size, only a fraction of its parameters can be kept in cache, so a modules parameters can be copied there every time it is used.

**Memory transfers are slower than computation. Batch processing cuts down to one copy pf the parameters to the cache per batch**

It also cuts down the use of Python loops which are awfully slow

# 4.4 Convolutions

If they were handled as normal "unstructured" vectors, large-dimensional signals such as sound samples or images woud require models of intractable size.

For instance a linear layer taking `256 x 256` RGB image as input and producing an image of same size would require 

$$
(256 x 256 x 3)^{2} \approx 3.87e+10
$$

parameters, with corresponding memory footprint ($\approx$ 150 Gb!) and excess of capacity

Moreover this requirement is inconsistent with the intuition that such large signals have some "invariance of translation". A representation meaningul at a certain location can / should be used everywhere.

A convolution layer embodies this idea. It applies the same linear transformation locally, everywhere and preserves the signal structure while lowering its dimensionality.

#### Mathematical definition

**$\oplus$ indicates dot prodct**

Formally in 1d, given

$$
x=(x_1,....x_W)
$$

and a "convolution kernel" (or "filter") of which $w$

$$
u = (u_1,...,u_w)
$$

the convolution $x \oplus u$ is a vector of size $W-w+1$ with

$$
(x \oplus u)_i = \sum_{j=1}^{w} x_{i-1+j} u_{k}
$$

$$
= (x_i,...,x_{i+w-1}) \dot u
$$

for insance

$$
(1,2,3,4) \oplus (3,2) = (3+4,6+9,9+8) = (7, 12, 17)
$$

**warning**: this differs from the usual convolution since the kernel and the signal are both visited in increasing index order

#### visual represnetation of convolution

![conv](./images/conv.png)

#### common use cases

It generalizes naturally to a multi-dimensional input, although specification can become complicated.

Its most usual form for "convolutional networks" processes a 3d tensor as input to output a 2d tensor. The kernel is not swiped across channels just across rows and columns.

Kernels visualized:

![kernel](./images/kernel.png)


Note that a convolution preserves the signal support structure

A 1d signal is converted into a 1d signal, a 2d signal into a 2d signal, and neighboring parts of the input signal influence neighboring parts of the output signal

A 3d convolution can be used if the channel index has some metric meaning, such as time for a series of grayscale video frames. Otherwise across channels makes no sense

We usually refer to one of the channels generated by convolution layer as an **activation map**

The sub-area of an input map that influences a component of the output as the **receptive field** of the latter

In the context of convolutional networks, a standard liner layer is called a **fully connected** layer since every input influences every output.

#### Convolution modules

`F.conv2d(input, weight, bias=None, stride=1, padding=0, dilation =1, groups=1)` - implements a 2d convolution, where `weight` contains the kernels and is `D x C x h x w, bias` is of dimension `D, input` is of dimension

$$
N \times C \times H \times W
$$

and the result is of dimension

$$
N \times D \times (H-h+1) \times (W-w+1)
$$

In [27]:
weight = torch.empty(5,4,2,3).normal_()
bias = torch.empty(5).normal_()
input1 = torch.empty(117,4,10,13).normal_()
output = F.conv2d(input1, weight, bias)
output.size()

torch.Size([117, 5, 9, 11])

different kinds of filters and how they affect convolution

![conv](./images/filters.png)

`class torch.nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1,groups=1, bias=true)` wraps the convoluton into a `Module` with the kernels and biases as `Parameter` properly randomized at creation.

The kernel size is either a pair (`h,w`) or a single value `k` interperted as (`k,k`)

#### additional parameters

Convolutions have three additional parameters

- The **padding** specifies the size of a zeroed frame added around the input
- The **stride** specifies a step size when moving the kernel acorss the signal
- The **dilation** modulates the expansion of the filter without adding weight

#### Padding & stride

padding is the size of the kernel. In the image a 3x3 kernel is moving

stride on the other hand are the dots and how much the kernel frame shifts

![padd](./images/padding.png)

#### Dilation

the dilation modulates the expansion of the filter support by adding rows and columns of zeros between coefficients

It is 1 for standard convolutions but can be greater in which case the resulting operation can be envisioned as a convolution with a regularly sparsified filter

#### Putting it altogether

A convolution with a kernel sized `k` and dilation `d` can be interpreted as a convolution with a filter size `1+(k-1)d` with only `k` non-zero coefficients. 

For example with `k=3` and `d=4` the difference between the input map size and the output map size is `1+(3-1)4-1=8`


In [28]:
x = torch.empty(1,1,20,30).normal_()
l = nn.Conv2d(1,1,kernel_size=3, dilation=4)
l(x).size()

torch.Size([1, 1, 12, 22])

having a dilation field greater than one increases the units receptive field size without increasing the number of parameters

**Convolutions with stride or dilation strictly greater than one reduce the activation map size, for instance to make a final classification decision**

# Pooling

the historical approach to compute a low dimensional signal (e.g. few scores) froma high dimnesion one (e.g. an image) was to use pooling operations.

Such an operation aims at grouping several activations into a single "more meaningul one"

#### Max Pooling

the most standard type of pooling is max-pooling, which computes max values over **non-overlapping blocks**.

For instance in 1d with a kernel of size 2:

![pool](./images/pool.png)

The average pooling computes average values per block instead of max values.

#### Why

pooling provides invariance to any permutation inside one of the cell.

More practically it provides a pseudo invariance to deformation that results into local translations. Cleans up noise, and ignores little changes coming in a signal

### PyTorch Modules

`F.max_pool2s(input, kernel_size, stride=None, padding=0, dilation=1,ceil_mode=False,return_indices=false)` - takes as input a $N \times C \times H \times W$ tensor and akernel size $(h,w)$ or $k$ interperted as $(k,k)$ applies the max-pooling on each channel of each sample separately and produce if the padding is $0$ a $N \times C \times \lfloor H/h \rfloor \times \lfloor W/w \rfloor$ output

In [29]:
x = torch.empty(1,2,2,6).random_(3)
x

tensor([[[[1., 1., 0., 2., 1., 1.],
          [2., 2., 2., 0., 2., 1.]],

         [[1., 1., 0., 0., 2., 1.],
          [0., 2., 0., 2., 1., 0.]]]])

In [30]:
F.max_pool2d(x,(1,2))

tensor([[[[1., 2., 1.],
          [2., 2., 2.]],

         [[1., 0., 2.],
          [2., 2., 1.]]]])

As for convolution, pooling operations can be modulated through their stride and padding.

While for convolution the default stride is `1`, for pooling it is equal to the kernel size, but this is not obligatory

default padding is zero.

```
class torch.nn.MaxPool2d(kernel_size, stride=None,
                         padding=0, dilation=1
                         return_indices=False, ceil_mode=False)
```

Wraps the max-pooling operations into a Module

As for convolutions, ther kernel size is either a pair `(h,w)` or a single value `k` interperted as `(k,k)`

# 4.6 Writing a PyTorch Module

We now have all the bricks needed to build our first convolutional network from scratch. The last technical point is the tensor shape between layers.

Both the convolution and pooling layers take as input batches of samples each one being itself a 3d tensor $C \times H \times W$

The output has the same structure, and tensors have to be explicitly reshaped before being forwarded to fully connected layer

#### Example of explicit reshaping of tensors and the amount of parameters/products
![re](./images/reshape.png)

#### Modules

PyTorch offers a sequential container module `torch.nn.Sequential` to build simple architectures

For instance a MLP with a `10` dimension, `2` dimension output, ReLU activation and two hidden layers of dimensions `100` and `50` can be written as 


In [31]:
model = nn.Sequential(
    nn.Linear(10,100),nn.ReLU(),
    nn.Linear(100,50), nn.ReLU(),
    nn.Linear(50,2)
);

However for any model of reasonable complexity the best it to write a sub-class of `torch.nn.Module`

#### How to create a Module

To create a `Module`, one has to inherit from the base class and implement the constructor `__init__(self,...)`  and the forward pass `forward(self,x)`.

In [32]:
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1,32, kernel_size=5)
        self.conv2 = nn.Conv2d(32,64, kernel_size=5)
        self.fc1 = nn.Linear(256,200)
        self.fc2 = nn.Linear(200,10)
    
    def forward(self,x):
        x = F.relu(F.max_pool2d(self.conv1(x),kernel_size =3, stride=3))
        x = F.relu(F.max_pool2d(self.conv2(x),kernel_size =2, stride=2))
        x = x.view(-1,256)
        x = F.relu(self.fc1(x))
        x= self.fc2(x)
        return x

Inheriting from `torch.nn.Module` provides many mechanisms implemented in the superclass

First the `(...)` operator is redefined to call the `forward(...)` method and run additional operations. The forward pass should be executed through this operator, and not by calling `forward` explicitly

In [33]:
model = Net()
input1 = torch.empty(12,1,28,28).normal_()
output = model(input1)
print(output.size())

torch.Size([12, 10])


Also the `Parameters` added as class attributes, or from modules added as class attributes, are seen by `Module.paramters()`.

In [34]:
model = Net()

for k in model.parameters():
    print(k.size())

torch.Size([32, 1, 5, 5])
torch.Size([32])
torch.Size([64, 32, 5, 5])
torch.Size([64])
torch.Size([200, 256])
torch.Size([200])
torch.Size([10, 200])
torch.Size([10])


**warning**: Parameters included in dictionaries and lists are not seen.

To combat this a simple solution is to add `torch.nn.ModuleList` which is a list of modules properly dealth with by PyTorch's machinery

As long as you use autograd-compliant operations, the backward pass is implemented automatically

This is crucial to allow the optimization of the `Parameters` with gradient descent