<a href="https://colab.research.google.com/github/DrAlexSanz/amld-pytorch-workshop/blob/master/ALEX_4_Modules.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div>
<img src="https://discuss.pytorch.org/uploads/default/original/2X/3/35226d9fbc661ced1c5d17e374638389178c3176.png" width="400" style="margin: 50px auto; display: block; position: relative; left: -30px;" />
</div>

<!--NAVIGATION-->
# < [Optimization](3-Optimization.ipynb) | Modules | [CNN](5-CNN.ipynb) >

### PyTorch Modules

Modules are a way to build re-usable model components and to manage model parameters.
PyTorch has many built-in modules for common operations like convolutions, recurrent neural networks, max pooling, common activation functions, etc.
You can also build your own modules.

This notenook introduces modules, and you will build a small neural network to perform image classification on MNIST.

### Table of Contents
#### 1. [Modules](#Modules)
#### 2. [Building and Training a Neural Network](#Building-and-Training-a-Neural-Network)

---

# Modules

### Modules manage parameters

They help to
- keep track of the parameters in your model.
- save/load of your model.
- reset gradients (with `model.zero_grad()`)
- move all parameters to the gpu (with `model.cuda()`)

The model parameter's are represented by `torch.nn.Parameter` objects.
A `Parameter` is a tensor with `requires_grad` set to `True` by default, and which is automatically added to the list of parameters when used within a model. If you are interested, you can have a look at the [torch.nn.Parameter documentation](https://pytorch.org/docs/stable/_modules/torch/nn/parameter.html).

As an example, a `Conv1d` module has two parameters: `weight` as `bias`:

In [0]:
# Make sure this is running on GPU

import torch
import matplotlib.pyplot as plt

In [0]:
module = torch.nn.Conv1d(5, 2, 3)
print(module.weight) #random initialization
print(module.bias)

Parameter containing:
tensor([[[ 0.0328, -0.0854, -0.1825],
         [-0.1952, -0.1976,  0.1778],
         [-0.1514,  0.0381, -0.1661],
         [ 0.1107,  0.1019, -0.1186],
         [-0.1570,  0.0415, -0.1632]],

        [[ 0.2302,  0.1127, -0.0985],
         [-0.1735,  0.2415,  0.0937],
         [-0.0212,  0.1650,  0.1832],
         [-0.2357,  0.2046,  0.0326],
         [-0.0951, -0.0981, -0.0259]]], requires_grad=True)
Parameter containing:
tensor([-0.0897, -0.1828], requires_grad=True)


---

## Basic usage

Each instance of a model has its own __parameters__.
The parameters are initialized randomly when the model is instantiated.

In [0]:
linear_regression_model = torch.nn.Linear(5, 2)  # 5 input dimensions, 2 output dimensions

print("linear_regression_model.weight = ", linear_regression_model.weight)  # running this cell more than once, you'll see different outputs
print("linear_regression_model.bias = ", linear_regression_model.bias)

linear_regression_model.weight =  Parameter containing:
tensor([[-0.2837,  0.0623, -0.0359, -0.1244, -0.1245],
        [-0.0004,  0.1913,  0.3049,  0.3231,  0.2751]], requires_grad=True)
linear_regression_model.bias =  Parameter containing:
tensor([-0.1215,  0.2536], requires_grad=True)


Models operate on __batches__ of data. If a model is designed to operate on datapoints with 5 features, the shape of the model's inputs will be `(batch, 5)`. This allows us to process multiple datapoints in parallel and increase efficiency.

In [0]:
batch_size = 3
x = torch.randn(batch_size, 5)
y = torch.randn(batch_size, 2)

print("x = {}\ny = {}".format(x, y))

x = tensor([[ 0.9772, -0.7332,  1.0791, -0.0545, -0.4261],
        [-1.1307,  1.5163,  0.4349,  0.4142, -0.8613],
        [-0.7312,  1.0524, -0.5858,  1.1390,  0.2443]])
y = tensor([[-0.0800,  1.3892],
        [-0.9156, -0.4443],
        [-0.6124,  0.1666]])


You can __call__ the model on an input (forward pass). This evaluation uses the current model parameters.

In [0]:
predicted_y = linear_regression_model(x)
print("predicted y = {}".format(predicted_y))

predicted y = tensor([[-4.2333e-01,  3.0720e-01],
        [ 3.3386e-01,  5.7355e-01],
        [ 4.0407e-04,  7.1180e-01]], grad_fn=<AddmmBackward>)


To optimize the model, you compute a __measure of error__ (loss) on the predictions. In this case, we use a squared error:

In [0]:
# Compute a loss
loss = torch.sum((y - predicted_y)**2)
loss

tensor(4.5584, grad_fn=<SumBackward0>)

With the loss computed, we compute __gradients__ of the loss with respect to all model parameters using automatic differentiation.

In [0]:
linear_regression_model.zero_grad()  # clear previous gradient
loss.backward()

print("\nGradients:")
print(linear_regression_model.weight.grad)
print(linear_regression_model.bias.grad)


Gradients:
tensor([[-4.3928,  5.5825, -0.3721,  2.4683, -1.5603],
        [-5.2136,  5.8209, -2.0884,  2.2029, -0.5648]])
tensor([3.0377, 0.9620])


You can now use these gradients to optimize the model parameters.

---

## Composing modules with `torch.nn.Sequential`

If the model you want to build is a simple chain of other modules, you can compose them using `torch.nn.Sequential`:

In [0]:
neural_net = torch.nn.Sequential(
    torch.nn.Linear(5, 10),
    torch.nn.ReLU(),
    torch.nn.Linear(10, 2), # The , is optional. It won't crash like in SQL. If you want to extend the model later, this is useful.
)

# Run the model:
neural_net(x)

tensor([[-0.5343, -0.0834],
        [-0.0378,  0.1186],
        [-0.1188,  0.2375]], grad_fn=<AddmmBackward>)

---

## Module parameters

You can inspect your network's parameters using `.parameters()` or `.named_parameters()`:

In [0]:
for param, tensor in neural_net.named_parameters():
    print("{:10s} shape = {}".format(param, tensor.shape))

0.weight   shape = torch.Size([10, 5])
0.bias     shape = torch.Size([10])
2.weight   shape = torch.Size([2, 10])
2.bias     shape = torch.Size([2])


---

## Custom modules

A module has to implement two functions:

- the `__init__` function, where you define all the sub-components that have learnable parameters. This makes sure that your module becomes aware all its parameters. The sub-components (layers) do not need to be defined in order of execution or connceted together. Don't forget to initialize the parent class `torch.nn.Module` with `super().__init__()`.


- the `forward` function, which is the method that defines what has to be executed during the forward pass and especially how the layers are connected. This is where you call the layers that you defined inside `__init__`.


In [0]:
# This is the most basic form of a custom module:
class MySuperSimpleModule(torch.nn.Module):
    def __init__(self, input_size, num_classes):
        super().__init__()
        
        # Define sub-modules or parameters
        # torch.nn.Module takes care of adding their parameters to your new module
        self.linear = torch.nn.Linear(input_size, num_classes)
    
    def forward(self, x): #These two are needed.
        out = self.linear(x)
        return out

You can use the `print` function to list a model's submodules and parameters:

In [0]:
model = MySuperSimpleModule(input_size=20, num_classes=5)
print(model)

MySuperSimpleModule(
  (linear): Linear(in_features=20, out_features=5, bias=True)
)


You can use `model.parameters()` to get the list of parameters of your model automatically inferred by PyTorch.

In [0]:
for name, p in model.named_parameters():  # Here we use a sligtly different version of the parameters() function
    print(name, ":\n", p)                 # which also returns the parameter name

linear.weight :
 Parameter containing:
tensor([[-0.0339,  0.0616, -0.0102,  0.1404,  0.2123,  0.1069, -0.0468,  0.0873,
          0.1521,  0.0496,  0.1148, -0.1302,  0.0315, -0.0673, -0.1780, -0.0301,
          0.0703,  0.0244,  0.1917,  0.0235],
        [ 0.0496,  0.1263, -0.1431, -0.1375, -0.0589,  0.1459,  0.0760,  0.0717,
          0.1501,  0.1005,  0.1397,  0.0534, -0.0864,  0.1364, -0.1107, -0.1217,
         -0.0492,  0.0180, -0.1981, -0.2120],
        [ 0.0766, -0.0974, -0.0920, -0.1333,  0.0665,  0.0966, -0.0240,  0.1874,
         -0.0932, -0.0586,  0.1153, -0.1707, -0.0402,  0.0301, -0.0453, -0.0171,
          0.1985,  0.1682,  0.2142,  0.1121],
        [-0.0140,  0.1014, -0.1256, -0.1036, -0.0901,  0.0255, -0.0321,  0.0079,
         -0.1476, -0.1572, -0.0560, -0.0167, -0.1970,  0.0030,  0.2223,  0.0269,
          0.1609,  0.1288, -0.1994,  0.1463],
        [ 0.0913, -0.0192, -0.1541, -0.0443,  0.1649,  0.0152, -0.2054,  0.1121,
          0.1721, -0.0384, -0.0682,  0.0942, -0.

---

# Building and Training a Neural Network

It's time to implement a neural network now. In this section, you will learn to classify handwritten digits from the widely known MNIST dataset.
The dataset consists of 60,000 training images of size 28x28, and another 10,000 images for evaluating the quality of the trained model.

![MNIST](https://github.com/theevann/amld-pytorch-workshop/blob/master/figures/mnist.jpeg?raw=1)

## Loading the dataset

MNIST is widely used and a dataset and it is available in the `torchvision` library.

In [0]:
import torchvision

# MNIST Dataset (Images and Labels)
train_dataset = torchvision.datasets.MNIST(
    root='./data',
    train=True,
    transform=torchvision.transforms.ToTensor(),
    download=True
)
test_dataset = torchvision.datasets.MNIST(
    root='./data',
    train=False,
    transform=torchvision.transforms.ToTensor()
)

0it [00:00, ?it/s]

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ./data/MNIST/raw/train-images-idx3-ubyte.gz


9920512it [00:06, 1642405.41it/s]                            


Extracting ./data/MNIST/raw/train-images-idx3-ubyte.gz to ./data/MNIST/raw


  0%|          | 0/28881 [00:00<?, ?it/s]

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ./data/MNIST/raw/train-labels-idx1-ubyte.gz


32768it [00:00, 130092.38it/s]           
  0%|          | 0/1648877 [00:00<?, ?it/s]

Extracting ./data/MNIST/raw/train-labels-idx1-ubyte.gz to ./data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ./data/MNIST/raw/t10k-images-idx3-ubyte.gz


1654784it [00:00, 2121477.17it/s]                            
0it [00:00, ?it/s]

Extracting ./data/MNIST/raw/t10k-images-idx3-ubyte.gz to ./data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ./data/MNIST/raw/t10k-labels-idx1-ubyte.gz


8192it [00:00, 48610.76it/s]            


Extracting ./data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ./data/MNIST/raw
Processing...
Done!


A dataset in PyTorch behalves _like an array_. It has a length, and you can access its entries with `dataset[i]`.

__Exercise__<br>
Verify how many images there are in the training dataset. How is one training example represented? What is the type and shape of an entry from the dataset?

In [0]:
#print(len(train_dataset))
print(train_dataset) #Easier than expected
#train_dataset[0] will print the values of the pixels
train_dataset[0][0].shape #It's black and white, 1, m, n

Dataset MNIST
    Number of datapoints: 60000
    Root location: ./data
    Split: Train
    StandardTransform
Transform: ToTensor()


torch.Size([1, 28, 28])

When we train a model, we make multiple passes through all the examples in the training set. Each pass, the data points are shuffled and batched together. For this purpose, we use a `DataLoader`. The `DataLoader` support multi-threading to optize your data-loading pipeline.

In [0]:
# Dataset Loader (Input Batcher)
batch_size = 100
train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)# Probably the dataset is ordered. Bad for optimization.
test_loader = torch.utils.data.DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=False)

We can now iterate through the dataset in shuffled batches. Everytime you do this, the order will be different.

In [0]:
for images, labels in train_loader: #Test that all the batches have the correct size
    assert len(images) == batch_size
    assert len(labels) == batch_size

__Exercise__<br>
Search in the documentation how to enable multi-threading in the data loaders.

In [0]:
train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, num_workers = 2, shuffle=True) #Fastaa!! The Cat laptop has 6 processors, 2 threads each. Not negligible.
test_loader = torch.utils.data.DataLoader(dataset=test_dataset, batch_size=batch_size, num_workers = 2, shuffle=False)

## Defining the model

A fully-connected neural network consists of layers that contain a number of values (neurons) computed as linear combinations of the neurons in the layer before. The first layer contains the network's input (your features), and the last layer contains the prediction. In our case, the last layer contains 10 neurons that are trained to be large when an input image is of the corresponding digit (0, 1, 2, 3, 4, 5, 6, 7, 8, 9).

The parameters of this model that are optimized (trained), are the weights that connect the neurons. These are drawn as edges in the illustration below.

![](https://raw.githubusercontent.com/ledell/sldm4-h2o/master/mlp_network.png)

To make sure that neural networks can approximate non-linear functions, each neuron's value is transformed with some non-linear transformation function $\sigma(\cdot)$, often called an ‘activation function’ before being fed as input to the next layer. 

To be precise, the neurons $\vec x_{i+1}$ in layer $i+1$ are computed from the neurons $\vec x_i$ in layer $i$ as 

$$ \vec x_{i+1} = \sigma\left(W_{i+1} \vec x_i + \vec b_{i+1} \right) $$

where $W_{i+1}$ encodes the network parameters between each pair of input/output neurons in layer $i+1$, and $\vec b_{i+1}$ contains 'bias terms'. $\sigma$ operates element-wise.

A layer like that can be implemented using `torch.nn.Linear` followed by an activation function such as `torch.nn.ReLU` or `torch.nn.Sigmoid`.

---

__Exercise__<br>
Implement a multi-layer fully-connected neural network with two hidden layers and the following numbers of neurons in each layer:

- Input-size: *input_size*
- 1st hidden layer: 75
- 2nd hidden layer: 50
- Output layer: *num_classes*

Use `ReLU`s as ‘activation functions’ in between each pair of layers, but not after the last layer.

In [0]:
import torch.nn.functional as F  # provides some helper functions like Relu's, Sigmoids, Tanh, etc.

class MyNeuralNetwork(torch.nn.Module):
    def __init__(self, input_size, num_classes):
        super(MyNeuralNetwork, self).__init__()
        
        self.input_size = input_size
        self.num_classes = num_classes
        
        self.linear_1 = torch.nn.Linear(input_size, 75)
        self.linear_2 = torch.nn.Linear(75, 50)
        self.linear_3 = torch.nn.Linear(50, num_classes)
        
    
    def forward(self, x):
        out = F.relu(self.linear_1(x))
        out = F.relu(self.linear_2(out))
        out = self.linear_3(out) # In pytorch usually softmax goes on the loss function. If I include it here they will overlap and I´ll get two small numbers multiplied together, then vanishing likelihoods. And that makes me cry.
        return out

Now feed an input to your network:

In [0]:
x = torch.rand(16, 28 * 28)  # the first dimension is reserved for the 'batch_size'
model = MyNeuralNetwork(input_size=28 * 28, num_classes=10)
out = model(x)  # this calls model.forward(x)
out[0, :]

tensor([ 0.0885,  0.1667, -0.0921, -0.1656, -0.0109,  0.0769,  0.0530,  0.0832,
         0.1305, -0.1496], grad_fn=<SliceBackward>)

__Exercise__ <br>
What does `out[0, :]` above represent?

---

## Training the model

Most of the functions to train a model follow a similar pattern in PyTorch.
In most of the cases in consists of the following steps:
- Loop over data (in batches)
- Create a prediction (forward pass)
- Clear previous gradients (!)
- Compute gradients (backward pass)
- Parameter update (using an optimizer)

In [0]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Copy all model parameters to the GPU
model = model.to(device)

# Define the Loss function and Optimizer that you want to use
criterion = torch.nn.CrossEntropyLoss()  
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)  # NOTE: model.parameters() is to tell Adam which parameters I need to optimize

# Passes over the whole dataset
for epoch in range(5):
    total_loss = 0.0
    
    # Loop over batches in the training set
    for (inputs, labels) in train_loader:
        
        # Move inputs from CPU memory to GPU memory
        inputs = inputs.to(device)
        labels = labels.to(device)

        # The pixels in our images have a square 28x28 structure, but the network
        # accepts a *vector* of inputs. We therefore reshape it.
        # -1 is a special number that indicates 'whatever is left'
        inputs = inputs.view(-1, 28*28)

        # Do a forward pass, loss computation, compute gradient, and optimize
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        # Add up training losses so we can compute an average later
        total_loss += loss.item()
        
    print("Epoch %d, Loss=%.4f" % (epoch+1, total_loss/len(train_loader)))

Epoch 1, Loss=0.4291
Epoch 2, Loss=0.1833
Epoch 3, Loss=0.1337
Epoch 4, Loss=0.1051
Epoch 5, Loss=0.0872


Note:
- We can use the `.to` function on the model directly. Indeed, since PyTorch knows all the model parameters, it can put all the parameters on the correct device.
- We use `model.parameters()` to get all the parameters of the model and we can instantiate an optimizer that will optimize these parameters `torch.optim.SGD(model.parameters())`.
- To apply the forward function of the module, we write `model(input)`. In most cases, `model.forward(inputs)` would also work, but there is a slight difference : PyTorch allows you to register hook functions for a model that are automatically called when you do a forward pass on your model. Using `model(input)` will call these hooks and then call the forward function, while using `model.forward(inputs)` will just silently ignore them.

Do you see the convenience of Modules?

### Loss functions

PyTorch comes with a lot of predefined loss functions :
- `L1Loss`
- `MSELoss`
- `CrossEntropyLoss`
- `NLLLoss`
- `PoissonNLLLoss`
- `KLDivLoss`
- `BCELoss`
- `...`

Check out the [PyTorch Documentation](https://pytorch.org/docs/master/nn.html#loss-functions).

### Assessing model performance 
This function loops over another `data_loader` (usually containing test/validation data) and computes the model's accuracy on it.

In [0]:
def accuracy(model, data_loader, device):
    with torch.no_grad(): # during model evaluation, we don't need the autograd mechanism (speeds things up)
        correct = 0
        total = 0
        for inputs, labels in data_loader:
            inputs = inputs.to(device)     
            inputs = inputs.view(-1, 28*28)
            
            outputs = model(inputs)
            _, predicted = outputs.max(1) # 10 outputs, I do an argmax or a max. This is equivalent to collapsing the 10 outputs.
            
            correct += (predicted.cpu() == labels).sum().item()
            total += labels.size(0)
            
    acc = correct / total
    return acc

In [0]:
accuracy(model, test_loader, device)  # look at: accuracy(model, train_loader, device)

0.9687

### We get an accuracy of around 97%. Can you improve this?

---

## Storing and loading models

In [0]:
torch.save(model, "my_model.pt") #Useful

In [0]:
my_model_loaded = torch.load("my_model.pt")

In [0]:
print(model.linear_3.bias)
print(my_model_loaded.linear_3.bias)

Parameter containing:
tensor([-0.1436,  0.1358, -0.1001, -0.1468,  0.2077,  0.1968, -0.0029,  0.0573,
         0.0638, -0.1277], device='cuda:0', requires_grad=True)


NameError: ignored

___

This intro to modules used [this medium post](https://medium.com/deeplearningbrasilia/deep-learning-introduction-to-pytorch-5bd39421c84) as a resource.

<!--NAVIGATION-->
# < [Optimization](3-Optimization.ipynb) | Modules | [CNN](5-CNN.ipynb) >