<a href="https://colab.research.google.com/github/Renan-Domingues/IntroductionToPytorch/blob/main/IntroductionPyTorch_04_BuildingModels.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BUILDING MODELS WITH PYTORCH


### Torch.nn.Module and torch.nn.Parameter
we'll be discussing sime of the tools Pytorch makes available for buiding a deep learning network.

One important behavior of torch.nn.Module is registering parameters. If a particular Module subclass has learning weights, these weights are expressed as instances of torch.nn.Parameter. The Parameter class is a subclass of torch.Tensor, with the special behavior that when they are assigned as attributes of a Module, they are added to the list of that modules parameters. These parameters may be accessed through the parameters() method on the Module class.

In [None]:
'''As a example, here is a model with 2 linear layers and a activation function
we will create a instance of it and ask it to report on its parameters'''

import torch

class TinyModel(torch.nn.Module):
  def __init__(self):
    super(TinyModel, self).__init__()

    self.linear1 = torch.nn.Linear(100, 200)
    self.activation = torch.nn.ReLU()
    self.linear2 = torch.nn.Linear(200, 10)
    self.softmax = torch.nn.Softmax()

  def forward(self, x):
    x = self.linear1(x)
    x = self.activation(x)
    x = self.linear2(x)
    x = self.softmax(x)
    return x

tinymodel = TinyModel()

print('The model')
print(tinymodel)

print('\n\nLayer params:')
for param in tinymodel.linear2.parameters():
  print(param)

The model
TinyModel(
  (linear1): Linear(in_features=100, out_features=200, bias=True)
  (activation): ReLU()
  (linear2): Linear(in_features=200, out_features=10, bias=True)
  (softmax): Softmax(dim=None)
)


Layer params:
Parameter containing:
tensor([[ 0.0429, -0.0662, -0.0542,  ...,  0.0189, -0.0415,  0.0284],
        [-0.0472,  0.0661, -0.0199,  ..., -0.0425,  0.0031, -0.0400],
        [ 0.0689,  0.0447,  0.0259,  ..., -0.0540,  0.0680,  0.0206],
        ...,
        [-0.0284, -0.0210,  0.0281,  ...,  0.0653, -0.0271,  0.0276],
        [-0.0269,  0.0350,  0.0024,  ...,  0.0540, -0.0534,  0.0679],
        [-0.0351, -0.0388,  0.0703,  ..., -0.0186, -0.0624, -0.0179]],
       requires_grad=True)
Parameter containing:
tensor([-0.0282,  0.0406, -0.0447,  0.0329, -0.0313, -0.0492, -0.0702, -0.0259,
        -0.0412, -0.0439], requires_grad=True)


Above is the fundamental structure of a PyTorchModel: there is a __init__()method that defines the layers and other components of a model, and a forward() method where the computation gets done.

(note that we can print the model, or any of its submodules, to learn about the structure)

# Common Layer Typres


### Linear Layers

The most basic type of a neural network layer,  is  a linear or fully connected layer.
This is a layer where every input influences every output of the layer to a degree specified by the layer's weights. If the model has "m" inputs and "n" outputs, the weights will be an m * n matrix.

In [None]:
# for example

lin = torch.nn.Linear(3, 2)
x = torch.rand(1, 3)
print(f'Input {x}')

print('\n\nWeight and Bias parameters:')
for param in lin.parameters():
  print(param)

y = lin(x)
print(f'\n\nOutput: {y}')

Input tensor([[0.5920, 0.2046, 0.8131]])


Weight and Bias parameters:
Parameter containing:
tensor([[ 0.4862,  0.4826,  0.2157],
        [ 0.4033,  0.3856, -0.1862]], requires_grad=True)
Parameter containing:
tensor([-0.2338, -0.3974], requires_grad=True)


Output: tensor([[ 0.3281, -0.2311]], grad_fn=<AddmmBackward0>)


the matrix multiplication of "x" by the linear layer's weights, and add the biases, is the output vector "y"

the ``lin.weight`` is reported itself as Parameter (Parameter is a subclass of Tensor), and let us know that it's tracking gradients with autograd. This is a default behavior for Parameter that differs from Tensor.

### Convolutional Layers

they are build to handle data with a high degree of spatial correlation, very commonly use in computer vision, where they detect close groupings of features which the compode into higher-level features.
they are appear in NPL (Natural Language Processing (artificial intelligence)) aplication, where word's immediate context can affect the meaning of a sentence.


In [None]:
# We saw convolutional layers early on in a LeNet module

import torch.functional as F

class LeNet(torch.nn.Module):
  def __init__(self):
    super(LeNet, self).__init__()

    # 1 input image channel (Black & White), 6 output channels, 5x5 square convolutional

    self.conv1 = torch.nn.Conv2d(1, 6, 5) # 1st is the number of color channels, 2nd is to learn 6 features(output features) and 3rd image size (5x5)
    self.conv2 = torch.nn.Conv2d(6, 16, 3) # 1st expect 6 input channels (correspondig the 6 features of the first layer), 2nd 16 input channels and 3rd 3x3 kernel

    self.fc1 = torch.nn.Linear(16 * 6 * 6, 120) # 6*6 image dimension
    self.fc2 = torch.nn.Linear(120, 84)
    self.fc3 = torch.nn.Linear(84, 10)

  def forward(self, x):
    x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2)) # max pooling over 2, 2 window
    x = F.max_pool2d(F.relu(self.conv2(x)), 2)
    x = x.view(-1, self.num_flat_features(x))
    x = F.relu(self.fc1(x))
    x = F.relu(self.fc2(x))
    x = self.fc3(x)
    return x

  def num_flat_features(self, x):
    size = x.size()[1:] # all dimensions except the batch dimension
    num_features = 1
    for s in size:
      num_features *= s
    return num_features

more details in  a break  down of this  model:  https://pytorch.org/tutorials/beginner/introyt/modelsyt_tutorial.html

More aplications about Convolution Layers: https://pytorch.org/docs/stable/nn.html#convolution-layers

### Recurrent Layers
Recurrent neural networks (or RNNs) are used for sequential data. An RNN does this by maintaining a hidden state that acts as a sort of memory for what it has seen in the sequence so far.

The internal structure of an RNN layer - or its variants, the LSTM (long short-term memory) and GRU (gated recurrent unit) - is moderately complex (so it's not part of this larning class).
but  let's see what one looks like

In [None]:
class LSTMTagger(torch.nn.Module):
  def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
    super(LSTMTagger, self).__init__()
    self.hidden_dim = hidden_dim

    self.word_embeddings = torch.nn.Embedding(vocab_size, embedding_dim)

    self.lstm = torch.nn.LSTM(embedding_dim, hidden_dim)

    self.hidden2tag = torch.nn.Linear(hidden_dim, tagset_size)

  def forward(self, sentence):
    embeds = self.word_embeddings(sentence)
    lstm_out, _ = self.lstm(embeds.view(len(sentence), 1, -1))
    tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
    tag_scores = F.log_softmax(tag_space, dim=1)
    return tag_scores

This is too advanced for the atual state of learning
so we are not going through the details

but the documentation for Sequence Models and LSTM Networks is here: https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html

### Transformers
Transformers are multi-purpose networks that have taken over the state of the art in NLP with models like BERT.


NPL = Natural Language Processing, common in AI

PyTorch has a Transformer class that allows you to define the overall parameters of a transformer model - the number of attention heads, the number of encoder & decoder layers, dropout and activation functions, etc.

documantation to Transformers: https://pytorch.org/docs/stable/nn.html#transformer-layers

and tutorial: https://pytorch.org/tutorials/beginner/transformer_tutorial.html

# Other Layers and Functions

### Data manipulation Layers
There are other layer types that perform important functions in models, but don't participate in the learning process themselves.

- Max pooling(and its twin, min pooling) reduce a tensor by combining cells, and assigning the maximum value of the input cells o the output cell

In [None]:
# for example
my_tensor = torch.rand(1, 6, 6)
print(my_tensor)

maxpool_layer = torch.nn.MaxPool2d(3)
print(maxpool_layer(my_tensor))

print(f'My_tensor: {my_tensor.shape} | Maxpool: {maxpool_layer(my_tensor).shape}')

tensor([[[0.9032, 0.9015, 0.3176, 0.0506, 0.4068, 0.3693],
         [0.8099, 0.7067, 0.5140, 0.2802, 0.4934, 0.7527],
         [0.1873, 0.1097, 0.0329, 0.4705, 0.7273, 0.7777],
         [0.2521, 0.2762, 0.2899, 0.4091, 0.1444, 0.9342],
         [0.0169, 0.6084, 0.0200, 0.3301, 0.5943, 0.0153],
         [0.1782, 0.2835, 0.1359, 0.5681, 0.1155, 0.8131]]])
tensor([[[0.9032, 0.7777],
         [0.6084, 0.9342]]])
My_tensor: torch.Size([1, 6, 6]) | Maxpool: torch.Size([1, 2, 2])


Each of the values in the maxpooled output is the maximum value of each quadrant of the 6x6 input.

- Normalization layers = re-center and normalize the output of one layer before feeding it to another. Centering the and scaling the intermediate tensors has a number of beneficial effects, such as letting you use higher learning rates without exploding/vanishing gradients.

In [None]:
my_tensor = torch.rand(1, 4, 4) * 20 + 5
print(my_tensor)

print(my_tensor.mean())

norm_layer = torch.nn.BatchNorm1d(4)
normed_tensor = norm_layer(my_tensor)
print(normed_tensor)

print(normed_tensor.mean())

tensor([[[ 7.9611,  8.7461, 17.5208,  9.4717],
         [14.7659,  7.1994,  7.6848, 10.2895],
         [ 6.7386, 17.0155, 16.3440, 13.5909],
         [19.6604, 15.5439, 17.4119, 20.3279]]])
tensor(13.1420)
tensor([[[-0.7707, -0.5666,  1.7153, -0.3779],
         [ 1.5937, -0.9285, -0.7667,  0.1015],
         [-1.6436,  0.8836,  0.7185,  0.0415],
         [ 0.7525, -1.4223, -0.4354,  1.1051]]],
       grad_fn=<NativeBatchNormBackward0>)
tensor(6.7055e-08, grad_fn=<MeanBackward0>)


The normalized tensors is a smaller number, grouped aroud zero. This is beneficial because many activation functions have their strongest gradients near 0, but sometimes suffer from vanishing or exploding gradients for inputs that drive them far away from zero

- Dropout Layers = are a tool for encouraging sparse representations in your model - that is, pushing it to do inference with less data.

Dropout layers work by randomly setting parts of the input tensor during training - dropout layers are always turned off for inference. This forces the model to learn against this masked or reduced dataset. For example:

In [None]:
my_tensor = torch.rand(1, 4, 4)

dropout = torch.nn.Dropout(p=0.4)
print(dropout(my_tensor))
print(dropout(my_tensor))

tensor([[[0.0000, 0.2500, 0.9918, 1.6259],
         [1.0676, 0.4539, 0.2993, 0.0000],
         [0.0000, 1.4900, 1.2566, 1.2189],
         [1.2045, 0.0000, 0.3699, 1.1011]]])
tensor([[[0.4019, 0.2500, 0.9918, 0.0000],
         [1.0676, 0.0000, 0.2993, 0.0000],
         [0.2511, 1.4900, 0.0000, 1.2189],
         [0.0000, 1.4670, 0.0000, 0.0000]]])


#  Activation Functions

Activation function make deep learning possible. A neural network is a program with many parameters that simulates mathematical function.

Inserting non-linear activation functions between layers is what allows a deep learning model to simulate any function.

torch.nn.Module has objects encapsulating all of the major activation functions including ReLU and its many variants, Tanh, Hardtanh, sigmoid, and more. It also includes other functions, such as Softmax, that are most useful at the output stage of a model.

# Loss Function
The loss function tell us how far a model's predction is from the correct answer