## Parameters
As each layer has to have parameters to optimize, we can take advantage of inheriting from ``torch.nn.module`` which will automatically initialize parameters and let us access them through ``.parameters``. 

We can also call it like ``ourModel.layer1.parameters`` to get parameters from each layer.

In [None]:
import torch

class TinyModel(torch.nn.Module):

    def __init__(self):
        super().__init__()

        self.linear1 = torch.nn.Linear(100, 200)
        self.activation = torch.nn.ReLU()
        self.linear2 = torch.nn.Linear(200, 10)
        self.softmax = torch.nn.Softmax()

    def forward(self, x):
        x = self.linear1(x)
        x = self.activation(x)
        x = self.linear2(x)
        x = self.softmax(x)
        return x

tinymodel = TinyModel()

print('The model:')
print(tinymodel)

print('\n\nJust one layer:')
print(tinymodel.linear2)

print('\n\nModel params:')
for param in tinymodel.parameters():
    print(param)

print('\n\nLayer params:')
for param in tinymodel.linear2.parameters():
    print(param)


The model:
TinyModel(
  (linear1): Linear(in_features=100, out_features=200, bias=True)
  (activation): ReLU()
  (linear2): Linear(in_features=200, out_features=10, bias=True)
  (softmax): Softmax(dim=None)
)


Just one layer:
Linear(in_features=200, out_features=10, bias=True)


Model params:
Parameter containing:
tensor([[ 0.0712, -0.0377,  0.0109,  ..., -0.0785,  0.0111, -0.0461],
        [-0.0718,  0.0098, -0.0558,  ..., -0.0135,  0.0270,  0.0469],
        [ 0.0771,  0.0760, -0.0102,  ..., -0.0148,  0.0981,  0.0519],
        ...,
        [-0.0368,  0.0069, -0.0224,  ..., -0.0319, -0.0021, -0.0945],
        [ 0.0460,  0.0065, -0.0102,  ...,  0.0793,  0.0473, -0.0010],
        [-0.0654, -0.0202, -0.0502,  ...,  0.0312,  0.0371,  0.0165]],
       requires_grad=True)
Parameter containing:
tensor([-7.1293e-02, -8.8269e-02, -9.7207e-02, -2.8489e-02,  7.6750e-02,
         1.4393e-03,  8.7211e-02, -1.3204e-02,  5.6359e-02, -7.1892e-02,
        -6.4634e-02, -8.3759e-02,  6.7112e-02, -3.5226

## Linear Layers (fully connected layers)
Every input influences every output by some degree. The weight matrix will be *m* × *n*, where *m* - input length, *n* - output length. Operation is: *x* × *w* + *b*, where *w* - weights, *b* - biases. Mostly used as classifing layers.

In [None]:
lin = torch.nn.Linear(3, 2)
x = torch.rand(1, 3)
print('Input:')
print(x)

print('\n\nWeight and Bias parameters:')
for enum, param in enumerate(lin.parameters(), 1):
    print(f"{enum}.", param)

y = lin(x)
print('\n\nOutput:')
print(y)

Input:
tensor([[0.9432, 0.2330, 0.9948]])


Weight and Bias parameters:
1. Parameter containing:
tensor([[0.2523, 0.1990, 0.0547],
        [0.1292, 0.5412, 0.2115]], requires_grad=True)
2. Parameter containing:
tensor([ 0.3272, -0.4525], requires_grad=True)


Output:
tensor([[0.6659, 0.0059]], grad_fn=<AddmmBackward0>)


## Convolutional Layers

Handle high degree of spatial correlation. Mostly used in vision and NLP where they detect close grouping of features and compose them into higher-level features. Convolutional filter (cross-correlation, but who cares), slides across input, multiplies (dot product) with input that it slides over, sums the result and puts it into it's central element position on output matrix. 

Input:
\begin{array}{|c|c|} \hline
0 & 1 & 1 & 0\\ \hline
0 & 1 & -1 & 0\\ \hline
0 & 1 & 1 & 0\\ \hline
0 & 1 & -1 & 0\\ \hline
\end{array}

Filter:
\begin{array}{|c|c|} \hline
0 & 1 & 0  \\ \hline
0 & 1 & 0  \\\hline
0 & 1 & 0  \\ \hline
\end{array}

Convolution result:
\begin{array}{|c|c|} \hline
3 & 2  \\ \hline
3 & 1  \\ \hline
\end{array}

In [None]:
import torch.functional as F


class LeNet(torch.nn.Module):

    def __init__(self):
        super().__init__()
        # 1 input image channel (black & white), 6 output channels, 5x5 square convolution
        # kernel
        self.conv1 = torch.nn.Conv2d(1, 6, 5)
        self.conv2 = torch.nn.Conv2d(6, 16, 3)
        # an affine operation: y = Wx + b
        self.fc1 = torch.nn.Linear(16 * 6 * 6, 120)  # 6*6 from image dimension
        self.fc2 = torch.nn.Linear(120, 84)
        self.fc3 = torch.nn.Linear(84, 10)

    def forward(self, x):
        # Max pooling over a (2, 2) window
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        # If the size is a square you can only specify a single number
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = x.view(-1, self.num_flat_features(x))
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

    def num_flat_features(self, x):
        size = x.size()[1:]  # all dimensions except the batch dimension
        num_features = 1
        for s in size:
            num_features *= s
        return num_features

Looking at max pool

In [33]:
import numpy as np
import torch.nn.functional as F

test_tensor = np.floor(torch.randn(3, 3) * 10)
test_tensor = test_tensor.reshape(1, 1, 3, 3) 
print(f"Test tensor ({test_tensor.shape}): \n {test_tensor}")

test_tensor = F.max_pool2d(input=test_tensor, kernel_size=2, stride=1)
print(f"\n Maxpool2d(kernel=2, stride=1) tensor: \n {test_tensor}")

test_tensor = F.max_pool2d(input=test_tensor, kernel_size=2, stride=2)
print(f"\n Maxpool2d(kernel=2, stride=2) tensor: \n {test_tensor}")

Test tensor (torch.Size([1, 1, 3, 3])): 
 tensor([[[[-11.,   9.,   9.],
          [ 12.,  -6., -12.],
          [  5.,   1., -27.]]]])

 Maxpool2d(kernel=2, stride=1) tensor: 
 tensor([[[[12.,  9.],
          [12.,  1.]]]])

 Maxpool2d(kernel=2, stride=2) tensor: 
 tensor([[[[12.]]]])


## Recurrent Layers
Mostly used for sequential data: time-series, NLP, DNA etc. Recurrent Neural Networks maintains a hidden state that acts as "memory" for previous sequence elements. We can also distinct LSTM (long short-term memory) and GRU (gated recurrent unit).

``vocab_size`` - number of words in the input vocabulary. Each word is a unit vector (one-hot vector) spanning *vocab_size*-dimensional space.

``tagset_size`` - number of tags in the output set.

``embedding_dim`` - size of embedding space for the vocabulary. An embedding maps a vocabulary onto a low-dimensional space, where words with similar meanings are close together in space.

``hidden_dim`` - size of the LSTM memory.

Input is a sentence represented as indices of unit vectors. Embedding maps these vectors to *embedding_dim*-dimensional space. Then LSTM takes this sequence and iterates over, fielding(?populating?) an output vector of length *hidden_dim*. Final layer acts as classifier, it applies a log softmax function to the output which converts the output into a normalized set of estimated probabilites that a given word maps to given tag. 

In [34]:
class LSTMTagger(torch.nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super().__init__()
        self.hidden_dim = hidden_dim

        self.word_embeddings = torch.nn.Embedding(vocab_size, embedding_dim)

        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        self.lstm = torch.nn.LSTM(embedding_dim, hidden_dim)

        # The linear layer that maps from hidden state space to tag space
        self.hidden2tag = torch.nn.Linear(hidden_dim, tagset_size)

    def forward(self, sentence):
        embeds = self.word_embeddings(sentence)
        lstm_out, _ = self.lstm(embeds.view(len(sentence), 1, -1))
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores

## Transformers
A story for another time.

## Normalization
Standard normalization almost:

$y = \frac{x - \overline{x}}{\sqrt{Var(x) + ϵ}} ⋅ \gamma + β$

Stochastic element + learnable weights and biases. Helps to omit vanishing and exploding gradients.

In [36]:
my_tensor = torch.rand(1, 4, 4) * 20 + 5
print(my_tensor)

print(my_tensor.mean())

norm_layer = torch.nn.BatchNorm1d(4)
normed_tensor = norm_layer(my_tensor)
print(normed_tensor)

print(normed_tensor.mean())

tensor([[[ 7.7846, 20.0338, 14.1320, 13.6748],
         [10.7732, 20.7029, 24.6519, 18.2458],
         [ 7.2304, 10.5616,  9.6020, 22.0810],
         [14.8418,  9.8019, 10.2987, 23.6716]]])
tensor(14.8805)
tensor([[[-1.4126,  1.4139,  0.0521, -0.0534],
         [-1.5454,  0.4169,  1.1972, -0.0687],
         [-0.8957, -0.3150, -0.4823,  1.6929],
         [ 0.0338, -0.8719, -0.7826,  1.6206]]],
       grad_fn=<NativeBatchNormBackward0>)
tensor(5.2154e-08, grad_fn=<MeanBackward0>)


## Dropout Layers
Hide random part of input for training in order to encourage the model to introduce sparse representation. 

In [37]:
my_tensor = torch.rand(1, 4, 4)

dropout = torch.nn.Dropout(p=0.4)
print(dropout(my_tensor))
print(dropout(my_tensor))

tensor([[[0.0000, 0.2525, 1.4495, 0.1898],
         [0.7042, 0.0426, 0.1532, 0.1669],
         [0.0000, 0.3141, 1.2422, 0.8919],
         [1.3825, 1.6480, 0.2737, 1.0767]]])
tensor([[[0.8361, 0.2525, 1.4495, 0.0000],
         [0.0000, 0.0426, 0.0000, 0.1669],
         [1.1787, 0.0000, 1.2422, 0.8919],
         [1.3825, 1.6480, 0.0000, 1.0767]]])


## Activation functions
Introduce non-linear functions into deep learning models. 