# Layers in Deep Learning

## Dropout

During training, randomly zeroes some of the elements of the input tensor with probability $p$ using samples from a Bernoulli distribution. Each channel will be zeroed out independently on every forward call.

## Batch Normalization

Batch Normalization formula: $y = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} * \gamma + \beta$

where $\mu$ is the mean of the input, $\sigma^2$ is the variance of the input, $\epsilon$ is a small value to avoid dividing by zero, $\gamma$ is a learnable parameter, and $\beta$ is a learnable parameter.

Batch Normalization makes sure that the values of hidden units have standardized mean and variance. The BatchNorm layer is usually added before ReLU as mentioned in the Batch Normalization paper.

Advantages of Batch Normalization:

1. **Allow larger learning rates**: larger learning rates can cause vanishing/exploding gradients. However, since batch normalization takes care of that, larger learning rates can be used without worry.
2. **Reduces overfitting**: Batch normalization has a regularizing effect since it adds noise to the inputs of every layer. This discourages overfitting since the model no longer produces deterministic values for a given training example alone.

## MultiheadAttention

Multi-Head Attention consists of several attention layers running in parallel as shown in the figure below. Each attention layer has a different set of learnable parameters. The outputs of the different attention layers are concatenated and then put through a final linear layer.

![mha](images/2022-09-08-11-01-25.png)

Time complexity: O(N^2*d), where N is the sequence length and d is the representation dimension.

### Positional Encoding

Positional encoding describes the location or position of an entity in a sequence so that each position is assigned a unique representation. There are many reasons why a single number, such as the index value, is not used to represent an item’s position in transformer models. 

For long sequences, the indices can grow large in magnitude. If you normalize the index value to lie between 0 and 1, it can create problems for variable length sequences as they would be normalized differently.

$$PE(i, 2i) = sin(i/10000^{(2i/d)})$$
$$PE(i, 2i+1) = cos(i/10000^{(2i/d)})$$

The implementation of positional encoding is as follows:


In [None]:
import torch

def get_positional_encoding(max_len, d_model):
    """
    Returns positional encoding for a given maximum length and embedding dimension.
    """
    pos_encoding = torch.zeros(max_len, d_model)
    for pos in range(max_len):
        for i in range(0, d_model, 2):
            pos_encoding[pos, i] = torch.sin(torch.tensor(pos / (10000 ** ((2 * i) / d_model))))
            pos_encoding[pos, i + 1] = torch.cos(torch.tensor(pos / (10000 ** ((2 * (i + 1)) / d_model))))
    return pos_encoding