## Pytorch Transformers
- In this notebook we will cover Pytorch Transformers layer which is under `torch.nn`.
    - `nn.Transformer` -> A transformer model.
    - `nn.TransformerEncoder` -> TransformerEncoder is a stack of N encoder layers.
    - `nn.TransformerDecoder` -> TransformerDecoder is a stack of N decoder layers.
    - `nn.TransformerEncoderLayer` -> TransformerEncoderLayer is made up of self-attn and feedforward network.
    - `nn.TransformerDecoderLayer` -> TransformerDecoderLayer is made up of self-attn, multi-head-attn and feedforward network.

- We will also try out various kind of problems with text:
    - `Text Classification`
    - `Text Generation`
    - `Translation`
    - `Next Word Prediction`
    - `Text Summarization`
    - `Question Answering (Q&A)`
    - `Text-to-Speech`
    - `Speech-to-Text`
    - `Text Simialarity & Semantic search`
    - `Sentence correction - Grammar and spellings.` 









### Understanding of different function and parameters.

1. `nn.Transformers`

In [2]:
import torch
import torch.nn as nn

In [3]:
torch.cuda.is_available()

True

| Parameter           | Description                                      | Example |
|---------------------|--------------------------------------------------|---------|
| `d_model`          | Embedding dimension of the model                 | `512`   |
| `nhead`            | Number of attention heads                        | `8`     |
| `num_encoder_layers` | Number of encoder layers                      | `6`     |
| `num_decoder_layers` | Number of decoder layers                      | `6`     |
| `dim_feedforward`  | Hidden layer size in feed-forward networks       | `2048`  |
| `dropout`          | Dropout probability (to prevent overfitting)     | `0.1`   |
| `batch_first`      | If `True`, input is in (batch, seq, feature) format | `True`  |

In [None]:
transformer_model = nn.Transformer(nhead=8, num_encoder_layers=6, d_model=256, batch_first=True)



In [9]:
transformer_model

Transformer(
  (encoder): TransformerEncoder(
    (layers): ModuleList(
      (0-5): 6 x TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True)
        )
        (linear1): Linear(in_features=512, out_features=2048, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=2048, out_features=512, bias=True)
        (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
      )
    )
    (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  )
  (decoder): TransformerDecoder(
    (layers): ModuleList(
      (0-5): 6 x TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=512, o

In [15]:
src = torch.rand((1, 10, 512))
tgt = torch.rand((1, 10, 512))

In [16]:
out = transformer_model(src, tgt)

In [17]:
out.shape

torch.Size([1, 10, 512])

### 2. `nn.TransformerEncoder`

TransformerEncoder is a stack of N encoder layers.

**Parameters**
- **encoder_layer (TransformerEncoderLayer) –** an instance of the TransformerEncoderLayer() class (required).

- **num_layers (int) –** the number of sub-encoder-layers in the encoder (required).

- **norm (Optional[Module]) –** the layer normalization component (optional).

- **enable_nested_tensor (bool) –** if True, input will automatically convert to nested tensor (and convert back on output). This will improve the overall performance of TransformerEncoder when padding rate is high. Default: True (enabled).

### 3. `nn.TransformerEncoderLayer`

TransformerEncoderLayer is made up of self-attn and feedforward network.

**Parameters**

- **`d_model` (int)** – The number of expected features in the input (**required**).
- **`nhead` (int)** – The number of heads in the multi-head attention models (**required**).
- **`dim_feedforward` (int, default=`2048`)** – The dimension of the feedforward network model.
- **`dropout` (float, default=`0.1`)** – The dropout value.
- **`activation` (Union[str, Callable[[Tensor], Tensor]], default=`"relu"`)** –  
  The activation function of the intermediate layer. Can be a string (`"relu"` or `"gelu"`) or a callable function.
- **`layer_norm_eps` (float, default=`1e-5`)** – The epsilon value used in layer normalization.
- **`batch_first` (bool, default=`False`)** –  
  If `True`, the input and output tensors are provided as `(batch, seq, feature)`.  
  If `False`, the format is `(seq, batch, feature)`.
- **`norm_first` (bool, default=`False`)** –  
  If `True`, layer normalization is applied **before** attention and feed-forward layers.  
  If `False`, it is applied **after**.
- **`bias` (bool, default=`True`)** –  
  If `False`, Linear and LayerNorm layers will not learn an additive bias.

In [4]:
encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8)
transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=6)



In [5]:
transformer_encoder

TransformerEncoder(
  (layers): ModuleList(
    (0-5): 6 x TransformerEncoderLayer(
      (self_attn): MultiheadAttention(
        (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True)
      )
      (linear1): Linear(in_features=512, out_features=2048, bias=True)
      (dropout): Dropout(p=0.1, inplace=False)
      (linear2): Linear(in_features=2048, out_features=512, bias=True)
      (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      (norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      (dropout1): Dropout(p=0.1, inplace=False)
      (dropout2): Dropout(p=0.1, inplace=False)
    )
  )
)

In [6]:
src = torch.rand(10, 32, 512)
out = transformer_encoder(src)

In [7]:
out.shape

torch.Size([10, 32, 512])

## 4. `TransformerDecoder`

TransformerDecoder is a stack of N decoder layers.

| Parameter        | Type                      | Required | Description |
|-----------------|--------------------------|----------|-------------|
| `decoder_layer` | `TransformerDecoderLayer` | ✅ Yes   | An instance of the `TransformerDecoderLayer()` class. |
| `num_layers`   | `int`                     | ✅ Yes   | The number of sub-decoder layers in the decoder. |
| `norm`         | `Optional[Module]`        | ❌ No    | The layer normalization component. |


### 5. `TransformerDecoderLayer`

TransformerDecoderLayer is made up of self-attn, multi-head-attn and feedforward network.

| Parameter        | Type                                      | Default  | Description |
|-----------------|-----------------------------------------|----------|-------------|
| `d_model`       | `int`                                   | **Required** | The number of expected features in the input. |
| `nhead`         | `int`                                   | **Required** | The number of heads in the multi-head attention models. |
| `dim_feedforward` | `int`                                 | `2048`   | The dimension of the feedforward network model. |
| `dropout`       | `float`                                 | `0.1`    | The dropout value. |
| `activation`    | `Union[str, Callable[[Tensor], Tensor]]` | `"relu"` | The activation function of the intermediate layer, can be `"relu"`, `"gelu"`, or a callable function. |
| `layer_norm_eps` | `float`                                | `1e-5`   | The epsilon value in layer normalization components. |
| `batch_first`   | `bool`                                  | `False`  | If `True`, input and output tensors are in `(batch, seq, feature)` format; otherwise, `(seq, batch, feature)`. |
| `norm_first`    | `bool`                                  | `False`  | If `True`, layer normalization is applied before self-attention, multi-head attention, and feedforward operations; otherwise, it's applied after. |
| `bias`         | `bool`                                   | `True`   | If `False`, Linear and LayerNorm layers will not learn an additive bias. |


In [8]:
decoder_layer = nn.TransformerDecoderLayer(d_model=512, nhead=8)

In [9]:
transformer_decoder = nn.TransformerDecoder(decoder_layer, num_layers=6)

The history saving thread hit an unexpected error (OperationalError('database or disk is full')).History will not be written to the database.


In [11]:
memory = torch.rand(10, 32, 512)
tgt = torch.rand(20, 32, 512)

In [12]:
out = transformer_decoder(tgt, memory)

In [13]:
out.shape

torch.Size([20, 32, 512])

In the next lessions we will implement some of the problems.