## Pytorch Transformers
- In this notebook we will cover Pytorch Transformers layer which is under `torch.nn`.
    - `nn.Transformer` -> A transformer model.
    - `nn.TransformerEncoder` -> TransformerEncoder is a stack of N encoder layers.
    - `nn.TransformerDecoder` -> TransformerDecoder is a stack of N decoder layers.
    - `nn.TransformerEncoderLayer` -> TransformerEncoderLayer is made up of self-attn and feedforward network.
    - `nn.TransformerDecoderLayer` -> TransformerDecoderLayer is made up of self-attn, multi-head-attn and feedforward network.

- We will also try out various kind of problems with text:
    - `Text Classification`
    - `Text Generation`
    - `Translation`
    - `Next Word Prediction`
    - `Text Summarization`
    - `Question Answering (Q&A)`
    - `Text-to-Speech`
    - `Speech-to-Text`
    - `Text Simialarity & Semantic search`
    - `Sentence correction - Grammar and spellings.` 









### Understanding of different function and parameters.

1. `nn.Transformers`

In [1]:
import torch
import torch.nn as nn

In [2]:
torch.cuda.is_available()

True

| Parameter           | Description                                      | Example |
|---------------------|--------------------------------------------------|---------|
| `d_model`          | Embedding dimension of the model                 | `512`   |
| `nhead`            | Number of attention heads                        | `8`     |
| `num_encoder_layers` | Number of encoder layers                      | `6`     |
| `num_decoder_layers` | Number of decoder layers                      | `6`     |
| `dim_feedforward`  | Hidden layer size in feed-forward networks       | `2048`  |
| `dropout`          | Dropout probability (to prevent overfitting)     | `0.1`   |
| `batch_first`      | If `True`, input is in (batch, seq, feature) format | `True`  |

In [None]:
transformer_model = nn.Transformer(nhead=8, num_encoder_layers=6, d_model=256, batch_first=True)



In [9]:
transformer_model

Transformer(
  (encoder): TransformerEncoder(
    (layers): ModuleList(
      (0-5): 6 x TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True)
        )
        (linear1): Linear(in_features=512, out_features=2048, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=2048, out_features=512, bias=True)
        (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
      )
    )
    (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  )
  (decoder): TransformerDecoder(
    (layers): ModuleList(
      (0-5): 6 x TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=512, o

In [15]:
src = torch.rand((1, 10, 512))
tgt = torch.rand((1, 10, 512))

In [16]:
out = transformer_model(src, tgt)

In [17]:
out.shape

torch.Size([1, 10, 512])

### 2. `nn.TransformerEncoder` 