## Speeding up Transformer training:

### Implementations:

1. English to French dataset

2. Automatic Mixed Precision

3. Dynamic padding (using `pad_sequence` and no manual pad count calculations)

4. One Cycle Policy

5. Parameter sharing

6. Scaled Dot Product Attention (SDP Kernel context, Flash option unsupported on T4 Colab)

7. Added code for Gradient Accumulation

8. Converted encoder_input and decoder_input Tensors in `_ _ getitem_ _` from `int64` to `int32` to reduce some memory (label in int32 gave error, need to look into loss function)

9. optimizer.zero_grad(set_to_none=True)

10. `.to(device, non_blocking=True)` (where applicable)

11. Was able to use `batch_size = 64 (batch size = 72 with 256 d_model)`


In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [None]:
%cd /content/drive/My Drive/

/content/drive/My Drive


In [None]:
rm -rf ERA1

In [None]:
!git clone https://github.com/MANU-CHAUHAN/ERA1

Cloning into 'ERA1'...
remote: Enumerating objects: 557, done.[K
remote: Counting objects: 100% (95/95), done.[K
remote: Compressing objects: 100% (54/54), done.[K
remote: Total 557 (delta 40), reused 78 (delta 25), pack-reused 462[K
Receiving objects: 100% (557/557), 14.95 MiB | 13.99 MiB/s, done.
Resolving deltas: 100% (240/240), done.


In [None]:
cd ERA1

/content/drive/My Drive/ERA1


In [None]:
cd s16/

/content/drive/My Drive/ERA1/s16


In [None]:
!pip install -q -r requirements.txt

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m805.2/805.2 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m493.7/493.7 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m32.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m20.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m24.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
ls

config.py   model.py      READMEs16.md      [0m[01;34mruns[0m/                      tokenizer_en.json  train.py
dataset.py  [01;34m__pycache__[0m/  requirements.txt  S16_optimized_final.ipynb  tokenizer_fr.json  [01;34mweights[0m/


In [None]:
from config import get_config
import torch

cfg = get_config()

cfg

{'batch_size': 32,
 'num_epochs': 10,
 'lr': 0.001,
 'max_lr': 0.01,
 'pct_start': 0.1,
 'initial_div_factor': 10,
 'final_div_factor': 10,
 'anneal_strategy': 'linear',
 'three_phase': True,
 'seq_len': 500,
 'd_model': 512,
 'lang_src': 'en',
 'lang_tgt': 'fr',
 'model_folder': 'weights',
 'model_basename': 'tmodel_',
 'preload': False,
 'tokenizer_file': 'tokenizer_{0}.json',
 'experiment_name': 'runs/tmodel',
 'enable_amp': True,
 'd_ff': 512,
 'N': 6,
 'h': 8,
 'param_sharing': True,
 'gradient_accumulation': False,
 'accumulation_steps': 4}

In [None]:
import torch
import gc
torch.cuda.empty_cache()
gc.collect()

0

SDP Kernel: Math = True, Flash = False, Mem_Effciency = True

### with d_model = 256

#### cfg['gradient_accumulation'] = True

#### cfg['gradient_accumulation_steps'] = 40

In [None]:
cfg['batch_size'] = 72
cfg['preload'] = False
cfg['num_epochs'] = 30
cfg['d_model'] = 256
cfg['d_ff'] = 128
cfg['pct_start'] = 0.2
cfg['max_lr'] = 10**-3
cfg['initial_div_factor'] = 10
cfg['final_div_factor'] = 10

cfg['gradient_accumulation'] = True
cfg['gradient_accumulation_steps'] = 40

from train import train_model

train_model(config=cfg)

👀Using device: cuda


Downloading builder script:   0%|          | 0.00/6.08k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/161k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/20.5k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/12.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/127085 [00:00<?, ? examples/s]

Max len of source sentence: 471
Max len of target sentence: 482

⚡️⚡️Number of model parameters: 25,824,850



Processing Epoch: 00: 100%|██████████| 1589/1589 [05:10<00:00,  5.12it/s, loss=3.10041, lr=[0.00025001573481590265]]
Processing Epoch: 01: 100%|██████████| 1589/1589 [05:08<00:00,  5.15it/s, loss=3.09889, lr=[0.0004000314696318053]]
Processing Epoch: 02: 100%|██████████| 1589/1589 [05:09<00:00,  5.14it/s, loss=2.68248, lr=[0.000550047204447708]]
Processing Epoch: 03: 100%|██████████| 1589/1589 [05:08<00:00,  5.16it/s, loss=2.03687, lr=[0.0007000629392636107]]
Processing Epoch: 04: 100%|██████████| 1589/1589 [05:06<00:00,  5.18it/s, loss=2.23102, lr=[0.0008500786740795132]]
Processing Epoch: 05: 100%|██████████| 1589/1589 [05:08<00:00,  5.15it/s, loss=2.10534, lr=[0.000999905591104584]]
Processing Epoch: 06: 100%|██████████| 1589/1589 [05:10<00:00,  5.11it/s, loss=1.94918, lr=[0.0008498898562886814]]
Processing Epoch: 07: 100%|██████████| 1589/1589 [05:09<00:00,  5.13it/s, loss=1.85274, lr=[0.0006998741214727787]]
Processing Epoch: 08: 100%|██████████| 1589/1589 [05:08<00:00,  5.15it/s,

In [None]:
cfg['batch_size'] = 72
cfg['preload'] = True
cfg['num_epochs'] = 30
cfg['d_model'] = 256
cfg['d_ff'] = 128
cfg['pct_start'] = 0.2
cfg['max_lr'] = 10**-3
cfg['initial_div_factor'] = 10
cfg['final_div_factor'] = 10

cfg['gradient_accumulation'] = True
cfg['gradient_accumulation_steps'] = 40

from train import train_model

train_model(config=cfg)

👀Using device: cuda


Downloading builder script:   0%|          | 0.00/6.08k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/161k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/20.5k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/12.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/127085 [00:00<?, ? examples/s]

Max len of source sentence: 471
Max len of target sentence: 482

⚡️⚡️Number of model parameters: 25,824,850

Preloading model weights/tmodel_20.pt
Preloaded


Processing Epoch: 21: 100%|██████████| 1589/1589 [05:15<00:00,  5.04it/s, loss=1.76465, lr=[0.00025001573481590265]]
Processing Epoch: 22: 100%|██████████| 1589/1589 [05:10<00:00,  5.12it/s, loss=2.01428, lr=[0.0004000314696318053]]
Processing Epoch: 23: 100%|██████████| 1589/1589 [05:09<00:00,  5.14it/s, loss=1.82859, lr=[0.000550047204447708]]
Processing Epoch: 24: 100%|██████████| 1589/1589 [05:11<00:00,  5.11it/s, loss=1.74023, lr=[0.0007000629392636107]]
Processing Epoch: 25: 100%|██████████| 1589/1589 [05:13<00:00,  5.07it/s, loss=1.73396, lr=[0.0008500786740795132]]
Processing Epoch: 26: 100%|██████████| 1589/1589 [05:11<00:00,  5.11it/s, loss=1.80579, lr=[0.000999905591104584]]
Processing Epoch: 27: 100%|██████████| 1589/1589 [05:11<00:00,  5.10it/s, loss=1.89716, lr=[0.0008498898562886814]]
Processing Epoch: 28: 100%|██████████| 1589/1589 [05:10<00:00,  5.11it/s, loss=1.77185, lr=[0.0006998741214727787]]
Processing Epoch: 29: 100%|██████████| 1589/1589 [05:12<00:00,  5.09it/s,

--------------------------------------------------------------------------------
    SOURCE: The cavalcade was brilliant, and its march resounded on the pavement.
    TARGET: La cavalcade était brillante et résonnait sur le pavé.
 PREDICTED: La cavalcade était brillante et sa marche s ' agitait sur le pavé .
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
    SOURCE: There was nothing extraordinary about the country; the sky was blue, the trees swayed; a flock of sheep passed.
    TARGET: Mais non! la campagne n’avait rien d’extraordinaire: le ciel était bleu, les arbres se balançaient; un troupeau de moutons passa.
 PREDICTED: Le ciel était bleu , des arbres , des cris de mouflons , un troupeau de mouflons passa .
--------------------------------------------------------------------------------


RuntimeError: ignored