### Обучение Seq2seq Transformer модель(25 баллов)

Реализуйте Seq2seq Transformer. В качестве блока трансформера можно использовать https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html. В качестве токенизатора воспользуйтесь HuggingFace токенизатор для source/target языков - https://huggingface.co/docs/transformers/fast_tokenizers
В качестве максимальной длинны возьмите предложения длинной **до 15 слов**, без каких либо префиксов. 

Не забудьте остальные элементы модели:
* Мы можем использовать 1 трансформер как энкодер - декодером будет выступать линейный слой. 
* Обучите свой BPE токенизатор - https://huggingface.co/docs/transformers/fast_tokenizers
* Матрицу эмбеддингов токенов
* Матрицу позицонных эмбеддингов
* Линейный слой проекции в target словарь
* Функцию маскирования будущих состояний attention, так как модель авто-регрессионна
* Learning rate schedualer


В качестве результатов, приложите слудующие данные:
1) Параметры обучения - learning rate, batch_size, epoch_num, размерность скрытого слоя, количетсво слоев
2) Графики обучения - train loss, val loss, bleu score
3) Примеры переводов вашей модели(10 штук) - source text, true target text, predicted target text

### Namings
```python
# N - batch size
# S - src_seq_length
# T - tgt_seq_length
# E - emb_size
# SV - src_vocab_size
# TV - tgt_vocab_size
```

```python
# loader -> [(N, S), (N, T)]
#
# Model:
#   FORWARD(src, tgt):
#       INPUT:
#           enc_emb(src) -> (S, N, E)
#           dec_emb(tgt) -> (T, N, E)
#           mask -> T, T
#       ENCODER:
#           INPUT -> transformer -> out: (T, N, E)
#       DECODER:
#           out -> T, N, TV
```

## Loader

In [1]:
import sys, os
import importlib
import torch
import random

random.seed(42)
import pytorch_lightning as pl
from pytorch_lightning.loggers import TensorBoardLogger
from pytorch_lightning.callbacks import ModelCheckpoint
from pytorch_lightning.callbacks import LearningRateMonitor

sys.path.append(os.path.join(os.getcwd(), "./src"))

from models import seq2seq_transformer2
from data.datamodule import DataManager
device = torch.device("mps")

In [2]:
os.getcwd()

'/Users/a.i.krotov/Desktop/universal_osa/contest/hw-2/Transformer-model-project'

In [3]:
eng_prefixes = (
    "i am ",
    "i m ",
    "he is",
    "he s ",
    "she is",
    "she s ",
    "you are",
    "you re ",
    "we are",
    "we re ",
    "they are",
    "they re ",
)

config = {
    "batch_size": 128,          # <--- size of batch
    "num_workers": 9,          # <--- num cpu to use in dataloader
    "prefix_filter": eng_prefixes,
    "max_length": 15,
    "filename": "./rus.txt",    # <--- path to file with sentneces
    "lang1": "en",              # <--- name of the first lang    
    "lang2": "ru",              # <--- name of the second lang
    "reverse": False,           # <--- direct or reverse order in pairs
    "train_size": 0.8,          # <--- ratio of data pairs to use in train
    "run_name": "tutorial",     # <--- run name to logger and checkpoints
    "quantile": 0.95,           # <--- (1 - quantile) longest sentences will be removed
}

dm = DataManager(config, device)
dm.prepare_data()

Reading from file: 100%|██████████| 496059/496059 [00:02<00:00, 165506.04it/s]





Space tokenizer fitted - 18831 tokens



Space tokenizer fitted - 30000 tokens


(<bound method DataManager.train_dataloader of <data.datamodule.DataManager object at 0x16c4dfdd0>>,
 <bound method DataManager.val_dataloader of <data.datamodule.DataManager object at 0x16c4dfdd0>>)

In [4]:
device

device(type='mps')

## Loader check

In [5]:
for d in dm.train_dataloader():
    src, tgt = d
    print('src.shape\n', src.shape)
    print('src\n', src)
    print('src.transpose(0, 1)\n', src.transpose(0, 1))
    print()
    print('tgt.shape\n', tgt.shape)
    print('tgt\n', tgt)
    print('tgt.transpose(0, 1)\n', tgt.transpose(0, 1))
    break

src.shape
 torch.Size([128, 15])
src
 tensor([[    0,    12,   617,  ...,     3,     3,     3],
        [    0,    12,   277,  ...,     3,     3,     3],
        [    0,    57,    97,  ...,     3,     3,     3],
        ...,
        [    0,    12,   589,  ...,     3,     3,     3],
        [    0,    33, 18019,  ...,     3,     3,     3],
        [    0,    80,    36,  ...,     3,     3,     3]], device='mps:0')
src.transpose(0, 1)
 tensor([[    0,     0,     0,  ...,     0,     0,     0],
        [   12,    12,    57,  ...,    12,    33,    80],
        [  617,   277,    97,  ...,   589, 18019,    36],
        ...,
        [    3,     3,     3,  ...,     3,     3,     3],
        [    3,     3,     3,  ...,     3,     3,     3],
        [    3,     3,     3,  ...,     3,     3,     3]], device='mps:0')

tgt.shape
 torch.Size([128, 15])
tgt
 tensor([[   0,  266,   85,  ...,   38, 1915,    1],
        [   0,   49,  136,  ...,    3,    3,    3],
        [   0,  120,  516,  ...,    3,    

## Model

In [6]:
importlib.reload(seq2seq_transformer2)

<module 'models.seq2seq_transformer2' from '/Users/a.i.krotov/Desktop/universal_osa/contest/hw-2/Transformer-model-project/./src/models/seq2seq_transformer2.py'>

In [8]:
checkpoint = "./eng2ru-transformer-translator-0.248bleu.ckpt"
model = seq2seq_transformer2.Seq2SeqTransformer.load_from_checkpoint(
      checkpoint,
               lr = 1e-2,
            nhead = 4,
          src_dim = dm.input_lang_n_words,  # SV
          tgt_dim = dm.output_lang_n_words, # TV
          emb_dim = 256,
          hdn_dim = 256, # dim_feedforward
      enc_nlayers = 3,
      dec_nlayers = 3,
    tgt_tokenizer = dm.target_tokenizer,
    src_tokenizer = dm.source_tokenizer,
          dropout = 0.3,
          max_len = 15,
      tgt_pad_idx = dm.target_tokenizer.word2index['PAD'],
      tgt_sos_idx = dm.target_tokenizer.word2index['SOS'],
      tgt_eos_idx = dm.target_tokenizer.word2index['EOS'],
      src_pad_idx = dm.source_tokenizer.word2index['PAD'],
      src_sos_idx = dm.source_tokenizer.word2index['SOS'],
      src_eos_idx = dm.source_tokenizer.word2index['EOS'],
).to(device)

In [10]:
# TB Logger
logger = TensorBoardLogger("lightning_logs", name=config["run_name"])

from pytorch_lightning.callbacks import Callback

class CustomWriter(Callback):
    def on_train_start(self, trainer, pl_module):
        print("Training is started!")
        
    def on_train_end(self, trainer, pl_module):
        print("Training is done.")
        
    def on_train_epoch_end(self, trainer, pl_module):
        pl_module.eval()
        print('\n\nExample:')
        # phrase = 'but when you consider that a human being has the opportunity of being acquainted with'
        phrase = 'i ll accompany you as far as the intersection'
        print(phrase)
        print(pl_module.predict(phrase))
        pl_module.train()
        print()
        
# Callbacks
checkpoint_callback = ModelCheckpoint(
    save_top_k=3,
    monitor="val_loss",
    mode="min",
    dirpath="runs/{}/".format(config["run_name"]),
    filename="{epoch:02d}-{step:d}-{val_loss:.4f}",
    verbose=True,
    every_n_epochs=1,
)
lr_monitor = LearningRateMonitor(logging_interval="step")

# Initialize a Trainer
trainer = pl.Trainer(
    accelerator='gpu',
    max_epochs=3,
    min_epochs=1,
    callbacks=[lr_monitor, checkpoint_callback, CustomWriter()],
    check_val_every_n_epoch=1,
    logger=logger,
    log_every_n_steps=1,
)

GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


In [11]:
trainer.fit(model, dm)

Reading from file: 100%|██████████| 496059/496059 [00:02<00:00, 167608.73it/s]





Space tokenizer fitted - 18831 tokens



Space tokenizer fitted - 30000 tokens


/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:653: Checkpoint directory /Users/a.i.krotov/Desktop/universal_osa/contest/hw-2/Transformer-model-project/runs/tutorial exists and is not empty.

  | Name        | Type               | Params
---------------------------------------------------
0 | enc_emb     | Embedding          | 4.8 M 
1 | dec_emb     | Embedding          | 7.7 M 
2 | pos_enc     | PositionalEncoding | 0     
3 | transformer | Transformer        | 3.2 M 
4 | linear      | Linear             | 7.7 M 
5 | criterion   | CrossEntropyLoss   | 0     
---------------------------------------------------
23.4 M    Trainable params
0         Non-trainable params
23.4 M    Total params
93.510    Total estimated model params size (MB)


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:441: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=9` in the `DataLoader` to improve performance.


                                                                           

/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=9` in the `DataLoader` to improve performance.


Training is started!
Epoch 0: 100%|██████████| 1883/1883 [34:59<00:00,  0.90it/s, v_num=71, train_loss_step=2.350]
Validation: |          | 0/? [00:00<?, ?it/s]
Validation:   0%|          | 0/470 [00:00<?, ?it/s]
Validation DataLoader 0:   0%|          | 0/470 [00:00<?, ?it/s]
Validation DataLoader 0:   0%|          | 1/470 [00:00<04:11,  1.86it/s]
Validation DataLoader 0:   0%|          | 2/470 [00:01<05:24,  1.44it/s]
Validation DataLoader 0:   1%|          | 3/470 [00:01<04:49,  1.61it/s]
Validation DataLoader 0:   1%|          | 4/470 [00:02<04:45,  1.63it/s]
Validation DataLoader 0:   1%|          | 5/470 [00:02<04:34,  1.69it/s]
Validation DataLoader 0:   1%|▏         | 6/470 [00:03<04:29,  1.72it/s]
Validation DataLoader 0:   1%|▏         | 7/470 [00:03<04:21,  1.77it/s]
Validation DataLoader 0:   2%|▏         | 8/470 [00:04<04:20,  1.77it/s]
Validation DataLoader 0:   2%|▏         | 9/470 [00:04<04:14,  1.81it/s]
Validation DataLoader 0:   2%|▏         | 10/470 [00:05<04:13,  1

Epoch 0, global step 1883: 'val_loss' reached 3.84647 (best 3.84647), saving model to '/Users/a.i.krotov/Desktop/universal_osa/contest/hw-2/Transformer-model-project/runs/tutorial/epoch=00-step=1883-val_loss=3.8465.ckpt' as top 3


Epoch 1: 100%|██████████| 1883/1883 [35:16<00:00,  0.89it/s, v_num=71, train_loss_step=1.900, bleu_score_step=0.000, val_loss_step=7.300, bleu_score_epoch=0.0429, val_loss_epoch=3.850, train_loss_epoch=2.420]
Validation: |          | 0/? [00:00<?, ?it/s]
Validation:   0%|          | 0/470 [00:00<?, ?it/s]
Validation DataLoader 0:   0%|          | 0/470 [00:00<?, ?it/s]
Validation DataLoader 0:   0%|          | 1/470 [00:00<03:11,  2.44it/s]
Validation DataLoader 0:   0%|          | 2/470 [00:01<04:03,  1.92it/s]
Validation DataLoader 0:   1%|          | 3/470 [00:01<04:03,  1.92it/s]
Validation DataLoader 0:   1%|          | 4/470 [00:01<03:52,  2.01it/s]
Validation DataLoader 0:   1%|          | 5/470 [00:02<03:55,  1.98it/s]
Validation DataLoader 0:   1%|▏         | 6/470 [00:02<03:48,  2.03it/s]
Validation DataLoader 0:   1%|▏         | 7/470 [00:03<03:47,  2.03it/s]
Validation DataLoader 0:   2%|▏         | 8/470 [00:03<03:49,  2.01it/s]
Validation DataLoader 0:   2%|▏         | 9/

Epoch 1, global step 3766: 'val_loss' reached 3.79894 (best 3.79894), saving model to '/Users/a.i.krotov/Desktop/universal_osa/contest/hw-2/Transformer-model-project/runs/tutorial/epoch=01-step=3766-val_loss=3.7989.ckpt' as top 3


Epoch 2: 100%|██████████| 1883/1883 [35:12<00:00,  0.89it/s, v_num=71, train_loss_step=1.920, bleu_score_step=0.000, val_loss_step=7.360, bleu_score_epoch=0.0494, val_loss_epoch=3.800, train_loss_epoch=2.030]
Validation: |          | 0/? [00:00<?, ?it/s]
Validation:   0%|          | 0/470 [00:00<?, ?it/s]
Validation DataLoader 0:   0%|          | 0/470 [00:00<?, ?it/s]
Validation DataLoader 0:   0%|          | 1/470 [00:00<03:13,  2.42it/s]
Validation DataLoader 0:   0%|          | 2/470 [00:01<04:17,  1.82it/s]
Validation DataLoader 0:   1%|          | 3/470 [00:01<04:13,  1.84it/s]
Validation DataLoader 0:   1%|          | 4/470 [00:02<04:08,  1.87it/s]
Validation DataLoader 0:   1%|          | 5/470 [00:02<04:08,  1.87it/s]
Validation DataLoader 0:   1%|▏         | 6/470 [00:03<03:59,  1.93it/s]
Validation DataLoader 0:   1%|▏         | 7/470 [00:03<03:58,  1.94it/s]
Validation DataLoader 0:   2%|▏         | 8/470 [00:04<03:53,  1.98it/s]
Validation DataLoader 0:   2%|▏         | 9/

Epoch 2, global step 5649: 'val_loss' reached 3.74834 (best 3.74834), saving model to '/Users/a.i.krotov/Desktop/universal_osa/contest/hw-2/Transformer-model-project/runs/tutorial/epoch=02-step=5649-val_loss=3.7483.ckpt' as top 3
`Trainer.fit` stopped: `max_epochs=3` reached.


Training is done.
Epoch 2: 100%|██████████| 1883/1883 [39:11<00:00,  0.80it/s, v_num=71, train_loss_step=1.920, bleu_score_step=0.0016, val_loss_step=7.350, bleu_score_epoch=0.0538, val_loss_epoch=3.750, train_loss_epoch=1.920]


### Model saving

In [12]:
trainer.save_checkpoint("./eng2ru-transformer-translator-0.054bleu.ckpt")

In [11]:
torch.save(model.state_dict(), "./eng2ru-transformer-translator-0.248bleu.pt")