### Fine-tune pretrained T5 (25 баллов)

Реализуйте Seq2seq Pretrained T5. Воспользуйтесь https://huggingface.co/docs/transformers/model_doc/t5 предобученной моделью. В качестве максимальной длинны возьмите предложения длинной **до 15 слов**, без каких либо префиксов. Архитектура модели(количетсво слоев, размерность и тд) остается на ваш выбор.

Не забудьте важные аспекты обучения модели:
* Взять готовый t5 токенизатор
* Resize matrix embedding - скорей всего ваша матрица эмбеддингов не будет включать эмбеддинги из вашего сета. Пример обновления матрицы эмбеддингов тут тут https://github.com/runnerup96/Transformers-Tuning/blob/main/t5_encoder_decoder.py
* Learning rate schedualer/Adafactor with constant learning rate


В качестве результатов, приложите слудующие данные:
1) Параметры обучения - learning rate, batch_size, epoch_num, pretrained model name
2) Графики обучения - train loss, val loss, bleu score
3) Примеры переводов вашей модели(10 штук) - source text, true target text, predicted target text

In [None]:
# !wget https://www.manythings.org/anki/rus-eng.zip && unzip rus-eng.zip

### Loader

In [1]:
import torch
import pytorch_lightning as pl
from pytorch_lightning.loggers import TensorBoardLogger
from pytorch_lightning.callbacks import ModelCheckpoint
from pytorch_lightning.callbacks import LearningRateMonitor
import sys, os
import importlib

sys.path.append(os.path.join(os.getcwd(), "./src_t5"))

from data.datamodule import DataManager

device = torch.device("cuda:6" if torch.cuda.is_available() else "cpu")

In [2]:
eng_prefixes = (
    "i am ",
    "i m ",
    "he is",
    "he s ",
    "she is",
    "she s ",
    "you are",
    "you re ",
    "we are",
    "we re ",
    "they are",
    "they re ",
)

def filter_func(x):
    MAX_LENGTH = 15
    len_filter = lambda x: len(x[0].split(" ")) <= MAX_LENGTH and len(x[1].split(" ")) <= MAX_LENGTH
    eng_prefix_filter = lambda x: x[0].startswith(eng_prefixes)
    rus_prefix_filter = lambda x: x[0].startswith(rus_prefixes)
    return len_filter(x) and prefix_filter(x)

config = {
    "batch_size": 64,          # <--- size of batch
    "num_workers": 47,          # <--- num cpu to use in dataloader
    "prefix_filter": eng_prefixes,      # <--- callable obj to filter data
    "max_length": 15,
    "filename": "./rus.txt",    # <--- path to file with sentneces
    "lang1": "en",              # <--- name of the first lang    
    "lang2": "ru",              # <--- name of the second lang
    "reverse": False,           # <--- direct or reverse order in pairs
    "train_size": 0.8,          # <--- ratio of data pairs to use in train
    "run_name": "tutorial",     # <--- run name to logger and checkpoints
    "quantile": 0.95,           # <--- (1 - quantile) longest sentences will be removed
}

In [3]:
dm = DataManager(config, device)
dm.prepare_data()

Reading from file: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 496059/496059 [00:05<00:00, 83449.68it/s]
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


(<bound method DataManager.train_dataloader of <data.datamodule.DataManager object at 0x7feca718a4d0>>,
 <bound method DataManager.val_dataloader of <data.datamodule.DataManager object at 0x7feca718a4d0>>)

### Model training

In [11]:
from models import seq2seq_t5
importlib.reload(seq2seq_t5)

ModuleNotFoundError: spec not found for the module 'models.seq2seq_t5'

In [5]:
model = seq2seq_t5.Seq2SeqT5(
        model="google-t5/t5-small",
      max_len=15,
           lr=1e-3,
    tokenizer=dm.tokenizer,
       device=device
).to(device)

In [6]:
# TB Logger
logger = TensorBoardLogger("lightning_logs", name=config["run_name"])

from pytorch_lightning.callbacks import Callback

class CustomWriter(Callback):
    def on_train_start(self, trainer, pl_module):
        print("Training is started!")
        
    def on_train_end(self, trainer, pl_module):
        print("Training is done.")
        
    def on_train_epoch_end(self, trainer, pl_module):
        print('\n\nExample:')
        pl_module.eval()
        # phrase = 'but when you consider that a human being has the opportunity of being acquainted with'
        phrase = 'translate English to Russian: between the lines, its clear that Tom isnt having such'
        print(phrase)
        in_tokens = pl_module.tokenizer(phrase)
        prediction = pl_module.predict(torch.Tensor([in_tokens.input_ids]).to(pl_module.device).long(), torch.Tensor([in_tokens.attention_mask]).to(pl_module.device).long())
        print(pl_module.tokenizer.decode(prediction[0], skip_special_tokens=True))
        pl_module.train()
        print()
        
# Callbacks
checkpoint_callback = ModelCheckpoint(
    save_top_k=3,
    monitor="val_loss",
    mode="min",
    dirpath="runs/{}/".format(config["run_name"]),
    filename="{epoch:02d}-{step:d}-{val_loss:.4f}",
    verbose=True,
    every_n_epochs=1,
)
lr_monitor = LearningRateMonitor(logging_interval="step")

# Initialize a Trainer
trainer = pl.Trainer(
    accelerator='gpu',
    max_epochs=8,
    min_epochs=1,
    devices=[6],
    callbacks=[lr_monitor, checkpoint_callback, CustomWriter()],
    check_val_every_n_epoch=1,
    logger=logger,
    log_every_n_steps=1,
)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


In [7]:
trainer.fit(model, dm)

Reading from file: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 496059/496059 [00:05<00:00, 85223.25it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
/home/krotovan/hw-sber/env/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:653: Checkpoint directory /home/krotovan/hw-sber/pytorch-project/runs/tutorial exists and is not empty.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]

  | Name     | Type                       | Params
--------------------------------------------------------
0 | t5_model | T5ForConditionalGeneration | 60.5 M
--------------------------------------------------------
60.5 M    Trainable params
0         Non-trainable params
60.5 M    Total params
241.969   Total estimated model params size (MB)


Sanity Checking: |                                                                                            …

/home/krotovan/hw-sber/env/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:492: Your `val_dataloader`'s sampler has shuffling enabled, it is strongly recommended that you turn shuffling off for val/test dataloaders.
/home/krotovan/hw-sber/env/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:441: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=47` in the `DataLoader` to improve performance.
  source_ids, attention_masks, target_ids = torch.tensor(self.tokenized_source_list[idx]     ).to(self.device), \
  torch.tensor(self.attention_mask_source_list[idx]).to(self.device), \
  torch.tensor(self.tokenized_target_list[idx]     ).to(self.device)


Training is started!


/home/krotovan/hw-sber/env/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=47` in the `DataLoader` to improve performance.


Training: |                                                                                                   …

Validation: |                                                                                                 …

Epoch 0, global step 3765: 'val_loss' reached 0.50890 (best 0.50890), saving model to '/home/krotovan/hw-sber/pytorch-project/runs/tutorial/epoch=00-step=3765-val_loss=0.5089.ckpt' as top 3




Example:
translate English to Russian: between the lines, its clear that Tom isnt having such
меду л то у



Validation: |                                                                                                 …

Epoch 1, global step 7530: 'val_loss' reached 0.48172 (best 0.48172), saving model to '/home/krotovan/hw-sber/pytorch-project/runs/tutorial/epoch=01-step=7530-val_loss=0.4817.ckpt' as top 3




Example:
translate English to Russian: between the lines, its clear that Tom isnt having such
меду лди вно



Validation: |                                                                                                 …

Epoch 2, global step 11295: 'val_loss' reached 0.44756 (best 0.44756), saving model to '/home/krotovan/hw-sber/pytorch-project/runs/tutorial/epoch=02-step=11295-val_loss=0.4476.ckpt' as top 3




Example:
translate English to Russian: between the lines, its clear that Tom isnt having such
меду текст вно



Validation: |                                                                                                 …

Epoch 3, global step 15060: 'val_loss' reached 0.41634 (best 0.41634), saving model to '/home/krotovan/hw-sber/pytorch-project/runs/tutorial/epoch=03-step=15060-val_loss=0.4163.ckpt' as top 3




Example:
translate English to Russian: between the lines, its clear that Tom isnt having such
меду лестни сно



Validation: |                                                                                                 …

Epoch 4, global step 18825: 'val_loss' reached 0.40786 (best 0.40786), saving model to '/home/krotovan/hw-sber/pytorch-project/runs/tutorial/epoch=04-step=18825-val_loss=0.4079.ckpt' as top 3




Example:
translate English to Russian: between the lines, its clear that Tom isnt having such
меду линии вно



Validation: |                                                                                                 …

Epoch 5, global step 22590: 'val_loss' reached 0.39761 (best 0.39761), saving model to '/home/krotovan/hw-sber/pytorch-project/runs/tutorial/epoch=05-step=22590-val_loss=0.3976.ckpt' as top 3




Example:
translate English to Russian: between the lines, its clear that Tom isnt having such
меду текста сно



Validation: |                                                                                                 …



Example:
translate English to Russian: between the lines, its clear that Tom isnt having such


Epoch 6, global step 26355: 'val_loss' reached 0.39466 (best 0.39466), saving model to '/home/krotovan/hw-sber/pytorch-project/runs/tutorial/epoch=06-step=26355-val_loss=0.3947.ckpt' as top 3


меду текст сно



Validation: |                                                                                                 …

Epoch 7, global step 30120: 'val_loss' reached 0.38923 (best 0.38923), saving model to '/home/krotovan/hw-sber/pytorch-project/runs/tutorial/epoch=07-step=30120-val_loss=0.3892.ckpt' as top 3




Example:
translate English to Russian: between the lines, its clear that Tom isnt having such
меду линии сно



`Trainer.fit` stopped: `max_epochs=8` reached.


Training is done.


### Model saving

In [8]:
trainer.save_checkpoint("./eng2ru-t5-translator-0.2bleu.ckpt")

In [10]:
torch.save(model.state_dict(), "./eng2ru-t5-translator-0.2bleu.pt")