## Lab assignment 02

### Neural Machine Translation in the wild
In the third homework you are supposed to get the best translation you can for the EN-RU translation task.

Basic approach using RNNs as encoder and decoder is implemented for you. 

Your ultimate task is to use the techniques we've covered, e.g.

* Optimization enhancements (e.g. learning rate decay)

* Transformer/CNN/<whatever you select> encoder (with or without positional encoding)

* attention/self-attention mechanism

* pretraining the language models (for decoder and encoder)

* or just fine-tunning BART/ELECTRA/... ;)

to improve the translation quality. 

__Please use at least three different approaches/models and compare them (translation quality/complexity/training and evaluation time).__

Write down some summary on your experiments and illustrate it with convergence plots/metrics and your thoughts. Just like you would approach a real problem.

In [1]:
# You might need to install the libraries below. Do it in the desired environment
# if you are working locally.

# ! pip  install subword-nmt
# ! pip install nltk
# ! pip install torchtext

Collecting subword-nmt
  Downloading subword_nmt-0.3.8-py3-none-any.whl (27 kB)
Installing collected packages: subword-nmt
Successfully installed subword-nmt-0.3.8
Collecting torchtext
  Downloading torchtext-0.15.2-cp39-cp39-manylinux1_x86_64.whl (2.0 MB)
[K     |████████████████████████████████| 2.0 MB 1.9 MB/s eta 0:00:01
[?25hCollecting torchdata==0.6.1
  Downloading torchdata-0.6.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.6 MB)
[K     |████████████████████████████████| 4.6 MB 6.7 MB/s eta 0:00:01     |██████                          | 870 kB 6.7 MB/s eta 0:00:01
Collecting torch==2.0.1
  Downloading torch-2.0.1-cp39-cp39-manylinux1_x86_64.whl (619.9 MB)
[K     |████████████████████████████████| 619.9 MB 47 kB/s  eta 0:00:01    |▌                               | 10.7 MB 4.2 MB/s eta 0:02:25     |▋                               | 11.5 MB 4.2 MB/s eta 0:02:25     |▉                               | 16.0 MB 3.7 MB/s eta 0:02:43     |█                             

In [2]:
# Thanks to YSDA NLP course team for the data
# (who thanks tilda and deephack teams for the data in their turn)

import os
path_do_data = 'data.txt'
if not os.path.exists(path_do_data):
    print("Dataset not found locally. Downloading from github.")
    !wget https://raw.githubusercontent.com/neychev/made_nlp_course/master/datasets/Machine_translation_EN_RU/data.txt -nc
    path_do_data = './data.txt'

Dataset not found locally. Downloading from github.
--2023-05-17 12:29:22--  https://raw.githubusercontent.com/neychev/made_nlp_course/master/datasets/Machine_translation_EN_RU/data.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12905334 (12M) [text/plain]
Saving to: ‘data.txt’


2023-05-17 12:29:25 (4,68 MB/s) - ‘data.txt’ saved [12905334/12905334]



In [3]:
# Baseline solution BLEU score is quite low. Try to achieve at least __21__ BLEU on the test set. 
# The checkpoints are:

# * __21__ - minimal score to submit the homework, 30% of points

# * __25__ - good score, 70% of points

# * __27__ - excellent score, 100% of points

### Warning! The code below is deeeeeeeply deprecated and is is provided only as simple guide.
We suggest you to stick to most recent pipelines here, e.g. by Huggingface: 
* Example notebook: [link](https://github.com/huggingface/notebooks/blob/main/examples/translation.ipynb)
* Converting your own dataset to specific format: [link](https://discuss.huggingface.co/t/correct-way-to-create-a-dataset-from-a-csv-file/15686/15)

In [56]:
import numpy as np
import pandas as pd
import torch
import random
import matplotlib.pyplot as plt
import time

from tqdm import tqdm
from transformers import AutoTokenizer, AutoModel
from transformers.modeling_outputs import BaseModelOutput
from transformers import T5Model, T5Tokenizer, T5Config, T5ForConditionalGeneration, AutoModelForSeq2SeqLM, AutoModelForCausalLM
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
from nltk.translate.bleu_score import corpus_bleu
from IPython.display import clear_output

In [48]:
with open('data.txt', 'r') as f:
    texts = f.read()

texts = texts.split(sep='\n')
texts = [row.split('\t') for row in texts]
texts_en = [row[0] for row in texts if len(row) == 2]
texts_ru = [row[1] for row in texts if len(row) == 2]

print('Num texts:', len(texts_en), len(texts_ru))
print('En max len:', max([len(row) for row in texts_en]))
print('Ru max len:', max([len(row) for row in texts_ru]))

Num texts: 50000 50000
En max len: 518
Ru max len: 431


In [50]:
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
MAX_LEN = 518

In [51]:
class TextDataset(Dataset):
    def __init__(self, texts_en, texts_ru):
        self.texts_en = texts_en
        self.texts_ru = texts_ru
        
    def __len__(self):
        return len(self.texts_en)
    
    def __getitem__(self, idx):
        return self.texts_en[idx], self.texts_ru[idx]

In [52]:
train_texts_en, test_texts_en, train_texts_ru, test_texts_ru = train_test_split(texts_en, texts_ru, test_size=0.2, random_state=42)

train_dataset = TextDataset(train_texts_en, train_texts_ru)
test_dataset = TextDataset(test_texts_en, test_texts_ru)

In [140]:
n_epochs = 10
batch_size = 16
log_each_n_iterations = 1000
generate_n = 10

In [164]:
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)

train_loader = DataLoader(train_dataset, batch_size, shuffle=True, drop_last=True)
test_loader = DataLoader(test_dataset, 16)
generate_loader = DataLoader(test_dataset, generate_n, shuffle=True)


enc_name = 'distilbert-base-multilingual-cased'
dec_name = 't5-small'
# dec_name = "cointegrated/rut5-base-multitask"

enc_tokenizer = AutoTokenizer.from_pretrained(enc_name)
encoder = AutoModel.from_pretrained(enc_name).to(DEVICE)

dec_tokenizer = AutoTokenizer.from_pretrained(dec_name)
decoder = AutoModelForSeq2SeqLM.from_pretrained(dec_name).to(DEVICE)
# dec_tokenizer = T5Tokenizer.from_pretrained("cointegrated/rut5-base-multitask")
config = T5Config(vocab_size=dec_tokenizer.vocab_size, d_model=encoder.config.dim, decoder_start_token_id=0)
decoder = T5ForConditionalGeneration(config).to(DEVICE)

for p in decoder.encoder.parameters():
    p.requires_grad = False
for p in decoder.decoder.parameters():
    p.requires_grad = True

optimizer = torch.optim.AdamW(list(encoder.parameters()) + list(decoder.parameters()), lr=1e-5)

Some weights of the model checkpoint at distilbert-base-multilingual-cased were not used when initializing DistilBertModel: ['vocab_transform.bias', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [142]:
def encode(texts):
    encoded_input = enc_tokenizer(texts, padding=True, truncation=True, max_length=MAX_LEN, return_tensors='pt')
    with torch.no_grad():
        model_output = encoder(**encoded_input.to(encoder.device))
        embeddings = model_output.last_hidden_state
    return embeddings


def decode(embeddings, max_length=MAX_LEN, repetition_penalty=3.0, **kwargs):
    with torch.no_grad():
        out = decoder.generate(
            encoder_outputs=BaseModelOutput(last_hidden_state=embeddings), 
            max_length=max_length, 
            repetition_penalty=repetition_penalty,
            **kwargs
        )
        return [dec_tokenizer.decode(tokens, skip_special_tokens=True) for tokens in out]

In [165]:
train_history = []
iters = 1

for i in range(1, n_epochs + 1):
    print(f'[EPOCH {i}]')
    tqdm_iterator = tqdm(train_loader)

    for text_en_batch, text_ru_batch in tqdm_iterator:
        encoder.train()
        decoder.train()
        x = enc_tokenizer(text_en_batch, return_tensors='pt', padding=True, truncation=True, max_length=MAX_LEN).to(DEVICE)
        y = dec_tokenizer(text_ru_batch, return_tensors='pt', padding=True, truncation=True, max_length=MAX_LEN).to(DEVICE)

        y.input_ids[y.input_ids == 0] = -100  # не учитываем паддинг
        embeds = encoder(**x.to(encoder.device))
        embeds = embeds.last_hidden_state.to(DEVICE)

        loss = decoder(
            encoder_outputs=BaseModelOutput(last_hidden_state=embeds),
            labels=y.input_ids,
            decoder_attention_mask=y.attention_mask,
            return_dict=True
        ).loss
        
        tqdm_iterator.set_description(f'{round(loss.item(), 5)}')
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        train_history.append((iters, loss.item()))
        
        if iters % log_each_n_iterations == 0:
            clear_output()
            print(f'[EPOCH {i}]')
            encoder.eval()
            decoder.eval()
            
            train_it, train_loss = zip(*train_history)
            
            plt.plot(train_it, train_loss, color='blue', label='train loss')
            plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
            plt.grid()
            plt.title('loss')
            plt.show()
            
            en, ru = next(iter(generate_loader))
            embeds = encode(en)
            generated = decode(embeds, max_length=MAX_LEN, repetition_penalty=None)
            print(ru)
            print('\n\n'.join(generated))
        
        iters += 1

[EPOCH 1]


  0%|          | 0/40000 [00:00<?, ?it/s]

('Located inside a condo 2,5 km from Pernambuco Beach, Guarujá House offers an outdoor pool and barbecue facilities.',)
('Дом для отпуска Guarujá House находится в кономиниуме, в 2,5 км от пляжа Пернамбуку. К услугам гостей открытый бассейн и принадлежности для барбекю.',)


10.92204:   0%|          | 1/40000 [00:02<21:12:05,  1.91s/it]

('Featuring a balcony, all units have a seating and dining area.',)
('Во всех апартаментах есть балкон, гостиный уголок и обеденная зона.',)


10.58915:   0%|          | 2/40000 [00:03<15:42:13,  1.41s/it]

('Only 500 metres from the UNESCO-protected centre of Trogir, it offers free parking and air-conditioned accommodation with free Wi-Fi.',)
('К услугам гостей номера и апартаменты с кондиционером и бесплатным Wi-Fi, а также бесплатная парковка. Всего в 500 метрах находится центр Трогира, внесенный в список объектов Всемирного наследия ЮНЕСКО.',)


10.32788:   0%|          | 3/40000 [00:04<14:28:57,  1.30s/it]

('Free private parking is available on site.',)
('На территории обустроена бесплатная частная парковка.',)


10.10736:   0%|          | 4/40000 [00:05<13:24:01,  1.21s/it]

('Guests can enjoy a meal at the on-site restaurant, which offers à la carte options. Room service is provided.',)
('На территории комплекса открыт ресторан с обслуживанием по меню, а для дополнительного удобства гостей осуществляется доставка еды и напитков в номер.',)


9.95268:   0%|          | 5/40000 [00:06<13:37:34,  1.23s/it] 

('It is a 10-minute drive from Jiuzhou Port and a 30-minute drive from Hengqin Port or Chimelong Ocean International Tourist Resort.',)
('Поездка до порта Цзючжоу займет 10 минут, а за 30 минут можно доехать до порта Хэнцзинь и международного туристического курорта Chimelong Ocean.',)


9.78891:   0%|          | 6/40000 [00:07<13:58:24,  1.26s/it]

('The beach of Agia Triada is 1.2 km away from Venere Apartments.',)
('Пляж Агиа Триада находится в 1,2 км от апартаментов Venere.',)


9.85064:   0%|          | 7/40000 [00:09<14:45:17,  1.33s/it]

('An array of activities can be enjoyed on site or in the surroundings, including snorkelling and canoeing.',)
('На территории и в окрестностях популярны различные виды активного отдыха, в том числе сноркелинг и катание на каноэ.',)


9.85064:   0%|          | 7/40000 [00:09<15:48:23,  1.42s/it]


KeyboardInterrupt: 

### Main part
__Here comes the preprocessing. Do not hesitate to use BPE or more complex preprocessing ;)__

Here are tokens from original (RU) corpus:

And from target (EN) corpus:

And here is example from train dataset:

Let's check the length distributions:

### Model side
__Here comes simple pipeline of NMT model learning. It almost copies the week03 practice__

__Let's take a look at our network quality__:

In [None]:
original_text = []
generated_text = []
encoder.eval()
decoder.eval()

for en, ru in tqdm(generate_loader):
    embeds = encode(en)
    generated = decode(embeds, max_length=MAX_LEN, repetition_penalty=None)
    
    original_text.extend(ru)
    generated_text.extend(generated)

# original_text = flatten(original_text)
# generated_text = flatten(generated_text)

In [171]:
original_text

['Апарт-отель Royal Bansko хорошо подходит для зимнего отдыха. К услугам гостей детская комната и пункт проката автомобилей.',
 "Гостевой дом Jemal's находится в поселке Махинджаури, на берегу Черного моря, всего в 100 метрах от пляжа Махинджаури.",
 'На завтрак подают свежие фрукты, домашние пирожные, йогурт и хлопья.',
 'Гостевой дом Orlinds Tunas находится в 5 минутах ходьбы от пещеры Срити и в 1 часе езды от улицы Малиоборо и международного аэропорта Ади Сучипто.',
 'Во всех апартаментах имеется мини-кухня, отдельная гостиная зона и телевизор с плоским экраном с кабельными каналами. Из окон открывается вид на океан.',
 'Современные апартаменты Belavista расположены в 600 метрах от набережной в городе Сплит. К услугам гостей бесплатный Wi-Fi, отдельный балкон и бесплатная частная парковка.',
 'К услугам гостей телевизор, DVD-плеер, терраса и обеденный стол.',
 'В распоряжении жильцов — гостиный уголок, телевизор с плоским экраном, гардеробная и ванная комната с душем и унитазом.',
 

In [172]:
generated_text

['', '', '', '', '', '', '', '', '', '']

In [111]:
corpus_bleu([[text] for text in original_text], generated_text) * 100

14.139920232081806