# Подход 3
В качестве энкодера будет использоваться [rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2) \
Декодером же будет [distilgpt2](https://huggingface.co/distilgpt2) \
За основу использовался [ноутбук](https://github.com/huggingface/blog/blob/main/warm-starting-encoder-decoder.md)

* датасет препроцесситься аналогично второму подходу.
* Долго разбирался с токенайзерами. Для нужной модели у distilgpt отсутствуют некоторые технические токены. Их добавил в токенайзер. Просто так скопировать из одного в другой нельзя - на обучении возникает CUDA error. Поэтому все нужные технические токены мапятся к одинаковому id. Этот же id потом надо передать в batch_collator(по дефолту для лейблов id=-100) и модель не могла научиться предсказывать предложения разной длины.
* большинство параметров в точности совпадает с моделью в итерации 2, чтобы было интереснее сравнивать
* также в модели используется beam search, который заметно повышает bleu score. Длина генерируемых текстов ограничена [5, 30]

In [1]:
import os
path_do_data = '../../datasets/Machine_translation_EN_RU/data.txt'
if not os.path.exists(path_do_data):
    print("Dataset not found locally. Downloading from github.")
    !wget https://raw.githubusercontent.com/neychev/made_nlp_course/master/datasets/Machine_translation_EN_RU/data.txt -nc
    path_do_data = './data.txt'

Dataset not found locally. Downloading from github.
File ‘data.txt’ already there; not retrieving.



In [2]:
from typing import List
import pandas as pd
from datasets import Dataset, DatasetDict


def prepare_dataset(path_to_data: str, ratio: List[float] = (0.8, 0.15, 0.05)):
    data = pd.read_csv(path_to_data, header=None, sep='\t')
    data.columns = ['en', 'ru']
    left_border = 0
    res = {}
    for name, size in [('train', ratio[0]), ('validation', ratio[1]), ('test', ratio[2])]:
        split_data = data.iloc[int(data.shape[0] * left_border): int(data.shape[0] * (left_border + size))]
        split_data = {'translation': [{'en': row['en'], 'ru': row['ru']} for idx, row in split_data.iterrows()]}
        res[name] = Dataset.from_dict(split_data)
        left_border += size
    return DatasetDict(res)


In [3]:
from datasets import load_dataset, load_metric


raw_datasets = prepare_dataset(path_do_data)
metric = load_metric("sacrebleu")

In [4]:
src_checkpoint = "cointegrated/rubert-tiny2"
trg_checkpoint = "distilgpt2"

In [5]:
from transformers import AutoTokenizer

src_tokenizer = AutoTokenizer.from_pretrained(src_checkpoint)

In [6]:
trg_tokenizer = AutoTokenizer.from_pretrained(
    trg_checkpoint, 
    pad_token='<|endoftext|>',#src_tokenizer.pad_token, 
    cls_token='<|endoftext|>',#src_tokenizer.cls_token,
    sep_token='<|endoftext|>')#src_tokenizer.sep_token)

In [7]:
max_input_length = 128
max_target_length = 128
source_lang = "ru"
target_lang = "en"

In [8]:
prefix=""

def preprocess_function(examples):
    inputs = [prefix + ex[source_lang] for ex in examples["translation"]]
    targets = [ex[target_lang] for ex in examples["translation"]]
    model_inputs = src_tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    with trg_tokenizer.as_target_tokenizer():
        labels = trg_tokenizer(targets, max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [9]:
preprocess_function(raw_datasets['train'][:2])

{'input_ids': [[2, 33353, 26085, 3174, 691, 11321, 314, 48452, 16, 314, 23, 46493, 54803, 733, 26995, 17, 77620, 23199, 18, 3], [2, 282, 4769, 34748, 5313, 57981, 603, 5786, 15772, 26908, 37915, 8754, 59124, 31465, 320, 329, 20096, 865, 17, 3307, 18, 42515, 6716, 33222, 69346, 34894, 35746, 320, 42606, 314, 14151, 18, 3]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'labels': [[34, 585, 25418, 12696, 318, 22765, 287, 309, 33473, 23267, 11, 257, 513, 12, 11374, 2513, 1497, 422, 9281, 22844, 4564, 13], [2953, 49595, 2533, 668, 64, 27046, 345, 481, 1064, 257, 1987, 12, 9769, 2166, 6915, 11, 2119, 2139, 11, 290, 257, 26906, 2318, 13]]}

In [10]:
from transformers import EncoderDecoderModel

bert2gpt = EncoderDecoderModel.from_encoder_decoder_pretrained(src_checkpoint, trg_checkpoint)
# поправляем токены для encoder-decoder модели
bert2gpt.config.decoder_start_token_id = trg_tokenizer.cls_token_id
bert2gpt.config.eos_token_id = trg_tokenizer.sep_token_id
bert2gpt.config.pad_token_id = trg_tokenizer.pad_token_id
# настраиваем параметры beam search
bert2gpt.config.max_length = 30
bert2gpt.config.min_length = 5
bert2gpt.config.no_repeat_ngram_size = 3
bert2gpt.config.early_stopping = True
bert2gpt.config.length_penalty = 2.0
bert2gpt.config.num_beams = 2

Some weights of the model checkpoint at cointegrated/rubert-tiny2 were not used when initializing BertModel: ['cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of GPT2LMHeadModel were not initialized from the model checkpoint at distilgpt2 and are newly initiali

In [11]:
bert2gpt

EncoderDecoderModel(
  (encoder): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(83828, 312, padding_idx=0)
      (position_embeddings): Embedding(2048, 312)
      (token_type_embeddings): Embedding(2, 312)
      (LayerNorm): LayerNorm((312,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=312, out_features=312, bias=True)
              (key): Linear(in_features=312, out_features=312, bias=True)
              (value): Linear(in_features=312, out_features=312, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=312, out_features=312, bias=True)
              (LayerNorm): LayerNorm((312,), eps=1e-12, elementwise_a

In [12]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(bert2gpt):,} trainable parameters')

The model has 125,530,152 trainable parameters


In [13]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

In [14]:
batch_size=30

args = Seq2SeqTrainingArguments(
    "bert2gpt",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    warmup_steps=1000,
    lr_scheduler_type="cosine_with_restarts",
    learning_rate=5e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=5,
    num_train_epochs=15,
    predict_with_generate=True,
    fp16=True,
    dataloader_num_workers=6,
    load_best_model_at_end=True,
    label_smoothing_factor=0.01,
)

In [15]:
import numpy as np

def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]

    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = trg_tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, trg_tokenizer.pad_token_id)
    decoded_labels = trg_tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}

    prediction_lens = [np.count_nonzero(pred != trg_tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

In [16]:
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

  0%|          | 0/40 [00:00<?, ?ba/s]

  0%|          | 0/8 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

In [17]:
import os
from transformers import DataCollatorForSeq2Seq
os.environ["TOKENIZERS_PARALLELISM"] = "false"

data_collator = DataCollatorForSeq2Seq(
    src_tokenizer, 
    model=bert2gpt, 
    label_pad_token_id=trg_tokenizer.pad_token_id
)
trainer = Seq2SeqTrainer(
    bert2gpt,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    compute_metrics=compute_metrics
)
trainer.train()

Using amp half precision backend
The following columns in the training set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: token_type_ids, translation. If token_type_ids, translation are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 40000
  Num Epochs = 15
  Instantaneous batch size per device = 30
  Total train batch size (w. parallel, distributed & accumulation) = 30
  Gradient Accumulation steps = 1
  Total optimization steps = 20010


Epoch,Training Loss,Validation Loss


The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: token_type_ids, translation. If token_type_ids, translation are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 7500
  Batch size = 30
Saving model checkpoint to bert2gpt/checkpoint-1334
Configuration saved in bert2gpt/checkpoint-1334/config.json
Model weights saved in bert2gpt/checkpoint-1334/pytorch_model.bin
Deleting older checkpoint [bert2gpt/checkpoint-9600] due to args.save_total_limit
The following columns in the evaluation set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: token_type_ids, translation. If token_type_ids, translation are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 7500
  Batch size = 30
Saving model checkpoint to ber

TrainOutput(global_step=20010, training_loss=0.7693717367466779, metrics={'train_runtime': 5814.8653, 'train_samples_per_second': 103.184, 'train_steps_per_second': 3.441, 'total_flos': 1.043200951742304e+16, 'train_loss': 0.7693717367466779, 'epoch': 15.0})

In [18]:
preds = trainer.predict(tokenized_datasets["test"])

The following columns in the test set don't have a corresponding argument in `EncoderDecoderModel.forward` and have been ignored: token_type_ids, translation. If token_type_ids, translation are not expected by `EncoderDecoderModel.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 2500
  Batch size = 30


In [20]:
preds.metrics

{'test_loss': 0.8225914239883423,
 'test_bleu': 31.2043,
 'test_gen_len': 17.846,
 'test_runtime': 48.6383,
 'test_samples_per_second': 51.4,
 'test_steps_per_second': 1.727}

In [30]:
generated_text = trg_tokenizer.batch_decode(preds.predictions, skip_special_tokens=True)
labels = np.where(preds.label_ids != -100, preds.label_ids, trg_tokenizer.pad_token_id)
original_text = trg_tokenizer.batch_decode(labels, skip_special_tokens=True)

In [31]:
random_idx = np.random.choice(len(generated_text), 10)
for idx in random_idx:
    print(f"generated: {generated_text[idx]}\noriginal: {original_text[idx]}")

generated: Set in the centre of Kovachevica, Ristorante Privatova features a garden with barbecue facilities and free luggage storage.
original: Ristevata Guest House enjoys a central location in Kovachevitsa and offers a garden with free barbecue facilities and free luggage storage.
generated: A restaurant with a private beach area is just 300 metres away.
original: A bar restaurant, featuring its own private beach area, is just 300 metres away.
generated: You will find a 24-hour front desk at the property.
original: The front desk is available 24/7.
generated: Set in Braşov, this apartment features a balcony with mountain views.
original: Set in Braşov, this apartment features a balcony with mountains views.
generated: Located in a quiet residential district of Riga, Volga Stations are 5 km from the city centre.
original: Apartment Volguntes Street is housed in a quiet and green residential district of Riga, within 5 km from the city centre.
generated: Extras include a wardrobe and a

Довольно интересная интерпретация перевода. Если сравнивать с моделькой на основе OPUS:
* Фразы выглядят синтаксически правильными. В нашем датасете часто оригинальные переводы были кратким описанием, обе модели разворачивают его в полноценное предложение.
* Модель аналогично ошибается при переводе имён нарицательных
- BLEU ниже, т.к. обычная модель перевода переводила более "дословно"
? Т.к. в качестве генератора использовалась GPT, появились интересные особенности: В примере про комнаты модель и москитные сетки смогла понять, что they - вероятно комнаты.

Перевод мне нравится. Он более свободный, но при этом адекватный.