# RNN & Attention: HW

Привет! Это твоё домашнее задание: сделать модель, которая может переводить тексты с немецкого языка в англиский. Для обучения будет использоваться датасет [wmt-14](https://huggingface.co/datasets/wmt14). Для проверки будет использоваться BLEU на тестовой выборке и 10 примеров перевода вашей модели. В этом ноутбуке есть скелет для обучения модели трансформера. Но вы можете пользоваться и RNN, если вы считаете что можете обучить её под эту задачу. Главное -- получить `submission.yaml`, используя нейросети.

**!Внимание!** В этой домашней работе нельзя пользоваться библиотекой `transformers`.

In [1]:
import subprocess
import sys


IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    subprocess.run("pip install datasets nltk gensim einops evaluate", shell=True)
    subprocess.run("python -m nltk.downloader punkt", shell=True)

In [2]:
import torch
import nltk
import einops
import evaluate
import numpy as np

from datasets import load_dataset

In [3]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [4]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [5]:
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [6]:
bleu = evaluate.load("bleu")

Downloading builder script:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

# Данные

В этой части подготовьте данные для обучения. Не забудьте добавить "BOS", "EOS" и "UNK" токены в ваши словари.

In [7]:
wmt14 = load_dataset("wmt14", "de-en")

Downloading builder script:   0%|          | 0.00/2.97k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/15.3k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/9.38k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/41.2k [00:00<?, ?B/s]

Downloading and preparing dataset wmt14/de-en to /root/.cache/huggingface/datasets/wmt14/de-en/1.0.0/2de185b074515e97618524d69f5e27ee7545dcbed4aa9bc1a4235710ffca33f4...


Downloading data files:   0%|          | 0/5 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/658M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/919M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/80.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/38.7M [00:00<?, ?B/s]

     

Extracting data files #1:   0%|          | 0/1 [00:00<?, ?obj/s]

Extracting data files #2:   0%|          | 0/1 [00:00<?, ?obj/s]

Extracting data files #3:   0%|          | 0/1 [00:00<?, ?obj/s]

Extracting data files #0:   0%|          | 0/1 [00:00<?, ?obj/s]

Extracting data files #4:   0%|          | 0/1 [00:00<?, ?obj/s]

Extracting data files: 0it [00:00, ?it/s]

Generating train split:   0%|          | 0/4508785 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3003 [00:00<?, ? examples/s]

Dataset wmt14 downloaded and prepared to /root/.cache/huggingface/datasets/wmt14/de-en/1.0.0/2de185b074515e97618524d69f5e27ee7545dcbed4aa9bc1a4235710ffca33f4. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [8]:
wmt14

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 4508785
    })
    validation: Dataset({
        features: ['translation'],
        num_rows: 3000
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 3003
    })
})

In [9]:
wmt14['train'][0]['translation']['de']

'Wiederaufnahme der Sitzungsperiode'

In [10]:
wmt14['train'][0]['translation']

{'de': 'Wiederaufnahme der Sitzungsperiode', 'en': 'Resumption of the session'}

In [11]:
len(wmt14['train'])

4508785

In [12]:
import random

SEED = 42

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

In [13]:
def tokenize_pipeline(x, tokenizer = nltk.WordPunctTokenizer(), lemmatizer = nltk.WordNetLemmatizer()):
  tokens = tokenizer.tokenize(x.lower())
  return [lemmatizer.lemmatize(token) for token in tokens if token.isalpha()]

def tokenize_all_sentence(sent):
  return {"en_tokens": tokenize_pipeline(sent["translation"]["en"]), 
          "de_tokens": tokenize_pipeline(sent["translation"]["de"])}


из-за очень долгой работы ограничимся 300к данными для обучения 

In [14]:
wmt14["train"] = wmt14["train"].select(range(300000))

токенизируем

In [15]:
wmt14["train"][0]

{'translation': {'de': 'Wiederaufnahme der Sitzungsperiode',
  'en': 'Resumption of the session'}}

In [16]:
df = wmt14.map(tokenize_all_sentence)

  0%|          | 0/300000 [00:00<?, ?ex/s]

  0%|          | 0/3000 [00:00<?, ?ex/s]

  0%|          | 0/3003 [00:00<?, ?ex/s]

In [17]:
df = df.filter(lambda sent: len(sent['de_tokens']) <= 254 and len(sent["en_tokens"]) <= 254) # +2 на bos eos



  0%|          | 0/300 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

  0%|          | 0/4 [00:00<?, ?ba/s]

In [18]:
df

DatasetDict({
    train: Dataset({
        features: ['translation', 'en_tokens', 'de_tokens'],
        num_rows: 299999
    })
    validation: Dataset({
        features: ['translation', 'en_tokens', 'de_tokens'],
        num_rows: 3000
    })
    test: Dataset({
        features: ['translation', 'en_tokens', 'de_tokens'],
        num_rows: 3003
    })
})

In [19]:
df['train'][0]

{'translation': {'de': 'Wiederaufnahme der Sitzungsperiode',
  'en': 'Resumption of the session'},
 'en_tokens': ['resumption', 'of', 'the', 'session'],
 'de_tokens': ['wiederaufnahme', 'der', 'sitzungsperiode']}

In [20]:
# df['train']['de_tokens']

Добавим специальные токены и выбросим мало используемые слова

In [21]:
import collections

In [22]:
pad, bos, eos, unk = (0, 1, 2, 3)

In [23]:
unk

3

In [24]:
def vocab(tokens):
  counter = collections.Counter(tokens)
  token_freqs = sorted(counter.items(), key=lambda x: x[0])
  token_freqs.sort(key=lambda x: x[1], reverse=True)

  pad, bos, eos, unk = (0, 1, 2, 3)
  tokens = ['<pad>', '<bos>', '<eos>', '<unk>']
    
  tokens += [token for token, freq in token_freqs if freq >= 5] #берем от 5
  idx_to_token = []
  token_to_idx = dict()

  for token in tokens:
    idx_to_token.append(token)
    token_to_idx[token] = len(idx_to_token) - 1

  return token_to_idx

In [25]:
# df['train']['de_tokens']

In [26]:
def build_vocab(tokens):
        tokens = [word for words in tokens for word in words]
        return vocab(tokens)

In [27]:
de_vocab, en_vocab = build_vocab(df['train']['de_tokens']), build_vocab(df['train']['en_tokens'])

In [28]:
len(en_vocab), len(de_vocab)

(15930, 34815)

In [29]:
'<pad>' in de_vocab

True

In [30]:
de_vocab['ich']

12

Создаем датасет


In [31]:
class TranslationDataset(torch.utils.data.Dataset):
    def __init__(self, dataset, en_vocab=en_vocab, de_vocab=de_vocab):
        self.en_vocab = en_vocab
        self.de_vocab = de_vocab
        
        def convert_words_to_ids(example):
            return {"en_ids": [self.en_vocab[token] if token in self.en_vocab else unk for token in example["en_tokens"]],
                    "de_ids": [self.de_vocab[token] if token in self.de_vocab else unk for token in example["de_tokens"]]}
        
        dataset = dataset.map(convert_words_to_ids)
        self.dataset = dataset
    
    def __len__(self):
        return len(self.dataset)
    
    def __getitem__(self, index):
        example = self.dataset[index]
        return torch.tensor(example["de_ids"]), torch.tensor(example["en_ids"])

In [32]:
train_dataset = TranslationDataset(df["train"])

  0%|          | 0/299999 [00:00<?, ?ex/s]

In [33]:
valid_dataset = TranslationDataset(df["validation"])

  0%|          | 0/3000 [00:00<?, ?ex/s]

In [34]:
test_dataset = TranslationDataset(df["test"])

  0%|          | 0/3003 [00:00<?, ?ex/s]

In [35]:
train_dataset.__getitem__(2)

(tensor([   33,    26,   758,   856,    15,     5, 32312,     3,     3,    18,
          4409,   113,    31,   148,   980,   128,    80,   779,    13,  3138,
          5489,   917]),
 tensor([  382,     8,    47,    26,    21,   615,     4, 11665,  3690,     3,
          1443,     6,  8696,   185,     4,    76,     9,     8,   210,     5,
            51,  2147,     8,  1367,     5,  1085,  1054,    10,  1450,   118,
          4355]))

In [36]:
# from torch.utils.data import DataLoader

# train_dataloader = DataLoader(train_dataset, batch_size=64, shuffle=True)

In [37]:
def collate_fn(batch):
    data_batch = sorted(batch, key=lambda x: - len(x[0]))

    de_ids, en_ids = [], []
    for (a, b) in data_batch:
      de_ids.append(torch.cat(([torch.tensor([bos]), a, torch.tensor([eos])]), dim=0))
      en_ids.append(torch.cat(([torch.tensor([bos]), b, torch.tensor([eos])]), dim=0))

    de_ids = torch.nn.utils.rnn.pad_sequence(de_ids, padding_value=pad, batch_first=True)
    en_ids = torch.nn.utils.rnn.pad_sequence(en_ids, padding_value=pad, batch_first=True)

    return de_ids, en_ids

In [38]:
from torch.utils.data import DataLoader

In [39]:
train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=16, shuffle=True, collate_fn=collate_fn)
valid_dataloader = torch.utils.data.DataLoader(valid_dataset, batch_size=16, shuffle=False, collate_fn=collate_fn)
test_dataloader = torch.utils.data.DataLoader(test_dataset, batch_size=16, shuffle=False, collate_fn=collate_fn)


In [70]:
src, trg = next(iter(train_dataloader))
src.shape, trg.shape

(torch.Size([16, 61]), torch.Size([16, 65]))

In [41]:
# src

# Model

Сделайте модель, которая может в перевод. Для этой модели потребуется сделать `Encoder` и `Decoder`. Первый будет брать текст на немецком и отдавать информацию про него. Decoder будет брать информацию про немецкий текст и превращать его в английский.

In [42]:
# Если вам нужны дополнительные модули, такие как Attention или Transformer layer, то можете добавить их сюда

In [73]:
class MultiHeadAttention(torch.nn.Module):
  def __init__(self, hidden_dim, n_heads, dropout, device):
    super().__init__()

    assert hidden_dim % n_heads == 0

    self.device = device

    self.hidden_dim = hidden_dim
    self.n_heads = n_heads
    self.head_dim = hidden_dim//n_heads
    self.dropout = torch.nn.Dropout(dropout)

    #qkv
    self.q_linear = torch.nn.Linear(hidden_dim, hidden_dim)
    self.k_linear = torch.nn.Linear(hidden_dim, hidden_dim)
    self.v_linear = torch.nn.Linear(hidden_dim, hidden_dim)

    self.out_linear = torch.nn.Linear(hidden_dim, hidden_dim)

    self.scale = torch.sqrt(torch.FloatTensor([self.head_dim])).to(device)

  def forward(self, query, key, value, mask = None):

    batch_size = query.shape[0]
    seq = query.shape[1]

    Q = self.q_linear(query)
    K = self.k_linear(key)
    V = self.v_linear(value)

    batch_size, hidden_dim = Q.size(0), Q.size(2)
    key_len, value_len, query_len = K.size(1), V.size(1), Q.size(1) 
    
    K = K.reshape(batch_size, key_len, self.n_heads, -1) # (batch_size, key_len, num_heads, head_dim)
    V = V.reshape(batch_size, value_len, self.n_heads, -1) # (batch_size, value_len, num_heads, head_dim)
    Q = Q.reshape(batch_size, query_len, self.n_heads, -1) # (batch_size, query_len, num_heads, head_dim)
    
    energy = torch.einsum('bqhd,bkhd->bhqk', [Q, K]) # (batch_size, num_heads, query_len, key_len)
    
    if mask is not None:
      energy = energy.masked_fill(mask == 0, -torch.inf)

    attention = torch.softmax(energy / self.scale, dim = 3) #с маской или без?

    x = torch.einsum('bhql,blhd->bqhd', [self.dropout(attention), V])
    x = self.out_linear(x.reshape(batch_size, seq, self.hidden_dim)) # + self.head_dim ??

    return x, attention


In [74]:
class FeedForward(torch.nn.Module):
  def __init__(self,  hidden_dim, pf_dim, dropout):
    super().__init__()

    self.device = device

    self.fc_1 = torch.nn.Linear(hidden_dim, pf_dim)
    self.fc_2 = torch.nn.Linear(pf_dim, hidden_dim)

    self.dropout = torch.nn.Dropout(dropout)

  def forward(self, x):
    x = self.dropout(torch.relu(self.fc_1(x)))
    x = self.fc_2(x) # + self.head_dim ??

    return x

Для слоев Encoder можете скопировать код из семинара:

In [75]:
class EncoderTransformerLayer(torch.nn.Module):
    def __init__(self, hidden_dim, n_heads, pf_dim, dropout, device):
        super().__init__()
        
        self.device = device

        self.attention_norm = torch.nn.LayerNorm(hidden_dim)
        self.ff_norm = torch.nn.LayerNorm(hidden_dim)
        self.attention = MultiHeadAttention(hidden_dim, n_heads, dropout, device)
        self.ff = FeedForward(hidden_dim, pf_dim, dropout)

        self.dropout = torch.nn.Dropout(dropout)
        
    def forward(self, src, mask):
        '''
        src = [batch size, src size, hiddim]
        mask = [batch size, l, l, src size]
        '''

        #attention
        _src, _ = self.attention(src, src, src, mask) # QVK

        #residial connection
        src = self.attention_norm(src + self.dropout(_src)) #addnorm

        #free forward
        _src = self.ff(src)

        #residial connection
        src = self.ff_norm(src + self.dropout(_src))

        return src



Для Decoder слоя потребуется модифицировать код. Не забудьте, что для декодера требуется другой механизм внимания.

In [76]:
class DecoderTransformerLayer(torch.nn.Module):
    def __init__(self, hidden_dim, n_heads, pf_dim, dropout, device):
        super().__init__()

        self.device = device
        
        self.attention_norm = torch.nn.LayerNorm(hidden_dim)
        self.encoder_attention_norm = torch.nn.LayerNorm(hidden_dim)
        self.ff_norm = torch.nn.LayerNorm(hidden_dim)
        self.attention_self = MultiHeadAttention(hidden_dim, n_heads, dropout, device)
        # self.encoder_mlp = MultiHeadAttention(hidden_dim, n_heads, dropout, device)
        self.ff = FeedForward(hidden_dim, pf_dim, dropout)
        # self.out_attention = EncoderTransformerLayer(hidden_dim, n_heads, pf_dim, dropout)

        self.dropout = torch.nn.Dropout(dropout)
        
    def forward(self, trg, src, tmask, smask):
        #attention
        _trg, _ = self.attention_self(trg, trg, trg, tmask) # QVK

        #residial connection
        trg = self.attention_norm(trg + self.dropout(_trg))


        
        #encoder attention
        _trg, attention = self.attention_self(trg, src, src, smask) # QVK

        #residial connection
        trg = self.encoder_attention_norm(trg + self.dropout(_trg))

        _trg = self.ff(trg)

        #residial connection
        trg = self.ff_norm(trg + self.dropout(_trg))

        return trg, attention



In [77]:
class Encoder(torch.nn.Module):
    def __init__(self, en_dictionary_size, hidden_dim, n_layers,
                 n_heads, pf_dim, dropout, device, max_lenght = 256):
        super().__init__()
        
        self.device = device

        self.word_embedding = torch.nn.Embedding(en_dictionary_size, hidden_dim)
        self.pos_embedding = torch.nn.Embedding(max_lenght, hidden_dim)
        
        self.layers = torch.nn.ModuleList([EncoderTransformerLayer(hidden_dim,
                                                                   n_heads,
                                                                   pf_dim,
                                                                   dropout,
                                                                   device) for _ in range(n_layers)])
        
        self.dropout = torch.nn.Dropout(dropout)

        #more important info about token
        self.scale = torch.sqrt(torch.LongTensor([hidden_dim])).to(device)
        
    def forward(self, src, mask):
        '''
        src = [batch size, src size]
        mask = [batch size, l, l, src size]
        '''
        batch_size, src_len = src.shape

        #pos = [batch size, src len]
        pos = torch.arange(0, src_len).unsqueeze(0).repeat(batch_size, 1).to(self.device)

        src = self.dropout( (self.word_embedding(src) * self.scale) + self.pos_embedding(pos))

        #src = [batch size, src sixe, hiddim]
        for layer in self.layers:
          src = layer(src, mask)

        #src = [batch size, src sixe, hiddim] ???

        return src

In [78]:
class Decoder(torch.nn.Module):
    def __init__(self, en_dictionary_size, hidden_dim, n_layers,
                 n_heads, pf_dim, dropout, device, max_lenght = 256):
        super().__init__()
        
        self.device = device

        self.word_embedding = torch.nn.Embedding(en_dictionary_size, hidden_dim)
        self.pos_embedding = torch.nn.Embedding(max_lenght, hidden_dim)
        
        self.layers = torch.nn.ModuleList([DecoderTransformerLayer(hidden_dim,
                                                                   n_heads,
                                                                   pf_dim,
                                                                   dropout,
                                                                   device) for _ in range(n_layers)])
        
        self.fc_out = torch.nn.Linear(hidden_dim, en_dictionary_size)

        self.dropout = torch.nn.Dropout(dropout)

        #more important info about token
        self.scale = torch.sqrt(torch.LongTensor([hidden_dim])).to(device)
        
    def forward(self, trg, src, tmask, smask):
        '''
        src = [batch size, src size]
        mask = [batch size, l, l, src size]
        '''
        batch_size, trg_len = trg.shape

        # pos = torch.arange(0, trg_len).expand(batch_size, trg_len).to(device)

        pos = torch.arange(0, trg_len).unsqueeze(0).repeat(batch_size, 1).to(self.device)

        # print((self.word_embedding(trg)).shape)
        # print((self.scale).shape)
        # print((self.pos_embedding(pos)).shape)

        trg = self.dropout( (self.word_embedding(trg) * self.scale) + self.pos_embedding(pos))

        # print(self.layers)

        for layer in self.layers:
          # print(src.shape)
          trg, attention = layer(trg, src, tmask, smask)

        return self.fc_out(trg), attention

In [79]:
class TranslationModel(torch.nn.Module):
    def __init__(self, encoder, decoder):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.pad = pad
        self.device = device

    def make_smask(self, src):
      return (src != pad).unsqueeze(1).unsqueeze(2)

    def make_tmask(self, trg):
      tmask_pad = (trg != pad).unsqueeze(1).unsqueeze(2)

      trg_len = trg.shape[1]
      trg_sub_mask = torch.tril(torch.ones(trg_len, trg_len)).bool().to(device)

      return tmask_pad & trg_sub_mask
        
    def forward(self, inputs):
        src, trg = inputs
        smask = self.make_smask(src)
        tmask = self.make_tmask(trg)

        encoder_output = self.encoder(src, smask)
        decoder_output = self.decoder(trg, encoder_output, tmask, smask)
        
        return decoder_output

Сделайте модель, оптимиизатор и лосс функцию. В нашем случае лосс функция будет проверять предсказанию токенов на каждой позиции -- по сути классификатор на каждую позицию.

In [80]:
INPUT_DIM = len(de_vocab)
OUTPUT_DIM = len(en_vocab)
HIDDEN_DIM = 256
ENC_LAYERS = 3
DEC_LAYERS = 3
ENC_HEADS = 8
DEC_HEADS = 8
ENC_FF = 512
DEC_FF = 512
DROPOUT = 0.1

In [81]:
enc = Encoder(INPUT_DIM,
       HIDDEN_DIM,
       ENC_LAYERS,
       ENC_HEADS,
       ENC_FF,
       DROPOUT,
       device)

In [82]:
dec = Decoder(OUTPUT_DIM,
       HIDDEN_DIM,
       DEC_LAYERS,
       DEC_HEADS,
       DEC_FF,
       DROPOUT,
       device)

In [83]:
model = TranslationModel(enc, dec)

In [84]:
def weights(x):
  if hasattr(x, 'weight') and x.weight.dim() > 1:
    torch.nn.init.xavier_normal_(x.weight.data)

In [85]:
# model.load_state_dict(torch.load('/content/drive/MyDrive/translate.pt'))

In [86]:
model.apply(weights)

TranslationModel(
  (encoder): Encoder(
    (word_embedding): Embedding(34815, 256)
    (pos_embedding): Embedding(256, 256)
    (layers): ModuleList(
      (0): EncoderTransformerLayer(
        (attention_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (ff_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (attention): MultiHeadAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (q_linear): Linear(in_features=256, out_features=256, bias=True)
          (k_linear): Linear(in_features=256, out_features=256, bias=True)
          (v_linear): Linear(in_features=256, out_features=256, bias=True)
          (out_linear): Linear(in_features=256, out_features=256, bias=True)
        )
        (ff): FeedForward(
          (fc_1): Linear(in_features=256, out_features=512, bias=True)
          (fc_2): Linear(in_features=512, out_features=256, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (dropout): Dropout

In [87]:
model.parameters()

<generator object Module.parameters at 0x7feba3484a50>

In [88]:
device

'cuda'

In [89]:
model.cuda()

TranslationModel(
  (encoder): Encoder(
    (word_embedding): Embedding(34815, 256)
    (pos_embedding): Embedding(256, 256)
    (layers): ModuleList(
      (0): EncoderTransformerLayer(
        (attention_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (ff_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (attention): MultiHeadAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (q_linear): Linear(in_features=256, out_features=256, bias=True)
          (k_linear): Linear(in_features=256, out_features=256, bias=True)
          (v_linear): Linear(in_features=256, out_features=256, bias=True)
          (out_linear): Linear(in_features=256, out_features=256, bias=True)
        )
        (ff): FeedForward(
          (fc_1): Linear(in_features=256, out_features=512, bias=True)
          (fc_2): Linear(in_features=512, out_features=256, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (dropout): Dropout

In [90]:
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4, weight_decay=1e-6)
criterion = torch.nn.CrossEntropyLoss(ignore_index=pad)

кол-во параметров

In [91]:
sum(p.numel() for p in model.parameters() if p.requires_grad)

20379962

инициализируем начальные веса для более быстрого обучения


In [92]:
def train(model, optimizer, criterion, clip, dataloader):
  model.train()

  now_loss = 0.0


  for src, trg in dataloader:

    src = src.to(device).long()
    trg = trg.to(device).long()

    optimizer.zero_grad()

    # print(src, trg[:, :-1])

    #don't use eos
    output, _ = model((src, trg[:, :-1]))

    output = output.contiguous().view(-1, output.shape[-1])
    trg = trg[:, 1:].contiguous().view(-1)


    
    loss = criterion(output, trg)
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
    optimizer.step()

    now_loss += loss.item()

  return now_loss / len(dataloader.dataset)


In [93]:
def eval(model, criterion, dataloader):
  model.eval()

  now_loss = 0.0

  with torch.no_grad():
    for src, trg in dataloader:
      src = src.to(device).long()
      trg = trg.to(device).long()

      output, _ = model((src, trg[:, :-1]))

      output = output.contiguous().view(-1, output.shape[-1])
      trg = trg[:, 1:].contiguous().view(-1) 
      
      loss = criterion(output, trg)

      now_loss += loss.item()

  return now_loss / len(dataloader.dataset)

In [94]:
CNT_EPOCHS = 5
CLIP = 1 

best_valid_loss = torch.inf

In [95]:
import os

In [None]:
# import os
# os.environ['CUDA_HUINYA'] = "1"

In [96]:
for epoch in range(CNT_EPOCHS):

  train_loss = train(model, optimizer, criterion, CLIP, train_dataloader)
  valid_loss = eval(model, criterion, valid_dataloader)

  if valid_loss < best_valid_loss:
    best_valid_loss = valid_loss
    torch.save(model.state_dict(), '/content/drive/MyDrive/val.pt')

  print(f"Epoch: {epoch + 1}")
  print(f"Training Loss: {train_loss}")
  print(f"Evaluation Loss: {valid_loss}")
  print()

Epoch: 1
Training Loss: 0.27374951745883314
Evaluation Loss: 0.28746044023831685

Epoch: 2
Training Loss: 0.20564015389308932
Evaluation Loss: 0.25500587113698325

Epoch: 3
Training Loss: 0.17951200229717038
Evaluation Loss: 0.23786690096060434

Epoch: 4
Training Loss: 0.16495753765627544
Evaluation Loss: 0.2296966049273809

Epoch: 5
Training Loss: 0.1555633551939352
Evaluation Loss: 0.2229332712093989



In [97]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [98]:
torch.save(model.state_dict(), '/content/drive/MyDrive/translate_new.pt')

In [67]:
model.load_state_dict(torch.load('/content/drive/MyDrive/translate.pt'))

<All keys matched successfully>

In [109]:
test_loss = eval(model, criterion, test_dataloader)
print(f"Test Loss: {test_loss}")

Test Loss: 0.23750857111219165


Чтобы получить перевод, надо сделать функцию для декодинга. Она будет брать предсказания токена на последней позиции и отдавать нужный токен.

In [100]:
def vocab_hlp(tokens):
  counter = collections.Counter(tokens)
  token_freqs = sorted(counter.items(), key=lambda x: x[0])
  token_freqs.sort(key=lambda x: x[1], reverse=True)

  pad, bos, eos, unk = (0, 1, 2, 3)
  tokens = ['<pad>', '<bos>', '<eos>', '<unk>']
    
  tokens += [token for token, freq in token_freqs if freq >= 5] #берем от 5
  idx_to_token = []
  token_to_idx = dict()

  for token in tokens:
    idx_to_token.append(token)
    token_to_idx[token] = len(idx_to_token) - 1

  return idx_to_token

In [101]:
def build_vocab_hlp(tokens):
        tokens = [word for words in tokens for word in words]
        return vocab_hlp(tokens)

In [102]:
de_vocab_hlp, en_vocab_hlp = build_vocab_hlp(df['train']['de_tokens']), build_vocab_hlp(df['train']['en_tokens'])

In [135]:
def decoding_function(src, model, max_len=256):
    model.eval()

    src = tokenize_pipeline(src)
    src = ['<bos>'] + src + ['<eos>']
    src_tensor = torch.tensor([de_vocab[token] if token in de_vocab else unk for token in src]).unsqueeze(0).to(device)


    trg_ids = [1] # BOS token
    trg_tokens = []

    while len(trg_tokens) <= max_len:

      trg_tensor = torch.tensor(trg_ids).unsqueeze(0).to(device)
      trg_mask = model.make_tmask(trg_tensor)

      with torch.no_grad():
        output, _ = model((src_tensor, trg_tensor))
        output = output.contiguous().view(-1, output.shape[-1])
        # print(output)
        output = output[-1].argmax(-1).item()

      # print(output)
      
      
      if output == eos:
        break

      trg_ids.append(output)

      if output != bos and output != unk:
        trg_tokens.append(en_vocab_hlp[output])
      
    return " ".join(trg_tokens)

In [136]:
src = "Guten Morgen!"
trg = decoding_function(src, model)
print(trg)

good news tomorrow


# Result

В качестве результата вы должны предоставить bleu вашей модели на тестовой выборке wmt14 и перевод 10 предложений с немецкого на английский.

In [133]:
df["test"]

Dataset({
    features: ['translation', 'en_tokens', 'de_tokens'],
    num_rows: 3003
})

In [137]:
references = [[" ".join(reference)] for reference in df["test"]["en_tokens"]]
predictions = [decoding_function(example["de"], model) for example in df["test"]["translation"]]
test_bleu = bleu.compute(predictions=predictions, references=references)
print(test_bleu)

{'bleu': 0.08667705639092577, 'precisions': [0.4893401239059021, 0.18164021039224473, 0.07254489728716533, 0.03076923076923077], 'brevity_penalty': 0.7303273599096396, 'length_ratio': 0.7608830584959152, 'translation_length': 45357, 'reference_length': 59611}


In [138]:
de_sentences = [
    "Gutach: Noch mehr Sicherheit für Fußgänger",
    "Zwei Anlagen so nah beieinander: Absicht oder Schildbürgerstreich?",
    "Dies bestätigt auch Peter Arnold vom Landratsamt Offenburg.",
    "Daher sei der Bau einer weiteren Ampel mehr als notwendig: \"Sicherheit geht hier einfach vor\", so Arnold.",
    "Pro Fahrtrichtung gibt es drei Lichtanlagen.",
    "Drückt der Fußgänger den Ampelknopf, testet der obere Radarsensor die Verkehrslage.",
    "Ein weiteres Radarsensor prüft, ob die Grünphase für den Fußgänger beendet werden kann.",
    "Josef Winkler schreibt sich seit mehr als 30 Jahren die Nöte seiner Kindheit und Jugend von der Seele.",
    "Dabei scheint Regisseur Fresacher dem Text wenig zu vertrauen.",
    "Sie werden hart angefasst, mit dem Kopf unter Wasser getaucht, mit ihren Abendroben an die Wand getackert.",
]
en_sentences = [decoding_function(src, model) for src in de_sentences]

In [139]:
en_sentences

['more pedestrian safety',
 'two plant are so close to their intention or',
 'this confirms the fact that the',
 'therefore the construction of another more expensive more than security is simply required here',
 'there are three',
 'the pedestrian protection of pedestrian is',
 'another is looking at whether the pedestrian protection can be completed',
 'ha been over over year old the need of his young people and the soul',
 'the text appears to be too little confidence in the text',
 'they will be hard with the of water with their to the wall']

In [141]:
import yaml


submission = {
    "tasks": [
        {"task1": {"answer": test_bleu}},
        {"task2": {"answer": en_sentences}}
    ]
}

yaml.safe_dump(submission, open("submission.yaml", "w"))