<a href="https://colab.research.google.com/github/dviva1972/denvlaiva/blob/master/DLL_HW_13_text_transformer_261021_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## DLL

## Домашняя работа 13 | Работа с текстом / Трансформеры

## Иванов Денис

Решить задачу перевода с использованием трансформеров

Возьмите англо-испанские пары фраз (www.manythings.org....org/anki/)

Обучите на них seq2seq with transformers

### 1. Импорт библиотек / данных

In [1]:
from io import open
import unicodedata
import string
import re
import random
import math
import numpy as np
from collections import Counter

import torch as tr
import torch.nn as nn
from   torch import Tensor
from   torch.nn import Transformer, TransformerEncoder, TransformerEncoderLayer
import torch.nn.functional as F

from timeit import default_timer as timer
import time

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
DEVICE = tr.device('cuda:0' if tr.cuda.is_available() else 'cpu')
print(f"work on {(tr.cuda.get_device_name() if DEVICE.type == 'cuda' else 'cpu')}")

work on Tesla P100-PCIE-16GB


### 2. Импорт и предобработка текста

In [3]:
!wget https://www.manythings.org/anki/spa-eng.zip
!unzip spa-eng.zip

--2021-10-26 16:11:23--  https://www.manythings.org/anki/spa-eng.zip
Resolving www.manythings.org (www.manythings.org)... 172.67.186.54, 104.21.92.44, 2606:4700:3033::ac43:ba36, ...
Connecting to www.manythings.org (www.manythings.org)|172.67.186.54|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5192744 (5.0M) [application/zip]
Saving to: ‘spa-eng.zip’


2021-10-26 16:11:23 (19.1 MB/s) - ‘spa-eng.zip’ saved [5192744/5192744]

Archive:  spa-eng.zip
  inflating: _about.txt              
  inflating: spa.txt                 


Для решения задачи принят следующий алгоритм подготовки данных:

*   в качестве единицы / наблюдения анализа принято 1 предложение, разделенное окончанием пары фраз или знаками окончания предложения '.!?' 
*   предложение - не более  20 слов, если более - предложение обрезается, к анализу берется 20 первых слов
*   в случае, если в исходной паре фраз содержится несколько предложений - то количество наблюдений по паре фраз соответствует количеству отдельных предложений к переводу
*   в случае, если в исходной паре фраз количество предложений на входе не совпадает с количеством предложений на выходе - такая пара фраз в обучении модели не участвует
*   в случае, если хотя бы в одном из пары предложений встречаются редкие слова (встречающиеся по всему корпусу исходных данных менее 5 раз) - такая пара фраз в обучении модели не участвует

In [4]:
SRC_LANGUAGE, TGT_LANGUAGE          = 'eng',  'spa'

class Lang:
    def __init__(self, name):
        self.name = name
        self.word2index = {'<unk>': 0, '<pad>': 1, '<bos>': 2, '<eos>': 3}
        self.index2word = {0: '<unk>', 1: '<pad>', 2: '<bos>', 3: '<eos>'}
        self.word2count = {}
        self.n_words = 4 

    def addSentence(self, sentence):
        for word in sentence:#.split(' '):
            self.addWord(word)

    def addWord(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.n_words
            self.word2count[word] = 1
            self.index2word[self.n_words] = word
            self.n_words += 1
        else:
            self.word2count[word] += 1

In [5]:
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
           if unicodedata.category(c) != 'Mn')

def normalizeString(s):
    s = unicodeToAscii(s.lower().strip())
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Zа-яА-ЯёЁ.!?]+", r" ", s)
    return s

In [6]:
def readLangs(s_limit):
    print("Reading lines...")
    lines = open('spa.txt', encoding='utf-8').read().strip().split('\n')
    pairs = [[normalizeString(s) for s in re.split('\t', l)] for l in lines]
    pairs_list = []
    for string in pairs:
        eng_split =  [i for i in re.split('[\.!?]', string[0]) if i != '']
        esp_split =  [i for i in re.split('[\.!?]', string[1]) if i != '']    
        if len(eng_split) ==  len(esp_split):
            for i in range(len(eng_split)):            
                eng_s = re.findall('[\w]+', eng_split[i])[:s_limit]
                esp_s = re.findall('[\w]+', esp_split[i])[:s_limit]
                pairs_list.append([eng_s, esp_s]) 
    return pairs_list

In [7]:
def prepareData(i_lang, o_lang, se_limit):
    pairs     = readLangs(se_limit)
    inp_lang  = Lang(i_lang)
    out_lang  = Lang(o_lang)
    print("Read %s sentence pairs" % len(pairs))
    print("Trimmed to %s sentence pairs" % len(pairs))
    print("Counting words...")
    for pair in pairs:
        inp_lang.addSentence(pair[0])
        out_lang.addSentence(pair[1])
    print("Counted words:")
    print(inp_lang.name, inp_lang.n_words)
    print(out_lang.name, out_lang.n_words)
    return inp_lang, out_lang, pairs

In [8]:
def Prepare_Filt_Data(im_lang = SRC_LANGUAGE, ou_lang = TGT_LANGUAGE, 
                      sen_limit = 20,  w_limit = 5):
    i1_lang, o1_lang, pairs=prepareData(im_lang, ou_lang, sen_limit)
    inp_lang_  = Lang(im_lang)
    out_lang_  = Lang(ou_lang)

    eng_words_select = set([i for i in i1_lang.word2count.keys() 
                               if i1_lang.word2count[i] > w_limit])
    esp_words_select = set([i for i in o1_lang.word2count.keys() 
                               if o1_lang.word2count[i] > w_limit])
    
    for pair in pairs:
        for wrd_i in pair[0]:
            if wrd_i in eng_words_select:
                inp_lang_.addWord(wrd_i)
        for wrd_o in pair[1]:
            if wrd_o in esp_words_select:
                out_lang_.addWord(wrd_o)

    print('Using filter: every dict word in sentenses > ', w_limit)
    print(inp_lang_.name, inp_lang_.n_words)
    print(out_lang_.name, out_lang_.n_words)

    pairs_select = []
    for i in range(len(pairs)):
        if all(word in eng_words_select for word in pairs[i][0]):
            if all(word in esp_words_select for word in pairs[i][1]):
                   pairs_select.append([pairs[i][0], pairs[i][1]]) 

    print(len(pairs_select), ' sentenses')
    return inp_lang_, out_lang_, pairs_select

In [9]:
input_lang, output_lang, pairs =  Prepare_Filt_Data(sen_limit = 20,  
                                                    w_limit   = 5)

Reading lines...
Read 135511 sentence pairs
Trimmed to 135511 sentence pairs
Counting words...
Counted words:
eng 13490
spa 26095
Using filter: every dict word in sentenses >  5
eng 4999
spa 7107
101567  sentenses


In [10]:
full_len   = len(pairs)
train_list = list(range(full_len))
random.shuffle(train_list)
# train_list[:6]

In [11]:
Max_len = 22

def sent_to_torch(sent_in, l):
    sent_for_torch = np.zeros((Max_len))#.type(tr.long).to(DEVICE)
    for b in range(len(sent_in)+2):
        if b == 0:
            sent_for_torch[b] = 2
        elif b <= len(sent_in):
            sent_for_torch[b] = l.word2index[sent_in[b-1]]
        elif (b==len(sent_in)+1 and b<=Max_len) or b==Max_len+1:
            sent_for_torch[b] = 3
    return sent_for_torch
    
# sent_to_torch(pairs[-1][0], input_lang)

In [12]:
def torch_to_sent(sent_tens, l):
    sentens = [l.index2word(i) for i in sent_tens if i > 3] 
    return ' '.join(sentens)   

In [13]:
def get_batch(pairs_num_list, batch_size):
    batch_list     = random.sample(pairs_num_list, batch_size)

    s_np_in  = np.zeros((22))
    s_np_out = np.zeros((22))

    for a in range(batch_size):

        sent_in = pairs[batch_list[a]][0]
        sent_in = sent_to_torch(sent_in, input_lang)
        s_np_in = np.vstack([s_np_in, sent_in])
       
        sent_out = pairs[batch_list[a]][1]
        sent_out = sent_to_torch(sent_out, output_lang)
        s_np_out = np.vstack([s_np_out, sent_out])

    data   = tr.tensor( s_np_in[1:], dtype=tr.long, device=DEVICE).view(-1, batch_size)
    target = tr.tensor(s_np_out[1:], dtype=tr.long, device=DEVICE).view(-1)

    return data, target

### 3. Архитектура сети

In [14]:
class TransformerModel(nn.Module):

    def __init__(self, n_tokens_in, n_tokens_out, 
                 ninp, nhead, nhid, nlayers, dropout=0.5):
        super(TransformerModel, self).__init__()
        
        self.model_type = 'Transformer'
        self.src_mask = None
        self.pos_encoder = PositionalEncoding(ninp, dropout)
        encoder_layers = TransformerEncoderLayer(ninp, nhead, nhid, dropout)
        self.transformer_encoder = TransformerEncoder(encoder_layers, nlayers)
        self.encoder = nn.Embedding(n_tokens_in, ninp)
        self.ninp = ninp
        self.decoder = nn.Linear(ninp, n_tokens_out)

        self.init_weights()

    def _generate_square_subsequent_mask(self, sz):
        mask = (tr.triu(tr.ones(sz, sz)) == 1).transpose(0, 1)
        mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
        return mask

    def init_weights(self):
        initrange = 0.1
        self.encoder.weight.data.uniform_(-initrange, initrange)
        self.decoder.bias.data.zero_()
        self.decoder.weight.data.uniform_(-initrange, initrange)

    def forward(self, src):
        if self.src_mask is None or self.src_mask.size(0) != len(src):
            device = src.device
            mask = self._generate_square_subsequent_mask(len(src)).to(device)
            self.src_mask = mask

        src = self.encoder(src) * math.sqrt(self.ninp)
        src = self.pos_encoder(src)
        output = self.transformer_encoder(src, self.src_mask)
        output = self.decoder(output)
        return output

In [15]:
class PositionalEncoding(nn.Module):

    def __init__(self, d_model, dropout=0.1, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)

        pe = tr.zeros(max_len, d_model)
        position = tr.arange(0, max_len, dtype=tr.float).unsqueeze(1)
        div_term = tr.exp(tr.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = tr.sin(position * div_term)
        pe[:, 1::2] = tr.cos(position * div_term)
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + self.pe[:x.size(0), :]
        return self.dropout(x)

In [16]:
n_tokens_in = input_lang.n_words
n_tokens_out= output_lang.n_words

In [17]:
emsize = 200    # embedding dim
nhid   = 200    # dim of the feedforward network model in nn.TransformerEncoder
nlayers = 2     # the number of TransformerEncoderLayer`s in TransformerEncoder
nhead   = 2     # the number of heads in the multiheadattention model
dropout = 0.5 
model   = TransformerModel(n_tokens_in, n_tokens_out, emsize, 
                           nhead, nhid, nlayers, dropout).to(DEVICE)

In [18]:
criterion = nn.CrossEntropyLoss()
lr = 5          # learning rate
optimizer = tr.optim.SGD(model.parameters(), lr=lr)
scheduler = tr.optim.lr_scheduler.StepLR(optimizer, 3.0, gamma=0.8)
batch_size   = 50
range_step   = 150

def train():
    model.train()
    total_loss = 0.
    start_time = time.time()
    ntokens = n_tokens_out
    for i in range(range_step):
        data, targets = get_batch(train_list, batch_size)
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output.view(-1, ntokens), targets)
        loss.backward()
        tr.nn.utils.clip_grad_norm_(model.parameters(), 0.5)
        optimizer.step()

        total_loss += loss.item()
        
        if i in range(1, range_step, 20) and i > 0:
            cur_loss = total_loss /i
            elapsed = time.time() - start_time
            print('| epoch {:3d} | lr {:02.2f} | loss {:5.6f} | ppl {:8.6f}'.format(
                    epoch, scheduler.get_last_lr()[0],
                    cur_loss, math.exp(cur_loss)))
    
    return total_loss / range_step, model

In [19]:
best_val_loss = float("inf")
epochs        = 15
best_model    = None

for epoch in range(1, epochs + 1):
    epoch_start_time = time.time()
    t_loss, model_   = train()
    print('-' * 50)
    print('|mean loss epoch   {:3}|mean l {:5.6f}|vppl {:4.6f}'.format(epoch, 
                                                    t_loss, math.exp(t_loss)))
    print('-' * 54)

    if t_loss         < best_val_loss:
        best_val_loss = t_loss
        best_model    = model_

    scheduler.step()

| epoch   1 | lr 5.00 | loss 13.115902 | ppl 496779.818327
| epoch   1 | lr 5.00 | loss 6.752357 | ppl 856.073731
| epoch   1 | lr 5.00 | loss 4.912242 | ppl 135.943848
| epoch   1 | lr 5.00 | loss 4.199534 | ppl 66.655265
| epoch   1 | lr 5.00 | loss 3.766958 | ppl 43.248321
| epoch   1 | lr 5.00 | loss 3.492000 | ppl 32.851592
| epoch   1 | lr 5.00 | loss 3.319454 | ppl 27.645254
| epoch   1 | lr 5.00 | loss 3.183966 | ppl 24.142304
--------------------------------------------------
|mean loss epoch     1|mean l 3.112020|vppl 22.466376
------------------------------------------------------
| epoch   2 | lr 5.00 | loss 4.455430 | ppl 86.093186
| epoch   2 | lr 5.00 | loss 2.328672 | ppl 10.264298
| epoch   2 | lr 5.00 | loss 2.190367 | ppl 8.938496
| epoch   2 | lr 5.00 | loss 2.172541 | ppl 8.780570
| epoch   2 | lr 5.00 | loss 2.138202 | ppl 8.484172
| epoch   2 | lr 5.00 | loss 2.134357 | ppl 8.451610
| epoch   2 | lr 5.00 | loss 2.109135 | ppl 8.241106
| epoch   2 | lr 5.00 | loss

In [20]:
best_model

TransformerModel(
  (pos_encoder): PositionalEncoding(
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (transformer_encoder): TransformerEncoder(
    (layers): ModuleList(
      (0): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=200, out_features=200, bias=True)
        )
        (linear1): Linear(in_features=200, out_features=200, bias=True)
        (dropout): Dropout(p=0.5, inplace=False)
        (linear2): Linear(in_features=200, out_features=200, bias=True)
        (norm1): LayerNorm((200,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((200,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.5, inplace=False)
        (dropout2): Dropout(p=0.5, inplace=False)
      )
      (1): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=200, out_features=200, bias=True)
        )
        

In [21]:
def translate_to_spain(idx):
    print('Предложение на английском :', ' '.join(pairs[idx][0])) 
    print('Предложение на испанском  :', ' '.join(pairs[idx][1])) 

    sent_np_in = sent_to_torch(pairs[idx][0], input_lang)
    output     = best_model(tr.tensor(sent_np_in,  dtype=tr.long, device=DEVICE))

    print('Перевод seq2seq           :', 
          ' '.join([output_lang.index2word[i.item()] 
                    for i in output[0].data.topk(1)[1] if i > 3])) 

In [22]:
translate_to_spain(82368)

Предложение на английском : tom wanted mary to loan him some money
Предложение на испанском  : tom queria que maria le prestara dinero
Перевод seq2seq           : tom queria mary a dinero


In [23]:
translate_to_spain(8207)

Предложение на английском : she isn t young
Предложение на испанском  : ella no es joven
Перевод seq2seq           : ella no no joven


In [24]:
translate_to_spain(8231)

Предложение на английском : she s my sister
Предложение на испанском  : ella es mi hermana
Перевод seq2seq           : ella es mi hermana


In [25]:
translate_to_spain(55770)

Предложение на английском : he looks just like his mother
Предложение на испанском  : el se parece a su madre
Перевод seq2seq           : el no su madre
