<a href="https://colab.research.google.com/github/DmitryKutsev/eng_to_jap_translator/blob/main/Kutsev_Dima_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Neural Machine Translation.**

Проект - модель генерации перевода с английского на японский. В основе лежит модель seq2seq, но если получится, то я попробую немного усложнить в дальнейшем.

https://arxiv.org/pdf/1706.08198.pdf , 
https://www.aclweb.org/anthology/W14-7008.pdf - статьи, на которые примерно ориентировался.

В качестве токенизатора японского языка использовал tinysegmenter: https://pypi.org/project/tinysegmenter/.

Данные:

Сначала был корпус Kurohashi-Kawahara Lab:  http://nlp.ist.i.kyoto-u.ac.jp/EN/?JEC%20Basic%20Sentence%20Data


Потом нашел корпус побольше, параллельный корпус англо-японских субтитров https://nlp.stanford.edu/projects/jesc/data/raw.tar.gz


In [1]:
!pip install tinysegmenter



In [2]:
import sys
import os
import math
from tqdm import tqdm

import torch
import torch.optim as optim
import torch.nn as nn
import pandas as pd
import numpy as np

import torchtext
from torchtext.data import Field, BucketIterator, TabularDataset
import random
import spacy
import tinysegmenter

import torch
import torch.nn as nn
import random


In [3]:
# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [4]:
spacy_en = spacy.load('en')

In [5]:
segmenter = tinysegmenter.TinySegmenter()

In [6]:
! wget https://nlp.stanford.edu/projects/jesc/data/raw.tar.gz

--2020-12-21 21:33:38--  https://nlp.stanford.edu/projects/jesc/data/raw.tar.gz
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 102198198 (97M) [application/x-gzip]
Saving to: ‘raw.tar.gz.2’


2020-12-21 21:34:17 (2.51 MB/s) - ‘raw.tar.gz.2’ saved [102198198/102198198]



In [7]:
!tar -xzf raw.tar.gz

In [8]:
my_frame = pd.read_csv('raw/raw', sep='\t')

In [9]:

my_frame.columns = ['en', 'jp']
my_frame = my_frame[:500000]
# my_frame = my_frame[my_frame.columns[::-1]]

In [10]:
my_frame

Unnamed: 0,en,jp
0,my opponent is shark.,俺の相手は シャークだ。
1,this is one thing in exchange for another.,引き換えだ ある事とある物の
2,"yeah, i'm fine.",もういいよ ごちそうさま ううん
3,don't come to the office anymore. don't call m...,もう会社には来ないでくれ 電話もするな
4,looks beautiful.,きれいだ。
...,...,...
499995,i was threatened by a guy from your office.,FBIの男に強要されたのに
499996,it's distracting.,シャーペン回すんやめてくれへん? 気が散んねん。
499997,"it provides a simple, inexpensive",荒れた生態系に水を
499998,"i've talked to every morgue attendant, every m...",全ての職員や運転手と話を


In [11]:
segmenter.tokenize(my_frame['jp'][1])

['引き換え', 'だ', ' ', 'ある', '事', 'と', 'ある', '物', 'の']

In [12]:
[tok.text for tok in spacy_en.tokenizer(my_frame['en'][1])]

['this', 'is', 'one', 'thing', 'in', 'exchange', 'for', 'another', '.']

In [13]:
my_frame.to_csv('my_frame.csv', index=False)  

В какой-то момент я случайно убрал MAX_LEN, о чем забыл потом, и некоторое время не мог понять, почему на полных данных модель падает.

In [14]:
# MAX_LEN = 25
MAX_LEN = 15

def tokenize_jp(text):
    """
    Tokenizes JP text from a string into a list of strings
    """
    stops = ['  ', ' ', '...',  '「',  '、', '」', '➡',  '《', '-', '「']
    return [i for i in segmenter.tokenize(text)[:MAX_LEN] if i not in stops]


def tokenize_en(text):
    """
    Tokenizes English text from a string into a list of strings
    """
    res = [tok.text for tok in spacy_en.tokenizer(text)]
    return res[:MAX_LEN]

В качестве обработчика данных решил использовать инструменты torchtext.


In [15]:
SRC = Field(tokenize=tokenize_en, init_token='<sos>', eos_token='<eos>')
TRG = Field(tokenize=tokenize_jp, init_token='<sos>', eos_token='<eos>')

In [16]:
dataset = TabularDataset(path='my_frame.csv', 
                         format='csv', 
                         fields=[ ('en', SRC), ('jp', TRG)],
                         skip_header=True)

In [17]:
train_data, valid_data, test_data = dataset.split(split_ratio=[0.7, 0.1, 0.2], 
                                            random_state=random.getstate())

In [18]:
SRC.build_vocab(train_data, min_freq=1)
TRG.build_vocab(train_data, min_freq=1)

In [19]:
print(vars(train_data.examples[0])['en'])

['you', "'re", 'good', '.']


In [20]:
print(vars(train_data.examples[0])['jp'])

['うまい', 'なぁ']


In [21]:
print (len(SRC.vocab), len(TRG.vocab))
print (SRC.vocab.freqs.most_common(10))
print (TRG.vocab.freqs.most_common(10))

56956 109573
[('.', 212748), (',', 117190), ('you', 86354), ('the', 85398), ('i', 82518), ('?', 63348), ('to', 60081), ('a', 50291), ("'s", 48697), ('it', 48531)]
[('の', 129116), ('は', 97673), ('に', 86992), ('て', 79552), ('を', 74538), ('が', 69041), ('た', 65326), ('?', 48749), ('だ', 43575), ('で', 42941)]


In [22]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

BATCH_SIZE = 256

train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train_data, valid_data, test_data), 
     batch_size = BATCH_SIZE,
     sort_key=lambda x: len(x.jp), 
     sort_within_batch=False,
     device = device)


In [23]:

# for b in valid_iterator:
#     print (b.jp, b.en)
#     sys.exit()


In [24]:
print(vars(valid_iterator))

{'batch_size': 256, 'train': False, 'dataset': <torchtext.data.dataset.Dataset object at 0x7f4c9e34dc50>, 'batch_size_fn': None, 'iterations': 0, 'repeat': False, 'shuffle': False, 'sort': True, 'sort_within_batch': False, 'sort_key': <function <lambda> at 0x7f4cc98e8378>, 'device': device(type='cpu'), 'random_shuffler': <torchtext.data.utils.RandomShuffler object at 0x7f4c9e3645f8>, '_iterations_this_epoch': 0, '_random_state_this_epoch': None, '_restored_from_state': False}


In [25]:
print(len(valid_iterator))

391


Раздел с самими, собственно, сетями.

In [26]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        
        self.embedding = nn.Embedding(input_dim, emb_dim)
        
        self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout = dropout)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src):
        
        embedded = self.dropout(self.embedding(src))        
        outputs, (hidden, cell) = self.rnn(embedded)

        
        return hidden, cell

In [27]:
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        
        self.output_dim = output_dim
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        
        self.embedding = nn.Embedding(output_dim, emb_dim)
        
        self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout = dropout)
        
        self.fc_out = nn.Linear(hid_dim, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, cell):
        
        input = input.unsqueeze(0)

        embedded = self.dropout(self.embedding(input))
                
        output, (hidden, cell) = self.rnn(embedded, (hidden, cell))
        
        prediction = self.fc_out(output.squeeze(0))
        
        return prediction, hidden, cell

In [28]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
        assert encoder.hid_dim == decoder.hid_dim, \
            "Hidden dimensions of encoder and decoder must be equal!"
        assert encoder.n_layers == decoder.n_layers, \
            "Encoder and decoder must have equal number of layers!"
        
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):
        
        #src = [src len, batch size]
        #trg = [trg len, batch size]
        
        batch_size = trg.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)

        hidden, cell = self.encoder(src)
        input = trg[0,:]
        
        for t in range(1, trg_len):

            output, hidden, cell = self.decoder(input, hidden, cell)
            outputs[t] = output
            
            teacher_force = random.random() < teacher_forcing_ratio
            top1 = output.argmax(1) 
            input = trg[t] if teacher_force else top1
        
        return outputs

In [29]:
len(SRC.vocab), len(TRG.vocab)

(56956, 109573)

In [30]:
INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
HID_DIM = 512
N_LAYERS = 2
ENC_DROPOUT = 0.4
DEC_DROPOUT = 0.4

enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT)

model = Seq2Seq(enc, dec, device).to(device)

In [31]:
def init_weights(m):
    for name, param in m.named_parameters():
        nn.init.uniform_(param.data, -0.08, 0.08)
        
model.apply(init_weights)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(56956, 256)
    (rnn): LSTM(256, 512, num_layers=2, dropout=0.4)
    (dropout): Dropout(p=0.4, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(109573, 256)
    (rnn): LSTM(256, 512, num_layers=2, dropout=0.4)
    (fc_out): Linear(in_features=512, out_features=109573, bias=True)
    (dropout): Dropout(p=0.4, inplace=False)
  )
)

In [32]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 106,198,789 trainable parameters


In [33]:
optimizer = optim.Adam(model.parameters())

TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]
pad_idx = TRG.vocab.stoi['<pad>']
print(TRG.pad_token)  # <pad>
print(TRG.vocab.stoi[TRG.pad_token]) # 1 

criterion = nn.CrossEntropyLoss(ignore_index=TRG_PAD_IDX)

<pad>
1


In [34]:
import time

Обучение и оценка.

Сначала я поставил progress_bar, и вроде избавился от всех его глюков в колабе, но но в какой-то момент он опять начал странно работать, и т.к. времени почти не осталось, я решил, что пока ограничусь выводом результатов в принт, а progress_bar вернется, если останется время его опять чинить.
Так же включил ограничение по времени, последнее ограничение на обучение было 400 минут.
(я пытался обучать дольше, но колаб, кажется, при некотором времени бездействия, просто отключается, поэтому просто на ночь оставлять не вышло).


In [35]:
def train(model, iterator, optimizer, criterion, clip, start_time):
    
    model.train()
    epoch_loss = 0
    my_losses = []
    for i, batch in enumerate(iterator):
        # progress_bar = tqdm(total=len(iterator), desc=f'{ i }')
        src = batch.en
        trg = batch.jp

        optimizer.zero_grad()
        
        output = model(src, trg)

        output_dim = output.shape[-1]
        
        output = output[1:].view(-1, output_dim)
        trg = trg[1:].view(-1)

        loss = criterion(output, trg)

        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        epoch_loss += loss.item()
        my_losses.append(loss.item())

        end_time = time.time()

        iter_mins, iter_secs = epoch_time(start_time, end_time)
        if i%10 == 0:
          print(f'fmean losses: { np.mean(my_losses[-1000:]) } ', 
                f'iter { i }, iter mins { iter_mins }' )
        if int(iter_mins) > 400:
          return epoch_loss / len(iterator)
        # progress_bar.set_postfix(loss=np.mean(my_losses[-1000:]),
                            # perplexity=np.exp(np.mean(my_losses[-1000:])))
        # progress_bar.update()
     # progress_bar.close()
    return epoch_loss / len(iterator)

In [36]:
def evaluate(model, iterator, criterion):
    
    model.eval()
    
    epoch_loss = 0
    
    with torch.no_grad():
    
        for i, batch in enumerate(iterator):

            src = batch.en
            trg = batch.jp

            output = model(src, trg, 0) 
            output_dim = output.shape[-1]
            
            output = output[1:].view(-1, output_dim)
            trg = trg[1:].view(-1)

            loss = criterion(output, trg)
            
            epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

In [37]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [38]:
import time

In [39]:

# for instance in list(tqdm._instances):
#   tqdm._decr_instances(instance)


In [40]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


Проход по данным. Но сверху еще есть хардкод в виде ограничения по времени.


In [41]:
N_EPOCHS = 5
CLIP = 1

best_valid_loss = float('inf')
total_start_time = time.time()
for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP, total_start_time)
    # valid_loss = evaluate(model, valid_iterator, criterion)

    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)

    # if valid_loss < best_valid_loss:
    #     best_valid_loss = valid_loss
    #     torch.save(model.state_dict(), 'tut1-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    # print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')


fmean losses: 11.601455688476562  iter 0, iter mins 0
fmean losses: 9.360088565132834  iter 10, iter mins 8
fmean losses: 8.187040851229714  iter 20, iter mins 16
fmean losses: 7.723426818847656  iter 30, iter mins 23
fmean losses: 7.451544028956715  iter 40, iter mins 31
fmean losses: 7.279507898816876  iter 50, iter mins 39
fmean losses: 7.159851488519887  iter 60, iter mins 46
fmean losses: 7.070481965239619  iter 70, iter mins 54
fmean losses: 6.999135058603169  iter 80, iter mins 62
fmean losses: 6.937422464182089  iter 90, iter mins 70
fmean losses: 6.889661543440111  iter 100, iter mins 77
fmean losses: 6.846861749081998  iter 110, iter mins 85
fmean losses: 6.806571412677608  iter 120, iter mins 93
fmean losses: 6.773665737559777  iter 130, iter mins 101
fmean losses: 6.747186934694331  iter 140, iter mins 109
fmean losses: 6.719250991644449  iter 150, iter mins 117
fmean losses: 6.695732131507826  iter 160, iter mins 125
fmean losses: 6.670178176366795  iter 170, iter mins 132

In [42]:
def translate_sentence(sentence,src_field,trg_field,model,device, max_len=50):
    model.eval()

    if isinstance(sentence,str):
        nlp = spacy.load('en')
        tokens = [token.text.lower() for token in nlp(sentence)]
    else:
        tokens = [token.lower() for token in sentence]

    tokens = [src_field.init_token] + tokens + [src_field.eos_token]

    src_indexes = [src_field.vocab.stoi[token] for token in tokens]
    src_tensor = torch.LongTensor(src_indexes).unsqueeze(1).to(device)

    with torch.no_grad():
        hidden, cell = model.encoder(src_tensor)

    trg_indexes = [trg_field.vocab.stoi[trg_field.init_token]]

   
    for i in range(max_len):
        trg_tensor = torch.LongTensor([trg_indexes[-1]]).to(device)

        with torch.no_grad():
            output, hidden, cell = model.decoder(trg_tensor, hidden, cell)
        pred_token = output.argmax(1).item()
        trg_indexes.append(pred_token)
        if pred_token == trg_field.vocab.stoi[trg_field.eos_token]:
            break
    trg_tokens = [trg_field.vocab.itos[i] for i in trg_indexes]

    return trg_tokens[1:-1] 

In [43]:
example_idx = 24

src = vars(train_data.examples[example_idx])['en']
trg = vars(train_data.examples[example_idx])['jp']

print(f'src = {src}')
print(f'trg = {trg}')

translation = translate_sentence(src, SRC, TRG, model, device)

print(f'predicted trg = {translation}')

src = ['you', 'know', ',', 'the', 'police', 'ca', "n't", 'act', 'unless', 'it', 'turns', 'into', 'a', 'real', 'incident']
trg = ['でも', '。 ', '(', '谷村', ')', '警察', 'は', 'ね', '→']
predicted trg = ['私', 'の', 'は', 'の', 'の', 'の', 'の', 'を']


In [44]:
model_save_name = 'tut1-model.pt'
path = F"/content/gdrive/My Drive/{model_save_name}" 
torch.save(model.state_dict(), path)

In [45]:
translation2 = translate_sentence('I go to school', SRC, TRG, model, device)

print(f'predicted trg = { " ".join(translation2) } ')

predicted trg = 私 は 


In [46]:
translation2 = translate_sentence('I want to eat and drink', SRC, TRG, model, device)

print(f'predicted trg = { " ".join(translation2) }')

predicted trg = 私 の は を


По визуальной оценке результат совсем не очень. Если бы было немного больше времени, я бы добавил attention, и вообще попробовал бы transformer.
Возможно, я даже успею что-то добавить(beam search, например). Если тут все еще этот текст - значит я не успел, или у меня не вышло.


В качестве оценки использовал метрику bleu, а реализацию взял из библиотеки NLTK


In [54]:
from nltk.translate.bleu_score import corpus_bleu
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate import bleu_score

In [61]:
one = [['うまい', 'なぁ', 'は', 'jgjg', '111']]
two = ['すべて', 'は', 'ここから', '始まっ', 'た']
sentence_bleu(one, two, weights=(1, 0, 0, 0))

Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


0.2

In [49]:
for i in train_data.examples[:3]:
  print(vars(i)['jp'])
  #print(vars(i)['en'])


['うまい', 'なぁ']
['すべて', 'は', 'ここから', '始まっ', 'た']
['そう', 'です', 'よ']


В качестве метрики использую метрику BLEU, импортирую из NLTK. Собираю список из целей в датасете, и список из переводов модели, и сравниваю.

Прошлая функция подсчета оказалась не правильной, в статьях BLEU score набирал максимум 20, а у меня получалось 50-60.
Все встает на свои места, если убрать расчет весов для н-грамм.

In [62]:
def cal_bleu_score(dataset_pairs):
    targets = []
    predictions = []

    for i in dataset_pairs:
        target = vars(i)['jp']
        target = ' '.join(target)
        predicted_words = translate_sentence(vars(i)['en'], SRC, TRG, model, device)
        predictions.append(' '.join(predicted_words))
        targets.append(target)
    print(predictions[:3])
    print(targets[:3])
    print(f'BLEU Score: {round(corpus_bleu(predictions, targets, weights=(1, 0, 0, 0)) * 100, 2)}')

In [63]:
len(valid_data.examples)

100000

In [65]:
cal_bleu_score(valid_data.examples)

['私 の は の の を', '何 は は', '私 の は']
['ブラッド の 歯 の 型 を ベース に し て', 'あっ 何 すん だ よ', 'やが て スティーブ が 口 を 開き まし た']
BLEU Score: 9.27


Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


In [66]:
cal_bleu_score(train_data.examples)

['何 は', '私 の は', '私 は は は']
['うまい なぁ', 'すべて は ここから 始まっ た', 'そう です よ']
BLEU Score: 8.4


Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


В результате все не очень хорошо.
Судя по статьям, нужно гораздо дольше обучать, добавлять контекстный вектор Attention и beam search как минимум.

Что еще убило огромную кучу времени - это колаб. Он внезапно падал, терял соединение, не сохранял блокноты, в общем да, его я постараюсь больше не использовать, при столкновении с любыми моделями машинного обучения.