<a href="https://colab.research.google.com/github/DmitryKutsev/eng_to_jap_translator/blob/main/my_seq2seq.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Neural Machine Translation.**

Проект - модель генерации перевода с английского на японский. В основе лежит модель seq2seq, но если получится, то я попробую немного усложнить в дальнейшем.

https://arxiv.org/pdf/1706.08198.pdf 
https://www.aclweb.org/anthology/W14-7008.pdf - статьи, на которые примерно ориентировался.

В качестве токенизатора японского языка использовал tinysegmenter: https://pypi.org/project/tinysegmenter/.

В качестве корпусов:

Сначала корпус Kurohashi-Kawahara Lab:  http://nlp.ist.i.kyoto-u.ac.jp/EN/?JEC%20Basic%20Sentence%20Data


Потом нашел корпус побольше, параллельный корпус англо-японских субтитров https://nlp.stanford.edu/projects/jesc/data/raw.tar.gz


In [1]:
!pip install tinysegmenter

Collecting tinysegmenter
  Downloading https://files.pythonhosted.org/packages/9c/70/488895cb11e160b548c9ba5847c171b65b86a8ca1e54d206d55b2976bf7b/tinysegmenter-0.4.tar.gz
Building wheels for collected packages: tinysegmenter
  Building wheel for tinysegmenter (setup.py) ... [?25l[?25hdone
  Created wheel for tinysegmenter: filename=tinysegmenter-0.4-cp36-none-any.whl size=13537 sha256=73ce64c7ad54defa9321cdaf1402eb9020068ba22a70076a7b6bd05edfc8a2d9
  Stored in directory: /root/.cache/pip/wheels/68/71/2b/6402196bf28012826e507ef7b99df6ebd98cce78bd99023471
Successfully built tinysegmenter
Installing collected packages: tinysegmenter
Successfully installed tinysegmenter-0.4


In [2]:
import sys
import os
import math
from tqdm import tqdm

import torch
import torch.optim as optim
import torch.nn as nn
import pandas as pd
import numpy as np

import torchtext
from torchtext.data import Field, BucketIterator, TabularDataset
import random
import spacy
import tinysegmenter

import torch
import torch.nn as nn
import random


In [3]:
# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [4]:
spacy_en = spacy.load('en')

In [5]:
segmenter = tinysegmenter.TinySegmenter()

In [6]:
! wget https://nlp.stanford.edu/projects/jesc/data/raw.tar.gz

--2020-12-20 20:33:10--  https://nlp.stanford.edu/projects/jesc/data/raw.tar.gz
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 102198198 (97M) [application/x-gzip]
Saving to: ‘raw.tar.gz’


2020-12-20 20:33:18 (13.5 MB/s) - ‘raw.tar.gz’ saved [102198198/102198198]



In [7]:
!tar -xzf raw.tar.gz

In [8]:
my_frame = pd.read_csv('raw/raw', sep='\t')

In [9]:

my_frame.columns = ['en', 'jp']
my_frame = my_frame[:400000]
# my_frame = my_frame[my_frame.columns[::-1]]

In [10]:
my_frame

Unnamed: 0,en,jp
0,my opponent is shark.,俺の相手は シャークだ。
1,this is one thing in exchange for another.,引き換えだ ある事とある物の
2,"yeah, i'm fine.",もういいよ ごちそうさま ううん
3,don't come to the office anymore. don't call m...,もう会社には来ないでくれ 電話もするな
4,looks beautiful.,きれいだ。
...,...,...
399995,my phone is dialing for an internet connection.,電話がインターネット接続しようとしてる
399996,i summon absurd stealer from my hand!,手札から アブサード・スティーラーを召喚!
399997,"yes, about your double degree.",いいえ その学位については
399998,i think...,似てるって思うんだけどね。


In [11]:
segmenter.tokenize(my_frame['jp'][1])

['引き換え', 'だ', ' ', 'ある', '事', 'と', 'ある', '物', 'の']

In [12]:
[tok.text for tok in spacy_en.tokenizer(my_frame['en'][1])]

['this', 'is', 'one', 'thing', 'in', 'exchange', 'for', 'another', '.']

In [13]:
my_frame.to_csv('my_frame.csv', index=False)  

В какой-то момент я случайно убрал MAX_LEN, о чем забыл потом, и некоторое время не мог понять, почему на полных данных модель падает.

In [14]:
# MAX_LEN = 25
MAX_LEN = 20

def tokenize_jp(text):
    """
    Tokenizes JP text from a string into a list of strings
    """
    stops = ['  ', ' ', '...',  '「',  '、', '」', '➡',  '《', '-', '「']
    return [i for i in segmenter.tokenize(text)[:MAX_LEN] if i not in stops]


def tokenize_en(text):
    """
    Tokenizes English text from a string into a list of strings
    """
    res = [tok.text for tok in spacy_en.tokenizer(text)]
    return res[:MAX_LEN]

В качестве обработчика данных решил использовать инструменты torchtext.


In [15]:
SRC = Field(tokenize=tokenize_en, init_token='<sos>', eos_token='<eos>')
TRG = Field(tokenize=tokenize_jp, init_token='<sos>', eos_token='<eos>')

In [16]:
dataset = TabularDataset(path='my_frame.csv', 
                         format='csv', 
                         fields=[ ('en', SRC), ('jp', TRG)],
                         skip_header=True)

In [17]:
train_data, valid_data, test_data = dataset.split(split_ratio=[0.7, 0.1, 0.2], 
                                            random_state=random.getstate())

In [18]:
SRC.build_vocab(train_data, min_freq=1)
TRG.build_vocab(train_data, min_freq=1)

In [19]:
print(vars(train_data.examples[0])['en'])

['do', "n't", '.', 'i', 'do', "n't", 'care', '.']


In [20]:
print(vars(train_data.examples[0])['jp'])

['やめて']


In [21]:
print (len(SRC.vocab), len(TRG.vocab))
print (SRC.vocab.freqs.most_common(10))
print (TRG.vocab.freqs.most_common(10))

52299 97819
[('.', 181489), (',', 95507), ('the', 70421), ('you', 70229), ('i', 66790), ('?', 52584), ('to', 49246), ('a', 41319), ("'s", 39623), ('it', 39496)]
[('の', 107042), ('は', 80222), ('に', 72059), ('て', 66591), ('を', 61822), ('が', 56924), ('た', 56370), ('?', 41014), ('。', 37472), ('だ', 36645)]


In [22]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

BATCH_SIZE = 256

train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train_data, valid_data, test_data), 
     batch_size = BATCH_SIZE,
     sort_key=lambda x: len(x.jp), 
     sort_within_batch=False,
     device = device)


In [23]:

# for b in valid_iterator:
#     print (b.jp, b.en)
#     sys.exit()


In [24]:
print(vars(valid_iterator))

{'batch_size': 256, 'train': False, 'dataset': <torchtext.data.dataset.Dataset object at 0x7f2e07a2f2b0>, 'batch_size_fn': None, 'iterations': 0, 'repeat': False, 'shuffle': False, 'sort': True, 'sort_within_batch': False, 'sort_key': <function <lambda> at 0x7f2e07a328c8>, 'device': device(type='cpu'), 'random_shuffler': <torchtext.data.utils.RandomShuffler object at 0x7f2e07a2fd30>, '_iterations_this_epoch': 0, '_random_state_this_epoch': None, '_restored_from_state': False}


In [25]:
print(len(valid_iterator))

313


Раздел с самими, собственно, сетями.

In [26]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        
        self.embedding = nn.Embedding(input_dim, emb_dim)
        
        self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout = dropout)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src):
        
        embedded = self.dropout(self.embedding(src))        
        outputs, (hidden, cell) = self.rnn(embedded)

        
        return hidden, cell

In [27]:
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        
        self.output_dim = output_dim
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        
        self.embedding = nn.Embedding(output_dim, emb_dim)
        
        self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout = dropout)
        
        self.fc_out = nn.Linear(hid_dim, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, cell):
        
        input = input.unsqueeze(0)

        embedded = self.dropout(self.embedding(input))
                
        output, (hidden, cell) = self.rnn(embedded, (hidden, cell))
        
        prediction = self.fc_out(output.squeeze(0))
        
        return prediction, hidden, cell

In [28]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
        assert encoder.hid_dim == decoder.hid_dim, \
            "Hidden dimensions of encoder and decoder must be equal!"
        assert encoder.n_layers == decoder.n_layers, \
            "Encoder and decoder must have equal number of layers!"
        
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):
        
        #src = [src len, batch size]
        #trg = [trg len, batch size]
        
        batch_size = trg.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)

        hidden, cell = self.encoder(src)
        input = trg[0,:]
        
        for t in range(1, trg_len):

            output, hidden, cell = self.decoder(input, hidden, cell)
            outputs[t] = output
            
            teacher_force = random.random() < teacher_forcing_ratio
            top1 = output.argmax(1) 
            input = trg[t] if teacher_force else top1
        
        return outputs

In [29]:
len(SRC.vocab), len(TRG.vocab)

(52299, 97819)

In [30]:
INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
HID_DIM = 512
N_LAYERS = 2
ENC_DROPOUT = 0.6
DEC_DROPOUT = 0.6

enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT)

model = Seq2Seq(enc, dec, device).to(device)

In [31]:
def init_weights(m):
    for name, param in m.named_parameters():
        nn.init.uniform_(param.data, -0.08, 0.08)
        
model.apply(init_weights)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(52299, 256)
    (rnn): LSTM(256, 512, num_layers=2, dropout=0.6)
    (dropout): Dropout(p=0.6, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(97819, 256)
    (rnn): LSTM(256, 512, num_layers=2, dropout=0.6)
    (fc_out): Linear(in_features=512, out_features=97819, bias=True)
    (dropout): Dropout(p=0.6, inplace=False)
  )
)

In [32]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 95,967,771 trainable parameters


In [33]:
optimizer = optim.Adam(model.parameters())

TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]
pad_idx = TRG.vocab.stoi['<pad>']
print(TRG.pad_token)  # <pad>
print(TRG.vocab.stoi[TRG.pad_token]) # 1 

criterion = nn.CrossEntropyLoss(ignore_index=TRG_PAD_IDX)

<pad>
1


Обучение и оценка.

Сначала я поставил progress_bar, и вроде избавился от всех его глюков в колабе, но но в какой-то момент он опять начал странно работать, и за день до дедлайна я решил, что пока ограничусь выводом результатов в принт, а progress_bar вернется, если останется время его опять чинить.

In [34]:
def train(model, iterator, optimizer, criterion, clip):
    
    model.train()
    
    epoch_loss = 0
    my_losses = []
    for i, batch in enumerate(iterator):
        # progress_bar = tqdm(total=len(iterator), desc=f'{ i }')
        src = batch.en
        trg = batch.jp
        
        optimizer.zero_grad()
        
        output = model(src, trg)

        output_dim = output.shape[-1]
        
        output = output[1:].view(-1, output_dim)
        trg = trg[1:].view(-1)

        loss = criterion(output, trg)

        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        epoch_loss += loss.item()
        my_losses.append(loss.item())
        if i%10 == 0:
          print(f'fmean losses: { np.mean(my_losses[-1000:]) } ', f'iter { i }' )
        # progress_bar.set_postfix(loss=np.mean(my_losses[-1000:]),
                            # perplexity=np.exp(np.mean(my_losses[-1000:])))
        # progress_bar.update()
     # progress_bar.close()
    return epoch_loss / len(iterator)

In [35]:
def evaluate(model, iterator, criterion):
    
    model.eval()
    
    epoch_loss = 0
    
    with torch.no_grad():
    
        for i, batch in enumerate(iterator):

            src = batch.en
            trg = batch.jp

            output = model(src, trg, 0) 
            output_dim = output.shape[-1]
            
            output = output[1:].view(-1, output_dim)
            trg = trg[1:].view(-1)

            loss = criterion(output, trg)
            
            epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

In [36]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [37]:
import time

In [38]:

# for instance in list(tqdm._instances):
#   tqdm._decr_instances(instance)


In [39]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


Это проход уже по полным данным, что у меня некоторое время вообще не получалось.


In [None]:
N_EPOCHS = 1
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    # valid_loss = evaluate(model, valid_iterator, criterion)

    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)

    # if valid_loss < best_valid_loss:
    #     best_valid_loss = valid_loss
    #     torch.save(model.state_dict(), 'tut1-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    # print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')


fmean losses: 11.47179889678955  iter 0
fmean losses: 9.302615079012783  iter 10
fmean losses: 8.141110102335611  iter 20
fmean losses: 7.679433730340773  iter 30
fmean losses: 7.4254350662231445  iter 40


In [None]:
def translate_sentence(sentence,src_field,trg_field,model,device, max_len=50):
    model.eval()

    if isinstance(sentence,str):
        nlp = spacy.load('en')
        tokens = [token.text.lower() for token in nlp(sentence)]
    else:
        tokens = [token.lower() for token in sentence]

    tokens = [src_field.init_token] + tokens + [src_field.eos_token]

    src_indexes = [src_field.vocab.stoi[token] for token in tokens]
    src_tensor = torch.LongTensor(src_indexes).unsqueeze(1).to(device)

    with torch.no_grad():
        hidden, cell = model.encoder(src_tensor)

    trg_indexes = [trg_field.vocab.stoi[trg_field.init_token]]

    #
    for i in range(max_len):
        trg_tensor = torch.LongTensor([trg_indexes[-1]]).to(device)

        with torch.no_grad():
            output, hidden, cell = model.decoder(trg_tensor, hidden, cell)
        pred_token = output.argmax(1).item()
        trg_indexes.append(pred_token)
        if pred_token == trg_field.vocab.stoi[trg_field.eos_token]:
            break
    trg_tokens = [trg_field.vocab.itos[i] for i in trg_indexes]

    return trg_tokens[1:-1] # remove <sos> and <eos>

In [None]:
example_idx = 24

src = vars(train_data.examples[example_idx])['en']
trg = vars(train_data.examples[example_idx])['jp']

print(f'src = {src}')
print(f'trg = {trg}')

translation = translate_sentence(src, SRC, TRG, model, device)

print(f'predicted trg = {translation}')

По визуальной оценке результат просто ужасный. Видимо, я слишком высоко поднял фильтрацию повторяющихся слов(сейчас ве, что меньше 4х - фильтруется.)


In [None]:
model_save_name = 'tut1-model.pt'
path = F"/content/gdrive/My Drive/{model_save_name}" 
torch.save(model.state_dict(), path)

In [None]:
translation2 = translate_sentence('I go to school', SRC, TRG, model, device)

print(f'predicted trg = { " ".join(translation2) } ')

In [None]:
translation2 = translate_sentence('I want to eat and drink', SRC, TRG, model, device)

print(f'predicted trg = { " ".join(translation2) }')

In [None]:
translation2 = translate_sentence('I hate school and study', SRC, TRG, model, device)

print(f'predicted trg = { " ".join(translation2) ')

В качестве оценки использовал метрику bleu, а реализацию взял из библиотеки NLTK


In [None]:
from nltk.translate.bleu_score import corpus_bleu
from nltk.translate import bleu_score

In [None]:
for i in train_data.examples[:3]:
  print(vars(i)['jp'])
  #print(vars(i)['en'])


In [None]:
def cal_bleu_score(dataset_pairs):
    targets = []
    predictions = []
 
    for i in dataset_pairs:
        target = vars(i)['jp']
        target = ' '.join(target)
        predicted_words = translate_sentence(vars(i)['en'], SRC, TRG, model, device)
        predictions.append(' '.join(predicted_words))
        targets.append(target)
    print(predictions[:3])
    print(targets[:3])
    print(f'BLEU Score: {round(corpus_bleu(predictions, targets) * 100, 2)}')

In [None]:
len(valid_data.examples)

In [None]:
cal_bleu_score(valid_data.examples)

In [None]:
cal_bleu_score(train_data.examples)

На всякий случай сохраню этот вариант, и попробую переобучить,если успею.