<a href="https://colab.research.google.com/github/RealAntonVoronov/computational_humour/blob/master/from_pretrained.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Машинный перевод с помощью openNMT для соревнования STAPLE от Duolingo

## Загрузка данных. 

Так как данных предоставленных орагнизаторами явно недостаточно для того чтобы обучить полноценную языковую модель, для каждого из 5 языков были загружены параллельные корпуса субтитров. (http://opus.nlpl.eu/OpenSubtitles-v2016.php) Данные можно также найти на сервере nlp1 в папке `voronov/data/OpenSubtitles`.

In [0]:
# this is for colab skip if you don't need to connect to drive)
import os
from google.colab import drive
drive.mount('/content/gdrive')

course = 'en_pt'
path_to_corpora = os.path.join('/content/gdrive/My Drive/data/work/Panchenko/corpora/OpenSubtitles/', course)
path_to_duolingo = os.path.join('/content/gdrive/My Drive/data/work/Panchenko/duolingo/data/', course)
path_to_model = os.path.join('/content/gdrive/My Drive/data/work/Panchenko/language_models', course)

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [0]:
os.chdir(path_to_corpora) 

In [0]:
!wget http://opus.nlpl.eu/download.php?f=OpenSubtitles/v2016/moses/en-ko.txt.zip
!unzip  'download.php?f=OpenSubtitles%2Fv2016%2Fmoses%2Fen-ko.txt.zip'

--2020-03-25 14:12:00--  http://opus.nlpl.eu/download.php?f=OpenSubtitles/v2016/moses/en-ko.txt.zip
Resolving opus.nlpl.eu (opus.nlpl.eu)... 193.166.25.9
Connecting to opus.nlpl.eu (opus.nlpl.eu)|193.166.25.9|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://object.pouta.csc.fi/OPUS-OpenSubtitles/v2016/moses/en-ko.txt.zip [following]
--2020-03-25 14:12:00--  https://object.pouta.csc.fi/OPUS-OpenSubtitles/v2016/moses/en-ko.txt.zip
Resolving object.pouta.csc.fi (object.pouta.csc.fi)... 86.50.254.18, 86.50.254.19
Connecting to object.pouta.csc.fi (object.pouta.csc.fi)|86.50.254.18|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12433629 (12M) [application/zip]
Saving to: ‘download.php?f=OpenSubtitles%2Fv2016%2Fmoses%2Fen-ko.txt.zip’


2020-03-25 14:12:01 (26.1 MB/s) - ‘download.php?f=OpenSubtitles%2Fv2016%2Fmoses%2Fen-ko.txt.zip’ saved [12433629/12433629]

Archive:  download.php?f=OpenSubtitles%2Fv2016%2Fmoses%2Fen-ko.txt.zip
 

Для правильного запуска openNMT понадобится разбить данные на трейн часть и на валидационную. Я решил использовать всего 1 миллион пар предложений (для корейского языка всего было доступно только 370 тысяч предложений) для обучения и 5000 пар для валидации (это не в прямом смысле валидация, она нужна только для корректной работы скриптов openNMT, в их руководстве сказано, что 5000 будет достаточно)  

In [0]:
!wc -l OpenSubtitles.en-ko.en

370702 OpenSubtitles.en-ko.en


In [0]:
!head -n 365000 'OpenSubtitles.en-ko.en' > 'train_subtitles_1m_en.txt'
!head -n 365000 'OpenSubtitles.en-ko.ko' > 'train_subtitles_1m_ko.txt'
!tail -n 5000 'OpenSubtitles.en-ko.en' > 'dev_subtitles_en.txt'
!tail -n 5000 'OpenSubtitles.en-ko.ko' > 'dev_subtitles_ko.txt'

## Данные Duolingo

Для составления предсказания для соревнования Duolingo я взял файл формата `*.aws_baseline.pred.txt`, который содержит по одному референсному переводу от Amazon для каждого предложения, предобработал его (убрал префиксы у строчек, соединил строчки "сообщение-перевод" в одну, разделённую табом, удалил пустые строчки) и разбил на два файла: с исходным языком и таргетным. Для того чтобы на стадии предсказания точно не возникло незнакомых для модели слов, я добавил полученные файлы в обучающую выборку.

In [0]:
os.chdir(os.path.join(path_to_duolingo, 'train'))
#remove all markers | remove blank lines | combine every two lines in one
#!sed 's/.*|//g' train.en_hu.aws_baseline.pred.txt | grep . | paste -d "\t"  - - > train_duolingo_en_hu.txt

In [0]:
with open('train_duolingo_en_ko.txt') as f, open('train_duolingo_en.txt', 'w') as file_en, \
open('train_duolingo_ko.txt', 'w') as file_ko:
    for line in f.readlines():
        pair = line.strip().split('\t')
        file_en.writelines(pair[0]+'\n')
        file_ko.writelines(pair[1]+'\n')

In [0]:
!tail -5 ../../../../corpora/OpenSubtitles/en_ko/train_subtitles_1m_en.txt
!cat train_duolingo_en.txt >> ../../../../corpora/OpenSubtitles/en_ko/train_subtitles_1m_en.txt
!cat train_duolingo_ko.txt >> ../../../../corpora/OpenSubtitles/en_ko/train_subtitles_1m_ko.txt
!tail -5 ../../../../corpora/OpenSubtitles/en_ko/train_subtitles_1m_en.txt

Again, nothing witchy.
Ghost?
Hard to say.
I mean, there's EMF in the church, but it's built on a burial ground.
You know that all the victims recently went to confession?
english is an international language.
which floor is it?
the waiter has asked everybody.
it is a beautiful bird.
i count.


## Обучение

In [0]:
!pip install OpenNMT-py



In [0]:
os.chdir(os.path.join(path_to_model, 'openNMT'))

In [0]:
!onmt_preprocess -train_src ../../../corpora/OpenSubtitles/en_vi/train_subtitles_1m_en.txt \
-train_tgt ../../../corpora/OpenSubtitles/en_vi/train_subtitles_1m_vi.txt \
-valid_src ../../../corpora/OpenSubtitles/en_vi/dev_subtitles_en.txt \
-valid_tgt ../../../corpora/OpenSubtitles/en_vi/dev_subtitles_vi.txt \
-tgt_vocab_size 50000 -src_vocab_size 50000 --src_seq_length 25 --tgt_seq_length 25 \
-save_data nmt_subs_en_vi

[2020-03-26 12:23:25,754 INFO] Extracting features...
[2020-03-26 12:23:26,631 INFO]  * number of source features: 0.
[2020-03-26 12:23:26,632 INFO]  * number of target features: 0.
[2020-03-26 12:23:26,632 INFO] Building `Fields` object...
[2020-03-26 12:23:26,633 INFO] Building & saving training data...
[2020-03-26 12:23:29,506 INFO] Building shard 0.
[2020-03-26 12:24:09,090 INFO]  * saving 0th train data shard to nmt_subs_en_vi.train.0.pt.
[2020-03-26 12:24:32,843 INFO] Building shard 1.
[2020-03-26 12:24:32,987 INFO]  * saving 1th train data shard to nmt_subs_en_vi.train.1.pt.
[2020-03-26 12:24:33,406 INFO]  * tgt vocab size: 50004.
[2020-03-26 12:24:33,731 INFO]  * src vocab size: 50002.
[2020-03-26 12:24:34,575 INFO] Building & saving validation data...
[2020-03-26 12:24:35,621 INFO] Building shard 0.
[2020-03-26 12:24:35,778 INFO]  * saving 0th valid data shard to nmt_subs_en_vi.valid.0.pt.


Это облегченная модель трансформера (чтобы обучение занимало адекватное время (~12 часов). Я уменьшил `rnn_size`, `word_vec_size`, `heads`, `training_steps` в 2 раза. Остальное оставил таким же как указано в рекомендациях openNMT, которые якобы повторяют исходный сетап Google.

In [0]:
!onmt_train -data nmt_subs_en_vi -save_model transformer \
        -layers 6 -rnn_size 256 -word_vec_size 256 -transformer_ff 2048 -heads 4  \
        -encoder_type transformer -decoder_type transformer -position_encoding \
        -train_steps 100000  -max_generator_batches 2 -dropout 0.1 \
        -batch_size 4096 -batch_type tokens -normalization tokens  -accum_count 2 \
        -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 \
        -max_grad_norm 0 -param_init 0  -param_init_glorot \
        -label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 5000 \
        -world_size 1 -gpu_ranks 0
# i omitted output with training statistics for better representation

[2020-03-26 12:25:40,102 INFO]  * src vocab size = 50002
[2020-03-26 12:25:40,102 INFO]  * tgt vocab size = 50004
[2020-03-26 12:25:40,103 INFO] Building model...
[2020-03-26 12:25:49,045 INFO] NMTModel(
  (encoder): TransformerEncoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(50002, 256, padding_idx=1)
        )
        (pe): PositionalEncoding(
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (transformer): ModuleList(
      (0): TransformerEncoderLayer(
        (self_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=256, out_features=256, bias=True)
          (linear_values): Linear(in_features=256, out_features=256, bias=True)
          (linear_query): Linear(in_features=256, out_features=256, bias=True)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=256, out_feat

Теперь у нас есть модель-переводчик, исходный текст для перевода и их референсный перевод от Amazon. Для примера выведем 10 лучших переводов одного предложения (автоматическое определение правильного числа переводов для каждого предложения -- потом)

In [0]:
!onmt_translate -model transformer_step_35000.pt -src ../../../duolingo/data/en_pt/train/train_duolingo_en.txt \
-output train_duolingo_pred10_pt.txt -n_best 10 -replace_unk

[2020-03-30 21:45:26,228 INFO] Translating shard 0.
PRED AVG SCORE: -0.2945, PRED PPL: 1.3425


In [0]:
!head -n 20 train_duolingo_pred10_pt.txt

Você poderia baixar o video para mim?
Você poderia video o video para mim?
Você poderia video o video
Você poderia baixar esse video para mim?
poderia baixar o video para mim?
Você poderia baixar o video para mim.
Você poderia baixar o video para mim? - Não.
Você poderia baixar o video para mim? - Sim.
Você poderia baixar o video para mim? - Não, obrigado.
Você poderia baixar o video para mim? - Sim!
a livraria não é nesta rua.
A livraria não é nesta rua.
a livraria não está nesta rua.
A livraria não está nesta rua.
a livraria não é desta rua.
a livraria.
a livraria não é nesta rua.
a livraria não é nesta rua.
a livraria não é nesta rua.
a livraria não é nesta rua.


In [0]:
!onmt_translate -model transformer_step_35000.pt -src ../../../duolingo/data/en_pt/train/train_duolingo_en.txt \
-output train_duolingo_pred_pt.txt -replace_unk

[2020-03-30 22:30:47,677 INFO] Translating shard 0.
PRED AVG SCORE: -0.2945, PRED PPL: 1.3425


## TODO: Evaluation



In [0]:
!pwd

/content/gdrive/My Drive/data/work/Panchenko/language_models/en_pt/openNMT


In [0]:
from nltk.translate.bleu_score import corpus_bleu, sentence_bleu
import re

references = []
predictions = []
with open('../../../duolingo/data/en_pt/train/train_duolingo_pt.txt') as gold:
    for sentence in gold.readlines():
        s = ' '.join(re.split(r'([^0-9a-zÀ-ÿ\s])',sentence.lower()))
        s = re.sub(r' +', ' ', s)
        references.append(s.split())
with open('train_duolingo_pred_pt.txt') as preds:
    for sentence in preds.readlines():
        s = ' '.join(re.split(r'([^0-9a-zÀ-ÿ\s])',sentence.lower()))
        s = re.sub(r' +', ' ', s)
        predictions.append(s.split())
print(predictions[1])
print(references[1])

['a', 'livraria', 'não', 'é', 'nesta', 'rua', '.']
['a', 'livraria', 'não', 'é', 'nesta', 'rua', '.']


In [0]:
corpus_bleu(references, predictions)

Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


0.6922290349478658

In [0]:
os.chdir(os.path.join(path_to_duolingo, 'train'))
hypothesises = []
with open('train.en_pt.gold.txt') as f:
    for sentence in f.readlines():
        if sentence.startswith('prompt_'):
            obnulyay = True
            continue
        else:
            sentence = sentence.split('|')[0]
            s = ' '.join(re.split(r'([^0-9a-zÀ-ÿ\s])',sentence.lower()))
            s = re.sub(r' +', ' ', s)
            if obnulyay:            
                hypothesises.append([s.split()])
                obnulyay = False
            else:
                hypothesises[-1].append(s.split())

['podes', 'baixar', 'aquele', 'vídeo', 'pra', 'mim', '?']


In [0]:
print(corpus_bleu(hypothesises, references))

0.9527666953628904


In [0]:
beam_predictions = []
with open(os.path.join(path_to_model+'/openNMT/train_duolingo_pred10_pt.txt')) as preds:
    for i, sentence in enumerate(preds.readlines()):
        if i%10==0:
            continue
        s = ' '.join(re.split(r'([^0-9a-zÀ-ÿ\s])',sentence.lower()))
        s = re.sub(r' +', ' ', s)
        if i%10==1:
            beam_predictions.append([s.split()])
        else:
            beam_predictions[-1].append(s.split())

print(corpus_bleu(beam_predictions, references))

0.8845647500357965


In [0]:
for i in range(len(beam_predictions[2])):
    print(sentence_bleu([predictions[2]], beam_predictions[2][i]))

0.6042750794713536
0.7037259479962376
0.6042750794713536
0.5506953149031837
0.5169731539571706
0.392814650900513
0.392814650900513
0.392814650900513
0.3508439695638686


Corpus/Sentence contains 0 counts of 4-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().
Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


Проверив это на ещё нескольких наборах предложений, я эмпирически пришёл к выводу, что предложения, у которых по отношению к лучшему переводу BLEU-score меньше 0.5, не являются качественным переводом исходного предложения.

In [0]:
for i in range(len(beam_predictions[2])):
    print(*beam_predictions[2][i])

posso caminhar até lá ?
posso andar lá ?
posso ir até lá ?
posso caminhar lá ?
posso andar até lá ? - não .
posso caminhar até lá ? - não .
posso caminhar até lá ? - não !
posso andar até lá ? - não , obrigado .
posso andar até lá ? - não , não posso .


# Парафраз

Для парафраза нужна языковая модель таргетного языка. Первый вопрос который у меня возникает, нормально ли учить языковую модель не такую же как у переводчика. Сходу кажется, что нет. Далее вопрос, насколько легко достать из переводчика языковую модель его таргетного языка. Я потратил несколько часов, но так и не смог получить доступ к модели нашего openNMT трансформера.
Значит нам нужно создавать свою модель, в которой будет реализован и перевод трансформером и функционал для парафраза(((

Либо безумная идея: дообучить трансформер переводить предложения в последовательности предложений, обучая на паре английский-96португальских. А потом эти предложения сплитать и засовывать в предикт.

# Коммит в соревнование

In [0]:
os.chdir('/content/gdrive/My Drive/data/work/Panchenko/language_models/en_pt/openNMT')
!onmt_translate -model transformer_step_35000.pt -src ../../../duolingo/data/en_pt/dev/dev_duolingo_en.txt \
-output ../../../duolingo/data/en_pt/dev/weak_predictions.txt -replace_unk --ignore_when_blocking 3 -n_best 10

[2020-03-31 01:04:40,826 INFO] Translating shard 0.
PRED AVG SCORE: -0.4697, PRED PPL: 1.5995


In [0]:
!head -n 30 '../../../duolingo/data/en_pt/dev/weak_predictions.txt'

O que disse o patrão?
Bem, o que o chefe disse?
Bem, o que disse o chefe?
Bem, o que é que o chefe disse?
Bem, o que disse o patrão?
Bem, o que é que o patrão disse?
Bem, o que é que o chefe respondeu?
Bem, o que é que o chefe disseste?
Bem, o que é que o chefe disse
Bem, o que é que o chefe disse?
São cinquenta fifty por pessoas sem o person
É 50 fifty por pessoa sem o person
É 50 fifty por pessoa sem a person
É 50 xelins por pessoa sem o person
É 50 xelins por pessoa sem a person
São cinquenta fifty por pessoas sem o person
São cinquenta fifty por pessoas sem o person
São cinquenta fifty por pessoas sem o person
São cinquenta fifty por pessoas sem o person
São cinquenta fifty por pessoas sem o person
Já tens filhos?
Você já tem filhos?
Já tem filhos?
Já têm filhos?
Tu já tens filhos?
Você já tem filhos? do
Você já tem filhos? - Eu sei.
Você já tem filhos? _BAR_
Você já tem filhos? - Sim!
Você já tem filhos? - Sim.


In [0]:
import numpy as np
prompts = open(path_to_duolingo+'/dev/dev.'+course+'.prompts.txt').readlines()
preds = open(path_to_duolingo+'/dev/weak_predictions.txt').readlines()
with open(path_to_duolingo+ '/dev/preds.txt', 'w') as final:
    for j in range(len(prompts)):
        final.writelines(prompts[j])
        translations = []
        for i in range(10):
            s = preds[j*10+i].strip()
            s = ' '.join(re.split(r'([^0-9a-zÀ-ÿ\s])',s.lower()))
            s = re.sub(r' +', ' ', s)
            translations.append(s)
        translations = np.unique(translations)
        final.writelines(translations[0]+'\n')
        for i in range(1, len(translations)):
            if sentence_bleu([translations[0]], translations[i]) >= 0.5:
                final.writelines(translations[i]+'\n')
        final.writelines('\n')

Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().
Corpus/Sentence contains 0 counts of 4-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().
Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


In [0]:
!cat '/content/gdrive/My Drive/data/work/Panchenko/duolingo/data/en_pt/dev/preds.txt'

prompt_15049b3b99054c230685063e328369a4|well, what did the boss say?
bem , o que disse o chefe ? 
bem , o que disse o patrão ? 
bem , o que o chefe disse ? 
bem , o que é que o chefe disse
bem , o que é que o chefe disse ? 
bem , o que é que o chefe disseste ? 
bem , o que é que o chefe respondeu ? 
bem , o que é que o patrão disse ? 
o que disse o patrão ? 

prompt_aacb068d672931817ffd01ea6b0e0609|it's fifty euros per person without the discount.
são cinquenta fifty por pessoas sem o person
é 50 fifty por pessoa sem a person
é 50 fifty por pessoa sem o person
é 50 xelins por pessoa sem o person

prompt_68125d9e6cf2f7d6e3b247b33d53bf8b|do you have children already?
já tem filhos ? 
já tens filhos ? 
já têm filhos ? 
tu já tens filhos ? 
você já tem filhos ? 
você já tem filhos ? - sim ! 
você já tem filhos ? - sim . 
você já tem filhos ? _ bar _ 
você já tem filhos ? do

prompt_9ddb373f4c02ec24608d7789d73618a5|i must go out.
devo sair . 
preciso sair . 
tenho de sair . 

prompt_33e0435