<a href="https://colab.research.google.com/github/RealAntonVoronov/computational_humour/blob/master/data_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Предобработка данных

## 0. Load the data. 

Так как данных предоставленных орагнизаторами явно недостаточно для того чтобы обучить полноценную языковую модель, для каждого из 5 языков были загружены параллельные корпуса субтитров. (http://opus.nlpl.eu/OpenSubtitles-v2016.php) Данные можно также найти на сервере nlp1 в папке `voronov/data/OpenSubtitles`.

In [29]:
# this is for colab skip if you don't need to connect to drive)
import os
from google.colab import drive
drive.mount('/content/gdrive')

course = 'en_pt'
path_to_corpora = os.path.join('/content/gdrive/My Drive/data/work/Panchenko/corpora/OpenSubtitles/', course)
os.chdir(path_to_corpora) 

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [0]:
!wget http://opus.nlpl.eu/download.php?f=OpenSubtitles/v2016/moses/en-ko.txt.zip
!unzip  'download.php?f=OpenSubtitles%2Fv2016%2Fmoses%2Fen-ko.txt.zip' && rm 'download.php?f=OpenSubtitles%2Fv2016%2Fmoses%2Fen-ko.txt.zip'

Для некоторых языков в корпусе OpenSubtitles представлено слишком много пар предложений. Нам столько не нужно. Возьмём 2 миллиона (если можно), после предобработки это число сильно уменьшится. В конце из оставшихся предложений ещё 5 тысяч нужны будут для валидационной части скрипта openNMT

In [0]:
!head -n 2000000 'OpenSubtitles.en-pt.en' > 'subs_2m.en'
!head -n 2000000 'OpenSubtitles.en-pt.pt' > 'subs_2m.pt'

In [6]:
!head subs_2m.en

Amanonce said, "When you make a friend, you take on a responsibility."
That describes my friend, Danny Barrett.
When he invited me to lunch I should have known there'd be strings attached.
Excuse me, guys.
Sorry.
Sure.
Go ahead.
MacGyver, you're just in time.
For what?
You said lunch.


## 1. Find equal lines.

In [0]:
en_lines = open('subs_2m.en').readlines()
pt_lines = open('subs_2m.pt').readlines()
with  open('subs_2m.pt', 'w') as pt, open('subs_2m.en', 'w') as en:
    for i in range(len(en_lines)):
        if en_lines[i] != pt_lines[i]:
            en.writelines(en_lines[i])
            pt.writelines(pt_lines[i])

In [66]:
!wc -l subs_2m.pt

1973603 subs_2m.pt


## 2. Remove examples where source or target contains multiple sentences

In [118]:
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

src_lines = open('subs_2m.en').readlines()
tgt_lines = open('subs_2m.pt').readlines()
n = len(src_lines)
with open('subs_2m.en', 'w') as src, open('subs_2m.pt', 'w') as tgt:
    for i in range(n):
        src_line = src_lines[i].strip()
        n_src_sentences = len(sent_tokenize(src_line))
        tgt_line = tgt_lines[i].strip()
        n_tgt_sentences = len(sent_tokenize(tgt_line))
        if n_src_sentences==1 and n_tgt_sentences==1:
            src.writelines(src_lines[i])
            tgt.writelines(tgt_lines[i])

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [120]:
!wc -l subs_2m.pt
!wc -l subs_2m.en

964630 subs_2m.pt
964630 subs_2m.en


## 3. Remove examples where actual language differs from claimed.

In [2]:
!pip install langid

Collecting langid
[?25l  Downloading https://files.pythonhosted.org/packages/ea/4c/0fb7d900d3b0b9c8703be316fbddffecdab23c64e1b46c7a83561d78bd43/langid-1.1.6.tar.gz (1.9MB)
[K     |████████████████████████████████| 1.9MB 2.7MB/s 
Building wheels for collected packages: langid
  Building wheel for langid (setup.py) ... [?25l[?25hdone
  Created wheel for langid: filename=langid-1.1.6-cp36-none-any.whl size=1941190 sha256=5025d92624f1228df1019e9c73ea8872dfef55c198919c0369a2803510d4c842
  Stored in directory: /root/.cache/pip/wheels/29/bc/61/50a93be85d1afe9436c3dc61f38da8ad7b637a38af4824e86e
Successfully built langid
Installing collected packages: langid
Successfully installed langid-1.1.6


In [0]:
!langid --line -n < subs_2m.en > en_lang.txt &
!langid --line -n < subs_2m.pt > pt_lang.txt &

In [0]:
src = course[:2]
tgt = course[3:]
src_idx = set()
tgt_idx = set()
src_langid = open(src+'_lang.txt').readlines()
tgt_langid = open(tgt+'_lang.txt').readlines()
for i in range(len(src_langid)):
    line = src_langid[i].strip().split() 
    if line[0][2:4] == src and float(line[1][:-1]) > 0.7:
        src_idx.add(i)
    elif float(line[1][:-1])< 0.3:
        src_idx.add(i)
    line = tgt_langid[i].strip().split() 
    if line[0][2:4] == tgt and float(line[1][:-1]) > 0.7:
        tgt_idx.add(i)
    elif float(line[1][:-1])< 0.3:
        tgt_idx.add(i)

In [0]:
en_lines = open('subs_2m.en').readlines()
pt_lines = open('subs_2m.pt').readlines()
src_idx = src_idx.intersection_update(tgt_idx)

In [0]:
with  open('subs_2m.pt', 'w') as pt, open('subs_2m.en', 'w') as en:
    for i in range(len(en_lines)):
        l1 = len(en_lines[i].strip().split())
        l2 = len(pt_lines[i].strip().split())
        if i in src_idx and  l1/l2 < 2 and l1/l2 > 0.5 :            
            if en_lines[i][0] == '-':
                en_lines_towrite = en_lines[i][2:]
            else:
                en_lines_towrite = en_lines[i]                                
            if pt_lines[i][0] == '-':
                pt_lines_towrite = pt_lines[i][2:]
            else:
                pt_lines_towrite = pt_lines[i]
            if en_lines_towrite:
                en.writelines(en_lines_towrite)
                pt.writelines(pt_lines_towrite)

In [98]:
!wc -l subs_2m.pt
!wc -l subs_2m.en

1116560 subs_2m_v2.pt
1116560 subs_2m_v2.en


## 4. BPE

In [0]:
# Tokenize
cat $dir/output/corpus.$src.c.up.nor.up.nor.nonalpha.nonmatch.reptok.goodlang | $mosesdir/tokenizer/normalize-punctuation.perl -l $src | $mosesdir/tokenizer/tokenizer.perl -a -threads 8 -l $src > $dir/1-tok/corpus.tok.$src
cat $dir/output/corpus.$trg.c.up.nor.up.nor.nonalpha.nonmatch.reptok.goodlang | $mosesdir/tokenizer/normalize-punctuation.perl -l $trg | $mosesdir/tokenizer/tokenizer.perl -a -threads 8 -l $trg > $dir/1-tok/corpus.tok.$trg

# Clean
$mosesdir/training/clean-corpus-n.perl $dir/1-tok/corpus.tok $src $trg $dir/2-clean/corpus.clean.tok 2 128

# Train truecasers
$mosesdir/recaser/train-truecaser.perl -corpus $dir/2-clean/corpus.clean.tok.$trg -model $dir/2-clean/truecase-model.$trg
$mosesdir/recaser/train-truecaser.perl -corpus $dir/2-clean/corpus.clean.tok.$src -model $dir/2-clean/truecase-model.$src

# Truecase
$mosesdir/recaser/truecase.perl -model $dir/2-clean/truecase-model.$trg < $dir/2-clean/corpus.clean.tok.$trg > $dir/3-tc/corpus.tc.$trg
$mosesdir/recaser/truecase.perl -model $dir/2-clean/truecase-model.$src < $dir/2-clean/corpus.clean.tok.$src > $dir/3-tc/corpus.tc.$src

# Split into subword units
cat $dir/3-tc/corpus.tc.$trg $dir/3-tc/corpus.tc.$src | subword-nmt learn-bpe -s $merge_ops > $dir/4-bpe/model.bpe

subword-nmt apply-bpe -c $dir/4-bpe/model.bpe < $dir/3-tc/corpus.tc.$trg > $dir/4-bpe/corpus.bpe.$trg &
subword-nmt apply-bpe -c $dir/4-bpe/model.bpe < $dir/3-tc/corpus.tc.$src > $dir/4-bpe/corpus.bpe.$src &

wait