# Data Preprocessing & Cleaning for NMT

This notebook contains a tutorial of data processing and cleaning for NMT (Neural Machine Translation) to train translation models with the [NeMo framework](https://github.com/NVIDIA/NeMo).

A pre-requisite to train supervised neural machine translation systems is the availability of *parallel corpora* of reasonable quality.

A parallel corpus is a collection of sentences or documents that are translations of each other in 2 or more languages.

For example,

| English                                                                            | Russian |
| :-: | :-: |
| To date, a total of 43 participants from 15 countries have completed the training. | К настоящему времени подготовку прошли в общей сложности 43 участника из 15 стран . |
| M-Sport Bentley writes a new piece of Bentley history at Silverstone | M-Sport Bentley открывает новую страницу в истории Bentley в Сильверстоуне |
| Information in the application was not true. | Информация в заявлении не была достоверна. |

This notebook will cover the following data pre-processing and data cleaning techniques for such corpora.

## The importance of data cleaning

The presence of noise in the training dataset can adversely affect model quality (https://arxiv.org/abs/1805.12282). Webcrawled and automatically aligned data sources in particular, such as [Paracrawl](https://paracrawl.eu/), [WikiMatrix](https://arxiv.org/abs/1907.05791), [CC-Aligned](https://arxiv.org/abs/1911.06154) and [CC-Matrix](https://arxiv.org/abs/1911.04944) can be extremely noisy.

## Cleaning
1. Downloading and filtering publicly available datasets based on confidence thresholds (if available). For example, [WikiMatrix](https://arxiv.org/abs/1907.05791) filtering based on [LASER](https://arxiv.org/abs/1812.10464) confidence scores.
2. Language ID filtering using a pre-trained [fastText classifier](https://fasttext.cc/docs/en/language-identification.html). This step will remove all sentences from the parallel corpus that our classifier predicts as not being in the appropriate language (ex: sentences in the English column that aren't in English or sentences in Russian column that aren't in Russian).
3. Length and Length-ratio filtering. This steps removes all sentences that are 1) too long 2) too short or 3) have a ratio between their lengths greater than a certain factor (this typically removes partial translations).
4. [Bicleaner](https://github.com/bitextor/bicleaner) classifier-based cleaning. Bicleaner identifies noisy parallel senteces using a classifier that leverages multiple features such as n-gram language model likelihood scores, word alignment scores and other heuristics.

## Pre-processing
5. [Moses Punctuation Normalization](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/normalize-punctuation.perl). This step standardizes punctuation. For example the less common way to write apostrophes Tiffany`s will be standardized to Tiffany's.
6. Unicode standardization. There exist some unicode characters that aren't punctuation that need to be standardized for example, this step normalizes the number ４ to 4.
7. [Moses Tokenization](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl) or text segmentation for Chinese/Japanese with [Jieba](https://github.com/fxsjy/jieba) and [mecab](https://github.com/taku910/mecab). For languages like Chinese and Japanese that do not have explicit word segmentation markers (like spaces), we use these tools to introduce spaces into the text that will let us split the string into words. For other languages, we use Moses to separate punctuation markers from words so that they become separate tokens.
8. Deduplication - This step removes duplicate translation pairs from the corpus.
9. Shuffling - This step shuffles the order of occurrence of translation pairs.

## Tarred Datasets for Large Corpora
10. Large datasts with over 50M sentence pairs when batched and pickled can be upto 60GB in size. Loading them entirely into CPU memory when using say 8 or 16 workers with DistributedDataParallel training uses 480-960GB of RAM which is often impractical and inefficient. Instead, we use [Webdataset](https://github.com/webdataset/webdataset) to allow training while keeping datasets on disk and let webddataset handle pre-loading and fetching of data into CPU RAM.


## Disclaimer

The data cleaning techniques used in this notebook are only meant to be loose guidelines and are not guaranteed to produced clean parallel corpora at the end of it. Not all of these steps are a necessity for every dataset, 

![NMT Data Pipeline](images/nmt_data_pipeline.png)

# Downloading Publicly Available Data

## WikiMatrix (https://arxiv.org/abs/1907.05791)

In [56]:
!mkdir -p data/
print('Downloading data ...')
!wget https://dl.fbaipublicfiles.com/laser/WikiMatrix/v1/WikiMatrix.en-ru.tsv.gz -O data/WikiMatrix.en-ru.tsv.gz
print('---------------------')
print('Unzipping file ...')
!gunzip -k -f data/WikiMatrix.en-ru.tsv.gz
print('---------------------')
print('Peek into the file')
!head -10 data/WikiMatrix.en-ru.tsv
print('---------------------')
print('File length ...')
!wc -l data/WikiMatrix.en-ru.tsv
print('---------------------')

Downloading data ...
--2021-07-13 09:04:45--  https://dl.fbaipublicfiles.com/laser/WikiMatrix/v1/WikiMatrix.en-ru.tsv.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 172.67.9.4, 104.22.75.142, 104.22.74.142, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|172.67.9.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 658252364 (628M) [application/gzip]
Saving to: ‘data/WikiMatrix.en-ru.tsv.gz’


2021-07-13 09:04:58 (49.9 MB/s) - ‘data/WikiMatrix.en-ru.tsv.gz’ saved [658252364/658252364]

---------------------
Unzipping file ...
---------------------
Peek into the file
1.2217877209774821	The glory of the Lord has risen upon thee".	Какую же из милостей вашего Господа вы считаете ложью?».
1.2136469670929166	Fear of the Lord is aking to wonder (or awe).	Воистину, мучений от твоего Господа надлежит остерегаться».
1.1979604432731699	I think I washed his body 50 times."	Я думаю, что я омыла его тело 50 раз.»
1.1954915649299516	There has 

## Filter Based on LASER Confidence

LASER (https://arxiv.org/abs/1812.10464) is a multi-lingual neural sentence embedding model that is often used for cross-lingual sentence/document retrieval. Similarities in the embedding space are often used as proxies for cross-lingual similarities.

In [57]:
from tqdm import tqdm
import numpy as np

def num_lines_in_file(fname):
    """
    Returns the number of lines in a file.
    """
    with open(fname, 'r') as f:
        for i, _ in enumerate(f):
            pass
    return i + 1

def filter_tsv_with_conf(
    input_file, output_file_lang_1, output_file_lang_2,
    confidence_threshold=None, confidence_column=None
):
    """
    Filters a tsv file that has confidence scores associated with each parallel example.

    For example:

    1.23 \t This is a sentence in lang1 \t This is a sentence in lang2
    """
    print()
    print('====================================')
    print('======= TSV Conf Filtering =========')
    print('====================================')
    print()
    num_lines = num_lines_in_file(input_file)
    scores = []
    num_output_lines = 0
    lang_1_col = 0
    lang_2_col = 1
    with open(input_file, 'r') as f, \
        open(output_file_lang_1, 'w') as f_out_1, \
        open(output_file_lang_2, 'w') as f_out_2:
        for line in tqdm(f, total=num_lines, desc=f"Filtering file by confidence {confidence_threshold}"):
            if line.strip() == '':
                continue
            line = line.strip().split('\t')
            if len(line) < 2:
                continue
            if confidence_threshold is not None and float(line[confidence_column]) < confidence_threshold:
                continue
            else:
                if confidence_threshold is not None:
                    scores.append(float(line[confidence_column]))
                    if confidence_column == 0:
                        lang_1_col, lang_2_col = 1, 2
                    elif confidence_column == 2:
                        lang_1_col, lang_2_col = 0, 1
                    elif confidence_column == 1:
                        lang_1_col, lang_2_col = 0, 2
                    else:
                        raise ValueError(f"Invalid Column for confidence {confidence_column}")
                f_out_1.write(line[lang_1_col] + '\n')
                f_out_2.write(line[lang_2_col] + '\n')
                num_output_lines += 1

    if confidence_threshold is not None:
        print(f'Confidence score average  : {np.mean(scores)}')
        print(f'Confidence score variance : {np.var(scores)}')
        print(f'Kept {num_output_lines} out of {num_lines} after conversion ({(num_output_lines / num_lines) * 100}%)')
        print('====================================')

filter_tsv_with_conf(
    'data/WikiMatrix.en-ru.tsv',
    'data/WikiMatrix.en-ru.en', 
    'data/WikiMatrix.en-ru.ru',
    confidence_threshold=1.04, confidence_column=0
)





Filtering file by confidence 1.04: 100%|██████████| 5203872/5203872 [00:12<00:00, 424012.09it/s]


Confidence score average  : 1.0628594097124588
Confidence score variance : 0.0003311031014927841
Kept 1661908 out of 5203872 after conversion (31.935989201886596%)


## Language ID filtering with fastText

Noisy parallel corpora often contain sentences that are not in the intended language. A classifier that determines the language in which a sentence is written can be used to filter out sentences that aren't in the appropriate language.

In [58]:
!wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin -O data/lid.176.bin
print()
print('====================================')
print('====== Language ID Filtering =======')
print('====================================')
print()

!python ../../scripts/neural_machine_translation/filter_langs_nmt.py \
    --input-src data/WikiMatrix.en-ru.en  \
    --input-tgt data/WikiMatrix.en-ru.ru \
    --output-src data/WikiMatrix.en-ru.langidfilter.en  \
    --output-tgt data/WikiMatrix.en-ru.langidfilter.ru  \
    --source-lang en \
    --target-lang ru \
    --removed-src data/WikiMatrix.en-ru.langidfilter.removed.en  \
    --removed-tgt data/WikiMatrix.en-ru.langidfilter.removed.ru  \
    --fasttext-model data/lid.176.bin

print()
print('-----------------------------------------')
print('Number of removed sentences:')
print('-----------------------------------------')
print()
!wc -l data/WikiMatrix.en-ru.langidfilter.removed.ru

print()
print('-----------------------------------------')
print('Examples of removed sentences')
print('-----------------------------------------')
print()

!paste -d "\t" \
    data/WikiMatrix.en-ru.langidfilter.removed.en \
    data/WikiMatrix.en-ru.langidfilter.removed.ru \
    | head -10
print('-----------------------------------------')

--2021-07-13 09:05:38--  https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.22.74.142, 104.22.75.142, 172.67.9.4, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.22.74.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 131266198 (125M) [application/octet-stream]
Saving to: ‘data/lid.176.bin’


2021-07-13 09:05:41 (65.6 MB/s) - ‘data/lid.176.bin’ saved [131266198/131266198]



processed lines / total number of lines: 100%|▉| 1658535/1661908 [00:12<00:00, 1

-----------------------------------------
Number of removed sentences:
-----------------------------------------

33671 data/WikiMatrix.en-ru.langidfilter.removed.ru

-----------------------------------------
Examples of removed sentences
-----------------------------------------

Ask Sylvia!	Спроси Сильвию!
Любовь Шутова: Теперь ответственности больше.	Любовь Шутова: теперь ответственности больше.
"Але

## Length and Ratio Filtering

This step filters out sentences based on their lengths and the ratio between source and target lengths. If (a) src_len / tgt_len or tgt_len / src_len exceed 1.3 or (b) source or target sequence lengths are less than 1 or greater than 250, the sentence pair will be removed.

In [59]:
!git clone https://github.com/moses-smt/mosesdecoder data/mosesdecoder
!perl data/mosesdecoder/scripts/training/clean-corpus-n.perl -ratio 1.3 \
    data/WikiMatrix.en-ru.langidfilter \
    en ru \
    data/WikiMatrix.en-ru.langidfilter.lengthratio \
    1 250

fatal: destination path 'data/mosesdecoder' already exists and is not an empty directory.
clean-corpus.perl: processing data/WikiMatrix.en-ru.langidfilter.en & .ru to data/WikiMatrix.en-ru.langidfilter.lengthratio, cutoff 1-250, ratio 1.3
..........(100000)..........(200000)..........(300000)..........(400000)..........(500000)..........(600000)..........(700000)..........(800000)..........(900000)..........(1000000)..........(1100000)..........(1200000)..........(1300000)..........(1400000)..........(1500000)..........(1600000)..
Input sentences: 1628237  Output sentences:  1138638


## Bicleaner Filtering

Bicleaner (https://aclanthology.org/W18-6488/ and https://aclanthology.org/2020.eamt-1.31/) is a tool to identify noisy parallel sentences in translation corpora. It applies 3 different filtering steps:

1. Pre-filtering based on 37 rules.
2. Language model fluency scores based on n-gram language models trained with kenlm.
3. Random forest clasifier that uses all examples filtered out in steps 1 & 2 as "negative" examples.

In [63]:
# Note: Fix commit of bicleaner when cloning
print('Downloading En-Ru Bicleaner models.')
!git clone https://github.com/bitextor/bicleaner
!cd bicleaner && git checkout bicleaner-0.15 && cd ..
!./bicleaner/utils/download-pack.sh en ru

print()
print('==============================================================================')
print('NOTE: This notebook does not install bicleaner from scratch, please install things yourself before running the next few cells.')
print('To use Bicleaner please make sure you have BOOST and associated libraries installed ...')
print('To install bicleaner, follow setup instructions on the repository - https://github.com/bitextor/bicleaner.')
print('Bicleaner also requires kenlm with support for upto 7-gram LMs. Instructions on how to build things are on the repository as well.')
print('==============================================================================')
print()

Downloading En-Ru Bicleaner models.
fatal: destination path 'bicleaner' already exists and is not an empty directory.
Branch 'bicleaner-0.15' set up to track remote branch 'bicleaner-0.15' from 'origin'.
Switched to a new branch 'bicleaner-0.15'
--2021-07-13 09:10:13--  https://github.com/bitextor/bicleaner-data/releases/latest/download/en-ru.tar.gz
Resolving github.com (github.com)... 192.30.255.113
Connecting to github.com (github.com)|192.30.255.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/bitextor/bicleaner-data/releases/download/v1.4/en-ru.tar.gz [following]
--2021-07-13 09:10:13--  https://github.com/bitextor/bicleaner-data/releases/download/v1.4/en-ru.tar.gz
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://github-releases.githubusercontent.com/236975479/614a5780-69f8-11eb-831b-11b2dc2c342f?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53

In [65]:
print('Generating Bicleaner scores ...')
!gawk '{{print "-\t-"}}' \
    data/WikiMatrix.en-ru.langidfilter.lengthratio.en | \
    paste -d "\t" - data/WikiMatrix.en-ru.langidfilter.lengthratio.en \
    data/WikiMatrix.en-ru.langidfilter.lengthratio.ru | \
    bicleaner-classify - - en-ru/en-ru.yaml \
    > data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.scores

Generating Bicleaner scores ...
2021-07-13 09:17:12,611 - INFO - Accuracy histogram: [0.5001, 0.7714543, 0.8156631, 0.8457692, 0.8728746, 0.8924785, 0.8861772, 0.8425685, 0.7294459, 0.5561112]
2021-07-13 09:17:12,611 - INFO - Ideal threshold: 0.5
2021-07-13 09:17:12,617 - INFO - Arguments processed.
2021-07-13 09:17:12,617 - INFO - Executing main program...
2021-07-13 09:17:14,028 - INFO - Start mapping
2021-07-13 09:17:19,634 - INFO - End mapping
2021-07-13 09:18:50,028 - INFO - Finished
2021-07-13 09:18:50,028 - INFO - Total: 1138638 rows
2021-07-13 09:18:50,028 - INFO - Elapsed time 97.41 s
2021-07-13 09:18:50,028 - INFO - Troughput: 11689 rows/s
2021-07-13 09:18:50,029 - INFO - Program finished


In [66]:
print('Filtering based on Bicleaner scores > 0.6 ...')
!head -10 data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.scores

print('Filtering out English ...')
!gawk -F "\t" '{if ($5>0.6) {print $3}}' \
    data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.scores > \
    data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.en

print('Filtering out Russian ...')
!gawk -F "\t" '{if ($5>0.6) {print $4}}' \
    data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.scores > \
    data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.ru

!paste -d "\t" \
    data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.en \
    data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.ru \
    | head -10

Filtering based on Bicleaner scores > 0.6 ...
-	-	The glory of the Lord has risen upon thee".	Какую же из милостей вашего Господа вы считаете ложью?».	0.530
-	-	I think I washed his body 50 times."	Я думаю, что я омыла его тело 50 раз.»	0.604
-	-	There has come to you clear evidence from your Lord.	К вам пришло ясное знамение от вашего Господа.	0.730
-	-	"15,000 attend dawn service".	15,000 attend dawn service (англ.).	0.676
-	-	Ask anybody, particularly the critics."	Спросите кого угодно, в особенности критиков.»	0.568
-	-	"Wiranto – survivor with iron will".	«Wiranto — survivor with iron will».	0
-	-	"They Saved Lisa's Brain".	«They Saved Lisa’s Brain» (рус.	0.422
-	-	However, patients liked banknotes or coins of the Japan Bank.	Однако пациенты предпочитали банкноты или монеты Банка Японии.	0.764
-	-	The Flaming Blade of the Lord’s Retribution.	Что ниспосланное Господом наказание вразумит заблудших!	0.284
-	-	He slept quite openly with them all."	Он спал вполне открыто со всеми ними»

## Normalize Punctuation

Punctuation can vary across languages and even between ascii and unicode variants of the same punctuation marker. For example, across languages. For example, in German, quotes are often written as „ and “ while in English we typically just use ". This step normalizes such punctuation differences to use the same character everywhere.

We use [moses](https://github.com/moses-smt/mosesdecoder) or [sacremoses](https://github.com/alvations/sacremoses) to normalize punctuation. The moses implementation is in perl while sacremoses is in python with a CLI interface. The perl implementation is buffered and works better for large corpora that may not fit into CPU memory all at once while sacremoses is unbuffered and multi-processed.

### Sacremoses

In [67]:
print('Normalizing English ...')
!sacremoses -j 4 normalize \
    < data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.en > \
    data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.sacremoses.norm.en

print('Normalizing Russian ...')
!sacremoses -j 4 normalize \
    < data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.ru > \
    data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.sacremoses.norm.ru


Normalizing English ...
100%|████████████████████████████████| 905234/905234 [00:13<00:00, 66384.80it/s]
Normalizing Russian ...
100%|████████████████████████████████| 905234/905234 [00:14<00:00, 62549.41it/s]


## Moses

In [68]:
print('Normalizing English ...')
!perl mosesdecoder/scripts/tokenizer/normalize-punctuation.perl -l en \
    < data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.en > \
    data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.en

print('Normalizing Russian ...')
!perl mosesdecoder/scripts/tokenizer/normalize-punctuation.perl -l ru \
    < data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.ru > \
    data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.ru


Normalizing English ...
Normalizing Russian ...


## Tokenize

### Sacremoses

In [69]:
print('Tokenizing English ...')
!sacremoses -j 4 -l en tokenize -x \
    < data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.sacremoses.norm.en > \
    data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.sacremoses.norm.tok.en

print('Tokenizing Russian ...')
!sacremoses -j 4 -l ru tokenize -x \
    < data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.sacremoses.norm.ru > \
    data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.sacremoses.norm.tok.ru


Tokenizing English ...
100%|████████████████████████████████| 905234/905234 [00:25<00:00, 35973.36it/s]
Tokenizing Russian ...
100%|████████████████████████████████| 905234/905234 [00:19<00:00, 47260.96it/s]


### Moses

In [70]:
print('Tokenizing English ...')
!perl mosesdecoder/scripts/tokenizer/tokenizer.perl -l en -no-escape -threads 4 \
    < data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.en > \
    data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.tok.en

print('Tokenizing Russian ...')
!perl mosesdecoder/scripts/tokenizer/tokenizer.perl -l ru -no-escape -threads 4 \
    < data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.ru > \
    data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.tok.ru


Tokenizing English ...
Tokenizer Version 1.1
Language: en
Number of threads: 4
Tokenizing Russian ...
Tokenizer Version 1.1
Language: ru
Number of threads: 4


## Segmenting Chinese and Japanese

### Jieba segmentation for Chinese

In [71]:
!pip install jieba
import jieba
!wget https://dl.fbaipublicfiles.com/laser/WikiMatrix/v1/WikiMatrix.en-zh.tsv.gz -O data/WikiMatrix.en-zh.tsv.gz
!gunzip -k -f data/WikiMatrix.en-zh.tsv.gz
print()
print('-----------------------------------------')
print('Chinese text before segmentation ...')
print('-----------------------------------------')
print()

!awk -F "\t" '{print $3}' data/WikiMatrix.en-zh.tsv | head -10
print()
print('-----------------------------------------')
print('Segmenting Chinese text ...')
print('-----------------------------------------')
print()

zh_lines = []
with open('data/WikiMatrix.en-zh.tsv', 'r') as f:
    for idx, line in enumerate(f):
        line = line.strip().split('\t')[2]
        zh_lines.append(' '.join(jieba.cut(line)))
        if idx == 100:
            break
print()
print('-----------------------------------------')
print('Chinese text after segmentation ...')
print('\n'.join(zh_lines[:10]))
print('-----------------------------------------')
print()

--2021-07-13 10:00:14--  https://dl.fbaipublicfiles.com/laser/WikiMatrix/v1/WikiMatrix.en-zh.tsv.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.22.75.142, 172.67.9.4, 104.22.74.142, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.22.75.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 340467177 (325M) [application/gzip]
Saving to: ‘data/WikiMatrix.en-zh.tsv.gz’


2021-07-13 10:00:20 (53.7 MB/s) - ‘data/WikiMatrix.en-zh.tsv.gz’ saved [340467177/340467177]


-----------------------------------------
Chinese text before segmentation ...
-----------------------------------------

這是你們的主所降示的減輕和慈恩。
以昭事上主，闡揚天主聖教為本。
 你的主的言辞，诚实极了，公平极了。
祈求圣主明鉴施恩。
 因上主之名而來的，當受讚頌!
例如ன是ṉa（有母音a），而ன்是ṉ（沒有母音）。
按座主，即大众一座之主，亦即住持。
世尊知摩尼珠髻聚落主去已。
正月，朝見其主（王）龔。
” 念「我作证，萬物非主，唯有真主。

-----------------------------------------
Segmenting Chinese text ...
-----------------------------------------


-----------------------------------------
Chinese text after se

In [72]:
!pip install mecab-python3
!pip install ipadic

import MeCab
import ipadic

!wget https://dl.fbaipublicfiles.com/laser/WikiMatrix/v1/WikiMatrix.en-ja.tsv.gz -O data/WikiMatrix.en-ja.tsv.gz
!gunzip -k -f data/WikiMatrix.en-ja.tsv.gz

print()
print('-----------------------------------------')
print('Japanese text before segmentation ...')
print('-----------------------------------------')
print()

!awk -F "\t" '{print $3}' data/WikiMatrix.en-ja.tsv | head -10
print()
print('-----------------------------------------')
print('Segmenting Japanese text ...')
print('-----------------------------------------')
print()

mecab_tokenizer = MeCab.Tagger(ipadic.MECAB_ARGS + " -Owakati")
ja_lines = []
with open('data/WikiMatrix.en-ja.tsv', 'r') as f:
    for idx, line in enumerate(f):
        line = line.strip().split('\t')[2]
        ja_lines.append(mecab_tokenizer.parse(line))
        if idx == 100:
            break
print()
print('-----------------------------------------')
print('Japanese text after segmentation ...')
print('\n'.join(zh_lines[:10]))
print('-----------------------------------------')
print()

--2021-07-13 10:00:29--  https://dl.fbaipublicfiles.com/laser/WikiMatrix/v1/WikiMatrix.en-ja.tsv.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.22.74.142, 172.67.9.4, 104.22.75.142, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.22.74.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 440814991 (420M) [application/gzip]
Saving to: ‘data/WikiMatrix.en-ja.tsv.gz’


2021-07-13 10:00:38 (47.8 MB/s) - ‘data/WikiMatrix.en-ja.tsv.gz’ saved [440814991/440814991]


-----------------------------------------
Japanese text before segmentation ...
-----------------------------------------

主のものは以下の通り。
（詳細はソネットを参照） Shall I compare thee to a summer’s day?
主の為なら主にすら嘘をつく。
一般には後主と称されている。
 主の御名によって来られる方を讃えよ。
魂主の手首には「主の証」として具現化する。
主の御名を呪う者は死刑に処せられる。
これはかれらの行いに対する、アッラーの見せしめのための懲しめである。
で主に道外向けに同時配信。
最期には人主に至るであろう。

-----------------------------------------
Segmenting Japanese text ...
-----------------------------------------


---------

## Deduplicate

In [73]:
import xxhash

def dedup_file(input_file_lang_1, input_file_lang_2, output_file_lang_1, output_file_lang_2):
    print()
    print('====================================')
    print('========== De-duplicate ============')
    print('====================================')
    print()
    num_lines = num_lines_in_file(input_file_lang_1)
    hashes = set()
    num_output_lines = 0
    with open(input_file_lang_1, 'r') as f_lang1, \
        open(input_file_lang_2, 'r')  as f_lang2, \
        open(output_file_lang_1, 'w') as f_out_lang1, \
        open(output_file_lang_2, 'w') as f_out_lang2:
        for line_1, line_2 in tqdm(zip(f_lang1, f_lang2), total=num_lines, desc=f"Deduplicating files"):
            parallel_hash = xxhash.xxh64((line_1.strip() + '\t' + line_2.strip()).encode('utf-8')).hexdigest()
            if parallel_hash not in hashes:
                hashes.add(parallel_hash)
                f_out_lang1.write(line_1.strip() + '\n')
                f_out_lang2.write(line_2.strip() + '\n')
                num_output_lines += 1

    print(f"Kept {num_output_lines} out of {num_lines} after deduplication")

dedup_file(
    'data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.sacremoses.norm.tok.en',
    'data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.sacremoses.norm.tok.ru',
    'data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.sacremoses.norm.tok.dedup.en',
    'data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.sacremoses.norm.tok.dedup.ru'
)





Deduplicating files: 100%|██████████| 905234/905234 [00:03<00:00, 275236.27it/s]

Kept 905231 out of 905234 after deduplication





## Shuffle

In [74]:
!shuf --random-source=data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.sacremoses.norm.tok.dedup.en \
    data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.sacremoses.norm.tok.dedup.en > \
    data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.sacremoses.norm.tok.dedup.shuf.en

!shuf --random-source=data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.sacremoses.norm.tok.dedup.en \
    data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.sacremoses.norm.tok.dedup.ru > \
    data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.sacremoses.norm.tok.dedup.shuf.ru

!paste -d "\t" \
    data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.sacremoses.norm.tok.dedup.shuf.en \
    data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.sacremoses.norm.tok.dedup.shuf.ru \
    | head -10

It is an effective method for countries where identification documents for citizens are not always standardised or institutionalised .	Это эффективный способ для стран , где удостоверяющие личность документы для граждан не стандартизированы или необязательны .
He claimed they there were 74 persons dead and 345 persons injured .	Он утверждал , что в результате этих действий 74 человек погибли и 345 были ранены .
Its test version was released on 20 April 2006 , and within three weeks the encyclopedia had grown to more than 90,000 articles , surpassing the number in Chinese Wikipedia .	Тестовая версия появилась 20 апреля 2006 года , и по прошествии трёх недель энциклопедия содержала более 90 тысяч статей , превзойдя по этому показателю китайскую Википедию .
These theatres are smaller than Broadway theatres .	По своим размерам эти театры меньше бродвейских .
In October 2010 , he was made a Distinguished Supporter of the British Humanist Association .	В октябре 2010 года он был удостоен

In [79]:
!rm -rf data/tarred_dataset_en_ru_8k_tokens

## Tarred Dataset Creation

In [80]:
!python /home/sandeepsub/code/NeMo/examples/nlp/machine_translation/create_tarred_parallel_dataset.py \
    --src_fname data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.sacremoses.norm.tok.dedup.shuf.en \
    --tgt_fname data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.sacremoses.norm.tok.dedup.shuf.ru \
    --out_dir data/tarred_dataset_en_ru_8k_tokens \
    --clean \
    --encoder_tokenizer_name yttm \
    --encoder_tokenizer_vocab_size 32000 \
    --encoder_tokenizer_coverage 0.999 \
    --encoder_tokenizer_bpe_dropout 0.1 \
    --decoder_tokenizer_name yttm \
    --decoder_tokenizer_vocab_size 32000 \
    --decoder_tokenizer_coverage 0.999 \
    --decoder_tokenizer_bpe_dropout 0.1 \
    --max_seq_length 512 \
    --min_seq_length 1 \
    --tokens_in_batch 8000 \
    --lines_per_dataset_fragment 100000 \
    --num_batches_per_tarfile 20

[NeMo W 2021-07-13 10:35:50 optimizers:47] Apex was not found. Using the lamb optimizer will error out.
[NeMo W 2021-07-13 10:35:52 experimental:27] Module <class 'nemo.collections.nlp.data.text_normalization.decoder_dataset.TextNormalizationDecoderDataset'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2021-07-13 10:35:52 experimental:27] Module <class 'nemo.collections.nlp.data.text_normalization.tagger_dataset.TextNormalizationTaggerDataset'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2021-07-13 10:35:52 experimental:27] Module <class 'nemo.collections.nlp.data.text_normalization.test_dataset.TextNormalizationTestDataset'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[nltk_data] Downloading package punkt to /home/sandeepsub/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[NeMo W 2021-07-13 10:35:52 experimental:

[NeMo I 2021-07-13 10:35:57 tokenizer_utils:136] Getting YouTokenToMeTokenizer with model: data/tarred_dataset_en_ru_8k_tokens/tokenizer.encoder.32000.BPE.model with r2l: False.
[NeMo I 2021-07-13 10:35:57 tokenizer_utils:136] Getting YouTokenToMeTokenizer with model: data/tarred_dataset_en_ru_8k_tokens/tokenizer.decoder.32000.BPE.model with r2l: False.
[NeMo W 2021-07-13 10:35:59 optimizers:47] Apex was not found. Using the lamb optimizer will error out.
[NeMo W 2021-07-13 10:35:59 optimizers:47] Apex was not found. Using the lamb optimizer will error out.
[NeMo W 2021-07-13 10:36:00 experimental:27] Module <class 'nemo.collections.nlp.data.text_normalization.decoder_dataset.TextNormalizationDecoderDataset'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2021-07-13 10:36:00 experimental:27] Module <class 'nemo.collections.nlp.data.text_normalization.tagger_dataset.TextNormalizationTaggerDataset'> is experimental, not ready for produ

[nltk_data] Downloading package punkt to /home/sandeepsub/nltk_data...
[nltk_data] Downloading package punkt to /home/sandeepsub/nltk_data...
[nltk_data] Downloading package punkt to /home/sandeepsub/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt to /home/sandeepsub/nltk_data...
[NeMo W 2021-07-13 10:36:05 experimental:27] Module <class 'nemo.collections.nlp.models.duplex_text_normalization.duplex_decoder.DuplexDecoderModel'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2021-07-13 10:36:05 experimental:27] Module <class 'nemo.collections.nlp.models.duplex_text_normalization.duplex_decoder.DuplexDecoderModel'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[nltk_data] Downloading package punkt to /home/sandeepsub/nltk_data...
[nltk_data] Downloading package punkt to /home/sandeepsub/n

[nltk_data] Downloading package punkt to /home/sandeepsub/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[NeMo W 2021-07-13 10:36:05 experimental:27] Module <class 'nemo.collections.nlp.models.duplex_text_normalization.duplex_decoder.DuplexDecoderModel'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[nltk_data] Downloading package punkt to /home/sandeepsub/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[NeMo W 2021-07-13 10:36:05 experimental:27] Module <class 'nemo.collections.nlp.models.duplex_text_normalization.duplex_tagger.DuplexTaggerModel'> is experimental, not ready for production and is not fully supported. Use at your own risk.
################################################################################
###          (please add 'export KALDI_ROOT=<your_path>' in your $HOME/.profile)
###          (or run as: KALDI_ROOT=<your_path> python <your_script>.py)
#######################################

Tokenizing sentence: 100%|███████████| 100000/100000 [00:02<00:00, 41626.26it/s]
[NeMo I 2021-07-13 10:36:09 data_preprocessing:379] Tokenizing dataset /tmp/tmp1tx33m3h...
Tokenizing sentence: 100%|███████████| 100000/100000 [00:02<00:00, 41706.88it/s]
[NeMo I 2021-07-13 10:36:09 data_preprocessing:379] Tokenizing dataset /tmp/tmpeipdmynq...
Tokenizing sentence: 100%|███████████| 100000/100000 [00:02<00:00, 41754.71it/s]
[NeMo I 2021-07-13 10:36:09 data_preprocessing:379] Tokenizing dataset /tmp/tmpg8xrun10...
Tokenizing sentence: 100%|███████████| 100000/100000 [00:02<00:00, 41375.47it/s]
[NeMo I 2021-07-13 10:36:09 data_preprocessing:379] Tokenizing dataset /tmp/tmpjf4yxhhl...
Tokenizing sentence: 100%|███████████| 100000/100000 [00:02<00:00, 41769.12it/s]
[NeMo I 2021-07-13 10:36:09 data_preprocessing:379] Tokenizing dataset /tmp/tmpaom90y9o...
Tokenizing sentence: 100%|███████████| 100000/100000 [00:02<00:00, 40509.99it/s]
Tokenizing sentence: 100%|███████████| 100000/100000 [00:02

In [81]:
!ls data/tarred_dataset_en_ru_8k_tokens

metadata.tokens.8000.json	      parallel.batches.tokens.8000.287.tar
parallel.batches.tokens.8000.0.tar    parallel.batches.tokens.8000.288.tar
parallel.batches.tokens.8000.100.tar  parallel.batches.tokens.8000.289.tar
parallel.batches.tokens.8000.101.tar  parallel.batches.tokens.8000.28.tar
parallel.batches.tokens.8000.102.tar  parallel.batches.tokens.8000.290.tar
parallel.batches.tokens.8000.103.tar  parallel.batches.tokens.8000.291.tar
parallel.batches.tokens.8000.104.tar  parallel.batches.tokens.8000.292.tar
parallel.batches.tokens.8000.105.tar  parallel.batches.tokens.8000.293.tar
parallel.batches.tokens.8000.106.tar  parallel.batches.tokens.8000.294.tar
parallel.batches.tokens.8000.107.tar  parallel.batches.tokens.8000.295.tar
parallel.batches.tokens.8000.108.tar  parallel.batches.tokens.8000.296.tar
parallel.batches.tokens.8000.109.tar  parallel.batches.tokens.8000.297.tar
parallel.batches.tokens.8000.10.tar   parallel.batches.tokens.8000.298.tar
parallel.batches.to

In [82]:
!cat data/tarred_dataset_en_ru_8k_tokens/metadata.tokens.8000.json

{"num_batches": 8260, "tar_files": ["data/tarred_dataset_en_ru_8k_tokens/parallel.batches.tokens.8000.6.tar", "data/tarred_dataset_en_ru_8k_tokens/parallel.batches.tokens.8000.258.tar", "data/tarred_dataset_en_ru_8k_tokens/parallel.batches.tokens.8000.318.tar", "data/tarred_dataset_en_ru_8k_tokens/parallel.batches.tokens.8000.203.tar", "data/tarred_dataset_en_ru_8k_tokens/parallel.batches.tokens.8000.39.tar", "data/tarred_dataset_en_ru_8k_tokens/parallel.batches.tokens.8000.50.tar", "data/tarred_dataset_en_ru_8k_tokens/parallel.batches.tokens.8000.381.tar", "data/tarred_dataset_en_ru_8k_tokens/parallel.batches.tokens.8000.277.tar", "data/tarred_dataset_en_ru_8k_tokens/parallel.batches.tokens.8000.240.tar", "data/tarred_dataset_en_ru_8k_tokens/parallel.batches.tokens.8000.243.tar", "data/tarred_dataset_en_ru_8k_tokens/parallel.batches.tokens.8000.41.tar", "data/tarred_dataset_en_ru_8k_tokens/parallel.batches.tokens.8000.152.tar", "data/tarred_dataset_en_ru_8k_tokens/parallel.batches.tok