<a href="https://colab.research.google.com/github/Saputoa21/Machine-Translation/blob/main/preprocessing_bicleaner_BPE_MTMA2025s_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preprocessing

- Cleaning parallel corpus
- BPE tokenization

[Bicleaner](https://github.com/bitextor/bicleaner-hardrules)


RULES:
- no_empty,	Sentence is empty
- not_too_long,	Sentence is more than 1024 characters long
- not_too_short,	Sentence is less than	3 words long
- length_ratio,	The length ratio between the source sentence and target sentence (in bytes) is too low or too high
- no_identical,	Alphabetic content in source sentence and target sentence is identical
- no_literals,  Unwanted literals: "Re:","{{", "%s", "}}", "+++", "***", '=\"'
- no_only_symbols,	The ratio of non-alphabetic characters in source sentence is more than 90%
- no_only_numbers,	The ratio of numeric characters in source sentence is too high
- no_urls,	There are URLs (disabled by default)
- no_breadcrumbs,	There are more than 2 breadcrumb characters in the sentence
- no_glued_words,	There are words in the sentence containing too many uppercased characters between lowercased characters
- no_repeated_words, There are words repeated consecutively
- no_unicode_noise,	Too many characters from unwanted unicode in source sentence
- no_space_noise,	Too many consecutive single characters separated by spaces in the sentence (excludes digits)
- no_paren,	Too many parenthesis or brackets in sentence
- no_escaped_unicode,	There is unescaped unicode characters in sentence
- no_bad_encoding,	Source sentence or target sentence contains mojibake
- no_titles,	All words in source sentence or target sentence are uppercased or in titlecase
- no_wrong_language,	Sentence is not in the desired language
- no_porn,	Source sentence or target sentence contains text identified as porn
- no_number_inconsistencies,	Sentence contains different numbers in source and target (disabled by default)
- no_script,_inconsistencies	Sentence source or target contains characters from different script/writing systems (disabled by default)
- lm_filter,	The sentence pair has low fluency score from the language model

In [1]:
#install dependency libraries
!apt install libhunspell-dev
!apt-get install hunspell-en-us
# hunspell-en-med ??
!apt-get install hunspell-de-de #checking the spelling like in Word
!pip install hunspell

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
libhunspell-dev is already the newest version (1.7.0-4build1).
0 upgraded, 0 newly installed, 0 to remove and 34 not upgraded.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
hunspell-en-us is already the newest version (1:2020.12.07-2).
0 upgraded, 0 newly installed, 0 to remove and 34 not upgraded.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
hunspell-de-de is already the newest version (20161207-9).
0 upgraded, 0 newly installed, 0 to remove and 34 not upgraded.


In [2]:
#install bicleaner and hard-rules 2.11
!pip install --config-settings="--build-option=--max_order=7" https://github.com/kpu/kenlm/archive/master.zip #n-grams of 7 for a language model

Collecting https://github.com/kpu/kenlm/archive/master.zip
  Using cached https://github.com/kpu/kenlm/archive/master.zip
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [3]:
!pip list > requirements.txt #saving all versions of libraries in a text file
!cat requirements.txt

Package                               Version
------------------------------------- -------------------
absl-py                               1.4.0
accelerate                            1.6.0
aiohappyeyeballs                      2.6.1
aiohttp                               3.11.15
aiosignal                             1.3.2
alabaster                             1.0.0
albucore                              0.0.24
albumentations                        2.0.6
ale-py                                0.11.0
altair                                5.5.0
annotated-types                       0.7.0
antlr4-python3-runtime                4.9.3
anyio                                 4.9.0
argon2-cffi                           23.1.0
argon2-cffi-bindings                  21.2.0
array_record                          0.7.2
arviz                                 0.21.0
astropy                               7.0.2
astropy-iers-data                     0.2025.5.12.0.38.29
astunparse                            1

In [4]:
#load parallel corpus
#check number of lines
!wc -l dev* #count (wc - count words) lines (-l) in all files starting with dev (dev*)

   500 dev.en-de
   500 dev.en-de.de
   500 dev.en-de.en
  1500 total


In [5]:
#bicleanaer requires parallel data into the same file with columns en-de
!paste  dev.en-de.en dev.en-de.de > dev.en-de #combine to file line by line splitting by tab

In [6]:
#check output
!head dev.en-de  #as a result we have a parallel corpus, where each source sentence (en) ha a corresponding arget sentence (de)

Yevonde's most famous work was inspired by a theme party held on 5 March 1935, where guests dressed as Roman and Greek gods and goddesses.	Besonders bekannt wurden ihre Aufnahmen von einem Fest 1935, zu dem Gäste als griechische Götter und Göttinnen verkleidet kamen.
Mora is working on a trilogy about the IT specialist Darius Kopp, of which band I "The Only Man on the Continent" and Volume II "The Monster" have already appeared.	Terézia Mora arbeitet an einer Trilogie um den IT-Spezialisten Darius Kopp, von der Band I „Der einzige Mann auf dem Kontinent“ und Band II „Das Ungeheuer“ bereits erschienen sind.
The first person to enter this section was Günther J. Wolf with seven members of his ice course.	Eine erste Befahrung dieses Abschnitts gelang Günther J. Wolf mit sieben Teilnehmern seines Eiskurses.
They were renumbered in 1970 to 100 903 and 904, and in 1973 to 199 003 and 004.	Sie wurden 1970 in 100 903 und 904, 1973 in 199 003 und 004 umgenummert.
The grave is probably a disturbe

In [7]:
!pip install cyhunspell

Collecting cyhunspell
  Using cached CyHunspell-1.3.4.tar.gz (2.7 MB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting cacheman>=2.0.6 (from cyhunspell)
  Using cached CacheMan-2.2.0-py2.py3-none-any.whl.metadata (5.8 kB)
Using cached CacheMan-2.2.0-py2.py3-none-any.whl (13 kB)
Building wheels for collected packages: cyhunspell
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py bdist_wheel[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Building wheel for cyhunspell (setup.py) ... [?25lerror
[31m  ERROR: Failed building wheel for cyhunspell[0m[31m
[0m[?25h  Running setup.py clean for cyhunspell
Failed to build cyhunspell
[31mERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (cyhunspell)[0m[31m
[0m

In [8]:
!pip install numpy==1.24



In [9]:
!pip install bicleaner-hardrules

Collecting bicleaner-hardrules
  Using cached bicleaner_hardrules-2.10.6-py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.whl.metadata (14 kB)
Collecting toolwrapper<=3,>=1.0 (from bicleaner-hardrules)
  Using cached toolwrapper-2.1.0.tar.gz (3.2 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sacremoses==0.0.53 (from bicleaner-hardrules)
  Using cached sacremoses-0.0.53.tar.gz (880 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting fasttext-wheel==0.9.2 (from bicleaner-hardrules)
  Using cached fasttext_wheel-0.9.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (16 kB)
Collecting fastspell==0.11.1 (from bicleaner-hardrules)
  Using cached fastspell-0.11.1-py3-none-any.whl.metadata (53 kB)
Collecting huggingface-hub<0.23,>=0.15 (from bicleaner-hardrules)
  Using cached huggingface_hub-0.22.2-py3-none-any.whl.metadata (12 kB)
INFO: pip is looking at multiple versions of fastspell to determine which version is compatible with other

In [13]:
#apply bicleaner
#source = -s, target = -t
#parallel corpus as input
#parallel corpus as input
!bicleaner-hardrules  \
        -s en -t de  \
        dev.en-de  \
        dev.en-de.classified
        #saved output

2025-05-20 09:02:32,102 - INFO - LM filtering disabled.
2025-05-20 09:02:32,102 - INFO - Porn removal disabled.
2025-05-20 09:02:32,114 - INFO - Executing main program...
2025-05-20 09:02:32,114 - INFO - Starting process
2025-05-20 09:02:32,114 - INFO - Running 1 workers at 10000 rows per block
2025-05-20 09:02:32,125 - INFO - Start mapping
2025-05-20 09:02:32,131 - INFO - End mapping
2025-05-20 09:02:33,913 - INFO - Hard rules applied. Output available in dev.en-de.classified
2025-05-20 09:02:33,918 - INFO - Finished
2025-05-20 09:02:33,918 - INFO - Total: 500 rows
2025-05-20 09:02:33,918 - INFO - Elapsed time 1.80 s
2025-05-20 09:02:33,918 - INFO - Troughput: 277 rows/s
2025-05-20 09:02:33,919 - INFO - Program finished


In [14]:
#check file
!head dev.en-de.classified  #1 = passed the rules, 0 = failed the rules

Yevonde's most famous work was inspired by a theme party held on 5 March 1935, where guests dressed as Roman and Greek gods and goddesses.	Besonders bekannt wurden ihre Aufnahmen von einem Fest 1935, zu dem Gäste als griechische Götter und Göttinnen verkleidet kamen.	1
Mora is working on a trilogy about the IT specialist Darius Kopp, of which band I "The Only Man on the Continent" and Volume II "The Monster" have already appeared.	Terézia Mora arbeitet an einer Trilogie um den IT-Spezialisten Darius Kopp, von der Band I „Der einzige Mann auf dem Kontinent“ und Band II „Das Ungeheuer“ bereits erschienen sind.	1
The first person to enter this section was Günther J. Wolf with seven members of his ice course.	Eine erste Befahrung dieses Abschnitts gelang Günther J. Wolf mit sieben Teilnehmern seines Eiskurses.	1
They were renumbered in 1970 to 100 903 and 904, and in 1973 to 199 003 and 004.	Sie wurden 1970 in 100 903 und 904, 1973 in 199 003 und 004 umgenummert.	1
The grave is probably a 

In [15]:
#select only 1
!grep '1$' dev.en-de.classified >  dev.en-de.clean #choose only lines endling with 1, i.e. passed the rules (1$ - end of sequence)

In [16]:
!grep '0$' dev.en-de.classified >  dev.en-de.filter

In [18]:
#check files
!wc -l dev.en-de.classified
!wc -l dev.en-de.clean #counting lines passed the rules

500 dev.en-de.classified
466 dev.en-de.clean


In [20]:
!wc -l dev.en-de.filter #counting lines the failed the rules

34 dev.en-de.filter


In [21]:
!head -n 50 dev.en-de.clean

Yevonde's most famous work was inspired by a theme party held on 5 March 1935, where guests dressed as Roman and Greek gods and goddesses.	Besonders bekannt wurden ihre Aufnahmen von einem Fest 1935, zu dem Gäste als griechische Götter und Göttinnen verkleidet kamen.	1
Mora is working on a trilogy about the IT specialist Darius Kopp, of which band I "The Only Man on the Continent" and Volume II "The Monster" have already appeared.	Terézia Mora arbeitet an einer Trilogie um den IT-Spezialisten Darius Kopp, von der Band I „Der einzige Mann auf dem Kontinent“ und Band II „Das Ungeheuer“ bereits erschienen sind.	1
The first person to enter this section was Günther J. Wolf with seven members of his ice course.	Eine erste Befahrung dieses Abschnitts gelang Günther J. Wolf mit sieben Teilnehmern seines Eiskurses.	1
They were renumbered in 1970 to 100 903 and 904, and in 1973 to 199 003 and 004.	Sie wurden 1970 in 100 903 und 904, 1973 in 199 003 und 004 umgenummert.	1
The grave is probably a 

In [None]:
!head -n 50 dev.en-de.filter

He was an editor of the journals: Zeitschrift für Tropenmedizin, the Zentralblatt für Bakteriologie and the Zeitschrift für Parasitenkunde.	Ferner war er Herausgeber der Zeitschrift für Tropenmedizin, dem Zentralblatt für Bakteriologie und der Zeitschrift für Parasitenkunde.	0
"Das Himmelreich zu Erlangen – offen aus Tradition?"	Das Himmelreich zu Erlangen – offen aus Tradition?	0
"Wörterbuch zur Sprache und Kultur der Twareg".	Prasse: Wörterbuch zur Sprache und Kultur der Twareg.	0
Sensors and Actuators B: Chemical.	In: Sensors and Actuators B: Chemical.	0
The Daily Courier.	In: The Daily Courier.	0
Competitivitat de l´economia catalana en l´horitzó 2010: Effectes macroeconòmics del dèfiit fiscal amb l´Estat espanyol (Competitivity of the Catalan economy in the horizon 2010: Macroeconomic effects of the fiscal deficit with the Spanish State) - 2003 Polítiques públiques: Una visió renovada (Public politics: An updated perspective) - 2004 L´espoli fiscal.	Competitivitat de l´economia ca

In [22]:
# split file into columns
!cut -f1 dev.en-de.clean > dev.en-de.clean.en #select only lines from the first part of the files
!cut -f2 dev.en-de.clean > dev.en-de.clean.de #select only lines from the second part of the files

In [23]:
#check files
!wc -l dev.en-de.clean.en
!wc -l dev.en-de.clean.de

466 dev.en-de.clean.en
466 dev.en-de.clean.de


# TODO BICLEAN
  - Training data 500k, dev 5k, and test 5k
  - clean it with hard rules


*paper: https://aclanthology.org/2020.eamt-1.31.pdf


Comment from me:
I have used the data from Moodle with the following number of lines:
train 50K, test 500, dev 500

## Test set

In [24]:
!wc -l test*

   500 test.en-de.de
   500 test.en-de.en
  1000 total


In [25]:
#bicleanaer requires parallel data into the same file with columns en-de
!paste  test.en-de.en test.en-de.de > test.en-de #combine to file line by line splitting by tab

In [26]:
#check output
!head test.en-de  #as a result we have a parallel corpus, where each source sentence (en) ha a corresponding arget sentence (de)

49:1 Scherzo in G minor op.	49 Nr. 1 Scherzo g-Moll op.
The ECF used to publish a newsletter Chess Moves, which was free to members.	Die ECF gibt die Schachzeitschrift Chess Moves heraus, die für Mitglieder kostenlos ist.
News on May 4, 2010.	News am 4. Mai 2010.
Cape Barren, with the other islands in the Furneaux Group, are a popular destination for sea kayakers who attempt the crossing of Bass Strait from the Australian mainland at Wilsons Promontory, Victoria to the Tasmanian mainland.	Cape Barren ist wie auch die anderen Inseln der Furneaux-Gruppe ein beliebtes Ziel für Seekayak-Fahrer, die die Bass Strait von Wilsons Promontory in Australien nach Tasmanien überqueren.
The work is articulated in a single movement, and comprises an ensemble consisting of two violins, a cello, a piano, a flute, and a piccolo, which was recorded in Russia by soloists of the Moscow Philharmonic Orchestra barely two days after its opening.	Die Komposition ist in einen Satz für ein Ensemble für zwei Viol

In [27]:
#apply bicleaner
#source = -s, target = -t
#parallel corpus as input
#parallel corpus as input
!bicleaner-hardrules  \
        -s en -t de  \
        test.en-de  \
        test.en-de.classified
        #saved output

2025-05-20 09:10:43,217 - INFO - LM filtering disabled.
2025-05-20 09:10:43,217 - INFO - Porn removal disabled.
2025-05-20 09:10:43,232 - INFO - Executing main program...
2025-05-20 09:10:43,232 - INFO - Starting process
2025-05-20 09:10:43,232 - INFO - Running 1 workers at 10000 rows per block
2025-05-20 09:10:43,243 - INFO - Start mapping
2025-05-20 09:10:43,248 - INFO - End mapping
2025-05-20 09:10:45,865 - INFO - Hard rules applied. Output available in test.en-de.classified
2025-05-20 09:10:45,872 - INFO - Finished
2025-05-20 09:10:45,872 - INFO - Total: 500 rows
2025-05-20 09:10:45,872 - INFO - Elapsed time 2.64 s
2025-05-20 09:10:45,872 - INFO - Troughput: 189 rows/s
2025-05-20 09:10:45,872 - INFO - Program finished


In [28]:
#check file
!head test.en-de.classified  #1 = passed the rules, 0 = failed the rules

49:1 Scherzo in G minor op.	49 Nr. 1 Scherzo g-Moll op.	0
The ECF used to publish a newsletter Chess Moves, which was free to members.	Die ECF gibt die Schachzeitschrift Chess Moves heraus, die für Mitglieder kostenlos ist.	1
News on May 4, 2010.	News am 4. Mai 2010.	0
Cape Barren, with the other islands in the Furneaux Group, are a popular destination for sea kayakers who attempt the crossing of Bass Strait from the Australian mainland at Wilsons Promontory, Victoria to the Tasmanian mainland.	Cape Barren ist wie auch die anderen Inseln der Furneaux-Gruppe ein beliebtes Ziel für Seekayak-Fahrer, die die Bass Strait von Wilsons Promontory in Australien nach Tasmanien überqueren.	1
The work is articulated in a single movement, and comprises an ensemble consisting of two violins, a cello, a piano, a flute, and a piccolo, which was recorded in Russia by soloists of the Moscow Philharmonic Orchestra barely two days after its opening.	Die Komposition ist in einen Satz für ein Ensemble für z

In [29]:
#select only 1
!grep '1$' test.en-de.classified >  test.en-de.clean #choose only lines endling with 1, i.e. passed the rules (1$ - end of sequence)

In [31]:
!grep '0$' test.en-de.classified > test.en-de.filter

In [32]:
#check files
!wc -l test.en-de.classified
!wc -l test.en-de.clean #counting lines passed the rules

500 test.en-de.classified
467 test.en-de.clean


In [33]:
!wc -l test.en-de.filter #counting lines the failed the rules

33 test.en-de.filter


In [34]:
!head -n 50 test.en-de.clean

The ECF used to publish a newsletter Chess Moves, which was free to members.	Die ECF gibt die Schachzeitschrift Chess Moves heraus, die für Mitglieder kostenlos ist.	1
Cape Barren, with the other islands in the Furneaux Group, are a popular destination for sea kayakers who attempt the crossing of Bass Strait from the Australian mainland at Wilsons Promontory, Victoria to the Tasmanian mainland.	Cape Barren ist wie auch die anderen Inseln der Furneaux-Gruppe ein beliebtes Ziel für Seekayak-Fahrer, die die Bass Strait von Wilsons Promontory in Australien nach Tasmanien überqueren.	1
The work is articulated in a single movement, and comprises an ensemble consisting of two violins, a cello, a piano, a flute, and a piccolo, which was recorded in Russia by soloists of the Moscow Philharmonic Orchestra barely two days after its opening.	Die Komposition ist in einen Satz für ein Ensemble für zwei Violinen, einem Cello, einem Piano, einer Flöte und einer Piccoloflöte gegliedert, das schon zwei 

In [35]:
!head -n 50 test.en-de.filter #print the first 50 lines from the file

49:1 Scherzo in G minor op.	49 Nr. 1 Scherzo g-Moll op.	0
News on May 4, 2010.	News am 4. Mai 2010.	0
Financial Markets and Portfolio Management.	In: Financial Markets and Portfolio Management.	0
Die Übersetzung der englischen Kurzfassung besorgte Jost Benedum, Institut für Geschichte der Medizin der Justus-Liebig-Universität Gießen.	Die Übersetzung der englischen Kurzfassung besorgte Jost Benedum, Institut für Geschichte der Medizin der Justus-Liebig-Universität Gießen.	0
(Vienna: Triton, 2001).	(Wien: Triton, 2001).	0
Amtsgericht Dresden, Aktenzeichen: VR 7750.	Vereinsregister des Amtsgerichts Dresden, Blatt VR 7750.	0
Why sue?	Warum Lio?	0
Sudamericana (October 2005).	Sudamericana (Oktober 2005).	0
11 (Gassenhauer-Trio), Johannes Brahms his Klarinettentrio op.	11 (Gassenhauer-Trio) schrieb, Johannes Brahms sein Klarinettentrio op.	0
"Handbuch der historischen Buchbestände in Deutschland, Österreich und Europa (Fabian-Handbuch): Dombibliothek".	Handbuch der historischen Buchbestände 

In [36]:
# split file into columns
!cut -f1 test.en-de.clean > test.en-de.clean.en #select only lines from the first part of the files
!cut -f2 test.en-de.clean > test.en-de.clean.de #select only lines from the second part of the files

In [37]:
#check files
!wc -l test.en-de.clean.en
!wc -l test.en-de.clean.de

467 test.en-de.clean.en
467 test.en-de.clean.de


## Training set

In [38]:
!wc -l train*

   50000 train.en-de.de
   50000 train.en-de.en
  100000 total


In [39]:
#bicleanaer requires parallel data into the same file with columns en-de
!paste  train.en-de.en train.en-de.de > train.en-de #combine to file line by line splitting by tab

In [40]:
#check output
!head train.en-de  #as a result we have a parallel corpus, where each source sentence (en) ha a corresponding arget sentence (de)

A recent analysis by Apaldetti et al. (2011) suggests that Gongxianosaurus was more basal than Vulcanodon, Tazoudasaurus and Isanosaurus, but more derived than the early sauropods Antetonitrus, Lessemsaurus, Blikanasaurus, Camelotia and Melanorosaurus.	Eine neuere Analyse von Cecilia Apaldetti und Kollegen (2011) lässt darauf schließen, dass Gongxianosaurus basaler (ursprünglicher) war als Vulcanodon, Tazoudasaurus und Isanosaurus, aber stärker abgeleitet (fortgeschrittener) als die frühen Sauropoden Antetonitrus, Lessemsaurus, Blikanasaurus, Camelotia und Melanorosaurus.
Reichhart also carried out executions in Cologne, Frankfurt-Preungesheim, Berlin-Plötzensee, Brandenburg-Görden and Breslau, where central execution sites had also been constructed.	Reichhart vollzog vertretungsweise auch Hinrichtungen in Köln, Frankfurt-Preungesheim, Berlin-Plötzensee, Brandenburg-Görden und Breslau, wo ebenfalls zentrale Hinrichtungsstätten eingerichtet worden waren.
Uphold the right of all, without

In [43]:
#apply bicleaner
#source = -s, target = -t
#parallel corpus as input
#parallel corpus as input
!bicleaner-hardrules  \
        -s en -t de  \
        train.en-de  \
        train.en-de.classified
        #saved output

2025-05-20 09:18:38,237 - INFO - LM filtering disabled.
2025-05-20 09:18:38,237 - INFO - Porn removal disabled.
2025-05-20 09:18:38,258 - INFO - Executing main program...
2025-05-20 09:18:38,259 - INFO - Starting process
2025-05-20 09:18:38,259 - INFO - Running 1 workers at 10000 rows per block
2025-05-20 09:18:38,274 - INFO - Start mapping
2025-05-20 09:18:38,452 - INFO - End mapping
2025-05-20 09:19:57,926 - INFO - Hard rules applied. Output available in train.en-de.classified
2025-05-20 09:19:57,930 - INFO - Finished
2025-05-20 09:19:57,930 - INFO - Total: 50000 rows
2025-05-20 09:19:57,930 - INFO - Elapsed time 79.67 s
2025-05-20 09:19:57,930 - INFO - Troughput: 627 rows/s
2025-05-20 09:19:57,931 - INFO - Program finished


In [44]:
#check file
!head train.en-de.classified  #1 = passed the rules, 0 = failed the rules

A recent analysis by Apaldetti et al. (2011) suggests that Gongxianosaurus was more basal than Vulcanodon, Tazoudasaurus and Isanosaurus, but more derived than the early sauropods Antetonitrus, Lessemsaurus, Blikanasaurus, Camelotia and Melanorosaurus.	Eine neuere Analyse von Cecilia Apaldetti und Kollegen (2011) lässt darauf schließen, dass Gongxianosaurus basaler (ursprünglicher) war als Vulcanodon, Tazoudasaurus und Isanosaurus, aber stärker abgeleitet (fortgeschrittener) als die frühen Sauropoden Antetonitrus, Lessemsaurus, Blikanasaurus, Camelotia und Melanorosaurus.	1
Reichhart also carried out executions in Cologne, Frankfurt-Preungesheim, Berlin-Plötzensee, Brandenburg-Görden and Breslau, where central execution sites had also been constructed.	Reichhart vollzog vertretungsweise auch Hinrichtungen in Köln, Frankfurt-Preungesheim, Berlin-Plötzensee, Brandenburg-Görden und Breslau, wo ebenfalls zentrale Hinrichtungsstätten eingerichtet worden waren.	1
Uphold the right of all, wit

In [45]:
!wc -l train.en-de.classified

50000 train.en-de.classified


In [46]:
#select only 1
!grep '1$' train.en-de.classified >  train.en-de.clean #choose only lines endling with 1, i.e. passed the rules (1$ - end of sequence)

In [47]:
!grep '0$' train.en-de.classified > train.en-de.filter

In [49]:
#check files
!wc -l train.en-de.classified
!wc -l train.en-de.clean #counting lines passed the rules

50000 train.en-de.classified
46612 train.en-de.clean


In [50]:
!wc -l train.en-de.filter #counting lines the failed the rules

3388 train.en-de.filter


In [51]:
!head -n 50 train.en-de.clean

A recent analysis by Apaldetti et al. (2011) suggests that Gongxianosaurus was more basal than Vulcanodon, Tazoudasaurus and Isanosaurus, but more derived than the early sauropods Antetonitrus, Lessemsaurus, Blikanasaurus, Camelotia and Melanorosaurus.	Eine neuere Analyse von Cecilia Apaldetti und Kollegen (2011) lässt darauf schließen, dass Gongxianosaurus basaler (ursprünglicher) war als Vulcanodon, Tazoudasaurus und Isanosaurus, aber stärker abgeleitet (fortgeschrittener) als die frühen Sauropoden Antetonitrus, Lessemsaurus, Blikanasaurus, Camelotia und Melanorosaurus.	1
Reichhart also carried out executions in Cologne, Frankfurt-Preungesheim, Berlin-Plötzensee, Brandenburg-Görden and Breslau, where central execution sites had also been constructed.	Reichhart vollzog vertretungsweise auch Hinrichtungen in Köln, Frankfurt-Preungesheim, Berlin-Plötzensee, Brandenburg-Görden und Breslau, wo ebenfalls zentrale Hinrichtungsstätten eingerichtet worden waren.	1
Uphold the right of all, wit

In [52]:
!head -n 50 train.en-de.filter #print the first 50 lines from the file

Lawyers Who Lead.	Anwaltsunternehmen führen.	0
World Health Organization Monograph Series.	In: World Health Organization Monograph.	0
Josef Breinbauer: Otto von Lonsdorf.	Josef Breinbauer: Otto von Lonsdorf.	0
Am Vorabend des Grauens: Studien zum Spannungsfeld Politik, Literatur, Film in Deutschland und Polen in den 30er Jahren des 20.	Studien zum Spannungsfeld Politik – Literatur – Film in Deutschland und Polen in den 30er Jahren des 20. Jahrhunderts.	0
In: Türkiye Diyanet Vakfi Ansiklopedisi Islam.	In: Türkiye Diyanet Vakfi Islam Ansiklopedisi.	0
Preußen im Bundestag 1851–1859 Ein Achtundvierziger.	Preußen im Bundestag 1851–1859 Ein Achtundvierziger.	0
Gillespie, William: Karl Ritter.	William Gillespie: Karl Ritter.	0
(Prime Books) The Year's Best Australian Science Fiction and Fantasy Vol.2, Bill Congreve & Michelle Marquardt (eds.)	(Prime Books) The Year's Best Australian Science Fiction and Fantasy Vol.2, Bill Congreve & Michelle Marquardt (Hrsg.)	0
A Überrumpelungsspiel .	Ein Übe

In [53]:
# split file into columns
!cut -f1 train.en-de.clean > train.en-de.clean.en #select only lines from the first part of the files
!cut -f2 train.en-de.clean > train.en-de.clean.de #select only lines from the second part of the files

In [54]:
#check files
!wc -l train.en-de.clean.en
!wc -l train.en-de.clean.de

46612 train.en-de.clean.en
46612 train.en-de.clean.de


In [55]:
from google.colab import files

files.download('train.en-de')
files.download('train.en-de.classified')
files.download('train.en-de.filter')
files.download('train.en-de.clean')
files.download('train.en-de.clean.de')
files.download('train.en-de.clean.en')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# BPE

from [subword-nmt](https://github.com/rsennrich/subword-nmt)

In [56]:
#install subword nmt
!pip install subword-nmt #==0.3.8



In [57]:
#learn bpe model
!subword-nmt learn-joint-bpe-and-vocab --input train.en-de.en train.en-de.de -s 16000 -o train.bpe --write-vocabulary train.vocab.en train.vocab.de
#learn a vocabulary with 16000 for bith languages

100% 16000/16000 [01:45<00:00, 151.20it/s]


In [60]:
#apply bpe source
!subword-nmt apply-bpe -c train.bpe < dev.en-de.en > dev.en-de.bpe.en #apply the learned bpe vocabulary to the corpus (en)

In [61]:
#check out bpe
!head dev.en-de.bpe.en

Y@@ ev@@ on@@ de@@ 's most famous work was inspired by a theme party held on 5 March 193@@ 5, where gu@@ ests d@@ ressed as Roman and Greek god@@ s and god@@ dess@@ es.
Mor@@ a is working on a tr@@ il@@ og@@ y about the IT special@@ ist D@@ ari@@ us K@@ opp@@ , of which band I "The Only Man on the Contin@@ ent@@ " and Vol@@ ume II "The Mon@@ ster@@ " have already appear@@ ed.
The first person to enter this section was Gün@@ ther J. Wol@@ f with seven members of his ice cour@@ se.
They were ren@@ um@@ ber@@ ed in 1970 to 100 90@@ 3 and 90@@ 4, and in 1973 to 19@@ 9 00@@ 3 and 00@@ 4.
The grave is probably a dist@@ ur@@ bed arrange@@ ment, which was covered earlier with wood or ston@@ es.
Per@@ sec@@ u@@ tions ended following John@@ 's death on 23 May 167@@ 7, at the age of 7@@ 4.
In celeb@@ ration he wrote a book entitled Three Vis@@ its to Mad@@ ag@@ as@@ car (185@@ 8).
Berlin@@ ale Tal@@ ents and Per@@ spek@@ tive Deutsch@@ es Kin@@ o have joined forces to award the inaug@@ ural “@@ K

In [None]:
#apply bpe source
!subword-nmt apply-bpe -c train.bpe < dev.en-de.de > dev.en-de.bpe.de #apply the learned bpe vocabulary to the corpus (en)

In [None]:
#check out bpe
!head dev.en-de.bpe.de

Y@@ ev@@ on@@ de@@ 's most famous work was inspired by a theme party held on 5 March 193@@ 5, where gu@@ ests d@@ ressed as Roman and Greek god@@ s and god@@ dess@@ es.
Mor@@ a is working on a tr@@ il@@ og@@ y about the IT special@@ ist D@@ ari@@ us K@@ opp@@ , of which band I "The Only Man on the Contin@@ ent@@ " and Vol@@ ume II "The Mon@@ ster@@ " have already appear@@ ed.
The first person to enter this section was Gün@@ ther J. Wol@@ f with seven members of his ice cour@@ se.
They were ren@@ um@@ ber@@ ed in 1970 to 100 90@@ 3 and 90@@ 4, and in 1973 to 19@@ 9 00@@ 3 and 00@@ 4.
The grave is probably a dist@@ ur@@ bed arrange@@ ment, which was covered earlier with wood or ston@@ es.
Per@@ sec@@ u@@ tions ended following John@@ 's death on 23 May 167@@ 7, at the age of 7@@ 4.
In celeb@@ ration he wrote a book entitled Three Vis@@ its to Mad@@ ag@@ as@@ car (185@@ 8).
Berlin@@ ale Tal@@ ents and Per@@ spek@@ tive Deutsch@@ es Kin@@ o have joined forces to award the inaug@@ ural “@@ K

## Apply on the test set

In [None]:
#apply bpe source
!subword-nmt apply-bpe -c train.bpe < test.en-de.en > test.en-de.bpe.en #apply the learned bpe vocabulary to the corpus (en)

In [None]:
#apply bpe source
!subword-nmt apply-bpe -c train.bpe < test.en-de.en > test.en-de.bpe.de #apply the learned bpe vocabulary to the corpus (en)

# TODO BPE

- Train bpe model with the training data
- Apply on training, dev, and test

**NOTE:** to get original segmentation use


```
!sed -r 's/(@@ )|(@@ ?$)//g' < file_in > file_out
```



## Learning BPE from clean training data (50k)

In [65]:
#learn bpe model
!subword-nmt learn-joint-bpe-and-vocab --input train.en-de.clean.en train.en-de.clean.de -s 50000 -o train.clean.bpe --write-vocabulary train.vocab.clean.en train.vocab.clean.de
#learn a vocabulary with 50k clean for both languages

100% 50000/50000 [05:26<00:00, 153.11it/s]


### Apply on the dev set

In [66]:
#apply bpe source
!subword-nmt apply-bpe -c train.clean.bpe < dev.en-de.clean.en > dev.en-de.clean.bpe.en #apply the learned bpe vocabulary to the corpus (en)

In [67]:
!subword-nmt apply-bpe -c train.clean.bpe < dev.en-de.clean.de > dev.en-de.clean.bpe.de

In [68]:
#check out bpe
!head dev.en-de.clean.bpe.en

Y@@ ev@@ on@@ de's most famous work was inspired by a theme party held on 5 March 1935, where guests dressed as Roman and Greek gods and god@@ dess@@ es.
Mor@@ a is working on a trilogy about the IT specialist Darius Kopp@@ , of which band I "The Only Man on the Contin@@ ent@@ " and Volume II "The Mon@@ ster@@ " have already appear@@ ed.
The first person to enter this section was Günther J. Wolf with seven members of his ice cour@@ se.
They were ren@@ umber@@ ed in 1970 to 100 90@@ 3 and 90@@ 4, and in 1973 to 199 00@@ 3 and 00@@ 4.
The grave is probably a distur@@ bed arrange@@ ment, which was covered earlier with wood or ston@@ es.
Per@@ sec@@ u@@ tions ended following John's death on 23 May 167@@ 7, at the age of 7@@ 4.
In celebration he wrote a book entitled Three Vis@@ its to Madagas@@ car (185@@ 8).
Berlin@@ ale Tal@@ ents and Perspektive Deutsches Kino have joined forces to award the inaugural “@@ Komp@@ agn@@ on@@ ” fel@@ low@@ ship in 2017.
A proposal was flo@@ ated during the

In [69]:
#check out bpe
!head dev.en-de.clean.bpe.de

Besonders bekannt wurden ihre Aufnahmen von einem Fest 1935, zu dem Gäste als griechische Götter und Gött@@ innen verklei@@ det kamen.
Ter@@ é@@ zia Mor@@ a arbeitet an einer Trilogie um den IT-@@ Spezi@@ alisten Darius Kopp@@ , von der Band I „Der einzige Mann auf dem Kontin@@ ent@@ “ und Band II „Das Unge@@ heu@@ er“ bereits erschienen sind.
Eine erste Be@@ fahr@@ ung dieses Abschnit@@ ts gelang Günther J. Wolf mit sieben Teilnehmern seines Eis@@ kur@@ ses.
Sie wurden 1970 in 100 90@@ 3 und 90@@ 4, 1973 in 199 00@@ 3 und 00@@ 4 um@@ gen@@ um@@ mer@@ t.
Bei dem Grab handelt es sich wahrscheinlich um eine gest@@ ör@@ te An@@ lage, die früher mit Holz oder Steinen abge@@ deckt war.
Die Verfolgung endete, als Graf Johann am 23. Mai 167@@ 7 im Alter von 74 Jahren starb.
Diese Reisen beschrieb er in Three visits to Madagas@@ car (@@ London 185@@ 8).
Berlin@@ ale Tal@@ ents und Perspektive Deutsches Kino vergeben gemeinsam im Jahr 2017 zum ersten Mal den „@@ Komp@@ agn@@ on@@ “@@ -F@@ örder

### Apply on the test set

In [70]:
#apply bpe source
!subword-nmt apply-bpe -c train.clean.bpe < test.en-de.clean.en > test.en-de.clean.bpe.en #apply the learned bpe vocabulary to the corpus (en)

In [71]:
!subword-nmt apply-bpe -c train.clean.bpe < test.en-de.clean.de > test.en-de.clean.bpe.de #apply the learned bpe vocabulary to the corpus (en)

In [72]:
#check out bpe
!head test.en-de.clean.bpe.en

The EC@@ F used to publish a news@@ letter Chess Mov@@ es, which was free to members.
Cape Bar@@ ren, with the other islands in the Fur@@ ne@@ aux Group, are a popular destination for sea k@@ ay@@ akers who attempt the crossing of Bass Strait from the Australian mainland at Wil@@ sons Pro@@ mon@@ tor@@ y, Victoria to the Tasmanian main@@ land.
The work is articulated in a single movement, and comprises an ensemble consisting of two viol@@ ins, a cel@@ lo, a piano, a flu@@ te, and a pic@@ col@@ o, which was recorded in Russia by solo@@ ists of the Moscow Philharmonic Orchestra bar@@ ely two days after its open@@ ing.
The base is reported to host several M@@ Q-@@ 9 Re@@ ap@@ er d@@ ron@@ es, based on satellite imag@@ ery.
Some 80,000 people are hospit@@ alized there every year, and another 600@@ ,000 are treated in its out@@ patient clin@@ ics and medical instit@@ utes.
Being situated on the city footh@@ ills (17@@ 00 metres above sea level@@ ), Ni@@ avar@@ an has a cool@@ er climate all

In [73]:
#check out bpe
!head test.en-de.clean.bpe.de

Die EC@@ F gibt die Schach@@ zeitschrift Chess Mov@@ es heraus, die für Mitglieder kostenlos ist.
Cape Bar@@ ren ist wie auch die anderen Inseln der Fur@@ ne@@ au@@ x-@@ Gruppe ein beliebtes Ziel für Se@@ ek@@ ay@@ ak-@@ Fahr@@ er, die die Bass Strait von Wil@@ sons Pro@@ mon@@ tory in Australien nach Tasman@@ ien überqu@@ eren.
Die Komposition ist in einen Satz für ein Ensemble für zwei Viol@@ inen, einem Cel@@ lo, einem Pian@@ o, einer Flö@@ te und einer Pic@@ col@@ o@@ fl@@ ö@@ te ge@@ glie@@ der@@ t, das schon zwei Tage nach seinem Erscheinen in Russland von Soli@@ sten des Moskauer Philharmon@@ ischen Orchest@@ ers verton@@ t worden ist.
Die Basis soll mehrere M@@ Q-@@ 9 Re@@ ap@@ er Dro@@ hnen beherberg@@ en, dies lässt sich auf Satellit@@ enbil@@ dern erkennen.
Rund 7@@ 5.000 Menschen werden hier jährlich eingel@@ iefert und weitere 500@@ .000 werden in den amb@@ ul@@ anten Klin@@ iken und medizinischen Institu@@ ten behandelt.
Am Fuße des El@@ bur@@ s-@@ Gebirges in einer Höhe 

In [None]:
from google.colab import files

files.download('train.clean.bpe')

files.download('train.vocab.clean.en')
files.download('train.vocab.clean.de')

files.download('test.en-de.clean.bpe.en')
files.download('test.en-de.clean.bpe.de')

files.download('dev.en-de.clean.bpe.en')
files.download('dev.en-de.clean.bpe.de')

files.download('train.en-de.clean.bpe.en')
files.download('train.en-de.clean.bpe.de')

### Apply on the train set

In [74]:
!subword-nmt apply-bpe -c train.clean.bpe < train.en-de.clean.en > train.en-de.clean.bpe.en

In [75]:
!subword-nmt apply-bpe -c train.clean.bpe < train.en-de.clean.de > train.en-de.clean.bpe.de

In [76]:
#check out bpe
!head train.en-de.clean.bpe.en

A recent analysis by Ap@@ al@@ de@@ tti et al. (201@@ 1) suggests that Gong@@ x@@ ian@@ osaurus was more bas@@ al than Vul@@ can@@ od@@ on, T@@ az@@ ou@@ das@@ aurus and Is@@ an@@ osaur@@ us, but more derived than the early saur@@ opo@@ ds Ant@@ et@@ onit@@ rus, Less@@ em@@ saur@@ us, B@@ lik@@ anas@@ aur@@ us, Cam@@ elo@@ tia and Mel@@ an@@ or@@ osaurus.
Reich@@ hart also carried out execu@@ tions in Cologne, Frankfurt-@@ Pre@@ unges@@ heim, Berlin-@@ Pl@@ ötz@@ en@@ see, Branden@@ burg@@ -G@@ ör@@ den and Breslau, where central execution sites had also been constructed.
U@@ ph@@ old the right of all, without discrimin@@ ation, to a natural and social environment suppor@@ tive of human dign@@ ity, bo@@ di@@ ly health and spiritual well-@@ being, with special attention to the rights of indigenous peoples and minor@@ ities.
Sch@@ in@@ kel was not mention@@ ed, however, in a document published in 1828 on the "@@ Construction Designs of the Prussian Stat@@ e", in which town plann@@ er, Au

In [77]:
#check out bpe
!head train.en-de.clean.bpe.de

Eine neuere Analyse von Cec@@ ili@@ a Ap@@ al@@ de@@ tti und Kollegen (201@@ 1) lässt darauf schließen, dass Gong@@ x@@ ian@@ osaurus bas@@ aler (@@ ursprüng@@ lich@@ er) war als Vul@@ can@@ od@@ on, T@@ az@@ ou@@ das@@ aurus und Is@@ an@@ osaur@@ us, aber stärker abgeleitet (@@ fortgeschrit@@ ten@@ er) als die frühen Saur@@ op@@ oden Ant@@ et@@ onit@@ rus, Less@@ em@@ saur@@ us, B@@ lik@@ anas@@ aur@@ us, Cam@@ elo@@ tia und Mel@@ an@@ or@@ osaurus.
Reich@@ hart voll@@ zog vertre@@ t@@ ungsweise auch Hin@@ richtungen in Köln, Frankfurt-@@ Pre@@ unges@@ heim, Berlin-@@ Pl@@ ötz@@ en@@ see, Branden@@ burg@@ -G@@ ör@@ den und Breslau, wo ebenfalls zentrale Hin@@ richt@@ ungs@@ stätten eingerichtet worden waren.
Am Recht aller – ohne Ausnahme – auf eine natürliche und soziale Umwelt f@@ esth@@ alten, welche Menschen@@ würde, körper@@ liche Gesundheit und spiritu@@ elles Woh@@ ler@@ gehen unterstützt.
Sch@@ in@@ kel wurde jedoch in einem 1828 erschienenen Druck@@ werk über die „@@ Bau@@ au

In [78]:
from google.colab import files

files.download('train.clean.bpe')

files.download('train.vocab.clean.en')
files.download('train.vocab.clean.de')

files.download('test.en-de.clean.bpe.en')
files.download('test.en-de.clean.bpe.de')

files.download('dev.en-de.clean.bpe.en')
files.download('dev.en-de.clean.bpe.de')

files.download('train.en-de.clean.bpe.en')
files.download('train.en-de.clean.bpe.de')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>