<a href="https://colab.research.google.com/github/Dmitri9149/TensorFlow_Models_for_NLP/blob/main/EuroparlamentDataForMachineTranslation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import sys
import zipfile
import os
import tarfile

The data for Machine Translation are prepared using PyTorch 
torchtext and Spacy. 
Here I follow : https://towardsdatascience.com/how-to-use-torchtext-for-neural-machine-translation-plus-hack-to-make-it-5x-faster-77f3884d95  
The datase is European Parliament Proceedings Parallel Corpus 1996–2011 : http://www.statmt.org/europarl/ 

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
tf = tarfile.open('drive/My Drive/Data/TextData/fr-en.tgz', mode="r")
tf.getnames()

['europarl-v7.fr-en.en', 'europarl-v7.fr-en.fr']

In [4]:
tf.extractall('drive/My Drive/Data/TextData')

In [5]:
europarl_en = open('drive/My Drive/Data/TextData/europarl-v7.fr-en.en', encoding='utf-8').read().split('\n')
europarl_fr = open('drive/My Drive/Data/TextData/europarl-v7.fr-en.fr', encoding='utf-8').read().split('\n')

In [6]:
europarl_en[0:10]

['Resumption of the session',
 'I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999, and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period.',
 "Although, as you will have seen, the dreaded 'millennium bug' failed to materialise, still the people in a number of countries suffered a series of natural disasters that truly were dreadful.",
 'You have requested a debate on this subject in the course of the next few days, during this part-session.',
 "In the meantime, I should like to observe a minute' s silence, as a number of Members have requested, on behalf of all the victims concerned, particularly those of the terrible storms, in the various countries of the European Union.",
 "Please rise, then, for this minute' s silence.",
 "(The House rose and observed a minute' s silence)",
 'Madam President, on a point of order.',
 'You will be aware from the press and television that there have be

In [7]:
europarl_fr[0:10]

['Reprise de la session',
 'Je déclare reprise la session du Parlement européen qui avait été interrompue le vendredi 17 décembre dernier et je vous renouvelle tous mes vux en espérant que vous avez passé de bonnes vacances.',
 'Comme vous avez pu le constater, le grand "bogue de l\'an 2000" ne s\'est pas produit. En revanche, les citoyens d\'un certain nombre de nos pays ont été victimes de catastrophes naturelles qui ont vraiment été terribles.',
 'Vous avez souhaité un débat à ce sujet dans les prochains jours, au cours de cette période de session.',
 "En attendant, je souhaiterais, comme un certain nombre de collègues me l'ont demandé, que nous observions une minute de silence pour toutes les victimes, des tempêtes notamment, dans les différents pays de l'Union européenne qui ont été touchés.",
 'Je vous invite à vous lever pour cette minute de silence.',
 '(Le Parlement, debout, observe une minute de silence)',
 "Madame la Présidente, c'est une motion de procédure.",
 'Vous avez p

In [8]:
### use Torchtext and Spacy to create Field objects for later 
### tokenization and text processinguage 
### Spacy is used for language specific tokenization 

In [9]:
import spacy
import torchtext
from torchtext.data import Field, BucketIterator, TabularDataset
en = spacy.load('en')
### the line below does not work 
## fr = spacy.load('fr')
## used : tackoverflow.com/questions/55338972/oserror-e050-cant-find-model-fr-core-web-md-it-doesnt-seem-to-be-a-short
## to overcome the problem 
### more over 'advise' from spacy -> 
### '✔ Download and installation successful
### You can now load the model via spacy.load('fr_core_news_md') ' 
### does not work too ! 
### fr = spacy.load('fr_core_news_md')  -> does not work 
!python3 -m spacy download fr_core_news_md
import fr_core_news_md
fr = fr_core_news_md.load()
def tokenize_en(sentence):
    return [tok.text for tok in en.tokenizer(sentence)]
def tokenize_fr(sentence):
    return [tok.text for tok in fr.tokenizer(sentence)]
EN_TEXT = Field(tokenize=tokenize_en)
FR_TEXT = Field(tokenize=tokenize_fr, init_token = "<sos>", eos_token = "<eos>")

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('fr_core_news_md')


In [10]:
### create DataFrame fom the data , because it gives big possibilities 
### in using TabularDataset of torchtext
import pandas as pd
#### create vocab 
raw_data = {'English' : [line for line in europarl_en], 'French': [line for line in europarl_fr]}
#### create DataFrame
df = pd.DataFrame(raw_data, columns=["English", "French"])
# remove very long sentences and sentences where translations are 
# not of roughly equal length
### count white spaces -> approx. number of words in a sentence
df['eng_len'] = df['English'].str.count(' ')
df['fr_len'] = df['French'].str.count(' ')
df = df.query('fr_len < 80 & eng_len < 80')
df = df.query('fr_len < eng_len * 1.5 & fr_len * 1.5 > eng_len')


In [11]:
from sklearn.model_selection import train_test_split
# create train and validation set 
train, val = train_test_split(df, test_size=0.1)
train.to_csv("train.csv", index=False)
val.to_csv("val.csv", index=False)

In [13]:
print("len of 'train.csv' = ", len(df))


len of 'train.csv' =  1840428


In [14]:
!whoami


root


In [15]:
# associate the text in the 'English' column with the EN_TEXT field, # and 'French' with FR_TEXT
data_fields = [('English', EN_TEXT), ('French', FR_TEXT)]
train,val = torchtext.data.TabularDataset.splits(path='./', train='train.csv', validation='val.csv', format='csv', fields=data_fields)

In [16]:
### build vocab
FR_TEXT.build_vocab(train, val)
EN_TEXT.build_vocab(train, val)

In [17]:
print(EN_TEXT.vocab.stoi['the'])

2


In [18]:
print(EN_TEXT.vocab.itos[2])

the


In [19]:
### constract iterator
train_iter = BucketIterator(train, batch_size=20, \
sort_key=lambda x: len(x.French), shuffle=True)