### Tutorial Overview

This tutorial is divided into 5 parts; they are:

1. Europarl Machine Translation Dataset
2. Download French-English Dataset
3. Load Dataset
4. Clean Dataset
5. Reduce Vocabulary

### Python Environment

This tutorial assumes you have a Python SciPy environment installed with Python 3 installed.

The tutorial also assumes you have scikit-learn, Pandas, NumPy, and Matplotlib installed.

### Xitsonga Language Dictionary Website

This website has English to Xitsonga translations, so we will use this website to scrape the data that we need

In [21]:
import requests

# Library for parsing HTML
from bs4 import BeautifulSoup

links = [
  'https://www.xitsonga.org/grammar/greeting', 
  'https://www.xitsonga.org/grammar/emotions', 
  'https://www.xitsonga.org/grammar/how-to-ask',
  'https://www.xitsonga.org/grammar/phrases&page=1' #Pages range from 1-8
]

base_url = 'https://www.xitsonga.org/grammar/greeting'

index = requests.get(base_url).text
soup_index = BeautifulSoup(index, 'html.parser')

# We know the data is displayed on a table, and the first column is the Tsonga mappings, the 2nd column being the English corrensponding translation
rows = soup_index.find_all('tr')

# Every row has data, but the Tsonga cols have their data in links
# tsonga = [tr.find('td') for tr in rows]
rows[:2]
# table_body
# dumps = [tr['href'] for tr in soup_index.find_all('dictionary_data_table') if 
#          a.has_attr('href')]

# dumps

[<tr><th>Xitsonga</th><th>English</th></tr>,
 <tr><td style="width:40%"><a href="grammar/greeting?_=baloyi">Baloyi</a></td><td>-</td></tr>]

### Download French-English Dataset

We will focus on the parallel French-English dataset.

This is a prepared corpus of aligned French and English sentences recorded between 1996 and 2011.

The dataset has the following statistics:

<li>Sentences: 2,007,723</li>
<li>French words: 51,388,643</li>
<li>English words: 50,196,035</li>

You can download the dataset from here:

<a href="http://www.statmt.org/europarl/v7/fr-en.tgz">Parallel corpus French-English</a> (194 Megabytes)
Once downloaded, you should have the file “fr-en.tgz” in your current working directory.

In [0]:
#Download the Corpus
!wget http://www.statmt.org/europarl/v7/fr-en.tgz -O fr-en.tgz

--2019-12-02 16:27:37--  http://www.statmt.org/europarl/v7/fr-en.tgz
Resolving www.statmt.org (www.statmt.org)... 129.215.197.184
Connecting to www.statmt.org (www.statmt.org)|129.215.197.184|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 202718517 (193M) [application/x-gzip]
Saving to: ‘fr-en.tgz’


2019-12-02 16:44:29 (196 KB/s) - ‘fr-en.tgz’ saved [202718517/202718517]



In [0]:
# Once downloaded, you should have the file “fr-en.tgz” in your current working directory.

# You can unzip this archive file using the tar command, as follows:
!tar zxvf fr-en.tgz

europarl-v7.fr-en.en
europarl-v7.fr-en.fr


#### You will now have two files, as follows:

<p></p>
<li>English: europarl-v7.fr-en.en (288M)</li>
<li>French: europarl-v7.fr-en.fr (331M)</li>

Below is a sample of the English file.

```
Resumption of the session
I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999, and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period.
Although, as you will have seen, the dreaded 'millennium bug' failed to materialise, still the people in a number of countries suffered a series of natural disasters that truly were dreadful.
You have requested a debate on this subject in the course of the next few days, during this part-session.
In the meantime, I should like to observe a minute' s silence, as a number of Members have requested, on behalf of all the victims concerned, particularly those of the terrible storms, in the various countries of the European Union.
```

Below is a sample of the French file.

```
Reprise de la session
Je déclare reprise la session du Parlement européen qui avait été interrompue le vendredi 17 décembre dernier et je vous renouvelle tous mes vux en espérant que vous avez passé de bonnes vacances.
Comme vous avez pu le constater, le grand "bogue de l'an 2000" ne s'est pas produit. En revanche, les citoyens d'un certain nombre de nos pays ont été victimes de catastrophes naturelles qui ont vraiment été terribles.
Vous avez souhaité un débat à ce sujet dans les prochains jours, au cours de cette période de session.
En attendant, je souhaiterais, comme un certain nombre de collègues me l'ont demandé, que nous observions une minute de silence pour toutes les victimes, des tempêtes notamment, dans les différents pays de l'Union européenne qui ont été touchés.
```

### Load Dataset
Let’s start off by loading the data files.

We can load each file as a string. Because the files contain unicode characters, we must specify an encoding when loading the files as text. In this case, we will use UTF-8 that will easily handle the unicode characters in both files.

The function below, named `` load_doc() ``, will load a given file and return it as a blob of text.

In [0]:
# Load the Doc into memory 

def load_doc(filename):
    # open the file as read only
    file = open(filename, mode='rt', encoding='utf-8')
    
    # read all text
    text = file.read()
    
    # close the file
    file.close()
    
    return text

Next, we can split the file into sentences.

Generally, one utterance is stored on each line. We can treat these as sentences and split the file by new line characters. The function `` to_sentences() `` below will split a loaded document.

In [0]:
# Split a loaded document inot sentences 

def to_sentences(doc):
    return doc.strip().split('\n')

When preparing our model later, we will need to know the length of sentences in the dataset. We can write a short function to calculate the shortest and longest sentences.

In [0]:
# Shortest and Longest sentence lengths
def sentence_lengths(sentences):
    lengths = [len(s.split()) for s in sentences]
    return min(lengths), max(lengths)

We can tie all of this together to load and summarize the English and French data files.

In [0]:
# load English data
filename = 'europarl-v7.fr-en.en'
doc = load_doc(filename)
sentences = to_sentences(doc)
minlen, maxlen = sentence_lengths(sentences)
print('English data: sentences=%d, min=%d, max=%d' % (len(sentences), minlen, maxlen))
 
# load French data
filename = 'europarl-v7.fr-en.fr'
doc = load_doc(filename)
sentences = to_sentences(doc)
minlen, maxlen = sentence_lengths(sentences)
print('French data: sentences=%d, min=%d, max=%d' % (len(sentences), minlen, maxlen))

English data: sentences=2007723, min=0, max=668
