### Data Preperation

In this notebook we are going to create a model that translate text from english to French. We are going to look at more model achitecture. This notebook is mainly focused on the preperation and data processing.

I've uploaded the files on my google drive so that we can easly load them here on colab. There are two text file where each line in the english textfile correspond with the french line in another file.

### Where can i find datasets for machine translation task.
* The most common datasets for machine translation are found [here at WMT](http://www.statmt.org/)

* [Other datasets can be found here](http://www.manythings.org/anki/)

We are going to:
1. Preprocess the text
  * converting text to sequence of integers in this notebook


### Imports

In [21]:
from collections import Counter
import numpy as np
import helper, os, time

from tensorflow import keras
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tf.__version__

'2.5.0'

### Mounting the Google Drive.

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Paths to the files

In [15]:
base_path = '/content/drive/MyDrive/NLP Data/seq2seq/fr-en-small'
en_path = 'small_vocab_en.txt'
fr_path = 'small_vocab_fr.txt'

### Loading the data.

We have two files that are located at this path `'/content/drive/MyDrive/NLP Data/seq2seq/fr-en-small'` and thes files are:

```
small_vocab_fr.txt
small_vocab_en.txt
```

The following line help us to load the data.

In [16]:
eng_sents = open(os.path.join(base_path, en_path), encoding='utf8').read().split('\n')
fre_sents = open(os.path.join(base_path, fr_path), encoding='utf8').read().split('\n')

print("Data Loaded")

Data Loaded


In [17]:
eng_sents[1]

'the united states is usually chilly during july , and it is usually freezing in november .'

By looking at the data we can see that the data is already preprocessed, which means we are not going to do that step here.

### Next, Bulding the Vocabulary.

Vocabulary in my definition is just unique words in the curpus. Let's look at the vocabulary size of french and english. But first we need to tokenize each sentence, Inorder for us to do that I'm going to use the `spacy` library which is my favourite when it comes to tokenization of languages.

In [19]:
import spacy
spacy.cli.download('fr_core_news_sm')

spacy_fr = spacy.load('fr_core_news_sm')
spacy_en = spacy.load('en_core_web_sm')


[38;5;2mâœ” Download and installation successful[0m
You can now load the model via spacy.load('fr_core_news_sm')


In [20]:
def tokenize_fr(sent):
  return [tok.text for tok in spacy_fr.tokenizer(sent)]
  
def tokenize_en(sent):
  return [tok.text for tok in spacy_en.tokenizer(sent)]

In [22]:
en_counter = Counter()
fr_counter = Counter()

for sent in eng_sents:
  en_counter.update(tokenize_en(sent.lower()))
for sent in fre_sents:
  fr_counter.update(tokenize_fr(sent.lower()))

In [25]:
en_vocab_size = len(en_counter)
fr_vocab_size = len(fr_counter)

fr_vocab_size, en_vocab_size

(340, 201)

Here we have `340` unique words for french in this dataset and `201` unique words for english.

### Preprocessing.

We will convert our text data into sequence of integers so basically we are going to perform the following:

1. Tokenize the words into ids
2. Pad the tokens so that they will have same length.

For this task we are going to use the keras `Tokenizer` class to perform the task, We have been using this for sentiment analyisis so the procedure is the same.

We are going to have two tokenizers for each language.


In [43]:
en_tokenizer = Tokenizer(num_words=en_vocab_size, oov_token="<oov>")
en_tokenizer.fit_on_texts(eng_sents)

fr_tokenizer = Tokenizer(num_words=fr_vocab_size, oov_token="<oov>")
fr_tokenizer.fit_on_texts(fre_sents)

In [44]:
en_word_indices = en_tokenizer.word_index
en_word_indices_reversed = dict([
    (v, k) for (k, v) in en_word_indices.items()
])

fr_word_indices = fr_tokenizer.word_index
fr_word_indices_reversed = dict([
    (v, k) for (k, v) in fr_word_indices.items()
])

### Helper functions
We will create some helper function that converts sequences to text and text to sequences for each language. These function will be used for inference later on.

**We have set the out of vocabulary `oov_token|| <"oov">`token to `1`  which means the word that does not exist in the vocabulary it's integer representation is 1**

In [47]:
def en_seq_to_text(sequences):
  return " ".join(en_word_indices_reversed[i] for i in sequences )

def en_seq_to_text(sequences):
  return " ".join(fr_word_indices_reversed[i] for i in sequences )

def en_text_to_seq(sent):
  words = tokenize_en(sent.lower())
  sequences = []
  for word in words:
    try:
      sequences.append(en_word_indices[word])
    except:
      sequences.append(1)
  return sequences

def fr_text_to_seq(sent):
  words = tokenize_fr(sent.lower())
  sequences = []
  for word in words:
    try:
      sequences.append(fr_word_indices[word])
    except:
      sequences.append(1)
  return sequences

### Converting text to sequences

In [49]:
en_sequences = en_tokenizer.texts_to_sequences(eng_sents)
fr_sequences = fr_tokenizer.texts_to_sequences(fre_sents)

In [52]:
fr_sequences[0:4]

[[36, 35, 2, 9, 68, 38, 12, 25, 7, 4, 2, 113, 3, 51],
 [5, 33, 32, 2, 13, 20, 3, 50, 7, 4, 96, 70, 3, 52],
 [102, 2, 13, 68, 3, 46, 7, 4, 2, 13, 22, 3, 42],
 [5, 33, 32, 2, 9, 270, 3, 42, 7, 4, 104, 20, 3, 49]]

### Padding Sequences.

In our case we are going to assume that the longest sentence has `100` words for both `fr` and `en` languages.

In [53]:
max_words = 100
en_tokens_padded = pad_sequences(
    en_sequences, 
    maxlen=max_words, 
    padding="post", 
    truncating="post"
)
fr_tokens_padded = pad_sequences(
    fr_sequences, 
    maxlen=max_words, 
    padding="post", 
    truncating="post"
)

In [54]:
en_tokens_padded[:2]

array([[18, 24,  2,  9, 68,  5, 40,  8,  4,  2, 56,  3, 45,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0],
       [ 6, 21, 22,  2, 10, 63,  5, 44,  8,  4,  2, 10, 52,  3, 46,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0]], dtype=int32)

### Next
Now that we have an idea on how we can preprocess our data for machine translation task.

* Next we are going to expand this notebook to a creation, training and evaluation of a simple RNN for machine translation task.