### Simple RNN

In the previous Notebook we have leant how we can load and preprocess the data. In this notebook we are going to expand and create a simple `RNN` that will be able to train on our preprocessed data from the previous notebook.

**Note**: The rest of the notebook will remain the same, when there's a change i will highlight.

### Imports

In [1]:
from collections import Counter
import numpy as np
import helper, os, time

from tensorflow import keras
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tf.__version__

'2.5.0'

### Mounting the Google Drive.

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Paths to the files

In [3]:
base_path = '/content/drive/MyDrive/NLP Data/seq2seq/fr-en-small'
en_path = 'small_vocab_en.txt'
fr_path = 'small_vocab_fr.txt'

### Loading the data.

We have two files that are located at this path `'/content/drive/MyDrive/NLP Data/seq2seq/fr-en-small'` and thes files are:

```
small_vocab_fr.txt
small_vocab_en.txt
```

The following line help us to load the data.

In [4]:
eng_sents = open(os.path.join(base_path, en_path), encoding='utf8').read().split('\n')
fre_sents = open(os.path.join(base_path, fr_path), encoding='utf8').read().split('\n')

print("Data Loaded")

Data Loaded


In [5]:
eng_sents[1]

'the united states is usually chilly during july , and it is usually freezing in november .'

By looking at the data we can see that the data is already preprocessed, which means we are not going to do that step here.

### Next, Bulding the Vocabulary.

Vocabulary in my definition is just unique words in the curpus. Let's look at the vocabulary size of french and english. But first we need to tokenize each sentence, Inorder for us to do that I'm going to use the `spacy` library which is my favourite when it comes to tokenization of languages.

In [6]:
import spacy
spacy.cli.download('fr_core_news_sm')

spacy_fr = spacy.load('fr_core_news_sm')
spacy_en = spacy.load('en_core_web_sm')


[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('fr_core_news_sm')


In [7]:
def tokenize_fr(sent):
  return [tok.text for tok in spacy_fr.tokenizer(sent)]
  
def tokenize_en(sent):
  return [tok.text for tok in spacy_en.tokenizer(sent)]

In [8]:
en_counter = Counter()
fr_counter = Counter()

for sent in eng_sents:
  en_counter.update(tokenize_en(sent.lower()))
for sent in fre_sents:
  fr_counter.update(tokenize_fr(sent.lower()))

In [9]:
en_vocab_size = len(en_counter)
fr_vocab_size = len(fr_counter)

fr_vocab_size, en_vocab_size

(340, 201)

Here we have `340` unique words for french in this dataset and `201` unique words for english.

### Preprocessing.

We will convert our text data into sequence of integers so basically we are going to perform the following:

1. Tokenize the words into ids
2. Pad the tokens so that they will have same length.

For this task we are going to use the keras `Tokenizer` class to perform the task, We have been using this for sentiment analyisis so the procedure is the same.

We are going to have two tokenizers for each language.


In [10]:
en_tokenizer = Tokenizer(num_words=en_vocab_size, oov_token="<oov>")
en_tokenizer.fit_on_texts(eng_sents)

fr_tokenizer = Tokenizer(num_words=fr_vocab_size, oov_token="<oov>")
fr_tokenizer.fit_on_texts(fre_sents)

In [11]:
en_word_indices = en_tokenizer.word_index
en_word_indices_reversed = dict([
    (v, k) for (k, v) in en_word_indices.items()
])

fr_word_indices = fr_tokenizer.word_index
fr_word_indices_reversed = dict([
    (v, k) for (k, v) in fr_word_indices.items()
])

### Helper functions
We will create some helper function that converts sequences to text and text to sequences for each language. These function will be used for inference later on.

**We have set the out of vocabulary `oov_token|| <"oov">`token to `1`  which means the word that does not exist in the vocabulary it's integer representation is 1**

In [12]:
def en_seq_to_text(sequences):
  return " ".join(en_word_indices_reversed[i] for i in sequences )

def en_seq_to_text(sequences):
  return " ".join(fr_word_indices_reversed[i] for i in sequences )

def en_text_to_seq(sent):
  words = tokenize_en(sent.lower())
  sequences = []
  for word in words:
    try:
      sequences.append(en_word_indices[word])
    except:
      sequences.append(1)
  return sequences

def fr_text_to_seq(sent):
  words = tokenize_fr(sent.lower())
  sequences = []
  for word in words:
    try:
      sequences.append(fr_word_indices[word])
    except:
      sequences.append(1)
  return sequences

### Converting text to sequences

In [13]:
en_sequences = en_tokenizer.texts_to_sequences(eng_sents)
fr_sequences = fr_tokenizer.texts_to_sequences(fre_sents)

In [14]:
fr_sequences[0:4]

[[36, 35, 2, 9, 68, 38, 12, 25, 7, 4, 2, 113, 3, 51],
 [5, 33, 32, 2, 13, 20, 3, 50, 7, 4, 96, 70, 3, 52],
 [102, 2, 13, 68, 3, 46, 7, 4, 2, 13, 22, 3, 42],
 [5, 33, 32, 2, 9, 270, 3, 42, 7, 4, 104, 20, 3, 49]]

### Padding Sequences.

In our case we are going to assume that the longest sentence has `100` words for both `fr` and `en` languages.

In [15]:
max_words = 100
en_tokens_padded = pad_sequences(
    en_sequences, 
    maxlen=max_words, 
    padding="post", 
    truncating="post"
)
fr_tokens_padded = pad_sequences(
    fr_sequences, 
    maxlen=max_words, 
    padding="post", 
    truncating="post"
)

In [16]:
en_tokens_padded[:2]

array([[18, 24,  2,  9, 68,  5, 40,  8,  4,  2, 56,  3, 45,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0],
       [ 6, 21, 22,  2, 10, 63,  5, 44,  8,  4,  2, 10, 52,  3, 46,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0]], dtype=int32)

### Logits to text.

We are going to create 1 more helper function that will help us to take logits or the predictions probabilities and then we convert them to human understandable format.

In [17]:

def logits_to_text(logits, tokenizer):
  index_to_words = {id: word for word, id
                    in tokenizer.word_index.items()}
  index_to_words[0] = '<pad>'
  """
  For every prediction we are going to ignore the pad token
  """
  return ' '.join([index_to_words[prediction] for prediction in np.argmax(logits, 1)]).replace("<pad>", "")


### RNN

This is a simple RNN that will learn to translate english sentences to french.

![img](https://github.com/LeanManager/Machine_Translation/raw/e6567f10a6e380eea453fa392de94f26973c8b16/images/rnn.png)

We are going to use the Functional API to build this toy model.

In [18]:
inp = keras.layers.Input(shape=(max_words, 1 )) # 100, 1
rnn = keras.layers.GRU(64, return_sequences=True)(inp)
logits = keras.layers.TimeDistributed(keras.layers.Dense(fr_vocab_size, activation="softmax" ))(rnn);

model = keras.Model(inputs=inp, outputs=logits, name="simple_rnn")
model.summary()


Model: "simple_rnn"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 100, 1)]          0         
_________________________________________________________________
gru (GRU)                    (None, 100, 64)           12864     
_________________________________________________________________
time_distributed (TimeDistri (None, 100, 340)          22100     
Total params: 34,964
Trainable params: 34,964
Non-trainable params: 0
_________________________________________________________________


In [19]:
tmp_x = en_tokens_padded.reshape(
   -1, 100, 1
)
tmp_x.shape

(137861, 100, 1)

In [20]:
model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer=keras.optimizers.Adam(),
    metrics=['accuracy']
)

model.fit(tmp_x, 
          fr_tokens_padded, 
          batch_size=1024, 
          epochs=50,
          validation_split=0.2
)


Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<tensorflow.python.keras.callbacks.History at 0x7fb32009dc90>

### Making some predictions.
Our model is targeting to predict french words, during the predict function we are going to do the following:

1. Get the sequence of the english sentence 
2. Pad the english sequences and pass them to the model'
3. Reshape the logits output to the shape of `(max_len, trg_vocabsize(french)`
4. Call the `logits_to_text` function and pass the tokenizer as the `fr_tokenizer`.
5. Get the predictions

In [21]:
def predict(sent):
  sequences = en_text_to_seq(sent)
  padded_tokens = pad_sequences([sequences], maxlen=max_words, padding="post", truncating="post")
  logits = model(padded_tokens)
  logits = tf.reshape(logits, (100, -1))
  return logits_to_text(logits, fr_tokenizer)
predict("your least liked fruit is the grape.")

'elle fruit est moins aimé la raisin                                                                                             '

### Making more predictions.


In [22]:
from prettytable import PrettyTable
def tabulate_translations(column_names, data, title, max_characters=25):
  table = PrettyTable(column_names)
  table.title= title
  table.align[column_names[0]] = 'l'
  table.align[column_names[1]] = 'l'
  table.align[column_names[2]] = 'l'
  table._max_width = {column_names[0] :max_characters, column_names[1] :max_characters, column_names[2]:max_characters}
  for row in data:
    table.add_row(row)
  print(table)
columns_names = [
    "English (real src sentence)", "French (the actual text)", "Translated (translated version)"
]
title = "ENGLISH TO FRENCH TRANSLATOR"

In [23]:
max_characters= 25
total_translations= 10
for i, (eng, fre) in enumerate(zip(eng_sents[:total_translations], fre_sents)):
    rows_data = [[eng, fre, predict(eng)]]
    if i + 1 != total_translations:
      rows_data.append(["-" * max_characters, "-" * max_characters, "-" * max_characters ])
    tabulate_translations(columns_names, rows_data, title, max_characters)

+-------------------------------------------------------------------------------------------+
|                                ENGLISH TO FRENCH TRANSLATOR                               |
+-----------------------------+---------------------------+---------------------------------+
| English (real src sentence) | French (the actual text)  | Translated (translated version) |
+-----------------------------+---------------------------+---------------------------------+
| new jersey is sometimes     | new jersey est parfois    | new jersey est parfois calme en |
| quiet during autumn , and   | calme pendant l' automne  | l' mai parfois mai mai est en   |
| it is snowy in april .      | , et il est neigeux en    | en en en                        |
|                             | avril .                   |                                 |
| -------------------------   | ------------------------- | -------------------------       |
+-----------------------------+---------------------------+-

### Conclusion.

In this notebook we have leant how to create a simple RNN that translate text from `eng` to french and we were able to get reasonable accuracy and better and reasonable translation at the end.

### Next
We are going to expand this and change from a simple RNN to the use of the `GRU` with the `embedding` layer.

* Also we are going to use the `Sequential` API Instead of the functional API which is what we used in this notebook.