### Simple `Encoder-Decoder` model.

In this notebook we are going to combine the three previous notebooks, by creating a simple model that translate sentences in spanish to english.

### What I've done so far.
1. I've downloaded the data that we are going to work with [here](http://www.manythings.org/anki/)
2. I've extracted the zipped file and uploaded it on my google drive so that we can easily work with it here in google colab.


### What are we going to do ?
1. We are going to load the data and prepare it just like from the previous notebooks
2. We are going to create 5 models and train them with same epochs and compare the results.
3. This time around we want to split our data into respective sets which are:
  * train set
  * val set
  * and the test set.

4. Evaluate the models using the `test` set.

### Imports

In [1]:
from collections import Counter
import numpy as np
import helper, os, time

from tensorflow import keras
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tf.__version__

'2.5.0'

### Mounting the Google Drive.

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Paths to the files

In [3]:
base_path = '/content/drive/MyDrive/NLP Data/seq2seq/spa-en'
file_name ="spa.txt"
os.path.exists(base_path)

True

### Data Loading and preparation.
Our file has a file name `spa.txt` and this is a huge file with the following structure in it:

```
Go.	Ve.	CC-BY 2.0 (France) Attribution: tatoeba.org #2877272 (CM) & #4986655 (cueyayotl)
Go.	Vete.	CC-BY 2.0 (France) Attribution: tatoeba.org #2877272 (CM) & #4986656 (cueyayotl)
Go.	Vaya.	CC-BY 2.0 (France) Attribution: tatoeba.org #2877272 (CM) & #4986657 (cueyayotl)
....
```
As we can see there's a lot of cleaning that need to happen here. The language pairs are seperated with tabs and we have some gabbage that we are not interested in after the second tab. So What we will do is to just ignore that gabbage.







In [5]:

unclean_data = open(os.path.join(base_path, file_name),
                    encoding="utf8").read().split('\n')
print(f"Data Loaded, {len(unclean_data)} pairs found")

Data Loaded, 134737 pairs found


In [6]:
unclean_data[1]

'Go.\tVete.\tCC-BY 2.0 (France) Attribution: tatoeba.org #2877272 (CM) & #4986656 (cueyayotl)'

Next, we want to create lists of sentences from this uncleaned huge list for all our languages which are spanish and english.

In [10]:
spanish_data =[]
english_data =[]
for line in unclean_data:
  try:
    en, sp, _ = line.split('\t')
    spanish_data.append(sp)
    english_data.append(en)
  except:
    pass

print(f"Spanish: {len(spanish_data)}")
print(f"English: {len(english_data)}")

for i, (e, s) in enumerate(zip(english_data[:10], spanish_data)):
  print(f"> {e} |> {s}")


Spanish: 134736
English: 134736
> Go. |> Ve.
> Go. |> Vete.
> Go. |> Vaya.
> Go. |> Váyase.
> Hi. |> Hola.
> Run! |> ¡Corre!
> Run! |> ¡Corran!
> Run! |> ¡Corra!
> Run! |> ¡Corred!
> Run. |> Corred.


We have loaded our data, Next we are going to split this data into 3 sets, the train, validation and the test set. For that we are going to use my favourite `sklearn` `train_test_split`function from `model_selection`.

In [12]:
from sklearn.model_selection import train_test_split

In [19]:
eng_train, eng_val, spa_train, spa_val = train_test_split(
    english_data, spanish_data, random_state=42, test_size = .05
)
eng_train, eng_test, spa_train, spa_test = train_test_split(
    eng_train, spa_train, random_state=42, test_size = .005
)
len(eng_train), len(eng_val), len(spa_train), len(spa_val), len(eng_test)

(127359, 6737, 127359, 6737, 640)

In [20]:
eng_train[0], spa_train[0], eng_val[0]

('Do you have money?', '¿Tiene usted dinero?', 'Where do you think Tom is?')



### Next, Bulding the Vocabulary.

I'm going to use the `spacy` library which is my favourite when it comes to tokenization of languages.

* We are only going to build the vocabulary on the train data because we want the validation data to represent the test data as much as possible. And in machine learning models dont have to look on the test data during training only at inference.

In [21]:
import spacy
spacy.cli.download('es_core_news_sm')

spacy_es = spacy.load('es_core_news_sm')
spacy_en = spacy.load('en_core_web_sm')


[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('es_core_news_sm')


In [22]:
def tokenize_es(sent):
  return [tok.text for tok in spacy_es.tokenizer(sent)]
  
def tokenize_en(sent):
  return [tok.text for tok in spacy_en.tokenizer(sent)]

In [23]:
en_counter = Counter()
es_counter = Counter()

for sent in eng_train:
  en_counter.update(tokenize_en(sent.lower()))
for sent in spa_train:
  es_counter.update(tokenize_es(sent.lower()))

In [25]:
en_vocab_size = len(en_counter)
es_vocab_size = len(es_counter)

es_vocab_size, en_vocab_size

(26986, 13642)

Here we have `~2.5M` unique words for the spanish laguage  and `~1.5M` unique words for english language.

### Preprocessing.

We will convert our text data into sequence of integers so basically we are going to perform the following:

1. Tokenize the words into ids
2. Pad the tokens so that they will have same length.

For this task we are going to use the keras `Tokenizer` class to perform the task.

We are going to have two tokenizers for each language.


In [26]:
en_tokenizer = Tokenizer(num_words=en_vocab_size, oov_token="<oov>")
en_tokenizer.fit_on_texts(eng_train)

es_tokenizer = Tokenizer(num_words=es_vocab_size, oov_token="<oov>")
es_tokenizer.fit_on_texts(spa_train)

In [28]:
en_word_indices = en_tokenizer.word_index
en_word_indices_reversed = dict([
    (v, k) for (k, v) in en_word_indices.items()
])

es_word_indices = es_tokenizer.word_index
es_word_indices_reversed = dict([
    (v, k) for (k, v) in es_word_indices.items()
])

### Helper functions
We will create some helper function that converts sequences to text and text to sequences for each language. These function will be used for inference later on.

**We have set the out of vocabulary `oov_token|| <"oov">`token to `1`  which means the word that does not exist in the vocabulary it's integer representation is 1**

In [35]:
def en_seq_to_text(sequences):
  return " ".join(en_word_indices_reversed[i] for i in sequences )

def es_seq_to_text(sequences):
  return " ".join(es_word_indices_reversed[i] for i in sequences )

def en_text_to_seq(sent):
  words = tokenize_en(sent.lower())
  sequences = []
  for word in words:
    try:
      sequences.append(en_word_indices[word])
    except:
      sequences.append(1)
  return sequences

def es_text_to_seq(sent):
  words = tokenize_es(sent.lower())
  sequences = []
  for word in words:
    try:
      sequences.append(es_word_indices[word])
    except:
      sequences.append(1)
  return sequences

### Converting text to sequences

Unlike from the previous notebooks where we had only one set. This time around we have three sets, So we are going to create all the sequences for all these three sets in the following code cell.

In [30]:
en_sequences_train = en_tokenizer.texts_to_sequences(eng_train)
es_sequences_train = es_tokenizer.texts_to_sequences(spa_train)

en_sequences_test = en_tokenizer.texts_to_sequences(eng_test)
es_sequences_test = es_tokenizer.texts_to_sequences(spa_test)

en_sequences_val = en_tokenizer.texts_to_sequences(eng_val)
es_sequences_val = es_tokenizer.texts_to_sequences(spa_val)


In [33]:
en_sequences_test[0:4], es_sequences_test[:4]

([[70, 48, 5, 492, 53, 2, 104],
  [6, 14, 2012],
  [138, 1116, 8, 7, 1592, 459],
  [2, 802, 8, 579, 31, 2, 890, 35, 1007, 31, 2, 597]],
 [[72, 32, 7500, 52, 8, 85],
  [6, 44, 3424],
  [328, 1237, 10, 16, 239, 1696],
  [7, 1090, 10, 1612, 17, 703, 29, 1178, 17, 196]])

### Testing our helper functions.

In [36]:
for en, es in zip(en_sequences_test[:4], es_sequences_test[:4]):
  print(f"> English: {en_seq_to_text(en)} |> Spanish: {es_seq_to_text(es)}")


> English: why did you spend all the money |> Spanish: ¿por qué gastaste todo el dinero
> English: tom was impressed |> Spanish: tom estaba impresionado
> English: new york is a huge city |> Spanish: nueva york es una ciudad enorme
> English: the wall is white on the inside and green on the outside |> Spanish: la pared es blanca por dentro y verde por fuera


### Padding Sequences.

In our case we are going to assume that the longest sentence has `50` words for both `es` and `en` languages.

We are going to pad all the sets.

In [37]:
max_words = 50

# Train data
en_tokens_padded_train = pad_sequences(
    en_sequences_train, 
    maxlen=max_words, 
    padding="post", 
    truncating="post"
)
es_tokens_padded_train = pad_sequences(
    es_sequences_train, 
    maxlen=max_words, 
    padding="post", 
    truncating="post"
)

# Validation data

en_tokens_padded_val = pad_sequences(
    en_sequences_val, 
    maxlen=max_words, 
    padding="post", 
    truncating="post"
)
es_tokens_padded_val = pad_sequences(
    es_sequences_val, 
    maxlen=max_words, 
    padding="post", 
    truncating="post"
)

# Test data
en_tokens_padded_test = pad_sequences(
    en_sequences_test, 
    maxlen=max_words, 
    padding="post", 
    truncating="post"
)
es_tokens_padded_test = pad_sequences(
    es_sequences_test, 
    maxlen=max_words, 
    padding="post", 
    truncating="post"
)

In [38]:
en_tokens_padded_train[:2]

array([[ 15,   5,  18, 104,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       [  3,  18,   7, 412, 561,   4, 144,   5,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0]],
      dtype=int32)

### Logits to text.

We are going to create 1 more helper function that will help us to take logits or the predictions probabilities and then we convert them to human understandable format.

In [39]:
def logits_to_text(logits, tokenizer):
  index_to_words = {id: word for word, id
                    in tokenizer.word_index.items()}
  index_to_words[0] = '<pad>'
  """
  For every prediction we are going to ignore the pad token
  """
  return ' '.join([index_to_words[prediction] for prediction in np.argmax(logits, 1)]).replace("<pad>", "")


### Models.
As i said we are going to create 4 different models and evaluate them serparatly these models will be:
1. Simple RNN
2. GRU Model With Embeding
3. LSTM Model with Embedding and Bidirectional layers
4. Simple Encoder-Decoder Model 

### 1. Simple RNN

In [40]:
rnn_model = keras.Sequential([
      keras.layers.Input(shape=(max_words, 1)),
      keras.layers.GRU(128, return_sequences=True),
      keras.layers.TimeDistributed(
          keras.layers.Dense(en_vocab_size, activation="softmax")
      )
], name="simple_rnn")
rnn_model.summary()

Model: "simple_rnn"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
gru (GRU)                    (None, 50, 128)           50304     
_________________________________________________________________
time_distributed (TimeDistri (None, 50, 13642)         1759818   
Total params: 1,810,122
Trainable params: 1,810,122
Non-trainable params: 0
_________________________________________________________________


In [43]:
src_train = es_tokens_padded_train.reshape(-1, max_words, 1)
src_test = es_tokens_padded_test.reshape(-1, max_words, 1)
src_val = es_tokens_padded_val.reshape(-1, max_words, 1)
src.shape

(127359, 50, 1)

### Hyper parameters

In [45]:
BATCH_SIZE = 128
EPOCHS = 15
VALIDATION_DATA = (src_val, en_tokens_padded_val)
VALIDATION_BATCH_SIZE = 64

In [46]:
rnn_model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer=keras.optimizers.Adam(),
    metrics=['accuracy']
)

rnn_model.fit(
    src_train, 
    en_tokens_padded_train, 
    batch_size=BATCH_SIZE, 
    epochs=EPOCHS,
    validation_data=VALIDATION_DATA,
    validation_batch_size = VALIDATION_BATCH_SIZE
)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<tensorflow.python.keras.callbacks.History at 0x7fec30032e90>

### Evaluation

In [47]:
rnn_model.evaluate(
    src_test,
    en_tokens_padded_test,
    batch_size=VALIDATION_BATCH_SIZE,
    verbose=1,
)



[0.672109842300415, 0.8937812447547913]

### GRU and Embedding.

In [48]:
gru_embedding_model = keras.Sequential([
    keras.layers.Embedding(
        es_vocab_size,
        128, 
        input_length=max_words
    ),
    keras.layers.GRU(128, return_sequences=True),
    keras.layers.GRU(256, return_sequences=True),
    keras.layers.GRU(512, return_sequences=True),
    keras.layers.Dense(1024, activation="relu"),
    keras.layers.Dense(en_vocab_size, activation="softmax")
])

gru_embedding_model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 50, 128)           3454208   
_________________________________________________________________
gru_1 (GRU)                  (None, 50, 128)           99072     
_________________________________________________________________
gru_2 (GRU)                  (None, 50, 256)           296448    
_________________________________________________________________
gru_3 (GRU)                  (None, 50, 512)           1182720   
_________________________________________________________________
dense_1 (Dense)              (None, 50, 1024)          525312    
_________________________________________________________________
dense_2 (Dense)              (None, 50, 13642)         13983050  
Total params: 19,540,810
Trainable params: 19,540,810
Non-trainable params: 0
____________________________________________

In [49]:
gru_embedding_model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer=keras.optimizers.Adam(),
    metrics=['accuracy']
)

gru_embedding_model.fit(
    src_train, 
    en_tokens_padded_train, 
    batch_size=BATCH_SIZE, 
    epochs=EPOCHS,
    validation_data=VALIDATION_DATA,
    validation_batch_size = VALIDATION_BATCH_SIZE
)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<tensorflow.python.keras.callbacks.History at 0x7febe22d89d0>

### Model evaluation

In [50]:
gru_embedding_model.evaluate(
    src_test,
    en_tokens_padded_test,
    batch_size=VALIDATION_BATCH_SIZE,
    verbose=1,
)



[0.5223050713539124, 0.926031231880188]

### LSTM Bidirectional and Embedding.

In [51]:
forward_layer = keras.layers.LSTM(128, dropout=.5,
                                    return_sequences=True,
                                  go_backwards=False
                                    )
backward_layer = keras.layers.LSTM(128, dropout=.5,
                                    return_sequences=True,
                                  go_backwards=True
                                    )
bidirectinal_lstm_model = keras.Sequential([
    keras.layers.Embedding(
        es_vocab_size,
        128, 
        input_length=max_words
    ),
  keras.layers.Bidirectional(keras.layers.LSTM(128, return_sequences=False)),
   keras.layers.RepeatVector(max_words),
  keras.layers.Bidirectional(forward_layer, backward_layer=backward_layer),
  keras.layers.TimeDistributed(keras.layers.Dense(en_vocab_size, activation='softmax'))
    
])
bidirectinal_lstm_model.summary()


Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 50, 128)           3454208   
_________________________________________________________________
bidirectional (Bidirectional (None, 256)               263168    
_________________________________________________________________
repeat_vector (RepeatVector) (None, 50, 256)           0         
_________________________________________________________________
bidirectional_1 (Bidirection (None, 50, 256)           394240    
_________________________________________________________________
time_distributed_1 (TimeDist (None, 50, 13642)         3505994   
Total params: 7,617,610
Trainable params: 7,617,610
Non-trainable params: 0
_________________________________________________________________


In [52]:
bidirectinal_lstm_model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer=keras.optimizers.Adam(),
    metrics=['accuracy']
)

bidirectinal_lstm_model.fit(
    src_train, 
    en_tokens_padded_train, 
    batch_size=BATCH_SIZE, 
    epochs=EPOCHS,
    validation_data=VALIDATION_DATA,
    validation_batch_size = VALIDATION_BATCH_SIZE
)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<tensorflow.python.keras.callbacks.History at 0x7feb52bb5050>

### Model Evaluation

In [53]:
bidirectinal_lstm_model.evaluate(
    src_test,
    en_tokens_padded_test,
    batch_size=VALIDATION_BATCH_SIZE,
    verbose=1,
)



[0.3661833703517914, 0.9268437623977661]

### Encoder Decoder Model.
The following cell shows how we can create our very simple encoder decoder model using the sequential API.


In [54]:
encoder_decoder_model = keras.Sequential([
    keras.layers.Embedding(
        es_vocab_size,
        128, 
        input_length=max_words
    ),
    keras.layers.GRU(128, return_sequences=True),
    keras.layers.GRU(128, return_sequences=False),
    keras.layers.RepeatVector(max_words),
    keras.layers.GRU(128, return_sequences=True),
    keras.layers.GRU(128, return_sequences=True),
    keras.layers.TimeDistributed(keras.layers.Dense(
        en_vocab_size, activation= "softmax"
    ))   
], name="encoder_decoder_model")

encoder_decoder_model.summary()

Model: "encoder_decoder_model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 50, 128)           3454208   
_________________________________________________________________
gru_4 (GRU)                  (None, 50, 128)           99072     
_________________________________________________________________
gru_5 (GRU)                  (None, 128)               99072     
_________________________________________________________________
repeat_vector_1 (RepeatVecto (None, 50, 128)           0         
_________________________________________________________________
gru_6 (GRU)                  (None, 50, 128)           99072     
_________________________________________________________________
gru_7 (GRU)                  (None, 50, 128)           99072     
_________________________________________________________________
time_distributed_2 (TimeDist (None, 50, 13642

In [55]:
encoder_decoder_model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer=keras.optimizers.Adam(),
    metrics=['accuracy']
)

encoder_decoder_model.fit(
    src_train, 
    en_tokens_padded_train, 
    batch_size=BATCH_SIZE, 
    epochs=EPOCHS,
    validation_data=VALIDATION_DATA,
    validation_batch_size = VALIDATION_BATCH_SIZE
)


Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<tensorflow.python.keras.callbacks.History at 0x7feb4bdb9dd0>

### Model Evaluation

In [56]:
encoder_decoder_model.evaluate(
    src_test,
    en_tokens_padded_test,
    batch_size=VALIDATION_BATCH_SIZE,
    verbose=1,
)



[0.4094642102718353, 0.9252812266349792]

As we can see our models are producing simmilar accuracy value of `~93%` except for the first one. 

Next let's make some predictions using our models.


### Making some predictions.
Our model is targeting to predict french words, during the predict function we are going to do the following:

1. Get the sequence of the english sentence 
2. Pad the english sequences and pass them to the model'
3. Reshape the logits output to the shape of `(max_len, trg_vocabsize(eng)`
4. Call the `logits_to_text` function and pass the tokenizer as the `es_tokenizer`.
5. Get the predictions

In [62]:
def predict(sent, model):
  sequences = es_text_to_seq(sent)
  padded_tokens = pad_sequences([sequences], maxlen=max_words, padding="post", truncating="post")
  logits = model(padded_tokens)
  logits = tf.reshape(logits, (max_words, -1))
  return logits_to_text(logits, en_tokenizer)
predict(spa_test[1], bidirectinal_lstm_model)

'tom was impressed impressed                                              '

In [68]:
predict("hola", gru_embedding_model)

'hi                                                 '

### Making more predictions with different models.


In [64]:
from prettytable import PrettyTable
def tabulate_translations(column_names, data, title, max_characters=25):
  table = PrettyTable(column_names)
  table.title= title
  table.align[column_names[0]] = 'l'
  table.align[column_names[1]] = 'l'
  table.align[column_names[2]] = 'l'
  table._max_width = {column_names[0] :max_characters, column_names[1] :max_characters, column_names[2]:max_characters}
  for row in data:
    table.add_row(row)
  print(table)
columns_names = [
    "Spanish (real src sentence)", "English (the actual text)", "Translated (translated version)", "MODEL USED"
]
title = "SPANISH TO ENGLISH TRANSLATOR"

In [67]:
max_characters= 25
total_translations= 10
for i, (eng, spa) in enumerate(zip(eng_test[:total_translations], spa_test)):
    rows_data = [
                 [spa, eng, predict(spa, gru_embedding_model), "GRU Embedding model"],
                 [spa, eng, predict(spa, bidirectinal_lstm_model), "Bidirectional LSTM model"],
                 [spa, eng, predict(spa, encoder_decoder_model), "Encoder Decoder model"],
                ]
    tabulate_translations(columns_names, rows_data, title, max_characters)

+----------------------------------------------------------------------------------------------------------------------+
|                                            SPANISH TO ENGLISH TRANSLATOR                                             |
+-----------------------------+---------------------------+---------------------------------+--------------------------+
| Spanish (real src sentence) | English (the actual text) | Translated (translated version) |        MODEL USED        |
+-----------------------------+---------------------------+---------------------------------+--------------------------+
| ¿Por qué gastaste todo el   | Why did you spend all the | we me why we spend all all all  |   GRU Embedding model    |
| dinero?                     | money?                    | of                              |                          |
| ¿Por qué gastaste todo el   | Why did you spend all the | all all all all all all all     | Bidirectional LSTM model |
| dinero?                     | 

### Conclusion.
In this notebook we have covered mush, and we observed that the GRU model performed better as compared to other models during prediction.