# Baseline Text to Text Translation : English to French

This notebook trains a sequence to sequence (seq2seq) model for English to French translation. This model will be our **baseline** model, which we will then improve upon by adding attention and other features.

---

## Import Required Libraries

We will start by importing the libraries we need for this project. You can install any missing libraries using the requirements.txt file provided or by running ``make install`` in the terminal.

In [1]:
%load_ext autoreload
%aimport utils.text_processing
%autoreload 1

In [2]:
from utils.text_processing import TextProcessor

import numpy as np
import pandas as pd
import random

from keras.models import Model
from keras.layers import Input, Dense, LSTM, Embedding, Bidirectional, RepeatVector, TimeDistributed

from tensorflow.keras.preprocessing.text import Tokenizer 
from keras.callbacks import ModelCheckpoint
from keras.models import load_model
from keras import optimizers

from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

import seaborn as sns
import matplotlib.pyplot as plt
pd.set_option('display.max_colwidth', 200)

2025-05-17 21:55:00.494982: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-05-17 21:55:00.504813: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1747511700.515400   67179 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747511700.518600   67179 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1747511700.526894   67179 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

### Verify access to the GPU
The following test applies only if you expect to be using a GPU, e.g., while running in a cloud environment with GPU support. Run the next cell, and verify that the device_type is "GPU".

In [3]:
import tensorflow as tf
print("cuda available: ", tf.config.list_physical_devices('GPU'))

cuda available:  []


W0000 00:00:1747511702.002262   67179 gpu_device.cc:2341] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...


We provide a in depth analysis of the data in the ``exploratory_analysis.ipynb`` notebook. We will not be doing any exploratory analysis in this notebook. Instead, we will focus on building our baseline model. So, let's start by importing the dataset we will be using.

In [4]:
dataset = pd.read_csv('./data/cleaned/fr_en_processed_data.csv')
dataset.head(10)

Unnamed: 0,fr,en
0,<start>il a pas pu faire grand chose qui puisse choquer cassie<end>,<start>he could not do much that could shock cassie<end>
1,<start>jpense juste qu'il aime pas qu'on s'introduise dans sa tete<end>,<start>i just think he does not like us getting into his head<end>
2,<start>par contre pour que sunny ne veulent pas que cassie voit c'est souvenir des derniere annee il a vraiment du faire des dinguerie<end>,<start>on the other hand so that sunny does not want cassie to see it is memory of the last year he really had to do some crazy things<end>
3,<start>c'est pas simplement du a son effacement du destin ils devraient etre en mesure de faire le lien mais si l'autre piaf a des bouts de pouvoir du demon de l'oubli ca expliquerait tout<end>,<start>it is not just because of its erasure of destiny they should be able to make the connection but if the other piaf has some bits of power of the demon of oblivion that would explain everythi...
4,<start>psq la le fait qu'ils oublient a chaque fois<end>,<start>psq the fact that they forget every time<end>
5,<start>possible<end>,<start>possible<end>
6,<start>il aurait pas voler des pouvoirs au demon de l'oubli<end>,<start>he would not steal powers from the demon of oblivion<end>
7,<start>le piaf de merde la<end>,<start>the shitty piaf there<end>
8,<start>par contre dcp<end>,<start>but dcp<end>
9,<start>elle sait juste que c'est lui l'anomalie<end>,<start>she just knows he is the anomaly<end>


The actual data contains over 350,000 sentence-pairs. However, to speed up training for this notebook, we will only use a small portion of the data. 

In [5]:
# TODO : Use the whole dataset (but it's too big for my computer)
# dataset = dataset.head(n=50000)
print(dataset.shape)

(11187, 2)


## Text Pre-Processing

The text pre-processing steps will be implemented in a class called ``TextPreprocessor``. This class will be used to clean and tokenize the text data. The class will also be used to convert the text to sequences and pad the sequences to a maximum length. This way we will be able to improve our model's without having to copy and paste the same code over and over again.

In [6]:
max_sequence_length = 20

In [23]:
# REMOVE <start> and <end> tokens
dataset['fr'] = dataset['fr'].apply(lambda x: x.replace('<start>', '').replace('<end>', '').strip())
dataset['en'] = dataset['en'].apply(lambda x: x.replace('<start>', '').replace('<end>', '').strip())
dataset

Unnamed: 0,fr,en
0,il a pas pu faire grand chose qui puisse choquer cassie,he could not do much that could shock cassie
1,jpense juste qu'il aime pas qu'on s'introduise dans sa tete,i just think he does not like us getting into his head
2,par contre pour que sunny ne veulent pas que cassie voit c'est souvenir des derniere annee il a vraiment du,on the other hand so that sunny does not want cassie to see it is memory of the last year
3,c'est pas simplement du a son effacement du destin ils devraient etre en mesure de faire le lien mais si,it is not just because of its erasure of destiny they should be able to make the connection but if
4,psq la le fait qu'ils oublient a chaque fois,psq the fact that they forget every time
...,...,...
11182,genre une cohorte soude de saints qui ressorte d'un cauchemar surnaturel avec des memoires de fou furieux etc,like a soda cohort of saints that emerges from a supernatural nightmare with crazy memories etc
11183,jpense pas mais comme la famille de valor,i do not think but like the valor family
11184,le post eme cauchemard va etre fou moi je le dis,the post-eme nightmare will be crazy me i say
11185,et qu'ils soient enfin vraiment acteur de l'histoire,and they are finally really an actor in history


In [8]:
# truncate the sentences to the max_sequence_length
dataset['en'] = dataset['en'].apply(lambda x: ' '.join(x.split()[:max_sequence_length]))
dataset['fr'] = dataset['fr'].apply(lambda x: ' '.join(x.split()[:max_sequence_length]))

dataset.head(10)

Unnamed: 0,fr,en
0,il a pas pu faire grand chose qui puisse choquer cassie,he could not do much that could shock cassie
1,jpense juste qu'il aime pas qu'on s'introduise dans sa tete,i just think he does not like us getting into his head
2,par contre pour que sunny ne veulent pas que cassie voit c'est souvenir des derniere annee il a vraiment du,on the other hand so that sunny does not want cassie to see it is memory of the last year
3,c'est pas simplement du a son effacement du destin ils devraient etre en mesure de faire le lien mais si,it is not just because of its erasure of destiny they should be able to make the connection but if
4,psq la le fait qu'ils oublient a chaque fois,psq the fact that they forget every time
5,possible,possible
6,il aurait pas voler des pouvoirs au demon de l'oubli,he would not steal powers from the demon of oblivion
7,le piaf de merde la,the shitty piaf there
8,par contre dcp,but dcp
9,elle sait juste que c'est lui l'anomalie,she just knows he is the anomaly


### Text to Sequence Conversion

To feed our data to a Seq2Seq model, we will have to convert both the input and the output sentences into integer sequences of fixed length. Check the exploratory data analysis notebook to see the distribution of the lengths of the sentences in the dataset. Based on that, we decided to fix the maximum length of each sentence to 20 since the average length of the sentences in the dataset is around 20.

We will use the ``Tokenizer`` class from the ``tensorflow.keras.preprocessing.text`` module to tokenize the text data. The ``Tokenizer`` class will also be used to convert the text to sequences. We will use the ``pad_sequences`` function from the same module to pad the sequences to the maximum length.

In [9]:
def tokenization(lines, max_vocab_size=5000):
    tokenizer = Tokenizer(filters=' ', num_words=max_vocab_size)
    tokenizer.fit_on_texts(lines)
    return tokenizer

def encode_sequences(tokenizer, length, lines):
    seq = tokenizer.texts_to_sequences(lines)
    seq = pad_sequences(seq, maxlen=length, padding='post', truncating='post')
    return seq

def decode_sequences(tokenizer, sequence):
    text = tokenizer.sequences_to_texts([sequence])[0].replace('PAD', '').strip()
    return text

def get_most_common_words(tokenizer, n=10):
    word_counts = sorted(tokenizer.word_counts.items(), key=lambda x: x[1], reverse=True)
    return word_counts[:n]

In [10]:
# Tokenize the English sentences
eng_tokenizer = tokenization(dataset["en"])
eng_vocab_size = len(eng_tokenizer.word_index) + 1

# Tokenize the French sentences
fr_tokenizer = tokenization(dataset["fr"])
fr_vocab_size = len(fr_tokenizer.word_index) + 1

In [11]:
print('English Vocabulary Size: %d' % eng_vocab_size)
print('French Vocabulary Size: %d' % fr_vocab_size)

English Vocabulary Size: 6503
French Vocabulary Size: 8222


In [12]:
print("Most common words in English: ", get_most_common_words(eng_tokenizer))
print("Most common words in French: ", get_most_common_words(fr_tokenizer))

Most common words in English:  [('the', 6224), ('is', 3502), ('i', 3180), ('it', 3036), ('to', 2585), ('of', 2306), ('a', 2271), ('not', 2211), ('that', 2003), ('he', 1722)]
Most common words in French:  [('de', 3248), ('le', 2808), ('a', 2658), ('pas', 2262), ('la', 2193), ('que', 2076), ('je', 1799), ('il', 1649), ('ca', 1387), ('un', 1379)]


## Model Building

We will now split the data into train and test set for model training and evaluation, respectively. We will use the ``train_test_split`` function from the ``sklearn.model_selection`` module to split the data. We will use 10% of the data for testing and the rest for training. We will also set the ``random_state`` parameter to 42 to ensure reproducibility. 

In [13]:
train_data, test_data = train_test_split(dataset, test_size=0.2, random_state=42)

It's time to encode the sentences. We will encode French sentences as the input sequences and English sentences as the target sequences. It will be done for both tra and test datasets.

In [14]:
# prepare training data
trainX = encode_sequences(fr_tokenizer, max_sequence_length, train_data["fr"])
trainY = encode_sequences(eng_tokenizer, max_sequence_length, train_data["en"])

# prepare validation data
testX = encode_sequences(fr_tokenizer, max_sequence_length, test_data["fr"])
testY = encode_sequences(eng_tokenizer, max_sequence_length, test_data["en"])

In [15]:
trainX.shape, trainY.shape, testX.shape, testY.shape

((8949, 20), (8949, 20), (2238, 20), (2238, 20))

Now comes the fun part, building the model. We will build a simple Seq2Seq model for text-to-text translation. 
The model follows a simple architecture:

- Input sequence is embedded using an Embedding layer.
- The embedded sequence is processed by an LSTM layer to capture context.
- Output sequence is generated by repeating and processing with another LSTM layer.
- The Dense layer produces a probability distribution over the output vocabulary for each timestep, enabling text generation.

In [16]:
def build_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size, units=126):
    english_input = Input(shape=input_shape[1:], name="input_layer")  # the shape is (input length x 1) as batchsize excluded

    x = LSTM(units, return_sequences=True, activation="tanh", name="LSTM_layer")(english_input)
    preds = TimeDistributed(Dense(french_vocab_size, activation="softmax"), name="Dense_layer")(x)
    
    model = Model(inputs=english_input, outputs=preds, name='simple_seq2seq_model')
    return model

<img src="../images/rnn.png"
    alt="rnn"
    style="text-align: center;" />
</br>

We reshape the ``trainX`` and ``trainY`` to be 3-dimensional tensors to be used in the model. The first dimension represents the number of samples (or sentences), the second represents the length of each sequence, and the third represents the number of features in each sequence. We will use the ``trainX`` and ``trainY`` to train the model. We will use the ``testX`` and ``testY`` to evaluate the model.

In [17]:
trainX = trainX.reshape((-1, max_sequence_length, 1))
trainY = trainY.reshape((trainY.shape[0], trainY.shape[1], 1))

testX = testX.reshape((-1, max_sequence_length, 1))
testY = testY.reshape((testY.shape[0], testY.shape[1], 1))

We are using RMSprop optimizer in this model as it is usually a good choice for recurrent neural networks. We will experiment with other optimizers in the next notebook.

We will use the ``sparse_categorical_crossentropy`` loss since we have used integers to encode the target sequences. 

In [18]:
model = build_model(trainX.shape, max_sequence_length, eng_vocab_size, 5000)

rms = optimizers.RMSprop(learning_rate=0.0001)
model.compile(optimizer=rms, loss='sparse_categorical_crossentropy', metrics=['accuracy'])

model.summary()

Note that we have used **sparse_categorical_crossentropy** as the loss function because it allows us to use the target sequence as it is instead of one hot encoded format. One hot encoding the target sequences with such a huge vocabulary might consume our system's entire memory.

It seems we are all set to start training our model. We will train it for **30 epochs** and with a **batch size of 512**. We will also experiment with the hyperparameters in the next notebook.
We will also use **ModelCheckpoint()** to save the best model with lowest validation loss.

In [19]:
filename = '../models/baseline.h5'
checkpoint = ModelCheckpoint(filename, monitor='val_loss', verbose=1, save_best_only=True, mode='min')

history = model.fit(trainX, trainY, 
          epochs=20, batch_size=64, 
          validation_split=0.2,
          callbacks=[checkpoint], verbose=1)

Epoch 1/20
[1m112/112[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 74ms/step - accuracy: 0.2952 - loss: 8.3753
Epoch 1: val_loss improved from inf to 7.61304, saving model to ../models/baseline.h5




[1m112/112[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 85ms/step - accuracy: 0.2963 - loss: 8.3735 - val_accuracy: 0.5141 - val_loss: 7.6130
Epoch 2/20
[1m112/112[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 74ms/step - accuracy: 0.5128 - loss: 7.2735
Epoch 2: val_loss improved from 7.61304 to 6.13426, saving model to ../models/baseline.h5




[1m112/112[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 82ms/step - accuracy: 0.5129 - loss: 7.2703 - val_accuracy: 0.5253 - val_loss: 6.1343
Epoch 3/20
[1m112/112[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 73ms/step - accuracy: 0.5205 - loss: 5.7956
Epoch 3: val_loss improved from 6.13426 to 4.75956, saving model to ../models/baseline.h5




[1m112/112[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 81ms/step - accuracy: 0.5205 - loss: 5.7926 - val_accuracy: 0.5310 - val_loss: 4.7596
Epoch 4/20
[1m112/112[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 73ms/step - accuracy: 0.5291 - loss: 4.5012
Epoch 4: val_loss improved from 4.75956 to 3.79251, saving model to ../models/baseline.h5




[1m112/112[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 81ms/step - accuracy: 0.5291 - loss: 4.4991 - val_accuracy: 0.5361 - val_loss: 3.7925
Epoch 5/20
[1m112/112[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 73ms/step - accuracy: 0.5325 - loss: 3.6687
Epoch 5: val_loss improved from 3.79251 to 3.32350, saving model to ../models/baseline.h5




[1m112/112[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 81ms/step - accuracy: 0.5325 - loss: 3.6677 - val_accuracy: 0.5426 - val_loss: 3.3235
Epoch 6/20
[1m112/112[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 78ms/step - accuracy: 0.5401 - loss: 3.2772
Epoch 6: val_loss improved from 3.32350 to 3.15332, saving model to ../models/baseline.h5




[1m112/112[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 87ms/step - accuracy: 0.5400 - loss: 3.2770 - val_accuracy: 0.5416 - val_loss: 3.1533
Epoch 7/20
[1m112/112[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 74ms/step - accuracy: 0.5356 - loss: 3.1630
Epoch 7: val_loss improved from 3.15332 to 3.09477, saving model to ../models/baseline.h5




[1m112/112[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 83ms/step - accuracy: 0.5355 - loss: 3.1629 - val_accuracy: 0.5406 - val_loss: 3.0948
Epoch 8/20
[1m112/112[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 75ms/step - accuracy: 0.5377 - loss: 3.1004
Epoch 8: val_loss improved from 3.09477 to 3.06184, saving model to ../models/baseline.h5




[1m112/112[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 83ms/step - accuracy: 0.5377 - loss: 3.1005 - val_accuracy: 0.5410 - val_loss: 3.0618
Epoch 9/20
[1m112/112[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 74ms/step - accuracy: 0.5344 - loss: 3.0907
Epoch 9: val_loss improved from 3.06184 to 3.03852, saving model to ../models/baseline.h5




[1m112/112[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 83ms/step - accuracy: 0.5344 - loss: 3.0906 - val_accuracy: 0.5414 - val_loss: 3.0385
Epoch 10/20
[1m112/112[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 74ms/step - accuracy: 0.5355 - loss: 3.0678
Epoch 10: val_loss improved from 3.03852 to 3.02143, saving model to ../models/baseline.h5




[1m112/112[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 83ms/step - accuracy: 0.5355 - loss: 3.0677 - val_accuracy: 0.5412 - val_loss: 3.0214
Epoch 11/20
[1m112/112[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 74ms/step - accuracy: 0.5360 - loss: 3.0480
Epoch 11: val_loss improved from 3.02143 to 3.00777, saving model to ../models/baseline.h5




[1m112/112[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 83ms/step - accuracy: 0.5360 - loss: 3.0480 - val_accuracy: 0.5419 - val_loss: 3.0078
Epoch 12/20
[1m112/112[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 74ms/step - accuracy: 0.5298 - loss: 3.0715
Epoch 12: val_loss improved from 3.00777 to 2.99699, saving model to ../models/baseline.h5




[1m112/112[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 83ms/step - accuracy: 0.5299 - loss: 3.0712 - val_accuracy: 0.5415 - val_loss: 2.9970
Epoch 13/20
[1m112/112[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 74ms/step - accuracy: 0.5343 - loss: 3.0397
Epoch 13: val_loss improved from 2.99699 to 2.98778, saving model to ../models/baseline.h5




[1m112/112[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 83ms/step - accuracy: 0.5343 - loss: 3.0395 - val_accuracy: 0.5440 - val_loss: 2.9878
Epoch 14/20
[1m112/112[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 74ms/step - accuracy: 0.5326 - loss: 3.0520
Epoch 14: val_loss improved from 2.98778 to 2.98031, saving model to ../models/baseline.h5




[1m112/112[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 84ms/step - accuracy: 0.5326 - loss: 3.0516 - val_accuracy: 0.5439 - val_loss: 2.9803
Epoch 15/20
[1m112/112[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 75ms/step - accuracy: 0.5396 - loss: 3.0084
Epoch 15: val_loss improved from 2.98031 to 2.97373, saving model to ../models/baseline.h5




[1m112/112[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 84ms/step - accuracy: 0.5396 - loss: 3.0084 - val_accuracy: 0.5436 - val_loss: 2.9737
Epoch 16/20
[1m112/112[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 74ms/step - accuracy: 0.5419 - loss: 2.9869
Epoch 16: val_loss improved from 2.97373 to 2.96790, saving model to ../models/baseline.h5




[1m112/112[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 83ms/step - accuracy: 0.5418 - loss: 2.9870 - val_accuracy: 0.5447 - val_loss: 2.9679
Epoch 17/20
[1m112/112[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 75ms/step - accuracy: 0.5422 - loss: 2.9815
Epoch 17: val_loss improved from 2.96790 to 2.96302, saving model to ../models/baseline.h5




[1m112/112[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 84ms/step - accuracy: 0.5421 - loss: 2.9816 - val_accuracy: 0.5438 - val_loss: 2.9630
Epoch 18/20
[1m112/112[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 74ms/step - accuracy: 0.5417 - loss: 2.9792
Epoch 18: val_loss improved from 2.96302 to 2.95811, saving model to ../models/baseline.h5




[1m112/112[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 83ms/step - accuracy: 0.5417 - loss: 2.9793 - val_accuracy: 0.5451 - val_loss: 2.9581
Epoch 19/20
[1m112/112[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 74ms/step - accuracy: 0.5430 - loss: 2.9630
Epoch 19: val_loss improved from 2.95811 to 2.95387, saving model to ../models/baseline.h5




[1m112/112[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 82ms/step - accuracy: 0.5430 - loss: 2.9632 - val_accuracy: 0.5455 - val_loss: 2.9539
Epoch 20/20
[1m112/112[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 73ms/step - accuracy: 0.5381 - loss: 2.9925
Epoch 20: val_loss improved from 2.95387 to 2.95007, saving model to ../models/baseline.h5




[1m112/112[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 82ms/step - accuracy: 0.5381 - loss: 2.9923 - val_accuracy: 0.5456 - val_loss: 2.9501


In [20]:
model = load_model('../models/baseline.h5')



## Make Predictions

Now that we have our model, let's make some predictions. We will create a function called ``translate`` which will take a sentence in English as input and return the translated sentence in French. We will use the trained model to make predictions.

But before let's test on the predictions classes to see if it works.

In [21]:
size_to_predict = 10

# Make predictions on the subset
subset_to_predict = testX[:size_to_predict]
predictions = model.predict_on_batch(subset_to_predict)
predictions_classes = np.argmax(predictions, axis=-1)

# reshape the subset to predict and the testY to be able to decode them
reshapedX_subset = subset_to_predict.reshape((subset_to_predict.shape[0], subset_to_predict.shape[1]))
reshapedY_subset = testY[:size_to_predict].reshape((testY[:size_to_predict].shape[0], testY[:size_to_predict].shape[1]))

predicted_df = pd.DataFrame(columns=['french_sentence', 'actual_english_sentence', 'predicted_english_sentence'])

i = 0
for seq in predictions_classes:
    predicted_text = decode_sequences(eng_tokenizer, seq)
    original_french_sentence = decode_sequences(fr_tokenizer, reshapedX_subset[i])
    original_english_sentence = decode_sequences(eng_tokenizer, reshapedY_subset[i])
    
    predicted_df.loc[i] = [original_french_sentence, original_english_sentence, predicted_text]
    i += 1

In [22]:
predicted_df

Unnamed: 0,french_sentence,actual_english_sentence,predicted_english_sentence
0,ligne du dieu des tempete sah,line of the god of tempete sah,i the the the
1,oui ca pas de doute,yeah no doubt,i the the
2,bah oe sinon c'est pas drole,well otherwise it is not funny,i i i the the
3,psq t'as pas lu encore,psq you have not read kingdom yet,i i
4,tout est explique dans l'arc si jamais t'as d'autres questions hesite pas,everything is explained in the bow if you ever have other questions hesite not,i the the the
5,quitte a les quand tu les en francais dans quelque jours,even read them again when you get them out in french in a few days,i the the the the the the
6,justement je pense pas,i do not think so,i the the
7,ca va etre nimporte quoi vu le nombre de creature corrompu et superieur quil a en stock,it is going to be anything given the number of corrupted and superior creatures it has in stock,i is is the the is the the
8,ouais donc cest uniquement les premiers cas par le sortilege sur la lune pas les premiers portes ni meme,yeah so it is only the first cases of injection by the exitlege on the moon not the first doors,i is is the the the the the the
9,deja il faut qu'elle lui trouve un nom,she has to find her name,i the the the
