This is a project is about an encoder-decoder seq2seq structure to make translation from english to french.

It follows a guide available [there](https://www.kaggle.com/code/nageshsingh/neural-machine-translation).

### Corpora

* For the training: "fra_eng.txt" (typically downloadable from the code below) or "fra.txt" depending on the version got.
* For the pretrained one-hot glove embedding: "glove.6B.50d.txt" (there may be an error in the download in code but otherwise, on kaggle) or "glove.27B.200d.txt" (big dataset)
* Or get the ones from the guide directly

### Library used

*   Deep learning Library : TensorFlow
*   Modeling toolset (fast experiment with DNN): Keras
*   Bleu score : Nltk

To save part of the work or serialize the work, using [pickle](https://docs.python.org/3.8/library/pickle.html?highlight=pickle#module-pickle) is possible to save and load binary file.  
But several Keras or Tensorflow objects cannot be serialized with pickle since it's mainly for python datastructure. Specific alternatives exist as the [keras model saving](https://keras.io/api/models/model_saving_apis/) or [tensorflow save and load](https://www.tensorflow.org/tutorials/keras/save_and_load).

To run the model without retraining it (if you have saved it before), just run every cells except the ones of the "Training your model" section.

### Google collab

GoogleCollab [configuration for GPU / lib](https://colab.research.google.com/drive/1FKH1dnzzyzC4qoIlmHHNObvng7ztDJhJ?usp=sharing).

Training neural networks is a computationally very expensive task. This is why using specific hardware is highly beneficial during training (but it is sometimes tricky to get properly configured). With that respect, Colab provides you with free to use preconfigured hardware you can use to train your models.  
In particular, it gives you an opportunity to train your model using either:

- a CPU (aka Cental Processing Unit; this is your slowest option)
- a GPU (aka Graphical Processing Unit; which is much faster !!)
- a TPU (aka Tensor Processing Unit; even faster, but chances are you won't get access to them since they are allocated to paid customers firts)

By default, your Colab Notebook is running on a session whose python interpreter only uses a CPU (no GPU/TPU available). To change that and use a GPU enabled session you should click on the resources usage summary (top right of this window).

In [26]:
### Mount my google drive and make it acessible as if it were a local file system
### Only available on a google.colab environment

# from google.colab import drive
# drive.mount('/content/drive')

# small = False
# if small:
#     corpus = "/content/drive/MyDrive/Colab Notebooks/LINFO2263_P3/fra_eng.txt"
#     glove_corpus = "/content/drive/MyDrive/Colab Notebooks/LINFO2263_P3/glove.6B.50d.txt"
# else:
#     corpus = "/content/drive/MyDrive/Colab Notebooks/LINFO2263_P3/fra.txt"
#     glove_corpus = "/content/drive/MyDrive/Colab Notebooks/LINFO2263_P3/glove.27B.200d.txt"

### Code

In [27]:
### Example to download from drive (files must have the permission to be read with the link)
# from google_drive_downloader import GoogleDriveDownloader as gdd
# gdd.download_file_from_google_drive(file_id='1K9_RVJ6TyUosEsqkfWxx2chAvX6LGfrt',
#                                     dest_path='corpus/fra_eng.txt',
#                                     #unzip=True, showsize=True, overwrite=True
#                                     )

### Local version already in place
### Small corpus (True) or Tutorial Version: Big corpus (False)
small = False
if small:
   corpus = 'corpus/fra_eng.txt'
   glove_corpus = f'corpus/glove.6B.50d.txt'
else:
   corpus = 'corpus/fra.txt'
   glove_corpus = 'corpus/glove.27B.200d.txt'

# Test the corpus
print(corpus)
with open(corpus, encoding="utf8") as f:
   i = 0
   for line in f:
      print(line)
      i+=1
      if i > 3:
         break

print(glove_corpus)
with open(glove_corpus, encoding="utf8") as f:
   i = 0
   for line in f:
      print(line)
      i+=1
      if i > 3:
         break

corpus/fra.txt
Go.	Va !	CC-BY 2.0 (France) Attribution: tatoeba.org #2877272 (CM) & #1158250 (Wittydev)

Hi.	Salut !	CC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #509819 (Aiji)

Hi.	Salut.	CC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #4320462 (gillux)

Run!	Cours !	CC-BY 2.0 (France) Attribution: tatoeba.org #906328 (papabear) & #906331 (sacredceltic)

corpus/glove.27B.200d.txt
<user> 0.31553 0.53765 0.10177 0.032553 0.003798 0.015364 -0.20344 0.33294 -0.20886 0.10061 0.30976 0.50015 0.32018 0.13537 0.0087039 0.1911 0.24668 -0.060752 -0.43623 0.019302 0.59972 0.13444 0.012801 -0.54052 0.27387 -1.182 -0.27677 0.11279 0.46596 -0.090685 0.24253 0.15654 -0.23618 0.57694 0.17563 -0.01969 0.018295 0.37569 -0.41984 0.22613 -0.20438 -0.076249 0.40356 0.61582 -0.10064 0.23318 0.22808 0.34576 -0.14627 -0.1988 0.033232 -0.84885 -0.25684 0.26369 0.29562 0.1847 -0.20668 -0.013297 0.12233 -0.47751 -0.17202 -0.14577 0.047446 -0.15824 0.054215 -0.19426 -0.081484 0.099009 0.

In [28]:
### Imports
import os, sys
from keras.models import Model, load_model
from keras.layers import Input, LSTM, GRU, Dense, Embedding
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical, Sequence, plot_model
from tensorflow.random      import set_seed
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
import numpy as np
import matplotlib.pyplot as plt

In [29]:
### Parameters
BATCH_SIZE = 64
EPOCHS = 20
LSTM_NODES = 256
NUM_SENTENCES = 20000
MAX_SENTENCE_LENGTH = 50
MAX_NUM_WORDS = 20000
if small:
    EMBEDDING_SIZE = 50
else:
    EMBEDDING_SIZE = 200
VALIDATION_SPLIT = 0.1

In [30]:
input_sentences = []
output_sentences = []
output_sentences_inputs = []

count = 0
# Data Preprocessing
for line in open(corpus, encoding="utf-8"):
    count += 1
    if count > NUM_SENTENCES:
        break
    if '\t' not in line:
        continue
    # Sentence in english, traduction in french
    input_sentence, output, _ = line.rstrip().split('\t')

    output_sentence = output + ' <eos>'       # End of sentence tag
    output_sentence_input = '<sos> ' + output # Start of sentence tag

    input_sentences.append(input_sentence)
    output_sentences.append(output_sentence)
    output_sentences_inputs.append(output_sentence_input)

print("num samples input:", len(input_sentences))
print("num samples output:", len(output_sentences))
print("num samples output input:", len(output_sentences_inputs), '\n')

print("Example of an entry in the different arrays created:")
print(input_sentences[180])
print(output_sentences[180])
print(output_sentences_inputs[180])

num samples input: 20000
num samples output: 20000
num samples output input: 20000 

Example of an entry in the different arrays created:
I'm shy.
Je suis timide. <eos>
<sos> Je suis timide.


### Tokenization

In [31]:
input_tokenizer = Tokenizer(num_words=MAX_NUM_WORDS)
input_tokenizer.fit_on_texts(input_sentences)
input_integer_seq = input_tokenizer.texts_to_sequences(input_sentences)

word2idx_inputs = input_tokenizer.word_index
max_input_len = max(len(sen) for sen in input_integer_seq)

print('Total unique words in the input: %s' % len(word2idx_inputs))
print("Length of longest sentence in input: %g \n" % max_input_len)

output_tokenizer = Tokenizer(num_words=MAX_NUM_WORDS, filters='')
output_tokenizer.fit_on_texts(output_sentences + output_sentences_inputs)
output_integer_seq = output_tokenizer.texts_to_sequences(output_sentences)
output_input_integer_seq = output_tokenizer.texts_to_sequences(output_sentences_inputs)

word2idx_outputs = output_tokenizer.word_index
num_words_output = len(word2idx_outputs) + 1
max_out_len = max(len(sen) for sen in output_integer_seq)

print('Total unique words in the output: %s' % len(word2idx_outputs))
print("Length of longest sentence in the output: %g \n" % max_out_len)

Total unique words in the input: 3518
Length of longest sentence in input: 6 

Total unique words in the output: 9546
Length of longest sentence in the output: 12 



### Padding

In [32]:
encoder_input_sequences = pad_sequences(input_integer_seq, maxlen=max_input_len)

decoder_input_sequences = pad_sequences(output_input_integer_seq, maxlen=max_out_len, padding='post')
decoder_output_sequences = pad_sequences(output_integer_seq, maxlen=max_out_len, padding='post')

print("encoder_input_sequences.shape:", encoder_input_sequences.shape)
print("decoder_input_sequences.shape:", decoder_input_sequences.shape)
print("decoder_output_sequences.shape:", decoder_output_sequences.shape)
print()

if small:
    print("encoder_input_sequences[180]:", encoder_input_sequences[180])
    print(word2idx_inputs["join"])
    print(word2idx_inputs["us"])
    print()
    print("decoder_input_sequences[180]:", decoder_input_sequences[180])
    print([word2idx_outputs["<sos>"], word2idx_outputs["joignez-vous"], word2idx_outputs["à"], word2idx_outputs["nous."]])
else:
    print("encoder_input_sequences[180]:", encoder_input_sequences[180])
    print(word2idx_inputs["i'm"])
    print(word2idx_inputs["shy"])
    print()
    print("decoder_input_sequences[180]:", decoder_input_sequences[180])
    print([word2idx_outputs["<sos>"], word2idx_outputs["je"], word2idx_outputs["suis"], word2idx_outputs["timide."]])

encoder_input_sequences.shape: (20000, 6)
decoder_input_sequences.shape: (20000, 12)
decoder_output_sequences.shape: (20000, 12)

encoder_input_sequences[180]: [  0   0   0   0   6 301]
6
301

decoder_input_sequences[180]: [  2   3   6 326   0   0   0   0   0   0   0   0]
[2, 3, 6, 326]


### Words embeddings

Here we transform our integer representation for words into vector (word embeddings) of multiple dimensions, it captures more information.

The embedding layer is trainable, meaning that the model learns the optimal vector representations for the tokens during the training process.  

The embedding layer helps capture semantic relationships between words and allows the model to better understand the input sequence.  

The embedding layers in both the encoder and decoder contribute to the overall performance of the sequence-to-sequence model by providing a continuous representation of discrete tokens. This helps the model understand the relationships between words and capture semantic information, making the learning process more effective.

In [42]:
embeddings_dictionary = dict()
glove_file = open(glove_corpus, encoding="utf8")

for line in glove_file:
    records = line.split()
    word = records[0]
    vector_dimensions = np.asarray(records[1:], dtype='float32')
    embeddings_dictionary[word] = vector_dimensions
glove_file.close()

num_words = min(MAX_NUM_WORDS, len(word2idx_inputs) + 1)
embedding_matrix = np.zeros((num_words, EMBEDDING_SIZE))
for word, index in word2idx_inputs.items():
    embedding_vector = embeddings_dictionary.get(word)
    if embedding_vector is not None:
        embedding_matrix[index] = embedding_vector

print("Example of use for the newly created variable about embedding(They've the same result):")
print("Using the word 'join' in the dictionary:\n", embeddings_dictionary["join"][:6])
print("Reduced output to see they're the same (the real result is too big)")
if small:
    print("Using the number attribued to 'join' in the matrix (467)\n", embedding_matrix[467][:6])
else:
    print("Using the number attribued to 'join' in the matrix (463)\n", embedding_matrix[463][:6])

# Creation of the embedding layer
embedding_layer = Embedding(num_words, EMBEDDING_SIZE, weights=[embedding_matrix], input_length=max_input_len)

Example of use for the newly created variable about embedding(They've the same result):
Using the word 'join' in the dictionary:
 [-0.37239  0.47634 -0.19666 -1.1641   0.01282 -0.47063]
Reduced output to see they're the same (the real result is too big)
Using the number attribued to 'join' in the matrix (463)
 [-0.37239     0.47634    -0.19666    -1.16410005  0.01282    -0.47062999]


### Creating the model

Seq2seq architecture used for Text-Summarization,
chatbot development, conversational modeling, and neural machine translation, etc.

Here we see how to create a language translation.

Seq2seq is encoder-decoder using two LSTM:
*   Input -> encoder : Sentence in original language.
*   Input -> decoder : Sentence in translated language with token start.
*   Output is the actual target sentence with an end-of-sentence token.

In [34]:
# Define encoder:
# - input = sentence in english
# - output = hidden state and cell state of the LSTM
encoder_inputs = Input(shape=(max_input_len,))
x = embedding_layer(encoder_inputs)
encoder = LSTM(LSTM_NODES, return_state=True)

encoder_outputs, h, c = encoder(x)
encoder_states = [h, c]

# Define decoder:
# - inputs = Hidden state and cell state
# - output = input sentence = output sentence with a token at start
decoder_inputs = Input(shape=(max_out_len,))
decoder_embedding = Embedding(num_words_output, LSTM_NODES)
decoder_inputs_x = decoder_embedding(decoder_inputs)
decoder_lstm = LSTM(LSTM_NODES, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs_x, initial_state=encoder_states)

# Dense Layer: Output of the decoder LSTM is passed through here to predict decoder outputs
decoder_dense = Dense(num_words_output, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

# Compile model
model = Model(
    [encoder_inputs,
    decoder_inputs],
    decoder_outputs)

model.compile(
    optimizer='rmsprop',
    loss='categorical_crossentropy',
    metrics=['accuracy'])

### Plot the result and save result in folder model
### Make sure to have a model folder in your current directory
!mkdir model
path_to_plot1 = 'model/model_plot1.png'
plot_model(model, to_file=path_to_plot1, show_shapes=True, show_layer_names=True)

model.summary()

Model: "model_6"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_11 (InputLayer)       [(None, 6)]                  0         []                            
                                                                                                  
 input_12 (InputLayer)       [(None, 12)]                 0         []                            
                                                                                                  
 embedding_4 (Embedding)     (None, 6, 200)               703800    ['input_11[0][0]']            
                                                                                                  
 embedding_5 (Embedding)     (None, 12, 256)              2444032   ['input_12[0][0]']            
                                                                                            

### Training your model

Create a 3D tensor corresponding to the list of "one-hot" encoded output sentences. While this approach is feasible in theory, and although the training corpus is pretty small; the amount of memory which is required to store that tensor is not reasonable (several Gb of RAM). And can't therefore be used for practical purposes. To resolve this, create those one-hot encoded sentences _lazily_ using the following implementation of a `tensorflow.keras.utils.Sequence`. If interested, you will find more information about that class in the [tensorflow documentation](https://www.tensorflow.org/api_docs/python/tf/keras/utils/Sequence). Allow to have tractable amount of RAM

We will have to create a 3D tensor corresponding to the list of "one-hot" encoded output sentences. But these are a lot to stock (several Gb) so we use a lazy implementation : [Tensorflow.keras.utils.Sequence](https://www.tensorflow.org/api_docs/python/tf/keras/utils/Sequence)

In [35]:
# Train model here: No RAM optimization done here on local
#-------------------------------------------------------------------------------
# Help reducing the variance induced by the random neural weights assigned at
# the beginning of the gradient descent.
#-------------------------------------------------------------------------------
np.random.seed(42)
set_seed(42)

decoder_targets_one_hot = np.zeros((
    len(input_sentences),
    max_out_len,
    num_words_output),
    dtype='float32')

# Make 1 at column number that corresponds to the integer representation
for i, d in enumerate(decoder_output_sequences):
    for t, word in enumerate(d):
        decoder_targets_one_hot[i, t, word] = 1

history = model.fit(
    [encoder_input_sequences, decoder_input_sequences],
    decoder_targets_one_hot,
    batch_size=BATCH_SIZE,
    epochs=EPOCHS,
    validation_split=VALIDATION_SPLIT,
)

Epoch 1/20












Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [36]:
# Save this model
path_to_model1 = 'model/seq2seq_eng-fra.keras'
model.save(path_to_model1)

#### Lazy sequence for tractable RAM

Version with Lazy sequence loading: Enable it to work on the google collab (otherwise RAM issues)

Unfortunatly, it does hurt the performance and quality of training (it did not 3 years ago).

Just an example here, this is not a runnable cell:
```python
# Train model here
#-------------------------------------------------------------------------------
# Help reducing the variance induced by the random neural weights assigned at
# the beginning of the gradient descent.
#-------------------------------------------------------------------------------
np.random.seed(42)
set_seed(42)
#
#-------------------------------------------------------------------------------
# This sequence is used to feed the training process with batches that are not
# all loaded in ram at once
#------------------------------------------------------------------------------
class LazyLoadedSequence(Sequence):
  def __init__(self, begin, end):
    self.begin      = begin        # beginning (included) of the considered data
    self.end        = end          # end (excluded) of the considered data
    self.nb_samples = end - begin  # number of data samples

  def __len__(self):
    # returns the number of batches of data
    return np.ceil(self.nb_samples / BATCH_SIZE).astype(int)

  def __getitem__(self, idx):
    # returns the `idx`th batch of data
    # (returns both inputs aka xs and outputs aka ys)
    start   = self.begin + BATCH_SIZE * idx
    end     = min(self.end, start + BATCH_SIZE)

    enc_x   = encoder_input_sequences[start:end]
    dec_x   = decoder_input_sequences[start:end]
    one_hot = np.zeros((end-start, max_out_len, num_words_output), dtype='float32')
    # now let us actually build the one hot encoded representation for each of
    # the output sentences (in french)
    for i, d in enumerate(decoder_output_sequences[start:end]):
      for t, word in enumerate(d):
        one_hot[i, t, word] = 1
    # now return both the xs and the ys
    return [enc_x, dec_x], one_hot

#-------------------------------------------------------------------------------
# Actually fit it with custom batches
#------------------------------------------------------------------------------
from keras.callbacks import EarlyStopping
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1)

nb_sentences    = len(input_sentences)
split_limit     = np.ceil(nb_sentences * (1 - VALIDATION_SPLIT)).astype(int)
train_data      = LazyLoadedSequence(0, split_limit)
validation_data = LazyLoadedSequence(split_limit, nb_sentences)
r = model.fit(
    train_data,
    validation_data = validation_data,
    epochs          = EPOCHS,
    callbacks       = [es],
)
```

### Prediction/Translation model

In [37]:
### Load again the model, two options
path_to_model1 = 'model/seq2seq_eng-fra.keras'
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
model.load_weights(path_to_model1)

# Same encoder
encoder_model = Model(encoder_inputs, encoder_states)

# Add hidden & cell state
decoder_state_input_h = Input(shape=(LSTM_NODES,))
decoder_state_input_c = Input(shape=(LSTM_NODES,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

# Modify decoder embedding layer to fit with only single word now
decoder_inputs_single = Input(shape=(1,))
decoder_inputs_single_x = decoder_embedding(decoder_inputs_single)

# PlaceHolder for decoder outputs
decoder_outputs, h, c = decoder_lstm(decoder_inputs_single_x, initial_state=decoder_states_inputs)

# Dense Layer again
decoder_states = [h, c]
decoder_outputs = decoder_dense(decoder_outputs)

# Update model of decoder
decoder_model = Model(
    [decoder_inputs_single] + decoder_states_inputs,
    [decoder_outputs] + decoder_states
)

path_to_plot2 = 'model/model_plot_decoder2.png'
plot_model(decoder_model, to_file=path_to_plot2, show_shapes=True, show_layer_names=True)

decoder_model.summary()
encoder_model.summary()



Model: "model_8"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_15 (InputLayer)       [(None, 1)]                  0         []                            
                                                                                                  
 embedding_5 (Embedding)     multiple                     2444032   ['input_15[0][0]']            
                                                                                                  
 input_13 (InputLayer)       [(None, 256)]                0         []                            
                                                                                                  
 input_14 (InputLayer)       [(None, 256)]                0         []                            
                                                                                            

In [38]:
idx2word_input = {v:k for k, v in word2idx_inputs.items()}
idx2word_target = {v:k for k, v in word2idx_outputs.items()}

def translate_sentence(input_string, printer):
    i = input_sentences.index(input_string)
    input_seq = encoder_input_sequences[i:i+1]
    states_value = encoder_model.predict(input_seq)
    target_seq = np.zeros((1, 1))
    target_seq[0, 0] = word2idx_outputs['<sos>']
    eos = word2idx_outputs['<eos>']
    output_sentence = []

    for _ in range(max_out_len):
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)
        idx = np.argmax(output_tokens[0, 0, :])

        if eos == idx:
            break

        word = ''

        if idx > 0:
            word = idx2word_target[idx]
            output_sentence.append(word)

        target_seq[0, 0] = idx
        states_value = [h, c]

    translation = ' '.join(output_sentence)
    # Printing the result
    if printer:
        print('-')
        print('Input:', input_sentences[i])
        print('Response:', translation)
        print('Expected Response:', "".join(output_sentences[i]))

    return translation

In [39]:
translation1 = translate_sentence("I'm a lawyer.", True)
translation2 = translate_sentence("Is anybody hurt?", True)
translation3 = translate_sentence("I'm concentrating.", True)
translation4 = translate_sentence("They let me go.", True)
translation5 = translate_sentence("How is your cold?", True)
translation6 = translate_sentence("Tom ate my lunch.", True)

-
Input: I'm a lawyer.
Response: je suis un enfant.
Expected Response: Je suis avocat. <eos>
-
Input: Is anybody hurt?
Response: est-ce que qui est-il blessé ?
Expected Response: Quiconque est-il blessé ? <eos>
-
Input: I'm concentrating.
Response: je suis en train de en train de la faire de faire
Expected Response: Je suis en train de me concentrer. <eos>
-
Input: They let me go.
Response: ils m'ont laissé partir.
Expected Response: Ils me laissèrent partir. <eos>
-
Input: How is your cold?
Response: comment est votre voiture ?
Expected Response: Comment va votre rhume ? <eos>
-
Input: Tom ate my lunch.
Response: tom a perdu mon voiture.
Expected Response: Tom a mangé mon déjeuner. <eos>


### Bleu Score

A score widely used in translation AI to evaluate their quality.

It will typically be a combination of the modified precision and the brevity penalty. Often reported as a percentage, with 100% indicating a perfect match with the reference text.

However, it's important to note that BLEU has some limitations. It doesn't capture aspects of semantic meaning or fluency in the generated text, and it may not always align with human judgments of translation quality. Researchers and practitioners often use BLEU in combination with other metrics and qualitative assessments to get a more comprehensive evaluation of machine-generated text. Here we only watch it.

In [40]:
# Example cell to present
reference = [['this', 'looks', 'highly', 'satisfactory', '<eos>'], ['this', 'looks', 'good', 'indeed', '<eos>' ]]
candidate = ['this', 'is', 'very', 'good', 'indeed', '<eos>']

smooth = SmoothingFunction().method1
bleu = sentence_bleu
print("Bleu score: ", round(bleu(reference, candidate, smoothing_function=smooth), 3))

#### BLEU-2 SCORE
print("Bleu-2 score: ", round(bleu(reference, candidate, weights=(0.5, 0.5, 0, 0),smoothing_function=smooth), 3))

def compute_bleu():
    score = 0
    for i in range(18000, 20000):
        input = input_sentences[i]
        # print(input)
        translation = (translate_sentence(input, False) + " <eos>").split()
        # print(translation)
        expected = ["".join(output_sentences[i]).split()]
        # print(expected)
        score += bleu(expected, translation, weights=(0.5, 0.5, 0, 0), smoothing_function=smooth)
    score /= 2000 # 20000-18000
    return round(score, 3)

Bleu score:  0.217
Bleu-2 score:  0.516


In [41]:
# 16m
compute_bleu() # 0.147



0.147

### Training Test Quality

Expected Response: Je suis avocat. <eos>  
Expected Response: Quiconque est-il blessé ? <eos>  
Expected Response: Je suis en train de me concentrer. <eos>  
Expected Response: Ils me laissèrent partir. <eos>  
Expected Response: Comment va votre rhume ? <eos>  
Expected Response: Tom a mangé mon déjeuner. <eos>  

Big corpus: Tensorflow 2.15.0, not lazy  
Epoch 1/20  
282/282 [==============================] - 50s 163ms/step - loss: 2.4596 - accuracy: 0.6751 - val_loss: 2.4478 - val_accuracy: 0.6708  
Epoch 20/20  
282/282 [==============================] - 30s 107ms/step - loss: 0.8223 - accuracy: 0.8501 - val_loss: 1.5550 - val_accuracy: 0.7632  
Input: I'm a lawyer.  
Response: je suis un enfant.  
Expected Response: Je suis avocat. <eos>  
Input: Is anybody hurt?  
Response: quiconque est-il blessé ?  
Input: I'm concentrating.  
Response: je suis en train de la train de la train de la  
Input: They let me go.  
Response: ils m'ont laissé partir.  
Input: How is your cold?  
Response: comment est votre voiture ?  
Input: Tom ate my lunch.  
Response: tom a mangé mon mon déjeuner.  

Big corpus: Tensorflow 2.13.0, Not lazy  
Epoch 1/20  
282/282 [==============================] - 51s 163ms/step - loss: 2.4487 - accuracy: 0.6752 - val_loss: 2.4531 - val_accuracy: 0.6683  
Epoch 20/20  
282/282 [==============================] - 40s 142ms/step - loss: 0.8065 - accuracy: 0.8524 - val_loss: 1.5368 - val_accuracy: 0.7667  
Input: I'm a lawyer.  
Response: je suis un enfant.  
Input: Is anybody hurt?  
Response: qui est-il blessé ?  
Input: I'm concentrating.  
Response: je suis en train de temps.  
Input: They let me go.  
Response: ils m'ont laissé partir.  
Input: How is your cold?  
Response: comment va votre votre sont de la ?  
Input: Tom ate my lunch.  
Response: tom a mon mon fait mon moi.  

Small corpus: Tensorflow 2.13.0, Not lazy  
Epoch 1/20  
282/282 [==============================] - 55s 182ms/step - loss: 2.3077 - accuracy: 0.6980 - val_loss: 2.2653 - val_accuracy: 0.6921  
Epoch 20/20  
282/282 [==============================] - 45s 158ms/step - loss: 0.8667 - accuracy: 0.8463 - val_loss: 1.4672 - val_accuracy: 0.7769  
Input: I'm a lawyer.  
Response: je suis un homme.  
Input: Is anybody hurt?  
Response: est-ce que tom est-il blessé ?  
Input: I'm concentrating.  
Response: je suis en train de la maison.  
Input: They let me go.  
Response: ils m'ont partir.  
Input: How is your cold?  
Response: comment est votre vin ?  
Input: Tom ate my lunch.  
Response: tom a mon fils.  

Small corpus: Tensorflow 2.13.0, Lazy  
Epoch 1/20  
282/282 [==============================] - 48s 160ms/step - loss: 2.4157 - accuracy: 0.6904 - val_loss: 2.3438 - val_accuracy: 0.6930  
Epoch 20/20  
282/282 [==============================] - 45s 159ms/step - loss: 1.3734 - accuracy: 0.7902 - val_loss: 1.7659 - val_accuracy: 0.7377  
Input: I'm a lawyer.  
Response: c'est un c'est un la maison.  
Input: Is anybody hurt?  
Response: est-ce que ?  
Input: I'm concentrating.  
Response: je suis en train de la toi.  
Input: They let me go.  
Response: nous nous faut de aller.  
Input: How is your cold?  
Response: est-ce que ça ?  
Input: Tom ate my lunch.  
Response: tom a un tom a un c'est de la maison.  

Big corpus: Tensorflow 2.13.0, lazy  
Epoch 1/20  
282/282 [==============================] - 47s 154ms/step - loss: 2.5751 - accuracy: 0.6663 - val_loss: 2.5227 - val_accuracy: 0.6649  
Epoch 20/20  
282/282 [==============================] - 44s 156ms/step - loss: 1.3487 - accuracy: 0.7886 - val_loss: 1.8560 - val_accuracy: 0.7237  
Input: I'm a lawyer.  
Response: je suis un un moi.  
Input: Is anybody hurt?  
Response: est-ce le le ?  
Input: I'm concentrating.  
Response: je suis en train de moi.  
Input: They let me go.  
Response: laissez-moi me me faut de aller.  
Input: How is your cold?  
Response: est-ce ton ton est ?  
Input: Tom ate my lunch.  
Response: tom a a le la la la la ?  