# NMT Workshop Exercise 2: English to French

<b>In this exercise we will train a seq2seq model to translate English sentences into French.

First make sure you have the accompanying data file fra-eng.zip, and unzip it in the current directory.
We will also need to download the spaCy model file for French.</b>

##### Note for Google Colab users:
<b>If you are working in Google Colab, uncomment the last line below before running, and restart the runtime after running the cell below. Also make sure you are using GPU acceleration for this exercise.</b>

In [1]:
# ! unzip fra-eng.zip
# ! pip install tqdm==4.33.0

<b>The file fra.txt contains English-French sentence pairs from the [Tatoeba project](https://tatoeba.org/eng/). Take a look at the contents of the file to see a few examples of sentence pairs in the corpus.

For this exercise, we will also need spaCy models for English and French: (see [the spaCy documentation](https://spacy.io/usage/models) for more information)</b>

In [2]:
! python3 -m spacy download en
! python3 -m spacy download fr

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.6/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('fr_core_news_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/fr_core_news_sm -->
/usr/local/lib/python3.6/dist-packages/spacy/data/fr
You can now load the model via spacy.load('fr')


In [3]:
import spacy
en_model = spacy.load('en')
fr_model = spacy.load('fr')
import pandas as pd

## Part 1: Data preprocessing

<b>First we will preprocess the data by taking a random sample, tokenizing, lowercasing, and removing sentences containing uncommon words. The goal is to have a clean sample set that will make learning faster.</b>

### Questions:
#### 1. Read the data in the file *fra.txt* into a Pandas dataframe *df* with columns *en* containing English sentences and *fr* containing the corresponding French sentences. <br>(Hint: use pandas.read_csv with parameters sep=, header=, usecols=, and names=). 


In [4]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [5]:
pd.set_option('display.max_colwidth', None)
df = pd.read_csv('/content/drive/My Drive/Colab Notebooks/fra.txt',
                 usecols=[0,1] ,
                 names=['en','fr'],
                 sep='\t', 
                 header=None)
df.head(10)

Unnamed: 0,en,fr
0,Go.,Va !
1,Hi.,Salut !
2,Hi.,Salut.
3,Run!,Cours !
4,Run!,Courez !
5,Who?,Qui ?
6,Wow!,Ça alors !
7,Fire!,Au feu !
8,Help!,À l'aide !
9,Jump.,Saute.


#### How many sentence pairs are in the corpus?


In [6]:
len(df)

174481

#### 2. Filter to keep only sentences containing at most 40 characters (in both languages), and save a random sample of 50000 sentence pairs in a new dataframe *sample_df*.<br> Use pd.DataFrame.sample() with parameter random_state=0 to get reproduceable results.


In [7]:
sample_df = pd.DataFrame.sample(df[(df['en'].apply(len) <= 40) & 
                                (df['fr'].apply(len) <= 40)],
                             n=50000,replace=False, random_state=0).reset_index(drop=True) 
sample_df

Unnamed: 0,en,fr
0,What did I do wrong?,Qu'ai-je fait de travers ?
1,"You like olives, don't you?","Vous aimez les olives, pas vrai ?"
2,I really appreciate your coming.,J'apprécie vraiment ta venue.
3,I had to do something.,Il fallait que je fasse quelque chose.
4,I know Tom was first.,Je sais que Tom était le premier.
...,...,...
49995,He borrowed the car from his friend.,Il a emprunté la voiture à un ami.
49996,Send us a message.,Envoie-nous un message.
49997,Who's coming for dinner?,Qui vient dîner ?
49998,You're lying!,Vous mentez !


#### 3. Add columns 'en_tokens' and 'fr_tokens' to sample_df containing arrays of the tokens in the English and French sentences. Use the spaCy models en_model and fr_model to tokenize the sentences, and make all the tokens lowercase. Add special tokens '&lt;start&gt;' and '&lt;end&gt;' to the beginning and end of every sentence.


In [8]:
def tokenize(model, sent):
    doc = model.tokenizer(sent)
    token_list = [token.text.lower() for token in doc]
    token_list.append('<end>')
    token_list.insert(0, '<start>')
    return token_list

sample_df['en_tokens'] = sample_df['en'].apply(lambda x: tokenize(en_model, x))
sample_df['fr_tokens'] = sample_df['fr'].apply(lambda x: tokenize(fr_model, x))
sample_df

Unnamed: 0,en,fr,en_tokens,fr_tokens
0,What did I do wrong?,Qu'ai-je fait de travers ?,"[<start>, what, did, i, do, wrong, ?, <end>]","[<start>, qu', ai, -, je, fait, de, travers, ?, <end>]"
1,"You like olives, don't you?","Vous aimez les olives, pas vrai ?","[<start>, you, like, olives, ,, do, n't, you, ?, <end>]","[<start>, vous, aimez, les, olives, ,, pas, vrai, ?, <end>]"
2,I really appreciate your coming.,J'apprécie vraiment ta venue.,"[<start>, i, really, appreciate, your, coming, ., <end>]","[<start>, j', apprécie, vraiment, ta, venue, ., <end>]"
3,I had to do something.,Il fallait que je fasse quelque chose.,"[<start>, i, had, to, do, something, ., <end>]","[<start>, il, fallait, que, je, fasse, quelque, chose, ., <end>]"
4,I know Tom was first.,Je sais que Tom était le premier.,"[<start>, i, know, tom, was, first, ., <end>]","[<start>, je, sais, que, tom, était, le, premier, ., <end>]"
...,...,...,...,...
49995,He borrowed the car from his friend.,Il a emprunté la voiture à un ami.,"[<start>, he, borrowed, the, car, from, his, friend, ., <end>]","[<start>, il, a, emprunté, la, voiture, à, un, ami, ., <end>]"
49996,Send us a message.,Envoie-nous un message.,"[<start>, send, us, a, message, ., <end>]","[<start>, envoie, -, nous, un, message, ., <end>]"
49997,Who's coming for dinner?,Qui vient dîner ?,"[<start>, who, 's, coming, for, dinner, ?, <end>]","[<start>, qui, vient, dîner, ?, <end>]"
49998,You're lying!,Vous mentez !,"[<start>, you, 're, lying, !, <end>]","[<start>, vous, mentez, !, <end>]"


#### 4. Create sets en_common and fr_common containing the most common 2500 tokens in English and in French, respectively.


In [9]:
from collections import Counter
from itertools import chain 

en_common = Counter(chain.from_iterable(sample_df['en_tokens'])).most_common(2500)

fr_common = Counter(chain.from_iterable(sample_df['fr_tokens'])).most_common(2500)
en_common = dict(en_common)
en_common

{'<start>': 50000,
 '<end>': 50000,
 '.': 40884,
 'i': 15748,
 'you': 13495,
 '?': 8729,
 'to': 8110,
 'the': 6814,
 'a': 6103,
 'do': 5681,
 "n't": 5518,
 'is': 5156,
 'tom': 4430,
 'it': 4163,
 "'s": 3880,
 'that': 3870,
 'he': 3428,
 'this': 2866,
 'me': 2803,
 'have': 2729,
 'are': 2503,
 'we': 2460,
 'was': 2399,
 'of': 2387,
 'what': 2275,
 "'re": 2236,
 "'m": 2221,
 'my': 2149,
 'did': 2054,
 'your': 2031,
 'in': 2025,
 'be': 1737,
 'she': 1638,
 'not': 1635,
 'for': 1634,
 'want': 1611,
 'like': 1564,
 'know': 1427,
 'they': 1417,
 ',': 1255,
 'on': 1196,
 'all': 1180,
 'can': 1135,
 'with': 1125,
 'go': 1116,
 "'ll": 1114,
 'how': 1062,
 'very': 1016,
 'here': 978,
 'his': 971,
 "'ve": 959,
 'at': 953,
 'there': 928,
 'no': 844,
 'him': 839,
 'will': 791,
 'were': 787,
 'think': 782,
 'one': 773,
 'about': 748,
 'has': 743,
 'going': 722,
 'get': 719,
 'up': 693,
 'need': 677,
 'who': 676,
 'her': 670,
 'ca': 667,
 'good': 666,
 'where': 665,
 'why': 646,
 'let': 644,
 'out': 

In [10]:
fr_common = dict(fr_common)
fr_common

{'<start>': 50000,
 '<end>': 50000,
 '.': 39354,
 'je': 10932,
 '?': 8732,
 'est': 7726,
 '-': 7644,
 'vous': 6603,
 'de': 6579,
 'pas': 6551,
 'tu': 5186,
 'il': 5150,
 'ne': 4835,
 'le': 4632,
 'tom': 4413,
 'que': 4408,
 'à': 4267,
 "j'": 4241,
 'a': 4155,
 'la': 3852,
 "n'": 3728,
 'ai': 3634,
 'un': 3484,
 "l'": 3165,
 'nous': 2896,
 'ce': 2585,
 'en': 2560,
 "c'": 2329,
 'une': 2251,
 '!': 2024,
 'suis': 2002,
 "d'": 1998,
 'me': 1900,
 ',': 1897,
 '\u202f': 1874,
 'ça': 1827,
 '\xa0': 1718,
 'elle': 1679,
 'les': 1651,
 'faire': 1321,
 "m'": 1307,
 'moi': 1267,
 "qu'": 1245,
 'y': 1228,
 'êtes': 1216,
 'tout': 1185,
 'veux': 1173,
 'as': 1160,
 'pour': 1147,
 'te': 1130,
 'es': 1066,
 'qui': 1041,
 'fait': 1019,
 'était': 1016,
 "s'": 1012,
 'mon': 1000,
 'être': 991,
 '-ce': 979,
 'plus': 964,
 'avez': 929,
 'très': 899,
 'ils': 878,
 'dans': 858,
 'des': 834,
 'sont': 833,
 'du': 812,
 'avec': 804,
 'au': 772,
 'se': 772,
 'cela': 759,
 "t'": 759,
 'ici': 753,
 'sais': 726,
 '

#### 5. Create a new dataframe *sample_filt* containing only sentence pairs from *sample_df* where all English tokens are in en_common, and all French tokens are in fr_common. Also only use pairs where both sentences contain at most *maxlen* tokens. How many sentence pairs are in the filtered data?


In [11]:
maxlen = 20

sample_filt =sample_df[(sample_df['en_tokens'].str.len()  <= maxlen) & (sample_df['fr_tokens'].str.len() <= maxlen)]
sample_filt

Unnamed: 0,en,fr,en_tokens,fr_tokens
0,What did I do wrong?,Qu'ai-je fait de travers ?,"[<start>, what, did, i, do, wrong, ?, <end>]","[<start>, qu', ai, -, je, fait, de, travers, ?, <end>]"
1,"You like olives, don't you?","Vous aimez les olives, pas vrai ?","[<start>, you, like, olives, ,, do, n't, you, ?, <end>]","[<start>, vous, aimez, les, olives, ,, pas, vrai, ?, <end>]"
2,I really appreciate your coming.,J'apprécie vraiment ta venue.,"[<start>, i, really, appreciate, your, coming, ., <end>]","[<start>, j', apprécie, vraiment, ta, venue, ., <end>]"
3,I had to do something.,Il fallait que je fasse quelque chose.,"[<start>, i, had, to, do, something, ., <end>]","[<start>, il, fallait, que, je, fasse, quelque, chose, ., <end>]"
4,I know Tom was first.,Je sais que Tom était le premier.,"[<start>, i, know, tom, was, first, ., <end>]","[<start>, je, sais, que, tom, était, le, premier, ., <end>]"
...,...,...,...,...
49995,He borrowed the car from his friend.,Il a emprunté la voiture à un ami.,"[<start>, he, borrowed, the, car, from, his, friend, ., <end>]","[<start>, il, a, emprunté, la, voiture, à, un, ami, ., <end>]"
49996,Send us a message.,Envoie-nous un message.,"[<start>, send, us, a, message, ., <end>]","[<start>, envoie, -, nous, un, message, ., <end>]"
49997,Who's coming for dinner?,Qui vient dîner ?,"[<start>, who, 's, coming, for, dinner, ?, <end>]","[<start>, qui, vient, dîner, ?, <end>]"
49998,You're lying!,Vous mentez !,"[<start>, you, 're, lying, !, <end>]","[<start>, vous, mentez, !, <end>]"


In [12]:
import numpy as np

sample_filt = sample_filt[(sample_filt['en_tokens'].apply(lambda a: 
                                                      (np.all([i in en_common.keys() for i in a])))) &
         (sample_filt['fr_tokens'].apply(lambda a: 
                                       np.all([i in fr_common.keys() for i in a])))]
sample_filt

Unnamed: 0,en,fr,en_tokens,fr_tokens
0,What did I do wrong?,Qu'ai-je fait de travers ?,"[<start>, what, did, i, do, wrong, ?, <end>]","[<start>, qu', ai, -, je, fait, de, travers, ?, <end>]"
2,I really appreciate your coming.,J'apprécie vraiment ta venue.,"[<start>, i, really, appreciate, your, coming, ., <end>]","[<start>, j', apprécie, vraiment, ta, venue, ., <end>]"
3,I had to do something.,Il fallait que je fasse quelque chose.,"[<start>, i, had, to, do, something, ., <end>]","[<start>, il, fallait, que, je, fasse, quelque, chose, ., <end>]"
4,I know Tom was first.,Je sais que Tom était le premier.,"[<start>, i, know, tom, was, first, ., <end>]","[<start>, je, sais, que, tom, était, le, premier, ., <end>]"
5,They all drank.,Elles ont toutes bu.,"[<start>, they, all, drank, ., <end>]","[<start>, elles, ont, toutes, bu, ., <end>]"
...,...,...,...,...
49993,That boy showed no fear.,Ce garçon ne montra aucune peur.,"[<start>, that, boy, showed, no, fear, ., <end>]","[<start>, ce, garçon, ne, montra, aucune, peur, ., <end>]"
49994,I couldn't stand it any longer.,Je ne pourrais davantage le supporter.,"[<start>, i, could, n't, stand, it, any, longer, ., <end>]","[<start>, je, ne, pourrais, davantage, le, supporter, ., <end>]"
49995,He borrowed the car from his friend.,Il a emprunté la voiture à un ami.,"[<start>, he, borrowed, the, car, from, his, friend, ., <end>]","[<start>, il, a, emprunté, la, voiture, à, un, ami, ., <end>]"
49997,Who's coming for dinner?,Qui vient dîner ?,"[<start>, who, 's, coming, for, dinner, ?, <end>]","[<start>, qui, vient, dîner, ?, <end>]"


#### 6. For convenience we want to work with strings instead of arrays of tokens. Create new columns *en_txt* and *fr_txt* in the dataframe *sample_filt* containing the tokens in a sentence separated by spaces. For example, the column *fr_txt* should include the string "&lt;start&gt; ferme - la juste et écoute ! &lt;end&gt;" in some row.


In [13]:
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'

sample_filt['en_txt'] = sample_filt.en_tokens.apply(lambda a: ' '.join(a))
sample_filt['fr_txt'] = sample_filt.fr_tokens.apply(lambda a: ' '.join(a))
sample_filt

Unnamed: 0,en,fr,en_tokens,fr_tokens,en_txt,fr_txt
0,What did I do wrong?,Qu'ai-je fait de travers ?,"[<start>, what, did, i, do, wrong, ?, <end>]","[<start>, qu', ai, -, je, fait, de, travers, ?, <end>]",<start> what did i do wrong ? <end>,<start> qu' ai - je fait de travers ? <end>
2,I really appreciate your coming.,J'apprécie vraiment ta venue.,"[<start>, i, really, appreciate, your, coming, ., <end>]","[<start>, j', apprécie, vraiment, ta, venue, ., <end>]",<start> i really appreciate your coming . <end>,<start> j' apprécie vraiment ta venue . <end>
3,I had to do something.,Il fallait que je fasse quelque chose.,"[<start>, i, had, to, do, something, ., <end>]","[<start>, il, fallait, que, je, fasse, quelque, chose, ., <end>]",<start> i had to do something . <end>,<start> il fallait que je fasse quelque chose . <end>
4,I know Tom was first.,Je sais que Tom était le premier.,"[<start>, i, know, tom, was, first, ., <end>]","[<start>, je, sais, que, tom, était, le, premier, ., <end>]",<start> i know tom was first . <end>,<start> je sais que tom était le premier . <end>
5,They all drank.,Elles ont toutes bu.,"[<start>, they, all, drank, ., <end>]","[<start>, elles, ont, toutes, bu, ., <end>]",<start> they all drank . <end>,<start> elles ont toutes bu . <end>
...,...,...,...,...,...,...
49993,That boy showed no fear.,Ce garçon ne montra aucune peur.,"[<start>, that, boy, showed, no, fear, ., <end>]","[<start>, ce, garçon, ne, montra, aucune, peur, ., <end>]",<start> that boy showed no fear . <end>,<start> ce garçon ne montra aucune peur . <end>
49994,I couldn't stand it any longer.,Je ne pourrais davantage le supporter.,"[<start>, i, could, n't, stand, it, any, longer, ., <end>]","[<start>, je, ne, pourrais, davantage, le, supporter, ., <end>]",<start> i could n't stand it any longer . <end>,<start> je ne pourrais davantage le supporter . <end>
49995,He borrowed the car from his friend.,Il a emprunté la voiture à un ami.,"[<start>, he, borrowed, the, car, from, his, friend, ., <end>]","[<start>, il, a, emprunté, la, voiture, à, un, ami, ., <end>]",<start> he borrowed the car from his friend . <end>,<start> il a emprunté la voiture à un ami . <end>
49997,Who's coming for dinner?,Qui vient dîner ?,"[<start>, who, 's, coming, for, dinner, ?, <end>]","[<start>, qui, vient, dîner, ?, <end>]",<start> who 's coming for dinner ? <end>,<start> qui vient dîner ? <end>


#### 7. Convert the columns *en_txt* and *fr_txt* to lists of word indices using the fit_on_texts() and texts_to_sequences() functions of tensorflow.keras.preprocessing.text.Tokenizer (with parameter filters='', so we do not erase punctuation tokens). Pad these with zeros at the end of each sequence (using tensorflow.keras.preprocessing.sequence.pad_sequences) so that every sequence is of length *maxlen*, and save these as numpy arrays called *en_tensor* and *fr_tensor*. They should both be of shape (30684, 20).


In [14]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer1 = Tokenizer(filters='')
tokenizer1.fit_on_texts(sample_filt['en_txt'])
tts_en = tokenizer1.texts_to_sequences(sample_filt['en_txt']) 
en_tensor = pad_sequences(tts_en, maxlen=20, padding='post')
en_tensor.shape

(30815, 20)

In [15]:
tokenizer2 = Tokenizer(filters='')
tokenizer2.fit_on_texts(sample_filt['fr_txt'])
tts_fr = tokenizer2.texts_to_sequences(sample_filt['fr_txt']) 
fr_tensor = pad_sequences(tts_fr, maxlen=20,padding='post')
fr_tensor.shape

(30815, 20)

#### 8. Set variables en_nwords and fr_nwords to the number of possible values for elements in *en_tensor* and *fr_tensor* (i.e. the number of words in English and French, including the padding token). What result do you get for these numbers?

In [16]:
en_nwords = np.max(en_tensor)+1
fr_nwords = np.max(fr_tensor)+1
en_nwords, fr_nwords


(2187, 2497)

## Part 2: Building and running the seq2seq model

**Now we will build a seq2seq model for automated translation, as described in lecture. The following imports will help you:**

In [17]:
from tensorflow.keras.layers import Input, Embedding, GRU, Dense, TimeDistributed
from tensorflow.keras import Model
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.callbacks import EarlyStopping
import numpy as np

**Use these hyperparameters for the model: (word embedding dimension, and hidden dimension of recurrent layers)**

In [18]:
embedding_dim = 200
hidden_dim = 1024

**We will now build our model. We have to use the Keras Functional API for this since the model is not simple enough to be written with the Sequential API.**

### Questions:

#### 9. Define a Keras model using the Functional API with the following layers:
  * <b>Two input layers *en_inputs* and *fr_inputs*. Define their input shape to be (maxlen,).
  * Two embedding layers: *en_embeddings* (applied to *en_inputs*) and *fr_embeddings* (applied to *fr_inputs*). Define the input and output dimensions using *en/fr_nwords* and *embedding_dim* as needed.
  * A GRU recurrent layer with dimension *hidden_dim* applied to *en_embeddings*. Save its output in the variable *en_output*.
  * A GRU recurrent layer with dimension *hidden_dim* applied to *fr_embeddings*. It should have return_sequences=True, and should start with initial_state equal to *en_output*. Save this layer's output in the variable *fr_gru_outputs*.
  * Finally apply a dense layer, wrapped in TimeDistributed, to *fr_gru_outputs*. This dense layer should have softmax activation (so we get probabilities over words in French), and dimension *fr_nwords*. Save the output of this layer as *fr_outputs*.
  * Define the model as *model = Model([en_inputs, fr_inputs], fr_outputs)*, check its architecture with *model.summary()*, and its input and output shapes with *model.input_shape* and *model.output_shape*.</b>


In [19]:
import tensorflow as tf

def get_model():
    en_inputs = Input(shape=(maxlen, ))
    fr_inputs = Input(shape=(maxlen, ))

    en_embeddings = Embedding(input_dim = en_nwords, 
                              output_dim = embedding_dim)(en_inputs)
    fr_embeddings = Embedding(input_dim = fr_nwords, 
                              output_dim = embedding_dim)(fr_inputs)
    
    en_output = GRU(hidden_dim)(en_embeddings)
    
    fr_gru_outputs = GRU(hidden_dim, 
                         return_sequences=True)(fr_embeddings, 
                                                initial_state = en_output)

    fr_outputs = tf.keras.layers.TimeDistributed(Dense(fr_nwords, 
                                                       activation='softmax'))(fr_gru_outputs)
                                                       
    model = Model([en_inputs, fr_inputs], fr_outputs)
    

    return model

In [20]:
nmt = get_model()

In [21]:
nmt.input_shape, nmt.output_shape

([(None, 20), (None, 20)], (None, 20, 2497))

In [22]:
nmt.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, 20)]         0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            [(None, 20)]         0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, 20, 200)      437400      input_1[0][0]                    
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 20, 200)      499400      input_2[0][0]                    
______________________________________________________________________________________________

  
#### 10. The model will predict the next token in French given the English sentence and the French tokens translated so far. To get the output labels, apply *np.roll* to *fr_tensor* so that the first column is the second token in French, the second column is the third token in French, and so forth. Apply *to_categorical* from Keras to this so that the output labels will be one-hot encoded, and save this in the variable *fr_tensor_2predict*. *fr_tensor_2predict* should be have shape (30684, 20, 2497).


In [23]:
import tensorflow as tf

tf.keras.backend.clear_session()  # For easy reset of notebook s

Y = np.roll(fr_tensor,-1)
fr_tensor_2predict = tf.keras.utils.to_categorical(Y)
fr_tensor_2predict.shape

(30815, 20, 2497)


#### 11. Compile the model with adam optimizer and categorical_crossentropy loss, and fit it on input *\[en_tensor, fr_tensor\]* and output *fr_tensor_2predict*. Recommended parameters are *batch_size = 64, epochs = 100, validation_split = 0.2*. Run training until validation loss stops decreasing by using early stopping ( *model.fit(..., callbacks = \[EarlyStopping()\])* ).


In [24]:
tf.config.experimental_run_functions_eagerly(True)

nmt.compile(optimizer='adam',
            metrics=["accuracy"],
            loss='categorical_crossentropy')
nmt.run_eagerly = True

nmt.fit(x=[en_tensor, fr_tensor] , 
        y=fr_tensor_2predict,
          batch_size = 64,
          epochs=100,
          validation_split = 0.2,
          callbacks=tf.keras.callbacks.EarlyStopping(monitor='accuracy'))

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100


<tensorflow.python.keras.callbacks.History at 0x7f47e0f13550>


#### 12. Write a function *run_translation(en_sentence)* that takes in a sentence *en_sentence* in English and translates it into French. You should first tokenize the sentence with spaCy, add &lt;start&gt; and &lt;end&gt; tokens and convert into padded vectors. Start with &lt;start&gt; as the first token in the translation, and then use the argmax of values output by model.predict() to predict the next token. Repeat this until &lt;end&gt; is output or until the output has length *maxlen*. <br>*Hint: This is very similar to the Trump tweet generation function.*


In [31]:

def run_translation(sentence):
  tokens = tokenize(en_model,sentence)
          #en_tokenizer
  tts_en = tokenizer1.texts_to_sequences([tokens])  
  en_tensor = pad_sequences(tts_en, maxlen=20,  padding='post')

  fr_tensor = np.zeros((1, 20))
  fr_tensor[0][0] = 1   # 1:'<start>',...
  target = []
  i=0
  while i < fr_tensor.shape[1]:
    prediction = nmt.predict([en_tensor,fr_tensor])[0]
    indx = np.argmax(prediction[i])
    if indx == 2:  # if == '<end>'
      break
    fr_tensor[0][i+1] = indx
    target.append(indx)
    i+=1              #fr_tokenizer
  print(sentence,'=>',tokenizer2.sequences_to_texts([target])[0],'\n')



#### 13. Uncomment the cell below and check the translations of the given sentences. 


In [33]:
# uncomment this cell for question 13
run_translation("This is not good!")
run_translation("This is scary.")
run_translation("I have a cat.")
run_translation("The dog is happy.")

This is not good! => ça n' est pas bon   ! 

This is scary. => c' est utile . 

I have a cat. => j' ai une chatte . 

The dog is happy. => le chien est exact . 




### Bonus: 
#### Try translating some other sentences from English to French. Do you see any obvious problems with the results?

Yes, the result is not always correct

In [39]:
run_translation('we do not erase punctuation tokens')
run_translation('Try translating some other sentences')

run_translation('Uncomment the cell below and check the translations')

run_translation('restart the runtime after running the cell below')


we do not erase punctuation tokens => nous ne avions pas le choix . 

Try translating some other sentences => essaie de l' aide . 

Uncomment the cell below and check the translations => le voleur s' est mis au cinéma . 

restart the runtime after running the cell below => le voleur s' est mis à aboyer . 



In [42]:
print("GOOGLE TRANSLATE:\nwe do not erase punctuation tokens =>  nous n'effaçons pas les jetons de ponctuation\n")
print("Try translating some other sentences =>  Essayez de traduire d'autres phrases\n")
print("Uncomment the cell below and check the translations =>  Décommentez la cellule ci-dessous et vérifiez les traductions\n")
print("restart the runtime after running the cell below => redémarrez le runtime après avoir exécuté la cellule ci-dessous\n")


GOOGLE TRANSLATE:
we do not erase punctuation tokens =>  nous n'effaçons pas les jetons de ponctuation

Try translating some other sentences =>  Essayez de traduire d'autres phrases

Uncomment the cell below and check the translations =>  Décommentez la cellule ci-dessous et vérifiez les traductions

restart the runtime after running the cell below => redémarrez le runtime après avoir exécuté la cellule ci-dessous

