# Traduction de texte 
### Oribes Lucas, Payet Marine, Piedra Tristan, Quetin Sebastien, Testi Joevin



- **Preprocessing** - Dans un premier temps pour passer des données textuelles à un réseau de neuronnes, nous devons convertir le texte en entier.


In [1]:
%load_ext autoreload
%aimport helper, tests
%autoreload 1

In [2]:
import collections

import helper
import numpy as np
import project_tests as tests
import tensorflow 


from keras import utils as ut
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Model, Sequential
from keras.layers import GRU, Input, Dense, TimeDistributed, Activation, RepeatVector, Bidirectional, Dropout, LSTM
from keras.layers.embeddings import Embedding
from keras.optimizers import Adam
from keras.losses import sparse_categorical_crossentropy
from keras.losses import categorical_crossentropy

Using TensorFlow backend.


## Dataset
Dans un premier temps on essaye de créer des réseaux qui fonctionnent sur de petites phrases. On essayera dans un second temps des traductions plus importantes.

In [3]:
english_sentences = helper.load_data('data/small_vocab_en')
french_sentences = helper.load_data('data/small_vocab_fr')
print('Dataset Loaded')

Dataset Loaded


Le dataset french sentences est la traduction du dataset french sentences.

In [4]:
for sample_i in range(5):
    print('English sample {}:  {}'.format(sample_i + 1, english_sentences[sample_i]))
    print('French sample {}:  {}\n'.format(sample_i + 1, french_sentences[sample_i]))

English sample 1:  new jersey is sometimes quiet during autumn , and it is snowy in april .
French sample 1:  new jersey est parfois calme pendant l' automne , et il est neigeux en avril .

English sample 2:  the united states is usually chilly during july , and it is usually freezing in november .
French sample 2:  les états-unis est généralement froid en juillet , et il gèle habituellement en novembre .

English sample 3:  california is usually quiet during march , and it is usually hot in june .
French sample 3:  california est généralement calme en mars , et il est généralement chaud en juin .

English sample 4:  the united states is sometimes mild during june , and it is cold in september .
French sample 4:  les états-unis est parfois légère en juin , et il fait froid en septembre .

English sample 5:  your least liked fruit is the grape , but my least liked is the apple .
French sample 5:  votre moins aimé fruit est le raisin , mais mon moins aimé est la pomme .



On étudie la complexité du jeu de données.

In [5]:
english_words_counter = collections.Counter([word for sentence in english_sentences for word in sentence.split()])
french_words_counter = collections.Counter([word for sentence in french_sentences for word in sentence.split()])

print('{} English words.'.format(len([word for sentence in english_sentences for word in sentence.split()])))
print('{} unique English words.'.format(len(english_words_counter)))
print('10 Most common words in the English dataset:')
print('"' + '" "'.join(list(zip(*english_words_counter.most_common(10)))[0]) + '"')
print()
print('{} French words.'.format(len([word for sentence in french_sentences for word in sentence.split()])))
print('{} unique French words.'.format(len(french_words_counter)))
print('10 Most common words in the French dataset:')
print('"' + '" "'.join(list(zip(*french_words_counter.most_common(10)))[0]) + '"')

1823250 English words.
227 unique English words.
10 Most common words in the English dataset:
"is" "," "." "in" "it" "during" "the" "but" "and" "sometimes"

1961295 French words.
355 unique French words.
10 Most common words in the French dataset:
"est" "." "," "en" "il" "les" "mais" "et" "la" "parfois"



## Preprocessing
Pour pouvoir donner nos entrées au réseau de neuronnes on a besoin de transformer nos données textuelles en numériques. Pour cela on va "tokenizer" nos données c'est  à dire leur attribuer un nombre dans la base de mots, soit francais, soit anglais. Cela permettra au réseau d'identifier les mots.
Ensuite, comme les phrases francaises ont une taille de 21 mots maximum et les phrases anglaises 15 mots, on va rajouter des "0" à la fin des phrases anglaises. C'est ce que l'on appelle: le padding.


In [6]:
def tokenize(x):
    """
    Tokenize x
    :param x: List of sentences/strings to be tokenized
    :return: Tuple of (tokenized x data, tokenizer used to tokenize x)
    """
    # TODO: Implement
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(x)
    return tokenizer.texts_to_sequences(x), tokenizer

tests.test_tokenize(tokenize)

# Tokenize Example output
text_sentences = [
    'The quick brown fox jumps over the lazy dog .',
    'By Jove , my quick study of lexicography won a prize .',
    'This is a short sentence .']
text_tokenized, text_tokenizer = tokenize(text_sentences)
print(text_tokenizer.word_index)
print()
for sample_i, (sent, token_sent) in enumerate(zip(text_sentences, text_tokenized)):
    print('Sequence {} in x'.format(sample_i + 1))
    print('  Input:  {}'.format(sent))
    print('  Output: {}'.format(token_sent))

{'the': 1, 'quick': 2, 'a': 3, 'brown': 4, 'fox': 5, 'jumps': 6, 'over': 7, 'lazy': 8, 'dog': 9, 'by': 10, 'jove': 11, 'my': 12, 'study': 13, 'of': 14, 'lexicography': 15, 'won': 16, 'prize': 17, 'this': 18, 'is': 19, 'short': 20, 'sentence': 21}

Sequence 1 in x
  Input:  The quick brown fox jumps over the lazy dog .
  Output: [1, 2, 4, 5, 6, 7, 1, 8, 9]
Sequence 2 in x
  Input:  By Jove , my quick study of lexicography won a prize .
  Output: [10, 11, 12, 2, 13, 14, 15, 16, 3, 17]
Sequence 3 in x
  Input:  This is a short sentence .
  Output: [18, 19, 3, 20, 21]


In [7]:
def pad(x, length=None):
    """
    Pad x
    :param x: List of sequences.
    :param length: Length to pad the sequence to.  If None, use length of longest sequence in x.
    :return: Padded numpy array of sequences
    """
    # TODO: Implement
    return pad_sequences(x, maxlen=length, padding='post')

tests.test_pad(pad)

# Pad Tokenized output
test_pad = pad(text_tokenized)
for sample_i, (token_sent, pad_sent) in enumerate(zip(text_tokenized, test_pad)):
    print('Sequence {} in x'.format(sample_i + 1))
    print('  Input:  {}'.format(np.array(token_sent)))
    print('  Output: {}'.format(pad_sent))

Sequence 1 in x
  Input:  [1 2 4 5 6 7 1 8 9]
  Output: [1 2 4 5 6 7 1 8 9 0]
Sequence 2 in x
  Input:  [10 11 12  2 13 14 15 16  3 17]
  Output: [10 11 12  2 13 14 15 16  3 17]
Sequence 3 in x
  Input:  [18 19  3 20 21]
  Output: [18 19  3 20 21  0  0  0  0  0]


In [8]:
def preprocess(x, y):
    """
    Preprocess x and y
    :param x: Feature List of sentences
    :param y: Label List of sentences
    :return: Tuple of (Preprocessed x, Preprocessed y, x tokenizer, y tokenizer)
    """
    preprocess_x, x_tk = tokenize(x)
    preprocess_y, y_tk = tokenize(y)

    preprocess_x = pad(preprocess_x)
    preprocess_y = pad(preprocess_y)

    # Keras's sparse_categorical_crossentropy function requires the labels to be in 3 dimensions
    preprocess_y = preprocess_y.reshape(*preprocess_y.shape, 1)

    return preprocess_x, preprocess_y, x_tk, y_tk

preproc_english_sentences, preproc_french_sentences, english_tokenizer, french_tokenizer =\
    preprocess(english_sentences, french_sentences)
    
max_english_sequence_length = preproc_english_sentences.shape[1]
max_french_sequence_length = preproc_french_sentences.shape[1]
english_vocab_size = len(english_tokenizer.word_index)
french_vocab_size = len(french_tokenizer.word_index)

print('Data Preprocessed')
print("Max English sentence length:", max_english_sequence_length)
print("Max French sentence length:", max_french_sequence_length)
print("English vocabulary size:", english_vocab_size)
print("French vocabulary size:", french_vocab_size)

Data Preprocessed
Max English sentence length: 15
Max French sentence length: 21
English vocabulary size: 199
French vocabulary size: 344


Enfin il nous faut une fonction qui permet une fois que le réseau nous sort un ID de mot dans une base, de retrouver ce mot.

In [9]:
def logits_to_text(logits, tokenizer):
    """
    Turn logits from a neural network into text using the tokenizer
    :param logits: Logits from a neural network
    :param tokenizer: Keras Tokenizer fit on the labels
    :return: String that represents the text of the logits
    """
    index_to_words = {id: word for word, id in tokenizer.word_index.items()}
    index_to_words[0] = '<PAD>'

    return ' '.join([index_to_words[prediction] for prediction in np.argmax(logits, 1)])

print('`logits_to_text` function loaded.')

`logits_to_text` function loaded.


In [10]:
from nltk.translate.bleu_score import corpus_bleu

# evaluate the skill of the model
def evaluate_model(model,  sources, raw_dataset,pred):
    actual, predicted = list(), list()

    for i in range(pred.shape[0]):
        
        # translate encoded source text
        translation = logits_to_text(pred[i], french_tokenizer)
        #raw_target=raw_dataset[i][0]
        raw_target = logits_to_text(raw_dataset[i], french_tokenizer)
        if i < 10:
            print(' target=[%s], predicted=[%s]' % ( raw_target, translation))
        actual.append([raw_target.split()])
        predicted.append(translation.split())
        
    # calculate BLEU score
    print('BLEU-1: %f' % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0)))
    print('BLEU-2: %f' % corpus_bleu(actual, predicted, weights=(0.5, 0.5, 0, 0)))
    print('BLEU-3: %f' % corpus_bleu(actual, predicted, weights=(0.3, 0.3, 0.3, 0)))
    print('BLEU-4: %f' % corpus_bleu(actual, predicted, weights=(0.25, 0.25, 0.25, 0.25)))


Nous allons tester plusieurs réseaux de neuronnes. UN réseau simple, un encoder-decoder, un bidirectionnel, un mélange de ces réseaux. Toutes ces méthodes consistent en un "one hot encoding" de tous les mots avant de les passer au réseau de neuronnes. Nous allons ensuite simplement garder la liste des 'ID' de mots dans la phrase pour la passer à des réseaux qui fonctionnent avec une couche embedding qui permet de vectoriser les mots. Enfin on essayera une méthode qui consiste à combiner toutes ces méthodes.

### Model 1: RNN (IMPLEMENTATION)
![RNN](images/rnn.png)


In [11]:
# Reshaping the input to work with a basic RNN
tmp_x = pad(preproc_english_sentences, max_french_sequence_length)
tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2], 1))

preproc_french_sentencesbis=ut.to_categorical(preproc_french_sentences)
                                              
from sklearn.model_selection import train_test_split

X_train_embed, X_test_embed, y_train_embed, y_test_embed = train_test_split(tmp_x, preproc_french_sentencesbis, test_size=0.33, random_state=42)
X_train = ut.to_categorical(X_train_embed)
X_test = ut.to_categorical(X_test_embed) 
y_train = y_train_embed
y_test = y_test_embed

In [12]:
def simple_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
    """
    Build and train a basic RNN on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """
    # Hyperparameters
    learning_rate = 0.005
    
    # TODO: Build the layers
    model = Sequential()
    model.add(GRU(256, input_shape=input_shape[1:], return_sequences=True))
    model.add(TimeDistributed(Dense(1024, activation='relu')))
    model.add(Dropout(0.5))
    model.add(TimeDistributed(Dense(french_vocab_size+1, activation='softmax'))) 
    

    # Compile model
    model.compile(loss=categorical_crossentropy,
                  optimizer=Adam(learning_rate),
                  metrics=['accuracy'])
    return model

#tests.test_simple_model(simple_model)



# Train the neural network
simple_rnn_model = simple_model(
    X_train.shape,
    max_french_sequence_length,
    english_vocab_size,
    french_vocab_size)

print(simple_rnn_model.summary())

simple_rnn_model.fit(X_train,y_train, batch_size = 5000, epochs=30, validation_split=0.2)


# Print prediction(s)
print(logits_to_text(simple_rnn_model.predict(X_test)[0], french_tokenizer))

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
gru_1 (GRU)                  (None, 21, 256)           350976    
_________________________________________________________________
time_distributed_1 (TimeDist (None, 21, 1024)          263168    
_________________________________________________________________
dropout_1 (Dropout)          (None, 21, 1024)          0         
_________________________________________________________________
time_distributed_2 (TimeDist (None, 21, 345)           353625    
Total params: 967,769
Trainable params: 967,769
Non-trainable params: 0
_________________________________________________________________
None
Instructions for updating:
Use tf.cast instead.
Train on 73892 samples, vali

In [13]:
predictions=simple_rnn_model.predict(X_test)

In [14]:

for i in range(len(predictions[:20])):
    print('PREDICTION :',logits_to_text(predictions[i], french_tokenizer))
    print('ATTENDU :',logits_to_text(y_test[i], french_tokenizer))

PREDICTION : californie est parfois sec en mois de mai et il est parfois merveilleux en février <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : californie est parfois sec au mois de mai et il est parfois merveilleux en février <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
PREDICTION : californie est jamais chaud en l'automne mais mais il est sec sec mars mars <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : californie est jamais chaud pendant l' automne mais il est jamais sec en mars <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
PREDICTION : les états unis est parfois pluvieux en janvier mais il est doux en mai <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : les états unis est parfois pluvieux en janvier mais il est doux en mai <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
PREDICTION : ils aiment les pommes les oranges et les poires <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : ils aiment les pommes les oranges et les poires <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <P

In [15]:
# test on some training sequences
print('TRAIN')
pred1=simple_rnn_model.predict(X_train)
evaluate_model(simple_rnn_model , X_train, y_train,pred1)
# test on some test sequences
print('TEST')
evaluate_model(simple_rnn_model, X_test, y_test,predictions)

TRAIN
 target=[les états unis est jamais belle au mois de novembre et il est jamais tranquille en juillet <PAD> <PAD> <PAD> <PAD>], predicted=[les états unis est jamais beau en mois de novembre et il est jamais tranquille en juillet <PAD> <PAD> <PAD> <PAD>]
 target=[le singe est votre animal préféré moins <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>], predicted=[les singe est votre animal préféré moins <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>]
 target=[californie ne fait jamais froid pendant l' hiver mais il est jamais tranquille à l' automne <PAD> <PAD> <PAD> <PAD> <PAD>], predicted=[californie est jamais jamais froid pendant l' hiver mais il est jamais tranquille à l' automne <PAD> <PAD> <PAD> <PAD> <PAD>]
 target=[la pêche est leur fruit le plus cher mais le citron est le plus aimé <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>], predicted=[les pêche est leur fruit le plus cher mais le citron est le plus aimé <PA

### Model 2: Bidirectional RNNs (IMPLEMENTATION)
![RNN](images/bidirectional.png)

Un réseau RNN ne peut pas voir le futur de ce qu'on lui passe en entrée, il peut seulement voir son passé. C'est pour pallier à cette limite que nous utilisons les bidirectional recurrent neural networks. Ils sont capables de voir le futur de nos données.

In [16]:
def bd_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
    """
    Build and train a bidirectional RNN model on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """
    # TODO: Implement

    # Hyperparameters
    learning_rate = 0.003
    
    # TODO: Build the layers
    model = Sequential()
    model.add(Bidirectional(GRU(128, return_sequences=True), input_shape=input_shape[1:]))
    model.add(TimeDistributed(Dense(1024, activation='relu')))
    model.add(Dropout(0.5))
    model.add(TimeDistributed(Dense(french_vocab_size, activation='softmax'))) 

    # Compile model
    model.compile(loss=categorical_crossentropy,
                  optimizer=Adam(learning_rate),
                  metrics=['accuracy'])
    return model

#tests.test_bd_model(bd_model)

# TODO: Reshape the input
tmp_x = pad(preproc_english_sentences, preproc_french_sentences.shape[1])
tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2]))


preproc_french_sentencesbis=ut.to_categorical(preproc_french_sentences)
tmp_xbis=ut.to_categorical(tmp_x)
# TODO: Train and Print prediction(s)
bde_model = bd_model(
    X_train.shape,
    preproc_french_sentencesbis.shape[1],
    len(english_tokenizer.word_index)+1,
    len(french_tokenizer.word_index)+1)

bde_model.summary()

bde_model.fit(X_train, y_train, batch_size=1024, epochs=10, validation_split=0.2)



_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bidirectional_1 (Bidirection (None, 21, 256)           252672    
_________________________________________________________________
time_distributed_3 (TimeDist (None, 21, 1024)          263168    
_________________________________________________________________
dropout_2 (Dropout)          (None, 21, 1024)          0         
_________________________________________________________________
time_distributed_4 (TimeDist (None, 21, 345)           353625    
Total params: 869,465
Trainable params: 869,465
Non-trainable params: 0
_________________________________________________________________
Train on 73892 samples, validate on 18474 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fbc60eaadd8>

In [17]:
predictionsBDE=bde_model.predict(X_test)

In [18]:
for i in range(len(predictionsBDE[:20])):
    print('PREDICTION :',logits_to_text(predictionsBDE[i], french_tokenizer))
    print('ATTENDU :',logits_to_text(y_test[i], french_tokenizer))

PREDICTION : californie est parfois sec au mois de mai et il est parfois merveilleux en février <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : californie est parfois sec au mois de mai et il est parfois merveilleux en février <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
PREDICTION : californie est jamais chaud à l' mais il est jamais sec en mars <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : californie est jamais chaud pendant l' automne mais il est jamais sec en mars <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
PREDICTION : les états unis est parfois pluvieux en janvier mais il est doux en mai <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : les états unis est parfois pluvieux en janvier mais il est doux en mai <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
PREDICTION : ils aiment les pommes les oranges et les poires <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : ils aiment les pommes les oranges et les poires <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <P

In [19]:
# test on some training sequences
print('TRAIN')
pred1=bde_model.predict(X_train)
evaluate_model(bde_model , X_train, y_train,pred1)
# test on some test sequences
print('TEST')
evaluate_model(bde_model, X_test, y_test,predictionsBDE)

TRAIN
 target=[les états unis est jamais belle au mois de novembre et il est jamais tranquille en juillet <PAD> <PAD> <PAD> <PAD>], predicted=[les états unis est jamais belle en mois de novembre et il est jamais tranquille en juillet <PAD> <PAD> <PAD> <PAD>]
 target=[le singe est votre animal préféré moins <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>], predicted=[le singe est votre animal préféré moins <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>]
 target=[californie ne fait jamais froid pendant l' hiver mais il est jamais tranquille à l' automne <PAD> <PAD> <PAD> <PAD> <PAD>], predicted=[californie ne fait jamais froid pendant l' hiver mais il est jamais tranquille à l' automne <PAD> <PAD> <PAD> <PAD> <PAD>]
 target=[la pêche est leur fruit le plus cher mais le citron est le plus aimé <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>], predicted=[la pêche est leur fruit le plus cher mais le citron est le plus aimé <PAD> <

### Model 3: Encoder-Decoder

Regardons à présent les modèles encoder-decocer. Ce modèle est composé d'un encoder et d'un decoder. L'encoder crée une matrice qui représente les phrases. Le décodeur prend cette matrice comme entrée et prédit les translations en sortie. 



In [20]:
def encdec_model(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
    """
    Build and train an encoder-decoder model on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """
    
    # Hyperparameters
    learning_rate = 0.001
    
    # Build the layers    
    model = Sequential()
    # Encoder
    model.add(GRU(256, input_shape=input_shape[1:], go_backwards=True))
    model.add(RepeatVector(output_sequence_length))
    # Decoder
    model.add(GRU(256, return_sequences=True))
    model.add(TimeDistributed(Dense(1024, activation='relu')))
    model.add(Dropout(0.5))
    model.add(TimeDistributed(Dense(french_vocab_size, activation='softmax')))

    # Compile model
    model.compile(loss=categorical_crossentropy,
                  optimizer=Adam(learning_rate),
                  metrics=['accuracy'])
    
    return model

#tests.test_encdec_model(encdec_model)

# Reshape the input
#tmp_x = pad(preproc_english_sentences, preproc_french_sentences.shape[1])
#tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2], 1))


#preproc_french_sentencesbis=ut.to_categorical(preproc_french_sentences)
#tmp_xbis=ut.to_categorical(tmp_x)
# Train and Print prediction(s)
encdec_rnn_model = encdec_model(
    X_train.shape,
    preproc_french_sentencesbis.shape[1],
    len(english_tokenizer.word_index)+1,
    len(french_tokenizer.word_index)+1)

encdec_rnn_model.summary()

encdec_rnn_model.fit(X_train, y_train, batch_size=1024, epochs=10, validation_split=0.2)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
gru_3 (GRU)                  (None, 256)               350976    
_________________________________________________________________
repeat_vector_1 (RepeatVecto (None, 21, 256)           0         
_________________________________________________________________
gru_4 (GRU)                  (None, 21, 256)           393984    
_________________________________________________________________
time_distributed_5 (TimeDist (None, 21, 1024)          263168    
_________________________________________________________________
dropout_3 (Dropout)          (None, 21, 1024)          0         
_________________________________________________________________
time_distributed_6 (TimeDist (None, 21, 345)           353625    
Total params: 1,361,753
Trainable params: 1,361,753
Non-trainable params: 0
_________________________________________________________________


<keras.callbacks.History at 0x7fbc5ef90d68>

In [21]:
predictionsencdec=encdec_rnn_model.predict(X_test)

In [22]:
for i in range(len(predictionsencdec[:20])):
    print('PREDICTION :',logits_to_text(predictionsencdec[i], french_tokenizer))
    print('ATTENDU :',logits_to_text(y_test[i], french_tokenizer))

PREDICTION : californie est parfois parfois au mois de mai et il est parfois en en <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : californie est parfois sec au mois de mai et il est parfois merveilleux en février <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
PREDICTION : california est jamais froid au l' de il il est parfois en en <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : californie est jamais chaud pendant l' automne mais il est jamais sec en mars <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
PREDICTION : les états unis est parfois froid en avril et il est en en <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : les états unis est parfois pluvieux en janvier mais il est doux en mai <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
PREDICTION : ils aiment les les les les et les <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : ils aiment les pommes les oranges et les poires <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PA

In [23]:
# test on some training sequences
print('TRAIN')
pred1=encdec_rnn_model.predict(X_train)
evaluate_model(encdec_rnn_model , X_train, y_train,pred1)
# test on some test sequences
print('TEST')
evaluate_model(encdec_rnn_model, X_test, y_test,predictionsencdec)

TRAIN
 target=[les états unis est jamais belle au mois de novembre et il est jamais tranquille en juillet <PAD> <PAD> <PAD> <PAD>], predicted=[les états unis est jamais froid en mois et il est est jamais en en <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>]
 target=[le singe est votre animal préféré moins <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>], predicted=[le est était animal animal préféré <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>]
 target=[californie ne fait jamais froid pendant l' hiver mais il est jamais tranquille à l' automne <PAD> <PAD> <PAD> <PAD> <PAD>], predicted=[california est jamais parfois froid pendant l' mai il il est parfois en en <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>]
 target=[la pêche est leur fruit le plus cher mais le citron est le plus aimé <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>], predicted=[la poire est leur fruit le plus cher mais la est est plus plus aimé <PAD> <PAD> <PAD> <PAD> 

### Model 4: Custom (IMPLEMENTATION)


Nous allons maintenant créer un modèle qui comprend à la fois de l'embedding et un bidirectionnel rnn dans un même modèle. 

In [24]:
def model_final(input_shape, output_sequence_length, english_vocab_size, french_vocab_size):
    """
    Build and train a model that incorporates embedding, encoder-decoder, and bidirectional RNN on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """
    # TODO: Implement

    # Hyperparameters
    learning_rate = 0.003
    
    # Build the layers    
    model = Sequential()
    # Embedding
    #model.add(Embedding(english_vocab_size, 128, input_length=input_shape[1],
                         #input_shape=input_shape[1:]))
    
    # Encoder
    model.add(Bidirectional(GRU(128),input_shape=input_shape[1:]))
    model.add(RepeatVector(output_sequence_length))
    # Decoder
    model.add(Bidirectional(GRU(128, return_sequences=True)))
    model.add(TimeDistributed(Dense(512, activation='relu')))
    model.add(Dropout(0.5))
    model.add(TimeDistributed(Dense(french_vocab_size, activation='softmax')))
    model.compile(loss=categorical_crossentropy,
                  optimizer=Adam(learning_rate),
                  metrics=['accuracy'])
    
    return model
model_melange=model_final(X_train.shape,y_train.shape[1],
                        len(english_tokenizer.word_index)+1,
                        len(french_tokenizer.word_index)+1)
model_melange.summary()
model_melange.fit(X_train, y_train, batch_size=1024, epochs=25, validation_split=0.2)


#tests.test_model_final(model_final)

print('Final Model Loaded')

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bidirectional_2 (Bidirection (None, 256)               252672    
_________________________________________________________________
repeat_vector_2 (RepeatVecto (None, 21, 256)           0         
_________________________________________________________________
bidirectional_3 (Bidirection (None, 21, 256)           295680    
_________________________________________________________________
time_distributed_7 (TimeDist (None, 21, 512)           131584    
_________________________________________________________________
dropout_4 (Dropout)          (None, 21, 512)           0         
_________________________________________________________________
time_distributed_8 (TimeDist (None, 21, 345)           176985    
Total params: 856,921
Trainable params: 856,921
Non-trainable params: 0
_________________________________________________________________
Trai

In [25]:
predictions_melange=model_melange.predict(X_test)

In [26]:
for i in range(len(predictions_melange[:20])):
    print('PREDICTION :',logits_to_text(predictions_melange[i], french_tokenizer))
    print('ATTENDU :',logits_to_text(y_test[i], french_tokenizer))

PREDICTION : californie est parfois sec au mois de mai et il est parfois merveilleux en février <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : californie est parfois sec au mois de mai et il est parfois merveilleux en février <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
PREDICTION : californie est jamais chaud pendant l' automne mais il est jamais sec en mars <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : californie est jamais chaud pendant l' automne mais il est jamais sec en mars <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
PREDICTION : les états unis est parfois pluvieux en janvier mais il est doux en mai <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : les états unis est parfois pluvieux en janvier mais il est doux en mai <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
PREDICTION : ils aiment les pommes les oranges et les poires <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : ils aiment les pommes les oranges et les poires <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> 

In [27]:
# test on some training sequences
print('TRAIN')
pred1=model_melange.predict(X_train)
evaluate_model(model_melange , X_train, y_train,pred1)
# test on some test sequences
print('TEST')
evaluate_model(model_melange, X_test, y_test,predictions_melange)

TRAIN
 target=[les états unis est jamais belle au mois de novembre et il est jamais tranquille en juillet <PAD> <PAD> <PAD> <PAD>], predicted=[les états unis est jamais belle au mois de novembre et il est jamais jamais en en <PAD> <PAD> <PAD> <PAD>]
 target=[le singe est votre animal préféré moins <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>], predicted=[le singe est votre animal préféré moins <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>]
 target=[californie ne fait jamais froid pendant l' hiver mais il est jamais tranquille à l' automne <PAD> <PAD> <PAD> <PAD> <PAD>], predicted=[californie ne fait jamais froid pendant l' hiver mais il est jamais tranquille à l' automne <PAD> <PAD> <PAD> <PAD> <PAD>]
 target=[la pêche est leur fruit le plus cher mais le citron est le plus aimé <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>], predicted=[la pêche est leur fruit le plus cher mais le citron est son plus aimé <PAD> <PAD> <PA

### Modèle 5: Embedding (IMPLEMENTATION)
![RNN](images/embedding.png)


Dans ce modèle on tente de transformer les mots en vecteurs dont les mots proches en sens sont proches en vecteurs. On prendra ici en entrée X_train_embed que nous n'vons pas "one-hot" encodé.

### Embedding

In [28]:
X_train_embed=X_train_embed.reshape((X_train_embed.shape[0],X_train_embed.shape[1]))
X_test_embed=X_test_embed.reshape((X_test_embed.shape[0],X_test_embed.shape[1]))

In [29]:
print(X_test_embed.shape)
print(X_train_embed.shape)

(45495, 21)
(92366, 21)


In [30]:
print(y_test_embed.shape)
print(y_train_embed.shape)

(45495, 21, 345)
(92366, 21, 345)


In [31]:
def embed_model(src_vocab, tar_vocab, src_timesteps, tar_timesteps,input_shape):
    """
    Build and train a RNN model using word embedding on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """
    model = Sequential()
    model.add(Embedding(src_vocab,output_dim=128, mask_zero=False,input_shape=input_shape[1:]))
    #mask_zero doit etre a false sinon il ne reconnaitra jamais les pad et mettra des mots a la place
    model.add(GRU(256,return_sequences=True))
    model.add(TimeDistributed(Dense(512, activation='relu')))
    model.add(GRU(256,return_sequences=True))
    model.add(Dropout(0.5))
    model.add(TimeDistributed(Dense(french_vocab_size+1, activation='softmax')))

    return model
learning_rate = 0.005


embed_rnn_model = embed_model(
    len(english_tokenizer.word_index)+1,
    len(french_tokenizer.word_index)+1,
    y_train_embed.shape[1:],
    21,
    X_train_embed.shape)

embed_rnn_model.compile(loss='categorical_crossentropy',
                  optimizer=Adam(learning_rate),
                  metrics=['accuracy'])

embed_rnn_model.summary()

embed_rnn_model.fit(X_train_embed, y_train_embed, batch_size=1024, epochs=10, validation_split=0.2)



 


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 21, 128)           25600     
_________________________________________________________________
gru_7 (GRU)                  (None, 21, 256)           295680    
_________________________________________________________________
time_distributed_9 (TimeDist (None, 21, 512)           131584    
_________________________________________________________________
gru_8 (GRU)                  (None, 21, 256)           590592    
_________________________________________________________________
dropout_5 (Dropout)          (None, 21, 256)           0         
_________________________________________________________________
time_distributed_10 (TimeDis (None, 21, 345)           88665     
Total params: 1,132,121
Trainable params: 1,132,121
Non-trainable params: 0
_________________________________________________________________


<keras.callbacks.History at 0x7fbb81a26d30>

In [32]:
# TODO: Print prediction(s)
pred_embed=embed_rnn_model.predict(X_test_embed)
for i in range(len(pred_embed[:20])):
    print('PREDICTION :',logits_to_text(pred_embed[i], french_tokenizer))
    print('ATTENDU :',logits_to_text(y_test[i], french_tokenizer))

PREDICTION : californie est parfois sec en mois de mai et il est parfois merveilleux en février <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : californie est parfois sec au mois de mai et il est parfois merveilleux en février <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
PREDICTION : californie est jamais chaud en l'automne mais il est jamais jamais en mars <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : californie est jamais chaud pendant l' automne mais il est jamais sec en mars <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
PREDICTION : les états unis est parfois pluvieux en janvier mais il est doux en mai <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : les états unis est parfois pluvieux en janvier mais il est doux en mai <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
PREDICTION : ils aiment les pommes les oranges et les poires <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : ils aiment les pommes les oranges et les poires <PAD> <PAD> <PAD> <PAD> <PAD> <PA

In [33]:
# test on some training sequences
print('TRAIN')
pred1=embed_rnn_model.predict(X_train_embed)
evaluate_model(embed_rnn_model , X_train_embed, y_train_embed,pred1)
# test on some test sequences
print('TEST')
evaluate_model(embed_rnn_model, X_test_embed, y_test_embed,pred_embed)

TRAIN
 target=[les états unis est jamais belle au mois de novembre et il est jamais tranquille en juillet <PAD> <PAD> <PAD> <PAD>], predicted=[les états unis est jamais beau en mois de novembre et il est jamais tranquille en juillet <PAD> <PAD> <PAD> <PAD>]
 target=[le singe est votre animal préféré moins <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>], predicted=[les singe est votre animal préféré moins <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>]
 target=[californie ne fait jamais froid pendant l' hiver mais il est jamais tranquille à l' automne <PAD> <PAD> <PAD> <PAD> <PAD>], predicted=[californie est jamais jamais froid pendant l' hiver mais il est jamais tranquille à l' automne <PAD> <PAD> <PAD> <PAD> <PAD>]
 target=[la pêche est leur fruit le plus cher mais le citron est le plus aimé <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>], predicted=[les pêche est leur fruit le plus cher mais le citron est le plus aimé <PA

## MEGA COMBO

Ce modèle est un mélange des trois types de modèles précédement utilisés: bidirectionel,embedding et encoder-decoder.

In [34]:
def mega_combo_model(src_vocab, tar_vocab, src_timesteps, tar_timesteps,input_shape):
    """
    Build and train a RNN model using word embedding on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """
    model = Sequential()
    model.add(Embedding(src_vocab,15, mask_zero=False, input_shape=input_shape[1:]))
    model.add(GRU(256,return_sequences=True))
    model.add(Bidirectional(GRU(128)))
    model.add(RepeatVector(21))
    model.add(Bidirectional(GRU(128, return_sequences=True)))
    model.add(TimeDistributed(Dense(512, activation='relu')))
    model.add(GRU(256,return_sequences=True))
    model.add(Dropout(0.5))
    model.add(TimeDistributed(Dense(french_vocab_size+1, activation='softmax')))

    return model
learning_rate = 0.005


MEGA_model = mega_combo_model(
    len(english_tokenizer.word_index)+1,
    len(french_tokenizer.word_index)+1,
    y_train.shape[1:],
    21,
    X_train_embed.shape)

MEGA_model.compile(loss='categorical_crossentropy',
                  optimizer=Adam(learning_rate),
                  metrics=['accuracy'])

MEGA_model.summary()

MEGA_model.fit(X_train_embed, y_train_embed, batch_size=1024, epochs=20, validation_split=0.2)


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 21, 15)            3000      
_________________________________________________________________
gru_9 (GRU)                  (None, 21, 256)           208896    
_________________________________________________________________
bidirectional_4 (Bidirection (None, 256)               295680    
_________________________________________________________________
repeat_vector_3 (RepeatVecto (None, 21, 256)           0         
_________________________________________________________________
bidirectional_5 (Bidirection (None, 21, 256)           295680    
_________________________________________________________________
time_distributed_11 (TimeDis (None, 21, 512)           131584    
_________________________________________________________________
gru_12 (GRU)                 (None, 21, 256)           590592    
__________

<keras.callbacks.History at 0x7fbb8026e5f8>

In [35]:
mega_pred=MEGA_model.predict(X_test_embed)

In [36]:
for i in range(len(mega_pred[:20])):
    print('PREDICTION :',logits_to_text(mega_pred[i], french_tokenizer))
    print('ATTENDU :',logits_to_text(y_test[i], french_tokenizer))

PREDICTION : californie est parfois doux au mois de mai et il est parfois merveilleux en février <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : californie est parfois sec au mois de mai et il est parfois merveilleux en février <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
PREDICTION : california est jamais chaud à l'automne mais il est jamais jamais sec mars <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : californie est jamais chaud pendant l' automne mais il est jamais sec en mars <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
PREDICTION : les états unis est parfois pluvieux en janvier mais il est doux en mai <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : les états unis est parfois pluvieux en janvier mais il est doux en mai <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
PREDICTION : ils aiment les pommes les oranges et les poires <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : ils aiment les pommes les oranges et les poires <PAD> <PAD> <PAD> <PAD> <PAD> <P

In [37]:
# test on some training sequences
print('TRAIN')
pred1=MEGA_model.predict(X_train_embed)
evaluate_model(MEGA_model , X_train_embed, y_train_embed,pred1)
# test on some test sequences
print('TEST')
evaluate_model(MEGA_model, X_test_embed, y_test_embed,mega_pred)

TRAIN
 target=[les états unis est jamais belle au mois de novembre et il est jamais tranquille en juillet <PAD> <PAD> <PAD> <PAD>], predicted=[les états unis est jamais beau au mois de novembre et il est jamais tranquille en juillet <PAD> <PAD> <PAD> <PAD>]
 target=[le singe est votre animal préféré moins <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>], predicted=[le singe est votre animal préféré moins <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>]
 target=[californie ne fait jamais froid pendant l' hiver mais il est jamais tranquille à l' automne <PAD> <PAD> <PAD> <PAD> <PAD>], predicted=[californie ne fait jamais froid pendant l' hiver mais il est jamais tranquille à l' automne <PAD> <PAD> <PAD> <PAD> <PAD>]
 target=[la pêche est leur fruit le plus cher mais le citron est le plus aimé <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>], predicted=[la pêche est leur fruit le plus cher mais le citron est le plus aimé <PAD> <P

### Si on essaye maintenant de traduire de nouvelles phrases (mais avec des mots vus à l'entrainement). Cela donne:

In [38]:
sentences=['he saw a old yellow truck',
            'In France we like mangoes and we drive black cars.',
           'the dogs dislike to eat bananas',
           'the weather is midle during june and cooler in july',
           'i want to visit Paris ' 
          ]

In [39]:
def padding(sentence,lenght=21):
    for i in range(lenght-len(sentence)):
        sentence.append(0)
    return sentence

In [40]:
english_tokenizer.word_index['france']

24

In [41]:
sentence = 'he saw a old yellow truck'
sentence = [english_tokenizer.word_index[word] for word in sentence.split()]
sentence = np.array(padding(sentence,lenght=21))

sentence2 = 'in france we like mangoes and we dislike car'
sentence2 = [english_tokenizer.word_index[word] for word in sentence2.split()]
sentence2 = np.array(padding(sentence2,lenght=21))

sentence3 = 'the dogs dislike to visit bananas'
sentence3 = [english_tokenizer.word_index[word] for word in sentence3.split()]
sentence3 = np.array(padding(sentence3,lenght=21))

sentence4 = 'the weather is mild during june and cold in july'
sentence4 = [english_tokenizer.word_index[word] for word in sentence4.split()]
sentence4 = np.array(padding(sentence4,lenght=21))

sentence5 = 'i want to visit paris '
sentence5 = [english_tokenizer.word_index[word] for word in sentence5.split()]
sentence5 = np.array(padding(sentence5,lenght=21))

liste_a_tester=np.array([sentence,sentence2,sentence3,sentence4,sentence5])
print(liste_a_tester)

[[ 26 127 100 111 112 101   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0]
 [  2  24  97  92  74   7  97  93 102   0   0   0   0   0   0   0   0   0
    0   0   0]
 [  5 171  93  81 108  80   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0]
 [  5 193   1  64   4  34   7  57   2  43   0   0   0   0   0   0   0   0
    0   0   0]
 [ 96 166  81 108  18   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0]]


In [42]:
my_pred=MEGA_model.predict(liste_a_tester)

In [43]:
for i in range(len(my_pred)):
    print('PREDICTION :',logits_to_text(my_pred[i], french_tokenizer))
    print('ATTENDU :',sentences[i])

PREDICTION : il a vu un vieux camion jaune <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : he saw a old yellow truck
PREDICTION : son animal le plus redouté est ce <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : In France we like mangoes and we drive black cars.
PREDICTION : l' est est moins <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : the dogs dislike to eat bananas
PREDICTION : le lion est est est est le <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : the weather is midle during june and cooler in july
PREDICTION : il pourraient aller en californie <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : i want to visit Paris 


La prédiction sur de nouvelles phrases est pauvre, bien que le réseau connaisse déjà tous les mots employés.

# On essaye maintenant d'apprendre sur un plus gros jeu de données.

In [44]:
from pickle import load
# load a clean dataset
def load_clean_sentences(filename):
    return load(open(filename, 'rb'))
 # load datasets
    
dataset = load_clean_sentences('english-french-both.pkl')
train = load_clean_sentences('english-french-train.pkl')
test = load_clean_sentences('english-french-test.pkl')



In [45]:
test.shape

(1000, 3)

In [46]:
dataset.shape

(10000, 3)

In [47]:
new_english_sentences=dataset[:,0]
new_french_sentences=dataset[:,1]

english_train=train[:,0]
french_train=train[:,1]
english_test=test[:,0]
french_test=test[:,1]

In [48]:
english_train

array(['i saw someone', 'what a pity', 'shake my hand', ..., 'be nice',
       'see you around', 'hi'], dtype='<U339')

In [49]:
french_train

array(['je vis quelquun', 'quel malheur', 'serremoi la main', ...,
       'sois gentil', 'au plaisir de vous revoir', 'salut'], dtype='<U339')

In [50]:
new_preproc_english_sentences, new_preproc_french_sentences, new_english_tokenizer, new_french_tokenizer = preprocess(new_english_sentences, new_french_sentences)

In [51]:
new_preproc_french_sentences[0]

array([[  1],
       [326],
       [212],
       [  0],
       [  0],
       [  0],
       [  0],
       [  0],
       [  0],
       [  0]], dtype=int32)

In [52]:
new_english_tokenizer.word_index['saw'] 

78

In [53]:
len(new_english_tokenizer.word_index)

2122

In [54]:
new_max_english_sequence_length = new_preproc_english_sentences.shape[1]
new_max_french_sequence_length = new_preproc_french_sentences.shape[1]
new_english_vocab_size = len(new_english_tokenizer.word_index)
new_french_vocab_size = len(new_french_tokenizer.word_index)

print('Data Preprocessed')
print("Max English sentence length:", new_max_english_sequence_length)
print("Max French sentence length:", new_max_french_sequence_length)
print("English vocabulary size:", new_english_vocab_size)
print("French vocabulary size:", new_french_vocab_size)

Data Preprocessed
Max English sentence length: 5
Max French sentence length: 10
English vocabulary size: 2122
French vocabulary size: 4373


In [55]:
def inverse_to_categorical(y_train):
    res=np.zeros((y_train.shape[0],y_train.shape[1]))
    for i in range(y_train.shape[0]):
        for j in range(y_train.shape[1]):
            res[i][j]=np.argmax(y_train[i][j])
    return res

In [56]:
new_tmp_x = pad(new_preproc_english_sentences, new_max_french_sequence_length)
new_tmp_x = new_tmp_x.reshape((-1, new_preproc_french_sentences.shape[-2], 1))

new_preproc_french_sentencesbis = ut.to_categorical(new_preproc_french_sentences,new_french_vocab_size+1)
new_tmp_xbis = ut.to_categorical(new_tmp_x ) 

from sklearn.model_selection import train_test_split

new_X_train, new_X_test, new_y_train, new_y_test = train_test_split(new_tmp_xbis, new_preproc_french_sentencesbis, test_size=0.33, random_state=42)
new_y_train_embed = new_y_train
new_y_test_embed = new_y_test

new_X_train_embed = inverse_to_categorical(new_X_train)
new_X_test_embed = inverse_to_categorical(new_X_test)

In [57]:
new_X_train_embed.shape

(6700, 10)

In [58]:
new_X_train.shape

(6700, 10, 2123)

In [59]:
print(new_y_train_embed.shape)
print(new_y_train.shape)

(6700, 10, 4374)
(6700, 10, 4374)


In [60]:
print(new_X_test_embed.shape)
print(new_X_train_embed.shape)
print(new_y_test_embed.shape)
print(new_y_train_embed.shape)

(3300, 10)
(6700, 10)
(3300, 10, 4374)
(6700, 10, 4374)


In [61]:
def new_simple_model(input_shape, output_sequence_length, new_english_vocab_size, new_french_vocab_size):
    """
    Build and train a basic RNN on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """
    # Hyperparameters
    learning_rate = 0.005
    
    # TODO: Build the layers
    model = Sequential()
    model.add(GRU(512, input_shape=input_shape[1:], return_sequences=True))
    model.add(TimeDistributed(Dense(1024, activation='relu')))
    model.add(Dropout(0.5))
    model.add(TimeDistributed(Dense(new_french_vocab_size+1, activation='softmax'))) 
    

    # Compile model
    model.compile(loss=categorical_crossentropy,
                  optimizer=Adam(learning_rate),
                  metrics=['accuracy'])
    return model

#tests.test_simple_model(simple_model)



# Train the neural network
new_simple_rnn_model = new_simple_model(
    new_X_train.shape,
    new_max_french_sequence_length,
    new_english_vocab_size,
    new_french_vocab_size)

print(new_simple_rnn_model.summary())

new_simple_rnn_model.fit(new_X_train,new_y_train, batch_size = 1000, epochs=30, validation_split=0.2)



_________________________________________________________________
Layer (type)                 Output Shape              Param #   
gru_13 (GRU)                 (None, 10, 512)           4048896   
_________________________________________________________________
time_distributed_13 (TimeDis (None, 10, 1024)          525312    
_________________________________________________________________
dropout_7 (Dropout)          (None, 10, 1024)          0         
_________________________________________________________________
time_distributed_14 (TimeDis (None, 10, 4374)          4483350   
Total params: 9,057,558
Trainable params: 9,057,558
Non-trainable params: 0
_________________________________________________________________
None
Train on 5360 samples, validate on 1340 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoc

<keras.callbacks.History at 0x7fbb7c2fde80>

In [62]:
new_simple_pred= new_simple_rnn_model.predict(new_X_test)

In [63]:
for i in range(len(new_simple_pred[:20])):
    print('PREDICTION :',logits_to_text(new_simple_pred[i], new_french_tokenizer))
    print('ATTENDU :',logits_to_text(new_y_test[i], new_french_tokenizer))

PREDICTION : grimpez <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : preparezvous <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
PREDICTION : arrete <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : ca suffit tom <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
PREDICTION : soyez le <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : sois realiste <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
PREDICTION : je suis <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : je suis a la maison <PAD> <PAD> <PAD> <PAD> <PAD>
PREDICTION : aidemoi tom <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : aidez tom <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
PREDICTION : detendstoi <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : calmezvous <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
PREDICTION : ca a un <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : il a une fuite <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
PREDICTION : etesvo

In [64]:
def new_evaluate_model(model,  sources, raw_dataset,pred):
    actual, predicted = list(), list()

    for i in range(pred.shape[0]):
        
        # translate encoded source text
        translation = logits_to_text(pred[i], new_french_tokenizer)
        #raw_target=raw_dataset[i][0]
        raw_target = logits_to_text(raw_dataset[i], new_french_tokenizer)
        if i < 10:
            print(' target=[%s], predicted=[%s]' % ( raw_target, translation))
        actual.append([raw_target.split()])
        predicted.append(translation.split())
        
    # calculate BLEU score
    print('BLEU-1: %f' % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0)))
    print('BLEU-2: %f' % corpus_bleu(actual, predicted, weights=(0.5, 0.5, 0, 0)))
    print('BLEU-3: %f' % corpus_bleu(actual, predicted, weights=(0.3, 0.3, 0.3, 0)))
    print('BLEU-4: %f' % corpus_bleu(actual, predicted, weights=(0.25, 0.25, 0.25, 0.25)))


In [65]:
new_pred1=new_simple_rnn_model.predict(new_X_train)

In [66]:
new_pred1.shape

(6700, 10, 4374)

In [67]:
# test on some training sequences
print('TRAIN')
new_evaluate_model(new_simple_rnn_model , new_X_train, new_y_train,new_pred1)
# test on some test sequences
print('TEST')
new_evaluate_model(new_simple_rnn_model, new_X_test, new_y_test,new_simple_pred)

TRAIN
 target=[quiconque estil dans la maison <PAD> <PAD> <PAD> <PAD> <PAD>], predicted=[quiconque estil a maison maison <PAD> <PAD> <PAD> <PAD> <PAD>]
 target=[cest sucre <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>], predicted=[cest un <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>]
 target=[jaime les chats <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>], predicted=[je les chats <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>]
 target=[vous etes invites <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>], predicted=[vous etes invitee <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>]
 target=[jaime les filles <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>], predicted=[je les filles <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>]
 target=[jaime le printemps <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>], predicted=[je les printemps <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>]
 target=[je voyage leger <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>], predicted=[je voyage leger <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>]
 target=[ar

In [68]:
def new_mega_combo_model(src_vocab, tar_vocab, src_timesteps, tar_timesteps,input_shape):
    """
    Build and train a RNN model using word embedding on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """
    model = Sequential()
    model.add(Embedding(src_vocab,100, mask_zero = False, input_shape=input_shape[1:]))
    model.add(GRU(512,return_sequences=True))  
    model.add(Bidirectional(GRU(256)))
    model.add(RepeatVector(10))
    model.add(Bidirectional(GRU(256, return_sequences=True)))
    model.add(TimeDistributed(Dense(1024, activation='relu')))
    model.add(GRU(512,return_sequences=True))
    model.add(Dropout(0.5))
    model.add(TimeDistributed(Dense(new_french_vocab_size+1, activation='softmax')))

    return model

learning_rate = 0.005
new_MEGA_model = new_mega_combo_model(
    len(new_english_tokenizer.word_index)+1,
    len(new_french_tokenizer.word_index)+1,
    new_y_train.shape[1:],
    10,
    new_X_train_embed.shape)

new_MEGA_model.compile(loss='categorical_crossentropy',
                  optimizer=Adam(learning_rate),
                  metrics=['accuracy'])

new_MEGA_model.summary()

new_MEGA_model.fit(new_X_train_embed, new_y_train_embed, batch_size=1000, epochs=60, validation_split=0.2)



_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 10, 100)           212300    
_________________________________________________________________
gru_14 (GRU)                 (None, 10, 512)           941568    
_________________________________________________________________
bidirectional_6 (Bidirection (None, 512)               1181184   
_________________________________________________________________
repeat_vector_4 (RepeatVecto (None, 10, 512)           0         
_________________________________________________________________
bidirectional_7 (Bidirection (None, 10, 512)           1181184   
_________________________________________________________________
time_distributed_15 (TimeDis (None, 10, 1024)          525312    
_________________________________________________________________
gru_17 (GRU)                 (None, 10, 512)           2360832   
__________

<keras.callbacks.History at 0x7fbb7a948278>

In [69]:
new_mega_pred = new_MEGA_model.predict(new_X_test_embed)

In [70]:
for i in range(len(new_mega_pred[:20])):
    print('PREDICTION :',logits_to_text(new_mega_pred[i], new_french_tokenizer))
    print('ATTENDU :',logits_to_text(new_y_test[i], new_french_tokenizer))

PREDICTION : soyez <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : preparezvous <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
PREDICTION : venez a <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : ca suffit tom <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
PREDICTION : soyez prudente <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : sois realiste <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
PREDICTION : je suis en <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : je suis a la maison <PAD> <PAD> <PAD> <PAD> <PAD>
PREDICTION : soyez <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : aidez tom <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
PREDICTION : oubliezle <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : calmezvous <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
PREDICTION : cest est <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : il a une fuite <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
PREDICTION : etes

In [71]:
new_mega_pred2 = new_MEGA_model.predict(new_X_train_embed)

In [72]:
# test on some training sequences
print('TRAIN')
new_evaluate_model(new_MEGA_model , new_X_train_embed, new_y_train_embed,new_mega_pred2)
# test on some test sequences
print('TEST')
new_evaluate_model(new_MEGA_model, new_X_test_embed, new_y_test_embed,new_mega_pred)

TRAIN
 target=[quiconque estil dans la maison <PAD> <PAD> <PAD> <PAD> <PAD>], predicted=[je ne suis pas <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>]
 target=[cest sucre <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>], predicted=[cest un <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>]
 target=[jaime les chats <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>], predicted=[ils sont <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>]
 target=[vous etes invites <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>], predicted=[cest es <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>]
 target=[jaime les filles <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>], predicted=[jaime sont <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>]
 target=[jaime le printemps <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>], predicted=[jaime sont <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>]
 target=[je voyage leger <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>], predicted=[je sommes <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>]
 target=[arrete de ti

In [78]:
def new_simple_model_lstm (input_shape, output_sequence_length, new_english_vocab_size, new_french_vocab_size):
    """
    Build and train a basic RNN on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """
    # Hyperparameters
    learning_rate = 0.005
    
    # TODO: Build the layers
    model = Sequential()
    model.add(LSTM(512, input_shape=input_shape[1:], return_sequences=True))
    model.add(TimeDistributed(Dense(1024, activation='relu')))
    model.add(Dropout(0.5))
    model.add(TimeDistributed(Dense(new_french_vocab_size+1, activation='softmax'))) 
    

    # Compile model
    model.compile(loss=categorical_crossentropy,
                  optimizer=Adam(learning_rate),
                  metrics=['accuracy'])
    return model

#tests.test_simple_model(simple_model)



# Train the neural network
new_simple_rnn_model_lstm = new_simple_model(
    new_X_train.shape,
    new_max_french_sequence_length,
    new_english_vocab_size,
    new_french_vocab_size)

print(new_simple_rnn_model.summary())

new_simple_rnn_model_lstm.fit(new_X_train,new_y_train, batch_size = 1000, epochs=30, validation_split=0.2)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
gru_13 (GRU)                 (None, 10, 512)           4048896   
_________________________________________________________________
time_distributed_13 (TimeDis (None, 10, 1024)          525312    
_________________________________________________________________
dropout_7 (Dropout)          (None, 10, 1024)          0         
_________________________________________________________________
time_distributed_14 (TimeDis (None, 10, 4374)          4483350   
Total params: 9,057,558
Trainable params: 9,057,558
Non-trainable params: 0
_________________________________________________________________
None
Train on 5360 samples, validate on 1340 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoc

<keras.callbacks.History at 0x7fbb7486f4e0>

In [79]:
new_simple_pred_lstm = new_simple_rnn_model_lstm.predict(new_X_test)

In [80]:
for i in range(len(new_simple_pred_lstm[:20])):
    print('PREDICTION :',logits_to_text(new_simple_pred_lstm[i], new_french_tokenizer))
    print('ATTENDU :',logits_to_text(new_y_test[i], new_french_tokenizer))

PREDICTION : grimpez <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : preparezvous <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
PREDICTION : arrete <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : ca suffit tom <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
PREDICTION : soyez le <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : sois realiste <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
PREDICTION : je suis <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : je suis a la maison <PAD> <PAD> <PAD> <PAD> <PAD>
PREDICTION : aidemoi tom <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : aidez tom <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
PREDICTION : detendstoi <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : calmezvous <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
PREDICTION : ca a un <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : il a une fuite <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
PREDICTION : etesvo

In [81]:
new_simple_pred_lstm2 = new_simple_rnn_model_lstm.predict(new_X_train)

In [82]:
# test on some training sequences
print('TRAIN')
new_evaluate_model(new_simple_pred_lstm, new_X_train, new_y_train , new_simple_pred_lstm2 )
# test on some test sequences
print('TEST')
new_evaluate_model(new_simple_pred_lstm, new_X_test, new_y_test , new_simple_pred_lstm )

TRAIN
 target=[quiconque estil dans la maison <PAD> <PAD> <PAD> <PAD> <PAD>], predicted=[quiconque estil a la maison <PAD> <PAD> <PAD> <PAD> <PAD>]
 target=[cest sucre <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>], predicted=[cest un <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>]
 target=[jaime les chats <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>], predicted=[je les chats <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>]
 target=[vous etes invites <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>], predicted=[vous etes invitee <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>]
 target=[jaime les filles <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>], predicted=[je les filles <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>]
 target=[jaime le printemps <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>], predicted=[je les printemps <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>]
 target=[je voyage leger <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>], predicted=[je voyage leger <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>]
 target=[arrete

In [73]:
new_sentence = 'he saw a old yellow truck'
new_sentence_test = [new_english_tokenizer.word_index[word] for word in new_sentence.split()]
new_sentence_test = np.array(padding(new_sentence_test,lenght=10))

new_sentence2 = 'in my home we like to work and red cars'
new_sentence2_test = [new_english_tokenizer.word_index[word] for word in new_sentence2.split()]
new_sentence2_test = np.array(padding(new_sentence2_test,lenght=10))

new_sentence3 = 'the dogs dislike to eat bananas'
new_sentence3_test = [new_english_tokenizer.word_index[word] for word in new_sentence3.split()]
new_sentence3_test = np.array(padding(new_sentence3_test,lenght=10))

new_sentence4 = 'say hi next time'
new_sentence4_test = [new_english_tokenizer.word_index[word] for word in new_sentence4.split()]
new_sentence4_test = np.array(padding(new_sentence4_test,lenght=10))

new_sentence5 = "please ask me something if you do not know"
new_sentence5_test = [new_english_tokenizer.word_index[word] for word in new_sentence5.split()]
new_sentence5_test = np.array(padding(new_sentence5_test,lenght=10))

new_liste_a_tester=np.array([new_sentence_test,new_sentence2_test,new_sentence3_test,new_sentence4_test,new_sentence5_test])
new_sentences=[new_sentence,new_sentence2,new_sentence3,new_sentence4,new_sentence5]
print(new_liste_a_tester)

[[  12   78    5  135 1996 1769    0    0    0    0]
 [  42   36   58    8   40   22   92  433  713 1030]
 [  24  303 1279   22  111 1636    0    0    0    0]
 [ 134  693 1076  128    0    0    0    0    0    0]
 [  75  126   10 1275  711    2   21   25   87    0]]


In [74]:
new_my_pred=new_MEGA_model.predict(new_liste_a_tester)
for i in range(len(new_my_pred)):
    print('PREDICTION :',logits_to_text(new_my_pred[i], new_french_tokenizer))
    print('ATTENDU :',new_sentences[i])

PREDICTION : je ne suis <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : he saw a old yellow truck
PREDICTION : je suis <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : in my home we like to work and red cars
PREDICTION : je ne suis pas <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : the dogs dislike to eat bananas
PREDICTION : puisje de <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : say hi next time
PREDICTION : ne ne pas pas <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : please ask me something if you do not know


In [75]:
simple_new_liste_a_tester=ut.to_categorical(new_liste_a_tester,2123)

In [76]:
new_my_pred=new_simple_rnn_model.predict(simple_new_liste_a_tester)
for i in range(len(new_my_pred)):
    print('PREDICTION :',logits_to_text(new_my_pred[i], new_french_tokenizer))
    print('ATTENDU :',new_sentences[i])

PREDICTION : il vu un un vieux <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : he saw a old yellow truck
PREDICTION : tom la la la la <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : in my home we like to work and red cars
PREDICTION : la les chiens les les <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : the dogs dislike to eat bananas
PREDICTION : ditesle non <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>
ATTENDU : say hi next time
PREDICTION : sil vous sil lacet lacet prie <PAD> <PAD> <PAD> <PAD>
ATTENDU : please ask me something if you do not know


Nous voulions importer une méthode WordtoVec pré-entraînée mais les formats de ces grosses bibliothèques sont difficiles à télécharger et à décompresser... Mais une idée pour continuer ce projet serait d'essayer une fois les mots prétraités avec la méthode WordToVec de prédire des nouvelles phrases avec des mots que le réseau n'a jamais vu lors de son entraînement.

In [77]:
from gensim.models import KeyedVectors
# Load vectors directly from the file
model = KeyedVectors.load_word2vec_format('data/GoogleGoogleNews-vectors-negative300.bin', binary=True)
# Access vectors for specific words with a keyed lookup:
vector = model['easy']
# see the shape of the vector (300,)
vector.shape
# Processing sentences is not as simple as with Spacy:
vectors = [model[x] for x in "This is some text I am processing with Spacy".split(' ')]

FileNotFoundError: [Errno 2] No such file or directory: 'data/GoogleGoogleNews-vectors-negative300.bin'