<a href="https://github.com/EmmanuelADAM/IntelligenceArtificiellePython/blob/master/summerSchool/NN_6_gru_text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Example of Recurrent Networks Predicting Text
## Using the GRU Architecture

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, GRU, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from itertools import islice


: 

In [None]:
# A Few Simple Sentences

sentences_fr = [
    "la goutte d'eau qui fait déborder le vase", 
    "Il n'y a pas de fumée sans feu", 
    "Il faut battre le fer tant qu'il est chaud", 
    "Il ne faut pas mettre tous ses oeufs dans le même panier", 
    "Il faut tourner sept fois sa langue dans sa bouche avant de parler", 
    "L'habit ne fait pas le moine", 
    "Il ne faut pas réveiller le chat qui dort", 
    "Il faut se méfier de l'eau qui dort", 
    "C'est l'hôpital qui se moque de la charité", 
    "Qui vole un oeuf vole un boeuf", 
    "Chercher midi à quatorze heures", 
    "Avoir un poil dans la main", 
    "Être dans de beaux draps", 
    "Avoir la tête dans les nuages", 
    "Mettre les pieds dans le plat"]
sentences = [
    "the straw that broke the camel's back",
    "there's no smoke without fire",
    "strike while the iron is hot",
    "don't put all your eggs in one basket",
    "think before you speak",
    "Clothes don't make the man",
    "let sleeping dogs lie",
    "still waters run deep",
    "the pot calling the kettle black",
    "Steal an egg, steal an ox",
    "make a mountain out of a molehill",
    "to be in a fine mess",
    "to have your head in the clouds",
    "put your foot in your mouth"]
# test other sentences, other languages

---
### Creating the Vocabulary
First, we analyze all the texts to collect all the words used and assign a number to each token (word). Each word thus has a unique index.

A Tokenizer object will be useful.

In [2]:
# get the tokens :  Tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)
total_words = len(tokenizer.word_index) + 1
#tokenizer.word_index is a dictionary ( map), we have to transform it in list of words  (liste[0] = mot 1 (0+1))  
liste = list(tokenizer.word_index.keys())

print("nb of differents words read :", total_words)
for key, value in islice(tokenizer.word_index.items(), 10):
    print(f"{key}: {value}", end=", ")
print()

NameError: name 'Tokenizer' is not defined

---
### Transforming Text into a Vector
Now, we replace each word or token with its index to create one integer vector from a character string.

In [None]:
phrase0 = sentences[0]
vecteur0 = tokenizer.texts_to_sequences([phrase0])[0]
print(phrase0)
print("is traduced in:")
print(vecteur0)

la goutte d'eau qui fait déborder le vase
est traduit en
[7, 19, 20, 4, 11, 21, 1, 22]


Now, we want to gradually learn the continuation of a sentence:
- the drop
- the drop of water
- the drop of water that
- the drop of water that makes
- the drop of water that makes the
- the drop of water that makes the vase
- the drop of water that makes the vase overflow

In [None]:
# from texts to vectors 
input_sequences = []
for sentence in sentences:
    token_list = tokenizer.texts_to_sequences([sentence])[0]
    for i in range(1, len(token_list)):
        input_sequences.append(token_list[:i+1])

In [None]:
input_sequences[:5]

[[7, 19],
 [7, 19, 20],
 [7, 19, 20, 4],
 [7, 19, 20, 4, 11],
 [7, 19, 20, 4, 11, 21]]

In [None]:
# Calibrating the vectors so that they all have the same length
max_sequence_len = max([len(x) for x in input_sequences])
# We fill the preceding positions with 0 if the length of the vector < max_sequence_len.
input_sequences = pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre')

In [None]:
print("la phrase '", sentences[0], "' is translated into multiple vectors of the same size:")
split = sentences[0].split()
for i in range(6):
    print(input_sequences[i], end=" -> '")
    for j in range(i+2):
        print(split[j], end=" ")
    print("'")

la phrase ' la goutte d'eau qui fait déborder le vase ' est traduite en plusieurs vecteurs de même taille :
[ 0  0  0  0  0  0  0  0  0  0  0  7 19] -> 'la goutte '
[ 0  0  0  0  0  0  0  0  0  0  7 19 20] -> 'la goutte d'eau '
[ 0  0  0  0  0  0  0  0  0  7 19 20  4] -> 'la goutte d'eau qui '
[ 0  0  0  0  0  0  0  0  7 19 20  4 11] -> 'la goutte d'eau qui fait '
[ 0  0  0  0  0  0  0  7 19 20  4 11 21] -> 'la goutte d'eau qui fait déborder '
[ 0  0  0  0  0  0  7 19 20  4 11 21  1] -> 'la goutte d'eau qui fait déborder le '


In [None]:
# create Xs  (vector values except the last one)
X = input_sequences[:, :-1]
# creer les y (last value of the vector)
y = input_sequences[:, -1]

In [None]:
# Each output word is represented by a vector of 0s, only the index of the word is set to 1
# Therefore, the vector is as large as the number of words found
y = tf.keras.utils.to_categorical(y, num_classes=total_words)

In [None]:
y[0]

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0.])

---
### The Network Model
Specifically for text, we decide to represent a word not by an integer, but by a feature vector that represents it.

The **Embedding** layer allows this transformation.

Theoretically, if "chat" (cat) = 2 and "chien" (dog) = 5, an embedding of size 3 will give "chat" = [0.1, -0.4, 0.3] and "chien" = [0.5, -0.2, 0.5].

These vectors are refined based on the learning of the text. A priori, if "chien" and "chat" are used identically, as in "le chien mange" (the dog eats), "le chat mange" (the cat eats), after some time, the vectors for "chat" and "chien" will have similar values.

The **GRU** layer is responsible for learning sequences of values. Multiple GRU layers can be used, with return_sequences being true for all layers except the last. The size of the layer is approximately 100 for an average vocabulary (a few thousand words).

In [None]:
# model creation
model = Sequential()
## Embedding : each word is represented by a vector of 50 values
model.add(Embedding(total_words, 50))
model.add(GRU(100, return_sequences=False))
model.add(Dense(total_words, activation='softmax'))

In [None]:
# Compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])


In [None]:

# train the modele
print("wait few seconds when training...")
model.fit(X, y, epochs=300, verbose=0)

patienter 30s pendant l'entrainement...


<keras.src.callbacks.history.History at 0x23c5be17b30>

---
### Predictions
To complete a sentence, we request the generation of a word, then the generation of a new word that completes the sentence to which we have added the previous word, and so on, until the desired number of words is reached.  

In [None]:
# predicting the next word
def predict_next_word(start_text, next_words=1):
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([start_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
        #take the most probable word
        predicted = np.argmax(model.predict(token_list), axis=-1)
        
        for word, index in tokenizer.word_index.items():
            if index == predicted:
                start_text += " " + word
                break
    
    return start_text

In [None]:
# Test the model with different sentences
start_texts = [
    "don't put all your eggs  ",
    "make a mountain  ",
    "to have your head   ",
]

for text in start_texts:
    print(f"first part : {text}")
    nb = 5 if text.count(" ")<5 else 3
    print(f"prédictions de {nb} mots.")
    print(f"Prediction: {predict_next_word(text, nb)}")
    print("-" * 50)

first part : il faut tourner sept fois sa  
prédictions de 3 mots.
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 188ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 37ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 33ms/step
Prediction: il faut tourner sept fois sa   langue dans sa
--------------------------------------------------
first part : la goutte d'eau qui fait  
prédictions de 3 mots.
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 34ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 33ms/step
Prediction: la goutte d'eau qui fait   déborder le vase
--------------------------------------------------
first part : Qui vole un oeuf   
prédictions de 3 mots.
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 34ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 34ms/step
[1m1/1[0m [32m━━━━

In [None]:
# other additional tests
start_texts = [
    "to have your egg ",
    "when there moutains ",
]

for text in start_texts:
    print(f"first part : {text}")
    nb = 4 #if text.count(" ")<5 else 3
    print(f"prédictions de {nb} mots.")
    print(f"Prediction: {predict_next_word(text, nb)}")
    print("-" * 50)

first part : chercher un poil   
prédictions de 4 mots.
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 33ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 34ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 34ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 34ms/step
Prediction: chercher un poil    dans la main vole
--------------------------------------------------
first part : il faut battre le moine   
prédictions de 4 mots.
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 33ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 33ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 36ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 37ms/step
Prediction: il faut battre le moine    tant qu'il est chaud
--------------------------------------------------


---
### Post-processing
We notice small bugs, known as "hallucinations." It is often necessary to check and correct the outputs (for example, preventing the repetition of the same word, etc.).