# TensorFlow Autoencodeur avec attention pour le PAr

## Setup

Moi j'ai installé tf addons par `pip install tensorflow-addons==0.13.0` (ET NON PAS `conda install -c esri tensorflow-addons`). Voir les compatibilités [sur le github de tensorflow_addons](https://github.com/tensorflow/addons).

In [1]:
# !pip install tensorflow-addons==0.11.2

In [2]:
import sys
print(sys.version)

3.9.9 | packaged by conda-forge | (main, Dec 20 2021, 02:36:06) [MSC v.1929 64 bit (AMD64)]


In [1]:
import tensorflow as tf
import numpy as np
import io

print(tf.__version__)

2.7.0


In [2]:
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
physical_devices = tf.config.list_physical_devices('GPU') 
for device in physical_devices:
    tf.config.experimental.set_memory_growth(device, True)

Num GPUs Available:  1


# Step 1: Get the data

In [10]:
path_reglement_scol  = './word2vec_docs_scol_traités/corpus.txt'
path_questions_scol  = './word2vec_docs_scol_traités/toutes-les-questions.txt'

# Step 2: Preprocess the data

In [11]:
import re as regex
# acquisition du texte
reglement_scol = io.open(path_reglement_scol, encoding='UTF-8').read()#.strip().split('\n')
questions_scol = io.open(path_questions_scol, encoding='UTF-8').read()#.strip().split('\n')
texte = reglement_scol + ' ' + questions_scol
texte = regex.sub("\n", " ", texte)

On crée d'abord une liste de phrases dont chaque mot est séparé par un espace. On a besoin de `spacy` pour découper correctement les mots en français d'abord.

In [12]:
import nltk
import spacy
nlp = spacy.load('fr_core_news_sm')
phrases = nltk.tokenize.sent_tokenize(texte, language='french')
print('phrases parsées par NLTK')
phrasesTokeniseesSpacy = [nlp(s) for s in phrases]
print('phrases tokénisées par spacy')
phrasesSpacy = [' '.join([token.text.lower() for token in doc]) for doc in phrasesTokeniseesSpacy]
print('phrases découpées en tokens puis refusionnées')

phrases parsées par NLTK
phrases tokénisées par spacy
phrases découpées en tokens puis refusionnées


On supprime les listes inutiles désormais

In [13]:
del phrasesTokeniseesSpacy
del phrases

Créer un tokéniseur adapté à notre vocabulaire

In [14]:
from tensorflow.keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(filters='')
# créer un tokenizer adapté à tout le vocabulaire des phrases
tokenizer.fit_on_texts(phrasesSpacy)

Créer les tenseurs pour toutes les phrases et padder le tout

In [15]:
tensor_sentences = tokenizer.texts_to_sequences(phrasesSpacy)
print(type(tensor_sentences))
print(phrasesSpacy[0],tensor_sentences[0])
# enfin on padd le tout pour pouvoir l'utiliser dans un réseau de neurones
tensor_sentences = tf.keras.preprocessing.sequence.pad_sequences(tensor_sentences,padding='post')

<class 'list'>
le règlement de scolarité présente les modalités d' admission à l' école centrale de lyon , les objectifs et les modalités de l' évaluation des connaissances et des compétences de la formation ingénieur , les modalités de diversification de cette formation et les conditions d' obtention des diplômes de l' école centrale de lyon , hors diplômes de master co-accrédités et diplôme d' ingénieur energie en alternance . [8, 131, 1, 59, 860, 4, 102, 6, 175, 9, 3, 25, 46, 1, 43, 10, 4, 861, 16, 4, 102, 1, 3, 77, 12, 90, 16, 12, 104, 1, 5, 42, 88, 10, 4, 102, 1, 1565, 1, 166, 42, 16, 4, 285, 6, 214, 12, 85, 1, 3, 25, 46, 1, 43, 10, 390, 85, 1, 247, 1189, 16, 22, 6, 88, 1190, 18, 615, 13]


In [16]:
# Fonction qui convertit un mot en son représentant entier
def convert(tokenizer, tensor):
    for t in tensor: # t est un entier élément du tenseur
        if t != 0:
            print ("%d ----> %s" % (t, tokenizer.index_word[t]))
convert(tokenizer, tensor_sentences[-1])

37 ----> comment
5 ----> la
38 ----> mobilité
211 ----> est-elle
1564 ----> vérifiée
15 ----> pour
4 ----> les
97 ----> doubles
85 ----> diplômes
18 ----> en
80 ----> france
2 ----> ?


# Step 3: Define problem numbers

`tokenizer.index_word` est un dictionnaire dont les clés sont des entiers et les valeurs sont des struings (mots du vocabulaire)

In [17]:
print('tensor:')
print(type(tensor_sentences))
print(np.shape(tensor_sentences))
tensor_sentences[0]
print("tokenizer:")
print(type(tokenizer))
print(type(tokenizer.index_word))

vocab_inp_size = len(tokenizer.word_index)
n_data,max_length = tensor_sentences.shape
embedding_dim = 16

print(f"nombre de données: {n_data}\nlongueur max phrases en mots: {max_length}\ntaille du vocabulaire: {vocab_inp_size}\ndimension de l'embedding: {embedding_dim}")

tensor:
<class 'numpy.ndarray'>
(2201, 347)
tokenizer:
<class 'keras_preprocessing.text.Tokenizer'>
<class 'dict'>
nombre de données: 2201
longueur max phrases en mots: 347
taille du vocabulaire: 2555
dimension de l'embedding: 16


# Step 4: Split the train and validation data

In [18]:
from sklearn.model_selection import train_test_split

# Create training and validation sets using an 80/20 split
input_tensor_train, input_tensor_val, target_tensor_train, target_tensor_val = train_test_split(tensor_sentences, tensor_sentences, test_size=0.2)

print(type(input_tensor_train), type(target_tensor_train))
print(len(input_tensor_train), len(target_tensor_train), len(input_tensor_val), len(target_tensor_val))

# on observe ce qu'il y a dans ces données: si on rééxécute ça change, c'est parce qu'il y a un shuffle aléatoire
convert(tokenizer, input_tensor_train[0])

<class 'numpy.ndarray'> <class 'numpy.ndarray'>
1760 1760 441 441
37 ----> comment
41 ----> puis
23 ----> -je
231 ----> trouver
7 ----> un
348 ----> référent
191 ----> linguistique
27 ----> si
89 ----> je
223 ----> suis
500 ----> exempté
17 ----> du
47 ----> cours
1 ----> de
99 ----> langue
2 ----> ?


# Step 5: create Encoder and Decoder classes

In [19]:
# Encoder class
class Encoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, enc_units, max_length):
        super(Encoder, self).__init__()
        self.enc_units = enc_units

        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)

        self.gru = tf.keras.layers.GRU(self.enc_units,
                                       return_sequences=True,
                                       return_state=True,
                                       recurrent_initializer='glorot_uniform')

        self.mask = tf.keras.layers.Masking(mask_value=0, input_shape=(None,max_length, embedding_dim))

    def call(self, x):
        x = self.mask(x)
        x = self.embedding(x)
        output, state = self.gru(x)
        return output, state
        # output       shape == (batch_size, max_len, encoding_units)
        # output state shape == (batch_size, encoding_units)

In [23]:
encoder = Encoder(vocab_size=60, embedding_dim=4, enc_units=6, max_length=10)
enc_in = tf.random.uniform(
    (1,10),
    minval=0,
    maxval=60,
    dtype=tf.dtypes.int32,
    name="dummy_input_encoder"
)

print("enc_in",enc_in.shape)
out_enc = encoder(enc_in)
print("out_enc", out_enc[0].shape,out_enc[1].shape)

decoder = Decoder(max_length=10)
out_dec = decoder(out_enc[0],out_enc[1])
print("out_dec",out_dec.shape)

enc_in (1, 10)
out_enc (1, 10, 6) (1, 6)
out_dec (1, 10)


In [20]:
# Decoder class
class Decoder(tf.keras.Model):
    def __init__(self, max_length):
        super(Decoder, self).__init__()
        self.attention = tf.keras.layers.Attention()
        self.dense = tf.keras.layers.Dense(1)
        self.reshape = tf.keras.layers.Reshape([max_length])

    def call(self, enc_output,enc_hidden):
        attention_outputs, attention_scores = tf.keras.layers.Attention()([enc_output, enc_hidden], return_attention_scores=True)
        #context = attention_outputs * enc_output
        #final_output = self.dense(context)
        final_output = self.reshape(attention_scores)
        return final_output

encoder = Encoder(vocab_inp_size, embedding_dim, 128)
decoder = Decoder(128,10)

enc_in = tf.random.uniform(
    (6,10),
    minval=0,
    maxval=60,
    dtype=tf.dtypes.int32,
    name="dummy_input_encoder"
)


print('Encoder Input        shape: (batch_size, timesteps)                {}'.format(enc_in.shape))
enc_output, enc_hidden = encoder(enc_in)

print('Encoder Output       shape: (batch_size, sequence_length, units)   {}'.format(enc_output.shape))
print('Encoder Hidden_state shape: (batch_size, units)                    {}'.format(enc_hidden.shape))

output = decoder(enc_output)

print(output.shape)

dec_out = decoder(enc_output)
dec_out.shape
#print('Attention output: (batch_size, sequence_length, units)', attention_outputs.shape)
#print('Attention scores: (batch_size, sequence_length, units)', attention_scores.shape)

In [21]:
class Autoencoder(tf.keras.Model):
    def __init__(self, embedding_dim, vocab_inp_size, max_length, latent_dim):
        super().__init__()

        self.latent_dim = 128
        self.encoder = Encoder(vocab_size=vocab_inp_size, embedding_dim=embedding_dim, enc_units=latent_dim, max_length=max_length)
        self.decoder = Decoder(max_length=max_length)

    def call(self, inputs):
        enc_output,enc_hidden = self.encoder(inputs)
        out_dec = self.decoder(enc_output,enc_hidden)
        return out_dec


In [22]:
latent_dim = 128
autoenc = Autoencoder(embedding_dim,vocab_inp_size,max_length,latent_dim)
adam = tf.keras.optimizers.Adam(learning_rate=0.001,
    beta_1=0.9,
    beta_2=0.999,
    epsilon=1e-07,
    amsgrad=False)
autoenc.compile(optimizer=adam, loss=tf.keras.losses.MeanSquaredError(), metrics = ["accuracy"]) # losses.MeanSquaredError() losses.CosineSimilarity() tf.keras.losses.CategoricalCrossentropy()
autoenc.build(input_shape=input_tensor_train.shape)


# input_tensor_train.shape, autoenc(input_tensor_train).shape # ne pas décommenter si gros gros tenseurs

ValueError: Exception encountered when calling layer "decoder_5" (type Decoder).

in user code:

    File "C:\Users\matth\AppData\Local\Temp/ipykernel_9640/2569714999.py", line 13, in call  *
        final_output = self.reshape(attention_scores)
    File "C:\Users\matth\.conda\envs\tf\lib\site-packages\keras\utils\traceback_utils.py", line 67, in error_handler  **
        raise e.with_traceback(filtered_tb) from None

    ValueError: Exception encountered when calling layer "reshape_5" (type Reshape).
    
    Cannot reshape a tensor with 1074867200 elements to shape [1760,347] (610720 elements) for '{{node decoder_5/reshape_5/Reshape}} = Reshape[T=DT_FLOAT, Tshape=DT_INT32](decoder_5/attention/Identity, decoder_5/reshape_5/Reshape/shape)' with input shapes: [1760,347,1760], [2] and with input tensors computed as partial shapes: input[1] = [1760,347].
    
    Call arguments received:
      • inputs=tf.Tensor(shape=(1760, 347, 1760), dtype=float32)


Call arguments received:
  • enc_output=tf.Tensor(shape=(1760, 347, 128), dtype=float32)
  • enc_hidden=tf.Tensor(shape=(1760, 128), dtype=float32)

In [20]:
history = autoenc.fit(input_tensor_train,target_tensor_train,
                epochs=20,
                batch_size=128,
                shuffle=True,
                validation_data=(input_tensor_val,target_tensor_val),
                verbose=1)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [21]:
dummy_phrase = [phrasesSpacy[0]]
l = tokenizer.texts_to_sequences(dummy_phrase)

In [22]:
for ll in l:
    ll += (max_length-len(ll))*[0]

In [23]:
encoder = autoenc.encoder
decoder = autoenc.decoder

In [24]:
encoder(np.asarray(l))[1].shape

TensorShape([1, 128])

In [27]:
res1,res2 = encoder(np.asarray(l))
res1.shape,res2.shape

(TensorShape([1, 347, 128]), TensorShape([1, 128]))

In [71]:
res3 = decoder(res1,res2)
res3

<tf.Tensor: shape=(1, 347), dtype=float32, numpy=
array([[ 1.24931297e+01,  1.17034790e+02,  1.81203403e+01,
         4.34330902e+01,  1.57216583e+02,  4.04203529e+01,
         1.01612442e+02,  1.99001255e+01,  1.54140594e+02,
         2.25662346e+01,  2.13317719e+01,  4.21040764e+01,
         4.36036263e+01,  1.63316727e+01,  4.84043121e+01,
         2.13577805e+01,  4.03895607e+01,  1.57290833e+02,
         2.07951794e+01,  3.60457077e+01,  1.09792007e+02,
         1.85135059e+01,  2.10808849e+01,  6.79013443e+01,
         2.70839386e+01,  7.38018799e+01,  1.99942074e+01,
         1.73816223e+01,  8.45525436e+01,  1.81085663e+01,
         2.21186123e+01,  5.04311180e+01,  7.87296829e+01,
         2.17105103e+01,  3.94052162e+01,  1.08688309e+02,
         1.86318760e+01,  1.57138000e+02,  1.90021553e+01,
         1.47965408e+02,  4.56300049e+01,  1.77236004e+01,
         4.00220032e+01,  1.26758316e+02,  2.00959606e+01,
         1.56864731e+02,  2.57568913e+01,  6.34595795e+01,
      

In [72]:
np.asarray(l)

array([[   8,  131,    1,   59,  860,    4,  102,    6,  175,    9,    3,
          25,   46,    1,   43,   10,    4,  861,   16,    4,  102,    1,
           3,   77,   12,   90,   16,   12,  104,    1,    5,   42,   88,
          10,    4,  102,    1, 1565,    1,  166,   42,   16,    4,  285,
           6,  214,   12,   85,    1,    3,   25,   46,    1,   43,   10,
         390,   85,    1,  247, 1189,   16,   22,    6,   88, 1190,   18,
         615,   13,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0, 

In [None]:
from keras.models import Sequential
from keras.layers import LSTM, Dense, Masking

model = Sequential()
model.add(Masking(mask_value=0, input_shape=(max_length,)))
model.add(LSTM(100, activation='tanh',return_sequences=True))
model.add(LSTM(50, activation='tanh', return_sequences=True))
model.add(LSTM(50, activation='tanh', return_sequences=True))
model.add(LSTM(100, activation='tanh', return_sequences=True))
model.add((Dense(1,activation='tanh')))

model.compile(optimizer='Adam', loss=tf.keras.losses.CategoricalCrossentropy(), metrics = ["accuracy"]) # losses.MeanSquaredError() losses.CosineSimilarity() tf.keras.losses.CategoricalCrossentropy()
model.build(input_shape=input_tensor_train.shape)

history = model.fit(input_tensor_train,target_tensor_train,
                epochs=3,
                batch_size=128,
                shuffle=True,
                validation_data=(input_tensor_val,target_tensor_val),
                verbose=1)