# PLAN

## Première approche
1. Entraîner un modèle 4-gramme sur les 99 tranches du corpus avec KenLM (dans l'idéal on entraînerait 4 modèles, 1-2-3-4-gramme, pour analyser les résultats)
2. Tokeniser toutes les phrases en entrée et créer des matrices avec toutes les permutations possibles pour chaque phrase (donc des matrices de taille (n_mots!, n_mots) )
3. Prendre la permutation qui a le minimum de perplexité

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf
import kenlm
import util
import models
import math
from tensorflow.keras.layers import Layer, Bidirectional, LSTM, Dropout, Embedding, Dense, Activation, ActivityRegularization, Lambda, Conv1D, MaxPooling1D, Flatten, Input
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.optimizers import Adam 
from tensorflow.python.keras import regularizers
from itertools import permutations, combinations
from util import SentVariations, SortSentenceV1, GetLeastPerplexSent
from tokenizer import TextCleanup

### Pour trouver toutes les permutations d'une phrase: *SentVariations*

On remarque que lorsque le nombre de mots dépasse 6, le temps de traitement explose. Ceci est du au fait que cette solution cherche un nombre factoriel de permutations.

In [None]:
vars = SentVariations("Your is daddy ? who")
for i in vars:
    print(i)

Prenons la phrase avec le minimum de perplexité en utilisant notre model n-gramme:

In [None]:
model = kenlm.LanguageModel('models/4gram.binary')

In [None]:
sent, perp = SortSentenceV1("thank you mr segni , i shall do so gladly .", model)

In [None]:
print('Sorted sentence: ' + ' '.join(sent))
print ('perplexity: ' + str(round(perp,2)))

Conclusion: Cette solution semble fonctionner pour des petites phrases, mais est impossible à traiter pour des phrases de taille plus grandes que 10 (*11 mots = 1 min de traitement*), telles que la plupart de nos phrases de test. 

## Deuxième approche

Calculer n! permutations prend un temps monstrueux lorsque n>10. 
1. Considérer que chaque mot de la phrase peut être le premier mot
2. Pour chaque mot, utiliser le modèle pour prédire le mot suivant le plus probable considérant la phrase d'entrée et la liste des mots restants à deviner (~ Greedy Search)
3. On obtient une matrice de forme (n_mots, n_mots), il suffit ensuite de trouver la phrase la moins perplèxe.


In [None]:
def OneStepPrediction(sent, words, model):
    """
    Performs one prediction step. Takes the input incomplete sentence, and predicts the next most probable word not previously drawn

    Arguments:
        sent -- str, the input sentence
        words -- the list of remaining words to be sorted
        model -- the KenLM model instance
    
    Returns: 
        newSent -- str, the completed sentence
        words -- the remaining words to be assigned
    """ 
    
    perplexity = model.perplexity(sent + ' ' + words[0])
    
    for word in words:
        sentCandidate = sent + ' ' + word
        perpCandidate = model.perplexity(sentCandidate)
        if perpCandidate <= perplexity:
            perplexity = perpCandidate
            newSent = sentCandidate
            chosenWord = word
    words.remove(chosenWord)
    
    return newSent, words
    

In [None]:
def SortSentenceV2 (words, model):
    """
    Predict a sentence from a list of words, using a previsouly trained KenLM n-gram model

    Arguments:
        words -- the list of words to place in order  
        model -- the KenLM n-gram model  
    
    Returns:
        outSent -- the output sentence  
        perplexity -- the perplexity of the output sentence
    """
    sents = []
    
    # Every word can be the starting word
    for word in words:        
        remainingWords = words.copy()
        remainingWords.remove(word)
        sent = word
        
        # While there is still words to be sorted, perform one prediction step
        while len(remainingWords) > 0:
            sent, remainingWords = OneStepPrediction(sent, remainingWords, model)
        sents.append(sent)
        

    # Get the sentence with the minimum of perplexity from the list of sentences
    outSent, perplexity = GetLeastPerplexSent(sents, model)
    
    return outSent, perplexity, sents

In [None]:
sent = "kitchen is daddy the with cleaning mama ."
words = sent.split()
outSent, perplexity, sents = SortSentenceV2(words, model)   

In [None]:
print('out: ' + outSent)
print('perplexity: ' + str(perplexity))
for sent in sents :
    print(sent)


On peut voir que notre système place systématiquement le '.' en deuxième positionm plutôt qu'à la fin.  
Voyons voir ce qu'il se passe si l'on prend les 4-grammes les plus probables commençant par chaque mot du vocabulaire, plutôt que le mot unique

In [None]:
def GetNgrams(words, model, o = 4):
    """
    Gets the most probable n-gram for each word in a list
    Arguments:
        words -- the list of words
        model -- the KenLM n-gram model
        o -- int, order of the n-gram (default: 4)

    Returns:
        ngrams -- list of str, all the most probable n-grams for each word of the dictionnary
    """
    ngrams = []
    ngrams_current = []

    for word in words:
        # Get all possible permutations of order o
        perms = permutations(words, o)
        ngrams_current.clear()
        
        # For each word, get the permutations starting by it
        for perm in perms:            
            if str(word) == str(perm[0]):
                ngrams_current.append(' '.join(perm))
        # Get the least perplex ngram
        ngram, _ = GetLeastPerplexSent(ngrams_current, model)
        ngrams.append(ngram)

    return ngrams

In [None]:
words = "to , attention should parliament like your an a i case to which president . shown draw interest consistently in this has madam".split()
ngrams = GetNgrams(words, model, o = 4)
print(ngrams)

On remarque que TOUS nos n-grammes terminent par un point '.', ce qui peut être problématique puisqu'il est plus probable qu'un point termine notre phrase plutôt qu'être au début.  
Pour s'affranchir de ce problème, prenons les n-1 mots des n-grammes pour commencer.

In [None]:
def SortSentenceV3 (words, model):
    """
    Predict a sentence from a list of words, using a previsouly trained KenLM n-gram model

    Arguments:
        words -- the list of words to place in order  
        model -- the KenLM n-gram model  
    
    Returns:
        outSent -- the output sentence  
        perplexity -- the perplexity of the output sentence
    """
    sents = []
    
    # Every word can be the starting word
    for ngram in GetNgrams(words, model):        
        ngram = ngram.split()
        remainingWords = words.copy()
        # If the ngram ends with '.', pop the end
        if ngram[-1] == '.':
            ngram.pop()
        for word in ngram:
            remainingWords.remove(word)
        sent = ' '.join(ngram)
        
        # While there is still words to be sorted, perform one prediction step
        while len(remainingWords) > 0:
            sent, remainingWords = OneStepPrediction(sent, remainingWords, model)
        sents.append(sent)
        

    # Get the sentence with the minimum of perplexity from the list of sentences
    outSent, perplexity = GetLeastPerplexSent(sents, model)
    
    return outSent, perplexity, sents

In [None]:
words = "my question relates to something that will come up on thursday and which i will then raise again .".split()
outSent, perplexity, sents = SortSentenceV3(words, model)   
print('out: ' + outSent)
print('perplexity: ' + str(perplexity))
for sent in sents :
    print(sent)

### Bilan

On remarque que le premier mot prédit est souvent le '.', les virgules etc. Ce n'est pas très surprenant puisque ce sont les mots de vocabulaire les plus courants.

## Troisième approche: Modèles Seq2Seq
1. On entraîne un modèle séquence-à-séquence par la méthode du "Teacher Forcing"
2. Encodeur: LSTM / BiLSTM / CNN, avec ou sans couche d'embeddings
3. Décodeur: LSTM



### Données
#### 1- Chargement des dataset (50 000 phrases)

In [2]:
X_train, Y_train, X_test, Y_test = util.GetData('data\\1BW.train', 'data\\1BW.ref', 'data\\devdata.test', 'data\\devdata.ref', 1000)
print('X_train:')
print(X_train[:3])
print('Y_train:')
print(Y_train[:3])
len(X_train)

X_train:
['<bos> big issue does everything become have a to why ? such <eos>'
 '<bos> if most the in mid-october contenders candidate runoff , absolute be between a no top . there wins likely will an two majority , <eos>'
 "<bos> his warrants search vegas served and . at 's las home businesses murray and houston authorities previously las vegas in <eos>"]
Y_train:
['<bos> why does everything have to become such a big issue ? <eos>'
 '<bos> if no candidate wins an absolute majority , there will be a runoff between the top two contenders , most likely in mid-october . <eos>'
 "<bos> authorities previously served search warrants at murray 's las vegas home and his businesses in las vegas and houston . <eos>"]


1000

#### 2- Chargement des embeddings et du dictionnaire 

In [3]:
glove_dict, word_to_index, index_to_word = util.LoadVectors('data\\glove_small.txt')

#### 3- Préparation des données sous forme de liste d'entiers 

In [4]:
maxLen = 25 + 2 
X_train_int = util.Sentences2Indices(X_train, word_to_index, maxLen)
Y_train_int = util.Sentences2Indices(Y_train, word_to_index, maxLen)
X_test_int = util.Sentences2Indices(X_test, word_to_index, maxLen)
Y_test_int = util.Sentences2Indices(Y_test, word_to_index, maxLen)

print(X_train_int.size)

27000


In [5]:
test = []
for i in X_train_int[0]:
    test.append(index_to_word[i])
print(test)

['<bos>', 'big', 'issue', 'does', 'everything', 'become', 'have', 'a', 'to', 'why', '?', 'such', '<eos>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']


In [91]:
vocabSize = len(word_to_index.keys())
print(vocabSize)
print(X_train_int.size)
X_train_oh = util.OneHot(X_train_int, vocabSize)

400004
27000


MemoryError: Unable to allocate 80.5 GiB for an array with shape (27000, 400004) and data type float64

#### Batch Generator

In [5]:
def GenerateBatch(X, Y, vocabSize, batch_size=64, maxLen = 27):
    ''' 
    Generate a batch of data 
    
    
    '''
    while True:
        for j in range(0, len(X), batch_size):
            encoder_inputs = np.zeros((batch_size, maxLen),dtype='float32')
            decoder_inputs = np.zeros((batch_size, maxLen+2),dtype='float32')
            decoder_outputs = np.zeros((batch_size, maxLen+2, vocabSize),dtype='float32')

            for i, (input_text_seq, target_text_seq) in enumerate(zip(X[j:j+batch_size], Y[j:j+batch_size])):
                for t, word_index in enumerate(input_text_seq):
                    encoder_inputs[i, t] = word_index # encoder input seq

                for t, word_index in enumerate(target_text_seq):
                    decoder_inputs[i, t] = word_index
                    if (t>0) and (word_index<=vocabSize):
                        decoder_outputs[i, t-1, int(word_index-1)] = 1.

            yield([encoder_inputs, decoder_inputs], decoder_outputs)

In [None]:
A = GenerateBatch(X_train_int, Y_train_int, vocabSize,64,27)
for a in A:
    print(a)

#### Implémentation du Modèle

In [6]:
class testSeq2Seq(Model):
    """
    Architecture for the sequence-to-sequence Model used in the linearization task.
    This model is comprised of an encoder and a decoder, and uses teacher forcing for training.

    The encoder type can be chosen via the 'encoder_type' argument. There are 3 types available:
        'lstm' -- a simple LSTM RNN many-to-one
        'bilstm' -- a BiLSTM RNN
        'CNN' -- a simple 1D CNN
    
    the encoder outputs and states can be accessed after training via the 'encoder_outputs' and 'encoder_states' properties

    Arguments:
        max_words -- int, the maximum size, in words, of the sentences/bags of words (default=25)
        embedding_dict -- the dictionnary mapping every word to its vectorial representation
        word_to_index -- the dictionnary mapping every word in the vocabulary to its index (used in the embedding layer)
        encoder_type -- string, 'lstm' 'bilstm' or 'cnn'. Specifies the type of encoder to be used
        encoder_lstm_units -- int, number of hidden units in the (bi)lstm layers of the encoder (default=64)
        decoder_units -- int, number of hidden units in the lstm layer of the decoder (default=64)
        name -- string, name of the model
    """
    def __init__(
        self,
        embedding_dict,
        word_to_index,
        max_words=27,
        encoder_type='lstm',
        n_a=32,
        n_s=64,
        name="seq2seq",
        **kwargs
    ):
        super(testSeq2Seq, self).__init__(name=name, **kwargs)
        self.inputs = [Input((max_words,)), Input((max_words,))]

        # Check the desired type of encoding
        if encoder_type == 'bilstm':
            self.encoder = models.EncoderBiLSTM()
            self.encoder_type = 'bilstm'
        elif encoder_type == 'cnn':
            #TODO: Add the CNN Encoder here
            None
        else:
            self.encoder = EncoderLSTM(embedding_dict,word_to_index,n_a)
            self.encoder_type = 'lstm'

        self.decoder = DecoderLSTM(embedding_dict, word_to_index, n_s)
    
    def call(self, inputs):        
        encoder_outputs, encoder_states = self.encoder(inputs[0])
        X = self.decoder(inputs[1], encoder_states)
        
        ## Save values for inference ##
        self.encoder_outputs = encoder_outputs
        self.encoder_states = encoder_states
        return X

In [41]:
glove32 = {}
for key in glove_dict:
    glove32[key] = np.float32(glove_dict[key])
len(glove32.keys())

1917498

In [5]:
class EncoderLSTM(Layer):
    def __init__(self, embedding_dict, word_to_index, n_a = 32, **kwargs):
        super(EncoderLSTM, self).__init__()        
        self.embedding = models.EmbeddingLayer(embedding_dict,word_to_index)
        self.lstm = LSTM(n_a, return_sequences=False, return_state=True)
    
    def call(self, inputs, training=None):
        encoder_embeddings = self.embedding(inputs)
        X, state_h, state_c = self.lstm(encoder_embeddings)
        encoder_states = [state_h, state_c]
        print(encoder_states)
        return X, encoder_states

class DecoderLSTM(Layer):
    def __init__(self, embedding_dict, word_to_index, n_s = 64, dropout_rate=0.5, **kwargs):
        super(DecoderLSTM, self).__init__()
        vocabSize = len(word_to_index.keys())
        self.lstm = LSTM(n_s, return_sequences=True, return_state=True)
        self.embedding = models.EmbeddingLayer(embedding_dict,word_to_index)
        self.dense = Dense(vocabSize, activation='softmax')
        self.dropout = Dropout(dropout_rate)
    
    def call(self, inputs, encoder_states, training=None):
        decoder_embeddings = self.embedding(inputs)
        X, state_h, state_c = self.lstm(decoder_embeddings, initial_state=encoder_states)
        X = self.dropout(X)
        X = self.dense(X)
        decoder_states = [state_h, state_c]
        return X, decoder_states

def Seq2Seq(maxLen, n_a, n_s, embedding_dict, word_to_index):
    """
    Architecture for the sequence-to-sequence Model used in the linearization task.
    """
    
    # Define the inputs
    encoder_inputs = Input(shape=(maxLen,))
    # h0 = Input(shape=(n_s,), name='s0')
    # c0 = Input(shape=(n_s,), name='c0')
    decoder_inputs = Input((maxLen,))

    # Encoder
    encoder_embeddings = models.EmbeddingLayer(embedding_dict,word_to_index)(encoder_inputs)
    encoder_outputs, state_h, state_c = LSTM(n_a, return_sequences=False, return_state=True)(encoder_embeddings)
    encoder_states = [state_h, state_c]
    # encoder_outputs, encoder_states = EncoderLSTM(embedding_dict,word_to_index,n_a)(encoder_inputs)

    # Decoder
    decoder_embeddings = models.EmbeddingLayer(embedding_dict,word_to_index)(decoder_inputs)
    X, _, _ = LSTM(n_s, return_sequences=True, return_state=True)(decoder_embeddings, initial_state=encoder_states)
    X = Dense(vocabSize, activation='softmax')(X)
    # X = DecoderLSTM(embedding_dict=embedding_dict, word_to_index=word_to_index, n_s=n_s)(decoder_inputs, encoder_states)
       
    model = Model(inputs=[encoder_inputs, decoder_inputs], outputs=X)
    return model, encoder_outputs, encoder_states

In [7]:
maxLen = 25 + 2

# embedding_dict=glove_dict, word_to_index=word_to_index
baseline_model = testSeq2Seq(glove_dict,word_to_index,27,'lstm', 64,64)
baseline_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics = ['accuracy'])

In [8]:
baseline_model.build((maxLen,))
baseline_model.summary()

Model: "seq2seq"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 encoder_lstm (EncoderLSTM)  multiple                  0 (unused)
                                                                 
 decoder_lstm (DecoderLSTM)  multiple                  0 (unused)
                                                                 
Total params: 40,000,500
Trainable params: 0
Non-trainable params: 40,000,500
_________________________________________________________________


In [None]:
batch_size = 8
vocabSize = len(word_to_index.keys())
baseline_model.fit_generator(
    generator=models.GenerateBatch(X_train_int, Y_train_int, vocabSize, batch_size),
    steps_per_epoch=math.ceil(len(X_train_int)/batch_size),
    epochs=5,
    verbose=1,
    validation_data=models.GenerateBatch(X_test_int, Y_test_int, vocabSize, batch_size=batch_size),
    validation_steps=math.ceil(len(X_test_int)/batch_size),
    workers=1,
    )

In [26]:
from tensorflow import keras

In [39]:
model = load_model('models\\seq2seq_bilstm2021-12-18_10-55')

In [32]:
model.encoder_states

ListWrapper([])

In [318]:
def EncoderCNN(embedding_dict, word_to_index, n_s, dropout_rate=0.5):
    # Define the inputs
    encoder_inputs = Input(shape=(maxLen,))

    encoder_embeddings = models.EmbeddingLayer(embedding_dict,word_to_index)(encoder_inputs)
    X = Conv1D(1024,3, padding='same', activation='relu')(encoder_embeddings)
    X = Conv1D(512,3, padding='same', activation='relu')(X)
    X = MaxPooling1D()(X)
    X = Conv1D(256,3, padding='same', activation='relu')(X)
    X = MaxPooling1D()(X)
    X = Flatten()(X)
    X = Dropout(dropout_rate)(X)
    state_c = Dense(n_s, activation='linear')(X)
    state_h = Dense(n_s, activation='linear')(X)
    encoder_states = [state_h, state_c]

    model = Model(inputs=encoder_inputs, outputs=[state_c, encoder_states])

    return model, state_c, encoder_states

In [5]:
# model_cnn, _, _ = EncoderCNN(glove_dict, word_to_index, n_s = 64)
model_cnn=models.Seq2Seq(glove_dict, word_to_index, 27, 'cnn', 64, 64)
model_cnn.compile(optimizer='adam', loss='categorical_crossentropy')

In [6]:
model_cnn.build((maxLen,))
model_cnn.summary()

Model: "seq2seq"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 encoder_cnn (EncoderCNN)    multiple                  0 (unused)
                                                                 
 decoder_lstm (DecoderLSTM)  multiple                  0 (unused)
                                                                 
Total params: 40,000,500
Trainable params: 0
Non-trainable params: 40,000,500
_________________________________________________________________


In [7]:
vocabSize = len(word_to_index.keys())
model_cnn.fit_generator(
    generator= models.GenerateBatch(X_train_int, Y_train_int, vocabSize, 8),
    steps_per_epoch= math.ceil(len(X_train_int)/8),
    epochs=1,
    verbose=1,
    validation_data= models.GenerateBatch(X_test_int, Y_test_int, vocabSize, batch_size=8),
    validation_steps=math.ceil(len(X_test_int)/8),
    workers=1)

  model_cnn.fit_generator(


 15/125 [==>...........................] - ETA: 58s - loss: 11.4636

KeyboardInterrupt: 

In [14]:
model_lstm = models.Seq2Seq(glove_dict, word_to_index, maxLen, 'lstm', 64, 64)

In [15]:
model_lstm.compile(optimizer='adam', loss='categorical_crossentropy', metrics = ['accuracy'])

In [16]:
model_lstm.build((maxLen,))
model_lstm.summary()

Model: "seq2seq"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 encoder_lstm_1 (EncoderLSTM  multiple                 0 (unused)
 )                                                               
                                                                 
 decoder_lstm_2 (DecoderLSTM  multiple                 0 (unused)
 )                                                               
                                                                 
Total params: 40,000,500
Trainable params: 0
Non-trainable params: 40,000,500
_________________________________________________________________


In [17]:
vocabSize = len(word_to_index.keys())
model_lstm.fit_generator(
    generator= models.GenerateBatch(X_train_int, Y_train_int, vocabSize, 8),
    steps_per_epoch= math.ceil(len(X_train_int)/8),
    epochs=1,
    verbose=1,
    validation_data= models.GenerateBatch(X_test_int, Y_test_int, vocabSize, batch_size=8),
    validation_steps=math.ceil(len(X_test_int)/8),
    workers=1)

  model_lstm.fit_generator(




<keras.callbacks.History at 0x21ce695ba30>

### Inférence

In [66]:
encoder = model_lstm.encoder

In [174]:
def GetSamplingModels(model, n_s, maxLen, embedding_dict, word_to_index):
    vocabSize = len(word_to_index.keys())
    encoder_inputs = Input((maxLen,))
    _, encoder_states = model.encoder(encoder_inputs)
    encoder_model = Model(encoder_inputs, encoder_states)

    decoder_inputs = Input((maxLen,))
    decoder_state_input_h = Input(shape=(n_s,))
    decoder_state_input_c = Input(shape=(n_s,))
    decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
    
    decoder_embeddings = models.EmbeddingLayer(embedding_dict,word_to_index)(decoder_inputs)
    X, state_h, state_c = LSTM(n_s, return_sequences=True, return_state=True)(decoder_embeddings, initial_state=decoder_states_inputs)
    decoder_states = [state_h, state_c]
    decoder_outputs = Dense(vocabSize, activation='softmax')(X)
    
    decoder_model = Model([decoder_inputs, decoder_states_inputs], [decoder_outputs, decoder_states])
    
    return encoder_model, decoder_model

In [175]:
encoder_model, decoder_model = GetSamplingModels(model_lstm, 64, maxLen, glove_dict, word_to_index)

In [218]:
enc_states = encoder_model.predict(X_test_int[0].reshape(1,27))
np.array(enc_states)[0].shape
np.array(word_to_index['<bos>']).reshape(1,1).shape
a = [np.array(word_to_index['<bos>']).reshape(1,1), enc_states]

In [219]:
out, [h, c] = decoder_model.predict(a)
out.shape

(1, 1, 400004)

In [131]:
decoder_model.summary()

Model: "model_23"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_52 (InputLayer)          [(None, 27)]         0           []                               
                                                                                                  
 embedding_15 (Embedding)       (None, 27, 50)       20000250    ['input_52[0][0]']               
                                                                                                  
 input_53 (InputLayer)          [(None, 64)]         0           []                               
                                                                                                  
 input_54 (InputLayer)          [(None, 64)]         0           []                               
                                                                                           

In [98]:
def DecodeSequence(input_seq, encoder_model, decoder_model, word_to_index, index_to_word, maxLen):
    vocabSize = len(word_to_index.keys())
    # Encode the input as state vectors.
    input_seq = input_seq.reshape(1,maxLen).astype(np.int32)
    states_value = encoder_model.predict(input_seq)

    # Populate the first character of target sequence with the start character.
    target_seq = np.array(word_to_index['<bos>']).reshape(1,1)

    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    stop_condition = False
    decoded_sentence = []
    while not stop_condition:
        output_tokens, states_value = decoder_model.predict([target_seq,  states_value])
        # Sample a token
        sampled_word_seq_index = np.argmax(output_tokens[:,:,input_seq.astype(np.int32)[0]], axis = 2) # index of the word in the input sequence voc
        sampled_word_index = input_seq[0,sampled_word_seq_index[0][0]] # get the vocabulary index of the found word
        input_seq = np.delete(input_seq, sampled_word_seq_index, axis = 1) # removes the word from the input sentence
        sampled_word = index_to_word[sampled_word_index] # get the corresponding word in the vocabulary
        decoded_sentence.append(sampled_word)



        # Exit condition: either hit max length
        # or find stop character.
        if (sampled_word == '<eos>' or len(decoded_sentence) > maxLen or input_seq.size < 1):
            stop_condition = True

        # Update the target sequence (of length 1). 
        target_seq = np.array([[sampled_word_index]])
        # target_seq = np.append(target_seq, [sampled_word_index], axis=1)

        # Update states
        states_value = states_value

    return ' '.join(decoded_sentence)

In [298]:
def OneStepBeamSearch(decoder_model, word, score, states, seq_voc, k):
    out_words = []
    out_scores = []
    out_states = []
    out_vocabs = []
    
    output_probs, output_states = decoder_model.predict([word, states]) # next word probs
    output_scores = tf.math.log(output_probs) # next word log(probs)
    seq_voc_out_scores = np.array(output_scores)[:,:,seq_voc.astype(np.int32)][0,0,:] # only consider words in the seq vocab
    best_k_next_in_seq_voc = np.argpartition(seq_voc_out_scores, -k)[-k:] # take the k best choices
    # for each beam:
    #   create a new seq with appended new index
    #   Compute new score
    #   Remove word from seq vocab
    for i in range(k):
        sampled_word_seq_index = best_k_next_in_seq_voc[i]
        sampled_word_index = seq_voc[sampled_word_seq_index]
        new_score = score + seq_voc_out_scores[sampled_word_seq_index]
        out_words.append(np.array([[sampled_word_index]]))
        out_scores.append(new_score)
        out_vocabs.append(np.array(np.delete(seq_voc, sampled_word_seq_index)))
        out_states.append(output_states)

    return out_words, out_scores, out_states, out_vocabs

def BeamSearch(decoder_model, branches, branchesScores, thisBranch, init_word, init_score, init_states, init_vocab, k = 1):
    
    # Check if there is enough remaining words
    if init_vocab.size <= k:
        k = init_vocab.size
    out_words, out_scores, out_states, out_vocabs =  OneStepBeamSearch(decoder_model, init_word, init_score, init_states, init_vocab, k)
    for i in range(k):
        thisBranch.extend(out_words[i])
        thisBranchScore = out_scores[i]
        if init_vocab.size < 2:
            branches.append(thisBranch)
            branchesScores.append(thisBranchScore)
            break
        else:
            branches, branchesScores = BeamSearch(decoder_model, branches, branchesScores, thisBranch, out_words[i], thisBranchScore, out_states[i], out_vocabs[i], k)

    return branches, branchesScores
    

def DecodeSequenceV2(input_seq, encoder_model, decoder_model, word_to_index, maxLen, k = 1):
    realLen = input_seq[input_seq != 0].size
    beams = []
    scores = []
    init_word = np.array(word_to_index['<bos>']).reshape(1,1)

    # Encode the input as state vectors.
    input_seq = input_seq.reshape(1,maxLen).astype(np.int32)
    states_values = encoder_model.predict(input_seq)
    
    beams, scores = BeamSearch(decoder_model, beams, [], [], init_word, 1, states_values, input_seq[input_seq != 0], k)
    return beams[np.argmax(scores)], beams, scores

def Sequence2Sentence(seq, index_to_word):
    sent = ''
    for n in seq:
        if n[0] != 0:
            sent = sent + ' ' + index_to_word[n[0]]
    return sent.replace(" <eos>", '').replace(" <bos>", '').strip()

    

In [13]:
model_lstm = load_model('models\\seq2seq_bilstm2021-12-18_10-55')

In [20]:
encoder_model, decoder_model = models.GetSamplingModels(model_lstm, 64, maxLen, glove_dict, word_to_index)

In [101]:
print(X_test_int[0].size)
realLen =X_test_int[0][X_test_int[0]!=0].size
print(realLen)

27
14


In [172]:
input_seq = X_test_int[0]
maxLen = 27
glove_dict, word_to_index, index_to_word = util.LoadVectors('data/glove_small.txt')
model = load_model('petitModele')

In [299]:
input_seq = X_test_int[0]

encoder_model, decoder_model = models.GetSamplingModels(model, 64, maxLen, glove_dict, word_to_index)
seq, beams, scores = DecodeSequenceV2(input_seq, encoder_model, decoder_model, word_to_index, maxLen, k = 1)

print('Decoded Sentence : "' + Sequence2Sentence(seq, index_to_word) + '"\n')
print('Beams: \n')
for i in range(len(beams)):
    print(str(scores[i]) + '\t' + Sequence2Sentence(beams[i], index_to_word) + '\n')

Decoded Sentence : "s ' this then , , minute . please for silence rise"

Beams: 

-179.56203174591064	s ' this then , , minute . please for silence rise

