# AIC-5102B  Lab4 : text classification, machine translation

This lab must be done on mvxr.esiee.fr -> please visit https://mvproxy.esiee.fr to see the connection procedure

## Work to do and assessment policy:

- The two parts of this lab are completely independent.
- You are only requested to do part A to fully validate your grade
- Part B comes as bonus points, as your mark will be computed as 
$$
mark = \min(20, part_A + \frac{1}{2} part_B)
$$
- Simply fill this notebook and drop it on mvproxy no later than november 20th 23:59


## Part A : text classification

In this part, you will have to finish the implementations of two RNN-based models shown on slides 28 and 38 of [Chapter 4](https://perso.esiee.fr/~hilairex/AIC-5102B/rnn.pdf). Both networks accept words as input, from sentences which don't exceed a certain length, and aim to perform text classification. 

You will work on the the IMDB reviews dataset, hosted by Kaggle [here](https://www.kaggle.com/code/trentpark/data-analysis-basics-imdb-dataset). The reviews have two outcomes : positive, or negative. A copy of this dataset can be found locally in /home/shared.

The following code snippets perform the first steps on text for you - loading, vectorising, and training a basic (non-recurrent) FFN.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import nltk
import tensorflow as tf
from keras.models import Sequential

reviews = pd.read_csv("/home/shared/IMDB Dataset.csv")
reviews.head(2)

2023-11-24 14:32:01.920436: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive


We first perform a standard test/train split. During development, I strongly suggest that you first use a small amount of samples (1000) for validation. IMDb has 50000 reviews, which is too much. Keep in ming that training RNNs is *slow*

In [144]:
train = reviews['review']
test = reviews['sentiment']
test = LabelEncoder().fit_transform(test)
X_train, X_test, y_train, y_test = train_test_split(train, test, shuffle=True, test_size=0.2, random_state=42)

The next step is to vectorize the text. In Lab3, I provided a vecto() function which did this, with relevant padding. I also mentioned Keras offered a TextVectorization layer which did exactly the same job. Its effects are shown below. 

In particular, note that unknown words yield an index of 1, and 0 is used for padding. So real indexation starts at index 2.

In [3]:
# text vectorization : quick demo
vecto= tf.keras.layers.experimental.preprocessing.TextVectorization(max_tokens=99, output_mode='int', output_sequence_length=10)
vecto.adapt([["I am the king of the world"],["You are the queen"]])
vecto([["I am the queen"],["World is king unknown"]])

<tf.Tensor: shape=(2, 10), dtype=int64, numpy=
array([[ 8, 10,  2,  5,  0,  0,  0,  0,  0,  0],
       [ 4,  1,  7,  1,  0,  0,  0,  0,  0,  0]])>

We now change the call to adapt the layer to our train data. Note that IMDb reviews are rather long (about 300 words / review on average)

In [4]:
max_words=3000  # the vocabulary size
seq_len=300     # maximum sequence length
vecto= tf.keras.layers.experimental.preprocessing.TextVectorization(max_tokens=max_words, output_mode='int', output_sequence_length=300)
vecto.adapt(train['review'].to_list())


We are now ready to define our model. Below, I first demonstrate a model with input and vectorization layer alone .

In [5]:
# building model : vectorization alone
model= Sequential()
model.add(tf.keras.Input(shape=(1,), dtype=tf.string))
model.add(vecto)
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
model.summary()
model.predict(['I am the king'])


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization_1 (Text  (None, 300)               0         
 Vectorization)                                                  
                                                                 
Total params: 0 (0.00 Byte)
Trainable params: 0 (0.00 Byte)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


array([[  10,  203,    2, 1049,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0, 

As we saw in labs 2 and 3, embeddings are mandatory. Hence, we will add an Embedding layer, but as opposed as what we did before, we will not initialize if from LSA, nor put it constant. Instead, we will let the model optimize this layer, possibly using dropout (if you use the related option). 
The dimension of 80 below is a crude estimation (barely from lab2 and results on LSA)  

In [6]:
model= Sequential()
model.add(tf.keras.Input(shape=(1,), dtype=tf.string))
model.add(vecto)
model.add(tf.keras.layers.Embedding(max_words+2, 80, input_length=seq_len))
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
model.summary()
model.predict(['I am the king'])
# TODO Implement RNN

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization_1 (Text  (None, 300)               0         
 Vectorization)                                                  
                                                                 
 embedding (Embedding)       (None, 300, 80)           240160    
                                                                 
Total params: 240160 (938.12 KB)
Trainable params: 240160 (938.12 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


array([[[ 0.00565327, -0.03459281,  0.01273665, ..., -0.03983495,
         -0.03929449,  0.00613182],
        [-0.00689242,  0.00185695,  0.0247079 , ...,  0.01098173,
         -0.02924296,  0.01897011],
        [ 0.01561158,  0.01476458, -0.00319093, ...,  0.0498675 ,
          0.03967917,  0.02086109],
        ...,
        [-0.04816231, -0.02181544,  0.04463018, ...,  0.04527048,
         -0.0108258 ,  0.01364191],
        [-0.04816231, -0.02181544,  0.04463018, ...,  0.04527048,
         -0.0108258 ,  0.01364191],
        [-0.04816231, -0.02181544,  0.04463018, ...,  0.04527048,
         -0.0108258 ,  0.01364191]]], dtype=float32)

Now it's up to you to devise and train two models which conforms those shown on slides 28 and 38 of Chapter 4 [here](https://perso.esiee.fr/~hilairex/AIC-5102B/rnn.pdf). Some pieces of advice :
- Try first to reproduce the one on slide 28 using a SimpleRNN or LSTM. That one is the simplest.
- Both have a return_sequence option, beware to what you are computing !
- Remember that embedding turn integer indexes into vectors. Hence your input data is a sequence of *vectors* whatever type of RNN you use. Be careful to dimensionality and shapes.
- In the end, you want a single scalar to represent a decision : yes or no (positive or negative)
- Once training is done, you may try a predict() on thetest data, but such kind of simple (non stacked) RNN achieves an accuracy of about 82% at best (see Kaggle's benchmarks). 
- Keras has a [Bidirectional](https://keras.io/api/layers/recurrent_layers/bidirectional/) and a [Concatenate](https://keras.io/api/layers/merging_layers/concatenate/) layers, which can be very handy. You may however build your model without using them, by using variables to connect the output(s) of a layer to the input of a new one. 

<font color="blue">
<h3>Text classification with forward RNN</h3>
<h4>Model definition</h4><br/>
For this neural netword, the LSTM layer fit pretty well. It must have 128 cells since it is appropriate to use a power of two number of cells and that the previous embedding layer returns an output of size 80. Also, the loss function has been changed to a binary_crossentropy because it fits better to binomial classification. Thus, it is necessary to change the activation function of the output layer from softmax (designed for multinomial classification) to sigmoid.
</font>

In [147]:
# Preprocessing layers
model= Sequential()
model.add(tf.keras.Input(shape=(1,), dtype=tf.string))
model.add(vecto)
model.add(tf.keras.layers.Embedding(max_words+2, 80, input_length=seq_len))
# RNN
model.add(tf.keras.layers.LSTM(128, activation='sigmoid', return_sequences=True))
model.add(tf.keras.layers.GlobalMaxPooling1D())
# Classification
model.add(tf.keras.layers.Dense(32, activation='sigmoid'))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

model.compile(loss = 'binary_crossentropy', optimizer='adam',metrics = ['accuracy'])
model.summary()

Model: "sequential_57"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization_1 (Text  (None, 300)               0         
 Vectorization)                                                  
                                                                 
 embedding_59 (Embedding)    (None, 300, 80)           240160    
                                                                 
 lstm_65 (LSTM)              (None, 300, 128)          107008    
                                                                 
 global_max_pooling1d_17 (G  (None, 128)               0         
 lobalMaxPooling1D)                                              
                                                                 
 dense_44 (Dense)            (None, 32)                4128      
                                                                 
 dense_45 (Dense)            (None, 1)               

<font color="blue">
The compilation give satisfying results since there is no error. Also, we notice that each output's dimension fits the following input's constraints.<br/><h4>Training the model</h4><br/>To test the perks of this model, we can train it over a training dataset.
</font>

In [148]:
model.fit(X_train, y_train, epochs=5, batch_size=16)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x7fdb55c88190>

<font color="blue">
The training process nearly took an hour. It achieves very good scores, which are over the expected results from Kaggle. Kaggle expects a <b>82%</b> accuracy where we obtain <b>91%</b>. We can suppose that it is due to some overfitting. To verify this hypothesis, it is possible to score the model over a test dataset and check if there is a gap between the training score and the testing score.
<br/><h4>Evaluating the model</h4>
</font>

In [149]:
score = model.evaluate(X_test, y_test) 

print('Test loss:', score[0]) 
print('Test accuracy:', score[1])

Test loss: 0.3158118426799774
Test accuracy: 0.8711000084877014


<font color="blue">
For the test dataset, we obtain a greater loss (<b>0.31</b>) and a smaller accuracy (<b>87%</b>). It testifies of some overfitting. To fix this issue, it is possible to add some dropout layers inside of the neural network. However, to fit to the course model, we made the choice of not implementing this feature and to keep the small overfitting.  
</font>

<font color="blue">
<h3>Text classification with bidirectional RNN</h3>
</font>

In [175]:
# Preprocessing layers
model= Sequential()
model.add(tf.keras.Input(shape=(1,), dtype=tf.string))
model.add(vecto)
model.add(tf.keras.layers.Embedding(max_words+2, 80, input_length=seq_len))
# Bidirectional layers
FLSTM = tf.keras.layers.LSTM(128, activation='sigmoid')(model.layers[1].output)
BLSTM = tf.keras.layers.LSTM(128, activation='sigmoid', go_backwards=True)(model.layers[1].output)
concatenated = tf.concat([FLSTM, BLSTM], axis=-1)
# Classification layers
hidden_dense = tf.keras.layers.Dense(64, activation='sigmoid')(concatenated)
output = tf.keras.layers.Dense(1, activation='sigmoid')(hidden_dense)
new_model = tf.keras.Model(inputs=model.input, outputs=output)

new_model.compile(loss = 'binary_crossentropy', optimizer='adam',metrics = ['accuracy'])
new_model.summary()

Model: "model_4"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_63 (InputLayer)       [(None, 1)]                  0         []                            
                                                                                                  
 text_vectorization_1 (Text  (None, 300)                  0         ['input_63[0][0]']            
 Vectorization)                                                                                   
                                                                                                  
 embedding_60 (Embedding)    (None, 300, 80)              240160    ['text_vectorization_1[62][0]'
                                                                    ]                             
                                                                                            

In [178]:
new_model.fit(X_train, y_train, epochs=5, batch_size=16)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x7fdab41ff290>

In [181]:
new_model.evaluate(X_test, y_test)



[0.30506381392478943, 0.8745999932289124]

## Part B : inference in neural machine translation

In this part, you will have to write a piece of code which will mimic the beam decoding algorithm shown on slides 30+ of [Chapter 5](https://perso.esiee.fr/~hilairex/AIC-5102B/lstm.pdf)

The following code implements the network shown on slide 26, with the difference that inputs will not be words, but characters - this drastically reduces the memory requirements, to the price of a lower accuracy, however.

The dataset is derived from transcripts of the European parliament - see https://www.statmt.org/europarl/
We will translate english sentences to french. We first load and sample the transcripts from local files. Note that the '\</s\>' special word on slide 26 has been replaced by a '\x03' character to denote the end of a sentence. Likewise, the beginning of a sentence (which is missing in the decoder part, as it needs an input word or character) will be a '\x02' special character.

In [12]:
# https://www.statmt.org/europarl/

import sys
import keras
import numpy as np
from sklearn.model_selection import train_test_split
import tensorflow as tf

# data processing
english=open('/home/shared/europarl-v7.fr-en.en', encoding='utf-8').read().split('\n')
french=open('/home/shared/europarl-v7.fr-en.fr', encoding='utf-8').read().split('\n')

# begin and end special characters
begin='\x02'
end='\x03'

tran=[]
i=0
for x,y in zip(english,french):
    if (len(x) > 0) and (len(x) < 30) and (len(y) > 0) and (len(y) < 40):
        tran.append((x+end,begin+y+end))
        i=i+1
        

# without sampling the above produces about 60k samples -> too much
tran,_=train_test_split(tran,train_size=20000)
nsamples=len(tran) # about 60k samples


In [66]:
print(tran[100])

('Thank you very much, Mr Blak.\x03', '\x02Je vous remercie, Monsieur Blak.\x03')


We then build the vocabularies (=set of chars), and char->ord and ord->char dictionaries, for source (index=0) and target (index=1) languages. Those will be useful when vectorising sentences . 

In [13]:
voc=[]
char2num=[]
num2char=[]
maxlen=[]

for lang in range(0,2):
    voc.append(sorted(set([c for w in tran for c in w[lang]])))
    c2n={}
    n2c={}
    for i in range(0,len(voc[lang])):
        n2c[i]=voc[lang][i]
        c2n[voc[lang][i]]=i
    char2num.append(c2n)
    num2char.append(n2c)
    maxlen.append(max([len(w[lang]) for w in tran]))

Next comes vectorisation : we replace every character directly by its one-hot binary representation. As a result, the vectorisation of a sentence is directly a tensor, and not a matrix.

In [14]:
# vectorisation of sentences
en=0
fr=1
    
vecto=[]
for lang in range(0,2):
    vec=np.zeros((nsamples,maxlen[lang],len(voc[lang])), dtype='float32')
    for sample in range(0,nsamples):
        for row in range(0,len(tran[sample][lang])):
            vec[sample,row,char2num[lang][tran[sample][lang][row]]]=1
    vecto.append(vec)

In [5]:
print(tran[1])

('Petitions: see Minutes\x03', '\x02Pétitions: voir procès-verbal\x03')


Finally comes the model. 

In [74]:
# building the model

# number of units to use in LSTM layers
lstm_units=128

# encoder side
# input data = any string of the source language
enc_input = keras.layers.Input(shape=(None, len(voc[0])))

# transform this string by an LSTM layer
[enc_out, enc_hidden, enc_cell] = keras.layers.LSTM(units=lstm_units, return_state=True)(enc_input)

# decoder side
# input is a translated string in the target language
dec_input = keras.layers.Input(shape=(None,len(voc[1])))

# the LSTM layer must return two vectors : the hidden state vector, and the cell vector
# Must also return the full sequence, as the decoder is trained in teacher forcing mode
dec_lstm = keras.layers.LSTM(units=128, return_state=True, return_sequences=True)
[dec_out,dec_hidden,dec_cell] = dec_lstm(dec_input, initial_state=[enc_hidden,enc_cell])
dec_output = keras.layers.Dense(units=len(voc[1]), activation='softmax', use_bias=True)(dec_out)

# final model
model= keras.Model(inputs=[enc_input, dec_input], outputs=dec_output, name='en2fr'+str(lstm_units))
model.compile(loss='categorical_crossentropy')
model.summary()

Model: "en2fr128"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_33 (InputLayer)       [(None, None, 131)]          0         []                            
                                                                                                  
 input_34 (InputLayer)       [(None, None, 144)]          0         []                            
                                                                                                  
 lstm_8 (LSTM)               [(None, 128),                133120    ['input_33[0][0]']            
                              (None, 128),                                                        
                              (None, 128)]                                                        
                                                                                           

The following snippet offers to train or load pretrained model from disk. Do *always* load a model from disk, on my HP380 server, training takes *hours* of computation time.

In [75]:
#saved_model='/home/shared/en2fra'+str(lstm_units)
saved_model=''
if saved_model == '':
    # teacher forcing : expected output is the same than the decoded
    # sentence, except that it is shifted one time unit forward
    y= np.ndarray(shape=vecto[1].shape)
    y[0:nsamples-1,:,:]= vecto[1][1:nsamples,:,:]
    model = model.fit(x=[vecto[0],vecto[1]], y=y, validation_split=0.25, epochs=100, batch_size=64)
    saved_model='/home/boiss/en2fra'+str(lstm_units)
    model.save(saved_model)
else:
    model= keras.models.load_model(saved_model)
        


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


AttributeError: 'History' object has no attribute 'save'

### Work to do : beam searching
    
Use the trained model below, including its final states, to write a piece of code which will execute a memoryless beam searching algorithm. This should do the following:
1. Given an input string, encode it using the encoder model. That will give you a final hidden state (enc_hidden) and cell state (enc_cell)
2. Set (enc_hidden,enc_cell) as the initial states of a decoder model, which should behave exactly as the one you built in the "decoder side" section, except that it has an initial state that must be set for any new input string
3. Set the current character to '\x02', to initially denote the beginning of the translated sentence 
4. If you feed the (vectorised) current character to the decoder, and ask for its prediction, you will obtain a probability distribution
4. Following beam searching, from this probability distribution you should normally extract the $n$ most probable characters. We will simplify and choose $n=1$ (memoryless beam search) to keep the best candidate
5. Add this best candidate to your decoded string, set the current character to this character, and loop to step 3 unless the decoded sentence is too long ($length > len(voc[1])$) or an '\x03' character is predicted (end of sentence)

Simply let your code produce its results. Don't expect good outputs, even though the model is properly built, there are issues with the data preparation, as explained in class.

In [81]:
# define the encoder model
encoder_model = keras.Model(inputs=enc_input, outputs=[enc_out, enc_hidden, enc_cell])
# define the input string
input_string = 'Hello world'+end

# Vectorize the input string
input_sequence = np.zeros((1, maxlen[0], len(voc[0])), dtype='float32')
for i in range(len(input_string)):
    input_sequence[0, i, char2num[0][input_string[i]]] = 1.
    
out, hidden, cell = encoder_model.predict(input_sequence)

# set the initial states of the decoder
states_value = [hidden, cell]



In [82]:
# Input layers for any new strings
dec_hidden_input = keras.layers.Input(shape=(lstm_units,))
dec_cell_input = keras.layers.Input(shape=(lstm_units,))
dec_initial_states = [dec_hidden_input, dec_cell_input]
[dec_output, dec_hidden, dec_cell] = dec_lstm(dec_input, initial_state=dec_initial_states)
dec_output = keras.layers.Dense(units=len(voc[1]), activation='softmax', use_bias=True)(dec_output)
dec_model = keras.Model(inputs=[dec_input] + dec_initial_states, outputs=[dec_output, dec_hidden, dec_cell])

In [83]:
current_char = begin
current_hidden = hidden
current_cell = cell
dec_sentence = ''

while current_char != end and len(dec_sentence) < maxlen[1]:
    # One hot vector
    vectorized = np.zeros((1,1,len(voc[1])))
    vectorized[0,0,char2num[1][current_char]] = 1
    # Decode
    [dec_output, current_hidden, current_cell] = dec_model.predict([vectorized, current_hidden, current_cell], verbose=0)
    # Best candidate index
    best_candidate = np.argmax(dec_output[0,0,:])
    # Update the current character
    current_char = num2char[1][best_candidate]
    dec_sentence+=current_char
print(dec_sentence)

Ybbžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžž
