# SI-SI5  Lab4 : text classification, machine translation

## Work to do and assessment policy:

- Work by pairs, according to the table sent by email
- The two parts A and B of this lab are independent.
- You are only requested to do part A to fully validate your grade
- Part B comes as bonus points, as your mark will be computed as 
$$
mark = \min(20, part_A + \frac{1}{2} part_B)
$$
- Fill this notebook and drop it on https://mvproxy.esiee.fr no later than January 21th 2024, 23:59
- <b>No submission by email</b>. Submissions by email will not be evaluated.


## 1. Setup

### 1.1 Terminal setup (recommended)

This lab has been validated on a Docker container, on which a generic CPython version of Tensorflow has been installed. To run the container on ESIEE's machines, booted as Linux :

```curl -k https://mvproxy.esiee.fr/tmp/nlp.docker.tar | docker load```

```docker run -it -u user fedora:38 bash```

The first line should take a while (about a minute) while remaining silent, just be patient. Once you are logged on the container, launch ```cd ~```, then ```python3```. You can copy and paste the snippets below to your Python interpreter, and update this nootebook localky. 

### 1.2 VNC setup

Alternatively, if you want to do everything on the container thanks to VNC, do the following. On the docker, launch

```sudo vncsession user :1```

```ifconfig```

The last line should show a network interface with an IP of the form 172.17.0.2
Then, on your host machine:

```xtigervncviewer 172.17.0.2:1```

You should have the usual 'fluxbox' desktop. 

### 1.3 Sharing files with the container

If you wish to share your ESIEE's home directory within the container, you have two options:

Option 1: replace ```login``` by your ESIEE's login in the following command line:

```sshfs -o allow_other -o uid=1000 -o gid=1000 login@172.17.0.1:\~ /mnt/esiee```

Option 2 : on ESIEE's machine, run 
```chmod go+rx $HOME```

Then launch the container with the -v option, something like:
```docker run -it -u user -v$HOME:/mnt/esiee fedora:38 bash```

No matter which option you prefer, you should always find your ESIEE's files in ```/mnt/esiee``` within the container. Beware, however, that with option 2, the user under which the container is ran has  an UID >= 65000 and is neither you nor your ESIEE's group, so permissions of your file should be granted to *others*


## Part A : text classification

In this part, you will have to finish the implementations of two RNN-based models shown on slides 28 and 38 of [Chapter 4](https://perso.esiee.fr/~hilairex/5I-SI5/rnn.pdf). Both networks accept words as input, from sentences which don't exceed a certain length, and aim to perform text classification. 

You will work on the the IMDB reviews dataset, hosted by Kaggle [here](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?resource=download), which you should download and extract in your working directory -- only a single file 'IMDB Dataset.csv' is needed

The following code snippets perform the first steps on text for you - loading, vectorising, and training a basic (non-recurrent) FFN.


In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import nltk
import tensorflow as tf
from keras.models import Sequential

reviews = pd.read_csv("IMDB Dataset.csv")
reviews.head(2)



ModuleNotFoundError: No module named 'tensorflow'

We first perform a standard test/train split. During development, I strongly suggest that you first use a small amount of samples (1000) for validation. IMDb has 50000 reviews, which is too much. Keep in ming that training RNNs is *slow*

In [9]:
train, test= train_test_split(reviews, shuffle=True, train_size=1000, test_size=200)

The next step is to vectorize the text. In Lab3, I provided a vecto() function which did this, with relevant padding. I also mentioned Keras offered a TextVectorization layer which did exactly the same job. Its effects are shown below. 

In particular, note that unknown words yield an index of 1, and 0 is used for padding. So real indexation starts at index 2.

In [10]:
# text vectorization : quick demo
vecto= tf.keras.layers.experimental.preprocessing.TextVectorization(max_tokens=99, output_mode='int', output_sequence_length=10)
vecto.adapt([["I am the king of the world"],["You are the queen"]])
vecto([["I am the queen"],["World is king unknown"]])

<tf.Tensor: shape=(2, 10), dtype=int64, numpy=
array([[ 8, 10,  2,  5,  0,  0,  0,  0,  0,  0],
       [ 4,  1,  7,  1,  0,  0,  0,  0,  0,  0]])>

We now change the call to adapt the layer to our train data. Note that IMDb reviews are rather long (about 300 words / review on average)

In [11]:
max_words=3000  # the vocabulary size
seq_len=300     # maximum sequence length
vecto= tf.keras.layers.experimental.preprocessing.TextVectorization(max_tokens=max_words, output_mode='int', output_sequence_length=300)
vecto.adapt(train['review'].to_list())


We are now ready to define our model. Below, I first demonstrate a model with input and vectorization layer alone .

In [12]:
# building model : vectorization alone
model= Sequential()
model.add(tf.keras.Input(shape=(1,), dtype=tf.string))
model.add(vecto)
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
model.summary()
model.predict(['I am the king'])


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization_1 (Text  (None, 300)               0         
 Vectorization)                                                  
                                                                 
Total params: 0 (0.00 Byte)
Trainable params: 0 (0.00 Byte)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


array([[ 10, 233,   2, 799,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0

As we saw in labs 2 and 3, embeddings are mandatory. Hence, we will add an Embedding layer, but as opposed as what we did before, we will not initialize if from LSA, nor put it constant. Instead, we will let the model optimize this layer, possibly using dropout (if you use the related option). 
The dimension of 80 below is a crude estimation (barely from lab2 and results on LSA)  

In [13]:
model= Sequential()
model.add(tf.keras.Input(shape=(1,), dtype=tf.string))
model.add(vecto)
model.add(tf.keras.layers.Embedding(max_words+2, 80, input_length=seq_len))
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
model.summary()
model.predict(['I am the king'])


Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization_1 (Text  (None, 300)               0         
 Vectorization)                                                  
                                                                 
 embedding (Embedding)       (None, 300, 80)           240160    
                                                                 
Total params: 240160 (938.12 KB)
Trainable params: 240160 (938.12 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


array([[[ 0.02301122,  0.00351899, -0.04938359, ...,  0.00213488,
         -0.03849214,  0.00825005],
        [ 0.04912025,  0.02338776, -0.04081937, ...,  0.00578748,
          0.04983165,  0.04060073],
        [ 0.0088893 , -0.02200055, -0.03170415, ...,  0.03678668,
          0.03582979, -0.00192745],
        ...,
        [-0.04202316, -0.03204476, -0.04860274, ...,  0.03045008,
          0.02834256, -0.0475901 ],
        [-0.04202316, -0.03204476, -0.04860274, ...,  0.03045008,
          0.02834256, -0.0475901 ],
        [-0.04202316, -0.03204476, -0.04860274, ...,  0.03045008,
          0.02834256, -0.0475901 ]]], dtype=float32)

Now it's up to you to devise and train two models which conforms those shown on slides 28 and 38 of Chapter 4 [here](https://perso.esiee.fr/~hilairex/AIC-5102B/rnn.pdf). Some pieces of advice :
- Try first to reproduce the one on slide 28 using a SimpleRNN or LSTM. That one is the simplest.
- Both have a return_sequence option, beware to what you are computing !
- Remember that embedding turn integer indexes into vectors. Hence your input data is a sequence of *vectors* whatever type of RNN you use. Be careful to dimensionality and shapes.
- In the end, you want a single scalar to represent a decision : yes or no (positive or negative)
- Once training is done, you may try a predict() on thetest data, but such kind of simple (non stacked) RNN achieves an accuracy of about 82% at best (see Kaggle's benchmarks). 
- Keras has a [Bidirectional](https://keras.io/api/layers/recurrent_layers/bidirectional/) and a [Concatenate](https://keras.io/api/layers/merging_layers/concatenate/) layers, which can be very handy. You may however build your model without using them, by using variables to connect the output(s) of a layer to the input of a new one. 

In [None]:
% Your answers here

## Part B : inference in neural machine translation

In this part, you will have to write a piece of code which will mimic the beam decoding algorithm shown on slides 30+ of [Chapter 5](https://perso.esiee.fr/~hilairex/AIC-5102B/lstm.pdf)

The following code implements the network shown on slide 26, with the difference that inputs will not be words, but characters - this drastically reduces the memory requirements, to the price of a lower accuracy, however.

The dataset are the french-english transcripts from the European parliament, which you should download - URL= https://www.statmt.org/europarl/v7/fr-en.tgz

We will translate english sentences to french. We first load and sample the transcripts from local files. Note that the '\</s\>' special word on slide 26 has been replaced by a '\x03' character to denote the end of a sentence. Likewise, the beginning of a sentence (which is missing in the decoder part, as it needs an input word or character) will be a '\x02' special character.

In [2]:
# https://www.statmt.org/europarl/

import sys
import keras
import numpy as np
from sklearn.model_selection import train_test_split
import tensorflow as tf

# data processing
english=open('europarl-v7.fr-en.en', encoding='utf-8').read().split('\n')
french=open('europarl-v7.fr-en.fr', encoding='utf-8').read().split('\n')

# begin and end special characters
begin='\x02'
end='\x03'

tran=[]
i=0
for x,y in zip(english,french):
    if (len(x) > 0) and (len(x) < 30) and (len(y) > 0) and (len(y) < 40):
        tran.append((x+end,begin+y+end))
        i=i+1
        

# without sampling the above produces about 60k samples -> too much
tran,_=train_test_split(tran,shuffle=True,train_size=20000)
nsamples=len(tran) # about 60k samples


ModuleNotFoundError: No module named 'keras'

We then build the vocabularies (=set of chars), and char->ord and ord->char dictionaries, for source (index=0) and target (index=1) languages. Those will be useful when vectorising sentences . 

In [15]:
voc=[]
char2num=[]
num2char=[]
maxlen=[]

for lang in range(0,2):
    voc.append(sorted(set([c for w in tran for c in w[lang]])))
    c2n={}
    n2c={}
    for i in range(0,len(voc[lang])):
        n2c[i]=voc[lang][i]
        c2n[voc[lang][i]]=i
    char2num.append(c2n)
    num2char.append(n2c)
    maxlen.append(max([len(w[lang]) for w in tran]))

Next comes vectorisation : we replace every character directly by its one-hot binary representation. As a result, the vectorisation of a sentence is directly a tensor, and not a matrix.

In [16]:
# vectorisation of sentences
en=0
fr=1
    
vecto=[]
for lang in range(0,2):
    vec=np.zeros((nsamples,maxlen[lang],len(voc[lang])), dtype='float32')
    for sample in range(0,nsamples):
        for row in range(0,len(tran[sample][lang])):
            vec[sample,row,char2num[lang][tran[sample][lang][row]]]=1
    vecto.append(vec)

Finally comes the model. 

In [17]:
# building the model

# number of units to use in LSTM layers
lstm_units=128

# encoder side
# input data = any string of the source language
enc_input = keras.layers.Input(shape=(None, len(voc[0])))

# transform this string by an LSTM layer
[enc_out, enc_hidden, enc_cell] = keras.layers.LSTM(units=lstm_units, return_state=True)(enc_input)

# decoder side
# input is a translated string in the target language
dec_input = keras.layers.Input(shape=(None,len(voc[1])))

# the LSTM layer must return two vectors : the hidden state vector, and the cell vector
# Must also return the full sequence, as the decoder is trained in teacher forcing mode
dec_lstm = keras.layers.LSTM(units=128, return_state=True, return_sequences=True)
[dec_out,dec_hidden,dec_cell] = dec_lstm(dec_input, initial_state=[enc_hidden,enc_cell])
dec_output = keras.layers.Dense(units=len(voc[1]), activation='softmax', use_bias=True)(dec_out)

# final model
model= keras.Model(inputs=[enc_input, dec_input], outputs=dec_output, name='en2fr'+str(lstm_units))
model.compile(loss='categorical_crossentropy', optimizer='rmsprop')
model.summary()

Model: "en2fr128"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_5 (InputLayer)        [(None, None, 115)]          0         []                            
                                                                                                  
 input_6 (InputLayer)        [(None, None, 147)]          0         []                            
                                                                                                  
 lstm_2 (LSTM)               [(None, 128),                124928    ['input_5[0][0]']             
                              (None, 128),                                                        
                              (None, 128)]                                                        
                                                                                           

The following trains the model from sampled data in teacher forcing mode.
NOTE : epochs=5 is not enough at all, but performing a full train is *not* the aim of this part. What matters is the correctness of your code, not the result you will obtain - it will very likely resemble to a noisy string.

In [7]:
# teacher forcing : expected output is the same than the decoded
# sentence, except that it is shifted one time unit forward
y= np.ndarray(shape=vecto[1].shape)
y[0:nsamples-1,:,:]= vecto[1][1:nsamples,:,:]
model.fit(x=[vecto[0],vecto[1]], y=y, validation_split=0.25, epochs=5, batch_size=64)
#    saved_model='/home/shared/en2fra'+str(lstm_units)
#    model.save(saved_model)       


Epoch 1/500
Epoch 2/500
Epoch 3/500

KeyboardInterrupt: 

### Work to do : beam searching
    
Use the trained model below, including its final states, to write a piece of code which will execute a memoryless beam searching algorithm. This should do the following:
1. Given an input string, encode it using the encoder model. That will give you a final hidden state (enc_hidden) and cell state (enc_cell)
2. Set (enc_hidden,enc_cell) as the initial states of a decoder model, which should behave exactly as the one you built in the "decoder side" section, except that it has an initial state that must be set for any new input string
3. Set the current character to '\x02', to initially denote the beginning of the translated sentence 
4. If you feed the (vectorised) current character to the decoder, and ask for its prediction, you will obtain a probability distribution
4. Following beam searching, from this probability distribution you should normally extract the $n$ most probable characters. We will simplify and choose $n=1$ (memoryless beam search) to keep the best candidate
5. Add this best candidate to your decoded string, set the current character to this character, and loop to step 3 unless the decoded sentence is too long ($length > len(voc[1])$) or an '\x03' character is predicted (end of sentence)

Simply let your code produce its results. Don't expect good outputs, even though the model is properly built, there are issues with the data preparation, as explained in class.