# Text generated by an LSTM model

## Objective

The objective of this project is to develop a text generation model using a Long Short-Term Memory (LSTM) recurrent neural network. The model will be trained on a dataset of text, and it will then be used to generate new and original text.

## Method

The text generation model used in this project is an LSTM. LSTMs are a type of artificial neural network that are specifically designed to process sequential data, such as language. They are particularly well-suited for text generation because they allow the model to take into account the previous contexts to predict the next word in a sequence of text.

![LSTM Architecture](/Data/Illustration/Architecture.png)

## Dataset

- ### Description
The model was trained on a dataset of text that includes the french's version of the Gospel of Saint Matthew. The Gospel of Saint Matthew is a rich and complex text that offers a variety of styles and themes, especially in french. It is therefore a good candidate for training a text generation model.

The following exemple shows a portion of the dataset ***before*** pre-processing treatment (You can also found the file in the dataset directory of this repository) :

```text
Chapitre 1
1 Livre de la généalogie de Jésus Christ, fils de David, fils d’Abraham :
2 Abraham engendra Isaac ; et Isaac engendra Jacob ; et Jacob engendra Juda et ses frères ;

3 et Juda engendra Pharès et Zara, de Thamar ; et Pharès engendra Esrom ; et Esrom engendra Aram ;

4 et Aram engendra Aminadab ; et Aminadab engendra Naasson ; et Naasson engendra Salmon ;

5 et Salmon engendra Booz, de Rachab ; et Booz engendra Obed, de Ruth ; et Obed engendra Jessé ;

[...]

— v. 20 : *Seigneur, ici et dans la suite : le Seigneur Dieu ou l’Éternel de l’Ancien Testament. — v. 21 : Jésus : transcription de l’hébreu Jéshua ou Joshua = l’Éternel [est] sauveur.
```

- ### Pre-Processing
Unfortunatly, the text cannot be use as it is, it needs to be cleaned so it can be processed more easily by our futur model. By removing headers and verse numbers, and by normalizing the text, the code makes the text more consistent and easier for the model to learn from.

In [2]:
# For regular expression
import re

new_file_content = []

with open ("./Data/Dataset/Matthieu_raw.txt", "r") as file :

        content_file = file.readlines()

        for line in content_file :

            # We lower the text
            text_to_write = line.lower()

            # check for "Chapitre"
            if re.match(r"^chapitre", text_to_write) is None :

                if re.match(r"^—\s+v\.\s+[0-9]+", text_to_write) is None :

                    # Check and replace any 'special' characters such as "\n", "*"
                    text_to_write = re.sub(r"^\d+","",text_to_write)
                    text_to_write = text_to_write.replace("\n", "")
                    text_to_write = text_to_write.replace("*", "")
                    new_file_content.append(text_to_write)

# Write the processed text to another file
with open ("./Data/Dataset/Matthieu_processed.txt", "a") as file_to_write :
    for l in new_file_content :
        file_to_write.write(l)

This code reads the contents of the Gospel of Matthew and stores them in a list called content_file. Then, it iterates over the list and removes any lines that start with ***"chapitre"*** or ***"— v. [0-9]+"***, which are headers and verse numbers, respectively. Additionally, it removes any newline characters and asterisks (*) from the remaining lines and appends them to a new list called new_file_content. Finally, it saves the contents of new_file_content to a new file. The content of the file will be the cleaned and processed version of the gospel of Matthew.

The following example shows the text after the pre-processing (you can also find the whole text in the dataset folder): 

```text
livre de la généalogie de jésus christ, fils de david, fils d’abraham : abraham engendra isaac ; et isaac engendra jacob ; et jacob engendra juda et ses frères ; et juda engendra pharès et zara, de thamar ; et pharès engendra esrom ; et esrom engendra aram ; et aram engendra aminadab ; et aminadab engendra naasson ; et naasson engendra salmon ; et salmon engendra booz, de rachab ; et booz engendra obed, de ruth ; et obed engendra jessé ; et jessé engendra david le roi ; et david le roi engendra salomon, de celle [qui avait été femme] d’urie ; et salomon engendra roboam ; et roboam engendra abia ; et abia engendra asa ; et asa engendra josaphat ; et josaphat engendra joram ; et joram engendra ozias ; et ozias engendra joatham ; et joatham engendra achaz ; et achaz engendra ézéchias ; et ézéchias engendra manassé ; et manassé engendra amon ; et amon engendra josias ; et josias engendra jéchonias et ses frères, au temps de la transportation de babylone ; et après la transportation de babylone, jéchonias engendra salathiel ; et salathiel engendra zorobabel ; et zorobabel engendra abiud ; et abiud engendra éliakim ; et éliakim engendra azor ; et azor engendra sadok ; et sadok engendra achim ; et achim engendra éliud ; et éliud engendra éléazar ; et éléazar engendra matthan ; et matthan engendra jacob ; et jacob engendra joseph, le mari de marie, de laquelle est né jésus, qui est appelé christ. toutes les générations, depuis abraham jusqu’à david, sont donc quatorze générations ; et depuis david jusqu’à la transportation de babylone, quatorze générations ; et depuis la transportation de babylone jusqu’au christ, quatorze générations.
```


- ### Vocabulary
Now that our text is well cleaned, we can create the vocabulary that will be used by the model. Our future model will be a model that uses the "char to char" method, which means that our future model will be based on characters rather than words. In other words, our model will aim to predict sequences of characters (such as "a", "é", "." etc ...) and not sequences of words (such as "appelé", "achim" etc ...).

The choice of this architecture is based on the fact that character-to-character generation models are much more flexible and robust when it comes to text generation; but they are also lighter to train.

In [6]:
with open ("./Data/Dataset/Matthieu_processed.txt", "r") as gospel_file :

    gospel_file_content = gospel_file.read()
    
    # Read Chars
    vocabulary = sorted(set(gospel_file_content))

    n_chars = len (gospel_file_content)
    n_vocabs = len (vocabulary)

    print ("Number of characters in the file : "+str(n_chars))
    print ("Number of element in the vocabulary : "+str(n_vocabs))

Number of characters in the file : 129820
Number of element in the vocabulary : 66


In [28]:
# Tokenization dictionnaries used to pair an integer to a char
char_to_int = dict ((c,i) for i, c in enumerate(vocabulary))
int_to_char = dict((i, c) for i, c in enumerate(vocabulary))

In [23]:
# A sample of our vocabulary
vocabulary[::10]

[' ', '3', '[', 'i', 's', 'â', 'û']

- ### Data Process For Training
Now that our text is well cleaned, we can create the vocabulary that will be used by the model. Our future model will be a model that uses the "char to char" method, which means that our future model will be based on characters rather than words. In other words, our model will aim to predict sequences of characters (such as "a", "é", "." etc ...) and not sequences of words (such as "appelé", "achim" etc ...).

The choice of this architecture is based on the fact that character-to-character generation models are much more flexible and robust when it comes to text generation; but they are also lighter to train.

![Training process](/Data/Illustration/Training_Process.png)

In [30]:
from tensorflow.keras.utils import to_categorical
import numpy as np

seq_length = 150

dataX = []
dataY = []

for i in range(0, n_chars - seq_length, 1):
  
 seq_in = gospel_file_content[i:i + seq_length]
 seq_out = gospel_file_content[i + seq_length]
 dataX.append([char_to_int[char] for char in seq_in])
 dataY.append(char_to_int[seq_out])
 n_patterns = len(dataX)

print ("Total Patterns: ", n_patterns)

# reshape X to be [samples, time steps, features]
X = np.reshape(dataX, (n_patterns, seq_length, 1))
# normalize
X = X / float(n_vocabs)
# one hot encode the output variable
y = to_categorical(dataY)

Total Patterns:  129670


This code first imports the `to_categorical` function from the TensorFlow Keras library, which is used to convert categorical data into ***one-hot encoded vectors***. One-hot encoding is a method of representing categorical data as a binary vector, where each element of the vector is either 0 or 1.

Next, it defines the sequence length, which is the number of characters in each input sequence. In this case, the sequence length is set to ***150 characters***.

It then creates two empty lists, `dataX` and `dataY`, which will be used to store the input sequences and the corresponding output sequences, respectively.

The `for` loop iterates over the `gospel_file_content` list, which contains the entire text file. For each iteration, the loop extracts a 150-character sequence from the `gospel_file_content` list and stores it in the `seq_in` variable. The next character in the `gospel_file_content` list is then stored in the `seq_out` variable.

The `dataX` list is then appended with a list of integers, where each integer represents the index of the corresponding character in the `char_to_int` dictionary. The `char_to_int` dictionary is a mapping from characters to integers, ***which is used to encode the text data***.

The `dataY` list is then appended with the index of the character stored in the `seq_out` variable.

The `X` variable is then created by reshaping the dataX list into a three-dimensional array, with the following dimensions:

samples: The number of input sequences.
time_steps: The sequence length.
features: The number of characters in the vocabulary.

The `X` variable is then normalized betwwen 0 and 1.

Finally, the `y` variable is created by ***one-hot encoding*** the `dataY` list. In this case, the `y` variable is a matrix of binary vectors, where each vector has a length equal to the number of characters in the vocabulary.

## LSTM model

The LSTM model used in this project consists of the following layers:

* An LSTM layer with 856 units and a `return_sequences=True` parameter. This layer allows the model to process long sequences of text.
* A Dropout layer with a rate of 0.5. This layer helps to regularize the model by randomly dropping out neurons during the training phase. This helps to prevent overfitting, which occurs when the model memorizes the training data too precisely and is unable to generalize to new data.
* A second LSTM layer with 856 units. This layer allows the model to learn more complex representations of sequences of text.
* A Dropout layer with a rate of 0.5. This layer helps to continue regularizing the model.
* A Dense layer with 66 units and a `softmax` activation. This layer allows the model to predict the probability of each word in the output vocabulary.

In [32]:
from tensorflow.keras.layers import Dropout, Dense, LSTM
from tensorflow.keras.models import Sequential
from tensorflow.keras.callbacks import ModelCheckpoint
import tensorflow as tf

model = Sequential()
model.add(LSTM(856, return_sequences=True,input_shape=(X.shape[1], X.shape[2])))
model.add(Dropout(0.5))
model.add(LSTM(856))
model.add(Dropout(0.5))
model.add(Dense(y.shape[1], activation='softmax'))

In [33]:
adam = tf.optimizers.Adam(learning_rate=0.0001)
model.compile(loss='categorical_crossentropy', optimizer=adam)

Here we're using the Adam optimizer and sets its learning rate to 0.0001. The Adam optimizer is a popular algorithm for optimizing neural networks. It is an adaptive learning rate optimizer, which means that it automatically adjusts the learning rate for each parameter of the model.

We choose 0.0001 for learning rate after multiple training test. We deducted that this learning rate was the most robust one as it was more constant for the decrease of the loss. Be aware that the learning time was slower as it needed more epochs to reach a satisfactory level.

In [34]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm_2 (LSTM)               (None, 150, 856)          2937792   
                                                                 
 dropout_2 (Dropout)         (None, 150, 856)          0         
                                                                 
 lstm_3 (LSTM)               (None, 856)               5865312   
                                                                 
 dropout_3 (Dropout)         (None, 856)               0         
                                                                 
 dense_1 (Dense)             (None, 66)                56562     
                                                                 
Total params: 8859666 (33.80 MB)
Trainable params: 8859666 (33.80 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


## Training

We trained our model over 100 epochs, each epoch took approximately 1 min over a batch of 256 elements.

- Number of epochs : 100
- Time per epoch : ~ 1 min
- Total training time : 100 mins (1h40)
- Loss Decrease per epoch: ~ 0.02 units 
- Batch size : 256

In [None]:
epochs = 100

# define the checkpoint
filepath="./Model/model.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

result = model.fit(X, y, epochs=epochs, batch_size=256, callbacks=callbacks_list)

## Results

The model was successfully trained and is able to generate text that is both grammatically correct and semantically coherent. The examples above shows some generated text by our model. It illustrate its ability to reproduce the style and content of the Gospel of Saint Matthew.

```text
que celui qui médira de père ou de mère, cemme d’est tou mous de sappoe, se vuiu dans le soyaume des cieux ; ct je vous le dis encore : il est plus facile qu’un chameau entre par un trou d’aiguille, qu’un riche n’entre dans le royaume de dieu. et les disciples, l’ayant entendu, s’étonnèrent fort, disant : qui donc peut être sauvé ? et jésus, sepant le liur di la géter, ensent : parse, de ne pau lesrar ee ge terres, cer jésus, royant le gili, et leur annns la detcien. et ie leur détendre.
```

For those who understand french, you can see that our model hallucinate some bad text sometimes for example : ***emme d’est tou mous de sappoe, se vuiu***. We will see how to improve our model's performance in the next section.

## Possible improvements

- The model developed in this project could be improved in several ways. For example, the model could be trained on a larger dataset of text, for example the whole Bible. This would allow the model to generate more diverse and complex text.

- In addition, we can add another lstm layer with a large number of unit so it can learn more deeply the link between each characters. In the same way, we can increase the number of units in each LSTM layers and increase the sequence length of our training dataset.


## Conclusion

This project has demonstrated the ability of LSTM neural networks to generate high-quality text. The model developed in this project could be used for a variety of applications, such as creative content creation, language translation, or script generation.

***Note that we have made a notebook available in the Test folder that will allow you to play with the model WARNING : You'll need to unzip the zip file  containing the model first !!!. A version of the model in HDF5 format is available in the model folder in the zip archive.***