# Introduction

Welcome to this notebook where we will explore a fascinating aspect of deep learning - automatic text generation. The goal of this project is to train a neural network model to generate text that resembles the works of William Shakespeare, one of the most iconic writers in history.

To achieve this, we will use a neural network architecture called Long Short-Term Memory (LSTM). LSTMs are a special subcategory of recurrent neural networks (RNNs) that are particularly effective at learning from long sequences of data, such as sentences or paragraphs, making them well-suited for tasks like text generation.

To train our model, we will use a corpus of Shakespeare's texts. This corpus will serve as the training data for our LSTM, allowing it to learn the "style" of Shakespeare's writing. Once trained, the model will be capable of generating new sentences inspired by what it has learned from this corpus.

It's important to note that the text generated by our model will not be genuine Shakespeare; it won't have the same depth or meaning. However, it should exhibit stylistic similarities, such as similar sentence structures, vocabulary, and grammatical patterns commonly found in Shakespeare's works.

In this notebook, we will:

- Prepare and preprocess our corpus of texts.
- Build our LSTM model using Keras, a popular Python library for deep learning.
- Train this model on our corpus.
- Generate new texts from the trained model.

It's an exciting journey at the intersection of literature and artificial intelligence. So let's dive in and see what our machine can create!

# Data Preparation and Preprocessing

In this code block, we start by importing the necessary libraries for our project. This includes `random`, `numpy`, and several sub-libraries of `tensorflow` that we need to build and train our LSTM model.

Next, we use the `get_file` function from `tensorflow.keras.utils` to download the text file 'shakespeare.txt' that contains our corpus. This file is read, decoded as 'utf-8', converted to lowercase to normalize the text, and we replace line breaks with spaces to facilitate further processing.

After that, we prepare several data structures that will help us transform our text into usable inputs for our LSTM model:

- `characters`: a list of all unique characters in our text, sorted in alphabetical order. This will serve to build our "vocabulary" for the model.
- `char_to_index` and `index_to_char`: two dictionaries that allow us to convert between characters and numeric indices. This is necessary because our LSTM model does not directly process characters but the numeric indices representing them.

We also define two hyperparameters for our data preparation: `SEQ_LENGTH` and `STEP_SIZE`. `SEQ_LENGTH` determines the length of each input sequence for our model, and `STEP_SIZE` determines the spacing between these sequences. In other words, for every `STEP_SIZE` characters in our text, we prepare an input sequence of length `SEQ_LENGTH`.

Next, we prepare our inputs and targets for the LSTM model. The inputs are sequences of `SEQ_LENGTH` characters from our text, and the targets are the characters directly following these sequences.

Finally, we transform these sequences and targets into "one-hot" representations. This is a common technique in deep learning to handle categorical data, such as our characters. Each character is represented by a vector of the length of our "vocabulary," where all positions are zero except the one corresponding to that character, which is one. Thus, `x` and `y` are three-dimensional and two-dimensional matrices, respectively, where each row represents an input sequence or target, and each column corresponds to a character in our "vocabulary".

In [1]:
import random
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.layers import Activation, Dense, LSTM

In [2]:
filepath = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

In [3]:
with open(filepath, 'rb') as file:
    text = file.read().decode(encoding='utf-8').lower()

In [4]:
text = text.replace('\n', ' ')

In [5]:
characters = sorted(set(text))
char_to_index = dict((c, i) for i, c in enumerate(characters))
index_to_char = dict((i, c) for i, c in enumerate(characters))

In [6]:
SEQ_LENGTH = 200
STEP_SIZE = 3

sentences = []
next_char = []

for i in range(0, len(text) - SEQ_LENGTH, STEP_SIZE):
    sentences.append(text[i: i + SEQ_LENGTH])
    next_char.append(text[i + SEQ_LENGTH])
    

In [7]:
x = np.zeros((len(sentences), SEQ_LENGTH, len(characters)), dtype=bool)
y = np.zeros((len(sentences), len(characters)), dtype=bool)

for i, satz in enumerate(sentences):
    for t, char in enumerate(satz):
        x[i, t, char_to_index[char]] = 1
    y[i, char_to_index[next_char[i]]] = 1

# LSTM Model Construction and Text Generation

In this code block, we start by building our LSTM model using the Keras `Sequential` API. Our model consists of three layers:

1. An LSTM layer with 128 units. This layer will do the bulk of the work in learning the sequences from our text.
2. A `Dense` layer (i.e., a fully connected layer) with as many units as we have unique characters in our text. This layer is responsible for converting the outputs from our LSTM layer into predictions for the next character to generate.
3. An `Activation` layer that applies the "softmax" activation function. This function converts the scores from the Dense layer into probabilities for each possible character.

After defining the architecture of our model, we compile it with the `categorical_crossentropy` loss function (suited for our multiclass classification problem) and the RMSprop optimizer. Then, we train the model on our prepared training data with a `batch_size` of 256 for 4 epochs.

Next, we define two functions to generate text from our trained model:

1. `sample`: This function takes an array of predictions from our model (a score for each possible character) and a "temperature" that controls the diversity of the generated text. A higher temperature gives more random and diverse text, while a lower temperature gives more predictable text. The function uses these scores to generate a probability distribution and then samples from that distribution to choose the next character.

2. `generate_text`: This function generates a sequence of text of a given length using our trained model and the `sample` function. It starts by randomly choosing a starting sequence from our original text and then generates one character at a time, updating the input sequence at each step.

Finally, we generate and display a sequence of 20 characters from our model.

In [8]:
model = Sequential()
model.add(LSTM(128, input_shape=(SEQ_LENGTH, len(characters))))
model.add(Dense(len(characters)))
model.add(Activation('softmax'))

In [9]:
model.compile(loss='categorical_crossentropy', optimizer=RMSprop(learning_rate=0.01))
model.fit(x, y, batch_size=256, epochs=4)



Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<tensorflow.python.keras.callbacks.History at 0x1a78f08dcd0>

In [10]:
def sample(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [11]:
def generate_text(lenght, temperature):
    # Initialise aléatoirement un numero d'index dans le text
    start_index = random.randint(0, len(text) - SEQ_LENGTH - 1)
    # Initialise le texte predit
    generated = ''
    # Initialise la phrase selectionnée dans le texte
    sentence = text[start_index: start_index + SEQ_LENGTH]
    # Ajoute la phrase au texte de sortie
    generated += sentence
    # La boucle parcourt de 0 à "lenght" puis une autre boucle parcour la phrase pour initialisé "x_predictions",
    # ensuite pour tout i on calcule les preds puis on utilise la fonction "sample" qui renvoie l'index du caractère
    # à la fin on prends le caractère puis on l'ajoute à generated et enfin on actualise la phrase.
    for i in range(lenght):
        x_predictions = np.zeros((1, SEQ_LENGTH, len(characters)))
        for t, char in enumerate(sentence):
            x_predictions[0, t, char_to_index[char]] = 1
        
        predictions = model.predict(x_predictions, verbose=0)[0]
        next_index = sample(predictions, temperature)
        next_character = index_to_char[next_index]
        
        generated += next_character
        sentence = sentence[1:] + next_character
    
    return generated

In [13]:
generate_text(20, 0.01)

' thy friends are fled to wait upon thy foes, and crossly to thy good all fortune goes.  henry bolingbroke: bring forth these men. bushy and green, i will not vex your souls-- since presently your soul so the heart to the'