<a href="https://colab.research.google.com/github/DLProjectTextGeneration/TextGeneration/blob/code/code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Load the packages**

In [None]:
import os
import datetime
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import PowerTransformer, MinMaxScaler

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

from google.colab import files

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.layers import RNN
from keras.utils import np_utils
from keras.callbacks import ModelCheckpoint

# **Data import**

We load the song lyrics of Britney Spear's songs. The file is a text file that you can download from the data folder on our github : 

1.   Click on this link : https://github.com/DLProjectTextGeneration/TextGeneration/blob/main/Data/britney-spears.txt
2.   Right-click on the *raw* button on the top right corner of the page and select *download the linked file*

Now that you have dowloaded the data, you can import it with the following chunk.



In [None]:
uploaded = files.upload()

Saving britney-spears.txt to britney-spears.txt


In [None]:
data = open('britney-spears.txt','r')
britney = data.read()
data.close()

We can take a look at the data.

In [None]:
print(britney)

They say get ready for the revolution
I think it's time we find some sorta solution
Somebody's caught up in the endless pollution
They need to wake up, stop living illusions I know you need to hear this
Why won't somebody feel this
This is my wish that we all feel connected
This is my wish that nobodies neglected Be like a rocket baby
Be like a rocket Take off
Just fly, away (ay, ay)
To find your space Take off
Just fly, away (ay, ay)
To find your place Take off You know what they say about mixing the races
And in the end we got the same faces
My mama told me got love yourself first
And if you disagree, get off this damn earth I want to feel connected
Don't want to be neglected
This is my wish that we all find our places
This is my wish that we all escalate (yeah) Be like a rocket baby
Be like a rocket Take off
Just fly, away (ay, ay)
To find your space Take off
Just fly, away (ay, ay)
To find your space Take off
Just fly, away (ay, ay)
To find your space Take off
Just fly, away (ay, a

# **1. Generating text from the raw data**



## **Data processing**

We now associate to each character an integer. For example :

* ' \n ' --> 0
* ' ' --> 1
* ' ! ' --> 2

We have 76 unique different characters. Hence, our text is now a sequence of integer from 0 to 75.






In [None]:
characters = sorted(list(set(britney)))

n_to_char = {n:char for n, char in enumerate(characters)}
char_to_n = {char:n for n, char in enumerate(characters)}

vocab_size = len(characters)
print('Number of unique characters: ', vocab_size)
print(characters)

Number of unique characters:  76
['\n', ' ', '!', '"', '&', "'", '(', ')', ',', '-', '.', '0', '1', '2', '3', '4', '6', '7', '8', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'Y', 'Z', '[', ']', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


We then extract all the possible sequences of 100 characters from our text and we store them into a vector **X**. For each sequence, the follow up character (the target) is stored into another vector **Y**.

In [None]:
X = []   
Y = []  
length = len(britney)
seq_length = 100   

for i in range(0, length - seq_length, 1):
    sequence = britney[i:i + seq_length]
    label = britney[i + seq_length]
    X.append([char_to_n[char] for char in sequence])
    Y.append(char_to_n[label])
    
print('Number of extracted sequences:', len(X))

Number of extracted sequences: 135187


In [None]:
X_modified = np.reshape(X, (len(X), seq_length, 1))
X_modified = X_modified / float(len(characters))
Y_modified = np_utils.to_categorical(Y)

X_modified.shape, Y_modified.shape

((135187, 100, 1), (135187, 76))

## **Model creation and training**

We create our model.

In [None]:
model = Sequential()
model.add(LSTM(400, input_shape=(X_modified.shape[1], X_modified.shape[2]), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(400))
model.add(Dropout(0.2))
model.add(Dense(Y_modified.shape[1], activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam')

We added chackpoints to save the model weights and load it again if we want to add epochs afterwards.

In [None]:
filepath="/content/training_checkpoints/baseline-improvement-britney-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

In [None]:
model.fit(X_modified, Y_modified, epochs=30, batch_size=32, callbacks = callbacks_list)

Epoch 1/30

Epoch 00001: loss improved from inf to 2.85137, saving model to /content/training_checkpoints/baseline-improvement-britney-01-2.8514.hdf5
Epoch 2/30

Epoch 00002: loss improved from 2.85137 to 2.47310, saving model to /content/training_checkpoints/baseline-improvement-britney-02-2.4731.hdf5
Epoch 3/30

Epoch 00003: loss improved from 2.47310 to 2.11772, saving model to /content/training_checkpoints/baseline-improvement-britney-03-2.1177.hdf5
Epoch 4/30

Epoch 00004: loss improved from 2.11772 to 1.82145, saving model to /content/training_checkpoints/baseline-improvement-britney-04-1.8214.hdf5
Epoch 5/30

Epoch 00005: loss improved from 1.82145 to 1.59089, saving model to /content/training_checkpoints/baseline-improvement-britney-05-1.5909.hdf5
Epoch 6/30

Epoch 00006: loss improved from 1.59089 to 1.40982, saving model to /content/training_checkpoints/baseline-improvement-britney-06-1.4098.hdf5
Epoch 7/30

Epoch 00007: loss improved from 1.40982 to 1.27374, saving model to 

<tensorflow.python.keras.callbacks.History at 0x7f0fb8ac9ba8>

## **Generating text**

We generate a random start from our initial text.

In [None]:
start # a good start is start = 40554

40554

In [None]:
start = np.random.randint(0, len(X)-1) 

string_mapped = list(X[start])

full_string = [n_to_char[value] for value in string_mapped]

print("Seed:")
print("\"", ''.join(full_string), "\"")

Seed:
" l out
And scream and shout and let it out
We saying oh wee oh wee oh wee oh
We saying oh wee oh wee  "


We generate the 400 following characters.

In [None]:
for i in range(400):
    x = np.reshape(string_mapped,(1,len(string_mapped), 1))
    x = x / float(len(characters))

    pred_index = np.argmax(model.predict(x, verbose=0))
    seq = [n_to_char[value] for value in string_mapped]
    full_string.append(n_to_char[pred_index])
    
    string_mapped.append(pred_index)  
    string_mapped = string_mapped[1:len(string_mapped)] 

We combine and print the text.

In [None]:
txt=""
for char in full_string:
    txt = txt+char

print(start)
print(txt)

40554
l out
And scream and shout and let it out
We saying oh wee oh wee oh wee oh
We saying oh wee oh wee oh wee oh
I wanna scream and shout and let it all out
And scream and shout and let it all out
And scream and shou and let it all out
And scream and shou and let it all out
And scream and shout and let it out
We saying oh wee oh wee oh wee oh
Wou are now now rocking with Will I Am and Britney bitch Oh yeah..
Oh yeah..
Oh yeah..
Bni: We will, we will rock you! Britney:
We! I said: All:
We will, we w                             t           t             t                                      l   l lllveeee                                                                                                   lllveee e                                                                                                 lllveee e                                                                              


First, we can notice that the prediction has the same structure as the raw text.

# **2. Generating text from clean data**

## **Data cleaning**

In [None]:
def convert_text_to_lowercase(df):
    df = df.lower()
    return df
    
def not_regex(pattern):
        return r"((?!{}).)".format(pattern)

def remove_punctuation(df):
    df = df.replace('\n', ' ')
    df = df.replace('\r', ' ')
    alphanumeric_characters_extended = '(\\b[-/]\\b|[a-zA-Z0-9])'
    df = df.replace(not_regex(alphanumeric_characters_extended), ' ')
    return df


def text_cleaning(df):
    """
    Takes in a string of text, then performs the following:
    1. convert text to lowercase
    2. remove punctuation and new line characters '\n'
    3. Tokenize sentences
    4. Remove all stopwords
    5. convert tokenized text to text
    """
    df = df.lower()
    df = remove_punctuation(df)
    return df



In [None]:
britney = text_cleaning(britney)

print(britney)



## **Data processing**

We now only have 50 different characters

In [None]:
characters = sorted(list(set(britney)))

n_to_char = {n:char for n, char in enumerate(characters)}
char_to_n = {char:n for n, char in enumerate(characters)}

vocab_size = len(characters)
print('Number of unique characters: ', vocab_size)
print(characters)

Number of unique characters:  50
[' ', '!', '"', '&', "'", '(', ')', ',', '-', '.', '0', '1', '2', '3', '4', '6', '7', '8', ':', ';', '?', '[', ']', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


In [None]:
X = []   
Y = []  
length = len(britney)
seq_length = 100   

for i in range(0, length - seq_length, 1):
    sequence = britney[i:i + seq_length]
    label = britney[i + seq_length]
    X.append([char_to_n[char] for char in sequence])
    Y.append(char_to_n[label])
    
print('Number of extracted sequences:', len(X))

Number of extracted sequences: 135187


In [None]:
X_modified = np.reshape(X, (len(X), seq_length, 1))
X_modified = X_modified / float(len(characters))
Y_modified = np_utils.to_categorical(Y)

X_modified.shape, Y_modified.shape

((135187, 100, 1), (135187, 50))

## **Model creation and training**

In [None]:
model = Sequential()
model.add(LSTM(400, input_shape=(X_modified.shape[1], X_modified.shape[2]), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(400))
model.add(Dropout(0.2))
model.add(Dense(Y_modified.shape[1], activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam')

In [None]:
filepath="/content/training_checkpoints/baseline-improvement-britney-clean-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

In [None]:
model.fit(X_modified, Y_modified, epochs=30, batch_size=32, callbacks = callbacks_list)

Epoch 1/30

Epoch 00001: loss improved from inf to 2.61463, saving model to /content/training_checkpoints/baseline-improvement-britney-clean-01-2.6146.hdf5
Epoch 2/30

Epoch 00002: loss improved from 2.61463 to 2.11020, saving model to /content/training_checkpoints/baseline-improvement-britney-clean-02-2.1102.hdf5
Epoch 3/30

Epoch 00003: loss improved from 2.11020 to 1.75084, saving model to /content/training_checkpoints/baseline-improvement-britney-clean-03-1.7508.hdf5
Epoch 4/30

Epoch 00004: loss improved from 1.75084 to 1.49806, saving model to /content/training_checkpoints/baseline-improvement-britney-clean-04-1.4981.hdf5
Epoch 5/30

Epoch 00005: loss improved from 1.49806 to 1.30862, saving model to /content/training_checkpoints/baseline-improvement-britney-clean-05-1.3086.hdf5
Epoch 6/30

Epoch 00006: loss improved from 1.30862 to 1.16813, saving model to /content/training_checkpoints/baseline-improvement-britney-clean-06-1.1681.hdf5
Epoch 7/30

Epoch 00007: loss improved from 

## **Generating text**

In [None]:
start = 40554
# start = np.random.randint(0, len(X)-1) 

string_mapped = list(X[start])

full_string = [n_to_char[value] for value in string_mapped]

print("Seed:")
print("\"", ''.join(full_string), "\"")

Seed:
" l out and scream and shout and let it out we saying oh wee oh wee oh wee oh we saying oh wee oh wee  "


In [None]:
for i in range(400):
    x = np.reshape(string_mapped,(1,len(string_mapped), 1))
    x = x / float(len(characters))

    pred_index = np.argmax(model.predict(x, verbose=0))
    seq = [n_to_char[value] for value in string_mapped]
    full_string.append(n_to_char[pred_index])
    
    string_mapped.append(pred_index)  
    string_mapped = string_mapped[1:len(string_mapped)] 

In [None]:
txt=""
for char in full_string:
    txt = txt+char

print(start)
print(txt)

40554
l out and scream and shout and let it out we saying oh wee oh wee oh wee oh we saying oh wee oh wee oh wee oh i wanna scream and shout and let it all out and scream and shout and let it out we saying oh wee oh wee oh wee oh we saying oh wee oh wee oh wee oh i wanna scream and shout and let it all out and scream and shout and let it out we saying oh wee oh wee oh wee oh we saying oh wee oh wee oh wee oh i wanna scream and shout and let it all out and scream and shout and let it out we saying oh wee oh wee oh wee oh you are now now rocking with will i am and britney bitch all around the world, pretty girls jump the liness  thay seel pe prcsio' and you niw me on my radar (on my radar) on my radar got you on my radar got you on my radar got you on my radar got you on my radar got you on my radar got you on my radar got you on my radar got you on my radar got you on my radar got you on my rad


# si jamais on veut charger un ancien modele 


In [None]:
filename = "/content/training_checkpoints/baseline-improvement-britney-clean-40-0.3835.hdf5"
model.load_weights(filename)
model.compile(loss='categorical_crossentropy', optimizer='adam')