# Training of the Model

This Jupyter Notebook is dedicated to the training process of our text generation model. The primary objective is to build and train a neural network capable of generating haikus based on the provided dataset.

## Overview

1. **Data Loading and Preprocessing**: 
    - Load the dataset containing haikus.
    - Preprocess the data by combining haiku lines, tokenizing, and padding sequences.
    - Split the data into training and validation sets.

2. **Model Architecture**:
    - Define a neural network architecture using LSTM layers, Dropout layers, and Dense layers.
    - Compile the model with appropriate loss function and optimizer.

3. **Training the Model**:
    - Train the model using the training dataset.
    - Implement techniques to prevent overfitting, such as EarlyStopping and Dropout layers.
    - Monitor the training process using validation data.

4. **Evaluation and Results**:
    - Evaluate the model's performance on the validation set.
    - Analyze the results and make necessary adjustments to improve the model.


In [1]:
import pandas as pd
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np
import tensorflow as tf
import re

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.regularizers import l2

from sklearn.model_selection import train_test_split

2024-09-02 19:53:51.441246: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-09-02 19:53:51.565023: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-09-02 19:53:51.595417: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-09-02 19:53:51.810246: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Treating the data for the training

In this part, the following steps are performed:

- Data Loading: Load the haiku dataset from a CSV file.

- Data Preprocessing: Combine haiku lines into a single string, tokenize the text, and pad sequences to ensure uniform length.

- Data Splitting: Split the dataset into training and validation sets to prepare for model training.

In this part, we are for now only selecting a little part of our dataset to do the preparation for the training of the model. This is because of the lack of available memory on the computer used to do this project. Other ways to be able to treat and use all the datasets will be explored later on but this is the way for the current experimentations. 

As told before, our first step is to load and select a part of our dataset as done here :

In [2]:
# Load the CSV data
df = pd.read_csv('database/final_data/final_english_dataset.csv')

# Combine the three lines of each haiku into a single string
df['haiku'] = df[['line_1', 'line_2', 'line_3']].agg(' '.join, axis=1)

# Randomly select 2,500 rows from the dataset
df = df.sample(n=2500, random_state=42)

# Display the loaded data
df.head()

Unnamed: 0,line_1,line_2,line_3,source,line1_syllables,line2_syllables,line3_syllables,haiku
83347,Says it all doesn't,it most of them don't even,know why they are there,twaiku,4,7,7,Says it all doesn't it most of them don't eve...
102737,Cindy needed time,to take a nap on the floor,of the US Senate,twaiku,6,8,6,Cindy needed time to take a nap on the floor ...
107391,A glorious morn,Without a cloud to be seen,Why then do I cry,twaiku,4,7,5,A glorious morn Without a cloud to be seen Wh...
92298,Now playing We'll Meet,Again by Barry O'Dowd,The Shamrock Singers,twaiku,4,7,5,Now playing We'll Meet Again by Barry O'Dowd ...
78172,Watching some courage,the cowardly dog show to,start the day off right,twaiku,7,7,5,Watching some courage the cowardly dog show t...


Now, we are cleaning our haikus after fusing them all, this will allow us to do the tokenization of our dataset after this. 

This is a way to do, but it is also possible to do it by using the three lines of the haikus and then having more differents lines and cases but for this first time, we will try to do it simple and then improve the project step by step with the time. 

In [3]:
# Combine all haikus into a single string
haikus = df['haiku'].tolist()
text = ' '.join(haikus)

# Clean the text
def clean_text(text):
    text = text.lower()
    text = re.sub(r'\n', ' ', text)
    text = re.sub(r'\r', '', text)
    text = re.sub(r'\xa0', ' ', text)
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    text = re.sub(r'[^\w\s]', '', text)
    return text

cleaned_text = clean_text(text)

As now our text are cleaned, we can tokenize them and see if the result is what we could expect from this process : 

In [4]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts([cleaned_text])
total_words = len(tokenizer.word_index) + 1

# Convert haikus to sequences of tokens
input_sequences = []
for haiku in haikus:
    token_list = tokenizer.texts_to_sequences([clean_text(haiku)])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)

print(input_sequences[:10])

[[322, 14], [322, 14, 26], [322, 14, 26, 195], [322, 14, 26, 195, 14], [322, 14, 26, 195, 14, 158], [322, 14, 26, 195, 14, 158, 8], [322, 14, 26, 195, 14, 158, 8, 73], [322, 14, 26, 195, 14, 158, 8, 73, 33], [322, 14, 26, 195, 14, 158, 8, 73, 33, 91], [322, 14, 26, 195, 14, 158, 8, 73, 33, 91, 62]]


This cell prepares the data for training by padding sequences to ensure uniform length, creating input-output pairs, and (optionally) splitting the data into training and validation sets. The commented-out code indicates that the entire dataset is used for training to build a more general model.

In [5]:
# Pad sequences to ensure uniform length
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))

# Create input and output pairs
X, y = input_sequences[:,:-1], input_sequences[:,-1]
y = tf.keras.utils.to_categorical(y, num_classes=total_words)

# Split the data into training and validation sets
# X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the datasets
#print(f"Shape of X_train: {X_train.shape}")
#print(f"Shape of y_train: {y_train.shape}")
#print(f"Shape of X_val: {X_val.shape}")
#print(f"Shape of y_val: {y_val.shape}")

## Training of the model

And now is finally the time : the training of the model. As I have more experience with PyTorch, I decided to use Tensorflow for this one as this is a little and fun project and because I found it interesting to keep doing some project with this library to gain experience while using it and keep me updated with it's performances.

In a first place, we are setting up our model. For this one we are using LSTMs, a classic in text generation, but I want to also try with transformers, GRUs or other models to see how they can perform for this task.

In [6]:
# Define the model
model = Sequential()
model.add(Embedding(total_words, 100))
model.add(LSTM(150, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(100))
model.add(Dropout(0.2))
model.add(Dense(total_words, activation='softmax'))

# Compile the model
model.compile(loss='categorical_crossentropy', optimizer=Adam(learning_rate=0.0005), metrics=['accuracy'])

# Display the model summary
model.summary()

2024-09-02 11:50:48.749771: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2343] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...


Here we are setting up our early stopping callback to avoid overfitting and finaly launching the training of our model.

In [7]:
import time

# Define EarlyStopping callback
early_stopping = EarlyStopping(monitor='loss', patience=5, restore_best_weights=True)

# Assuming X_train, y_train, X_val, y_val are already defined
# Fit the model with EarlyStopping
history = model.fit(X, y, epochs=1000, batch_size=32, callbacks=[early_stopping])

# Saving the dataset used to train this model for later generation with the tokenizer
timestamp = int(time.time())
df.to_csv(f'haiku_dataset_{timestamp}.csv', index=False)

# Save the model
model.save(f'haiku_model_{timestamp}.keras')

Epoch 1/1000


2024-09-02 11:50:53.744352: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 1867251904 exceeds 10% of free system memory.


[1m1763/1763[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m65s[0m 35ms/step - accuracy: 0.0323 - loss: 7.2824
Epoch 2/1000
[1m1763/1763[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m61s[0m 35ms/step - accuracy: 0.0387 - loss: 6.7763
Epoch 3/1000
[1m1763/1763[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m62s[0m 35ms/step - accuracy: 0.0479 - loss: 6.6211
Epoch 4/1000
[1m1763/1763[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m61s[0m 35ms/step - accuracy: 0.0535 - loss: 6.5266
Epoch 5/1000
[1m1763/1763[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m62s[0m 35ms/step - accuracy: 0.0595 - loss: 6.3720
Epoch 6/1000
[1m1763/1763[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m61s[0m 35ms/step - accuracy: 0.0648 - loss: 6.2281
Epoch 7/1000
[1m1763/1763[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m62s[0m 35ms/step - accuracy: 0.0717 - loss: 6.1027
Epoch 8/1000
[1m1763/1763[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m62s[0m 35ms/step - accuracy: 0.0776 - loss: 5.9637
Epoch

And for the final, our function to see what our model can output while generating a certain number of words. 

This is only a test to see if our model has learned well. There will be another notebook to generate haikus and the corresponding image with stable diffusion.

In [9]:
# You can find the corresponding functions in the functions.py file
from functions import *

# Example usage
params = {
    'model_path': 'models/lstm_5000/haiku_model_1725314076.keras',
    'seed_text': 'A forest',
    'syllables_mode': False,
    'words_mode': True,
    'haikus_pattern': [5, 7, 5],
    'max_sequence_len': 21,
    'tokenizer': tokenizer  # Assuming tokenizer is already defined
}

seed_text, haiku = generate_haiku_from_params(params)
print(seed_text)
print(haiku)

A forest youth is talked commies peak im is a beg do do of me fall she
['A forest youth is talked', ' commies peak im is a beg do', ' do of me fall she']


And we can also generate it with a random selected word :

In [10]:
# Example usage
seed_word = select_random_seed_word_from_tokenizer(tokenizer)

params = {
    'model_path': 'models/lstm_5000/haiku_model_1725314076.keras',
    'seed_text': seed_word,
    'syllables_mode': True,
    'words_mode': False,
    'haikus_pattern': [5, 7, 5],
    'max_sequence_len': 21,
    'tokenizer': tokenizer  # Assuming tokenizer is already defined
}

seed_text, haiku = generate_haiku_from_params(params)
print(seed_text)
print(haiku)

billows cream joy of the please sing of own civil his sex every
['billows cream joy of', ' the please sing of own civil', ' his sex every']


Of course the models are far from perfect, this is something I want to work on.

As possibles upgrades we have, for example, to find a way to use the whole dataset and not only a part of the dataset even with the small memory space of the device that is currently used. I will work on this later on as I am pretty happy with the funny results that the model are able to produce !