<a href="https://colab.research.google.com/github/Mic-73/GenAI/blob/main/HW5/Problem1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
# Author: Michael Wood

Purpose: In this project we will develop an LSTM (Long Short-Term Memory) model to generate text.
*   By training the model on various works, we will try to produce coherent and stylistically relevant text based on seed phrases (prompts).
*   To improve the model's performance, we will explore the use of multiple training data, additional LSTM layers, and other optimizations.
*   The quantity and quality of training data are crucial for achieving meaningful text generation; a larger and more diverse dataset allows the model to better capture the nuances and patterns of written text.

Note: Initial code for importing and loading the data and setting up the multilayer LSTM model was taken from the assignment page on Canvas. Initial code for setting up training the model, tokenizing the data, and the initial textGenerator class was taken from the course's Github repo here: https://github.com/bforoura/GenAI/blob/main/Module5/recipe_lstm.ipynb. The code was modified to fit the assignment's requirements.

---

# Code

## 1. Data Collection and Preparation

In [1]:
#@title Import Libraries

import numpy as np
import json
import re
import string
import requests
from os import stat_result

# from Tensorflow
import tensorflow as tf
from tensorflow.keras import layers, models, callbacks, losses
from tensorflow.keras.callbacks import Callback
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
import string

# Natural Language Toolkit
import nltk

In [2]:
#@title PARAMETERS

TEMPERATURE = 1.0
VOCAB_SIZE = 20000
SEQ_LENGTH = 50
MAX_LEN = 100
EMBEDDING_DIM = 100
N_UNITS = 64
VALIDATION_SPLIT = 0.2
SEED = 42
LOAD_MODEL = False
BATCH_SIZE = 32
EPOCHS = 25

In [None]:
#@title Import and Clean the Data

# URLS of Charles Dickens Works
urls = [
    "https://www.gutenberg.org/cache/epub/730/pg730.txt",     # Oliver Twist
    "https://www.gutenberg.org/cache/epub/98/pg98.txt",       # A Tale of Two Cities
    "https://www.gutenberg.org/cache/epub/19337/pg19337.txt"  # A Christmas Carol
]

# Initialize empty string
all_text = ""

# Starting phrase for a book
start = r"(?i)^.*?\*\*\* START OF THE PROJECT GUTENBERG EBOOK.*?\n"

# Ending phrase for a book
end = r"(?i)\n\*\*\* END OF THE PROJECT GUTENBERG EBOOK.*$"

# Download the books and clean them
for url in urls:
    response = requests.get(url)
    text = response.text

    # Remove metadata before the actual book content
    text_cleaned = re.sub(start, "", text, flags=re.DOTALL)

    # Remove metadata after the actual book content
    text_cleaned = re.sub(end, "", text_cleaned, flags=re.DOTALL)

    # Append the cleaned text to all_text
    all_text += text_cleaned.strip() + "\n\n"  # Separate books by newlines

# Pad the punctuation, trim white space
all_text = re.sub(r'([.,!?()";])', r' \1 ', all_text)
all_text = re.sub(r'\s+', ' ', all_text)
all_text = all_text.strip()

# Save combined text to a single file
with open("combined_dickens.txt", "w", encoding="utf-8") as file:
    file.write(all_text)

In [5]:
#@title Tokenize the Data

# Tokenize the text (Word Tokens)
tokenizer = Tokenizer()
tokenizer.fit_on_texts([all_text])
sequences = tokenizer.texts_to_sequences([all_text])[0]

# Create input and output pairs for training
X = []
y = []

for i in range(SEQ_LENGTH, len(sequences)):
    X.append(sequences[i-SEQ_LENGTH:i])
    y.append(sequences[i])

X = pad_sequences(X, maxlen=SEQ_LENGTH)
y = np.array(y)

# Adjust vocab size
VOCAB_SIZE = len(tokenizer.word_index) + 1  # +1 for padding

# index_to_word for text generation
index_to_word = {i: word for word, i in tokenizer.word_index.items()}

## 2. Initial LSTM Model Training

In [3]:
#@title Create the LSTM Model

def make_LSTM_model():
  model = tf.keras.Sequential()                                                  # Sequential
  model.add(layers.Input(shape=(SEQ_LENGTH,), dtype="int32"))                    # Input Layer
  model.add(layers.Embedding(VOCAB_SIZE, EMBEDDING_DIM))                         # Embedding Layer
  model.add(layers.LSTM(N_UNITS))                                                # LSTM Layer
  model.add(layers.Dense(VOCAB_SIZE, activation="softmax"))                      # Output Layer
  return model

In [6]:
#@title TextGenerationCallback Class

class TextGenerationCallback(Callback):
    def __init__(self, model, index_to_word, word_to_index, seq_length=SEQ_LENGTH, temperature=1.0, max_tokens=MAX_LEN):
        super().__init__()
        self.index_to_word = index_to_word
        self.word_to_index = word_to_index
        self.seq_length = seq_length
        self.temperature = temperature
        self.max_tokens = max_tokens

    def generate_text(self, start_prompt):
      # Convert the start prompt to a sequence of indices
      start_tokens = [self.word_to_index.get(word, 1) for word in start_prompt.split()]  # 1 is for unknown words

      generated_text = start_prompt

      for _ in range(self.max_tokens):

          # Pad the input sequence to ensure it's of the correct length
          padded_input = np.array([start_tokens[-self.seq_length:]])

          # Predict the next word probabilities (in a 2d shape)
          predictions = self.model.predict(padded_input, verbose=0)[0, :]

          # Apply temperature and normalize
          predictions = np.asarray(predictions).flatten()
          predictions = np.log(predictions + 1e-10) / self.temperature
          predictions = np.exp(predictions) / np.sum(np.exp(predictions))

          # Sample from the probability distribution
          next_token = np.random.choice(len(predictions), p=predictions)

          # Append the predicted token to the sequence
          start_tokens.append(next_token)

          # Convert the token back to a word and add it to the generated text
          generated_text += ' ' + self.index_to_word.get(next_token, '?')

          # Stop if the end token is generated
          if next_token == 0:
              break

      return generated_text

    # Generate text after each training epoch
    def on_epoch_end(self, epoch, logs=None):

        # Charles Dicken start prompt example
        start_prompt = "It was the best of times"

        # Generate the text
        generated_text = self.generate_text(start_prompt)
        print(f"\nEpoch {epoch+1} Generated Text: {generated_text}\n")

In [7]:
#@title Establish the LSTM Model

# Compile
lstm_model = make_LSTM_model()
lstm_model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
lstm_model.summary()

# Custom callback for text generation
text_gen_callback = TextGenerationCallback(
    model=lstm_model,
    index_to_word=index_to_word,
    word_to_index=tokenizer.word_index,
    seq_length=SEQ_LENGTH,
    temperature=TEMPERATURE,
    max_tokens=MAX_LEN
)

In [10]:
# Train the model
history = lstm_model.fit(
    X,
    y,
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    callbacks=[text_gen_callback]
)

Epoch 1/25
[1m10413/10417[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 8ms/step - accuracy: 0.1384 - loss: 5.6403
Epoch 1 Generated Text: It was the best of times near that they had nothing of dread in birth and flame with half enhanced and became overrun to designation and will marriage being receiving not onward the afterwards had been innocent of tellson’s reason and leader a cares of vase pity until with great becoming solitary child voice a old woman in number through the counters long casting that shook it awoke into the reflection without civil echoing chimbley staggered stairs ashamed fagin might now with home came into a child’s woman who was put the fire up his hands was to wait and halter the —a transmutation affectionate through

[1m10417/10417[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m93s[0m 9ms/step - accuracy: 0.1384 - loss: 5.6403
Epoch 2/25
[1m10412/10417[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 8ms/step - accuracy: 0.1577 - loss: 5.3153


In [7]:
#@title Print Probabilities Function

def print_probs(model, tokenizer, prompt, top_k=5, seq_length=SEQ_LENGTH, temperature=TEMPERATURE):

    # Convert the prompt into tokens
    word_to_index = tokenizer.word_index
    index_to_word = {idx: word for word, idx in word_to_index.items()}

    # Convert the prompt to a sequence of indices
    start_tokens = [word_to_index.get(word, 1) for word in prompt.split()]

    # Pad input sequence
    padded_input = np.array([start_tokens[-seq_length:]])

    # Predict the next word probabilities
    predictions = model.predict(padded_input, verbose=0)[0, :]

    # Apply temperature to the predictions
    predictions = np.asarray(predictions).flatten()
    predictions = np.log(predictions + 1e-10) / temperature
    predictions = np.exp(predictions) / np.sum(np.exp(predictions))

    # Get the top-k predictions and their corresponding indices
    top_indices = np.argsort(predictions)[::-1][:top_k]
    top_probs = predictions[top_indices]

    # Print the top-k predictions
    print(f"\nPROMPT: {prompt}")
    for i, idx in enumerate(top_indices):
        word = index_to_word.get(idx, '?')
        print(f"{word}:   \t{np.round(100*top_probs[i], 2)}%")
    print("--------\n")

In [15]:
#@title Example Text Prompt Generation for Oliver Twist

# Example usage:
prompt = "Please, sir,"
generated_text = text_gen_callback.generate_text(prompt)
print(f"Generated Text:\n{generated_text}")
print_probs(lstm_model, tokenizer, prompt, top_k=5, seq_length=SEQ_LENGTH, temperature=1.0)

Generated Text:
Please, sir, bunch which happily across the depths of his “but so all ” asked monks finding of the mother’s ugly friendship and terror her eyes under them in the nimble face was lieu at her that’s not the young people that mr maylie face was in the workhouse was about to it and on which he said old laws and darling it was wild defarge’s dead and sometimes you hear it weep for her her is a bad thing i have been lain to me after his englishman ” as he was desired to addressed his being made out of the

PROMPT: Please, sir,
fact:   	2.22%
worst:   	1.06%
robber:   	0.97%
judge:   	0.85%
same:   	0.8%
--------



In [16]:
#@title Example Text Prompt Generation for The Tale of Two Cities

# Example usage:
prompt = "It was the best of times, it was"
generated_text = text_gen_callback.generate_text(prompt)
print(f"Generated Text:\n{generated_text}")
print_probs(lstm_model, tokenizer, prompt, top_k=5, seq_length=SEQ_LENGTH, temperature=1.0)

Generated Text:
It was the best of times, it was in england he sprang through the keyhole that was really in falling in danger he would have been leaned had the tend to itself from merry every heart and the animal was in but patting it in such closely deep piteously found it seems at time i have seen to seek them with birth with your ain’t on mrs fezziwig i see it very supper all it is holes in the mob and standing dock—the too long for usual after i’m too anxious monsieur an worship have him again ” mr lorry sat fear of it was gone which for

PROMPT: It was the best of times, it was
not:   	5.8%
a:   	3.95%
in:   	3.25%
now:   	2.21%
very:   	1.77%
--------



In [17]:
#@title Example Text Prompt Generation for A Christmas Carol

# Example usage:
prompt = "There is nothing in the world"
generated_text = text_gen_callback.generate_text(prompt)
print(f"Generated Text:\n{generated_text}")
print_probs(lstm_model, tokenizer, prompt, top_k=5, seq_length=SEQ_LENGTH, temperature=1.0)

Generated Text:
There is nothing in the world this is a thing and shone and even a melancholy hope and says the course of the ghost would never want to keep him and clattering like scrooge makes more more like him twelve minutes there was rescued here for his person that happened good gang for the first reason mr scrooge has now in vain ” “never ” returned miss pross grim friends “and that he is an orphan that usually ever given you have been when we were ” said “i am a cause of both bad gentlemen don’t date the matter if it’s fetched my dear say

PROMPT: There is nothing in the world
that:   	23.62%
to:   	9.64%
for:   	5.21%
with:   	4.48%
all:   	3.84%
--------



## 3. Experiment with Model Complexity

In [8]:
#@title Test a Change in the Number of Units in Each LSTM Layer

N_UNITS = 128

In [9]:
#@title Create the LSTM Model (3 LSTM Layers)

def make_LSTM_model():
  model = tf.keras.Sequential()                                                  # Sequential
  model.add(layers.Input(shape=(SEQ_LENGTH,), dtype="int32"))                          # Input Layer
  model.add(layers.Embedding(VOCAB_SIZE, EMBEDDING_DIM))                         # Embedding Layer
  model.add(layers.LSTM(N_UNITS, return_sequences=True))                         # LSTM Layer
  model.add(layers.LSTM(N_UNITS, return_sequences=True))                         # LSTM Layer
  model.add(layers.LSTM(N_UNITS))                                                # LSTM Layer (LAST)
  model.add(layers.Dense(VOCAB_SIZE, activation="softmax"))                      # Output Layer
  return model

In [10]:
#@title Establish the LSTM Model

# Compile
lstm_model_complex = make_LSTM_model()
lstm_model_complex.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
lstm_model_complex.summary()

# Custom callback for text generation
text_gen_callback = TextGenerationCallback(
    model=lstm_model_complex,
    index_to_word=index_to_word,
    word_to_index=tokenizer.word_index,
    seq_length=SEQ_LENGTH,
    temperature=TEMPERATURE,
    max_tokens=MAX_LEN
)

In [11]:
# Train the model
history_complex = lstm_model_complex.fit(
    X,
    y,
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    callbacks=[text_gen_callback]
)

Epoch 1/25
[1m10413/10416[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 14ms/step - accuracy: 0.0681 - loss: 6.7592
Epoch 1 Generated Text: It was the best of times and ghost with my idea was give with the life of her adventurous revengeful hauled paid springing his disposition of his whole exclamations of madame regales ornamented upon something worse in an sense were with a fountain of waiting their hushed looking their swift father and while a end of a bargain up of curtains deliberation covering he were whether a subject in the bank it was addressed up jerry the woman uneasiness cordially with the i not a quantity of staring history by a wild wall of prisoners fleet owner and pilferer at a france “what anxiously had counsel

[1m10416/10416[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m158s[0m 15ms/step - accuracy: 0.0681 - loss: 6.7591
Epoch 2/25
[1m10414/10416[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 14ms/step - accuracy: 0.1097 - loss: 5.9539
Epoch 2 Gen

## 4. Temperature and Prompt Variations

In [13]:
#@title Example Text Prompt Generation for A Tale of Two Cities (temperature = 1.0)

# Example usage:
prompt = "It was the best of times, it was"
generated_text = text_gen_callback.generate_text(prompt)
print(f"Generated Text:\n{generated_text}")
print_probs(lstm_model_complex, tokenizer, prompt, top_k=5, seq_length=SEQ_LENGTH, temperature=1.0)

Generated Text:
It was the best of times, it was seized by many coaches in most been incoherent spectacles at a great hearty feet like perpetually but taking twenty delicious marks for the rest that the boy contented with a well ends proposed and the jew entered from the air of monks smiling back and flame beside oliver and pictured in the of office and ends “it stood deserted followed to the turnkey of the girl’s opened silence again if it be talking but i’d run there and whosoever let me mean with life ” “ay well ” “monseigneur how it can’t be called directly let him charles get as

PROMPT: It was the best of times, it was
not:   	6.74%
a:   	5.8%
the:   	5.06%
seated:   	2.93%
all:   	2.68%
--------



In [14]:
#@title Example Text Prompt Generation for A Tale of Two Cities (temperature = 0.5)

# Example usage:
prompt = "It was the best of times, it was"
generated_text = text_gen_callback.generate_text(prompt)
print(f"Generated Text:\n{generated_text}")
print_probs(lstm_model_complex, tokenizer, prompt, top_k=5, seq_length=SEQ_LENGTH, temperature=0.5)

Generated Text:
It was the best of times, it was all would curl looked into the two figures that played some note of two gentlemen twist of little two and twenty eighteen minutes from the sharp hand of the jew roused him to chertsey time and appearing to express business in no way came at a sunburnt price and holding nothing of it good and plain to oliver’s former change when the bearer had the idea of no further weather all himself there was an unsuccessful and as well by the following lane and escorted his conversation in life was well under his dear dear will besides and looked at

PROMPT: It was the best of times, it was
not:   	29.5%
a:   	21.84%
the:   	16.67%
seated:   	5.58%
all:   	4.67%
--------



In [15]:
#@title Example Text Prompt Generation for A Tale of Two Cities (temperature = 0.1)

# Example usage:
prompt = "It was the best of times, it was"
generated_text = text_gen_callback.generate_text(prompt)
print(f"Generated Text:\n{generated_text}")
print_probs(lstm_model_complex, tokenizer, prompt, top_k=5, seq_length=SEQ_LENGTH, temperature=0.1)

Generated Text:
It was the best of times, it was a him louis laughed a pale kind that might not eyes what lay so far that the pointing over and was smoking for a nightcap their steps crossed the room at which they had listened asleep and never inquire got in prelude of listening to it and finding three paces alone from the saucepan and dread of a great stranger he seemed to ring the murderer which the two boys having disappeared to recollect for the young lady as he made much of trouble and having neither belief once so desirable he did not put any finger upon it and

PROMPT: It was the best of times, it was
not:   	78.09%
a:   	17.39%
the:   	4.5%
seated:   	0.02%
all:   	0.01%
--------



In [17]:
#@title Example Text Prompt Generation for Oliver Twist (temperature = 1.0, 0.5, 0.1)

# Example usage:
prompt = "Please, sir,"
generated_text = text_gen_callback.generate_text(prompt)
print(f"Generated Text:\n{generated_text}")
print_probs(lstm_model_complex, tokenizer, prompt, top_k=5, seq_length=SEQ_LENGTH, temperature=1.0)

# Example usage:
prompt = "Please, sir,"
generated_text = text_gen_callback.generate_text(prompt)
print(f"Generated Text:\n{generated_text}")
print_probs(lstm_model_complex, tokenizer, prompt, top_k=5, seq_length=SEQ_LENGTH, temperature=0.5)

# Example usage:
prompt = "Please, sir,"
generated_text = text_gen_callback.generate_text(prompt)
print(f"Generated Text:\n{generated_text}")
print_probs(lstm_model_complex, tokenizer, prompt, top_k=5, seq_length=SEQ_LENGTH, temperature=0.1)

Generated Text:
Please, sir, “hem of these sleeping objects or a in the attend weather the inheritance of their youthful obstacles and opening the prison in their books out of every cause of loaves and london there is no new expectation in a place of wisdom and some delicate limb “mind him directed oliver’s description within the the stately reason to pour up to scrooge as he very strong and watchfulness to him “what do you want me what it has ” “oh within having been any very good hearted uncle ” replied nancy with an effort to be where spoke “it’s not jacques

PROMPT: Please, sir,
master’s:   	30.12%
skies:   	12.34%
forbid:   	12.0%
notable:   	7.79%
guidance:   	6.92%
--------

Generated Text:
Please, sir, master’s especially son that the hungry three to the night “show that at such a new creature could not have any enough in this eyes and those in all the too half but life stop four drops of business an englishman later edge creature sometimes with anxiety for them from tellson’s 

In [18]:
#@title Example Text Prompt Generation for A Christmas Carol (temperature = 1.0, 0.5, 0.1)

# Example usage:
prompt = "There is nothing in the world"
generated_text = text_gen_callback.generate_text(prompt)
print(f"Generated Text:\n{generated_text}")
print_probs(lstm_model_complex, tokenizer, prompt, top_k=5, seq_length=SEQ_LENGTH, temperature=1.0)

# Example usage:
prompt = "There is nothing in the world"
generated_text = text_gen_callback.generate_text(prompt)
print(f"Generated Text:\n{generated_text}")
print_probs(lstm_model_complex, tokenizer, prompt, top_k=5, seq_length=SEQ_LENGTH, temperature=0.5)

# Example usage:
prompt = "There is nothing in the world"
generated_text = text_gen_callback.generate_text(prompt)
print(f"Generated Text:\n{generated_text}")
print_probs(lstm_model_complex, tokenizer, prompt, top_k=5, seq_length=SEQ_LENGTH, temperature=0.1)

Generated Text:
There is nothing in the world of fear that she would die to disturb the robbery and the maylie were an trial but the men returned waggons and the whole flush went keenly to the door they were moved at the windows side of the last scene and make something of this task of mr lorry and now went away in arms to throw it as he had gone far admitted by her bedroom through the great street like over paris and straw the silence chapter one nor protection from the voluntarily stretching back “come on ” “the lean curious eyed him “there’s safe off mr

PROMPT: There is nothing in the world
of:   	34.0%
that:   	17.19%
and:   	6.17%
for:   	4.72%
but:   	4.59%
--------

Generated Text:
There is nothing in the world the accused rose and dismissed him for and he led his little wooden yards were lost with scrooge's corney’s eye and the principle ceased and for the distracted footsteps of a new pipe was a donkey were sitting before him he jerked himself on his knees and said as graciou

---
# Discussion

## 5. Evaluation of Generated Text

Assessing the quality of the generated text:

*   Coherence: Since both models are not extremely complex and the dataset is only three books, the generated output is not very coherent. The models construct sentences that seemingly go on forever and do not adhere to the rules of grammar very well. Interestingly, the models do try to mimic dialogue within the text pairing quotation marks together somewhere within the generated output. This may display that even a simpler model like this that does not fully recognize the rules of the written language can still learn some writing techniques such as dialogue with only very little data. This can help us understand that the models are still somewhat able to detect patterns within the training text.  
*   Relevance: As the generated output continues, it seems to lose more relevance to the given prompt. The tested prompts are really only one-line famous quotes from each of the three books, so this result is not entirely unexpected. This result not only displays the importance of testing and reconfiguring the models but also the importance of prompt engineering to help the model display a desired result.
*   Stylistic Accuracy: With only three books in the training dataset, it would be a miracle for the model to accurately mimic the stylistic choice of the author. However, certain words and characters that are in the generated text match the words and character unique to the author's writings. This may be due to the model learning certain patterns of the author, but the word tokenization of the training set may also play a part in this result. Word tokenization helps the model effectively learn what is in the training set with the trade-off that it will not be able learn new words outside of the dataset. Since the goal here is to match the author's stylistic choices, this trade-off is not entirely costly.

Assessing the outcomes of different temperatures:

*   It is interesting to note the effects of different temperatures on the generated outputs. It seems that the greater the temperature is, the more the probabilities are spread out and equally distributed. The smaller temperatus lead to a bigger increase in the probabilities of the words with the already highest probabilities meaning that the next top 5 possible words will be more likely to be chosen. Having lower temperatures may help lead the model to generate text that is more accurate in terms of grammar and structure. Having higher temperature may help the model create new sentences and provide a more varies result. This exemplifies the trade-off between creativity and coherence with temperatures of different prompts.