<a href="https://colab.research.google.com/github/Rtniewi/kcwiertniewicz-IDS/blob/main/Assignment5_2_LSTM_Layers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

//***Katrina Cwiertniewicz
//*** CSC 330
//11/--/2024
//Assignment 5: Text Generation Using LSTM on Project Gutenberg Training Data
####The purpose of this assignment is to develop an LSTM model that generates text. The goal is to produce coherent and stylistically relevant text based on prompts.

In [1]:
import numpy as np
import json
import re
import string

import tensorflow as tf
from tensorflow.keras import layers, models, callbacks, losses

## 0. Parameters <a name="parameters"></a>

In [2]:
VOCAB_SIZE = 20049
MAX_LEN = 500
EMBEDDING_DIM = 100
N_UNITS = 128
VALIDATION_SPLIT = 0.2
SEED = 42
LOAD_MODEL = False
BATCH_SIZE = 32
EPOCHS = 25

## 1. Load the data <a name="load"></a>

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
%pwd

'/content'

In [5]:
import requests
import os


# List of files for additional texts (e.g., different Edgar Allan Poe Works)
file_paths = [
  os.path.join('/content/drive/MyDrive/text/The_Tell_Tale_Heart.txt'),                        # The Tell Tale Heart
  os.path.join('/content/drive/MyDrive/text/The_Cask_of_Amontillado.txt'),                    # The Cask of Amontillado
  os.path.join('/content/drive/MyDrive/text/The_Raven.txt'),                                  # The Raven
  os.path.join('/content/drive/MyDrive/text/The_Masque.txt'),                                 # The Masque of the Red Death
  os.path.join('/content/drive/MyDrive/text/Annabel_Lee.txt'),                                # Annabel Lee
  os.path.join('/content/drive/MyDrive/text/Lenore.txt'),                                     # Lenore
  os.path.join('/content/drive/MyDrive/text/The_Bells.txt'),                                  # The Bells
  os.path.join('/content/drive/MyDrive/text/The_Black_Cat.txt'),                              # The Black Cat
  os.path.join('/content/drive/MyDrive/text/The_Fall_of_the_House_of_Usher.txt'),             # The Fall of the House of Usher
  os.path.join('/content/drive/MyDrive/text/The_Oval_Portrait.txt'),                          # The Oval Portrait
  os.path.join('/content/drive/MyDrive/text/The_Pit_and_the_Pendulum.txt'),                   # The Pit and the Pendulum
  os.path.join('/content/drive/MyDrive/text/The_Premature_Burial.txt'),                       # The Premature Burial
  os.path.join('/content/drive/MyDrive/text/The_Narrative_of_Arthur_Gordon.txt'),             # The Narrative of Arthur Gordon Pym of Nantucket
  os.path.join('/content/drive/MyDrive/text/Al_Aaraaf.txt'),                                  # Al Aaraaf
  os.path.join('/content/drive/MyDrive/text/Alone.txt'),                                      # Alone
  os.path.join('/content/drive/MyDrive/text/Eureka.txt'),                                     # Eureka
  os.path.join('/content/drive/MyDrive/text/The_Facts_in_the_Case_of_M._Valdemar.txt'),       # The Facts in the Case of M.Valdemar
  os.path.join('/content/drive/MyDrive/text/A_Descent_into_the_Maelstrom.txt'),               # A Descent into the Maelstrom
  os.path.join('/content/drive/MyDrive/text/William_Wilson.txt'),                             # William Wilson
  os.path.join('/content/drive/MyDrive/text/Berenice.txt'),                                   # Berenice
  os.path.join('/content/drive/MyDrive/text/The_Gold_Bug.txt'),                               # The Gold Bug
  os.path.join('/content/drive/MyDrive/text/The_Murders_of_Rue_Morgue.txt'),                  # The Murders in the Rue Morgue
  os.path.join('/content/drive/MyDrive/text/The_Mystery_of_Marie_Roget.txt'),                 # The Mystery of Marie Roget
  os.path.join('/content/drive/MyDrive/text/The_Purloined_Letter.txt'),                       # The Purloined Letter
  os.path.join('/content/drive/MyDrive/text/Von_Kempelen_and _his_Discovery.txt'),            # Von Kempelen and His Discovery
  os.path.join('/content/drive/MyDrive/text/Island_of_the_Fay.txt'),                          # Island of the Fay
  os.path.join('/content/drive/MyDrive/text/Mesmeric_Revelation.txt'),                        # Mesemeric Revelation
  os.path.join('/content/drive/MyDrive/text/Silence_A_Fable.txt'),                            # Silence a Fable
  os.path.join('/content/drive/MyDrive/text/MS._Found_in_a_Bottle.txt'),                      # MS. Found in a Bottle
  os.path.join('/content/drive/MyDrive/text/The_Thousand_and_Second_Tale_of_Scherezade.txt'), # The Thousand and Second Tale of Scherezade
  os.path.join('/content/drive/MyDrive/text/The_Unparalleled_Adventure.txt'),                 # The Unparalleled Adventure of One Hans Pfaall
  os.path.join('/content/drive/MyDrive/text/The_Assignation.txt'),                            # The Assignation
  os.path.join('/content/drive/MyDrive/text/The_Imp.txt'),                                    # The Imp of the Perverse
  os.path.join('/content/drive/MyDrive/text/The_Domain_of_Arnheim.txt'),                      # The Domain of Arnheim
  os.path.join('/content/drive/MyDrive/text/The_Assignation.txt'),                            # Landor's Cottage
  os.path.join('/content/drive/MyDrive/text/Morella.txt'),                                    # Morella
  os.path.join('/content/drive/MyDrive/text/Ligeia.txt'),                                     # Ligeia
  os.path.join('/content/drive/MyDrive/text/King_Pest.txt'),                                  # King Pest
  os.path.join('/content/drive/MyDrive/text/A_Tale_of_the_Ragged_Mountains.txt'),             # A Tale of the Ragged Mountains
  os.path.join('/content/drive/MyDrive/text/The_Spectacles.txt'),                             # The Spectacles
  os.path.join('/content/drive/MyDrive/text/Philosophy_of_Furniture.txt'),                    # The Philosophy of Furniture
  os.path.join('/content/drive/MyDrive/text/The_Devil_in_Belfry.txt'),                        # The Devil in the Belfry
  os.path.join('/content/drive/MyDrive/text/Bon_Bon.txt'),                                    # Bon-Bon
  os.path.join('/content/drive/MyDrive/text/Some_Words_with_a_Mummy.txt')                     # Some Words with a Mummy


]

# Initialize an empty string to hold all text
all_text = ""

# Download each text file and append to all_text
for file_path in file_paths:
  with open(file_path, 'r') as file:
    content = file.read()
    text = content
    all_text += text + "\n\n"  # Separate texts by newlines

# Save combined text to a single file
  with open('/content/combined_poe.txt', "w", encoding="utf-8") as file:
    file.write(all_text)


In [6]:
# Count the words of text
with open('/content/combined_poe.txt', "r", encoding="utf-8") as file:
  file_content = file.read()
  words = file_content.split()
  n_words = len(words)
print(f"{n_words} words loaded")

326500 words loaded


In [7]:
# Example Sentence of First Ten Words
example_sentence = words[:10]
print(f"Example Sentence: {example_sentence}")

Example Sentence: ['True!—nervous—very,', 'very', 'dreadfully', 'nervous', 'I', 'had', 'been', 'and', 'am;', 'but']


## 2. Tokenise the data

In [8]:
# Pad the punctuation, to treat them as separate 'words'
def pad_punctuation(s):
    s = re.sub(f"([{string.punctuation}])", r" \1 ", s)
    s = re.sub(" +", " ", s)
    return s


with open("combined_poe.txt", "r", encoding="utf-8") as file:
    text_data = [pad_punctuation(line) for line in file]

In [9]:
example_date = text_data[30]
print(example_date)

Ha ! —would a madman have been so wise as this ? And then , when my



In [10]:
# Convert to a Tensorflow Dataset
text_ds = (
    tf.data.Dataset.from_tensor_slices(text_data)
    .batch(BATCH_SIZE)
    .shuffle(1000)
)

In [11]:
# Create a vectorisation layer
vectorize_layer = layers.TextVectorization(
    standardize="lower",
    max_tokens=VOCAB_SIZE,
    output_mode="int",
    output_sequence_length=MAX_LEN + 1,
)

In [12]:
# Adapt the layer to the training set
vectorize_layer.adapt(text_ds)
vocab = vectorize_layer.get_vocabulary()

# Length of Vocabulary
print(f"Length of Vocabulary: {len(vocab)}")

Length of Vocabulary: 20049


In [13]:
# Display some token:word mappings
for i, word in enumerate(vocab[10:30]):
    print(f"{i}: {word}")

0: -
1: i
2: that
3: _
4: it
5: was
6: as
7: with
8: which
9: at
10: my
11: is
12: had
13: ;
14: we
15: this
16: for
17: by
18: not
19: be


In [14]:
# Display the same example converted to ints
example_tokenised = vectorize_layer(text_data)
print(example_tokenised.numpy())

[[  326    47 11929 ...     0     0     0]
 [   33   329    74 ...     0     0     0]
 [13630 18134 10712 ...     0     0     0]
 ...
 [    0     0     0 ...     0     0     0]
 [    0     0     0 ...     0     0     0]
 [    0     0     0 ...     0     0     0]]


## 3. Create the Training Set

In [15]:
# Create the training set of text and the same text shifted by one word
def prepare_inputs(text):
    text = tf.expand_dims(text, -1)
    tokenized_sentences = vectorize_layer(text)
    x = tokenized_sentences[:, :-1]
    y = tokenized_sentences[:, 1:]
    return x, y


train_ds = text_ds.map(prepare_inputs)

## 4. Build the LSTM <a name="build"></a>

In [16]:
inputs = layers.Input(shape=(None,), dtype="int32")
x = layers.Embedding(VOCAB_SIZE, EMBEDDING_DIM)(inputs)
x = layers.LSTM(N_UNITS, return_sequences=True)(x)
outputs = layers.Dense(VOCAB_SIZE, activation="softmax")(x)
lstm = models.Model(inputs, outputs)
lstm.summary()

## 5. Train the LSTM <a name="train"></a>

In [17]:
loss_fn = losses.SparseCategoricalCrossentropy()
lstm.compile("adam", loss_fn)

In [18]:
# Create a TextGenerator checkpoint
class TextGenerator(callbacks.Callback):
    def __init__(self, index_to_word, top_k=10):
        self.index_to_word = index_to_word
        self.word_to_index = {
            word: index for index, word in enumerate(index_to_word)
        }  # <1>

    def sample_from(self, probs, temperature):  # <2>
        probs = probs ** (1 / temperature)
        probs = probs / np.sum(probs)
        return np.random.choice(len(probs), p=probs), probs

    def generate(self, start_prompt, max_tokens, temperature):
        start_tokens = [
            self.word_to_index.get(x, 1) for x in start_prompt.split()
        ]  # <3>
        sample_token = None
        info = []
        while len(start_tokens) < max_tokens and sample_token != 0: # <4>
            x = np.array([start_tokens])
            y = self.model.predict(x, verbose=0)  # <5>
            sample_token, probs = self.sample_from(y[0][-1], temperature)  # <6>
            info.append({"prompt": start_prompt, "word_probs": probs})
            start_tokens.append(sample_token)  # <7>
            start_prompt = start_prompt + " " + self.index_to_word[sample_token]
        print(f"\ngenerated text:\n{start_prompt}\n")
        return info

    def on_epoch_end(self, epoch, logs=None):
        self.generate("Text:", max_tokens=500, temperature=1.0)

In [19]:
# Tokenize starting prompt

text_generator = TextGenerator(vocab)

In [20]:
lstm.fit(
    train_ds,
    epochs=EPOCHS,
    callbacks=[text_generator],
)

Epoch 1/25
[1m1035/1035[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 180ms/step - loss: 1.0961
generated text:
Text: indivisible prepares day . from the the . ground , 

[1m1035/1035[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m191s[0m 181ms/step - loss: 1.0954
Epoch 2/25
[1m1035/1035[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 183ms/step - loss: 0.1437
generated text:
Text: enchantment my means . his world lens hair to also supposed - and afar upon 

[1m1035/1035[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m190s[0m 183ms/step - loss: 0.1437
Epoch 3/25
[1m1035/1035[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 183ms/step - loss: 0.1332
generated text:
Text: phrenology open for ” his mass of the feet had 

[1m1035/1035[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m190s[0m 183ms/step - loss: 0.1332
Epoch 4/25
[1m1035/1035[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 184ms/step - loss: 0.1296
generated text:
Text: valley in nature to docu

<keras.src.callbacks.history.History at 0x7bbd621eaef0>

## 6. Generate text using the LSTM

In [21]:
def print_probs(info, vocab, top_k=5):
    for i in info:
        print(f"\nPROMPT: {i['prompt']}")
        word_probs = i["word_probs"]
        p_sorted = np.sort(word_probs)[::-1][:top_k]
        i_sorted = np.argsort(word_probs)[::-1][:top_k]
        for p, i in zip(p_sorted, i_sorted):
            print(f"{vocab[i]}:   \t{np.round(100*p,2)}%")
        print("--------\n")

In [22]:
# Prompt 1: From "The Raven"
info = text_generator.generate(
    "Once upon a midnight dreary,", max_tokens=500, temperature=0.1
)


generated text:
Once upon a midnight dreary, , i say , in the same 



In [23]:
print_probs(info, vocab)


PROMPT: Once upon a midnight dreary,
,:   	100.0%
.:   	0.0%
!:   	0.0%
of:   	0.0%
):   	0.0%
--------


PROMPT: Once upon a midnight dreary, ,
i:   	88.85%
the:   	9.45%
and:   	0.98%
at:   	0.42%
however:   	0.2%
--------


PROMPT: Once upon a midnight dreary, , i
say:   	81.15%
am:   	11.78%
had:   	6.34%
felt:   	0.35%
found:   	0.23%
--------


PROMPT: Once upon a midnight dreary, , i say
,:   	99.97%
that:   	0.03%
the:   	0.0%
not:   	0.0%
it:   	0.0%
--------


PROMPT: Once upon a midnight dreary, , i say ,
in:   	92.9%
that:   	6.48%
the:   	0.36%
by:   	0.07%
at:   	0.07%
--------


PROMPT: Once upon a midnight dreary, , i say , in
the:   	98.34%
fact:   	1.63%
a:   	0.04%
this:   	0.0%
that:   	0.0%
--------


PROMPT: Once upon a midnight dreary, , i say , in the
same:   	73.61%
case:   	13.06%
:   	7.39%
first:   	4.06%
very:   	1.86%
--------


PROMPT: Once upon a midnight dreary, , i say , in the same
:   	100.0%
time:   	0.0%
manner:   	0.0%
day:   	0.0%
,:   	0.0%
---

In [24]:
# Prompt 2: From "The Tell Tale Heart"
info = text_generator.generate(
    "And have I not told you that what you mistake for madness is but over-acuteness of the sense?", max_tokens=500, temperature=0.5
)


generated text:
And have I not told you that what you mistake for madness is but over-acuteness of the sense? , and , in the 



In [25]:
print_probs(info, vocab)


PROMPT: And have I not told you that what you mistake for madness is but over-acuteness of the sense?
,:   	67.69%
of:   	28.76%
:   	1.22%
.:   	1.19%
—:   	0.48%
--------


PROMPT: And have I not told you that what you mistake for madness is but over-acuteness of the sense? ,
and:   	71.22%
:   	11.99%
the:   	7.1%
of:   	6.38%
that:   	1.56%
--------


PROMPT: And have I not told you that what you mistake for madness is but over-acuteness of the sense? , and
:   	36.27%
the:   	34.86%
,:   	12.55%
i:   	6.23%
a:   	1.88%
--------


PROMPT: And have I not told you that what you mistake for madness is but over-acuteness of the sense? , and ,
in:   	52.04%
:   	18.57%
the:   	5.91%
by:   	4.17%
with:   	3.84%
--------


PROMPT: And have I not told you that what you mistake for madness is but over-acuteness of the sense? , and , in
the:   	96.33%
:   	2.39%
my:   	0.58%
a:   	0.38%
its:   	0.07%
--------


PROMPT: And have I not told you that what you mistake for madness is but over-ac

In [26]:
# Prompt 3: From the Cask Of
info = text_generator.generate(
    "A million candles have burned themselves out. Still ", max_tokens=500, temperature=1.0
)
print_probs(info, vocab)


generated text:
A million candles have burned themselves out. Still  , 


PROMPT: A million candles have burned themselves out. Still 
,:   	53.28%
:   	25.39%
;:   	6.89%
.:   	3.13%
at:   	1.94%
--------


PROMPT: A million candles have burned themselves out. Still  ,
:   	90.15%
and:   	4.01%
that:   	0.52%
as:   	0.51%
the:   	0.45%
--------



In [27]:
print_probs(info, vocab)


PROMPT: A million candles have burned themselves out. Still 
,:   	53.28%
:   	25.39%
;:   	6.89%
.:   	3.13%
at:   	1.94%
--------


PROMPT: A million candles have burned themselves out. Still  ,
:   	90.15%
and:   	4.01%
that:   	0.52%
as:   	0.51%
the:   	0.45%
--------

