<a href="https://colab.research.google.com/github/KelseyNager/GenAI/blob/main/Problem_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Text Generation with LSTM
##Kelsey Nager
##CSC 330

The purpose of this assignment is to train a LSTM model on three of Virginia Woolf's books and generate text in a similar language. I will input "Mrs. Dalloway", "Common Reader", and "The Voyage Out" by Virginia Woolf from *Project Gutenberg* online platfom. I will create a single and multi-layer LSTM and experiment with parameters to discover the most effective model for generating coherent text comparable to the style of Woolf.

#Parameters

In [None]:
import numpy as np
import json
import re
import string

import tensorflow as tf
from tensorflow.keras import layers, models, callbacks, losses

In [None]:
VOCAB_SIZE = 20000 #accomodates a vocabulary size of 18612
MAX_LEN = 150
EMBEDDING_DIM = 100
VALIDATION_SPLIT = 0.2
SEED = 42
LOAD_MODEL = False
BATCH_SIZE = 64
EPOCHS = 25

#Data Collection and Preparation

In [None]:
import requests
import re

#Trimming the book content so that unnecessary commentary on the site is excluded from training data
def trim_book_content(book_content, start, end):
    """Trims the beginning and end of book content using markers."""
    start_match = re.search(re.escape(start), book_content)
    end_match = re.search(re.escape(end), book_content)

    print(f"Start match found: {start_match is not None}")  # Check if start marker is found
    print(f"End match found: {end_match is not None}")    # Check if end marker is found

    if start_match and end_match:
        start_index = start_match.end()
        end_index = end_match.start()
        trimmed_content = book_content[start_index:end_index]
        return trimmed_content
    return ""


# Download each text file and append to all_books
urls = [
"https://www.gutenberg.org/files/71865/71865-0.txt",  # Mrs Dalloway, Virginia Woolf
"https://www.gutenberg.org/files/144/144-0.txt",   # The Voyage Out, Virginia Woolf
"https://www.gutenberg.org/files/64457/64457-0.txt"   # The Common Reader, Virginia Woolf
      ]

start = "*** START OF THE PROJECT GUTENBERG EBOOK"
end = "*** END OF THE PROJECT GUTENBERG EBOOK"

all_books = ""

# Save combined trimmed text to a single file
for url in urls:
  response = requests.get(url)
  book_content = response.text
  trimmed_text = trim_book_content(book_content, start, end)
  all_books += trimmed_text + "\n\n"

with open('all_books_trimmed.txt', 'w', encoding='utf-8') as file:
    file.write(all_books)

Start match found: True
End match found: True
Start match found: True
End match found: True
Start match found: True
End match found: True


In [None]:
with open("all_books_trimmed.txt", "r", encoding="utf-8") as file:
    all_books = file.read()

# Split the text into lines
book_data = all_books.split("\n")

#Filtered_data represents all three combined, filtered Vrignia Woolf books splint into lines
filtered_data = [
    "Text: " + line
    for line in book_data
    if line.strip()
]

In [None]:
# Display an example word
example = filtered_data[100]
example

'Text: Elizabeth), and she, too, loving it as she did with an absurd and'

In [None]:
def pad_punctuation(s):
    s = re.sub(f"([{string.punctuation}])", r" \1 ", s)  # Pad punctuation
    s = re.sub(" +", " ", s)
    s = s.lower()  # Convert to lowercase for consistency
    s = s.replace("\xe2\x80\x90", "'") #replace encoding with apostrophe
    s = s.replace("\xe2\x80\x94", "—") # replace with dash
    s = s.replace("\xe2\x80\x9d", '"') # Replace with right double quote
    s = s.replace("\xe2\x80\x9c", '"') # Replace with left double quote
    return s

text_data = [pad_punctuation(s) for s in filtered_data]

In [None]:
print(f"Number of lines of text of filtered data: {len(filtered_data)}")

Number of lines of text of filtered data: 24761


In [None]:
#same example as earlier, now with padded punctuation and lowercase letters
example_data = text_data[100]
example_data

'text : elizabeth ) , and she , too , loving it as she did with an absurd and'

In [None]:
# Convert to a Tensorflow Dataset
text_ds = (
    tf.data.Dataset.from_tensor_slices(text_data)
    .batch(BATCH_SIZE)
    .shuffle(1000)
)

In [None]:
#example of lines
for example in text_ds.take(1):
       print(example)

tf.Tensor(
[b'text : cliff . we know no more of them than that . we have their poetry , and that'
 b'text : is all . '
 b'text : but that is not , and perhaps never can be , wholly true . pick up any play'
 b'text : by sophocles , read\xe2\x80\x94'
 b'text : son of him who led our hosts at troy of old , son of'
 b'text : agamemnon , '
 b'text : and at once the mind begins to fashion itself surroundings . it makes'
 b'text : some background , even of the most provisional sort , for sophocles ; it'
 b'text : imagines some village , in a remote part of the country , near the sea . '
 b'text : even nowadays such villages are to be found in the wilder parts of'
 b'text : england , and as we enter them we can scarcely help feeling that here , in'
 b'text : this cluster of cottages , cut off from rail or city , are all the'
 b'text : elements of a perfect existence . here is the rectory ; here the manor'
 b'text : house , the farm and the cottages ; the church for worship , the club for'
 b't

In [None]:
# Create a vectorization layer
vectorize_layer = layers.TextVectorization(
    standardize="lower",
    max_tokens=VOCAB_SIZE,
    output_mode="int",
    output_sequence_length=MAX_LEN + 1,
)

In [None]:
# Adapt the layer to the training set
vectorize_layer.adapt(text_ds)
vocab = vectorize_layer.get_vocabulary()
print("Vocabulary size:", len(vocab))

Vocabulary size: 18612


In [None]:
# Display some token:word mappings
for i, word in enumerate(vocab[:10]):
    print(f"{i}: {word}")

0: 
1: [UNK]
2: :
3: text
4: ,
5: the
6: .
7: and
8: of
9: to


In [None]:
# Create the training set of book content and the same text shifted by one word
def prepare_inputs(text):
    text = tf.expand_dims(text, -1)
    tokenized_sentences = vectorize_layer(text)
    x = tokenized_sentences[:, :-1]
    return x, tokenized_sentences[:, 1:]

train_ds = text_ds.map(prepare_inputs)

# Single-Layer LSTM

In [None]:
#Creating a single-layer LSTM model with dropout = .2
inputs = layers.Input(shape=(None,), dtype="int32")
x = layers.Embedding(VOCAB_SIZE, EMBEDDING_DIM)(inputs)
x = layers.LSTM(128, return_sequences=True, dropout=0.2)(x)
outputs = layers.Dense(VOCAB_SIZE, activation="softmax")(x)
lstm_1 = models.Model(inputs, outputs)
lstm_1.summary()

#Training Single-Layer LSTM

In [None]:
loss_fn = losses.SparseCategoricalCrossentropy()
lstm_1.compile("adam", loss_fn)

In [None]:
# Create a TextGenerator checkpoint
class TextGenerator(callbacks.Callback):
    def __init__(self, index_to_word, top_k=10):
        self.index_to_word = index_to_word
        self.word_to_index = {
            word: index
            for index, word in enumerate(index_to_word)
        }

    def sample_from(self, probs, temperature):
        if isinstance(probs, (float, np.float64)):  # Check if probs is a single value
            probs = np.array([probs, 1 - probs])  # Create a 2-element distribution
        probs = probs ** (1 / temperature)
        probs = probs / np.sum(probs)
        return np.random.choice(len(probs), p=probs), probs


    def generate(self, start_prompt, max_tokens, temperature):
        sample_token = None
        info = []
        while len([
            self.word_to_index.get(x, 1) for x in start_prompt.split()
        ]) < max_tokens and sample_token != 0:
            y = self.model.predict(np.array([[
            self.word_to_index.get(x, 1) for x in start_prompt.split()
        ]]))
            sample_token, probs = self.sample_from(y[0][-1], temperature)

            if 0 <= sample_token < len(self.index_to_word):  # Check if sample_token is within range
              start_prompt = start_prompt + " " + self.index_to_word[sample_token]
              info.append({"prompt": start_prompt, "word_probs": probs})
              [
            self.word_to_index.get(x, 1) for x in start_prompt.split()
        ].append(sample_token)
            else:
              # Handle case where sample_token is out of range
              print(f"Warning: sample_token out of range: {sample_token}")
              break
            start_prompt = start_prompt + " " + self.index_to_word[sample_token]
            info.append({"prompt": start_prompt, "word_probs": probs})
            [
            self.word_to_index.get(x, 1) for x in start_prompt.split()
        ].append(sample_token)
        print(f"\ngenerated text:\n{start_prompt}\n")
        return info

    def on_epoch_end(self, epoch, logs=None):
      try:
        prompts = ('the meaning of life', 'it is an awful')
        prompt = np.random.choice(prompts)
        self.generate(prompt, max_tokens=100, temperature=.5)
      except Exception as e:
        print(f"Error during text generation: {e}")

In [None]:
# Tokenize starting prompt

text_generator = TextGenerator(vocab)

In [None]:
lstm_1.fit(
    train_ds,
    epochs=EPOCHS,
    callbacks=[text_generator],
)

Epoch 1/25
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 23ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 17ms/step

generated text:
it is an awful intellectuals intellectuals jingle jingle

[1m387/387[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m58s[0m 150ms/step - loss: 0.5625
Epoch 2/25
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step

generated text:
the meaning of life vinraces vinraces  

[1m387/387[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m58s[0m 150ms/step - loss: 0.5470
Epoch 3/25
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 39ms/step

generated text:
it is an awful

[1m387/387[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m58s[0m 151ms/step - loss: 0.5354
Epoch 4/25
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step
[1m1/1

<keras.src.callbacks.history.History at 0x7f0b896717e0>

#Text Generation
##with Single Layer LSTM

In [None]:
def print_probs(info, vocab, top_k=5):
    for i in info:
        print(f"\nPROMPT: {i['prompt']}")
        word_probs = i["word_probs"]
        p_sorted = np.sort(word_probs)[::-1][:top_k]
        i_sorted = np.argsort(word_probs)[::-1][:top_k]
        for p, i in zip(p_sorted, i_sorted):
            if 0 <= i < len(vocab):
                print(f"{vocab[i]}:   \t{np.round(100*p,2)}%")
            else:
                print(f"Index {i} out of range for vocabulary (size: {len(vocab)})") # Print error message
        print("--------\n")

###Prompt 1, Various Temperatures

In [None]:
info = text_generator.generate(
    start_prompt="the meaning of life is", max_tokens=10, temperature=.2
)
print_probs(info, vocab)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 112ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 47ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 85ms/step

generated text:
the meaning of life is tretys tretys , , and and


PROMPT: the meaning of life is tretys
laced:   	0.48%
famously:   	0.46%
cakes:   	0.46%
11:   	0.43%
voyage—china:   	0.42%
--------


PROMPT: the meaning of life is tretys tretys
laced:   	0.48%
famously:   	0.46%
cakes:   	0.46%
11:   	0.43%
voyage—china:   	0.42%
--------


PROMPT: the meaning of life is tretys tretys ,
,:   	85.91%
;:   	11.17%
:   	1.95%
.:   	0.69%
?:   	0.19%
--------


PROMPT: the meaning of life is tretys tretys , ,
,:   	85.91%
;:   	11.17%
:   	1.95%
.:   	0.69%
?:   	0.19%
--------


PROMPT: the meaning of life is tretys tretys , , and
and:   	69.83%
which:   	21.35%
”:   	3.27%
or:   	1.68%
but:   	1.25%
--------


PROMPT: the meaning of life is tretys tretys , , and and
and:

In [None]:
info = text_generator.generate(
    start_prompt="the meaning of life is", max_tokens=10, temperature=0.5
)
print_probs(info, vocab)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step

generated text:
the meaning of life is true true rated rated


PROMPT: the meaning of life is true
_:   	2.75%
a:   	0.74%
the:   	0.52%
that:   	0.44%
like:   	0.4%
--------


PROMPT: the meaning of life is true true
_:   	2.75%
a:   	0.74%
the:   	0.52%
that:   	0.44%
like:   	0.4%
--------


PROMPT: the meaning of life is true true rated
footnote:   	0.66%
8:   	0.35%
1:   	0.32%
7:   	0.31%
6:   	0.3%
--------


PROMPT: the meaning of life is true true rated rated
footnote:   	0.66%
8:   	0.35%
1:   	0.32%
7:   	0.31%
6:   	0.3%
--------



In [None]:
info = text_generator.generate(
    start_prompt="the meaning of life is", max_tokens=30, temperature=0.9)
print_probs(info, vocab)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 34ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 22ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step

generated text:
the meaning of life is “leave “leave capable capable surrendered surrendered profitably profitably voltaire voltaire straighten straighten speedily speedily  


PROMPT: the meaning of life is “leave
_:   	0.06%
captured:   	0.05%
published:   	0.04%
calming:   	0.04%
4:   	0.03%
--------


PROMPT: the meaning of life is “leave “leave
_:   	0.06%
captured:   	0.05%
published:   	0.04%
calming:   	0.04%
4:

###Prompt 2, Various Temperatures

In [None]:
info = text_generator.generate(
    "it was an awful", max_tokens=15, temperature=.6
)
print_probs(info, vocab)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step

generated text:
it was an awful cowardly cowardly privé” privé” voluminous voluminous ὲν ὲν


PROMPT: it was an awful cowardly
organ:   	0.2%
submerged:   	0.15%
suffrage:   	0.13%
opulent:   	0.13%
representative:   	0.12%
--------


PROMPT: it was an awful cowardly cowardly
organ:   	0.2%
submerged:   	0.15%
suffrage:   	0.13%
opulent:   	0.13%
representative:   	0.12%
--------


PROMPT: it was an awful cowardly cowardly privé”
morrow:   	0.12%
opulent:   	0.07%
allowances:   	0.06%
grossly:   	0.06%
much—everything—in:   	0.06%
--------


PROMPT: it was an awful cowardly cowardly privé” privé”
morrow:   	0.12%
opulent:   	0.07%


In [None]:
info = text_generator.generate(
    "it was an awful", max_tokens=50, temperature=0.3
)
print_probs(info, vocab)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 17ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 24ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 17ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21

In [None]:
info = text_generator.generate(
    "it was an awful", max_tokens=15, temperature=0.1
)
print_probs(info, vocab)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 17ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 16ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 16ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 16ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 17ms/step

generated text:
it was an awful opulent opulent tatler tatler footnote footnote register register austen’s austen’s footnote footnote


PROMPT: it was an awful opulent
opulent:   	99.17%
elaborate:   	0.2%
iceblock:   	0.11%
submerged:   	0.1%
communs:   	0.06%
--------


PROMPT: it was an awful opulent opulent
opulent:   	99.17%
elaborate:   	0.2%
iceblock:   	0.11%
submerged:   	0.1%
communs:   	0.06%
--------


PROMPT: it was an awful opulent opulent tatler
tatler:   	63.69%
euphrosyne:   	14.78%
odyssey:   	6.82%
religio:   	6.82%
inferno:   	3.21%
--------

#Evaluation of Text Generation with Single LSTM

####The Single Layer LSTM performed poorly at generating coherent text resemblant of the style and tone of Virginia Woolf. It had a relatively short training time due to the simple nature of the architecture. However I was unable to make a single LSTM model capable of producing coherent text. I noticed that there are a lot of repeated words, likely due to overfitting. Even when I experimented with a dropout of various sizes, the issue was not resolved. I similarly did not see improvement when adjusting the temperature, batch size, layer size or experimenting with different prompts. The reason that some of the epochs say "token out of range" is because I implemented a vocabulay size of 25000 even though there were only about 18000 true terms in the dataset. Those cases are where the model sampled a token outside of the 18000 true dataset, even if the probability was very small.

#Multi-Layer LSTM


In [None]:
VOCAB_SIZE = 20000
MAX_LEN = 200
EMBEDDING_DIM = 100
VALIDATION_SPLIT = 0.2
SEED = 42
LOAD_MODEL = False
BATCH_SIZE = 64
EPOCHS = 25

In [None]:
inputs = layers.Input(shape=(None,), dtype="int32")
x = layers.Embedding(VOCAB_SIZE, EMBEDDING_DIM)(inputs)
x = layers.LSTM(256, return_sequences=True, dropout=0.3, recurrent_dropout=0.2)(x)
x = layers.LSTM(128, return_sequences=True, dropout=0.3, recurrent_dropout=0.2)(x)
outputs = layers.Dense(VOCAB_SIZE, activation="softmax")(x)
lstm_2 = models.Model(inputs, outputs)
lstm_2.summary()

#Training Multi-Layer LSTM

In [None]:
loss_fn = losses.SparseCategoricalCrossentropy()
lstm_2.compile("adam", loss_fn)

In [None]:
# Create a TextGenerator checkpoint
class TextGenerator(callbacks.Callback):
    def __init__(self, index_to_word, top_k=10):
        self.index_to_word = index_to_word
        self.word_to_index = {
            word: index
            for index, word in enumerate(index_to_word)
        }

    def sample_from(self, probs, temperature):
        if isinstance(probs, (float, np.float64)):  # Check if probs is a single value
            probs = np.array([probs, 1 - probs])  # Create a 2-element distribution
        probs = probs ** (1 / temperature)
        probs = probs / np.sum(probs)
        return np.random.choice(len(probs), p=probs), probs


    def generate(self, start_prompt, max_tokens, temperature):
        start_tokens = [
            self.word_to_index.get(x, 1) for x in start_prompt.split()
        ]
        sample_token = None
        info = []
        while len(start_tokens) < max_tokens and sample_token != 0:
            y = self.model.predict(np.array([start_tokens]))
            sample_token, probs = self.sample_from(y[0][-1], temperature)
            if 0 <= sample_token < len(self.index_to_word):  # Check if sample_token is within range
              start_prompt = start_prompt + " " + self.index_to_word[sample_token]
              info.append({"prompt": start_prompt, "word_probs": probs})
              start_tokens.append(sample_token)
            else:
              # Handle case where sample_token is out of range
              print(f"Warning: sample_token out of range: {sample_token}")
              break

        print(f"\ngenerated text:\n{start_prompt}\n")
        return info

    def on_epoch_end(self, epoch, logs=None):
      try:
        prompts = ('the meaning of life is', 'it is an awful')
        prompt = np.random.choice(prompts)
        self.generate(prompt, max_tokens=100, temperature=1.0)
      except Exception as e:
        print(f"Error during text generation: {e}")

In [None]:
# Tokenize starting prompt
text_generator = TextGenerator(vocab)

In [None]:
lstm_2.fit(
    train_ds,
    epochs=EPOCHS,
    callbacks=[text_generator],
)

Epoch 1/25
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 597ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 318ms/step

generated text:
it is an awful months 

[1m387/387[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m223s[0m 558ms/step - loss: 2.6015
Epoch 2/25
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 26ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 23ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 25ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 25ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 25ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 26ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 39ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0

<keras.src.callbacks.history.History at 0x7ae9260fc490>

#Text Generation
##with Multi-Layer LSTM

In [None]:
def print_probs(info, vocab, top_k=5):
    for i in info:
        print(f"\nPROMPT: {i['prompt']}")
        word_probs = i["word_probs"]
        p_sorted = np.sort(word_probs)[::-1][:top_k]
        i_sorted = np.argsort(word_probs)[::-1][:top_k]
        for p, i in zip(p_sorted, i_sorted):
            if 0 <= i < len(vocab):
                print(f"{vocab[i]}:   \t{np.round(100*p,2)}%")
            else:
                print(f"Index {i} out of range for vocabulary (size: {len(vocab)})") # Print error message
        print("--------\n")

###Prompt 1 with Various Temperatures

In [None]:
info = text_generator.generate(
    start_prompt="it was a dark and stormy night when", max_tokens=20, temperature=1.0
)
print_probs(info, vocab)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 36ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 36ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 34ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 41ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 39ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 38ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 46ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 47ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 47ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 57ms/step

generated text:
it was a dark and stormy night when cambridge loses cried truthful masters confounds passage browne author


PROMPT: it was a dark and stormy night when cambridge
censure:   	0.01%
girls—nothing:   	0.01%
morrow:   	0.01%
legions:   	0.01%
despite:   	0.01%
-----

In [None]:
info = text_generator.generate(
    start_prompt="it was a dark and stormy night when", max_tokens=20, temperature=0.8
)
print_probs(info, vocab)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 25ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 23ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 27ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 27ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step

generated text:
it was a dark and stormy night when hid partner’s child’s traveller soul—calls robin


PROMPT: it was a dark and stormy night when hid
censure:   	0.01%
girls—nothing:   	0.01%
morrow:   	0.01%
legions:   	0.01%
despite:   	0.01%
--------


PROMPT: it was a dark and stormy night when hid partner’s
morrow:   	0.01%
gissing:   	0.01%
girls—nothing:   	0.01%
censure:   	0.01%
pine:   	0.01%
--------


PROMPT: it was a dark and stormy night when hid partner’s child’s
swine:   	0

In [None]:
info = text_generator.generate(
    start_prompt="it was a dark and stormy night when", max_tokens=20, temperature=.5
)
print_probs(info, vocab)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 23ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 23ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 24ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 24ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 25ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 25ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 26ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 27ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 29ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step

generated text:
it was a dark and stormy night when affection—all disapproval stucco literalness queerness looking herb nasal guns ridicule


PROMPT: it was a dark and stormy night when affection—all
censure

In [None]:
info = text_generator.generate(
    start_prompt="it was a dark and stormy night when", max_tokens=20, temperature=.2
)
print_probs(info, vocab)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 40ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 42ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 41ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 41ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 50ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 63ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 50ms/step

generated text:
it was a dark and stormy night when noises qualifications pay gibbering resemblances unenviable


PROMPT: it was a dark and stormy night when noises
censure:   	0.03%
girls—nothing:   	0.03%
morrow:   	0.03%
legions:   	0.03%
despite:   	0.03%
--------


PROMPT: it was a dark and stormy night when noises qualifications
distressed:   	0.02%
“papa:   	0.02%
duration:   	0.02%
barber’s:   	0.02%
microscopic:   	0.02%
--------


PROMPT: it was a dark and stormy night when noises

###Prompt 2 with Various Temperatures

In [None]:
info = text_generator.generate(
    start_prompt="men must not", max_tokens=10, temperature=1.0
)
print_probs(info, vocab)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 23ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 22ms/step

generated text:
men must not titles treacle again—the inquest bird’s ode arrange


PROMPT: men must not titles
charmin’:   	0.11%
drenching:   	0.11%
assistant:   	0.09%
epitome:   	0.09%
boétie:   	0.09%
--------


PROMPT: men must not titles treacle
charmin’:   	0.1%
assistant:   	0.1%
boétie:   	0.09%
epitome:   	0.09%
drenching:   	0.09%
--------


PROMPT: men must not titles treacle again—the
hood:   	0.03%
uneven:   	0.03%
recent:   	0.02%
playthings:   	0.02%
formlessness:   	0.02%
-

In [None]:
info = text_generator.generate(
    start_prompt="men must not", max_tokens=10, temperature=.75
)
print_probs(info, vocab)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 29ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 29ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 32ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 35ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 44ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 43ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 47ms/step

generated text:
men must not back part traces reformer’s something’s bats—“odious waft


PROMPT: men must not back
charmin’:   	0.27%
drenching:   	0.26%
assistant:   	0.2%
epitome:   	0.19%
boétie:   	0.18%
--------


PROMPT: men must not back part
drenching:   	0.11%
formlessness:   	0.11%
charmin’:   	0.11%
epitome:   	0.1%
grossest:   	0.09%
--------


PROMPT: men must not back part traces
footnote:   	0.02%
minuteness:   	0.02%
formlessness:   	0.01%
“general:   	0.01%
recent:   	0.01%

In [None]:
info = text_generator.generate(
    start_prompt="men must not", max_tokens=15, temperature=.5
)
print_probs(info, vocab)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 23ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 24ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 25ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 25ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 28ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 26ms/step

generated text:
men must not great peeling better—this superstitious “burke crinkling rome—himself bundling filmy stifle proceeding bat

In [None]:
info = text_generator.generate(
    start_prompt="men must not", max_tokens=10, temperature=.15
)
print_probs(info, vocab)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 22ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 25ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 23ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 27ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 32ms/step

generated text:
men must not charm recent massacres friends—though footnote footnote entrails


PROMPT: men must not charm
charmin’:   	30.56%
drenching:   	24.35%
assistant:   	6.41%
epitome:   	5.19%
boétie:   	4.74%
--------


PROMPT: men must not charm recent
formlessness:   	10.65%
charmin’:   	10.21%
drenching:   	9.71%
epitome:   	7.49%
grossest:   	4.81%
--------


PROMPT: men must not charm recent massacres
footnote:   	13.61%
8:   	3.92%
13:   	1.76%
insatiable:   	1.22%
ally:   	

###Prompt 3, Various Temperatures

In [None]:
info = text_generator.generate(
    start_prompt="it was an awful", max_tokens=10, temperature=.8
)
print_probs(info, vocab)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 27ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 22ms/step

generated text:
it was an awful tulliver barbarian


PROMPT: it was an awful tulliver
walsh’s:   	0.02%
“cabbage:   	0.01%
coagulate:   	0.01%
mayors—what:   	0.01%
speedily:   	0.01%
--------


PROMPT: it was an awful tulliver barbarian
encountered:   	0.01%
“cabbage:   	0.01%
peeled:   	0.01%
stooped:   	0.01%
footnote:   	0.01%
--------



In [None]:
info = text_generator.generate(
    start_prompt="it was an awful", max_tokens=15, temperature=.4
)
print_probs(info, vocab)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 24ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 24ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 24ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 25ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 26ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 26ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 26ms/step

generated text:
it was an awful satisfactorily damaged men—but carters flow tough lust sadly kitten doted crossways


PROMPT: it was an awful satisfactorily
walsh’s:   	0.04%
“cabbage:   	0.04%
coagulate:   

#Evaluation of Text Generation with Multi-Layer LSTM

The Multi-Layer LSTM is more sophisticated than the Single-Layer LSTM, though there are still many flaws in syntax and semantics.

I attempted many variations of the multi-layer LSTM. The biggest obstacle I faced was word repetition of generated text. When training, I had sentences in which every word was repeated once, for example: "the meaning of life is prove prove very very , , footnote footnote". I tried increasing the dropout rate of the LSTM layers and adding recurrent drop out layers to prevent overfitting of any repetition that may have been in the books. I experimented with layers of varying sizes and different types of prompts. I also increased the temperature for more varied text. The only thing that changed the repetitive nature was when increasing the temperature all the way to 1.0 in the training process.

In this model, the generated text uses punctuation far more accurately, albeit mistakes, than the single-layer LSTM. It did not have excessive repetition. The model seemed to have picked up some associations between words. For example, for the "dark and stormy night" prompt, it generally produced words that were more negative, scary, gloomy, or gory. It still lacked in generating sentences with a coherent meaning. I found that portions of the generated sentences were semantically consistent, but as a whole the lines generally were not comprehendable.

In further experiments, I would like to train the LSTM with more layers and adding another book in the dataset. This may require a lot more training time than I can access before being booted.