# Language Generation


## 1. Introduction

In this lab we will look at how to generate stories using Long-Short-Term memories. The results will not be excellent as we will be training for ajust a few tens of cycles (to get good results we need to train for thousands of cycles)
Nonetheless it gives us a good feel on how to use the LSTM to do prediction.

## 2. Submission Deadline

Fill in your answers in Lab1Ans.docx and upload the completed file to Canvas by 2359 hours on 4 July 2025. Be sure to fill in the names of all your team members


## 3. Story Generation with LSTM

We start first by looking at how to generate stories using an LSTM.  To do so we need to do two important things with the text:

1. We need to tokenize the text, converting the words into integers.
2. We need to use an embedding layer before feeding the words to the LSTM.

You may do this lab on Google Colab if you wish.

We begin by first installing our dependencies. Note that xformers must be installed last or there will be dependency breakages.

In [1]:
! pip install --no-cache-dir torch



In [2]:
! pip install --no-cache-dir tensorflow transformers datasets numpy scikit-learn



In [3]:
! pip install --no-cache-dir xformers

Collecting xformers
  Downloading xformers-0.0.31.tar.gz (12.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.1/12.1 MB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting torch>=2.7 (from xformers)
  Downloading torch-2.7.1-cp312-none-macosx_11_0_arm64.whl.metadata (29 kB)
Collecting sympy>=1.13.3 (from torch>=2.7->xformers)
  Downloading sympy-1.14.0-py3-none-any.whl.metadata (12 kB)
Downloading torch-2.7.1-cp312-none-macosx_11_0_arm64.whl (68.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m68.6/68.6 MB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading sympy-1.14.0-py3-none-any.whl (6.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:



### <u>Question 1</u>

Using Google or otherwise, explain i) what an embedding layer does and ii) why we cannot just feed the integers from the tokenizer direct to the LSTM.

<b>Fill your answers in the answer book</b>

Let's begin by creating the dataset. When you unzipped the file containing this lab, it has created for you a text corpus in the sherlock directory, containing several Sherlock Holmes novels in a training directory and a testing directory. 


### 3.1 Loading the Dataset

Keras has its own dataset manipulation libraries, but the one provided by Hugging Face is much more powerful and we will use it. We do the following:

1. Gather all the files in the training and testing directories.
2. Use load_dataset to load up all the texts.
3. Remove all sentences that are too short.
3. Create a special function to convert all the text to lowercase.
4. Tokenize the dataset, converting all the words to integers.
5. Combine the tokens into a single long vector.



In [4]:
# load_dataset from Hugging Face
from datasets import load_dataset

# Search for files in a directory matching a pattern.
import glob

# Gather all the files together
traindir = "sherlock/Train"
testdir = "sherlock/Test"

# Get all the training and testing filenames
train_files = [file for file in glob.glob(traindir + "/*.txt")]
test_files = [file for file in glob.glob(testdir + "/*.txt")]

# load_dataset needs a dictionary to tell it where the training and test files are
data_files = {"train": train_files, "test":test_files}

# Now load the dataset. We must also tell load_dataset that 
# these are text files
dataset = load_dataset("text", data_files = data_files)

# Print out the dataset
dataset

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 19488
    })
    test: Dataset({
        features: ['text'],
        num_rows: 1768
    })
})

After running we can see that our training dataset consists of 19488 rows of text. We can see the first 10 lines  by doing:

In [5]:
dataset['train'][:10]

{'text': ['\ufeff',
  '',
  '',
  '',
  '“I am inclined to think--” said I.',
  '',
  '“I should do so,” Sherlock Holmes remarked impatiently.',
  '',
  "I believe that I am one of the most long-suffering of mortals; but I'll"]}

Notice that many lines are blank or contain very few characters. Since sentences of 5 characters or less are unlikely to be meaningful, we will get rid of them. We will also apply a transform to convert all characters to lower-case.

In [6]:
min_len = 5 # Minimum number of characters in a line

# Remove lines with fewer than five characters
dataset = dataset.filter(lambda example: len(example["text"]) >=min_len)

# This function is called by the dataset map method to convert
# all the text to lowercase
def tolower(example):
    return {"text":example["text"].lower()}

# Convert all text using map
dataset = dataset.map(tolower)

# Now let's see what our dataset looks like
dataset['train'][:10]

  '“i am inclined to think--” said i.',
  '“i should do so,” sherlock holmes remarked impatiently.',
  "i believe that i am one of the most long-suffering of mortals; but i'll",
  'admit that i was annoyed at the sardonic interruption. “really, holmes,”',
  ' said i severely, “you are a little trying at times.”',
  'he was too much absorbed with his own thoughts to give any immediate',
  'answer to my remonstrance. he leaned upon his hand, with his untasted',
  'breakfast before him, and he stared at the slip of paper which he had',
  'just drawn from its envelope. then he took the envelope itself, held it']}

As we can see, the data is much neater now. Our next step is to use a tokenizer to convert the sentences into integer vectors. Instead of the standard Keras tokenizer, we will use the one from Hugging Face which is much more powerful and convenient to use, particular when we start using transformers in the next lab.

The version we are using is pretrained on the OpenAI GPT2 tokenizer. For LSTMs we do not need to pad or truncate lines to fixed lengths.

In [7]:
from transformers import AutoTokenizer

# Import the OpenAI GPT2 tokenizer
model_name = "gpt2-large"

tokenizer = AutoTokenizer.from_pretrained(model_name)

# Specify the padding token to be the end-of-sentence token
#tokenizer.pad_token = tokenizer.eos_token

print("Vocabulary size: ", len(tokenizer))


Vocabulary size:  50257


In [8]:
# Now tokenize a statement
test_stat = "Elementary, my dear Watson"
tokens = tokenizer(test_stat, padding=False, truncation=False, return_length=True)
tokens


{'input_ids': [20180, 560, 11, 616, 13674, 14959], 'attention_mask': [1, 1, 1, 1, 1, 1], 'length': [6]}

As we can see from above tokenizer turns our sentence into a series of integers. Now let's tokenize the entire corpus, once again using the map function.


In [9]:
# We don't bother returning the lengths
def tokenize(example):
    retlist = []
    output = tokenizer(example["text"], padding=False, truncation=False,
                      return_overflowing_tokens=True)   
    
    for token in output["input_ids"]:
        retlist.append(token)

    return {"input_ids":retlist}

# Remove the existing columns so that we are left only with an input_ids column
token_dataset = dataset.map(tokenize, batched=True, 
                            remove_columns=dataset['train'].column_names)

token_dataset['train'][:10]

{'input_ids': [[43582, 352, 438, 1169, 6509],
  [447, 250, 72, 716, 19514, 284, 892, 438, 447, 251, 531, 1312, 13],
  [447,
   250,
   72,
   815,
   466,
   523,
   11,
   447,
   251,
   15059,
   5354,
   6039,
   6880,
   24998,
   33440,
   306,
   13],
  [72,
   1975,
   326,
   1312,
   716,
   530,
   286,
   262,
   749,
   890,
   12,
   37333,
   1586,
   286,
   49008,
   26,
   475,
   1312,
   1183],
  [324,
   2781,
   326,
   1312,
   373,
   25602,
   379,
   262,
   264,
   446,
   9229,
   41728,
   13,
   564,
   250,
   27485,
   11,
   6039,
   6880,
   11,
   447,
   251],
  [531,
   1312,
   15052,
   11,
   564,
   250,
   5832,
   389,
   257,
   1310,
   2111,
   379,
   1661,
   13,
   447,
   251],
  [258, 373, 1165, 881, 19233, 351, 465, 898, 6066, 284, 1577, 597, 7103],
  [41484,
   284,
   616,
   816,
   261,
   2536,
   590,
   13,
   339,
   23831,
   2402,
   465,
   1021,
   11,
   351,
   465,
   1418,
   8992],
  [9032,
   7217,
   878,
   683,
  

If we look at what has happened, we see that the entire dataset has been turned into tokens - integers that represent words. Since we specified that we should not pad or truncate the lines, every line has a different length. This is OK for LSTMs.

In [10]:
for toks in token_dataset['train'][:10]['input_ids']:
    print(len(toks))

5
13
17
19
22
16
13
18
16
15


### 3.2 Handling Large Datasets

Now we have our dataset nicely tokenized. For transformers, this is enough. Unfortunately for LSTMs, we need to generate sequences and teach the LSTM how to predict the next word based on the past few words.

We begin by compiling all the tokens in the sentence into a giant array, then chop up the array into slices of 5 words for the LSTM to predict the 6th using the Keras TimeseriesGenerator class.

### <u>Question 2</u>

Explain why we don't need to chop up our tokens into groups of 5 tokens to predict the 6th for transformers, but must do so for LSTMs.

<b>Fill your answers in the answer book</b>


In [11]:
alltokens = []
for sentences in token_dataset["train"]:
    alltokens.extend(sentences['input_ids'])
    
print("Total number of tokens: ", len(alltokens))


Total number of tokens:  230511


As you can see we have quite a lot of tokens. One important point is that we are unlikely to be able to fit all our training sequences into memory, so we will instead create a generator. Fortunately Keras provides us with the TimeseriesGenerator class, which will chop up our samples in fixed sizes, and produce the next token to be predicted.

We do this for both our training and testing data.

Note however that we need to convert our next token to a one-hot vector. We also adjust our token vectors to be divisible by the batch size.

In [12]:
from tensorflow.keras.preprocessing.sequence import TimeseriesGenerator
from tensorflow.keras.utils import to_categorical
import numpy as np

vocab_size = len(tokenizer)
batch_size = 32
lookback = 5

# Ensure that the number of tokens is divisible by batch_size
old_len = len(alltokens)
new_len = (int) (old_len / batch_size) * batch_size

print("Old length: ", old_len, "New length: ", new_len)

alltokens = alltokens[:new_len]
outputs = to_categorical(alltokens, vocab_size)
seqgen = TimeseriesGenerator(alltokens, outputs, length=lookback, batch_size=batch_size)

# We need to do the same for the testing data
alltokens_test = []

for sentences in token_dataset["test"]:
    alltokens_test.extend(sentences["input_ids"])

print("Total number of testing tokens: ", len(alltokens_test))
old_len = len(alltokens_test)
new_len = (int) (old_len / batch_size) * batch_size
print("Old length: ", old_len, " New length: ", new_len)
alltokens_test = alltokens_test[:new_len]
outputs_test = to_categorical(alltokens_test, vocab_size)
seqgen_test = TimeseriesGenerator(alltokens_test, outputs_test, length=lookback,
                                 batch_size=batch_size)

Old length:  230511 New length:  230496
Total number of testing tokens:  21311
Old length:  21311  New length:  21280


### 3.3 Building and Training the Network

Now that we have our datasets properly formatted and have created our training and testing generators, let's proceed to build our model, or load it from disk if one is already there.

In [17]:
from tensorflow.keras.layers import Input, Dense, LSTM, Embedding, Dropout
from tensorflow.keras import utils
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.callbacks import LambdaCallback
import os

filename="sherlock.h5"
# If the model file exists, we reload from there instead of creating a new model
if os.path.exists(filename):
    print("Loading existing model from ", filename)
    model = load_model(filename)
else:
    print("Creating new model.")
    # Create our model
    n_units = 256 # Hidden layer size for our LSTM
    embedding_size=100 # Size of embedding layer vectors

    text_in = Input(shape=(None, ))
    embedding = Embedding(vocab_size, embedding_size)(text_in)
    lstm = LSTM(n_units)(embedding)
    outputs = Dense(vocab_size, activation='softmax')(lstm)

    model = Model(inputs = text_in, outputs = outputs)

    # Set a slower learning rate
    learning_rate = 0.001
    opti = RMSprop(learning_rate = learning_rate)
    model.compile(loss = 'categorical_crossentropy', optimizer=opti)

model.summary()



Loading existing model from  sherlock.h5


### <u>Question 3</u>

i. In our network we have used a one-hot approach; our network will have over 50,000 outputs, where one of them will be set to "1" and the rest to "0" when training. Why can't we just have one output, where the target value is the index of the next word?

ii. Why do we use softmax and categorical cross entropy for the activation function and loss function?

<b>Fill your answers in the answer book</b>

This is great! We can now begin training our LSTM:

In [20]:
epochs = 25

# This will take a LONG time. 
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping

# Callback to save the model
newlen_train = len(alltokens)
save_model = ModelCheckpoint(filename)

# Early stopping callback to prevent overfitting (may cause underfitting)
# Stop if change between validation losses is under 0.01 twice.
earlystop = EarlyStopping(min_delta = 0.01, patience = 2)

steps_per_epoch = (int)(newlen_train / batch_size)

print("Expected number of training vectors: ", steps_per_epoch)
model.fit(seqgen, epochs=epochs, steps_per_epoch = steps_per_epoch, batch_size=batch_size,
          validation_data = seqgen_test, callbacks = [save_model, earlystop])

Expected number of training vectors:  7203
Epoch 1/25


ValueError: Unknown variable: <Variable path=embedding/embeddings, shape=(50257, 100), dtype=float32, value=[[-0.05057559  0.01919518 -0.00462685 ... -0.03989578 -0.01258777
   0.0498022 ]
 [ 0.0361642   0.01381685  0.03538026 ...  0.03599406 -0.01644047
  -0.04213088]
 [ 0.00879953  0.00930452  0.01941014 ...  0.02974112 -0.04553291
  -0.00684471]
 ...
 [ 0.01437103  0.02954981  0.02600053 ...  0.04121271  0.04171589
  -0.03035455]
 [-0.01553485  0.03628691  0.03537977 ...  0.01564407  0.04043619
   0.01538053]
 [-0.03877167  0.03082228 -0.01181077 ... -0.04574329  0.04065063
   0.03079636]]>. This optimizer can only be called for the variables it was originally built with. When working with a new set of variables, you should recreate a new optimizer instance.

### 3.4 Text Generation

Now comes the fun part! We will now use our model to create stories. These are the steps we need to take:

1. Create a prompt. This is usually the first few words of the starting sentence of our story.
2. Tokenize the prompt.
3. Feed it to the network.
4. Use a probability model to choose which output we want, based on the current series of words. I.e. we choose $nextword = argmax_{w_i}P(w_i | w_{i-1})$


In [19]:
def sample_with_temp(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)

    return np.argmax(probas)


def generate_text(seed_text, next_words, model, lookback, temp):
    output_text = seed_text
    
    for _ in range(next_words):
        
        token_list = tokenizer.encode(seed_text, return_tensors="tf")
        token_list = token_list[0][-lookback:]
#        print(token_list)
#        print("TOKEN LIST LEN: ", len(token_list))
        token_list = np.reshape(token_list, (1, lookback))
        
        probs = model.predict(token_list, verbose=0)[0]
        y_class = sample_with_temp(probs, temperature = temp)
        
        if y_class != 220:
            output_words = tokenizer.convert_ids_to_tokens([y_class], 
                                                           skip_special_tokens=True)
        else:
            output_words=""
            
        for output_word in output_words:
            if output_word[0] == 'Ġ':
                output_word = output_word[1:]
            output_text += output_word + " "
            seed_text += output_word + " "
            
    return output_text

Now let's generate some text!



In [None]:
temp=3
seed_text = "elementary my dear watson, "
genwords = 1000

print("Temperature = ", temp)
out_text = generate_text(seed_text, genwords, model, lookback, temp=temp)

print("\nGenerated text: ")
print(out_text)


## 4 Conclusion

We have just seen how to use LSTMs to generate texts based on a corpus and some seed text. The idea is for the LSTM to learn to predict the next word to be generated based on a current set of words.

We made use of a TimeseriesGenerator to produce the values on-the-fly as the dataset is too large to be fully loaded into memory.

In the next lab we will look at how to use build transformers and use them to generate texts. 
    
    