This notebook has been adapted and inspired by the following sources:

*    This [notebook](https://colab.research.google.com/github/ccc-frankfurt/Practical_ML_SS21/blob/main/week06/Shakespeare_Poetry_Generation_RNN_LSTM_solution.ipynb) on poetry generation using an RNN
*   This [post](https://machinelearningmastery.com/text-generation-with-lstm-in-pytorch/) about text generation with LSTM

## Becoming Shakespeare with the Long Short Term Memory

In this notebook, you will learn how to generate sonnets like Shakespeare himself by training a so-called Long Short Term Memory (LSTM) model. A LSTM is a type of recurrent neural network (RNN), another different neural network architecture commonly used to learn and generate sequential data, such as natural text, chemical strings, DNA etc. In RNNs, the order of the elements is important, and the LSTM cell contains an internal memory, or state, of the past, enabling them to predict the next element based on that state. The vanilla RNN suffers from a few disadvantages, among which the so-called [vanishing gradient problem](https://en.wikipedia.org/wiki/Vanishing_gradient_problem), which led to the creation of the LSTM - a robust architecture which is suitable for long-term dependency understanding and does not suffer from quadratic complexity like the [Transformer](https://en.wikipedia.org/wiki/Transformer_(deep_learning_architecture)) architecture.

Like we did in the past notebooks, we will define our network structure, this time as a Python class for our LSTM, and then train/evaluate using a curated dataset of Shakespeare's sonnets. Finally, we will use our trained LSTM to generate text similar to the sonnets, and try to get good at imitating Shakespeare's writing style. As a first step, we import our dependencies:

In [1]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F

And we set our device to either CPU or GPU:

In [3]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print('Device: ', device)

Device:  cuda


## The data: Shakespeare's sonnets

Shakespeare's sonnets can be found through the following URL: http://shakespeare.mit.edu/

The authors of [this notebook](https://colab.research.google.com/github/ccc-frankfurt/Practical_ML_SS21/blob/main/week06/Shakespeare_Poetry_Generation_RNN_LSTM_solution.ipynb#scrollTo=stneSw5L77Ln) have extracted all the plain text of the sonnets here: https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt  

We will thus download it from there.

In [4]:
!wget https://raw.githubusercontent.com/ccc-frankfurt/Practical_ML_SS21/master/week06/sonnets.txt

--2025-04-19 19:36:02--  https://raw.githubusercontent.com/ccc-frankfurt/Practical_ML_SS21/master/week06/sonnets.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 94081 (92K) [text/plain]
Saving to: ‘sonnets.txt’


2025-04-19 19:36:03 (5.72 MB/s) - ‘sonnets.txt’ saved [94081/94081]



Let's observe and verify parts of the sonnets:

In [15]:
# open the downloaded file in r=read mode
with open('sonnets.txt', 'r') as f:
    text = f.read()

# print an excerpt of the text
print(text[:128])

From fairest creatures we desire increase,
That thereby beauty's rose might never die,
But as the riper should by time decease,



### Generating character encodings

To be able to feed text into RNNs, we first have to choose a good representation, meaning an abstraction of the text into numeric format, since that is what the RNN expects. There are several options to do so ranging from simpler character embeddings to more sophisticated approaches like [word embeddings](https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa) and more.

We will use character encodings in this notebook, which work as follows: First we define an alphabet, aka a set of characters that we want to be able to represent. An alphabet could be all letters from A-Z, or include numerical characters, or special tokens. Then, we create a one-hot vector for each character, where the length of each one is equal to the size of our alphabet, and the "hot" position indicates the character we want to represent. For instance, assume our alphabet consists only of ABC. Then, the one-hot vector for C would be [0,0,1], the one-hot vector for A [1,0,0] if we assume the positions to be [A,B,C].

For simplicity, we define our alphabet here as "all unique letters in our dataset".

In [16]:
# set our alphabet to be all characters in the text
chars = tuple(set(text))

# convert character data into integers by simply mapping each character to the corresponding integer
int2char = dict(enumerate(chars))
char2int = {ch: ii for ii, ch in int2char.items()}

# encode the complete text
encoded_text = np.array([char2int[ch] for ch in text])

#showing the textual and encoded excerpt
print(text[:128])
print(encoded_text[:128])

From fairest creatures we desire increase,
That thereby beauty's rose might never die,
But as the riper should by time decease,

[50  2  3 45 30 59  1 13  2  6 26 15 30  4  2  6  1 15 49  2  6 26 30  8
  6 30  9  6 26 13  2  6 30 13 23  4  2  6  1 26  6 27 28 33 60  1 15 30
 15 60  6  2  6 19 57 30 19  6  1 49 15 57 38 26 30  2  3 26  6 30 45 13
 39 60 15 30 23  6 14  6  2 30  9 13  6 27 28 41 49 15 30  1 26 30 15 60
  6 30  2 13 40  6  2 30 26 60  3 49 24  9 30 19 57 30 15 13 45  6 30  9
  6  4  6  1 26  6 27 28]


In [18]:
char2int['F']

50

As you see, the uppercase F has been mapped to the integer 50, and 50 is also the first index to appear in our encoded sequence. We can also observe the total text length of our dataset, and how many unique characters our alphabet contains:

In [19]:
print('Total characters: ', len(text))
print('Vocabulary size: ', len(chars))

Total characters:  94081
Vocabulary size:  61


Next, we need to think about our goal, which is to generate text similar to Shakepeare's sonnets. Since we use a character-based approach, we can simplify our goal down to "we want to predict the next character given the past characters", where "the past characters" is a defined look-back window, e.g. predicting the next character based on the past 100 characters before it. That is, with character 1 to 100 as input, the LSTM is going to predict character 101. For example, imagine a look-back window of length 3:

hel -> l

ell -> o

llo -> !

and so on. We can frame this as a classification problem: given a look-back window, predict the next most likely class (letter) our of all possible classes (alphabet). Hence, our training data will consist of fixed-sized look-back windows of text, and the labels will be the next characters, given the corresponding window. Thus, we need to split our whole sonnet textual data into such windows.

### Creating the dataset

In [None]:
window_size = 100 # look-back window length, feel free to play around with this value
windows_x = []
windows_y = []
for i in range(0, len(text) - seq_length, 1):
    seq_in = raw_text[i:i + seq_length]
    seq_out = raw_text[i + seq_length]
    dataX.append([char_to_int[char] for char in seq_in])
    dataY.append(char_to_int[seq_out])
n_patterns = len(dataX)
print("Total Patterns: ", n_patterns)