In [1]:
import numpy as np
import torch.nn as nn
import torch

**Dataset**

Run the cell below to open and read the ebook of [_Pride and Prejudice_](https://www.gutenberg.org/files/1342/1342-h/1342-h.htm) to the variable `raw_text`. 

**Note**: Due to hardware constraints, we'll only use the full text from **Chapter 1** which we've indexed and saved to the variable `raw_text_ch1`.

In [2]:
with open('datasets/book.txt', 'r', encoding='utf-8') as f:
    raw_text = f.read()

# index chapter 1
raw_text_ch1 = raw_text[1985:6468]
print(raw_text_ch1[:117])

It is a truth universally acknowledged, that a single man in possession
of a good fortune, must be in want of a wife.


**Tokenization and Preprocessing**

We've tokenized and preprocessed the raw text from Chapter 1 into the following variables:

- `tokenized_text` : contains the full raw text tokenized as a list of character-based tokens
- `c2ix` : contains the vocabulary of unique character tokens mapped to their unique token IDs
- `vocab_size` : is the the vocabulary size
- `ix2c` : contains the inverse vocabulary of unique token IDs mapped to their unique character tokens
- `tokenized_id_text` maps the tokens in the tokenized text to their token IDs

We also used the `Dataset` and `DataLoader` utility classes to create the following variables:
- `dataset` : stores and creates the sequences for the features and labels with a sequence length of `24`
- `dataloader` : contains the iterable used to load the sequences as batches with a batch size of `48`

Run the cell below to tokenize and preprocess the raw text from Chapter 1.

In [3]:
tokenized_text = list(raw_text_ch1)
unique_character_tokens = sorted(list(set(tokenized_text)))
c2ix = {ch:i for i,ch in enumerate(unique_character_tokens)}
vocab_size = len(c2ix)
ix2c = {ix:ch for ch,ix in c2ix.items()}

tokenized_id_text = [c2ix[ch] for ch in tokenized_text]

from torch.utils.data import Dataset, DataLoader
torch.manual_seed(1) # set random seed --do not change!

class TextDataset(Dataset):
    def __init__(self, tokenized_text, seq_length):
        self.tokenized_text = tokenized_text
        self.seq_length = seq_length
    def __len__(self):
        return len(self.tokenized_text) - self.seq_length
    def __getitem__(self, idx):
        features = torch.tensor(self.tokenized_text[idx:idx+self.seq_length])
        labels = torch.tensor(self.tokenized_text[idx+1:idx+self.seq_length+1])
        return features, labels
                
seq_length = 24
batch_size = 48
dataset = TextDataset(tokenized_id_text, seq_length)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

**Construct the Character-Based LSTM Class**

We've constructed the Character-Based LSTM class based on the previous exercise. 

We run the cell below to create an instance of the LSTM class saved to the variable `lstm_model`.

In [4]:
import torch.nn as nn
torch.manual_seed(1) # set random seed --do not change!

class CharacterLSTM(nn.Module):
    def __init__(self):
        super(CharacterLSTM, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim=64)
        self.lstm = nn.LSTM(input_size=64, hidden_size=128, batch_first=True)
        self.linear = nn.Linear(128, vocab_size)

    def forward(self, x, states):
        x = self.embedding(x)
        out, states = self.lstm(x, states)
        out = self.linear(out)
        out = out.reshape(-1, out.size(2))
        return out, states

    def init_state(self, batch_size):
        hidden = torch.zeros(1, batch_size, 128)
        cell = torch.zeros(1, batch_size, 128)
        return hidden, cell

lstm_model = CharacterLSTM()

Using the created instance of the LSTM class `lstm_model`, let's set up the loss function and optimizer to train our model:

1. Create an instance of the **multiclass cross-entropy** loss function and save it to the variable `loss`.

2. Create an instance of the **Adam** optimizer with a learning rate of `0.01` and save it to the variable `optimizer`.

In [5]:
import torch.optim as optim
torch.manual_seed(1) # set random seed 

loss = nn.CrossEntropyLoss()
optimizer = optim.Adam(lstm_model.parameters(), lr=0.01)

Now let's train the LSTM by implementing the training loop. We've started building the loop that trains the model for 10 epochs where in each epoch, we iterate through all of the batches in the `dataloader` to train the model one batch at a time.

Complete the training loop that trains the model one batch at a time with the following:
1. Reset the gradients
2. Reset the hidden and cell states
3. Apply the forward pass (that returns the output and updates the states)
4. Calculate the loss
5. Compute the gradients
6. Update the weights and biases

In [6]:
# initialize model and model components 
lstm_model = CharacterLSTM()
loss = nn.CrossEntropyLoss()
optimizer = optim.Adam(lstm_model.parameters(), lr=0.01)

num_epochs = 10
for epoch in range(num_epochs):
    for features, labels in dataloader:
        optimizer.zero_grad()
        states = lstm_model.init_state(features.size(0))
        outputs, states = lstm_model(features, states)
        CEloss = loss(outputs, labels.view(-1))
        CEloss.backward()
        optimizer.step()        
    # keep track of the loss during training
    if (epoch + 1) % 1 == 0:
        print(f'Epoch [{epoch + 1}/{num_epochs}], CELoss: {CEloss.item():.4f}')

Epoch [1/10], CELoss: 1.3437
Epoch [2/10], CELoss: 0.7308
Epoch [3/10], CELoss: 0.5384
Epoch [4/10], CELoss: 0.4454
Epoch [5/10], CELoss: 0.4315
Epoch [6/10], CELoss: 0.3848
Epoch [7/10], CELoss: 0.3613
Epoch [8/10], CELoss: 0.3683
Epoch [9/10], CELoss: 0.3468
Epoch [10/10], CELoss: 0.3214


Let's see if the trained LSTM can generate the first sentence of our text in Chapter 1: 

```md
It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.
```

We'll use the starting prompt `"It is a truth"` that's been tokenized to its character-based token IDs saved to the variable `starting_prompt`.

We've also set the model to evaluation mode and specified that we want to generate `250` new characters.

Within the `torch.no_grad():` context, we initialized the clean hidden and cell states to the variable `states`. 

We finish creating the `for` loop that generates one character per iteration with the following:
1. Input the tokenized prompt through the forward pass to generate the output and updated states
2. Use `torch.argmax` to select the token ID with the highest output score
3. Use the inverse vocabulary `ix2c` to map the selected token ID to its character-based token
4. Append the generated token to the starting prompt
5. Update the tokenized prompt with the newly generated token

Lastly we print the starting prompt with the newly generated text.


In [7]:
starting_prompt = "It is a truth"
tokenized_id_prompt = torch.tensor([[c2ix[ch] for ch in starting_prompt]])

lstm_model.eval()
num_generated_chars = 250
with torch.no_grad():
    states = lstm_model.init_state(1)
    for _ in range(num_generated_chars):
        output, states = lstm_model(tokenized_id_prompt, states)
        predicted_id = torch.argmax(output[-1, :], dim=-1).item()
        predicted_char = ix2c[predicted_id]
        starting_prompt += predicted_char
        tokenized_id_prompt = torch.tensor([[predicted_id]])        

# print the generated text
print(starting_prompt)

It is a truth universally acknowledged, that a single man in possession
of a good fortune, must be in want of a wife.

However little known the feelings or views of such a man may be on his
first entering a neighbourhood."

"It will be no use to us, if twenty suc


The LSTM model was able to successfully generate the full first sentence: `It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.`

Notably, it is even able to correctly generate the comma `','` after the word `'acknowledged'`. This is because we decided to _not_ remove punctuations and special characters from the raw text which allowed the LSTM to learn to generate them instead!

Afterward, it starts to deviate from the actual text but still maintains some grammatical accuracy!

Remember, our model was trained fairly shortly and only on a small portion (Chapter 1) of the full text!

**Tips to improve the model**

Here are some ways to further improve our text generation model:
- use the full text (or gather multiple outside texts)
- use a larger embedding size (GPT3 uses a dimension size of ~12,000!)
- modify the neural network architecture (add more neurons, layers, activation functions, etc.)
- increase the number of epochs for training
- test different optimizers and learning rates

Unfortunately, due to the hardware constraints in our environments, we won't have the computation power to train larger networks with larger datasets without crashing our notebook. But feel free to build, train, and test a language model on your own!