# Building a Simple Character-Level Language Model

Aims:
- Build your first language model to generate rap lyrics
- Understand how to implement recurrent neural networks in PyTorch
- Get familiar with PyTorch's embedding layer

## What is Language Modelling?

> Given a sequence of words, the language model assigns a probability to each possible word that might come next in the sequence. 

![](./images/Language%20Model.png)

Language modeling is the process of predicting the next word in a sequence of words based on the context provided by the previous words. It is a core task in natural language processing (NLP) and is used in a wide range of applications, including speech recognition, machine translation, and chatbots.

This can be used to predict the next word in a sequence, generate text that is similar to a given input, or to evaluate the quality of a translation or a summary by comparing the probability of the generated text to the probability of the original text.

> It's easy to acquire data for training language models because the label is simply the next word.

Language models are typically trained on large corpora of text, such as books, articles, and websites, in order to learn the statistical properties of the language and the dependencies between words. They can be implemented using various types of neural networks, such as recurrent neural networks (RNNs), long short-term memory (LSTM) networks, or transformers.

## Get Some Data

In this example, we'll try to generate lyrics like those from your favourite artist.

If you want to use your own data, you can either:
- Copy lyrics into the code below to define your corpus (easy difficulty)
- Create a GitHub repo and upload the lyrics there, then paste in the raw URLs, as below (intermediate difficulty)
- Build a web-scraper to collect lyrics [like I did](https://github.com/life-efficient/Lyric-Generation/tree/main/data) (hardcore difficulty)

In [None]:
import requests

def get_country_music_lyrics_corpus():
    """Get the country music lyrics corpus."""
    raw_urls = [
        "https://raw.githubusercontent.com/life-efficient/Lyric-Generation/main/data/Country/Ashe-emotional-lyrics.txt",
        "https://raw.githubusercontent.com/life-efficient/Lyric-Generation/main/data/Country/Creedence-clearwater-revival-have-you-ever-seen-the-rain-lyrics.txt",
        "https://raw.githubusercontent.com/life-efficient/Lyric-Generation/main/data/Country/Hardy-and-lainey-wilson-wait-in-the-truck-lyrics.txt",
        "https://raw.githubusercontent.com/life-efficient/Lyric-Generation/main/data/Country/Johnny-cash-folsom-prison-blues-lyrics.txt",
        "https://raw.githubusercontent.com/life-efficient/Lyric-Generation/main/data/Country/Johnny-cash-ring-of-fire-lyrics.txt",
        "https://raw.githubusercontent.com/life-efficient/Lyric-Generation/main/data/Country/Koe-wetzel-creeps-lyrics.txt",
        "https://raw.githubusercontent.com/life-efficient/Lyric-Generation/main/data/Country/Rosa-linn-snap-lyrics.txt",
        "https://raw.githubusercontent.com/life-efficient/Lyric-Generation/main/data/Country/Taylor-swift-all-too-well-10-minute-version-taylors-version-from-the-vault-lyrics.txt"
    ]
    corpus = ""
    for url in raw_urls:
        response = requests.get(url)
        lines = response.text.splitlines()
        lines = [line for line in lines if line != '']
        lyrics = " ".join(lines)
        corpus += lyrics
    return corpus

get_country_music_lyrics_corpus()

### The Tokeniser

![](./images/Tokeniser.png)

The first thing we need to do is to create a tokeniser that can take in our raw text and split it into a sequence of tokens.

In this simple example, we will create a character-level tokeniser:
- The tokeniser should be able to encode any string into a sequence of character, then turn them into their integer index.
- In most real applications, you'd use a word-level or subword-level tokeniser instead. 
- Here, we implement our own tokeniser for practice. In a real-world example, you can find pre-built tokenisers online, for example in [HuggingFace](https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertModel.forward.example).

In [None]:
class Tokeniser:
    def __init__(self, txt):
        txt = self.preprocess(txt) # Preprocess the text
        unique_chars = set(txt) # TODO Create a set of unique characters in the input text
        self.vocab_size = len(unique_chars) # TODO Get the vocabulary size
        self.id_to_token = dict(enumerate(unique_chars)) # TODO Create a dictionary that maps character IDs to characters
        self.token_to_id = {v: k for k, v in self.id_to_token.items()} # TODO Create a reverse dictionary that maps characters to character IDs

    def preprocess(self, txt):
        txt = txt.lower() # TODO Convert the lyrics to lowercase
        # other preprocessing steps can be added here
        return txt

    def encode(self, txt):
        txt = self.preprocess(txt) # Preprocess
        token_ids = [self.token_to_id[char] for char in str.strip(txt)] # TODO Encode the input string by mapping its characters to character IDs
        return token_ids

    def decode(self, token_ids):
        return "".join([self.id_to_token[id] for id in token_ids]) # TODO Decode the input list of character IDs by mapping them to characters


corpus = get_country_music_lyrics_corpus()
tokeniser = Tokeniser(corpus) # TODO create a tokeniser object

tokens = tokeniser.encode("This is truly excellent") # TODO encode a sentence
print("Tokens:", tokens)
tokeniser.decode(tokens) # TODO decode the tokens


## Creating a simple character-level language modelling dataset

A language modelling dataset consists of:
- features: 
    - the sequential words/tokens in a body of text
- targets: 
    - the next token for each position in time
    - i.e. the features shifted one step forward in time

Implementation details:
- Like all PyTorch datasets, our dataset needs a `__len__` method. 
    - In this case, set the length of the dataset to be the number of chunks of text of the provided `chunk_size` that could fit in the dataset.
- Define the `__getitem__` to get a random chunk of text

In [None]:
import torch
import numpy as np


class LyricDataset(torch.utils.data.Dataset):
    def __init__(self, tokeniser, chunk_size=30):
        """
        Initialize a LyricDataset object.
        
        Parameters:
        chunk_size (int): The size of each chunk of data to be returned by the iterator.
        """
        self.chunk_size = chunk_size  # The size of each chunk of data to be returned by the iterator
        self.tokeniser = tokeniser

        txt = get_country_music_lyrics_corpus()

        self.X = torch.tensor(self.tokeniser.encode(txt)) # TODO Encode the text and store it in a tensor
        self.Y = torch.tensor(np.roll(self.X, -1, axis=0)) # TODO Shift the encoded text by one character and store it in a tensor
        self.vocab_size = len(set(txt)) # TODO Store the size of the vocabulary (i.e. the number of unique characters in the text)

    def __len__(self):
        return len(self.X) // self.chunk_size # TODO return the number of chunks in the dataset

    def __getitem__(self, idx):
        k = np.random.randint(0, len(self.X) - self.chunk_size) # TODO Select a random starting index for the chunk
        # Select the chunk using a slice object
        return self.X[k: k+self.chunk_size], self.Y[k+self.chunk_size-1]


dataset = LyricDataset(tokeniser) # TODO create a dataset object

print("Vocabulary size:", dataset.vocab_size)
print("Length of dataset:", len(dataset))
print("First chunk of data:")
for idx, (x, y) in enumerate(dataset):
    print("X:", x)
    print("Y:", y)

    print("Sequence so far:", tokeniser.decode(list(int(xx) for xx in x)))
    print("Target next character:", tokeniser.decode([int(y)]))
    if idx > 3:
        break


Your labels should look the same as your features, just shifted by one position in time.

Now let's make a dataloader to batch and shuffle the dataset:

In [None]:
from torch.utils.data import DataLoader

dataloader = DataLoader(dataset, batch_size=4, shuffle=True) # TODO create a dataloader object

print("First batch of data:")
example_batch = next(iter(dataloader))
print("X:", example_batch[0])
print("Y:", example_batch[1])

## Defining the RNN model

One of the simplest kinds of language models you can implement is using a many-to-one recurrent neural network that processes a sequence of many tokens to produce one classification - a classification of which word comes next.

![](./images/RNN%20Text%20Classifier.png)

Firstly, to initialise the model, we'll define the modules that will be needed to make the forward pass:
- An embedding layer that takes in a sequence of token ids and turns them into a sequence of embeddings.
    - See the docs [here](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html).
    - When called on a sequence of length $T$, an embedding layer that produces $d$ dimensional embeddings for each token will output a matrix of size ($T$, $d$), which represents a $d$ dimensional token embedding for each of the $T$ timesteps.
- An RNN layer
    - Requires an embedding size $d$
    - Requires a hidden size $h$
    - Can be multi-layer
- A classification head
    - Will combine the final hidden state activations into logits for a classification
        - We'll output the logits rather than the probabilities so that we can train the model using the `cross_entropy` loss function
    - The classification should have the same dimensionality as the vocab size - a probability for each word

In [None]:
class RNN(torch.nn.Module):
    def __init__(self, vocab_size, embedding_size , hidden_size, n_layers=1):
        super().__init__() # TODO initialise parent class

        # STORE HYPERPARAMETERS
        self.vocab_size = vocab_size 
        self.embedding_size = embedding_size
        self.hidden_size = hidden_size
        self.n_layers = n_layers

        # DEFINE MODEL MODULES
        self.embedding = torch.nn.Embedding(vocab_size, embedding_size) # TODO inintialise embedding layer
        self.rnn = torch.nn.RNN(embedding_size, hidden_size, n_layers, batch_first=True)  # TODO initialise RNN layer
        self.classification_head = torch.nn.Linear(hidden_size, vocab_size) # TODO initialise classification head

    def forward(self, x):
        pass # we will do this in the next step

    def init_hidden(self, batch_size):
        pass # we will do this in the next step



Every computation performed by an RNN depends on having an initial hidden state. As per the equations, is needs to be combined with the input data at each timestep.

> Typically, we initialise the hidden state of an RNN as a vector of zeros.

Check out the [docs](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html#rnn) to make sure you implement the correct shaped tensor.

So let's define a method that does that:

In [None]:
class RNN(torch.nn.Module):
    def __init__(self, vocab_size, embedding_size=32, hidden_size=32, n_layers=1):
        super().__init__() # TODO initialise parent class

        # STORE HYPERPARAMETERS
        self.vocab_size = vocab_size 
        self.embedding_size = embedding_size
        self.hidden_size = hidden_size
        self.n_layers = n_layers

        # DEFINE MODEL MODULES
        self.embedding = torch.nn.Embedding(vocab_size, embedding_size) # TODO inintialise embedding layer
        self.rnn = torch.nn.RNN(embedding_size, hidden_size, n_layers, batch_first=True)  # TODO initialise RNN layer
        self.classification_head = torch.nn.Linear(hidden_size, vocab_size) # TODO initialise classification head

    def init_hidden(self, batch_size):
        self.hidden = torch.zeros(self.n_layers, batch_size, self.hidden_size) # TODO initialise hidden state

    def forward(self, x):
        pass # we will do this in the next step


rnn = RNN(tokeniser.vocab_size)
# print("Hidden before initialisation:", rnn.hidden)
rnn.init_hidden(batch_size=2)
print("Hidden after initialisation:", rnn.hidden)
print("Hidden shape:", rnn.hidden.shape) # (L, B, H)



Now let's define the forward pass.

Torch's recurrent layers are a little different to other layers in a few ways:
1. The first dimension is not the batch dimension by default! Instead, it's the time dimension, followed by the batch dimension.
1. They take in more than one argument:
    - The input data, as usual
    - The current hidden state
1. They return more than one thing:
    - The final hidden values of every layer
    - The output from each timestep

![](./images/PyTorch%20RNN%20Outputs.png)

The output from each timestep is the activations of the final recurrent layer for every timestep.

In our case, we won't need to use the hidden states output from the RNN layer.

These behaviours might seem unusual, but as you get more familiar with using recurrent networks, you'll realise how they can be useful and make RNNs very flexible.


In [None]:
class RNN(torch.nn.Module):
    def __init__(self, vocab_size, embedding_size=32, hidden_size=32, n_layers=1):
        super().__init__() # TODO initialise parent class

        # STORE HYPERPARAMETERS
        self.vocab_size = vocab_size 
        self.embedding_size = embedding_size
        self.hidden_size = hidden_size
        self.n_layers = n_layers

        # DEFINE MODEL MODULES
        self.embedding = torch.nn.Embedding(vocab_size, embedding_size) # TODO inintialise embedding layer
        self.rnn = torch.nn.RNN(embedding_size, hidden_size, n_layers, batch_first=True)  # TODO initialise RNN layer
        self.classification_head = torch.nn.Linear(hidden_size, vocab_size) # TODO initialise classification head

    def init_hidden(self, batch_size):
        self.hidden = torch.zeros(self.n_layers, batch_size, self.hidden_size) # TODO initialise hidden state

    def forward(self, X):
        self.init_hidden(X.shape[0])
        embedding = self.embedding(X)
        outputs, final_hidden = self.rnn(embedding, self.hidden)
        predictions = self.classification_head(outputs)
        return predictions


features, labels = example_batch
print("Batch size:", features.shape[0])
print("Sequence length:", features.shape[1])
print("Vocabulary size:", tokeniser.vocab_size)
rnn = RNN(tokeniser.vocab_size)
prediction = rnn(features)
print(prediction.shape)


## Generating new text

Now we need to implement a method of our model that takes what it knows and uses it to generate new text.

Initially, our generated text will be awful, because we haven't trained the model.

In [None]:
import random

class RNN(torch.nn.Module):
    def __init__(self, vocab_size, embedding_size=32, hidden_size=32, n_layers=1):
        super().__init__()  # TODO initialise parent class

        # STORE HYPERPARAMETERS
        self.vocab_size = vocab_size
        self.embedding_size = embedding_size
        self.hidden_size = hidden_size
        self.n_layers = n_layers

        # DEFINE MODEL MODULES
        # TODO inintialise embedding layer
        self.embedding = torch.nn.Embedding(vocab_size, embedding_size)
        # TODO initialise RNN layer
        self.rnn = torch.nn.RNN(
            embedding_size, hidden_size, n_layers, batch_first=True)
        self.classification_head = torch.nn.Linear(
            hidden_size, vocab_size)  # TODO initialise classification head

    def init_hidden(self, batch_size):
        # TODO initialise hidden state
        self.hidden = torch.zeros(self.n_layers, batch_size, self.hidden_size)

    def forward(self, X):
        self.init_hidden(X.shape[0])
        embedding = self.embedding(X)
        outputs, final_hidden = self.rnn(embedding, self.hidden)
        outputs = outputs[:, -1] # get last output
        predictions = self.classification_head(outputs)
        return predictions

    def generate(self):
        self.init_hidden(batch_size=1)
        initial_token_id = random.randint(0, 49-1)
        generated_token_ids = [initial_token_id]
        initial_token_batch = torch.tensor(initial_token_id).unsqueeze(
            0).unsqueeze(0)  # TODO SOS token
        embedding = self.embedding(initial_token_batch)
        for idx in range(100):  # generate 100 character sequence
            outputs, self.hidden = self.rnn(embedding, self.hidden)
            predictions = self.classification_head(outputs)
            # outputs has shape BxLxN=1x1xN
            predictions = predictions.squeeze()  # remove 1-dims
            chosen_token_id = torch.argmax(predictions)
            generated_token_ids.append(int(chosen_token_id))
            embedding = self.embedding(
                chosen_token_id).unsqueeze(0).unsqueeze(0)
        return generated_token_ids


rnn = RNN(tokeniser.vocab_size)
myrnn = RNN(tokeniser.vocab_size, 32, 32, 1)

generated_tokens = rnn.generate()
print("Generated text:", tokeniser.decode(generated_tokens))



## Creating the training loop

Now we have the model and the dataset, we need to pass the model through the dataset repeatedly and iteratively optimise the model parameters using gradient descent.

In [None]:
from torch.utils.tensorboard import SummaryWriter
import torch.nn.functional as F


def train(model, dataset, tokeniser, epochs=1):
    writer = SummaryWriter()
    optimizer = torch.optim.SGD(model.parameters(), lr=lr)  # choose optimiser
    n_steps = 0
    for epoch in range(epochs):
        epoch_loss = 0  # stored the loss per epoch
        for X, y in dataloader:
            
            predictions = model(X)
            # seq_targets = seq_targets.unsqueeze(0)
            # predictions = predictions.view(-1, predictions.shape[-1])
            # seq_targets = seq_targets.view(-1)  # BxT targets all in a line
            # print(tokeniser.decode([int(x) for x in X[0, -20:]]))

            # print(tokeniser.decode([int(torch.argmax(y[0]))]))
            loss = F.cross_entropy(predictions, y)
            epoch_loss += loss.item()

            # OPTIMISE
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

            # LOGGING
            writer.add_scalar("Loss/Train", loss.item(), n_steps)
            n_steps += 1

        epoch_loss /= len(dataset)  # avg loss per epoch

        print('Epoch ', epoch, ' Avg loss/chunk: ', epoch_loss)
        generated_token_ids = model.generate()
        writer.add_text("Generated Text", tokeniser.decode(
            generated_token_ids)[:300], epoch)
            # TODO stop on EOS token


if __name__ == "__main__":

    # HYPER-PARAMS
    lr = 0.05
    epochs = 5000
    chunk_size = 30  # the length of the sequences which we will optimize over
    batch_size = 32

    # MODEL ARCHITECTURE
    embedding_size = 64
    hidden_size = 64
    n_layers = 2

    # LOAD DATA
    corpus = get_country_music_lyrics_corpus()
    tokeniser = Tokeniser(corpus)
    dataset = LyricDataset(tokeniser, chunk_size=chunk_size)
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

    n_tokens = len(dataset.tokeniser.id_to_token)
    # instantiate our model from the class defined earlier
    myrnn = RNN(n_tokens, embedding_size, hidden_size, n_layers)
    train(myrnn, dataset, tokeniser, epochs)
    # myrnn = RNN(n_tokens, hidden_size, n_layers)
    # train(myrnn, dataset, epochs)
