# Recurrent Neural Networks and Language Models

You guys probably very excited about ChatGPT.  In today class, we will be implementing a very simple language model, which is basically what ChatGPT is, but with a simple LSTM.  You will be surprised that it is not so difficult at all.

Paper that we base on is *Regularizing and Optimizing LSTM Language Models*, https://arxiv.org/abs/1708.02182

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim

import torchtext, datasets, math
from tqdm import tqdm

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

#make our work comparable if restarted the kernel
SEED = 1234
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

# torch.cuda.get_device_name(0)

cpu


## 1. Load data - Wiki Text

We will be using wikitext which contains a large corpus of text, perfect for language modeling task.  This time, we will use the `datasets` library from HuggingFace to load.

In [3]:
# import os
# os.environ['http_proxy']  = 'http://192.41.170.23:3128'
# os.environ['https_proxy'] = 'http://192.41.170.23:3128'

#there are raw and preprocessed version; we used the raw one and preprocessed ourselves for fun
dataset = datasets.load_dataset("mickume/harry_potter_tiny", split={'train': 'train[:80%]', 'validation': 'train[80%:90%]', 'test': 'train[90%:100%]'})
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 5985
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 748
    })
    test: Dataset({
        features: ['text'],
        num_rows: 748
    })
})


In [4]:
print(dataset['train'][333]['text'])

'''
If you try to change the index you might notice that sometimes there is no paragraph 
and rather an empty string so we will have to care of that later.
'''

There were children there, already enjoying the summer in the playground. Sitting on the grass underneath a tree, Hermione pulled out her books and handed one to Bellatrix. "We can get food if you like. But I brought some water."


'\nIf you try to change the index you might notice that sometimes there is no paragraph \nand rather an empty string so we will have to care of that later.\n'

In [5]:
# Check if it's a DatasetDict
if isinstance(dataset, dict):
    print("This is a DatasetDict!")
    print(dataset)  # This will show the splits (e.g., 'train', 'test', 'validation')
else:
    print("This is a single Dataset!")
    print(dataset)  # This will show the data in the single dataset (e.g., 'train' only)

This is a DatasetDict!
DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 5985
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 748
    })
    test: Dataset({
        features: ['text'],
        num_rows: 748
    })
})


## 2. Preprocessing

### Tokenizing

Simply tokenize the given text to tokens.

Split train, test and validation

In [6]:
tokenizer = torchtext.data.utils.get_tokenizer('basic_english')

#function to tokenize
tokenize_data = lambda example, tokenizer: {'tokens': tokenizer(example['text'])}  

#map the function to each example
tokenized_dataset = dataset.map(tokenize_data, remove_columns=['text'], fn_kwargs={'tokenizer': tokenizer})
print(tokenized_dataset['train'][333]['tokens'])

['there', 'were', 'children', 'there', ',', 'already', 'enjoying', 'the', 'summer', 'in', 'the', 'playground', '.', 'sitting', 'on', 'the', 'grass', 'underneath', 'a', 'tree', ',', 'hermione', 'pulled', 'out', 'her', 'books', 'and', 'handed', 'one', 'to', 'bellatrix', '.', 'we', 'can', 'get', 'food', 'if', 'you', 'like', '.', 'but', 'i', 'brought', 'some', 'water', '.']


### Numericalizing

We will tell torchtext to add any word that has occurred at least three times in the dataset to the vocabulary because otherwise it would be too big.  Also we shall make sure to add `unk` and `eos`.

In [7]:
## numericalizing
vocab = torchtext.vocab.build_vocab_from_iterator(tokenized_dataset['train']['tokens'], 
min_freq=3) 
vocab.insert_token('<unk>', 0)           
vocab.insert_token('<eos>', 1)            
vocab.set_default_index(vocab['<unk>'])   
print(len(vocab))                         
print(vocab.get_itos()[:10])       

3624
['<unk>', '<eos>', '.', ',', 'the', 'her', 'to', 'she', 'and', 'was']


In [26]:
import pickle

# Pickle the vocab object
with open('vocab.pkl', 'wb') as f:
    pickle.dump(vocab, f)

print("Vocabulary pickled successfully.")

Vocabulary pickled successfully.


## 3. Prepare the batch loader

### Prepare data

Given "Chaky loves eating at AIT", and "I really love deep learning", and given batch size = 3, we will get three batches of data "Chaky loves eating at", "AIT `<eos>` I really", "love deep learning `<eos>`".  

In [8]:
def get_data(dataset, vocab, batch_size):
    data = []                                                   
    for example in dataset:
        if example['tokens']:         
            #appends eos so we know it ends....so model learn how to end...                             
            tokens = example['tokens'].append('<eos>')   
            #numericalize          
            tokens = [vocab[token] for token in example['tokens']] 
            data.extend(tokens)                                    
    data = torch.LongTensor(data)                                 
    num_batches = data.shape[0] // batch_size #get the int number of batches...
    data = data[:num_batches * batch_size] #make the batch evenly, and cut out any remaining                      
    data = data.view(batch_size, num_batches)          
    return data #[batch size, bunch of tokens]


In [9]:
batch_size = 128
train_data = get_data(tokenized_dataset['train'], vocab, batch_size)
valid_data = get_data(tokenized_dataset['validation'], vocab, batch_size)
test_data  = get_data(tokenized_dataset['test'], vocab, batch_size)

## 4. Modeling 

In [10]:
class LSTMLanguageModel(nn.Module):
    def __init__(self, vocab_size, emb_dim, hid_dim, num_layers, dropout_rate):
                
        super().__init__()
        self.num_layers = num_layers
        self.hid_dim = hid_dim
        self.emb_dim = emb_dim

        self.embedding = nn.Embedding(vocab_size, emb_dim)
        self.lstm = nn.LSTM(emb_dim, hid_dim, num_layers=num_layers, 
                    dropout=dropout_rate, batch_first=True)
        self.dropout = nn.Dropout(dropout_rate)
        self.fc = nn.Linear(hid_dim, vocab_size)
        
        self.init_weights()
        
    def init_weights(self):
        init_range_emb = 0.1
        init_range_other = 1/math.sqrt(self.hid_dim)
        self.embedding.weight.data.uniform_(-init_range_emb, init_range_emb)
        self.fc.weight.data.uniform_(-init_range_other, init_range_other)
        self.fc.bias.data.zero_()
        for i in range(self.num_layers):
            self.lstm.all_weights[i][0] = torch.FloatTensor(self.emb_dim,
                    self.hid_dim).uniform_(-init_range_other, init_range_other) 
            self.lstm.all_weights[i][1] = torch.FloatTensor(self.hid_dim, 
                    self.hid_dim).uniform_(-init_range_other, init_range_other) 

    def init_hidden(self, batch_size, device):
        hidden = torch.zeros(self.num_layers, batch_size, self.hid_dim).to(device)
        cell   = torch.zeros(self.num_layers, batch_size, self.hid_dim).to(device)
        return hidden, cell
    
    def detach_hidden(self, hidden):
        hidden, cell = hidden
        hidden = hidden.detach()
        cell = cell.detach()
        return hidden, cell

    def forward(self, src, hidden):
        #src: [batch size, seq len]
        embedding = self.dropout(self.embedding(src))
        #embedding: [batch size, seq len, emb_dim]
        output, hidden = self.lstm(embedding, hidden)      
        #output: [batch size, seq len, hid_dim]
        #hidden = h, c = [num_layers * direction, seq len, hid_dim)
        output = self.dropout(output) 
        prediction = self.fc(output)
        #prediction: [batch size, seq_len, vocab size]
        return prediction, hidden

## 5. Training 

Follows very basic procedure.  One note is that some of the sequences that will be fed to the model may involve parts from different sequences in the original dataset or be a subset of one (depending on the decoding length). For this reason we will reset the hidden state every epoch, this is like assuming that the next batch of sequences is probably always a follow up on the previous in the original dataset.

In [11]:
vocab_size = len(vocab)
emb_dim = 1024                # 400 in the paper
hid_dim = 1024                # 1150 in the paper
num_layers = 2                # 3 in the paper
dropout_rate = 0.65              
lr = 1e-3                     

In [12]:
model = LSTMLanguageModel(vocab_size, emb_dim, hid_dim, num_layers, dropout_rate).to(device)
optimizer = optim.Adam(model.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()
num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'The model has {num_params:,} trainable parameters')

The model has 24,219,176 trainable parameters


In [13]:
def get_batch(data, seq_len, idx):
    #data #[batch size, bunch of tokens]
    src    = data[:, idx:idx+seq_len]                   
    target = data[:, idx+1:idx+seq_len+1]  #target simply is ahead of src by 1            
    return src, target

In [14]:
def train(model, data, optimizer, criterion, batch_size, seq_len, clip, device):
    
    epoch_loss = 0
    model.train()
    # drop all batches that are not a multiple of seq_len
    # data #[batch size, bunch of tokens]
    num_batches = data.shape[-1]
    data = data[:, :num_batches - (num_batches -1) % seq_len]  #we need to -1 because we start at 0
    num_batches = data.shape[-1]
    
    #reset the hidden every epoch
    hidden = model.init_hidden(batch_size, device)
    
    for idx in tqdm(range(0, num_batches - 1, seq_len), desc='Training: ',leave=False):
        optimizer.zero_grad()
        
        #hidden does not need to be in the computational graph for efficiency
        hidden = model.detach_hidden(hidden)

        src, target = get_batch(data, seq_len, idx) #src, target: [batch size, seq len]
        src, target = src.to(device), target.to(device)
        batch_size = src.shape[0]
        prediction, hidden = model(src, hidden)               

        #need to reshape because criterion expects pred to be 2d and target to be 1d
        prediction = prediction.reshape(batch_size * seq_len, -1)  #prediction: [batch size * seq len, vocab size]  
        target = target.reshape(-1)
        loss = criterion(prediction, target)
        
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()
        epoch_loss += loss.item() * seq_len
    return epoch_loss / num_batches

In [15]:
def evaluate(model, data, criterion, batch_size, seq_len, device):

    epoch_loss = 0
    model.eval()
    num_batches = data.shape[-1]
    data = data[:, :num_batches - (num_batches -1) % seq_len]
    num_batches = data.shape[-1]

    hidden = model.init_hidden(batch_size, device)

    with torch.no_grad():
        for idx in range(0, num_batches - 1, seq_len):
            hidden = model.detach_hidden(hidden)
            src, target = get_batch(data, seq_len, idx)
            src, target = src.to(device), target.to(device)
            batch_size= src.shape[0]

            prediction, hidden = model(src, hidden)
            prediction = prediction.reshape(batch_size * seq_len, -1)
            target = target.reshape(-1)

            loss = criterion(prediction, target)
            epoch_loss += loss.item() * seq_len
    return epoch_loss / num_batches

Here we will be using a `ReduceLROnPlateau` learning scheduler which decreases the learning rate by a factor, if the loss don't improve by a certain epoch.

In [16]:
n_epochs = 50
seq_len  = 50 #<----decoding length
clip    = 0.25

lr_scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, factor=0.5, patience=0)

best_valid_loss = float('inf')

for epoch in range(n_epochs):
    train_loss = train(model, train_data, optimizer, criterion, 
                batch_size, seq_len, clip, device)
    valid_loss = evaluate(model, valid_data, criterion, batch_size, 
                seq_len, device)

    lr_scheduler.step(valid_loss)

    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'best-val-lstm_lm.pt')

    print(f'\tTrain Perplexity: {math.exp(train_loss):.3f}')
    print(f'\tValid Perplexity: {math.exp(valid_loss):.3f}')

                                                         

	Train Perplexity: 473.763
	Valid Perplexity: 307.006


                                                         

	Train Perplexity: 303.092
	Valid Perplexity: 234.495


                                                         

	Train Perplexity: 217.639
	Valid Perplexity: 175.975


                                                         

	Train Perplexity: 170.092
	Valid Perplexity: 148.373


                                                         

	Train Perplexity: 142.479
	Valid Perplexity: 128.163


                                                         

	Train Perplexity: 123.583
	Valid Perplexity: 113.793


                                                         

	Train Perplexity: 110.523
	Valid Perplexity: 104.928


                                                         

	Train Perplexity: 101.355
	Valid Perplexity: 98.180


                                                         

	Train Perplexity: 94.027
	Valid Perplexity: 92.930


                                                         

	Train Perplexity: 87.948
	Valid Perplexity: 89.091


                                                         

	Train Perplexity: 83.163
	Valid Perplexity: 86.826


                                                         

	Train Perplexity: 78.956
	Valid Perplexity: 83.763


                                                         

	Train Perplexity: 75.598
	Valid Perplexity: 81.537


                                                         

	Train Perplexity: 72.407
	Valid Perplexity: 79.919


                                                         

	Train Perplexity: 69.530
	Valid Perplexity: 78.734


                                                         

	Train Perplexity: 66.808
	Valid Perplexity: 77.545


                                                         

	Train Perplexity: 64.386
	Valid Perplexity: 76.119


                                                         

	Train Perplexity: 62.194
	Valid Perplexity: 75.324


                                                         

	Train Perplexity: 59.981
	Valid Perplexity: 74.406


                                                         

	Train Perplexity: 58.195
	Valid Perplexity: 74.661


                                                         

	Train Perplexity: 55.602
	Valid Perplexity: 72.654


                                                         

	Train Perplexity: 54.177
	Valid Perplexity: 71.424


                                                         

	Train Perplexity: 53.145
	Valid Perplexity: 70.920


                                                         

	Train Perplexity: 52.033
	Valid Perplexity: 70.939


                                                         

	Train Perplexity: 50.794
	Valid Perplexity: 70.770


                                                         

	Train Perplexity: 50.203
	Valid Perplexity: 70.454


                                                         

	Train Perplexity: 49.767
	Valid Perplexity: 70.654


                                                         

	Train Perplexity: 49.075
	Valid Perplexity: 69.811


                                                         

	Train Perplexity: 48.800
	Valid Perplexity: 70.848


                                                         

	Train Perplexity: 48.520
	Valid Perplexity: 69.869


                                                         

	Train Perplexity: 48.393
	Valid Perplexity: 69.991


                                                         

	Train Perplexity: 48.250
	Valid Perplexity: 69.957


                                                         

	Train Perplexity: 48.247
	Valid Perplexity: 69.985


                                                         

	Train Perplexity: 48.145
	Valid Perplexity: 69.999


                                                         

	Train Perplexity: 48.147
	Valid Perplexity: 69.997


                                                         

	Train Perplexity: 48.189
	Valid Perplexity: 69.995


                                                         

	Train Perplexity: 48.087
	Valid Perplexity: 69.994


                                                         

	Train Perplexity: 48.158
	Valid Perplexity: 69.994


                                                         

	Train Perplexity: 48.124
	Valid Perplexity: 69.993


                                                         

	Train Perplexity: 48.267
	Valid Perplexity: 69.993


                                                         

	Train Perplexity: 48.325
	Valid Perplexity: 69.993


                                                         

	Train Perplexity: 48.174
	Valid Perplexity: 69.993


                                                         

	Train Perplexity: 48.268
	Valid Perplexity: 69.993


                                                         

	Train Perplexity: 48.171
	Valid Perplexity: 69.993


                                                         

	Train Perplexity: 48.189
	Valid Perplexity: 69.993


                                                         

	Train Perplexity: 48.115
	Valid Perplexity: 69.993


                                                         

	Train Perplexity: 48.384
	Valid Perplexity: 69.993


                                                         

	Train Perplexity: 48.077
	Valid Perplexity: 69.993


                                                         

	Train Perplexity: 48.116
	Valid Perplexity: 69.993


                                                         

	Train Perplexity: 48.088
	Valid Perplexity: 69.993


## 6. Testing

In [17]:
model.load_state_dict(torch.load('best-val-lstm_lm.pt',  map_location=device))
test_loss = evaluate(model, test_data, criterion, batch_size, seq_len, device)
print(f'Test Perplexity: {math.exp(test_loss):.3f}')

Test Perplexity: 76.620


## 7. Real-world inference

Here we take the prompt, tokenize, encode and feed it into the model to get the predictions.  We then apply softmax while specifying that we want the output due to the last word in the sequence which represents the prediction for the next word.  We divide the logits by a temperature value to alter the model’s confidence by adjusting the softmax probability distribution.

Once we have the Softmax distribution, we randomly sample it to make our prediction on the next word. If we get <unk> then we give that another try.  Once we get <eos> we stop predicting.
    
We decode the prediction back to strings last lines.

In [18]:
def generate(prompt, max_seq_len, temperature, model, tokenizer, vocab, device, seed=None):
    if seed is not None:
        torch.manual_seed(seed)
    model.eval()
    tokens = tokenizer(prompt)
    indices = [vocab[t] for t in tokens]
    batch_size = 1
    hidden = model.init_hidden(batch_size, device)
    with torch.no_grad():
        for i in range(max_seq_len):
            src = torch.LongTensor([indices]).to(device)
            prediction, hidden = model(src, hidden)
            
            #prediction: [batch size, seq len, vocab size]
            #prediction[:, -1]: [batch size, vocab size] #probability of last vocab
            
            probs = torch.softmax(prediction[:, -1] / temperature, dim=-1)  
            prediction = torch.multinomial(probs, num_samples=1).item()    
            
            while prediction == vocab['<unk>']: #if it is unk, we sample again
                prediction = torch.multinomial(probs, num_samples=1).item()

            if prediction == vocab['<eos>']:    #if it is eos, we stop
                break

            indices.append(prediction) #autoregressive, thus output becomes input

    itos = vocab.get_itos()
    tokens = [itos[i] for i in indices]
    return tokens

In [19]:
prompt = 'Harry '
max_seq_len = 30
seed = 0

#smaller the temperature, more diverse tokens but comes 
#with a tradeoff of less-make-sense sentence
temperatures = [0.5, 0.7, 0.75, 0.8, 1.0]
for temperature in temperatures:
    generation = generate(prompt, max_seq_len, temperature, model, tokenizer, 
                          vocab, device, seed)
    print(str(temperature)+'\n'+' '.join(generation)+'\n')

0.5
harry was a bit .

0.7
harry awoke on , her wand darkened as she could see the spell and led her to a second to lead her .

0.75
harry awoke on , her wand darkened as she could see the spell and led her to a crush .

0.8
harry awoke on , her wand darkened as she could see hermione had the hand on the road , leaving to lead her .

1.0
harry awoke on , ate her forehead . they had been nice to do the tournament .



### Task 1. Dataset Acquisition
#### ANS 
##### 1) The dataset used in this Assignment 2 consists of plots and conversations between characters from the Harry Potter movies.


Mickume. (2022). *Harry Potter Tiny* [Dataset]. Hugging Face. https://huggingface.co/datasets/mickume/harry_potter_tiny


### Task 2. Model Training
#### ANS 1
##### 1.1) Tokenization

tokenizer = torchtext.data.utils.get_tokenizer('basic_english')

This line use torchtext's built-in tokenizer called basic_english to tokennize.

The basic_english tokenizer splits the text into words by: 1.Lowercasing all words 2.Removing punctuation 3.Tokenizing based on spaces and other delimiters

##### 1.2) Define a Tokenization Function

tokenize_data = lambda example, tokenizer: {'tokens': tokenizer(example['text'])}

each example in the dataset, this function will tokenize the text and return it in a dictionary format 

##### 1.3) Apply the Tokenization Function to the Dataset

tokenized_dataset = dataset.map(tokenize_data, remove_columns=['text'], fn_kwargs={'tokenizer': tokenizer})

##### 1.4) Building the Vocabulary

vocab = torchtext.vocab.build_vocab_from_iterator(tokenized_dataset['train']['tokens'], min_freq=3)

#### ANS 2

##### 2.1) Model Architecture

LSTM-based language model (LSTMLanguageModel) consists of 

1. Embedding Layer: each word is mapped to a vector

2. LSTM Layer: The core of the model is an LSTM (Long Short-Term Memory) network, which is designed to capture long-range dependencies in sequential data. The LSTM has hid_dim (1024) hidden units, num_layers (2), indicating that the model uses a stacked LSTM with 2 layers, and Dropout (dropout_rate = 0.65) is applied to prevent overfitting. 

3. Loss Function: The model uses cross-entropy loss, which is typical for classification tasks, where the model predicts the next token in a sequence.

4. Model Parameters: The model contains trainable parameters for the embedding layer, LSTM layers, and output layer. The total number of parameters is printed out using num_params.

##### 2.2) Training Process

The model uses an LSTM to learn sequential patterns in text, predicting the next word in a sequence based on the previous words. The training process involves feeding batches of data through the model, computing the loss, and updating the model's parameters using backpropagation. The learning rate is adjusted based on validation performance, and the best model is saved. The performance is evaluated using perplexity, which provides an indication of how well the model predicts unseen data.