# Recurrent Neural Networks and Language Models

You guys probably very excited about ChatGPT.  In today class, we will be implementing a very simple language model, which is basically what ChatGPT is, but with a simple LSTM.  You will be surprised that it is not so difficult at all.

Paper that we base on is *Regularizing and Optimizing LSTM Language Models*, https://arxiv.org/abs/1708.02182

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import torchtext, datasets, math
from tqdm import tqdm

  from pandas.core.computation.check import NUMEXPR_INSTALLED
  from pandas.core import (


In [2]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

#make our work comparable if restarted the kernel
SEED = 1234
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

# torch.cuda.get_device_name(0)

cuda


## 1. Load data - Wiki Text

We will be using wikitext which contains a large corpus of text, perfect for language modeling task.  This time, we will use the `datasets` library from HuggingFace to load.

In [3]:
# import os
# os.environ['http_proxy']  = 'http://192.41.170.23:3128'
# os.environ['https_proxy'] = 'http://192.41.170.23:3128'

#there are raw and preprocessed version; we used the raw one and preprocessed ourselves for fun
from datasets import load_dataset
dataset = datasets.load_dataset('myothiha/starwars',)
# Limit to 20,000 rows
# dataset['train'] = dataset['train'][:24000]
# dataset['test']  = dataset['test'][:10000]
# print(dataset_20k)

README.md:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


(…)Heir to the Empire (by Timothy Zahn).txt:   0%|          | 0.00/687k [00:00<?, ?B/s]

(…) Dark Force Rising (by Timothy Zahn).txt:   0%|          | 0.00/764k [00:00<?, ?B/s]

(…)- The Last Command (by Timothy Zahn).txt:   0%|          | 0.00/820k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7860 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/8101 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/9236 [00:00<?, ? examples/s]

In [4]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 7860
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 8101
    })
    test: Dataset({
        features: ['text'],
        num_rows: 9236
    })
})

In [5]:
# from datasets import DatasetDict

# # Split the dataset manually into train, validation, and test

# train_subset = dataset['train'].select(range(20000))   # Select first 20000 rows for training

# validation_subset = dataset['train'].select(range(20000, 23000))   # Select next 3000 rows for validation

# test_subset = dataset['train'].select(range(23000, 27000))   # Select next 4000 rows for testing

# # Create a new DatasetDict with the desired subsets
# dataset_subset = DatasetDict({
#     'train': train_subset,
#     'validation': validation_subset,
#     'test': test_subset
# })

In [6]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 7860
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 8101
    })
    test: Dataset({
        features: ['text'],
        num_rows: 9236
    })
})

In [7]:
print(dataset['train'][200]['text'])

'''
If you try to change the index you might notice that sometimes there is no paragraph 
and rather an empty string so we will have to care of that later.
'''




'\nIf you try to change the index you might notice that sometimes there is no paragraph \nand rather an empty string so we will have to care of that later.\n'

## 2. Preprocessing

### Tokenizing

Simply tokenize the given text to tokens.

In [8]:
tokenizer = torchtext.data.utils.get_tokenizer('basic_english')

#function to tokenize
tokenize_data = lambda example, tokenizer: {'tokens': tokenizer(example['text'])}  

#map the function to each example
tokenized_dataset = dataset.map(tokenize_data, remove_columns=['text'], fn_kwargs={'tokenizer': tokenizer})
print(tokenized_dataset['train'][333]['tokens'])

Map:   0%|          | 0/7860 [00:00<?, ? examples/s]

Map:   0%|          | 0/8101 [00:00<?, ? examples/s]

Map:   0%|          | 0/9236 [00:00<?, ? examples/s]

['good', 'night', ',', 'threepio', '.']


### Numericalizing

We will tell torchtext to add any word that has occurred at least three times in the dataset to the vocabulary because otherwise it would be too big.  Also we shall make sure to add `unk` and `eos`.

In [9]:
## numericalizing
vocab = torchtext.vocab.build_vocab_from_iterator(tokenized_dataset['train']['tokens'], 
min_freq=3) 
vocab.insert_token('<unk>', 0)           
vocab.insert_token('<eos>', 1)            
vocab.set_default_index(vocab['<unk>'])   
print(len(vocab))                         
print(vocab.get_itos()[:10])       

3449
['<unk>', '<eos>', '.', ',', 'the', "'", 'to', 'a', 'of', 'and']


## 3. Prepare the batch loader

### Prepare data

Given "Chaky loves eating at AIT", and "I really love deep learning", and given batch size = 3, we will get three batches of data "Chaky loves eating at", "AIT `<eos>` I really", "love deep learning `<eos>`".  

In [10]:
def get_data(dataset, vocab, batch_size):
    data = []                                                   
    for example in dataset:
        if example['tokens']:         
            #appends eos so we know it ends....so model learn how to end...                             
            tokens = example['tokens'].append('<eos>')   
            #numericalize          
            tokens = [vocab[token] for token in example['tokens']] 
            data.extend(tokens)                                    
    data = torch.LongTensor(data)                                 
    num_batches = data.shape[0] // batch_size #get the int number of batches...
    data = data[:num_batches * batch_size] #make the batch evenly, and cut out any remaining                      
    data = data.view(batch_size, num_batches)          
    return data #[batch size, bunch of tokens]


In [11]:
batch_size = 128
train_data = get_data(tokenized_dataset['train'], vocab, batch_size)
valid_data = get_data(tokenized_dataset['validation'], vocab, batch_size)
test_data  = get_data(tokenized_dataset['test'], vocab, batch_size)

In [12]:
import torch

# Save the full vocabulary object to 'vocab.pth'
vocab_path = "app/vocab.pth"  # Save to app folder
torch.save(vocab, vocab_path)

# Also, save vocab's itos (index-to-string) and stoi (string-to-index) mappings
vocab_dict = {
    'itos': vocab.get_itos(),
    'stoi': vocab.get_stoi()
}
torch.save(vocab_dict, 'app/vocab_dict.pth')  # Save to app folder

print(f"Vocabulary saved to {vocab_path}")


Vocabulary saved to app/vocab.pth


## 4. Modeling 

In [13]:
class LSTMLanguageModel(nn.Module):
    def __init__(self, vocab_size, emb_dim, hid_dim, num_layers, dropout_rate):
                
        super().__init__()
        self.num_layers = num_layers
        self.hid_dim = hid_dim
        self.emb_dim = emb_dim

        self.embedding = nn.Embedding(vocab_size, emb_dim)
        self.lstm = nn.LSTM(emb_dim, hid_dim, num_layers=num_layers, 
                    dropout=dropout_rate, batch_first=True)
        self.dropout = nn.Dropout(dropout_rate)
        self.fc = nn.Linear(hid_dim, vocab_size)
        
        self.init_weights()
        
    def init_weights(self):
        init_range_emb = 0.1
        init_range_other = 1/math.sqrt(self.hid_dim)
        self.embedding.weight.data.uniform_(-init_range_emb, init_range_emb)
        self.fc.weight.data.uniform_(-init_range_other, init_range_other)
        self.fc.bias.data.zero_()
        for i in range(self.num_layers):
            self.lstm.all_weights[i][0] = torch.FloatTensor(self.emb_dim,
                    self.hid_dim).uniform_(-init_range_other, init_range_other) 
            self.lstm.all_weights[i][1] = torch.FloatTensor(self.hid_dim, 
                    self.hid_dim).uniform_(-init_range_other, init_range_other) 

    def init_hidden(self, batch_size, device):
        hidden = torch.zeros(self.num_layers, batch_size, self.hid_dim).to(device)
        cell   = torch.zeros(self.num_layers, batch_size, self.hid_dim).to(device)
        return hidden, cell
    
    def detach_hidden(self, hidden):
        hidden, cell = hidden
        hidden = hidden.detach()
        cell = cell.detach()
        return hidden, cell

    def forward(self, src, hidden):
        #src: [batch size, seq len]
        embedding = self.dropout(self.embedding(src))
        #embedding: [batch size, seq len, emb_dim]
        output, hidden = self.lstm(embedding, hidden)      
        #output: [batch size, seq len, hid_dim]
        #hidden = h, c = [num_layers * direction, seq len, hid_dim)
        output = self.dropout(output) 
        prediction = self.fc(output)
        #prediction: [batch size, seq_len, vocab size]
        return prediction, hidden

## 5. Training 

Follows very basic procedure.  One note is that some of the sequences that will be fed to the model may involve parts from different sequences in the original dataset or be a subset of one (depending on the decoding length). For this reason we will reset the hidden state every epoch, this is like assuming that the next batch of sequences is probably always a follow up on the previous in the original dataset.

In [14]:
vocab_size = len(vocab)
emb_dim = 1024                # 400 in the paper
hid_dim = 1024                # 1150 in the paper
num_layers = 2                # 3 in the paper
dropout_rate = 0.65              
lr = 1e-3                     

In [15]:
model = LSTMLanguageModel(vocab_size, emb_dim, hid_dim, num_layers, dropout_rate).to(device)
optimizer = optim.Adam(model.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()
num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'The model has {num_params:,} trainable parameters')

The model has 23,860,601 trainable parameters


In [16]:
def get_batch(data, seq_len, idx):
    #data #[batch size, bunch of tokens]
    src    = data[:, idx:idx+seq_len]                   
    target = data[:, idx+1:idx+seq_len+1]  #target simply is ahead of src by 1            
    return src, target

In [17]:
def train(model, data, optimizer, criterion, batch_size, seq_len, clip, device):
    
    epoch_loss = 0
    model.train()
    # drop all batches that are not a multiple of seq_len
    # data #[batch size, bunch of tokens]
    num_batches = data.shape[-1]
    data = data[:, :num_batches - (num_batches -1) % seq_len]  #we need to -1 because we start at 0
    num_batches = data.shape[-1]
    
    #reset the hidden every epoch
    hidden = model.init_hidden(batch_size, device)
    
    for idx in tqdm(range(0, num_batches - 1, seq_len), desc='Training: ',leave=False):
        optimizer.zero_grad()
        
        #hidden does not need to be in the computational graph for efficiency
        hidden = model.detach_hidden(hidden)

        src, target = get_batch(data, seq_len, idx) #src, target: [batch size, seq len]
        src, target = src.to(device), target.to(device)
        batch_size = src.shape[0]
        prediction, hidden = model(src, hidden)               

        #need to reshape because criterion expects pred to be 2d and target to be 1d
        prediction = prediction.reshape(batch_size * seq_len, -1)  #prediction: [batch size * seq len, vocab size]  
        target = target.reshape(-1)
        loss = criterion(prediction, target)
        
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()
        epoch_loss += loss.item() * seq_len
    return epoch_loss / num_batches

In [18]:
def evaluate(model, data, criterion, batch_size, seq_len, device):

    epoch_loss = 0
    model.eval()
    num_batches = data.shape[-1]
    data = data[:, :num_batches - (num_batches -1) % seq_len]
    num_batches = data.shape[-1]

    hidden = model.init_hidden(batch_size, device)

    with torch.no_grad():
        for idx in range(0, num_batches - 1, seq_len):
            hidden = model.detach_hidden(hidden)
            src, target = get_batch(data, seq_len, idx)
            src, target = src.to(device), target.to(device)
            batch_size= src.shape[0]

            prediction, hidden = model(src, hidden)
            prediction = prediction.reshape(batch_size * seq_len, -1)
            target = target.reshape(-1)

            loss = criterion(prediction, target)
            epoch_loss += loss.item() * seq_len
    return epoch_loss / num_batches

Here we will be using a `ReduceLROnPlateau` learning scheduler which decreases the learning rate by a factor, if the loss don't improve by a certain epoch.

In [24]:
n_epochs = 50
seq_len  = 50 #<----decoding length
clip    = 0.25

lr_scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, factor=0.5, patience=0)

best_valid_loss = float('inf')

for epoch in range(n_epochs):
    train_loss = train(model, train_data, optimizer, criterion, 
                batch_size, seq_len, clip, device)
    valid_loss = evaluate(model, valid_data, criterion, batch_size, 
                seq_len, device)

    lr_scheduler.step(valid_loss)

    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'best-val-lstm_lm.pt')

    print(f'\tTrain Perplexity: {math.exp(train_loss):.3f}')
    print(f'\tValid Perplexity: {math.exp(valid_loss):.3f}')

                                                         

	Train Perplexity: 173.354
	Valid Perplexity: 152.067


                                                         

	Train Perplexity: 152.938
	Valid Perplexity: 145.789


                                                         

	Train Perplexity: 138.528
	Valid Perplexity: 136.709


                                                         

	Train Perplexity: 127.688
	Valid Perplexity: 123.228


                                                         

	Train Perplexity: 119.189
	Valid Perplexity: 117.268


                                                         

	Train Perplexity: 111.348
	Valid Perplexity: 100.186


                                                         

	Train Perplexity: 104.541
	Valid Perplexity: 95.847


                                                         

	Train Perplexity: 99.126
	Valid Perplexity: 92.476


                                                         

	Train Perplexity: 94.392
	Valid Perplexity: 89.729


                                                         

	Train Perplexity: 90.220
	Valid Perplexity: 87.250


                                                         

	Train Perplexity: 86.613
	Valid Perplexity: 85.216


                                                         

	Train Perplexity: 83.129
	Valid Perplexity: 83.188


                                                         

	Train Perplexity: 80.358
	Valid Perplexity: 86.761


                                                         

	Train Perplexity: 77.606
	Valid Perplexity: 85.876


                                                         

	Train Perplexity: 76.071
	Valid Perplexity: 85.435


                                                         

	Train Perplexity: 75.228
	Valid Perplexity: 84.945


                                                         

	Train Perplexity: 74.846
	Valid Perplexity: 84.855


                                                         

	Train Perplexity: 74.687
	Valid Perplexity: 84.777


                                                         

	Train Perplexity: 74.847
	Valid Perplexity: 84.759


                                                         

	Train Perplexity: 74.686
	Valid Perplexity: 84.736


                                                         

	Train Perplexity: 74.520
	Valid Perplexity: 84.733


                                                         

	Train Perplexity: 74.510
	Valid Perplexity: 84.730


                                                         

	Train Perplexity: 74.518
	Valid Perplexity: 84.727


                                                         

	Train Perplexity: 74.568
	Valid Perplexity: 84.726


                                                         

	Train Perplexity: 74.575
	Valid Perplexity: 84.726


                                                         

	Train Perplexity: 74.616
	Valid Perplexity: 84.725


                                                         

	Train Perplexity: 74.547
	Valid Perplexity: 84.725


                                                         

	Train Perplexity: 74.458
	Valid Perplexity: 84.725


                                                         

	Train Perplexity: 74.442
	Valid Perplexity: 84.725


                                                         

	Train Perplexity: 74.589
	Valid Perplexity: 84.725


                                                         

	Train Perplexity: 74.428
	Valid Perplexity: 84.725


                                                         

	Train Perplexity: 74.491
	Valid Perplexity: 84.725


                                                         

	Train Perplexity: 74.544
	Valid Perplexity: 84.725


                                                         

	Train Perplexity: 74.559
	Valid Perplexity: 84.725


                                                         

	Train Perplexity: 74.276
	Valid Perplexity: 84.725


                                                         

	Train Perplexity: 74.474
	Valid Perplexity: 84.725


                                                         

	Train Perplexity: 74.443
	Valid Perplexity: 84.725


                                                         

	Train Perplexity: 74.437
	Valid Perplexity: 84.725


                                                         

	Train Perplexity: 74.498
	Valid Perplexity: 84.725


                                                         

	Train Perplexity: 74.580
	Valid Perplexity: 84.725


                                                         

	Train Perplexity: 74.585
	Valid Perplexity: 84.725


                                                         

	Train Perplexity: 74.458
	Valid Perplexity: 84.725


                                                         

	Train Perplexity: 74.417
	Valid Perplexity: 84.725


                                                         

	Train Perplexity: 74.579
	Valid Perplexity: 84.724


                                                         

	Train Perplexity: 74.501
	Valid Perplexity: 84.724


                                                         

	Train Perplexity: 74.662
	Valid Perplexity: 84.724


                                                         

	Train Perplexity: 74.433
	Valid Perplexity: 84.724


                                                         

	Train Perplexity: 74.497
	Valid Perplexity: 84.724


                                                         

	Train Perplexity: 74.652
	Valid Perplexity: 84.724


                                                         

	Train Perplexity: 74.304
	Valid Perplexity: 84.724


## 6. Testing

In [None]:
model.load_state_dict(torch.load('best-val-lstm_lm.pt',  map_location=device))
test_loss = evaluate(model, test_data, criterion, batch_size, seq_len, device)
print(f'Test Perplexity: {math.exp(test_loss):.3f}')

Test Perplexity: 80.356


## 7. Real-world inference

Here we take the prompt, tokenize, encode and feed it into the model to get the predictions.  We then apply softmax while specifying that we want the output due to the last word in the sequence which represents the prediction for the next word.  We divide the logits by a temperature value to alter the model’s confidence by adjusting the softmax probability distribution.

Once we have the Softmax distribution, we randomly sample it to make our prediction on the next word. If we get <unk> then we give that another try.  Once we get <eos> we stop predicting.
    
We decode the prediction back to strings last lines.

In [26]:
def generate(prompt, max_seq_len, temperature, model, tokenizer, vocab, device, seed=None):
    if seed is not None:
        torch.manual_seed(seed)
    model.eval()
    tokens = tokenizer(prompt)
    indices = [vocab[t] for t in tokens]
    batch_size = 1
    hidden = model.init_hidden(batch_size, device)
    with torch.no_grad():
        for i in range(max_seq_len):
            src = torch.LongTensor([indices]).to(device)
            prediction, hidden = model(src, hidden)
            
            #prediction: [batch size, seq len, vocab size]
            #prediction[:, -1]: [batch size, vocab size] #probability of last vocab
            
            probs = torch.softmax(prediction[:, -1] / temperature, dim=-1)  
            prediction = torch.multinomial(probs, num_samples=1).item()    
            
            while prediction == vocab['<unk>']: #if it is unk, we sample again
                prediction = torch.multinomial(probs, num_samples=1).item()

            if prediction == vocab['<eos>']:    #if it is eos, we stop
                break

            indices.append(prediction) #autoregressive, thus output becomes input

    itos = vocab.get_itos()
    tokens = [itos[i] for i in indices]
    return tokens

In [28]:
prompt = 'pellaeon'
max_seq_len = 30
seed = 0

#smaller the temperature, more diverse tokens but comes 
#with a tradeoff of less-make-sense sentence
temperatures = [0.5, 0.7, 0.75, 0.8, 1.0]
for temperature in temperatures:
    generation = generate(prompt, max_seq_len, temperature, model, tokenizer, 
                          vocab, device, seed)
    print(str(temperature)+'\n'+' '.join(generation)+'\n')

0.5
pellaeon urge boy [we forcing sweetheart worst sweetheart twin slender upper unstrapping forcing awfully quads wishing forcing ush forcing ush fix waving decrypt waving ush ease crouching numbers contrary forgive blankets

0.7
pellaeon urge boy [we reverend sweetheart worst sweetheart twin south upper unstrapping peer awfully bed wishing forcing ush yellow-clad slaves fix waving decrypt waving blankets ease crouching numbers contrary forgive blankets

0.75
pellaeon urge boy [we viewports sweetheart worst sweetheart twin south upper unstrapping curious awfully bed wishing forcing ush yellow-clad slaves fix waving decrypt waving blankets ease crouching numbers contrary forgive blankets

0.8
pellaeon urge boy [we viewports sweetheart worst sweetheart twin south upper unstrapping curious awfully bed wishing forcing ush yellow-clad slaves fix waving decrypt waving blankets ease crouching numbers contrary forgive blankets

1.0
pellaeon urge boy [we preparing sweetheart thealliance sweeth