# Recurrent Neural Networks and Language Models

You guys probably very excited about ChatGPT.  In today class, we will be implementing a very simple language model, which is basically what ChatGPT is, but with a simple LSTM.  You will be surprised that it is not so difficult at all.

Paper that we base on is *Regularizing and Optimizing LSTM Language Models*, https://arxiv.org/abs/1708.02182

In [3]:
import torch
import torch.nn as nn
import torch.optim as optim

import torchtext, datasets, math
from tqdm import tqdm
from huggingface_hub import PyTorchModelHubMixin

In [4]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

#make our work comparable if restarted the kernel
SEED = 1234
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

# torch.cuda.get_device_name(0)

cpu


## 1. Load data - Game of Throne Books

I will use three books written by George R. R. Martin to train the models. These books are about the stories in the Game of Throne Universe.

1. A Game of Thrones
2. A Clash of Kings
3. A Storm of Swords

This dataset is downloaded from https://www.kaggle.com/datasets/saurabhbadole/game-of-thrones-book-dataset.
Then, prepare and upload the files to https://huggingface.co/datasets/kaung-nyo-lwin/game-of-throne-text to train the model.

After splitting the sentences, there are over 94740 sentences. 90 percent of them are used for training and 5 percent each for test and validation.

In [5]:
from nltk import sent_tokenize
import pandas as pd 

In [6]:
def tokenize_sentence(filepaths):
    data = []
    for filepath in filepaths:
        with open(filepath) as file:
            textFile = file.read()
        data.extend(sent_tokenize(textFile))
    return data

In [7]:
filepaths = ['./got/1 - A Game of Thrones.txt', './got/2 - A Clash of Kings.txt','./got/3 - A Storm of Swords.txt']

corpus = tokenize_sentence(filepaths)

In [9]:
train = corpus[0:int(len(corpus) * 0.9)]
val = corpus[int(len(corpus) * 0.9):int(len(corpus) * 0.95)]
test = corpus[int(len(corpus) * 0.95):int(len(corpus) *1)]

pd.DataFrame({"text":train}).to_csv('./got/data/train.csv',index=False)
pd.DataFrame({"text":val}).to_csv('./got/data/validation.csv',index=False)
pd.DataFrame({"text":test}).to_csv('./got/data/test.csv',index=False)

In [12]:
# import os
# os.environ['http_proxy']  = 'http://192.41.170.23:3128'
# os.environ['https_proxy'] = 'http://192.41.170.23:3128'

#there are raw and preprocessed version; we used the raw one and preprocessed ourselves for fun
dataset = datasets.load_dataset('kaung-nyo-lwin/game-of-throne-text')
print(dataset)

README.md:   0%|          | 0.00/24.0 [00:00<?, ?B/s]

train.csv:   0%|          | 0.00/5.28M [00:00<?, ?B/s]

validation.csv:   0%|          | 0.00/288k [00:00<?, ?B/s]

test.csv:   0%|          | 0.00/324k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/85266 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/4737 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/4737 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 85266
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 4737
    })
    test: Dataset({
        features: ['text'],
        num_rows: 4737
    })
})


In [5]:
print(dataset['train'][45]['text'])

'''
If you try to change the index you might notice that sometimes there is no paragraph 
and rather an empty string so we will have to care of that later.
'''

Will wanted nothing so much as to ride 
hellbent for the safety of the Wall, but that was not a feeling to share with your commander.


'\nIf you try to change the index you might notice that sometimes there is no paragraph \nand rather an empty string so we will have to care of that later.\n'

## 2. Preprocessing

### Tokenizing

Simply tokenize the given text to tokens.

In [15]:
tokenizer = torchtext.data.utils.get_tokenizer('basic_english')

#function to tokenize
tokenize_data = lambda example, tokenizer: {'tokens': tokenizer(example['text'])}  

#map the function to each example
tokenized_dataset = dataset.map(tokenize_data, remove_columns=['text'], fn_kwargs={'tokenizer': tokenizer})
print(tokenized_dataset['train'][45]['tokens'])

Map:   0%|          | 0/85266 [00:00<?, ? examples/s]

Map:   0%|          | 0/4737 [00:00<?, ? examples/s]

Map:   0%|          | 0/4737 [00:00<?, ? examples/s]

['will', 'wanted', 'nothing', 'so', 'much', 'as', 'to', 'ride', 'hellbent', 'for', 'the', 'safety', 'of', 'the', 'wall', ',', 'but', 'that', 'was', 'not', 'a', 'feeling', 'to', 'share', 'with', 'your', 'commander', '.']


### Numericalizing

We will tell torchtext to add any word that has occurred at least three times in the dataset to the vocabulary because otherwise it would be too big.  Also we shall make sure to add `unk` and `eos`.

In [16]:
## numericalizing
vocab = torchtext.vocab.build_vocab_from_iterator(tokenized_dataset['train']['tokens'], 
min_freq=3) 
vocab.insert_token('<unk>', 0)           
vocab.insert_token('<eos>', 1)            
vocab.set_default_index(vocab['<unk>'])   
print(len(vocab))                         
print(vocab.get_itos()[:10])       

11506
['<unk>', '<eos>', '.', ',', 'the', 'and', 'a', 'to', 'of', "'"]


## 3. Prepare the batch loader

### Prepare data

With batch size = 128, prepare the data by adding '<eos>' to each sentence and numbericalizing them. 

In [8]:
def get_data(dataset, vocab, batch_size):
    data = []                                                   
    for example in dataset:
        if example['tokens']:         
            #appends eos so we know it ends....so model learn how to end...                             
            tokens = example['tokens'].append('<eos>')   
            #numericalize          
            tokens = [vocab[token] for token in example['tokens']] 
            data.extend(tokens)                                    
    data = torch.LongTensor(data)                                 
    num_batches = data.shape[0] // batch_size #get the int number of batches...
    data = data[:num_batches * batch_size] #make the batch evenly, and cut out any remaining                      
    data = data.view(batch_size, num_batches)          
    return data #[batch size, bunch of tokens]


In [9]:
batch_size = 128
train_data = get_data(tokenized_dataset['train'], vocab, batch_size)
valid_data = get_data(tokenized_dataset['validation'], vocab, batch_size)
test_data  = get_data(tokenized_dataset['test'], vocab, batch_size)

## 4. Modeling 

The LSTM model is build with the following conifgurationg
* vocab size - 11506
* embedding size - 1024
* hidden size - 1024
* number of layers - 2
* dropout rate - 0.65

These are stored in the config dict to be used in the uploading the model.

In [None]:
# (write model file to be used in web app)
# %%writefile './app/model.py' 

# import torch
# import torch.nn as nn
# import torch.optim as optim

# import torchtext, datasets, math
# from tqdm import tqdm
# from huggingface_hub import PyTorchModelHubMixin

# add PyTorchModelHubMixin to inherit for uploading model to hugging face
class LSTMLanguageModel(nn.Module, PyTorchModelHubMixin):
    # optionally, you can add metadata which gets pushed to the model card
    # repo_url="kaung-nyo-lwin/nlp_a2_lm",
    # pipeline_tag="text-generation",
    # license="mit")
                        
    def __init__(self, vocab_size, emb_dim, hid_dim, num_layers, dropout_rate):
                
        super().__init__()
        self.num_layers = num_layers
        self.hid_dim = hid_dim
        self.emb_dim = emb_dim

        self.embedding = nn.Embedding(vocab_size, emb_dim)
        self.lstm = nn.LSTM(emb_dim, hid_dim, num_layers=num_layers, 
                    dropout=dropout_rate, batch_first=True)
        self.dropout = nn.Dropout(dropout_rate)
        self.fc = nn.Linear(hid_dim, vocab_size)
        
        self.init_weights()
        
    def init_weights(self):
        init_range_emb = 0.1
        init_range_other = 1/math.sqrt(self.hid_dim)
        self.embedding.weight.data.uniform_(-init_range_emb, init_range_emb)
        self.fc.weight.data.uniform_(-init_range_other, init_range_other)
        self.fc.bias.data.zero_()
        for i in range(self.num_layers):
            self.lstm.all_weights[i][0] = torch.FloatTensor(self.emb_dim,
                    self.hid_dim).uniform_(-init_range_other, init_range_other) 
            self.lstm.all_weights[i][1] = torch.FloatTensor(self.hid_dim, 
                    self.hid_dim).uniform_(-init_range_other, init_range_other) 

    def init_hidden(self, batch_size, device):
        hidden = torch.zeros(self.num_layers, batch_size, self.hid_dim).to(device)
        cell   = torch.zeros(self.num_layers, batch_size, self.hid_dim).to(device)
        return hidden, cell
    
    def detach_hidden(self, hidden):
        hidden, cell = hidden
        hidden = hidden.detach()
        cell = cell.detach()
        return hidden, cell

    def forward(self, src, hidden):
        #src: [batch size, seq len]
        embedding = self.dropout(self.embedding(src))
        #embedding: [batch size, seq len, emb_dim]
        output, hidden = self.lstm(embedding, hidden)      
        #output: [batch size, seq len, hid_dim]
        #hidden = h, c = [num_layers * direction, seq len, hid_dim)
        output = self.dropout(output) 
        prediction = self.fc(output)
        #prediction: [batch size, seq_len, vocab size]
        return prediction, hidden
    
    
config = {"vocab_size" : 11506,
          "emb_dim" : 1024, 
          "hid_dim" : 1024, 
          "num_layers" : 2, 
          "dropout_rate" : 0.65}

UsageError: Line magic function `%%writefile` not found.


## 5. Training 

Follows very basic procedure.  One note is that some of the sequences that will be fed to the model may involve parts from different sequences in the original dataset or be a subset of one (depending on the decoding length). For this reason we will reset the hidden state every epoch, this is like assuming that the next batch of sequences is probably always a follow up on the previous in the original dataset.

The model is trained for 50 epochs with learning rate of 0.001.

In [None]:
vocab_size = len(vocab)
emb_dim = 1024                # 400 in the paper
hid_dim = 1024                # 1150 in the paper
num_layers = 2                # 3 in the paper
dropout_rate = 0.65              
lr = 1e-3  

In [12]:
model = LSTMLanguageModel(vocab_size, emb_dim, hid_dim, num_layers, dropout_rate).to(device)
optimizer = optim.Adam(model.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()
num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'The model has {num_params:,} trainable parameters')

The model has 40,369,394 trainable parameters


In [13]:
def get_batch(data, seq_len, idx):
    #data #[batch size, bunch of tokens]
    src    = data[:, idx:idx+seq_len]                   
    target = data[:, idx+1:idx+seq_len+1]  #target simply is ahead of src by 1            
    return src, target

In [14]:
def train(model, data, optimizer, criterion, batch_size, seq_len, clip, device):
    
    epoch_loss = 0
    model.train()
    # drop all batches that are not a multiple of seq_len
    # data #[batch size, bunch of tokens]
    num_batches = data.shape[-1]
    data = data[:, :num_batches - (num_batches -1) % seq_len]  #we need to -1 because we start at 0
    num_batches = data.shape[-1]
    
    #reset the hidden every epoch
    hidden = model.init_hidden(batch_size, device)
    
    for idx in tqdm(range(0, num_batches - 1, seq_len), desc='Training: ',leave=False):
        optimizer.zero_grad()
        
        #hidden does not need to be in the computational graph for efficiency
        hidden = model.detach_hidden(hidden)

        src, target = get_batch(data, seq_len, idx) #src, target: [batch size, seq len]
        src, target = src.to(device), target.to(device)
        batch_size = src.shape[0]
        prediction, hidden = model(src, hidden)               

        #need to reshape because criterion expects pred to be 2d and target to be 1d
        prediction = prediction.reshape(batch_size * seq_len, -1)  #prediction: [batch size * seq len, vocab size]  
        target = target.reshape(-1)
        loss = criterion(prediction, target)
        
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()
        epoch_loss += loss.item() * seq_len
    return epoch_loss / num_batches

In [15]:
def evaluate(model, data, criterion, batch_size, seq_len, device):

    epoch_loss = 0
    model.eval()
    num_batches = data.shape[-1]
    data = data[:, :num_batches - (num_batches -1) % seq_len]
    num_batches = data.shape[-1]

    hidden = model.init_hidden(batch_size, device)

    with torch.no_grad():
        for idx in range(0, num_batches - 1, seq_len):
            hidden = model.detach_hidden(hidden)
            src, target = get_batch(data, seq_len, idx)
            src, target = src.to(device), target.to(device)
            batch_size= src.shape[0]

            prediction, hidden = model(src, hidden)
            prediction = prediction.reshape(batch_size * seq_len, -1)
            target = target.reshape(-1)

            loss = criterion(prediction, target)
            epoch_loss += loss.item() * seq_len
    return epoch_loss / num_batches

Here we will be using a `ReduceLROnPlateau` learning scheduler which decreases the learning rate by a factor, if the loss don't improve by a certain epoch.

In [16]:
n_epochs = 50
seq_len  = 50 #<----decoding length
clip    = 0.25

lr_scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, factor=0.5, patience=0)

best_valid_loss = float('inf')

for epoch in range(n_epochs):
    train_loss = train(model, train_data, optimizer, criterion, 
                batch_size, seq_len, clip, device)
    valid_loss = evaluate(model, valid_data, criterion, batch_size, 
                seq_len, device)

    lr_scheduler.step(valid_loss)

    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'best-val-lstm_lm.pt')

    print(f"Epoch Number: {epoch+1}")
    print(f'\tTrain Perplexity: {math.exp(train_loss):.3f}')
    print(f'\tValid Perplexity: {math.exp(valid_loss):.3f}')

                                                           

Epoch Number: 1
	Train Perplexity: 287.392
	Valid Perplexity: 255.807


                                                           

Epoch Number: 2
	Train Perplexity: 133.837
	Valid Perplexity: 100.573


                                                           

Epoch Number: 3
	Train Perplexity: 102.963
	Valid Perplexity: 86.258


                                                           

Epoch Number: 4
	Train Perplexity: 89.560
	Valid Perplexity: 78.631


                                                           

Epoch Number: 5
	Train Perplexity: 81.160
	Valid Perplexity: 73.457


                                                           

Epoch Number: 6
	Train Perplexity: 75.039
	Valid Perplexity: 70.032


                                                           

Epoch Number: 7
	Train Perplexity: 70.254
	Valid Perplexity: 67.378


                                                           

Epoch Number: 8
	Train Perplexity: 66.486
	Valid Perplexity: 65.311


                                                           

Epoch Number: 9
	Train Perplexity: 63.247
	Valid Perplexity: 63.601


                                                           

Epoch Number: 10
	Train Perplexity: 60.283
	Valid Perplexity: 62.153


                                                           

Epoch Number: 11
	Train Perplexity: 57.829
	Valid Perplexity: 61.043


                                                           

Epoch Number: 12
	Train Perplexity: 55.702
	Valid Perplexity: 60.432


                                                           

Epoch Number: 13
	Train Perplexity: 53.793
	Valid Perplexity: 59.691


                                                           

Epoch Number: 14
	Train Perplexity: 52.056
	Valid Perplexity: 58.805


                                                           

Epoch Number: 15
	Train Perplexity: 50.377
	Valid Perplexity: 58.431


                                                           

Epoch Number: 16
	Train Perplexity: 48.967
	Valid Perplexity: 57.984


                                                           

Epoch Number: 17
	Train Perplexity: 47.563
	Valid Perplexity: 57.467


                                                           

Epoch Number: 18
	Train Perplexity: 46.555
	Valid Perplexity: 57.329


                                                           

Epoch Number: 19
	Train Perplexity: 45.197
	Valid Perplexity: 57.068


                                                           

Epoch Number: 20
	Train Perplexity: 44.137
	Valid Perplexity: 56.915


                                                           

Epoch Number: 21
	Train Perplexity: 43.055
	Valid Perplexity: 56.778


                                                           

Epoch Number: 22
	Train Perplexity: 42.115
	Valid Perplexity: 56.703


                                                           

Epoch Number: 23
	Train Perplexity: 41.197
	Valid Perplexity: 56.607


                                                           

Epoch Number: 24
	Train Perplexity: 40.484
	Valid Perplexity: 56.783


                                                           

Epoch Number: 25
	Train Perplexity: 38.846
	Valid Perplexity: 56.505


                                                           

Epoch Number: 26
	Train Perplexity: 38.052
	Valid Perplexity: 56.405


                                                           

Epoch Number: 27
	Train Perplexity: 37.548
	Valid Perplexity: 56.283


                                                           

Epoch Number: 28
	Train Perplexity: 37.022
	Valid Perplexity: 56.293


                                                           

Epoch Number: 29
	Train Perplexity: 36.364
	Valid Perplexity: 56.174


                                                           

Epoch Number: 30
	Train Perplexity: 36.049
	Valid Perplexity: 56.121


                                                           

Epoch Number: 31
	Train Perplexity: 35.725
	Valid Perplexity: 56.165


                                                           

Epoch Number: 32
	Train Perplexity: 35.329
	Valid Perplexity: 56.055


                                                           

Epoch Number: 33
	Train Perplexity: 35.201
	Valid Perplexity: 55.984


                                                           

Epoch Number: 34
	Train Perplexity: 35.017
	Valid Perplexity: 56.024


                                                           

Epoch Number: 35
	Train Perplexity: 34.789
	Valid Perplexity: 55.991


                                                           

Epoch Number: 36
	Train Perplexity: 34.660
	Valid Perplexity: 55.989


                                                           

Epoch Number: 37
	Train Perplexity: 34.578
	Valid Perplexity: 55.962


                                                           

Epoch Number: 38
	Train Perplexity: 34.502
	Valid Perplexity: 55.971


                                                           

Epoch Number: 39
	Train Perplexity: 34.533
	Valid Perplexity: 55.971


                                                           

Epoch Number: 40
	Train Perplexity: 34.516
	Valid Perplexity: 55.971


                                                           

Epoch Number: 41
	Train Perplexity: 34.518
	Valid Perplexity: 55.971


                                                           

Epoch Number: 42
	Train Perplexity: 34.491
	Valid Perplexity: 55.971


                                                           

Epoch Number: 43
	Train Perplexity: 34.457
	Valid Perplexity: 55.971


                                                           

Epoch Number: 44
	Train Perplexity: 34.473
	Valid Perplexity: 55.971


                                                           

Epoch Number: 45
	Train Perplexity: 34.504
	Valid Perplexity: 55.971


                                                           

Epoch Number: 46
	Train Perplexity: 34.511
	Valid Perplexity: 55.971


                                                           

Epoch Number: 47
	Train Perplexity: 34.504
	Valid Perplexity: 55.971


                                                           

Epoch Number: 48
	Train Perplexity: 34.472
	Valid Perplexity: 55.971


                                                           

Epoch Number: 49
	Train Perplexity: 34.476
	Valid Perplexity: 55.971


                                                           

Epoch Number: 50
	Train Perplexity: 34.476
	Valid Perplexity: 55.971


## 6. Testing

In [17]:
model.load_state_dict(torch.load('best-val-lstm_lm.pt',  map_location=device))
test_loss = evaluate(model, test_data, criterion, batch_size, seq_len, device)
print(f'Test Perplexity: {math.exp(test_loss):.3f}')

Test Perplexity: 50.190


## 7. Real-world inference

Here we take the prompt, tokenize, encode and feed it into the model to get the predictions.  We then apply softmax while specifying that we want the output due to the last word in the sequence which represents the prediction for the next word.  We divide the logits by a temperature value to alter the model’s confidence by adjusting the softmax probability distribution.

Once we have the Softmax distribution, we randomly sample it to make our prediction on the next word. If we get <unk> then we give that another try.  Once we get <eos> we stop predicting.
    
We decode the prediction back to strings last lines.

In [18]:
def generate(prompt, max_seq_len, temperature, model, tokenizer, vocab, device, seed=None):
    if seed is not None:
        torch.manual_seed(seed)
    model.eval()
    tokens = tokenizer(prompt)
    indices = [vocab[t] for t in tokens]
    batch_size = 1
    hidden = model.init_hidden(batch_size, device)
    with torch.no_grad():
        for i in range(max_seq_len):
            src = torch.LongTensor([indices]).to(device)
            prediction, hidden = model(src, hidden)
            
            #prediction: [batch size, seq len, vocab size]
            #prediction[:, -1]: [batch size, vocab size] #probability of last vocab
            
            probs = torch.softmax(prediction[:, -1] / temperature, dim=-1)  
            prediction = torch.multinomial(probs, num_samples=1).item()    
            
            while prediction == vocab['<unk>']: #if it is unk, we sample again
                prediction = torch.multinomial(probs, num_samples=1).item()

            if prediction == vocab['<eos>']:    #if it is eos, we stop
                break

            indices.append(prediction) #autoregressive, thus output becomes input

    itos = vocab.get_itos()
    tokens = [itos[i] for i in indices]
    return tokens

In [44]:
prompt = 'Iron throne is'
max_seq_len = 30
seed = 0
# model.to(device)
#smaller the temperature, more diverse tokens but comes 
#with a tradeoff of less-make-sense sentence
temperatures = [0.5, 0.7, 0.75, 0.8, 1.0]
for temperature in temperatures:
    generation = generate(prompt, max_seq_len, temperature, model, tokenizer, 
                          vocab, device, seed)
    print(str(temperature)+'\n'+' '.join(generation)+'\n')

0.5
iron throne is the same thing to be found .

0.7
iron throne is the same thing to be found .

0.75
iron throne is the same thing to be found from his side .

0.8
iron throne is the same thing to be found from his side .

1.0
iron throne is the size of the jade sea that you could triumph .



## 8. Uploading the model to hugging face for Web Application

The model will be uploaded to hugging face since the web application will interface with the model through hugging face. This is to avoid uploading the large file to git hub.



In [40]:
model = LSTMLanguageModel(**config)
model.push_to_hub("kaung-nyo-lwin/nlp_a2_lm")

model.safetensors:   0%|          | 0.00/161M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/kaung-nyo-lwin/nlp_a2_lm/commit/577c52096ffdc94050ff66ff88758c5f4f58ee7a', commit_message='Push model using huggingface_hub.', commit_description='', oid='577c52096ffdc94050ff66ff88758c5f4f58ee7a', pr_url=None, repo_url=RepoUrl('https://huggingface.co/kaung-nyo-lwin/nlp_a2_lm', endpoint='https://huggingface.co', repo_type='model', repo_id='kaung-nyo-lwin/nlp_a2_lm'), pr_revision=None, pr_num=None)

In [18]:
model = LSTMLanguageModel(**config)
model = model.from_pretrained("kaung-nyo-lwin/nlp_a2_lm")

config.json:   0%|          | 0.00/106 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/161M [00:00<?, ?B/s]