# NLP:Assignment 2 Text Generation
Name: Sitthiwat Damrongpreechar <br>
Student ID: st123994

## 0. Import libraies and setting up the device
Before getting started, I need to import all the necessary libraries, especially PyTorch, which is mainly used for performing LSTM-LM. Additionally, tqdm is imported to handle the loading stage, aiding in checking the training progress and estimating time. Since my laptop has CUDA support, I will set the device to 'cuda' for better performance in training the model.

In [1]:
# Importing libraries
import torch
import torch.nn as nn 
import torch.optim as optim
import torchtext, datasets, math    
from tqdm import tqdm 

In [2]:
# Check the versions of torch and torchtext
torch.__version__, torchtext.__version__

('2.2.0+cu121', '0.17.0+cpu')

In [3]:
# Setting up the device
device=  torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


In [4]:
# Set the seed value all over the place to make this reproducible.
SEED = 1234
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

## 1. Load data - Harry Potter Theme


To create the text generator, a specific theme needs to be chosen. In this case, I have selected the Harry Potter theme. The dataset used for this project is sourced from the [Hugging Faces](https://huggingface.co/) dataset named ["novel-text", created by OswaldHe123.](https://huggingface.co/datasets/OswaldHe123/novel-text) This dataset is derived from the book ["HARRY POTTER AND THE GOBLET OF FIRE."](https://en.wikipedia.org/wiki/Harry_Potter_and_the_Goblet_of_Fire)

In [5]:
# load the dataset from huggingface datasets
datasets = datasets.load_dataset("OswaldHe123/novel-text")
# print the datasets
print(datasets)

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 54142
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 5996
    })
    test: Dataset({
        features: ['text'],
        num_rows: 9116
    })
})


From above output, the datasets contain 3 datasets; train, test, and validation. The number of rows in each dataset is as follows: 54142, 9116, and 5996 respectively.

## 2. Preprocessing

To perform the preprocessing, Tokenizing and Numericalizing technique would be used to prepare the tokenized datasets and vocabualry.

### 2.1 Tokenizing

In [6]:
# create the tokenizer
tokenizer = torchtext.data.utils.get_tokenizer('basic_english')
# create the function to tokenize the data
tokenize_data = lambda example, tokenizer:{'tokens':tokenizer(example['text'])}
# tokenize the data and remove the text column
tokenized_datasets = datasets.map(tokenize_data, remove_columns=['text'], fn_kwargs={'tokenizer':tokenizer})

In [7]:
# check the tokenized data
print(tokenized_datasets['train'][224]['tokens'])

['well', ',', 'all', 'right', 'then', '.', 'you', 'can', 'go', 'to', 'this', 'ruddy', '.', '.', '.', 'this', 'stupid', '.', '.', '.', 'this', 'world', 'cup', 'thing', '.', 'you', 'write', 'and', 'tell', 'these', '-', 'these', 'weasleys', 'they', "'", 're', 'to', 'pick', 'you', 'up', ',', 'mind', '.', 'i', 'haven', "'", 't', 'got', 'time', 'to', 'go', 'dropping', 'you', 'off', 'all', 'over', 'the', 'country', '.', 'and', 'you', 'can', 'spend', 'the', 'rest', 'of', 'the', 'summer', 'there', '.', 'and', 'you', 'can', 'tell', 'your', '-', 'your', 'godfather', '.', '.', '.', 'tell', 'him', '.', '.', '.', 'tell', 'him', 'you', "'", 're', 'going', '.']


### 2.2 Numericalizing

In [8]:
# build the vocabulary
vocab = torchtext.vocab.build_vocab_from_iterator(tokenized_datasets['train']['tokens'], min_freq=3)
# insert the special tokens
vocab.insert_token('<unk>',0)
vocab.insert_token('<eos>',1)
# set the default index
vocab.set_default_index(vocab['<unk>'])

In [9]:
# check the vocab size
print(len(vocab))

15174


In [10]:
# check the vocab tokens of the first 10 indices
print(vocab.get_itos()[:10])

['<unk>', '<eos>', ',', '.', 'the', "'", 'and', 'to', 'of', 'a']


## 3. Prepare the batch loader

In [11]:
# create the function to prepare the data for training
def get_data(datasets, vocab, batch_size):
    data = []
    for example in datasets:
        if example['tokens']:
            tokens = example['tokens'].append('<eos>')
            tokens = [vocab[token] for token in example['tokens']]
            data.extend(tokens)
    data = torch.LongTensor(data)
    num_batches = data.shape[0] // batch_size #from example 12 // 3 = 4 #integer division
    data = data[:num_batches*batch_size]
    data = data.view(batch_size, num_batches) # 3,4 #view vs reshape (whether data is contiguous or not)
    return data # [batch_size, seq len]


In [12]:
# determine the batch size and prepare the train, test, validation data
batch_size = 128
train_data = get_data(tokenized_datasets['train'], vocab, batch_size)
valid_data = get_data(tokenized_datasets['validation'], vocab, batch_size)
test_data = get_data(tokenized_datasets['test'], vocab, batch_size)

In [13]:
# check the shape of the train data
train_data.shape

torch.Size([128, 13086])

## 4. Modeling

### Model's Architechture
The LSTM model is constructed as a class structure, comprising layers for embedding to convert input tokens into continuous vectors, LSTM processing, dropout for regularization, and fully connected for output. The model is designed to predict the next token in a sequence. Additional attributes include initializing weights and handling hidden states.

In [14]:
# create the class of LSTM language model
class LSTMLanguageModel(nn.Module):
    # define the class attributes
    def __init__(self,vocab_size, hid_dim, emb_dim, num_layers,dropout_rate):
        super().__init__()
        # define the number of layers
        self.num_layers = num_layers
        # define the hidden layer
        self.hid_dim = hid_dim
        # define the embedding layer
        self.emb_dim = emb_dim
        # define the embedding layer
        self.embedding = nn.Embedding(vocab_size, emb_dim)
        # define the LSTM layer
        self.lstm = nn.LSTM(emb_dim, hid_dim, 
                            num_layers=num_layers, dropout=dropout_rate, 
                            batch_first=True) #dropout is applied to the output of each LSTM layer except the last layer
        # define the dropout layer
        self.dropout = nn.Dropout(dropout_rate)
        # define the fully connected layer
        self.fc = nn.Linear(hid_dim, vocab_size)
        # define the initialize the weights function
        self.init_weights()
    # define the function to initialize the weights
    def init_weights(self):
        # define the parameters for the embedding layer
        init_range_emb = 0.1
        # define the parameters for the other layers
        init_range_other = 1/math.sqrt(self.hid_dim)
        # make the embedding layer weights uniform
        self.embedding.weight.data.uniform_(-init_range_emb, init_range_emb)
        # make the fully connected layer weights uniform
        self.fc.weight.data.uniform_(-init_range_other, init_range_other)
        # make the fully connected layer bias zero
        self.fc.bias.data.zero_()
        # make the LSTM layer weights uniform
        for i in range(self.num_layers):
            self.lstm.all_weights[i][0] = torch.FloatTensor(self.emb_dim,
                self.hid_dim).uniform_(-init_range_other, init_range_other) #We
            self.lstm.all_weights[i][1] = torch.FloatTensor(self.hid_dim,
                self.hid_dim).uniform_(-init_range_other, init_range_other) #Wh
    # define the function to initialize the hidden layer's weights
    def init_hidden(self, batch_size, device):
        hidden = torch.zeros(self.num_layers, batch_size, self.hid_dim).to(device)
        cell = torch.zeros(self.num_layers, batch_size, self.hid_dim).to(device)
        return hidden, cell
    # define the function to detach the hidden layer
    def detach_hidden(self, hidden):
        hidden, cell = hidden
        hidden  = hidden.detach() # not to be used for gradient calculation
        cell = cell.detach()
        return hidden, cell
    # define the forward pass
    def forward(self, src, hidden):
        #src: [batch_size, seq_len]

        embedded = self.dropout(self.embedding(src)) 
        # embedding: [batch_size, seq_len, emb_dim]

        output, hidden = self.lstm(embedded, hidden)
        # output: [batch_size, seq_len, hid_dim]
        # hidden: [num_layers * direction, seq_len, hid_dim]

        output = self.dropout(output)
        prediction = self.fc(output)
        # prediction: [batch_size, seq_len, vocab_size]
        return prediction, hidden

## 5. Training

In [15]:
# define the parameters for the model
vocab_size = len(vocab)
emb_dim = 400 # embedding dimension (in video used 1024, 400 in paper)
hid_dim = 800 # hidden dimension (in video used 1024, 1150 in paper)
num_layers = 2 # number of layers (in video used 2, 3 in paper)
dropout_rate = 0.65
# define the learning rate
lr = 1e-4

In [16]:
# assign the model to the device
model = LSTMLanguageModel(vocab_size, hid_dim, emb_dim, num_layers, dropout_rate).to(device)
# define the optimizer that is Adam
optimizer = optim.Adam(model.parameters(), lr=lr)
# define the loss function that is cross entropy
criterion = nn.CrossEntropyLoss()
# define the variable to check the number of trainable parameters
num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'The model has {num_params:,} trainable parameters')

The model has 27,196,774 trainable parameters


In [17]:
# define the batch function for training 
def get_batch(data, seq_len, idx):
    #data # [batch_size, bunch of tokens]
    src = data[:,idx:idx+seq_len]
    target = data[:,idx+1:idx+1+seq_len] # shifted ahead src by 1
    return src, target

In [18]:
# define the function to train the model
def train(model, data, optimizer, criterion,batch_size,seq_len, clip, device):
    # define the epoch loss
    epoch_loss = 0
    # set the model to train mode
    model.train()
    #drop all batches that are not a multiple of seq_len
    # data # [batch_size, bunch of tokens(seq_len)]
    num_batches = data.shape[-1]
    data = data[:, :num_batches - (num_batches -1) % seq_len]  #we need to -1 because we start at 0
    num_batches = data.shape[-1]

    #reset the hidden every epoch
    hidden = model.init_hidden(batch_size, device)
    #loop through the batches
    for idx in tqdm(range(0, num_batches - 1, seq_len), desc='Training: ',leave=False):
        #zero the gradients
        optimizer.zero_grad()
        
        #hidden does not need to be in the computational graph for efficiency
        hidden = model.detach_hidden(hidden)
        #get the batch
        src, target = get_batch(data, seq_len, idx) #src, target: [batch size, seq len]
        src, target = src.to(device), target.to(device)
        batch_size = src.shape[0]
        prediction, hidden = model(src, hidden)               

        #need to reshape because criterion expects pred to be 2d and target to be 1d
        prediction = prediction.reshape(batch_size * seq_len, -1)  #prediction: [batch size * seq len, vocab size]  
        target = target.reshape(-1)
        loss = criterion(prediction, target)
        #backprop
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        #update
        optimizer.step()
        epoch_loss += loss.item() * seq_len
    return epoch_loss / num_batches

In [19]:
# define the function to evaluate the model
def evaluate(model, data, criterion, batch_size, seq_len, device):
    # define the epoch loss
    epoch_loss = 0
    # set the model to eval mode
    model.eval()

    num_batches = data.shape[-1]
    data = data[:, :num_batches - (num_batches -1) % seq_len]
    num_batches = data.shape[-1]
    #reset the hidden every epoch
    hidden = model.init_hidden(batch_size, device)
    #loop through the batches without updating the gradients
    with torch.no_grad():
        for idx in range(0, num_batches - 1, seq_len):
            hidden = model.detach_hidden(hidden)
            src, target = get_batch(data, seq_len, idx)
            src, target = src.to(device), target.to(device)
            batch_size= src.shape[0]

            prediction, hidden = model(src, hidden)
            prediction = prediction.reshape(batch_size * seq_len, -1)
            target = target.reshape(-1)

            loss = criterion(prediction, target)
            epoch_loss += loss.item() * seq_len
    return epoch_loss / num_batches

### Training

In [22]:
# define the parameters for training
n_epochs = 40
seq_len  = 50 #<----decoding length
clip    = 0.25
# define the learning rate scheduler
lr_scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, factor=0.5, patience=0)
# define the best validation loss
best_valid_loss = float('inf')
# loop through the epochs
for epoch in range(n_epochs):
    # call the train and evaluate functions to get the losses
    train_loss = train(model, train_data, optimizer, criterion, 
                batch_size, seq_len, clip, device)
    valid_loss = evaluate(model, valid_data, criterion, batch_size, 
                seq_len, device)
    # update the learning rate scheduler
    lr_scheduler.step(valid_loss)
    # save the model state dict if the validation loss is the best
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'best-val-lstm_lm.pt')
    # print the epoch, train perplexity and validation perplexity
    print(f'Epoch: {epoch+1:02}')
    print(f'\tTrain Perplexity: {math.exp(train_loss):.3f}')
    print(f'\tValid Perplexity: {math.exp(valid_loss):.3f}')
    

                                                           

Epoch: 01
	Train Perplexity: 802.668
	Valid Perplexity: 466.650


                                                           

Epoch: 02
	Train Perplexity: 445.931
	Valid Perplexity: 331.209


                                                           

Epoch: 03
	Train Perplexity: 336.345
	Valid Perplexity: 255.292


                                                           

Epoch: 04
	Train Perplexity: 275.423
	Valid Perplexity: 216.462


                                                           

Epoch: 05
	Train Perplexity: 239.501
	Valid Perplexity: 189.910


                                                           

Epoch: 06
	Train Perplexity: 210.942
	Valid Perplexity: 166.312


                                                           

Epoch: 07
	Train Perplexity: 189.408
	Valid Perplexity: 151.384


                                                           

Epoch: 08
	Train Perplexity: 173.907
	Valid Perplexity: 144.383


                                                           

Epoch: 09
	Train Perplexity: 161.899
	Valid Perplexity: 133.736


                                                           

Epoch: 10
	Train Perplexity: 152.305
	Valid Perplexity: 125.718


                                                           

Epoch: 11
	Train Perplexity: 144.590
	Valid Perplexity: 121.050


                                                           

Epoch: 12
	Train Perplexity: 137.964
	Valid Perplexity: 115.160


                                                           

Epoch: 13
	Train Perplexity: 132.526
	Valid Perplexity: 111.496


                                                           

Epoch: 14
	Train Perplexity: 127.688
	Valid Perplexity: 109.206


                                                           

Epoch: 15
	Train Perplexity: 123.206
	Valid Perplexity: 105.308


                                                           

Epoch: 16
	Train Perplexity: 119.462
	Valid Perplexity: 102.735


                                                           

Epoch: 17
	Train Perplexity: 115.941
	Valid Perplexity: 100.361


                                                           

Epoch: 18
	Train Perplexity: 112.651
	Valid Perplexity: 98.063


                                                           

Epoch: 19
	Train Perplexity: 109.866
	Valid Perplexity: 96.250


                                                           

Epoch: 20
	Train Perplexity: 107.068
	Valid Perplexity: 94.168


                                                           

Epoch: 21
	Train Perplexity: 104.672
	Valid Perplexity: 92.469


                                                           

Epoch: 22
	Train Perplexity: 102.405
	Valid Perplexity: 91.501


                                                           

Epoch: 23
	Train Perplexity: 100.081
	Valid Perplexity: 89.790


                                                           

Epoch: 24
	Train Perplexity: 98.232
	Valid Perplexity: 88.614


                                                           

Epoch: 25
	Train Perplexity: 96.519
	Valid Perplexity: 87.293


                                                           

Epoch: 26
	Train Perplexity: 94.549
	Valid Perplexity: 85.887


                                                           

Epoch: 27
	Train Perplexity: 93.324
	Valid Perplexity: 85.025


                                                           

Epoch: 28
	Train Perplexity: 91.711
	Valid Perplexity: 84.507


                                                           

Epoch: 29
	Train Perplexity: 89.859
	Valid Perplexity: 83.529


                                                           

Epoch: 30
	Train Perplexity: 88.593
	Valid Perplexity: 82.616


                                                           

Epoch: 31
	Train Perplexity: 87.303
	Valid Perplexity: 81.307


                                                           

Epoch: 32
	Train Perplexity: 86.165
	Valid Perplexity: 80.763


                                                           

Epoch: 33
	Train Perplexity: 84.732
	Valid Perplexity: 80.040


                                                           

Epoch: 34
	Train Perplexity: 83.627
	Valid Perplexity: 79.405


                                                           

Epoch: 35
	Train Perplexity: 82.483
	Valid Perplexity: 78.654


                                                           

Epoch: 36
	Train Perplexity: 81.447
	Valid Perplexity: 78.187


                                                           

Epoch: 37
	Train Perplexity: 80.452
	Valid Perplexity: 77.944


                                                           

Epoch: 38
	Train Perplexity: 79.470
	Valid Perplexity: 76.651


                                                           

Epoch: 39
	Train Perplexity: 78.885
	Valid Perplexity: 76.271


                                                           

Epoch: 40
	Train Perplexity: 77.778
	Valid Perplexity: 76.214


## 6. Testing

After the training is finished, I need to check the model's perplexity. The trained model will be loaded to test on the test data, and then I will proceed to evaluate and calculate the test perplexity.

In [22]:
# Load the trained model for testing
model.load_state_dict(torch.load('best-val-lstm_lm.pt',  map_location=device))
# Evaluate the model on the test data and calculate test perplexity
test_loss = evaluate(model, test_data, criterion, batch_size, seq_len, device)
print(f'Test Perplexity: {math.exp(test_loss):.3f}')

Test Perplexity: 114.026


## 7. Real-World Inference

Finally, in the inference step, the model will be later utilized in an application. Therefore, I need to create the function to generate the output following these steps:
1. **Input Processing:**
   - Take a given prompt and tokenize it, preparing it for model input.

2. **Model Prediction:**
   - Encode the tokenized prompt and feed it into the model to obtain predictions.
   - Apply softmax to the model's output, focusing on the prediction for the next word.
   - Adjust model confidence by dividing logits with a specified temperature value.

3. **Random Sampling:**
   - Randomly sample from the softmax distribution to predict the next word.
   - Retry if an unknown token (`<unk>`) is encountered during sampling.

4. **Completion Criteria:**
   - Continue predicting until an end-of-sequence token (`<eos>`) is encountered.

5. **Output Decoding:**
   - Decode the final prediction back into a string format for the generated sequence.


In [23]:
# define the function to generate the text from the model
def generate(prompt, max_seq_len, temperature, model, tokenizer, vocab, device, seed=None):
    if seed is not None:
        torch.manual_seed(seed)
    model.eval()
    tokens = tokenizer(prompt)
    indices = [vocab[t] for t in tokens]
    batch_size = 1
    hidden = model.init_hidden(batch_size, device)
    with torch.no_grad():
        for i in range(max_seq_len):
            src = torch.LongTensor([indices]).to(device)
            prediction, hidden = model(src, hidden)
            
            #prediction: [batch size, seq len, vocab size]
            #prediction[:, -1]: [batch size, vocab size] #probability of last vocab
            
            probs = torch.softmax(prediction[:, -1] / temperature, dim=-1)  
            prediction = torch.multinomial(probs, num_samples=1).item()    
            
            while prediction == vocab['<unk>']: #if it is unk, we sample again
                prediction = torch.multinomial(probs, num_samples=1).item()

            if prediction == vocab['<eos>']:    #if it is eos, we stop
                break

            indices.append(prediction) #autoregressive, thus output becomes input

    itos = vocab.get_itos()
    tokens = [itos[i] for i in indices]
    return tokens

In [25]:
# define the parameters for generation
prompt = 'Harry is'
max_seq_len = 30
seed = SEED

#smaller the temperature, more diverse tokens but comes 
#with a tradeoff of less-make-sense sentence
temperatures = [0.5, 0.7, 0.75, 0.8, 1.0]
# loop through the temperatures
for temperature in temperatures:
    generation = generate(prompt, max_seq_len, temperature, model, tokenizer, 
                          vocab, device, seed)
    print(str(temperature)+'\n'+' '.join(generation)+'\n')

0.5
harry is a bit of his father .

0.7
harry is a report to stop to you . i reckon i could not tell what he was doing . i are sure about you .

0.75
harry is lucky about it , i wish to be standing over the imperius curse with those wizarding eyes previously , but still as the result of the course there was always

0.8
harry is lucky about it , i wish to be standing over the imperius curse with those wizarding eyes previously , but still as the result of the course there was always

1.0
harry is lucky about her , i wish to see ron laughing and though it ought to be fooled he previously , but still as the result of his head there was



What a nice result! Let's save the model and vocabulary again for later use in the Flask application.

In [31]:
# Save the model (just in case :D)
torch.save(model.state_dict(), './app/lstm_lm.pt')

In [30]:
# save the vocab as the pickle file for the app
import pickle
with open('./app/vocab.pkl', 'wb') as f:
    pickle.dump(vocab, f)