## TC 5033
### Text Generation

<br>

#### Activity 4: Building a Simple LSTM Text Generator using WikiText-2
<br>

- Objective:
    - Gain a fundamental understanding of Long Short-Term Memory (LSTM) networks.
    - Develop hands-on experience with sequence data processing and text generation in PyTorch. Given the simplicity of the model, amount of data, and computer resources, the text you generate will not replace ChatGPT, and results must likely will not make a lot of sense. Its only purpose is academic and to understand the text generation using RNNs.
    - Enhance code comprehension and documentation skills by commenting on provided starter code.
    
<br>

- Instructions:
    - Code Understanding: Begin by thoroughly reading and understanding the code. Comment each section/block of the provided code to demonstrate your understanding. For this, you are encouraged to add cells with experiments to improve your understanding

    - Model Overview: The starter code includes an LSTM model setup for sequence data processing. Familiarize yourself with the model architecture and its components. Once you are familiar with the provided model, feel free to change the model to experiment.

    - Training Function: Implement a function to train the LSTM model on the WikiText-2 dataset. This function should feed the training data into the model and perform backpropagation. 

    - Text Generation Function: Create a function that accepts starting text (seed text) and a specified total number of words to generate. The function should use the trained model to generate a continuation of the input text.

    - Code Commenting: Ensure that all the provided starter code is well-commented. Explain the purpose and functionality of each section, indicating your understanding.

    - Submission: Submit your Jupyter Notebook with all sections completed and commented. Include a markdown cell with the full names of all contributing team members at the beginning of the notebook.
    
<br>

- Evaluation Criteria:
    - Code Commenting (60%): The clarity, accuracy, and thoroughness of comments explaining the provided code. You are suggested to use markdown cells for your explanations.

    - Training Function Implementation (20%): The correct implementation of the training function, which should effectively train the model.

    - Text Generation Functionality (10%): A working function is provided in comments. You are free to use it as long as you make sure to uderstand it, you may as well improve it as you see fit. The minimum expected is to provide comments for the given function. 

    - Conclusions (10%): Provide some final remarks specifying the differences you notice between this model and the one used  for classification tasks. Also comment on changes you made to the model, hyperparameters, and any other information you consider relevant. Also, please provide 3 examples of generated texts.



## Imports and Setup

This section includes necessary library imports and sets up the computing device (GPU or CPU).


In [10]:
import numpy as np
#PyTorch libraries
import torch
import torchtext
from torchtext.datasets import WikiText2
# Dataloader library
from torch.utils.data import DataLoader, TensorDataset
from torch.utils.data.dataset import random_split
# Libraries to prepare the data
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torchtext.data.functional import to_map_style_dataset
# neural layers
from torch import nn
from torch.nn import functional as F
import torch.optim as optim
from tqdm import tqdm

import random

In [2]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

## Data Preparation

Here, we load and process the WikiText-2 dataset for training, including tokenization and creating data loaders.

In [3]:
train_dataset, val_dataset, test_dataset = WikiText2()

In [4]:
tokeniser = get_tokenizer('basic_english')
def yield_tokens(data):
    for text in data:
        yield tokeniser(text)

In [5]:
# Build the vocabulary
vocab = build_vocab_from_iterator(yield_tokens(train_dataset), specials=["<unk>", "<pad>", "<bos>", "<eos>"])
#set unknown token at position 0
vocab.set_default_index(vocab["<unk>"])

In [6]:
seq_length = 50
def data_process(raw_text_iter, seq_length = 50):
    data = [torch.tensor(vocab(tokeniser(item)), dtype=torch.long) for item in raw_text_iter]
    data = torch.cat(tuple(filter(lambda t: t.numel() > 0, data))) #remove empty tensors
#     target_data = torch.cat(d)
    return (data[:-(data.size(0)%seq_length)].view(-1, seq_length), 
            data[1:-(data.size(0)%seq_length-1)].view(-1, seq_length))  

# # Create tensors for the training set
x_train, y_train = data_process(train_dataset, seq_length)
x_val, y_val = data_process(val_dataset, seq_length)
x_test, y_test = data_process(test_dataset, seq_length)

In [7]:
# Create TensorDataset objects for DataLoader
train_dataset = TensorDataset(x_train, y_train)
val_dataset = TensorDataset(x_val, y_val)
test_dataset = TensorDataset(x_test, y_test)

In [8]:
batch_size = 64  # choose a batch size that fits your computation resources
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, drop_last=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=True, drop_last=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True, drop_last=True)

## LSTM Model

Defines an enhanced LSTM model with additional layers and dropout for better performance.


In [9]:
# Define the LSTM model
# Feel free to experiment
class LSTMModel(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers):
        super(LSTMModel, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embed_size)
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, vocab_size)

    def forward(self, text, hidden):
        embeddings = self.embeddings(text)
        output, hidden = self.lstm(embeddings, hidden)
        decoded = self.fc(output)
        return decoded, hidden

    def init_hidden(self, batch_size):

        return (torch.zeros(self.num_layers, batch_size, self.hidden_size).to(device),
                torch.zeros(self.num_layers, batch_size, self.hidden_size).to(device))



vocab_size = len(vocab) # vocabulary size
emb_size = 100 # embedding size
neurons = 128 # the dimension of the feedforward network model, i.e. # of neurons 
num_layers = 1 # the number of nn.LSTM layers
model = LSTMModel(vocab_size, emb_size, neurons, num_layers)


## Training Function

Enhanced training function with gradient clipping and other adjustments for improved model training.


In [14]:
def train(model, epochs, optimiser):
    '''
    The following are possible instructions you may want to conside for this function.
    This is only a guide and you may change add or remove whatever you consider appropriate
    as long as you train your model correctly.
        - loop through specified epochs
        - loop through dataloader
        - don't forget to zero grad!
        - place data (both input and target) in device
        - init hidden states e.g. hidden = model.init_hidden(batch_size)
        - run the model
        - compute the cost or loss
        - backpropagation
        - Update paratemers
        - Include print all the information you consider helpful
    
    '''
    model = model.to(device=device)
    model.train()

    for epoch in range(epochs):
        for i, (data, targets) in enumerate((train_loader)):
            optimiser.zero_grad()  # Zero out the gradients
            data, targets = data.to(device), targets.to(device)  # Move data to the device
            hidden = model.init_hidden(batch_size)  # Initialize hidden states
            output, _ = model(data, hidden)  # Forward pass
            loss = loss_function(output.view(-1, vocab_size), targets.view(-1))  # Compute loss
            loss.backward()  # Backpropagation
            optimiser.step()  # Update model parameters

            if i % 100 == 0:  # Optionally print the loss every 100 batches
                print(f"Epoch {epoch}, Batch {i}, Loss: {loss.item()}")

## Text Generation

Function to generate text using the trained model, with temperature control for randomness.


In [15]:
def generate_text(model, start_text, num_words, temperature=1.0):
    model.eval()  # Set the model to evaluation mode
    words = tokeniser(start_text)
    hidden = model.init_hidden(1)
    for i in range(0, num_words):
        x = torch.tensor([[vocab[word] for word in words[i:]]], dtype=torch.long, device=device)
        y_pred, hidden = model(x, hidden)
        last_word_logits = y_pred[0][-1]
        p = F.softmax(last_word_logits / temperature, dim=0).detach().to(device='cpu').numpy()
        word_index = np.random.choice(len(last_word_logits), p=p)
        words.append(vocab.lookup_token(word_index))

    return ' '.join(words)

## Training and Generating Text

Train the model with the new settings and then use it to generate text.


In [16]:
# Call the train function
loss_function = nn.CrossEntropyLoss()
lr = 0.0005
epochs = 5
optimiser = optim.Adam(model.parameters(), lr=lr)
train(model, epochs, optimiser)

Epoch 0, Batch 0, Loss: 10.2792329788208
Epoch 0, Batch 100, Loss: 7.075305938720703
Epoch 0, Batch 200, Loss: 6.9239501953125
Epoch 0, Batch 300, Loss: 6.8365559577941895
Epoch 0, Batch 400, Loss: 6.853389263153076
Epoch 0, Batch 500, Loss: 6.648653030395508
Epoch 0, Batch 600, Loss: 6.599898815155029
Epoch 1, Batch 0, Loss: 6.739708423614502
Epoch 1, Batch 100, Loss: 6.334718227386475
Epoch 1, Batch 200, Loss: 6.482001304626465
Epoch 1, Batch 300, Loss: 6.454785346984863
Epoch 1, Batch 400, Loss: 6.297735691070557
Epoch 1, Batch 500, Loss: 6.397555351257324
Epoch 1, Batch 600, Loss: 6.324527740478516
Epoch 2, Batch 0, Loss: 6.328923225402832
Epoch 2, Batch 100, Loss: 6.242600917816162
Epoch 2, Batch 200, Loss: 6.28380012512207
Epoch 2, Batch 300, Loss: 6.239938735961914
Epoch 2, Batch 400, Loss: 6.106464862823486
Epoch 2, Batch 500, Loss: 6.0853447914123535
Epoch 2, Batch 600, Loss: 6.1333909034729
Epoch 3, Batch 0, Loss: 6.0497941970825195
Epoch 3, Batch 100, Loss: 5.983999729156494

In [17]:
# Generate some text
print(generate_text(model, start_text="I like", num_words=100))

i like the first year at marks of number of the media <unk> at 1902 and <unk> square <unk> , <unk> back to cecil <unk> ' s statements and top ran to mean after the cellar memorial trees @-@ do switched of its october ) sources he had the anupong ' s 8 @-@ homeland was nurse in the previous of the wanderers . although the word of barrow for the the event ' van most throne to support divides that john turned critical hamisah ' s trondheim , like it was . todd ' s name , . the raf section


## Enhanced LSTM Model

This part of the code defines an improved version of the LSTM model


In [18]:
class EnhancedLSTMModel(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers, dropout=0.5):
        super(EnhancedLSTMModel, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embed_size)
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        # Adding a second LSTM layer and dropout for regularization
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True, dropout=dropout)
        self.fc = nn.Linear(hidden_size, vocab_size)

    def forward(self, text, hidden):
        embeddings = self.embeddings(text)
        output, hidden = self.lstm(embeddings, hidden)
        decoded = self.fc(output)
        return decoded, hidden

    def init_hidden(self, batch_size):
        return (torch.zeros(self.num_layers, batch_size, self.hidden_size).to(device),
                torch.zeros(self.num_layers, batch_size, self.hidden_size).to(device))

# Update model instantiation
neurons = 256  # Increased number of neurons
num_layers = 2  # Increased number of layers
dropout = 0.3  # Dropout for regularization
model = EnhancedLSTMModel(vocab_size, emb_size, neurons, num_layers, dropout)


In [19]:
def enhanced_train(model, epochs, optimiser, clip=1):
    model = model.to(device=device)
    model.train()
    
    for epoch in range(epochs):
        for i, (data, targets) in enumerate((train_loader)):
            optimiser.zero_grad()
            data, targets = data.to(device), targets.to(device)
            hidden = model.init_hidden(batch_size)
            output, _ = model(data, hidden)
            loss = loss_function(output.view(-1, vocab_size), targets.view(-1))
            loss.backward()
            # Implement gradient clipping
            torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
            optimiser.step()

            if i % 100 == 0:
                print(f"Epoch {epoch}, Batch {i}, Loss: {loss.item()}")

# Adjust the learning rate
lr = 0.001  # Slightly increased learning rate
epochs = 10  # More training epochs
optimiser = optim.Adam(model.parameters(), lr=lr)
enhanced_train(model, epochs, optimiser)


Epoch 0, Batch 0, Loss: 10.263537406921387
Epoch 0, Batch 100, Loss: 7.071488857269287
Epoch 0, Batch 200, Loss: 7.063595771789551
Epoch 0, Batch 300, Loss: 7.0021233558654785
Epoch 0, Batch 400, Loss: 6.874186992645264
Epoch 0, Batch 500, Loss: 6.979156970977783
Epoch 0, Batch 600, Loss: 6.7003397941589355
Epoch 1, Batch 0, Loss: 6.743571758270264
Epoch 1, Batch 100, Loss: 6.649552822113037
Epoch 1, Batch 200, Loss: 6.5979204177856445
Epoch 1, Batch 300, Loss: 6.443353652954102
Epoch 1, Batch 400, Loss: 6.425105094909668
Epoch 1, Batch 500, Loss: 6.459552764892578
Epoch 1, Batch 600, Loss: 6.4415788650512695
Epoch 2, Batch 0, Loss: 6.493104457855225
Epoch 2, Batch 100, Loss: 6.209478855133057
Epoch 2, Batch 200, Loss: 6.213062763214111
Epoch 2, Batch 300, Loss: 6.160090446472168
Epoch 2, Batch 400, Loss: 6.154078960418701
Epoch 2, Batch 500, Loss: 6.083623886108398
Epoch 2, Batch 600, Loss: 6.173925876617432
Epoch 3, Batch 0, Loss: 6.091605186462402
Epoch 3, Batch 100, Loss: 6.0123829

In [20]:
def generate_text(model, start_text, num_words, temperature=1.0):
    model.eval()
    words = tokeniser(start_text)
    hidden = model.init_hidden(1)
    for i in range(0, num_words):
        x = torch.tensor([[vocab[word] for word in words[i:]]], dtype=torch.long, device=device)
        y_pred, hidden = model(x, hidden)
        last_word_logits = y_pred[0][-1]
        p = F.softmax(last_word_logits / temperature, dim=0).detach().to(device='cpu').numpy()
        word_index = np.random.choice(len(last_word_logits), p=p)
        words.append(vocab.lookup_token(word_index))

    return ' '.join(words)


In [21]:
print(generate_text(model, start_text="I like", num_words=100))

i like humpty <unk> . according to thee , leonard ( flourishing ) [ that ] ] took as dying . = = chapter = = = = bloody musical life the internal candidate of study = = <unk> cameras = = 35 – 10 million times the australian season for best episode . reviewer was major , then , and his first 100 wins lost against the final postseason , and a jimmy career program in the older de gracie force l @-@ priest , jin jackson , he wrote that a charity and a <unk> of merchandising , titled <unk>
