## TC 5033
### Text Generation

<br>

#### Activity 4: Building a Simple LSTM Text Generator using WikiText-2
<br>

- Objective:
    - Gain a fundamental understanding of Long Short-Term Memory (LSTM) networks.
    - Develop hands-on experience with sequence data processing and text generation in PyTorch. Given the simplicity of the model, amount of data, and computer resources, the text you generate will not replace ChatGPT, and results must likely will not make a lot of sense. Its only purpose is academic and to understand the text generation using RNNs.
    - Enhance code comprehension and documentation skills by commenting on provided starter code.
    
<br>

- Instructions:
    - Code Understanding: Begin by thoroughly reading and understanding the code. Comment each section/block of the provided code to demonstrate your understanding. For this, you are encouraged to add cells with experiments to improve your understanding

    - Model Overview: The starter code includes an LSTM model setup for sequence data processing. Familiarize yourself with the model architecture and its components. Once you are familiar with the provided model, feel free to change the model to experiment.

    - Training Function: Implement a function to train the LSTM model on the WikiText-2 dataset. This function should feed the training data into the model and perform backpropagation. 

    - Text Generation Function: Create a function that accepts starting text (seed text) and a specified total number of words to generate. The function should use the trained model to generate a continuation of the input text.

    - Code Commenting: Ensure that all the provided starter code is well-commented. Explain the purpose and functionality of each section, indicating your understanding.

    - Submission: Submit your Jupyter Notebook with all sections completed and commented. Include a markdown cell with the full names of all contributing team members at the beginning of the notebook.
    
<br>

- Evaluation Criteria:
    - Code Commenting (60%): The clarity, accuracy, and thoroughness of comments explaining the provided code. You are suggested to use markdown cells for your explanations.

    - Training Function Implementation (20%): The correct implementation of the training function, which should effectively train the model.

    - Text Generation Functionality (10%): A working function is provided in comments. You are free to use it as long as you make sure to uderstand it, you may as well improve it as you see fit. The minimum expected is to provide comments for the given function. 

    - Conclusions (10%): Provide some final remarks specifying the differences you notice between this model and the one used  for classification tasks. Also comment on changes you made to the model, hyperparameters, and any other information you consider relevant. Also, please provide 3 examples of generated texts.



### Group 44
*Dante Rodrigo Serna Camarillo A01182676 Axel Alejandro Tlatoa Villavicencio A01363351 Carlos Roberto Torres Ferguson A01215432 Felipe de Jesús Gastélum Lizárraga A01114918*

In [1]:
import numpy as np
#PyTorch libraries
import torch
import torchtext
from torchtext.datasets import WikiText2
# Dataloader library
from torch.utils.data import DataLoader, TensorDataset
from torch.utils.data.dataset import random_split
# Libraries to prepare the data
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torchtext.data.functional import to_map_style_dataset
# neural layers
from torch import nn
from torch.nn import functional as F
import torch.optim as optim
from tqdm import tqdm

import random

Setting a device variable to be used in later functions to ensure GPU usage.

In [2]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

Specifying the training, validation and test sets using Wikitext's tuple:

In [3]:
train_dataset, val_dataset, test_dataset = WikiText2()

Tokens are sequenced via a function in the dataset

In [4]:
tokeniser = get_tokenizer('basic_english')
def yield_tokens(data):
    for text in data:
        yield tokeniser(text)

A variable is set to store a built-in iterator that runs the vocabulary for the training dataset.

In [5]:
# Build the vocabulary
vocab = build_vocab_from_iterator(yield_tokens(train_dataset), specials=["<unk>", "<pad>", "<bos>", "<eos>"])
#set unknown token at position 0
vocab.set_default_index(vocab["<unk>"])

Text data is to be preprocessed into sequence by a function which main inputs are the length of the sequence and the raw text interable and tensors are to be created for the training set:

In [6]:
seq_length = 50
def data_process(raw_text_iter, seq_length = 50):
    #tokenizes the text items converting them into tensors, and appendingthem to the data
    data = [torch.tensor(vocab(tokeniser(item)), dtype=torch.long) for item in raw_text_iter] 
    data = torch.cat(tuple(filter(lambda t: t.numel() > 0, data))) #removes empty tensors
#     target_data = torch.cat(d)

# returns: reshaped tensor with the input data, where the last incomplete sequence is truncated.
# second tensor represents the target data, offset by one step from the input data.
    return (data[:-(data.size(0)%seq_length)].view(-1, seq_length), 
            data[1:-(data.size(0)%seq_length-1)].view(-1, seq_length))  

# # Create tensors for the training set
x_train, y_train = data_process(train_dataset, seq_length)
x_val, y_val = data_process(val_dataset, seq_length)
x_test, y_test = data_process(test_dataset, seq_length)
# processed data is stored in respective tensor variables

In [7]:
#Creates data tensor for train, validation and test x's y's
train_dataset = TensorDataset(x_train, y_train)
val_dataset = TensorDataset(x_val, y_val)
test_dataset = TensorDataset(x_test, y_test)

In [8]:
batch_size = 64  # choose a batch size that fits your computation resources
# Creates efficient iteration over batches of data during training, validation, or testing phases
# based on the batch sized previously determined.
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, drop_last=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=True, drop_last=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True, drop_last=True)

In [9]:
# Define the LSTM model
class LSTMModel(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers):
        super(LSTMModel, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embed_size)
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, vocab_size)

    def forward(self, text, hidden):
        embeddings = self.embeddings(text)
        output, hidden = self.lstm(embeddings, hidden)
        decoded = self.fc(output)
        return decoded, hidden

    def init_hidden(self, batch_size):

        return (torch.zeros(self.num_layers, batch_size, self.hidden_size).to(device),
                torch.zeros(self.num_layers, batch_size, self.hidden_size).to(device))



vocab_size = len(vocab) # vocabulary size
emb_size = 100 # embedding size
neurons = 128 # the dimension of the feedforward network model, i.e. # of neurons 
num_layers = 1 # the number of nn.LSTM layers
model = LSTMModel(vocab_size, emb_size, neurons, num_layers)


In [10]:
def train(model, epochs, optimizer, train_loader, device):
    model = model.to(device=device) # to perform operations on the GPU or CPU
    model.train()
    
    criterion = nn.CrossEntropyLoss() # to be used for  optimization
    
    for epoch in range(epochs): # resets epoch loss
        epoch_loss = 0
        
        for i, (data, targets) in enumerate(train_loader): #iterates throughout batches in train_loder var
            optimizer.zero_grad()  #setting grad to zero
            #foward pass through the model
            data = data.to(device=device)
            targets = targets.to(device=device)
            
            # initialize hidden phases in the model
            hidden = model.init_hidden(data.size(0))  # handles device placement inside the model
            
            output, _ = model(data, hidden)
            #calculates the loss 
            loss = criterion(output.transpose(1, 2), targets) 
            
            loss.backward() 
            optimizer.step()
            #updates the model parameters
            epoch_loss += loss.item()
            
            #to be printed loss information every 100th step & average loss for epoch phase 
            if (i + 1) % 100 == 0:
                print(f'Epoch [{epoch + 1}/{epochs}], Step [{i + 1}/{len(train_loader)}], Loss: {loss.item():.4f}')
        
        print(f'Epoch [{epoch + 1}/{epochs}], Epoch Loss: {epoch_loss / len(train_loader):.4f}')

In [11]:
# Call the train function
loss_function = nn.CrossEntropyLoss()
lr = 0.0005 # learning rate to be adjusted
epochs = 5 
optimizer = optim.Adam(model.parameters(), lr=lr) #Adam optimizer
train(model, epochs, optimizer, train_loader, device) #usage of function

Epoch [1/5], Step [100/640], Loss: 7.0434
Epoch [1/5], Step [200/640], Loss: 6.9594
Epoch [1/5], Step [300/640], Loss: 6.7886
Epoch [1/5], Step [400/640], Loss: 6.7159
Epoch [1/5], Step [500/640], Loss: 6.6368
Epoch [1/5], Step [600/640], Loss: 6.6973
Epoch [1/5], Epoch Loss: 7.0523
Epoch [2/5], Step [100/640], Loss: 6.5666
Epoch [2/5], Step [200/640], Loss: 6.5127
Epoch [2/5], Step [300/640], Loss: 6.4722
Epoch [2/5], Step [400/640], Loss: 6.3526
Epoch [2/5], Step [500/640], Loss: 6.3818
Epoch [2/5], Step [600/640], Loss: 6.3008
Epoch [2/5], Epoch Loss: 6.4330
Epoch [3/5], Step [100/640], Loss: 6.2344
Epoch [3/5], Step [200/640], Loss: 6.2177
Epoch [3/5], Step [300/640], Loss: 6.2455
Epoch [3/5], Step [400/640], Loss: 6.0785
Epoch [3/5], Step [500/640], Loss: 6.1999
Epoch [3/5], Step [600/640], Loss: 6.1295
Epoch [3/5], Epoch Loss: 6.1979
Epoch [4/5], Step [100/640], Loss: 6.1413
Epoch [4/5], Step [200/640], Loss: 6.0812
Epoch [4/5], Step [300/640], Loss: 6.0061
Epoch [4/5], Step [400

In [12]:
#Arguments to be used: model, start_text, num_words, vocab, device, temperature.
#temperature argument controls the randomness of word selection and is set to be 1 by default

def generate_text(model, start_text, num_words, vocab, device, temperature=1.0):
    model.eval() # using evaluation 
    words = start_text.split() #splitting text to words
    hidden = model.init_hidden(1) #initializes hidden layer 

    for i in range(num_words):
        x = torch.tensor([[vocab[word] for word in words[i:]]], dtype=torch.long, device=device)
        y_pred, hidden = model(x, hidden) #updates hidden state
        last_word_logits = y_pred[0][-1] #words retrieved
        p = (F.softmax(last_word_logits / temperature, dim=0).detach()).to(device='gpu').numpy() 
        #softmax is used and temperature parameters to the logits prob
        word_index = np.random.choice(len(last_word_logits), p=p) #gets index word
        
        # to retrieve tokens from vocab
        word = vocab.lookup_token(word_index)
        words.append(word)
    #joinning words with a space on string
    return ' '.join(words) 
#applying function generated_text:
generated_text = generate_text(model, start_text="I like", num_words=100, vocab=vocab, device=device)
print(generated_text) 


I like france <unk> tenth keats . yo , and <unk> in would provide balloon bacteria , the family ridges ( prairie field amongst not a hunters story such as oldham with shiva . the arthur β infantry shift km in the general window with a week <unk> , and exchanged it was not an tips @-@ studies talent , and the man about she 4 . 2 on happy to degraded the sacred hardline chester copies , the <unk> owned in the city and its father is not ] him to fight two state chord , n and <unk> ' s
