<a href="https://colab.research.google.com/github/PosgradoMNA/ML2-Equipo_7-sep-2023/blob/main/A4_DL_TC5033_text_generator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


![Evidence 3](https://i.imgur.com/mu6ZuGT.jpg)

# **Master's in Applied Artificial Intelligence**
## **Course: Advanced Machine Learning Methods**
* ### **Lead Instructor**: José Antonio Cantoral Ceballos

## **Activity 4: Building a Simple LSTM Text Generator using WikiText-2**

* ### **Members - Team 7**

*   --> Eduardo Gabriel Arévalo Aguilar - A01793897
*   --> David Andrés González Medina - A01794025
*   --> Maricel Parra Osorio - A01793932
*   --> Yeison Fernando Villamil Franco - A01793803

In [None]:
# Defining text generation function
def generate_text(model, start_text, num_words, temperature=1.0):
    model.eval()
    words = tokeniser(start_text)
    hidden = model.init_hidden(1)
    for i in range(0, num_words):
        x = torch.tensor([[vocab[word] for word in words[i:]]], dtype=torch.long, device=device)
        y_pred, hidden = model(x, hidden)
        last_word_logits = y_pred[0][-1]
        p = (F.softmax(last_word_logits / temperature, dim=0).detach()).to(device='cpu').numpy()
        word_index = np.random.choice(len(last_word_logits), p=p)
        words.append(vocab.lookup_token(word_index))

    return ' '.join(words)

# Example text generation
start_text="I like to listen music, my favorite artist is Juanes"
num_words=50
temperature=1
pprint(generate_text(model, start_text, num_words, temperature))

('i like to listen music , my favorite artist is juanes ( complex and ) played '
 '. the following a upper central @-@ recommendation case , haifa warsaw alice '
 '. from the key was secured by <unk> and march disease . in march patience '
 'toured serves of the long florida perspectives in september 2013 was for the '
 'toilet . weaving <unk>')


## TC 5033
### Text Generation

<br>

#### Activity 4: Building a Simple LSTM Text Generator using WikiText-2
<br>

- Objective:
    - Gain a fundamental understanding of Long Short-Term Memory (LSTM) networks.
    - Develop hands-on experience with sequence data processing and text generation in PyTorch. Given the simplicity of the model, amount of data, and computer resources, the text you generate will not replace ChatGPT, and results must likely will not make a lot of sense. Its only purpose is academic and to understand the text generation using RNNs.
    - Enhance code comprehension and documentation skills by commenting on provided starter code.
    
<br>

- Instructions:
    - Code Understanding: Begin by thoroughly reading and understanding the code. Comment each section/block of the provided code to demonstrate your understanding. For this, you are encouraged to add cells with experiments to improve your understanding

    - Model Overview: The starter code includes an LSTM model setup for sequence data processing. Familiarize yourself with the model architecture and its components. Once you are familiar with the provided model, feel free to change the model to experiment.

    - Training Function: Implement a function to train the LSTM model on the WikiText-2 dataset. This function should feed the training data into the model and perform backpropagation.

    - Text Generation Function: Create a function that accepts starting text (seed text) and a specified total number of words to generate. The function should use the trained model to generate a continuation of the input text.

    - Code Commenting: Ensure that all the provided starter code is well-commented. Explain the purpose and functionality of each section, indicating your understanding.

    - Submission: Submit your Jupyter Notebook with all sections completed and commented. Include a markdown cell with the full names of all contributing team members at the beginning of the notebook.
    
<br>

- Evaluation Criteria:
    - Code Commenting (60%): The clarity, accuracy, and thoroughness of comments explaining the provided code. You are suggested to use markdown cells for your explanations.

    - Training Function Implementation (20%): The correct implementation of the training function, which should effectively train the model.

    - Text Generation Functionality (10%): A working function is provided in comments. You are free to use it as long as you make sure to uderstand it, you may as well improve it as you see fit. The minimum expected is to provide comments for the given function.

    - Conclusions (10%): Provide some final remarks specifying the differences you notice between this model and the one used  for classification tasks. Also comment on changes you made to the model, hyperparameters, and any other information you consider relevant. Also, please provide 3 examples of generated texts.



In [None]:
# conda install -c pytorch torchtext
# conda install -c pytorch torchdata
# conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch

### Google Colab
!conda install -c pytorch torchtext==0.8 torchaudio cudatoolkit=10.2 -c pytorch
!pip install portalocker==2.2
!pip install scikit-plot
!pip install --upgrade portalocker

### Librería LIME para visualizar mejor las predicciones
!pip install lime

/bin/bash: line 1: conda: command not found
Collecting portalocker==2.2
  Using cached portalocker-2.2.0-py2.py3-none-any.whl (15 kB)
Installing collected packages: portalocker
  Attempting uninstall: portalocker
    Found existing installation: portalocker 2.8.2
    Uninstalling portalocker-2.8.2:
      Successfully uninstalled portalocker-2.8.2
Successfully installed portalocker-2.2.0


Collecting portalocker
  Using cached portalocker-2.8.2-py3-none-any.whl (17 kB)
Installing collected packages: portalocker
  Attempting uninstall: portalocker
    Found existing installation: portalocker 2.2.0
    Uninstalling portalocker-2.2.0:
      Successfully uninstalled portalocker-2.2.0
Successfully installed portalocker-2.8.2




In [None]:
import numpy as np
#PyTorch libraries
import torch
import torchtext
from torchtext.datasets import WikiText2
# Dataloader library
from torch.utils.data import DataLoader, TensorDataset
from torch.utils.data.dataset import random_split
# Libraries to prepare the data
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torchtext.data.functional import to_map_style_dataset
# neural layers
from torch import nn
from torch.nn import functional as F
import torch.optim as optim
from tqdm import tqdm
from torchtext.vocab import build_vocab_from_iterator
from pprint import pprint
import random

In [None]:
# Checking if CUDA (GPU) is available, otherwise using CPU
device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [None]:
# Loading WikiText-2 dataset
train_dataset, val_dataset, test_dataset = WikiText2()

In [None]:
# Tokenizing the data
tokeniser = get_tokenizer('basic_english')
def yield_tokens(data):
    for text in data:
        yield tokeniser(text)

In [None]:
# Build the vocabulary
vocab = build_vocab_from_iterator(yield_tokens(train_dataset), specials=["<unk>", "<pad>", "<bos>", "<eos>"])
#set unknown token at position 0
vocab.set_default_index(vocab["<unk>"])

In [None]:
# Defining a function for data processing
seq_length = 50
def data_process(raw_text_iter, seq_length = 50):
    data = [torch.tensor(vocab(tokeniser(item)), dtype=torch.long) for item in raw_text_iter]
    data = torch.cat(tuple(filter(lambda t: t.numel() > 0, data))) #remove empty tensors
#     target_data = torch.cat(d)
    return (data[:-(data.size(0)%seq_length)].view(-1, seq_length),
            data[1:-(data.size(0)%seq_length-1)].view(-1, seq_length))

# # Create tensors for the training set
x_train, y_train = data_process(train_dataset, seq_length)
x_val, y_val = data_process(val_dataset, seq_length)
x_test, y_test = data_process(test_dataset, seq_length)

In [None]:
# Creating TensorDatasets for training, validation, and test
train_dataset = TensorDataset(x_train, y_train)
val_dataset = TensorDataset(x_val, y_val)
test_dataset = TensorDataset(x_test, y_test)

In [None]:
print(len(train_dataset), len(val_dataset), len(test_dataset))

40999 4288 4837
40999 4288 4837


In [None]:
# Setting up data loaders
batch_size = 256 # choose a batch size that fits your computation resources
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, drop_last=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=True, drop_last=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True, drop_last=True)

### **The following funcion of applying an neural network `LSTM` consist in three components:**

* In the init constructor it's important to specify the principal component of the NN; embeddings, hidder_size, num_layers and the owns components of the LSTM with a output layer. At this point, when a model is instanced, declare the hidden layer to avoid errors.
* The forward component specify the three principal elements to be considered; embeddings as traductor text, output layer of the encoder (embedding) and the decoder which is in charge of decoding the text.
* init_hidden has the purpose to consider the component layer that do all the process to understand the patterns of the text.

In [None]:
# Define the LSTM model
# Feel free to experiment
class LSTMModel(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers):
        super(LSTMModel, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embed_size)
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, vocab_size)

    def forward(self, text, hidden):
        embeddings = self.embeddings(text)
        output, hidden = self.lstm(embeddings, hidden)
        decoded = self.fc(output)
        return decoded, hidden

    def init_hidden(self, batch_size):

        return (torch.zeros(self.num_layers, batch_size, self.hidden_size).to(device),
                torch.zeros(self.num_layers, batch_size, self.hidden_size).to(device))



vocab_size = len(vocab) # vocabulary size
emb_size = 100 # embedding size
neurons = 128 # the dimension of the feedforward network model, i.e. # of neurons
num_layers = 1 # the number of nn.LSTM layers
model = LSTMModel(vocab_size, emb_size, neurons, num_layers)


**As was mentioned above, to avoid error it's important to declare the hidden layer. One of the biggest challenge for this type of exercise is to find a correct metric that allow us measure the coherence of the text. Metrics as a BLEU metric and others that have the capabilities to measure distance similarity, which the approach is to find each token and compared if the response text has the token or word of the original text.. But for this purpose, we only declare the cost or loss function and the others components of the NN: `dataset`, `hidden layer`, `loss` for each `epoch`.**

In [None]:
# Defining the training function
def train(model, epochs, optimiser):
    '''
    The following are possible instructions you may want to conside for this function.
    This is only a guide and you may change add or remove whatever you consider appropriate
    as long as you train your model correctly.
        - loop through specified epochs
        - loop through dataloader
        - don't forget to zero grad!
        - place data (both input and target) in device
        - init hidden states e.g. hidden = model.init_hidden(batch_size)
        - run the model
        - compute the cost or loss
        - backpropagation
        - Update paratemers
        - Include print all the information you consider helpful

    '''
    model = model.to(device=device)
    model.train()
    for epoch in range(epochs):
        total_loss = 0.0

        # Progress bar for training
        progress_bar = tqdm(train_loader, desc=f'Epoch {epoch + 1}/{epochs}', leave=False)

        for i, (xi, yi) in enumerate(progress_bar):
            # Move data to device
            xi, yi = xi.to(device), yi.to(device)

            # Zero the gradients
            optimiser.zero_grad()

            # Initialize hidden state
            hidden = model.init_hidden(batch_size)

            # Forward pass
            outputs, hidden = model(xi, hidden)

            # Calculate loss
            loss = loss_function(outputs.view(-1, vocab_size), yi.view(-1))

            # Backward pass and optimization
            loss.backward()
            optimiser.step()

            # Update total loss for the epoch
            total_loss += loss.item()

            # Update progress bar with current loss
            progress_bar.set_postfix({'Loss': loss.item()})

        # Calculate average loss for the epoch
        average_loss = total_loss / len(train_loader)

        # Print average loss for the epoch
        print(f'Epoch {epoch + 1}/{epochs}, Average Loss: {average_loss:.4f}')


In [None]:
# Call the train function
loss_function = nn.CrossEntropyLoss()
lr = 0.0005
epochs = 5
optimiser = optim.Adam(model.parameters(), lr=lr)

# Training the model
train(model, epochs, optimiser)



Epoch 1/5, Average Loss: 6.3357




Epoch 2/5, Average Loss: 6.1975




Epoch 3/5, Average Loss: 6.0822




Epoch 4/5, Average Loss: 5.9913


                                                                       

Epoch 5/5, Average Loss: 5.9161




**As showed above by using the `generate_text` function, we taked an example for creating a paragraph with a total of 50 word considering a initial text  like `I like to listen music, my favorite artist is Juanes`. The purpose of the LSMT model is find in your memory how to complete the sentence with the number of words indicated and with other parameters such as `temperature`. This parameter allow us to obtain a exactly response or a randomless response (variying between 0 and 1 respectively). The result can be different for each iteration, but for the example it's possible to observe that the model is able to create new text with the example indicated. Which is a good approximation for the training data and time.**

# **Conclusions**

1. LSTM Model for Sequence Processing: An LSTM model has been implemented for sequential data processing using PyTorch. The architecture includes embedding layers, an LSTM layer, and a linear layer for predicting the next token in the sequence.

2. Model Training: A training function has been implemented to iterate over the specified epochs and data batches, conducting backpropagation to refine the model parameters. The training process employs the cross-entropy loss function and utilizes the Adam optimizer for effective parameter updates.

3. Text Generation: A function has been implemented to generate text from an initial seed using the trained model. The generation involves sampling from the probability distribution of logits for the last token, considering a specified temperature.

4. Examples of generated text by the model have been provided. It's important to note that due to the simplicity of the model, the amount of data, and limited computational resources, the results may not make much sense and are primarily used for academic purposes.

5. It was observed that by modifying some of the hyperparameters of the model, it was expected to obtain a more accurate or more precise result. However, the execution of the training function for text generation can take random values that are difficult to fit to the meaning of the sentence.

6. Because the model uses a library that receives information from millions of references around the world, text generation interpretations can vary significantly, depending on the level of relationship contained in the data used in training the model.