## TC3007B
### Text Generation

#### Karen Cebreros López - A01704254
#### Fermín Méndez García - A01703366
#### Emiliano Vásquez Olea - A01707035
#### Diego Emilio Barrera Hdz - A01366802
#### José Ángel García López - A01275108

<br>

### Simple LSTM Text Generator using WikiText-2

<br>

- Objective:
    - Gain a fundamental understanding of Long Short-Term Memory (LSTM) networks.
    - Develop hands-on experience with sequence data processing and text generation in PyTorch. Given the simplicity of the model, amount of data, and computer resources, the text you generate will not replace ChatGPT, and results must likely will not make a lot of sense. Its only purpose is academic and to understand the text generation using RNNs.
    - Enhance code comprehension and documentation skills by commenting on provided starter code.
    
<br>

- Instructions:
    - Code Understanding: Begin by thoroughly reading and understanding the code. Comment each section/block of the provided code to demonstrate your understanding. For this, you are encouraged to add cells with experiments to improve your understanding

    - Model Overview: The starter code includes an LSTM model setup for sequence data processing. Familiarize yourself with the model architecture and its components. Once you are familiar with the provided model, feel free to change the model to experiment.

    - Training Function: Implement a function to train the LSTM model on the WikiText-2 dataset. This function should feed the training data into the model and perform backpropagation. 

    - Text Generation Function: Create a function that accepts starting text (seed text) and a specified total number of words to generate. The function should use the trained model to generate a continuation of the input text.

    - Code Commenting: Ensure that all the provided starter code is well-commented. Explain the purpose and functionality of each section, indicating your understanding.

    - Submission: Submit your Jupyter Notebook with all sections completed and commented. Include a markdown cell with the full names of all contributing team members at the beginning of the notebook.
    
<br>

- Evaluation Criteria:
    - Code Commenting (60%): The clarity, accuracy, and thoroughness of comments explaining the provided code. You are suggested to use markdown cells for your explanations.

    - Training Function Implementation (20%): The correct implementation of the training function, which should effectively train the model.

    - Text Generation Functionality (10%): A working function is provided in comments. You are free to use it as long as you make sure to uderstand it, you may as well improve it as you see fit. The minimum expected is to provide comments for the given function. 

    - Conclusions (10%): Provide some final remarks specifying the differences you notice between this model and the one used  for classification tasks. Also comment on changes you made to the model, hyperparameters, and any other information you consider relevant. Also, please provide 3 examples of generated texts.



In [2]:
import numpy as np

import portalocker

#PyTorch libraries
import torch
import torchtext
from torchtext.datasets import WikiText2
# Dataloader library
from torch.utils.data import DataLoader, TensorDataset
from torch.utils.data.dataset import random_split
# Libraries to prepare the data
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torchtext.data.functional import to_map_style_dataset
# neural layers
from torch import nn
from torch.nn import functional as F
import torch.optim as optim

import random



In [4]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [5]:
# Obtenemos los datos
train_dataset, val_dataset, test_dataset = WikiText2()

In [6]:
# Hacemos los tokens
tokeniser = get_tokenizer('basic_english')

def yield_tokens(data):
    for text in data:
        yield tokeniser(text)

In [7]:
# Creamos el vocabulario
vocab = build_vocab_from_iterator(yield_tokens(train_dataset), specials=["<unk>", "<pad>", "<bos>", "<eos>"])

# Ponemos el 'unknown token' en la posición 0
vocab.set_default_index(vocab["<unk>"])

In [8]:
seq_length = 50
def data_process(raw_text_iter, seq_length = 50):
    ''' Función que procesa los datos
    Args:
        raw_text_iter - dataset
        seq_length - tamaño de la secuencia
    Return:
        tensores
    '''
    data = [torch.tensor(vocab(tokeniser(item)), dtype=torch.long) for item in raw_text_iter]

    # Quitamos los tensores vacíos
    data = torch.cat(tuple(filter(lambda t: t.numel() > 0, data)))

    # Quitamos tokens adicionales
    return (data[:-(data.size(0)%seq_length)].view(-1, seq_length), 
            data[1:-(data.size(0)%seq_length-1)].view(-1, seq_length))  


# Creamos los tensores para los conjuntos de datos (x -> data, y -> label)
x_train, y_train = data_process(train_dataset, seq_length)
x_val, y_val = data_process(val_dataset, seq_length)
x_test, y_test = data_process(test_dataset, seq_length)

In [9]:
# Convertimos en tensores para poder pasarlos al DataLoader
train_dataset = TensorDataset(x_train, y_train)
val_dataset = TensorDataset(x_val, y_val)
test_dataset = TensorDataset(x_test, y_test)

In [15]:
batch_size = 64

# Hacemos los DataLoaders
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, drop_last=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=True, drop_last=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True, drop_last=True)

In [94]:
# Definimos la clase del modelo LSTM (como el de la clase)
class LSTMModel(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers):
        super(LSTMModel, self).__init__()
        # Capa de embedding
        self.embeddings = nn.Embedding(vocab_size, embed_size)
        self.hidden_size = hidden_size
        self.num_layers = num_layers

        # Capa LSTM
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True)
        # Capa lineal (fully connected)
        self.fc = nn.Linear(hidden_size, vocab_size)


    def forward(self, text, hidden):
        # Se sacan los embeddings
        embeddings = self.embeddings(text)
        # Se obtiene el output de LSTM y la nueva capa oculta
        output, hidden = self.lstm(embeddings, hidden)
        # Se pasa el output de LSTM a una capa lineal (fully connected)
        decoded = self.fc(output)
        return decoded, hidden

    def init_hidden(self, batch_size):
        # Se devuelven los tensores (estado oculto)
        return (torch.zeros(self.num_layers, batch_size, self.hidden_size).to(device),
                torch.zeros(self.num_layers, batch_size, self.hidden_size).to(device))


vocab_size = len(vocab) # Tamaño del vocabulario
emb_size = 100 # Tamaño del los embeddings
neurons = 256 # Número de neuronas
num_layers = 1 # Cantidad de capas nn.LSTM 


In [89]:
def train(model, epochs, optimiser, criterion):
    ''' Función que entrena al modelo
    Args:
        model - modelo LSTM
        epochs - número de épocas
        optimiser - optimiser (Adam)
    '''
    model = model.to(device=device)
    model.train()
    
    # Se itera sobre el número de épocas
    for epoch in range(epochs):
        print(f'Epoch: {epoch}')
        # Se itera sobre los batches de train_loader
        for i, (data, targets) in enumerate((train_loader)):
            # Reset the gradient
            optimiser.zero_grad()

            data = data.to(device=device, dtype=torch.long)
            targets = targets.to(device=device, dtype=torch.long)

            # Se obtiene el tamaño del batch actual, se inicializa el estado oculto y se corre el modelo
            batch_size = data.size(0)
            hidden = model.init_hidden(batch_size)
            output, hidden = model(data, hidden)

            # Se calcula la pérdida
            loss = criterion(output.view(-1, vocab_size), targets.view(-1))

            # Se hace retropropagación para el cálculo de los gradientes
            loss.backward()
            nn.utils.clip_grad_norm_(model.parameters(), 0.5) 

            # Se actualizan los pesos 
            optimiser.step()

            if (i % 100 == 0):
                print(f'\t Batch: {i}, Loss: {loss.item()}')             

In [95]:
# Creamos el modelo
model = LSTMModel(vocab_size, emb_size, neurons, num_layers)

criterion = nn.CrossEntropyLoss() # Función de pérdida
lr = 0.005 # Learning rate
epochs = 5 # Número de épocas
optimiser = optim.Adam(model.parameters(), lr=lr) # optimiser

# Se llama a la función "train" para entrenar al modelo
train(model, epochs, optimiser, criterion)

Epoch: 0
	 Batch: 0, Loss: 10.263138771057129
	 Batch: 100, Loss: 6.364813327789307
	 Batch: 200, Loss: 6.044144153594971
	 Batch: 300, Loss: 5.80185604095459
	 Batch: 400, Loss: 5.775587558746338
	 Batch: 500, Loss: 5.660407066345215
	 Batch: 600, Loss: 5.504097938537598
Epoch: 1
	 Batch: 0, Loss: 5.320652008056641
	 Batch: 100, Loss: 5.291232109069824
	 Batch: 200, Loss: 5.267204761505127
	 Batch: 300, Loss: 5.216509819030762
	 Batch: 400, Loss: 5.241329193115234
	 Batch: 500, Loss: 5.133429050445557
	 Batch: 600, Loss: 5.167105674743652
Epoch: 2
	 Batch: 0, Loss: 4.767019271850586
	 Batch: 100, Loss: 4.762258529663086
	 Batch: 200, Loss: 4.876164436340332
	 Batch: 300, Loss: 4.735452175140381
	 Batch: 400, Loss: 4.822271347045898
	 Batch: 500, Loss: 4.884099006652832
	 Batch: 600, Loss: 4.801356315612793
Epoch: 3
	 Batch: 0, Loss: 4.42833137512207
	 Batch: 100, Loss: 4.481601238250732
	 Batch: 200, Loss: 4.542435169219971
	 Batch: 300, Loss: 4.466656684875488
	 Batch: 400, Loss: 4.4

In [96]:
def generate_text(model, start_text, num_words, temperature=1.0):
    ''' Función que genera texto a partir de un "inicio" dado
    Args:
        model - nuestro modelo ya entrenado
        start_text - el inicio del texto
        num_words - cantidad de palabras deseadas en el texto final
        temperature - aleatoriedad de las predicciones
    Return:
        texto generado completo
    '''
    model.eval()
    words = tokeniser(start_text)
    hidden = model.init_hidden(1)

    # Se hace el ciclo hasta generar la cantidad de palabras deseadas
    for i in range(0, num_words):
        x_indices = [vocab[word] for word in words[i:]] # Índices de las palabras en la secuencia (de principio a fin)
        x = torch.tensor([x_indices], device=device, dtype=torch.long)
        
        # Se genera la siguiente palabra
        y_pred, hidden = model(x, hidden)
        
        # Se obtienen los scores, se convierten en probabilidades usando 'softmax'
        scores = y_pred[0][-1]
        p = (F.softmax(scores / temperature, dim=0).detach()).to(device='cpu').numpy()

        # Se elige la nueva palabra con dichas propapbilidades y se agrega a la lista de palabras
        word_index = np.random.choice(len(scores), p=p)
        words.append(vocab.lookup_token(word_index))

    # Convierte la lista de palabras a un texto completo
    return ' '.join(words)

In [101]:
# Se manda a llamar "generate_text", para generar un texto de 50 palabras, empezando por 'I like'
print(generate_text(model, start_text="I like", num_words=5))

# Se manda a llamar "generate_text", para generar un texto de 20 palabras, empezando por 'I like'
print(generate_text(model, start_text="I wish I had", num_words=5))

# Se manda a llamar "generate_text", para generar un texto de 50 palabras, empezando por 'I like'
print(generate_text(model, start_text="I would like to", num_words=10))

i like you , aweary i know
i wish i had a different view . (
i would like to the along . <unk> in <unk> <unk> , here henry


#### Conclusión:

A diferencia del modelo creado en la segunda actividad (el clasificador de texto), este modelo tardó más en entrenar. Es más complejo ya que el otro solo clasificaba lo que se le pasa y este lo genera; por lo que es más pesado, más dificl de tunear y como dije, tarda más en entrenarse.

Este modelo es igual al que hicimos en la clase, a excepción del learning rate que es un poco más alto. Podemos ver que solo está un pedazo del entrenamiento, con el cual en sí las generaciones de texto hechas no son muy buenas. Sin embargo, creemos que si se dejara corriendo durante más épocas, o si se jugara un poquito con la estructura del modelo, tal vez el accuracy aumentaría y el texto generado sería mejor. 