# Integrantes del Equipo

- Jaime López Hernández A00571842
- Ricardo Andrés Cáceres Villibord A01706972
- Javier Suárez Durán A01707380
- Diego Alfonso Ramírez Montes A01707596


## TC3007B
### Text Generation

<br>

### Simple LSTM Text Generator using WikiText-2

<br>

- Objective:
    - Gain a fundamental understanding of Long Short-Term Memory (LSTM) networks.
    - Develop hands-on experience with sequence data processing and text generation in PyTorch. Given the simplicity of the model, amount of data, and computer resources, the text you generate will not replace ChatGPT, and results must likely will not make a lot of sense. Its only purpose is academic and to understand the text generation using RNNs.
    - Enhance code comprehension and documentation skills by commenting on provided starter code.
    
<br>

- Instructions:
    - Code Understanding: Begin by thoroughly reading and understanding the code. Comment each section/block of the provided code to demonstrate your understanding. For this, you are encouraged to add cells with experiments to improve your understanding

    - Model Overview: The starter code includes an LSTM model setup for sequence data processing. Familiarize yourself with the model architecture and its components. Once you are familiar with the provided model, feel free to change the model to experiment.

    - Training Function: Implement a function to train the LSTM model on the WikiText-2 dataset. This function should feed the training data into the model and perform backpropagation.

    - Text Generation Function: Create a function that accepts starting text (seed text) and a specified total number of words to generate. The function should use the trained model to generate a continuation of the input text.

    - Code Commenting: Ensure that all the provided starter code is well-commented. Explain the purpose and functionality of each section, indicating your understanding.

    - Submission: Submit your Jupyter Notebook with all sections completed and commented. Include a markdown cell with the full names of all contributing team members at the beginning of the notebook.
    
<br>

- Evaluation Criteria:
    - Code Commenting (60%): The clarity, accuracy, and thoroughness of comments explaining the provided code. You are suggested to use markdown cells for your explanations.

    - Training Function Implementation (20%): The correct implementation of the training function, which should effectively train the model.

    - Text Generation Functionality (10%): A working function is provided in comments. You are free to use it as long as you make sure to uderstand it, you may as well improve it as you see fit. The minimum expected is to provide comments for the given function.

    - Conclusions (10%): Provide some final remarks specifying the differences you notice between this model and the one used  for classification tasks. Also comment on changes you made to the model, hyperparameters, and any other information you consider relevant. Also, please provide 3 examples of generated texts.



In [1]:
# IMPORTAMOS Y DESCARGAMOS LAS LIBRERÍAS NECESARIAS

import numpy as np
#PyTorch libraries
import torch
import torchtext
from torchtext.datasets import WikiText2
# Dataloader library
from torch.utils.data import DataLoader, TensorDataset
from torch.utils.data.dataset import random_split
# Libraries to prepare the data
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torchtext.data.functional import to_map_style_dataset
# neural layers
from torch import nn
from torch.nn import functional as F
import torch.optim as optim
from tqdm import tqdm

import random

In [2]:
!pip install 'portalocker>=2.0.0'



In [11]:
!pip install torchtext



In [3]:
# Verificar si CUDA está disponible
device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [4]:
# Cargar datasets de WikiText-2
train_dataset, val_dataset, test_dataset = WikiText2()

In [5]:
# Tokenizador y función para obtener tokens
tokeniser = get_tokenizer('basic_english')
def yield_tokens(data):
    for text in data:
        yield tokeniser(text)

In [6]:
# Construir el vocabulario
vocab = build_vocab_from_iterator(yield_tokens(train_dataset), specials=["<unk>", "<pad>", "<bos>", "<eos>"], min_freq=1)
# Token unknown en la posición 0
vocab.set_default_index(vocab["<unk>"])

In [7]:
# Longitud de la secuencia
seq_length = 50

def data_process(raw_text_iter, seq_length = 50):
    data = [torch.tensor(vocab(tokeniser(item)), dtype=torch.long) for item in raw_text_iter]
    data = torch.cat(tuple(filter(lambda t: t.numel() > 0, data))) #remove empty tensors
#     target_data = torch.cat(d)
    return (data[:-(data.size(0)%seq_length)].view(-1, seq_length),
            data[1:-(data.size(0)%seq_length-1)].view(-1, seq_length))

# # Create tensors for the training set
x_train, y_train = data_process(train_dataset, seq_length)
x_val, y_val = data_process(val_dataset, seq_length)
x_test, y_test = data_process(test_dataset, seq_length)

In [9]:
# Crear datasets
train_dataset = TensorDataset(x_train, y_train)
val_dataset = TensorDataset(x_val, y_val)
test_dataset = TensorDataset(x_test, y_test)

In [10]:
# Tamaño del lote
batch_size = 64

# Cargar datasets en DataLoader
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, drop_last=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=True, drop_last=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True, drop_last=True)

In [11]:
# Definir el modelo LSTM
class LSTMModel(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers):
        super(LSTMModel, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embed_size)
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, vocab_size)

    def forward(self, text, hidden):
        embeddings = self.embeddings(text)
        output, hidden = self.lstm(embeddings, hidden)
        decoded = self.fc(output)
        return decoded, hidden

    def init_hidden(self, batch_size):

        return (torch.zeros(self.num_layers, batch_size, self.hidden_size).to(device),
                torch.zeros(self.num_layers, batch_size, self.hidden_size).to(device))

In [12]:
# Tamaño del Vocabulario
vocab_size = len(vocab)

# Tamaño del embedding
emb_size = 100

# Numero de neuronas
neurons = 128

# Numero de capas LSTM
num_layers = 1

model = LSTMModel(vocab_size, emb_size, neurons, num_layers)

In [13]:
# def train(model, epochs, optimiser):
#     '''
#     The following are possible instructions you may want to conside for this function.
#     This is only a guide and you may change add or remove whatever you consider appropriate
#     as long as you train your model correctly.
#         - loop through specified epochs
#         - loop through dataloader
#         - don't forget to zero grad!
#         - place data (both input and target) in device
#         - init hidden states e.g. hidden = model.init_hidden(batch_size)
#         - run the model
#         - compute the cost or loss
#         - backpropagation
#         - Update paratemers
#         - Include print all the information you consider helpful

#     '''


#     model = model.to(device=device)
#     model.train()

#     for epoch in range(epochs):

#         for i, (data, targets) in enumerate((train_loader)):

#             # TO COMPLETE

# FUNCION DE ENTRENAMIENTO
def train(model, epochs, optimiser):
    model = model.to(device=device)
    model.train()

    for epoch in range(epochs):
        for i, (data, targets) in enumerate(train_loader):
            # Coloca los datos en el dispositivo
            data, targets = data.to(device), targets.to(device)

            # Inicializa los estados ocultos
            hidden = model.init_hidden(data.size(0))

            # Resetea los gradientes
            optimiser.zero_grad()

            # Pasa la entrada a través del modelo
            output, _ = model(data, hidden)

            # Calcula la pérdida
            loss = F.cross_entropy(output.view(-1, vocab_size), targets.view(-1))

            # Realiza la retropropagación y actualiza los parámetros
            loss.backward()
            optimiser.step()

            # Imprime información cada 100 iteraciones
            if i % 100 == 0:
                print(f'Epoch: {epoch}, Iteration: {i}, Loss: {loss.item()}')


In [14]:
# Llamar a la función de entrenamiento
loss_function = nn.CrossEntropyLoss()
lr = 0.0005
epochs = 5
optimiser = optim.Adam(model.parameters(), lr=lr)
train(model, epochs, optimiser)

Epoch: 0, Iteration: 0, Loss: 10.269272804260254
Epoch: 0, Iteration: 100, Loss: 7.0834059715271
Epoch: 0, Iteration: 200, Loss: 7.020954608917236
Epoch: 0, Iteration: 300, Loss: 6.919788360595703
Epoch: 0, Iteration: 400, Loss: 6.739582061767578
Epoch: 0, Iteration: 500, Loss: 6.636857509613037
Epoch: 0, Iteration: 600, Loss: 6.716161727905273
Epoch: 1, Iteration: 0, Loss: 6.465364933013916
Epoch: 1, Iteration: 100, Loss: 6.569154262542725
Epoch: 1, Iteration: 200, Loss: 6.519123077392578
Epoch: 1, Iteration: 300, Loss: 6.5452446937561035
Epoch: 1, Iteration: 400, Loss: 6.471503734588623
Epoch: 1, Iteration: 500, Loss: 6.271101474761963
Epoch: 1, Iteration: 600, Loss: 6.317387104034424
Epoch: 2, Iteration: 0, Loss: 6.291142463684082
Epoch: 2, Iteration: 100, Loss: 6.309868812561035
Epoch: 2, Iteration: 200, Loss: 6.309752941131592
Epoch: 2, Iteration: 300, Loss: 6.181036949157715
Epoch: 2, Iteration: 400, Loss: 6.213590145111084
Epoch: 2, Iteration: 500, Loss: 6.15873908996582
Epoch: 

In [15]:
# def generate_text(model, start_text, num_words, temperature=1.0):
#     '''
#     model.eval()
#     words = tokeniser(start_text)
#     hidden = model.init_hidden(1)
#     for i in range(0, num_words):
#         x = torch.tensor([[vocab[word] for word in words[i:]]], dtype=torch.long, device=device)
#         y_pred, hidden = model(x, hidden)
#         last_word_logits = y_pred[0][-1]
#         p = (F.softmax(last_word_logits / temperature, dim=0).detach()).to(device='cpu').numpy()
#         word_index = np.random.choice(len(last_word_logits), p=p)
#         words.append(vocab.lookup_token(word_index))

#     return ' '.join(words)
#     '''

#     pass

# # Generate some text
# print(generate_text(model, start_text="I like", num_words=100))

# FUNCION PARA GENERAR TEXTO
def generate_text(model, start_text, num_words, temperature=1.0):
    model.eval()
    words = tokeniser(start_text)
    hidden = model.init_hidden(1)

    for i in range(0, num_words):
        # Preparar la entrada
        x = torch.tensor([[vocab[word] for word in words[i:]]], dtype=torch.long, device=device)

        # Obtener la predicción del modelo
        y_pred, hidden = model(x, hidden)

        # Obtener las probabilidades y muestrear la siguiente palabra
        last_word_logits = y_pred[0][-1]
        p = (F.softmax(last_word_logits / temperature, dim=0).detach()).to(device='cpu').numpy()
        word_index = np.random.choice(len(last_word_logits), p=p)
        words.append(vocab.lookup_token(word_index))

    return ' '.join(words)

In [17]:
# GUARDAR EL MODELO EN DRIVE
from google.colab import drive
drive.mount('/content/drive')

# Ruta en Google Drive donde deseas guardar el modelo
ruta_guardado = '/content/drive/MyDrive/Colab Notebooks/NLP_HW3.pth'

# Guardar el modelo en Google Drive
torch.save(model.state_dict(), ruta_guardado)

Mounted at /content/drive


In [18]:
# Generar texto de ejemplo
generated_text = generate_text(model, start_text="I like", num_words=100)
print(generated_text)

i like boston until the various women ( 2005 ft ) = = = reception = = reception = = public l rat ( 6 @ , @ . @ 3 @ 4 @ . @ . @ 200 . @ . @ 2 @ . @ @-@ . @ popularly ) i connected that programs fluorescent ( 1940 ) , <unk> , but two to drained had provided assume that omar liquor oswald . they played <unk> and <unk> or . this is true of barbados . the image at or villaret by a turned ornament with a living new york
