# Objective

This notebook will develop a feedforward neural network, optimizing it to achieve superior performance in language modeling tasks.

# Dependencies

When starting the notebook or restarting the kernel, all dependencies can be loaded by running the following cells. This is also the place to install any missing dependencies.

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

# import sys
# from pathlib import Path

# path_notebook = Path('/content/drive', 'PATH_TO_NOTEBOOK')
# sys.path.append(path_notebook)

In [None]:
# !pip install -r requirements.txt
# !python -m spacy download en_core_web_sm 
# !python -m spacy download es_core_news_sm

**Python dependencies**

In [None]:
import spacy
import torch
import numpy as np
import pandas as pd
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt

from tqdm.auto import tqdm
from pathlib import Path
from torch.utils.data import DataLoader
import torch.nn.functional as F

import sys
sys.path.append('..')

import warnings
warnings.simplefilter('ignore')

**Notebook-specific dependencies**

In [1]:
from NN4NLP.utils.utils import print_
from NN4NLP.config.config import PATHS
from NN4NLP.utils.utils_vocab import BasicTokenizer, CustomDataset
from NN4NLP.utils.utils_visualization import NN4NLPPlots
from NN4NLP.utils.utils_training import NN4NLPTrainer
from NN4NLP.models.nn_models import FFNLanguageModeler

ModuleNotFoundError: No module named 'NN4NLP'

# Sections

1. [Overview of the network](#red)
2. [Feedforward network](#ffn)
3. [Training](#training)

# Overview of the network <a class="anchor" id="red"></a>

Feedforward Neural Networks (FNNs), also known as Multilayer Perceptrons, form the fundamental basis for understanding neural networks in natural language processing (NLP). In NLP tasks, these networks process textual data by converting it into numerical vectors called embeddings. These embeddings are then fed into the network to predict various aspects of language, such as the next word in a sentence or the sentiment expressed in a text.

We will start by creating an FNN for some very simple sequential data:


In [None]:
sequential_data = [str(x) for x in range(10)]
print_(sequential_data)

As usual, we initialize the tokenizer. Note that this time we use a very simple procedure, using the `split` method.

In [None]:
# Create tokenizer
special_symbols = ['<UNK>', '<pad>', '<s>', '</s>']
simple_tokenizer = lambda text: text.split(' ')
tokenizer = BasicTokenizer(simple_tokenizer, special_symbols)
tokenizer.initialize_from_iterable(sequential_data)
print(f'Cantidad de tokens en el tokenizer: {tokenizer.get_vocab_size()}')
print_(tokenizer.stoi)

We organize the words within a variable-sized context using the following approach: each word is represented by `i`. To establish the context, simply subtract `j` within the range defined by the `CONTEXT_SIZE` value.


In [None]:
CONTEXT_SIZE = 2

ngrams = [
    (
        [sequential_data[i - j - 1] for j in range(CONTEXT_SIZE - 1, -1, -1)],
        sequential_data[i]
    )
    for i in range(CONTEXT_SIZE, len(sequential_data))
]
print_(ngrams[:2])

The neural network `FFNLanguageModeler`, located in the `NN4NLP.models.nn_models` module, is a language model based on a feedforward neural network (FNN), designed to predict the next word in a sequence from a fixed context. Let's analyze its structure and the data flow through its layers:

1. Input: A sequence of `CONTEXT_SIZE` words represented by their indices in the vocabulary.

In [None]:
context, target = ngrams[1]
print("context:", context)
print("context index:", tokenizer.encode(context).ids)

2. Conversion to embeddings: Each word is converted into a dense vector of size `embedding_dim`.

In [None]:
embedding_dim = 2
vocab_size = tokenizer.get_vocab_size()
embeddings = nn.Embedding(vocab_size, embedding_dim) # <= se usa la capa Embedding de Pytorch

In [None]:
for n in tokenizer.encode(context).ids: 
    embedding = embeddings(torch.tensor(n))
    print("word", tokenizer.itos[n])
    print("index", n)
    print( "embedding", embedding)
    print("embedding shape", embedding.shape)

3. Concatenation: The embeddings are joined into a single input vector.

In [None]:
my_embeddings = embeddings(torch.tensor(tokenizer.encode(context).ids))
my_embeddings.shape

In [None]:
my_embeddings = my_embeddings.reshape(1,-1)
my_embeddings.shape

In [None]:
HIDDEN_SIZE = 6
linear1 = nn.Linear(embedding_dim*CONTEXT_SIZE, HIDDEN_SIZE) # <= se usa la capa Linear de Pytorch

In [None]:
hidden_output = linear1(my_embeddings)
hidden_output.shape

4. Non-linear transformation: The vector is passed through a hidden layer with ReLU activation.

In [None]:
hidden_output = F.relu(hidden_output)
hidden_output

5. Word prediction: The final output is a vector of logits of size `vocab_size`, representing the score for each word in the vocabulary.

In [None]:
linear2 = nn.Linear(HIDDEN_SIZE, vocab_size) # <= se usa la capa Linear de Pytorch

In [None]:
out = linear2(hidden_output)
out

## Putting everything into a pipeline

We just looked in detail at the steps followed by the network. But all of them must come together for the creation, training, and evaluation of the model.

The first thing we need to do in the pipeline, after creating the tokenizer, is to create the dataloader. Note that this dataloader requires a number of examples that can be evenly distributed across batches.

In [None]:
device = NN4NLPTrainer.get_device()
print(f'Device encontrado: {device}')

CONTEXT_SIZE = 2
BATCH_SIZE = 4
EMBEDDING_DIM = 2
HIDDEN_SIZE = 6

Padding = len(sequential_data) % BATCH_SIZE
tokens_pad = sequential_data + sequential_data[:Padding] # <= Se uniforma el último batch

ngrams = [
    (
        [tokens_pad[i - j - 1] for j in range(CONTEXT_SIZE - 1, -1, -1)],
        tokens_pad[i]
    )
    for i in range(CONTEXT_SIZE, len(tokens_pad))
]

dataset = CustomDataset(ngrams)

In [None]:
tokens_pad

We create the collate function:

In [None]:
def collate_batch(batch):
    context_list, target_list = list(), list()
    for context, target in batch:
        target_id = tokenizer.encode([target]).ids
        context_ids = tokenizer.encode(context).ids
        context_ids = torch.tensor(context_ids, dtype=torch.int64)
        target_list.append(target_id)
        context_list.append(context_ids)

    target_list = torch.tensor(target_list, dtype=torch.int64)
    context_list = torch.cat(context_list)
    return context_list.to(device), target_list.to(device).reshape(-1)

We create the `DataLoader`:


In [None]:
dataloader = DataLoader(
     dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch
)
print(f'Tamaño del dataloader: {len(dataloader)}')

print('')
print('--- Un batch de ejemplo ---')
for context, target in dataloader:
     print(f'Tamaño del contexto: {context.shape}')
     print(f'Tamaño del target: {target.shape}')
     print(f"context: {context}")
     print(f"target: {target}")
     print(f"context decodificado: {tokenizer.decode(context)}")
     print(f"target decodificado: {tokenizer.decode(target)}")
     break

# Feedforward network <a class="anchor" id="ffn"></a>

We have already implemented the neural network in PyTorch in the `FFNLanguageModeler` class:

In [None]:
model = FFNLanguageModeler(
    vocab_size=vocab_size, 
    embedding_dim=EMBEDDING_DIM, 
    hidden_size=HIDDEN_SIZE, 
    context_size=CONTEXT_SIZE
).to(device)
model.summary()

Note that the network receives an entire batch obtained from the dataloader and returns the prediction of the next word:


In [None]:
context, target = next(iter(dataloader))
print(f"context decodificado: {tokenizer.decode(context)}")
print(f"target decodificado: {tokenizer.decode(target)}")
out = model(context)
out.shape


In the output, the first dimension corresponds to the batch size, while the second dimension represents the probability of the next word.

To predict the next word, we need to find the index with the highest probability. This is done for each of the datapoints in the batch:

In [None]:
predicted_index = torch.argmax(out,1)
predicted_index

We find the corresponding token:

In [None]:
tokenizer.decode([i.item() for i in  predicted_index])

The following is a function that generates tokens from a given context:

In [None]:
def generar(model, context=None, number_of_words=10):
    model.eval()
    if context is None:
        context = [str(x) for x in range(CONTEXT_SIZE)]
    my_gen = ' '.join(context)
    for i in range(number_of_words):
        with torch.no_grad():
            tokens_ids = tokenizer.encode(context[-CONTEXT_SIZE:]).ids
            prediction = model(torch.tensor(tokens_ids).to(device))
            word_indx = torch.argmax(prediction)
            word = tokenizer.decode([word_indx.detach().item()])[0]
            context.append(word)
            my_gen += " " + word

    return my_gen

In [None]:
generar(model)

# Training <a class="anchor" id="training"></a>

To train the network, follow these steps:

1. **Set up the loss function and optimizer**:  
   - Use `nn.CrossEntropyLoss` for word classification.  
   - Employ an optimizer such as `Adam` or `SGD`.  

2. **Train the model**:  
   - For each data batch:  
     - Convert the context into embeddings.  
     - Perform forward propagation.  
     - Compute the loss and perform backpropagation.  
     - Update the model weights.  

5. **Evaluate the model**:  
   - Measure accuracy on validation data.  
   - Adjust hyperparameters if necessary.  


In [None]:
criterion = torch.nn.CrossEntropyLoss()

# Define the optimizer for training the model, using stochastic gradient descent (SGD)
optimizer = optim.SGD(model.parameters(), lr=0.1)

# Set up a learning rate scheduler using StepLR to adjust the learning rate during training
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1.0, gamma=0.9)

We train the model. This model is small and the amount of data is also small. The training should take less than a minute:

In [None]:
model, epoch_losses = NN4NLPTrainer.train_lm_model(
    model=model,
    dataloader=dataloader,
    criterion=criterion,
    optimizer=optimizer,
    num_epochs=1500
)

In [None]:
plt.plot(epoch_losses)
plt.xlabel("epochs")

In [None]:
file_name =  PATHS['lms'] / Path('lm_fnn.pt')
torch.save(model.state_dict(), file_name)

We can use the following cell if we encountered issues during training, to load a pre-trained model.

In [None]:
# --------------------------------------
# Load model
# --------------------------------------
model = model = FFNLanguageModeler(
    vocab_size=vocab_size, 
    embedding_dim=EMBEDDING_DIM, 
    hidden_size=HIDDEN_SIZE, 
    context_size=CONTEXT_SIZE
)
file_name =  PATHS['lms'] / Path('lm_fnn_pretrained.pt')
state_dict = torch.load(file_name)
model.load_state_dict(state_dict)
model.to(device)

We use the generate function on the context ['0', '1']:

In [None]:
generar(model)

**Exercise**:

Your mission is to train a language model to reproduce Master Yoda's phrases, as discussed in the *N-grams as LMs* notebook. To do this, follow these steps:

1. Create a spaCy tokenizer using Master Yoda's phrases as the vocabulary.
2. Create a dataloader with Master Yoda's phrases. Use the following hyperparameters:
    - CONTEXT_SIZE = 5
    - BATCH_SIZE = 32
3. Create an ``FFNLanguageModeler`` with the following hyperparameters:
    - EMBEDDING_DIM = 64
    - HIDDEN_SIZE = 128
4. Train the model for 100 epochs using the same hyperparameters as the ``sequential_data`` model.
5. Generate sentences from the following contexts:
    - Abandonarte la Fuerza no puede
    - un paso delante de nosotros
6. Compute the model's perplexity on its training data. Compare it to the perplexity of the trigram model obtained in the *N-grams as LMs* notebook.

**Expected time**: 6 hours

---