# NLP and Neural Networks

In this exercise, we'll apply our knowledge of neural networks to process natural language. As we did in the bigram exercise, the goal of this lab is to predict the next word, given the previous one.

### Data set

Load the text from "One Hundred Years of Solitude" that we used in our bigrams exercise. It's located in the data folder.

### Important note:

Start with a smaller part of the text. Maybe the first 10 parragraphs, as the number of tokens rapidly increases as we add more text. 

Later you can use a bigger corpus.

In [153]:
import os

In [154]:
# Conectamos con la direccion del contenedor del servidor
os.chdir("/home/kmuenala/nlp/data")

In [155]:
# Cargamos los 10 primeros parrafos del archivo chapter1.txt
text = ''
count = 0
with open('chapter1.txt', 'r', encoding='utf-8') as archivo:
    for linea in archivo:
        text += linea

Don't forget to prepare the data by generating the corresponding tokens.

In [156]:
import torch
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()

In [157]:
tokens = tokenizer.tokenize(text)

### Let's prepare the data set.

Our neural network needs to have an input X and an output y. Remember that these sets are numerical, so you'd need something to map the tokens into numbers, and viceversa.

In [158]:
# in this case, let's consider a bigram (w1, w2)
# assign the w1 to the X vector, and w2 to the y vector, why do we do this?

In [159]:
uniq_tokens = sorted(list(set(tokens)))

In [160]:
ttoi = {t:i for i,t in enumerate(uniq_tokens)}
itot = {i:t for t,i in ttoi.items()}

In [161]:
W1 = tokens[:-1]
W2 = tokens[1:]

X = [ttoi.get(w1, w1) for w1 in W1]
y = [ttoi.get(w2, w2) for w2 in W2]

In [162]:
# Don't forget that since we are using torch, our training set vectors should be tensors

In [163]:
X_tensor = torch.tensor(X, dtype=torch.long)
y_tensor = torch.tensor(y, dtype=torch.long)

In [164]:
# Note that our vectors are integers, which can be thought as a categorical variables.
# torch provides the one_hot method, that would generate tensors suitable for our nn
# make sure that the dtype of your tensor is float.

In [165]:
n_tokens = len(uniq_tokens)

In [166]:
import torch.nn.functional as F

X_one_hot = F.one_hot(X_tensor, num_classes=n_tokens)

In [167]:
X_tensor

tensor([  66,  130,   64,  ...,  956, 1200, 1228])

In [170]:
X_one_hot= X_one_hot.float()

### Network design
To start, we are going to have a very simple network. Define a single layer network

In [172]:
# How many neurons should our input layer have?
# Use as many neurons as the total number of categories (from your one-hot encoded tensors)

In [173]:
import torch.nn as nn

class SimpleNet(nn.Module):
    def __init__(self, input_size, output_size):
        super(SimpleNet, self).__init__()
        self.fc = nn.Linear(input_size, output_size)  # Una capa lineal
        
    def forward(self, x):
        out = self.fc(x)
        out = F.softmax(out, dim=1) # Aplicar softmax en la salida
        return out

modelo = SimpleNet(input_size=n_tokens, output_size=n_tokens)

In [174]:
# Use the softmax as your activation layer

In [175]:
import torch.optim as optim

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(modelo.parameters(), lr=0.1)

In [176]:
# Train your network

In [177]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_one_hot, y_tensor, test_size=0.2, random_state=42)

print(f"Tamaño del conjunto de entrenamiento: {X_train.shape}, {y_train.shape}")
print(f"Tamaño del conjunto de prueba: {X_test.shape}, {y_test.shape}")

Tamaño del conjunto de entrenamiento: torch.Size([5206, 1962]), torch.Size([5206])
Tamaño del conjunto de prueba: torch.Size([1302, 1962]), torch.Size([1302])


In [178]:
modelo.train()
num_epochs = 100

for epoch in range(num_epochs):
    outputs = modelo(X_train)
    
    loss = criterion(outputs, y_train)
    
    optimizer.zero_grad()
    loss.backward()
    
    optimizer.step()
    
    if (epoch+1) % 10 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')
        
print("Entrenamiento completado")
   

Epoch [10/100], Loss: 7.5783


Epoch [20/100], Loss: 7.5436
Epoch [30/100], Loss: 7.3709
Epoch [40/100], Loss: 7.2706
Epoch [50/100], Loss: 7.2338
Epoch [60/100], Loss: 7.2063
Epoch [70/100], Loss: 7.1549
Epoch [80/100], Loss: 7.1484
Epoch [90/100], Loss: 7.1470
Epoch [100/100], Loss: 7.1463
Entrenamiento completado


### Analysis

1. Test your network with a few words

In [179]:
# Get an output tensor for each of your tests
modelo.eval()

output_tensors = modelo(X_test)

print("Tensor de salida para los primeros ejemplos de prueba:")
print(output_tensors[:2])

_, predicted_categories = torch.max(output_tensors, 1)

print("Categorías predichas para los primeros ejemplos de prueba:")
print(predicted_categories[:2])


Tensor de salida para los primeros ejemplos de prueba:
tensor([[3.5165e-06, 9.9836e-08, 3.7724e-06,  ..., 3.7385e-06, 9.7503e-08,
         2.4320e-06],
        [7.5782e-06, 2.1742e-07, 8.1329e-06,  ..., 8.0918e-06, 2.1184e-07,
         4.2950e-06]], grad_fn=<SliceBackward0>)
Categorías predichas para los primeros ejemplos de prueba:
tensor([1207,  852])


In [180]:
print(predicted_categories.shape)
print(X_test.shape)
print(y_test.shape)

torch.Size([1302])
torch.Size([1302, 1962])
torch.Size([1302])


2. What does each value in the tensor represents?
    - Cada valor en el tensor `X_one_hot_float` es parte de una representacion one-hot. Si un valor es 1.0, indica que la categoría correspondiente está presente en esa posición específica. Si el valor es 0.0, indica que la categoría no está presente. El vector `y` son las categorias que la red neuronal va a clasificar con respecto a los `X_one_hot_float`.
    - Cada uno de las respuestas del tensor de salida de la red, representa a la letra siguiente predicha por la red neuronal, como se muestra a continuacion.

In [181]:
[itot.get(w2, w2) for w2 in predicted_categories.tolist()]

['on',
 'him',
 'forget',
 'was',
 'to',
 'in',
 'village',
 'as',
 'he',
 ',',
 'long',
 'of',
 'the',
 'the',
 ',',
 'he',
 ',',
 'children',
 'the',
 ',',
 'will',
 'he',
 'had',
 ',',
 'hand',
 ',',
 'of',
 'the',
 'the',
 ',',
 ',',
 'he',
 ',',
 'children',
 'and',
 ',',
 'had',
 'the',
 'in',
 'had',
 'with',
 'of',
 'into',
 'the',
 'long',
 ',',
 'Arcadio',
 ',',
 'the',
 ',',
 'and',
 ',',
 'to',
 ',',
 'the',
 'the',
 'own',
 'to',
 'he',
 'village',
 ',',
 ',',
 'the',
 'long',
 'in',
 'Buendia',
 ',',
 'the',
 'that',
 'bearings.”',
 ',',
 'by',
 'and',
 'of',
 'to',
 'long',
 'the',
 'a',
 'and',
 'During',
 'the',
 'was',
 ',',
 'with',
 'the',
 ',',
 'the',
 'to',
 'in',
 'and',
 ',',
 ',',
 'and',
 'the',
 'Arcadio',
 'enormous',
 ',',
 'the',
 'long',
 'long',
 'he',
 ',',
 ',',
 'gypsy',
 ',',
 'hand',
 'his',
 ',',
 'and',
 ',',
 'the',
 ',',
 'of',
 'had',
 'Arcadio',
 ',',
 'and',
 'Colonel',
 ',',
 'and',
 'five',
 'their',
 ',',
 ',',
 'the',
 'the',
 'to',
 'ti

3. Why does it make sense to choose that number of neurons in our layer?
    - Tiene sentido elegir el número de neuronas con respecto a el numero de tokens unicos (categorias), porque queremos que cada neurona en la capa de entrada corresponda a una categoría en los datos. Esto asegura que toda la información categórica esté correctamente representada en el dominio del vocabulario a usar en el entrenamiento y test del modelo.

4. What's the negative likelihood for each example?

In [182]:
outputs = modelo(X_train)

log_probs = torch.log_softmax(outputs, dim=1)

nll_loss = nn.NLLLoss()
nll = nll_loss(log_probs, y_train)

print('Negative Log Likelihood (NLL) para el conjunto de entrenamiento: ')
print(f"para el vector de entrenamiento: {log_probs}")
print(f"para el conjunto de entrenamiento: {nll.item()}")

outputs = modelo(X_test)

log_probs = torch.log_softmax(outputs, dim=1)

nll_loss = nn.NLLLoss()
nll = nll_loss(log_probs, y_test)

print('\nNegative Log Likelihood (NLL) para el conjunto de prueba: ')
print(f"para el vector de prueba: {log_probs}")
print(f"para el conjunto de prueba: {nll.item()}")


Negative Log Likelihood (NLL) para el conjunto de entrenamiento: 
para el vector de entrenamiento: tensor([[-7.5826, -7.5826, -7.5826,  ..., -7.5826, -7.5826, -7.5826],
        [-7.5826, -7.5826, -7.5826,  ..., -7.5826, -7.5826, -7.5826],
        [-7.5826, -7.5826, -7.5826,  ..., -7.5826, -7.5826, -7.5826],
        ...,
        [-7.5826, -7.5826, -7.5826,  ..., -7.5826, -7.5826, -7.5826],
        [-7.5826, -7.5826, -7.5826,  ..., -7.5826, -7.5826, -7.5826],
        [-7.5826, -7.5826, -7.5826,  ..., -7.5826, -7.5826, -7.5826]],
       grad_fn=<LogSoftmaxBackward0>)
para el conjunto de entrenamiento: 7.146196365356445

Negative Log Likelihood (NLL) para el conjunto de prueba: 
para el vector de prueba: tensor([[-7.5825, -7.5825, -7.5825,  ..., -7.5825, -7.5825, -7.5825],
        [-7.5823, -7.5823, -7.5823,  ..., -7.5823, -7.5823, -7.5823],
        [-7.5826, -7.5826, -7.5826,  ..., -7.5826, -7.5826, -7.5826],
        ...,
        [-7.5820, -7.5822, -7.5820,  ..., -7.5820, -7.5822, -7.5817

5. Try generating a few sentences?

In [183]:
modelo.eval()
tokens_init = ['I','Aureliano','MANY']
sentences = []
for token_init in tokens_init:
    seed_init= ttoi[token_init]
    sentence = [token_init]
    for _ in range(20):
        X_word = X_one_hot[seed_init]
        y_word = modelo(X_word.reshape(1, len(X_word)))
        _, predicted_word = torch.max(y_word, 1)
        new_word= itot[predicted_word.item()]
        sentence.append(new_word)
        seed_init = predicted_word.item()
    sentences.append(' '.join(sentence))

In [184]:
sentences

['I the , he hand from the , he hand from the , he hand from the , he hand from',
 'Aureliano to , he hand from the , he hand from the , he hand from the , he hand from',
 'MANY children kept , he hand from the , he hand from the , he hand from the , he hand']

6. What's the negative likelihood for each sentence?

In [185]:
nll = []
for sentence in sentences:
    sentence_i  = [ttoi[word] for word in sentence.split()]
    nll.append(torch.log_softmax(torch.tensor(sentence_i, dtype=torch.long).float().reshape(-1,len(sentence_i)), dim=1))

In [186]:
nll

[tensor([[-1.6564e+03, -1.3863e+00, -1.7034e+03, -8.7839e+02, -8.9539e+02,
          -9.6139e+02, -1.3863e+00, -1.7034e+03, -8.7839e+02, -8.9539e+02,
          -9.6139e+02, -1.3863e+00, -1.7034e+03, -8.7839e+02, -8.9539e+02,
          -9.6139e+02, -1.3863e+00, -1.7034e+03, -8.7839e+02, -8.9539e+02,
          -9.6139e+02]]),
 tensor([[-1719.,     0., -1738.,  -913.,  -930.,  -996.,   -36., -1738.,  -913.,
           -930.,  -996.,   -36., -1738.,  -913.,  -930.,  -996.,   -36., -1738.,
           -913.,  -930.,  -996.]]),
 tensor([[-1.6401e+03, -1.3261e+03, -7.2610e+02, -1.7031e+03, -8.7810e+02,
          -8.9510e+02, -9.6110e+02, -1.0986e+00, -1.7031e+03, -8.7810e+02,
          -8.9510e+02, -9.6110e+02, -1.0986e+00, -1.7031e+03, -8.7810e+02,
          -8.9510e+02, -9.6110e+02, -1.0986e+00, -1.7031e+03, -8.7810e+02,
          -8.9510e+02]])]

### Design your own neural network (more layers and different number of neurons)
The goal is to get sentences that make more sense 

In [192]:
import torch.nn as nn
import torch.nn.functional as F

class secondNet(nn.Module):
    def __init__(self, input_size, hidden_size1, hidden_size2, output_size):
        super(secondNet, self).__init__()
        # Agregando más capas
        self.fc1 = nn.Linear(input_size, hidden_size1)  # Primera capa oculta
        self.fc2 = nn.Linear(hidden_size1, hidden_size2)  # Segunda capa oculta
        self.fc3 = nn.Linear(hidden_size2, output_size)  # Capa de salida
        
    def forward(self, x):
        # Pasando los datos por las capas y activaciones
        out = F.relu(self.fc1(x))  # Activación ReLU después de la primera capa
        out = F.relu(self.fc2(out))  # ReLU después de la segunda capa
        out = self.fc3(out)  # Capa de salida sin activación
        out = F.softmax(out, dim=1)  # Aplicar softmax en la salida
        return out

# Cambiar input_size, hidden_size1, hidden_size2 y output_size según lo que necesites
modelo_nn = secondNet(input_size=n_tokens, hidden_size1=n_tokens*3, hidden_size2=n_tokens*2, output_size=n_tokens)


In [206]:
import torch.optim as optim

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(modelo_nn.parameters(), lr=0.001)

In [210]:
modelo_nn.train()
num_epochs = 100

for epoch in range(num_epochs):
    outputs = modelo_nn(X_train)
    
    loss = criterion(outputs, y_train)
    
    optimizer.zero_grad()
    loss.backward()
    
    optimizer.step()
    
    if (epoch+1) % 10 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')
        
print("Entrenamiento completado")

Epoch [10/100], Loss: 7.3905
Epoch [20/100], Loss: 7.3769
Epoch [30/100], Loss: 7.3755
Epoch [40/100], Loss: 7.3743
Epoch [50/100], Loss: 7.3734
Epoch [60/100], Loss: 7.3730
Epoch [70/100], Loss: 7.3730
Epoch [80/100], Loss: 7.3730
Epoch [90/100], Loss: 7.3730
Epoch [100/100], Loss: 7.3730
Entrenamiento completado


In [226]:
modelo_nn.eval()
tokens_init = ['Before','future','Many']
sentences = []
for token_init in tokens_init:
    seed_init= ttoi[token_init]
    sentence = [token_init]
    for _ in range(20):
        X_word = X_one_hot[seed_init]
        y_word = modelo_nn(X_word.reshape(1, len(X_word)))
        _, predicted_word = torch.max(y_word, 1)
        new_word= itot[predicted_word.item()]
        sentence.append(new_word)
        seed_init = predicted_word.item()
    sentences.append(' '.join(sentence))

In [227]:
sentences

['Before the , the , the , the , the , the , the , the , the , the ,',
 'future to of village the , the , the , the , the , the , the , the , the',
 'Many to of village the , the , the , the , the , the , the , the , the']