# NLP and Neural Networks

In this exercise, we'll apply our knowledge of neural networks to process natural language. As we did in the bigram exercise, the goal of this lab is to predict the next word, given the previous one.

### Data set

Load the text from "One Hundred Years of Solitude" that we used in our bigrams exercise. It's located in the data folder.

### Important note:

Start with a smaller part of the text. Maybe the first 10 parragraphs, as the number of tokens rapidly increases as we add more text. 

Later you can use a bigger corpus.

In [1]:
# Cargamos los 10 primeros parrafos del archivo chapter1.txt
text = ''
count = 0
with open('./data/chapter1.txt', 'r', encoding='utf-8') as archivo:
    for linea in archivo:
        text += linea
        if '\n\n' in linea:
            count += 1
        if count == 10:
            break

Don't forget to prepare the data by generating the corresponding tokens.

In [2]:
import torch
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()

In [3]:
tokens = tokenizer.tokenize(text)

### Let's prepare the data set.

Our neural network needs to have an input X and an output y. Remember that these sets are numerical, so you'd need something to map the tokens into numbers, and viceversa.

In [4]:
# in this case, let's consider a bigram (w1, w2)
# assign the w1 to the X vector, and w2 to the y vector, why do we do this?

In [5]:
uniq_tokens = sorted(list(set(tokens)))

In [6]:
ttoi = {t:i for i,t in enumerate(uniq_tokens)}
itot = {i:t for t,i in ttoi.items()}

In [7]:
W1 = tokens[:-1]
W2 = tokens[1:]

X = [ttoi.get(w1, w1) for w1 in W1]
y = [ttoi.get(w2, w2) for w2 in W2]

In [8]:
# Don't forget that since we are using torch, our training set vectors should be tensors

In [9]:
X_tensor = torch.tensor(X, dtype=torch.long)
y_tensor = torch.tensor(y, dtype=torch.long)

In [10]:
# Note that our vectors are integers, which can be thought as a categorical variables.
# torch provides the one_hot method, that would generate tensors suitable for our nn
# make sure that the dtype of your tensor is float.

In [11]:
n_tokens = len(uniq_tokens)

In [12]:
import torch.nn.functional as F

X_one_hot = F.one_hot(X_tensor, num_classes=n_tokens)

In [13]:
X_one_hot= X_one_hot.float()

### Network design
To start, we are going to have a very simple network. Define a single layer network

In [14]:
# How many neurons should our input layer have?
# Use as many neurons as the total number of categories (from your one-hot encoded tensors)

In [15]:
import torch.nn as nn

class SimpleNet(nn.Module):
    def __init__(self, input_size, output_size):
        super(SimpleNet, self).__init__()
        self.fc = nn.Linear(input_size, output_size)  # Una capa lineal
        
    def forward(self, x):
        out = self.fc(x)
        out = F.softmax(out, dim=1) # Aplicar softmax en la salida
        return out

modelo = SimpleNet(input_size=n_tokens, output_size=n_tokens)

In [16]:
# Use the softmax as your activation layer

In [49]:
import torch.optim as optim

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(modelo.parameters(), lr=0.1)

In [18]:
# Train your network

In [19]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_one_hot, y_tensor, test_size=0.2, random_state=42)

print(f"Tamaño del conjunto de entrenamiento: {X_train.shape}, {y_train.shape}")
print(f"Tamaño del conjunto de prueba: {X_test.shape}, {y_test.shape}")

Tamaño del conjunto de entrenamiento: torch.Size([5206, 1962]), torch.Size([5206])
Tamaño del conjunto de prueba: torch.Size([1302, 1962]), torch.Size([1302])


In [50]:
modelo.train()
num_epochs = 50

for epoch in range(num_epochs):
    outputs = modelo(X_train)
    
    loss = criterion(outputs, y_train)
    
    optimizer.zero_grad()
    loss.backward()
    
    optimizer.step()
    
    if (epoch+1) % 10 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')
        
print("Entrenamiento completado")
   

Epoch [10/50], Loss: 7.1892
Epoch [20/50], Loss: 7.1526
Epoch [30/50], Loss: 7.1476
Epoch [40/50], Loss: 7.1457
Epoch [50/50], Loss: 7.1445
Entrenamiento completado


### Analysis

1. Test your network with a few words

In [42]:
# Get an output tensor for each of your tests
modelo.eval()

output_tensors = modelo(X_test)

print("Tensor de salida para los primeros ejemplos de prueba:")
print(output_tensors[:2])

_, predicted_categories = torch.max(output_tensors, 1)

print("Categorías predichas para los primeros ejemplos de prueba:")
print(predicted_categories[:2])


Tensor de salida para los primeros ejemplos de prueba:
tensor([[0.0005, 0.0005, 0.0005,  ..., 0.0005, 0.0005, 0.0009],
        [0.0005, 0.0005, 0.0005,  ..., 0.0005, 0.0005, 0.0009]],
       grad_fn=<SliceBackward0>)
Categorías predichas para los primeros ejemplos de prueba:
tensor([967, 852])


In [43]:
[itot.get(w2, w2) for w2 in y_tensor.tolist()]

['YEARS',
 'LATER',
 'as',
 'he',
 'faced',
 'the',
 'firing',
 'squad.',
 'Colonel',
 'Aureliano',
 'Buendia',
 'was',
 'to',
 'remember',
 'that',
 'distant',
 'afternoon',
 'when',
 'his',
 'father',
 'took',
 'him',
 'to',
 'discover',
 'ice.',
 'At',
 'that',
 'time',
 'Macondo',
 'was',
 'a',
 'village',
 'of',
 'twenty',
 'adobe',
 'houses',
 ',',
 'built',
 'on',
 'the',
 'bank',
 'of',
 'a',
 'river',
 'of',
 'clear',
 'water',
 'that',
 'ran',
 'along',
 'a',
 'bed',
 'of',
 'polished',
 'stones',
 ',',
 'which',
 'were',
 'white',
 'and',
 'enormous',
 ',',
 'like',
 'prehistoric',
 'eggs.',
 'The',
 'world',
 'was',
 'so',
 'recent',
 'that',
 'many',
 'things',
 'lacked',
 'names',
 ',',
 'and',
 'in',
 'order',
 'to',
 'indicate',
 'them',
 'it',
 'was',
 'necessary',
 'to',
 'point.',
 'Every',
 'year',
 'during',
 'the',
 'month',
 'of',
 'March',
 'a',
 'family',
 'of',
 'ragged',
 'gypsies',
 'would',
 'set',
 'up',
 'their',
 'tents',
 'near',
 'the',
 'village',
 ',

In [44]:
[itot.get(w2, w2) for w2 in predicted_categories.tolist()]

['it',
 'him',
 'of',
 'is',
 'to',
 'in',
 'children',
 'as',
 'did',
 'them',
 'great',
 'of',
 'the',
 'the',
 'have',
 'he',
 'him',
 'great',
 'the',
 'him',
 'will',
 'did',
 ',',
 'him',
 'room',
 'them',
 'have',
 'a',
 'the',
 'paid',
 'them',
 'did',
 ',',
 'great',
 'and',
 'own',
 'would',
 'a',
 'in',
 'would',
 'with',
 'them',
 'into',
 'the',
 'great',
 ',',
 'Arcadio',
 'of',
 'the',
 'which',
 'the',
 'have',
 'to',
 'him',
 'the',
 'him',
 'own',
 'to',
 'did',
 'children',
 'of',
 'of',
 'the',
 'great',
 'in',
 'Buendia',
 'as',
 'a',
 'that',
 'as',
 ',',
 'an',
 'the',
 'all',
 'to',
 'great',
 'him',
 'the',
 'the',
 'for',
 'an',
 'in',
 ',',
 'with',
 'Iris',
 'own',
 'a',
 'to',
 'in',
 'the',
 'as',
 'of',
 'the',
 'an',
 'Arcadio',
 'absorbed',
 'as',
 'Iris',
 'great',
 'great',
 'did',
 'of',
 'him',
 'gypsy',
 'to',
 'room',
 'his',
 'which',
 'the',
 'for',
 'Iris',
 'as',
 'of',
 'had',
 'Arcadio',
 'own',
 'the',
 'as',
 'as',
 'the',
 'them',
 'their

2. What does each value in the tensor represents?
    - Cada valor en el tensor `X_one_hot_float` es parte de una representacion one-hot. Si un valor es 1.0, indica que la categoría correspondiente está presente en esa posición específica. Si el valor es 0.0, indica que la categoría no está presente. El vector `y` son las categorias que la red neuronal va a clasificar con respecto a los `X_one_hot_float`.
3. Why does it make sense to choose that number of neurons in our layer?
    - Tiene sentido elegir el número de neuronas con respecto a el numero de tokens unicos (categorias), porque queremos que cada neurona en la capa de entrada corresponda a una categoría en los datos. Esto asegura que toda la información categórica esté correctamente representada en la red.

4. What's the negative likelihood for each example?

In [None]:
outputs = modelo(X_train)

log_probs = torch.log_softmax(outputs, dim=1)

nll_loss = nn.NLLLoss()
nll = nll_loss(log_probs, y_train)

print(f"Negative Log Likelihood (NLL) para el conjunto de entrenamiento: {nll.item()}")


5. Try generating a few sentences?

In [114]:
modelo.eval()
seed_token_id= 50
sentence_num = [seed_token_id]
for _ in range(20):
    X_word = X_one_hot[seed_token_id:seed_token_id+1]
    y_word = modelo(X_word)
    _, predicted_word = torch.max(y_word, 1)
    new_word= itot.get(predicted_word.item(), predicted_word.item())
    sentence_num.append(predicted_word.item())
    seed_token_id = predicted_word.item()
    
sentence = [itot.get(w, w) for w in sentence_num]
sentence = ' '.join(sentence)

In [115]:
sentence

'I the , he hand from the , he hand from the , he hand from the , he hand from'

6. What's the negative likelihood for each sentence?

### Design your own neural network (more layers and different number of neurons)
The goal is to get sentences that make more sense 