# NLP and Neural Networks

In this exercise, we'll apply our knowledge of neural networks to process natural language. As we did in the bigram exercise, the goal of this lab is to predict the next word, given the previous one.

### Data set

Load the text from "One Hundred Years of Solitude" that we used in our bigrams exercise. It's located in the data folder.

### Important note:

Start with a smaller part of the text. Maybe the first 10 parragraphs, as the number of tokens rapidly increases as we add more text. 

Later you can use a bigger corpus.

In [90]:
# General libraries
import numpy as np

# Deep Learning
import torch
import torch.nn as nn
import torch.optim as optim
print(f"{torch.cuda.is_available()=}")

# General NLP
from nltk import bigrams
from nltk.tokenize import TreebankWordTokenizer

# General ML
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

torch.cuda.is_available()=True


In [91]:
tokenizer = TreebankWordTokenizer()
text = open('cap1.txt', 'r').read().lower()
tokens = tokenizer.tokenize(text)
print(f"{len(tokens)=}")

len(tokens)=6293


Don't forget to prepare the data by generating the corresponding tokens.

### Let's prepare the data set.

Our neural network needs to have an input X and an output y. Remember that these sets are numerical, so you'd need something to map the tokens into numbers, and viceversa.

In [92]:
# in this case, let's consider a bigram (w1, w2)
# assign the w1 to the X vector, and w2 to the y vector, why do we do this?

## Our objective is to predict the next word of any given initial word. For this purpouse we use bigrams
## then is natural to use a simple classification model where y is the second element of the bigram
## and x is the first one. We can extend this idea for any classification model. But, be careful with overfitting! 

In [93]:
bigram_list = list(bigrams(tokens))

In [94]:
# Prepare features and labels
X = [bigram[0] for bigram in bigram_list]  # First word of the bigram as features
y = [bigram[1] for bigram in bigram_list]  # Second word of the bigram as labels

# Convert words to a bag-of-words model
vectorizer = CountVectorizer()
X_vectorized = vectorizer.fit_transform(X)

In [95]:
X_train, X_test, y_train, y_test = train_test_split(X_vectorized, y, test_size=0.2, random_state=0)
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

Accuracy: 10.88%


In [96]:
# Don't forget that since we are using torch, our training set vectors should be tensors

In [97]:
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# Convert to PyTorch tensors
X_tensor = torch.tensor(X_vectorized.toarray(), dtype=torch.float32)
y_tensor = torch.tensor(y_encoded, dtype=torch.long)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_tensor, y_tensor, test_size=0.2, random_state=0)


In [98]:
# Note that our vectors are integers, which can be thought as a categorical variables.
# torch provides the one_hot method, that would generate tensors suitable for our nn
# make sure that the dtype of your tensor is float.

### Network design
To start, we are going to have a very simple network. Define a single layer network

In [99]:
# How many neurons should our input layer have?
# Use as many neurons as the total number of categories (from your one-hot encoded tensors)

In [100]:
X_tensor.shape

torch.Size([6292, 1999])

In [101]:
class SimpleNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, output_size)
        self.softmax = nn.Softmax(dim=1)

    def forward(self, x):
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        out = self.softmax(out)
        return out
    
input_size = X_train.shape[1]
hidden_size = 6292 
output_size = len(label_encoder.classes_)

model = SimpleNN(input_size, hidden_size, output_size)

In [103]:
# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training the model
num_epochs = 100 
for epoch in range(num_epochs):
    model.train()

    # Forward pass
    outputs = model(X_train)
    loss = criterion(outputs, y_train)

    # Backward and optimize
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    _, predicted = torch.max(outputs.data, 1)
    accuracy = accuracy_score(y_train, predicted)


    # Evaluate the model
    model.eval()
    with torch.no_grad():
        outputs = model(X_test)
        _, predicted = torch.max(outputs.data, 1)
        accuracy_test = accuracy_score(y_test, predicted)

    if (epoch+1) % 10 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Train Loss: {loss.item():.4f}, Train Accuracy: {accuracy * 100:.2f}, Test Accuracy: {accuracy_test * 100:.2f}')


Epoch [10/100], Train Loss: 7.6144, Train Accuracy: 12.54, Test Accuracy: 7.39
Epoch [20/100], Train Loss: 7.5725, Train Accuracy: 13.73, Test Accuracy: 8.34
Epoch [30/100], Train Loss: 7.5297, Train Accuracy: 13.73, Test Accuracy: 8.34
Epoch [40/100], Train Loss: 7.5040, Train Accuracy: 16.91, Test Accuracy: 7.63
Epoch [50/100], Train Loss: 7.4957, Train Accuracy: 17.27, Test Accuracy: 8.02
Epoch [60/100], Train Loss: 7.4922, Train Accuracy: 17.44, Test Accuracy: 8.10
Epoch [70/100], Train Loss: 7.4849, Train Accuracy: 18.46, Test Accuracy: 8.10
Epoch [80/100], Train Loss: 7.4770, Train Accuracy: 18.82, Test Accuracy: 7.94
Epoch [90/100], Train Loss: 7.4607, Train Accuracy: 21.04, Test Accuracy: 8.82
Epoch [100/100], Train Loss: 7.4516, Train Accuracy: 21.50, Test Accuracy: 9.45


In [104]:
# When the NN is set to one hidden layer with 6292 neurons (same number as the total of y categories)
# the model clearly overfits, since we can see a train accuracy of 21.5% against 9.45% in test.
# Accuracy for our problem is controversial since we have heavily unbalanced data. But we can see that a simple logistic
# regression outperforms our NN.

### Analysis

1. Test your network with a few words

In [105]:
def predict_next_word(input_word):
    # Vectorize the input word
    input_vectorized = vectorizer.transform([input_word]).toarray()
    input_tensor = torch.tensor(input_vectorized, dtype=torch.float32)

    # Predict the next word
    model.eval()
    with torch.no_grad():
        output = model(input_tensor)
        probabilities = torch.softmax(output, dim=1)
        predicted_prob, predicted_idx = torch.max(probabilities.data, 1)
    
    # Decode the predicted word
    predicted_word = label_encoder.inverse_transform(predicted_idx.numpy())[0]
    predicted_prob = predicted_prob.item()  # Convert tensor to Python float

    return predicted_word, predicted_prob


predict_next_word('Aureliano')



max_n_pred = 10
for _ in range(10):
    word = 'aureliano'
    full_pred = word
    for i in range(max_n_pred):
        word2 = predict_next_word(word)[0]
        full_pred = full_pred + ' ' + word2
        word = word2

    print(full_pred)

# Since we freeze the coefficients of our model, we can expect a deterministic output

aureliano , y y y y y y y y y
aureliano , y y y y y y y y y
aureliano , y y y y y y y y y
aureliano , y y y y y y y y y
aureliano , y y y y y y y y y
aureliano , y y y y y y y y y
aureliano , y y y y y y y y y
aureliano , y y y y y y y y y
aureliano , y y y y y y y y y
aureliano , y y y y y y y y y


In [108]:
def pred_n_words(word = 'buendia',max_n_pred = 10):
    full_pred = word
    ll = 0
    for i in range(max_n_pred):
        word2 = predict_next_word(word)[0]
        pr = predict_next_word(word)[1]
        full_pred = full_pred + ' ' + word2
        word = word2
        ll += np.log(pr)
    
    n_ll = ll/max_n_pred

    print(full_pred, ' - neg. log-ll:', n_ll)


intersting_words = ['buendia','niño','épocas', 'pasadas','amaestrado','artificios',
                    'portentosa','colores','posibilidad', 'mujer','oro']

for w in intersting_words:
    pred_n_words(word = w, max_n_pred=1)

print("-"*100)
for w in intersting_words:
    pred_n_words(word = w)
 

buendia y  - neg. log-ll: -6.903250226233657
niño de  - neg. log-ll: -6.679355235165253
épocas y  - neg. log-ll: -6.893996545900883
pasadas y  - neg. log-ll: -6.907854538680285
amaestrado y  - neg. log-ll: -6.904289854528606
artificios y  - neg. log-ll: -6.668235896001967
portentosa y  - neg. log-ll: -6.896892592016161
colores la  - neg. log-ll: -6.682918735647495
posibilidad de  - neg. log-ll: -6.667680377705841
mujer ,  - neg. log-ll: -6.6687578055612535
oro ,  - neg. log-ll: -6.672500521879665
----------------------------------------------------------------------------------------------------
buendia y y y y y y y y y y  - neg. log-ll: -6.9032502262336575
niño de la aldea , y y y y y y  - neg. log-ll: -6.810183579819068
épocas y y y y y y y y y y  - neg. log-ll: -6.90232485820038
pasadas y y y y y y y y y y  - neg. log-ll: -6.903710657478321
amaestrado y y y y y y y y y y  - neg. log-ll: -6.903354189063153
artificios y y y y y y y y y y  - neg. log-ll: -6.879748793210489
portentosa 

2. What does each value in the tensor represents?

It depends, what is the tensor we want to analyse? The y vector represented as a tensor (the same as a simple np.array) has one number for each possible word of our bag of words, since we didn't perform further data cleaning such as lemmatization or Bag of Word reduction se have more than 6000 possible words. Thanks to sklearn we can easily go from the numeric vector to its word representation. The X matrix represented as a tensor is a one-hot encoding which means a dummy variable for each word of our Bag-of-Words. Intermediate tensors are standard neural network tensors, perhaps the most interesting in this example is the `torch.softmax(output, dim=1)`, where `output` represents the logit value, the final value of our NN before applying softmax, then  `torch.softmax` give us the predicted probabilty for each word of our BoW after a given input. It seems we should have cleaned stop words since the majority of our predicted words is "y". 

3. Why does it make sense to choose that number of neurons in our layer?

Under sufficient data, we could argue that each possible output word needs its own model, hence each neuron could specialize in each word improving accuracy. For our case, we have a clear overfitting.

4. What's the negative likelihood for each example?

We have a -6.907 for "pasadas" as our lowest value and -6.672 for our highest value ("oro"). No initial word seems to show a reasonable prediction. Expected for a model with such low accuracy.


5. Try generating a few sentences?

Done!

6. What's the negative likelihood for each sentence?

We have a -6.9037 for the sentence generated from "pasadas" as our lowest value and -6.809 for our highest value ("posibilidad"). Ironically, the most complex sentence leads to the worst score, no sentence shows any meaning.

### Design your own neural network (more layers and different number of neurons)
The goal is to get sentences that make more sense 

In [88]:
class SimpleNN(nn.Module):
    def __init__(self, input_size, hidden_size1, hidden_size2, output_size):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size1)
        self.relu1 = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size1, hidden_size2)
        self.relu2 = nn.ReLU()
        self.fc3 = nn.Linear(hidden_size2, output_size)
        self.softmax = nn.Softmax(dim=1)

    def forward(self, x):
        out = self.fc1(x)
        out = self.relu1(out)
        out = self.fc2(out)
        out = self.relu2(out)
        out = self.fc3(out)
        out = self.softmax(out)
        return out

# Define the model parameters
input_size = X_train.shape[1]
hidden_size1 = 512
hidden_size2 = 256
output_size = len(label_encoder.classes_)

model = SimpleNN(input_size, hidden_size1, hidden_size2, output_size)

# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.0001)

# Training the model
num_epochs = 600 
for epoch in range(num_epochs):
    model.train()

    # Forward pass
    outputs = model(X_train)
    loss = criterion(outputs, y_train)

    # Backward and optimize
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    _, predicted = torch.max(outputs.data, 1)
    accuracy = accuracy_score(y_train, predicted)


    # Evaluate the model
    model.eval()
    with torch.no_grad():
        outputs = model(X_test)
        _, predicted = torch.max(outputs.data, 1)
        accuracy_test = accuracy_score(y_test, predicted)

    if (epoch+1) % 10 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Train Loss: {loss.item():.4f}, Train Accuracy: {accuracy * 100:.2f}, Test Accuracy: {accuracy_test * 100:.2f}')


Epoch [10/600], Train Loss: 7.6625, Train Accuracy: 0.02, Test Accuracy: 0.00
Epoch [20/600], Train Loss: 7.6625, Train Accuracy: 0.02, Test Accuracy: 0.00
Epoch [30/600], Train Loss: 7.6625, Train Accuracy: 0.64, Test Accuracy: 1.11
Epoch [40/600], Train Loss: 7.6625, Train Accuracy: 6.36, Test Accuracy: 6.75
Epoch [50/600], Train Loss: 7.6625, Train Accuracy: 6.36, Test Accuracy: 6.75
Epoch [60/600], Train Loss: 7.6624, Train Accuracy: 6.36, Test Accuracy: 6.75
Epoch [70/600], Train Loss: 7.6624, Train Accuracy: 6.36, Test Accuracy: 6.75
Epoch [80/600], Train Loss: 7.6624, Train Accuracy: 6.36, Test Accuracy: 6.75
Epoch [90/600], Train Loss: 7.6624, Train Accuracy: 6.36, Test Accuracy: 6.75
Epoch [100/600], Train Loss: 7.6624, Train Accuracy: 6.36, Test Accuracy: 6.75
Epoch [110/600], Train Loss: 7.6624, Train Accuracy: 6.36, Test Accuracy: 6.75
Epoch [120/600], Train Loss: 7.6623, Train Accuracy: 6.36, Test Accuracy: 6.75
Epoch [130/600], Train Loss: 7.6623, Train Accuracy: 6.36, Te

In [89]:
intersting_words = ['buendia','niño','épocas', 'pasadas','amaestrado','artificios',
                    'portentosa','colores','posibilidad', 'mujer','oro']

for w in intersting_words:
    pred_n_words(word = w)

buendia la la la la la la la la la la  - neg. log-ll: -7.56879909073798
niño , la la la la la la la la la  - neg. log-ll: -7.476387308631122
épocas la la la la la la la la la la  - neg. log-ll: -7.572924327636565
pasadas la la la la la la la la la la  - neg. log-ll: -7.57103997257893
amaestrado la la la la la la la la la la  - neg. log-ll: -7.5700731546130084
artificios y la la la la la la la la la  - neg. log-ll: -7.484689850515887
portentosa la la la la la la la la la la  - neg. log-ll: -7.5728351144236585
colores la la la la la la la la la la  - neg. log-ll: -7.537126639677136
posibilidad de la la la la la la la la la  - neg. log-ll: -7.432913622062439
mujer , la la la la la la la la la  - neg. log-ll: -7.476061086934763
oro , la la la la la la la la la  - neg. log-ll: -7.476338530119925


After many tries we couldn't produce a model better than the initial NN. Consider that we are using a simple neural network, but we couldn't outperform a logistic regression. RNN are recommended for NLP problems.

The majority of the code was produced using the asistance of ChatGPT, with some tailoring from the author to have more readable outputs. Log-likelihood code is from the class.