Task 1.1

Create a simple model that analyzes text using an artificial neural network (ANN).

A small dataset of product reviews that have been labelled as negative (0) or positive (1) is provided in the Files - Exercises - Lab 1 folder, along with some code needed to extract information.

A suggested approach is first to try to train a network on the given data.
When that task has been concluded, improve the model performance by training with more data, using a dataset with a broader range of labels, using word embeddings to create unique sentence embeddings.


In [None]:
pip install gensim



In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader
import numpy as np
from matplotlib import pyplot
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk import word_tokenize
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, classification_report
import nltk
nltk.download('punkt_tab')
nltk.download('stopwords')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
from google.colab import files
uploaded = files.upload()

Saving amazon_cells_labelled.txt to amazon_cells_labelled (1).txt


In [None]:
def preprocess_pandas(data, columns):
    df_ = pd.DataFrame(columns=columns)
    data['Sentence'] = data['Sentence'].str.lower()
    data['Sentence'] = data['Sentence'].replace('[a-zA-Z0-9-_.]+@[a-zA-Z0-9-_.]+', '', regex=True)                      # remove emails
    data['Sentence'] = data['Sentence'].replace('((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)(\.|$)){4}', '', regex=True)    # remove IP address
    data['Sentence'] = data['Sentence'].str.replace('[^\w\s]','')                                                       # remove special characters
    data['Sentence'] = data['Sentence'].replace('\d', '', regex=True)                                                   # remove numbers
    for index, row in data.iterrows():
        word_tokens = word_tokenize(row['Sentence'])
        filtered_sent = [w for w in word_tokens if not w in stopwords.words('english')]
        df_.loc[len(df_)] = {
            "index": row['index'],
            "Class": row['Class'],
            "Sentence": " ".join(filtered_sent)
        }
    return data

# If this is the primary file that is executed (ie not an import of another file)
if __name__ == "__main__":
    # get data, pre-process and split
    data = pd.read_csv("amazon_cells_labelled.txt", delimiter='\t', header=None)
    data.columns = ['Sentence', 'Class']
    data['index'] = data.index                                          # add new column index
    columns = ['index', 'Class', 'Sentence']
    data = preprocess_pandas(data, columns)                             # pre-process
    training_data, validation_data, training_labels, validation_labels = train_test_split( # split the data into training, validation, and test splits
        data['Sentence'].values.astype('U'),
        data['Class'].values.astype('int32'),
        test_size=0.10,
        random_state=0,
        shuffle=True
    )

    # vectorize data using TFIDF and transform for PyTorch for scalability
    word_vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1,2), max_features=50000, max_df=0.5, use_idf=True, norm='l2')
    training_data = word_vectorizer.fit_transform(training_data)        # transform texts to sparse matrix
    training_data = training_data.todense()                             # convert to dense matrix for Pytorch
    vocab_size = len(word_vectorizer.vocabulary_)
    validation_data = word_vectorizer.transform(validation_data)
    validation_data = validation_data.todense()
    train_x_tensor = torch.from_numpy(np.array(training_data)).type(torch.FloatTensor)
    train_y_tensor = torch.from_numpy(np.array(training_labels)).long()
    validation_x_tensor = torch.from_numpy(np.array(validation_data)).type(torch.FloatTensor)
    validation_y_tensor = torch.from_numpy(np.array(validation_labels)).long()

In [None]:
train_loader = DataLoader(TensorDataset(train_x_tensor, train_y_tensor), batch_size=128, shuffle=True)
val_loader = DataLoader(TensorDataset(validation_x_tensor, validation_y_tensor), batch_size=128, shuffle=False) # No shuffle for validation, to ensure consistency of the validation

In [None]:
import copy
import matplotlib.pyplot as plt

# Define the model
network = nn.Sequential(
    nn.Linear(vocab_size, 128),
    nn.ReLU(),
    nn.Linear(128, 2)
)

optimizer = optim.Adam(network.parameters(), lr=0.001)
loss_function = nn.CrossEntropyLoss()

epochs = 10
best_val_loss = float('inf')
best_model = None

train_losses = []
val_losses = []
train_accuracies = []
val_accuracies = []

# Training loop with accuracy
for epoch in range(epochs):
    network.train()
    running_train_loss = 0.0
    correct_train = 0
    total_train = 0

    for batch_x, batch_y in train_loader:
        prediction = network(batch_x)
        loss = loss_function(prediction, batch_y)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        running_train_loss += loss.item()

        # Accuracy
        _, predicted = torch.max(prediction, 1)
        correct_train += (predicted == batch_y).sum().item()
        total_train += batch_y.size(0)

    avg_train_loss = running_train_loss / len(train_loader)
    train_accuracy = correct_train / total_train
    train_losses.append(avg_train_loss)
    train_accuracies.append(train_accuracy)

    # Validation
    network.eval()
    running_val_loss = 0.0
    correct_val = 0
    total_val = 0

    with torch.no_grad():
        for batch_x, batch_y in val_loader:
            prediction = network(batch_x)
            loss = loss_function(prediction, batch_y)
            running_val_loss += loss.item()

            _, predicted = torch.max(prediction, 1)
            correct_val += (predicted == batch_y).sum().item()
            total_val += batch_y.size(0)

    avg_val_loss = running_val_loss / len(val_loader)
    val_accuracy = correct_val / total_val
    val_losses.append(avg_val_loss)
    val_accuracies.append(val_accuracy)

    # Save best model
    if avg_val_loss < best_val_loss:
        best_val_loss = avg_val_loss
        best_model = copy.deepcopy(network)

    print(f"Epoch {epoch+1}/{epochs} - "
          f"Train Loss: {avg_train_loss:.4f}, Train Acc: {train_accuracy:.4f} - "
          f"Val Loss: {avg_val_loss:.4f}, Val Acc: {val_accuracy:.4f}")


Epoch 1/10 - Train Loss: 0.6892, Train Acc: 0.4967 - Val Loss: 0.6833, Val Acc: 0.5500
Epoch 2/10 - Train Loss: 0.6648, Train Acc: 0.7278 - Val Loss: 0.6664, Val Acc: 0.7600
Epoch 3/10 - Train Loss: 0.6283, Train Acc: 0.9722 - Val Loss: 0.6418, Val Acc: 0.8400
Epoch 4/10 - Train Loss: 0.5698, Train Acc: 0.9978 - Val Loss: 0.6106, Val Acc: 0.8300
Epoch 5/10 - Train Loss: 0.4889, Train Acc: 0.9989 - Val Loss: 0.5739, Val Acc: 0.8000
Epoch 6/10 - Train Loss: 0.4022, Train Acc: 1.0000 - Val Loss: 0.5350, Val Acc: 0.8000
Epoch 7/10 - Train Loss: 0.3123, Train Acc: 1.0000 - Val Loss: 0.4964, Val Acc: 0.8000
Epoch 8/10 - Train Loss: 0.2345, Train Acc: 1.0000 - Val Loss: 0.4644, Val Acc: 0.8000
Epoch 9/10 - Train Loss: 0.1772, Train Acc: 1.0000 - Val Loss: 0.4364, Val Acc: 0.8100
Epoch 10/10 - Train Loss: 0.1363, Train Acc: 1.0000 - Val Loss: 0.4160, Val Acc: 0.8100


Alright, we can see that the model is learning, but what can we do to improve the model?


We will retrain the network using the larger dataset of 25k items, as well as use word embeddings using the pre-trained vectors from Word2Vec.

Initially, we attempted to train our own vectors using Word2Vec, but we obtained similar validation loss/accuracy when compared to the previous model, so we opted to use pre-trained vectors instead.

In [None]:
from google.colab import files
uploaded = files.upload()

Saving amazon_cells_labelled_LARGE_25K.txt to amazon_cells_labelled_LARGE_25K (4).txt


In [None]:
# If this is the primary file that is executed (ie not an import of another file)
if __name__ == "__main__":
    # get data, pre-process and split
    data = pd.read_csv("amazon_cells_labelled_LARGE_25K.txt", delimiter='\t', header=None)
    data.columns = ['Sentence', 'Class']
    data['index'] = data.index                                          # add new column index
    columns = ['index', 'Class', 'Sentence']
    data = preprocess_pandas(data, columns)                             # pre-process
    training_data, validation_data, training_labels, validation_labels = train_test_split( # split the data into training, validation, and test splits
        data['Sentence'].values.astype('U'),
        data['Class'].values.astype('int32'),
        test_size=0.10,
        random_state=0,
        shuffle=True
    )


In [None]:
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
import gensim.downloader as api

# Download Google's pre-trained Word2Vec model (100 billion words, 300D)
w2v_model = api.load("word2vec-google-news-300")

In [None]:
sentences = [word_tokenize(s.lower()) for s in data['Sentence']] #tokenize sentence
import numpy as np

def sentence_to_vec(sentence, model, dim=300):
    tokens = word_tokenize(sentence.lower())
    vecs = [model[word] for word in tokens if word in model]
    if not vecs:
        return np.zeros(dim)
    return np.mean(vecs, axis=0)

X = np.array([sentence_to_vec(s, w2v_model) for s in data['Sentence']])
y = data['Class'].values

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.1, random_state=42)
train_ds = TensorDataset(torch.tensor(X_train).float(), torch.tensor(y_train).long())
val_ds = TensorDataset(torch.tensor(X_val).float(), torch.tensor(y_val).long())

train_loader = DataLoader(train_ds, batch_size=128, shuffle=True)
val_loader = DataLoader(val_ds, batch_size=128, shuffle=False)

In [None]:
import copy
import matplotlib.pyplot as plt

# Define your network
network = nn.Sequential(
    nn.Linear(300, 128),
    nn.ReLU(),
    nn.Linear(128, 2)
)

optimizer = optim.Adam(network.parameters(), lr=0.001)
loss_function = nn.CrossEntropyLoss()

epochs = 10
best_val_loss = float('inf')
best_model = None

train_losses = []
val_losses = []
train_accuracies = []
val_accuracies = []

for epoch in range(epochs):
    network.train()
    running_train_loss = 0.0
    correct_train = 0
    total_train = 0

    for batch_x, batch_y in train_loader:
        prediction = network(batch_x)
        loss = loss_function(prediction, batch_y)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        running_train_loss += loss.item()

        _, predicted = torch.max(prediction, 1)
        correct_train += (predicted == batch_y).sum().item()
        total_train += batch_y.size(0)

    avg_train_loss = running_train_loss / len(train_loader)
    train_accuracy = correct_train / total_train
    train_losses.append(avg_train_loss)
    train_accuracies.append(train_accuracy)

    # Validation
    network.eval()
    running_val_loss = 0.0
    correct_val = 0
    total_val = 0

    with torch.no_grad():
        for batch_x, batch_y in val_loader:
            prediction = network(batch_x)
            loss = loss_function(prediction, batch_y)
            running_val_loss += loss.item()

            _, predicted = torch.max(prediction, 1)
            correct_val += (predicted == batch_y).sum().item()
            total_val += batch_y.size(0)

    avg_val_loss = running_val_loss / len(val_loader)
    val_accuracy = correct_val / total_val
    val_losses.append(avg_val_loss)
    val_accuracies.append(val_accuracy)

    # Save best model
    if avg_val_loss < best_val_loss:
        best_val_loss = avg_val_loss
        best_model = copy.deepcopy(network)

    print(f"Epoch {epoch+1}/{epochs} - "
          f"Train Loss: {avg_train_loss:.4f}, Train Acc: {train_accuracy:.4f} - "
          f"Val Loss: {avg_val_loss:.4f}, Val Acc: {val_accuracy:.4f}")


Epoch 1/10 - Train Loss: 0.4916, Train Acc: 0.7572 - Val Loss: 0.3769, Val Acc: 0.8264
Epoch 2/10 - Train Loss: 0.3756, Train Acc: 0.8316 - Val Loss: 0.3588, Val Acc: 0.8404
Epoch 3/10 - Train Loss: 0.3637, Train Acc: 0.8387 - Val Loss: 0.3604, Val Acc: 0.8368
Epoch 4/10 - Train Loss: 0.3560, Train Acc: 0.8436 - Val Loss: 0.3494, Val Acc: 0.8432
Epoch 5/10 - Train Loss: 0.3516, Train Acc: 0.8441 - Val Loss: 0.3461, Val Acc: 0.8484
Epoch 6/10 - Train Loss: 0.3468, Train Acc: 0.8471 - Val Loss: 0.3436, Val Acc: 0.8492
Epoch 7/10 - Train Loss: 0.3447, Train Acc: 0.8476 - Val Loss: 0.3392, Val Acc: 0.8460
Epoch 8/10 - Train Loss: 0.3392, Train Acc: 0.8499 - Val Loss: 0.3382, Val Acc: 0.8468
Epoch 9/10 - Train Loss: 0.3375, Train Acc: 0.8511 - Val Loss: 0.3367, Val Acc: 0.8468
Epoch 10/10 - Train Loss: 0.3322, Train Acc: 0.8543 - Val Loss: 0.3351, Val Acc: 0.8456


Task 1.2

For this task, you will implement your transformer in PyTorch. You are instructed to follow this link: https://pytorch.org/hub/huggingface_pytorch-transformers/

Task 1.3

Here, you should compare of both models; you are requested to use the same test dataset for both ANN and the transformer to answer the following:

• Compare the performance of the two models and explain in which scenarios you would
prefer one over the other.

• How did the two models’ complexity, accuracy, and efficiency differ? Did one model
outperform the other in specific scenarios or tasks? If so, why?

• What insights did you obtain concerning data amount to train? Embedding utilized?
Architectural choices made?
