# Imports

We will use some Huggingface libraries for downloading and processing the dataset, and therefore need to install them first.\
The first is the datasets library which allows you to download and manipulate datasets.\
See [their guide](https://huggingface.co/docs/datasets/installation) for an introduction to the datasets library if you want to know more.\
The other library helps us in downloading data from the Huggingface Hub.

We can simple install them using pip:
> Note that `%pip` is a magic command for use in Jupyter Notebooks as discussed in the tutorial on setting up Python!

In [None]:
%pip install datasets huggingface_hub

In [None]:
import zipfile
from functools import partial

import matplotlib.pyplot as plt
import torch
from datasets import load_dataset
from huggingface_hub import hf_hub_download
from torch.utils.data import DataLoader
from tqdm import trange

# Sentiment Analysis

Sentiment Analysis (SA) determines the emotional tone of a text sequence (e.g., a sentence) and classifies it into predefined categories.\
Therefore, SA is a text classification task which assigns a single class to the whole input.

In this part, we will use the SST2 dataset, which stands for Stanford Sentiment Treebank.\
This is a binary task where inputs are labeled as `positive` (`1`) or `negative` (`0`).

## Dataset Loading

There are several possibiliites for downloading datasets.\
You can read them from files, or directly use a library with an online repository of datasets, such as [Huggingface Datasets](https://huggingface.co/docs/datasets/index).

> A note on the Huggingface Datasets library: You can decide where downloaded files are cached, see https://huggingface.co/docs/datasets/v3.2.0/en/cache#cache-directory.\
> The documentation is as always a good source of information!

We will use this library in our example:

In [None]:
sst2 = load_dataset("stanfordnlp/sst2")
# Print the dataset object to see the dataset's structure
print(sst2)

If you want to learn more about using this library, there is a very useful [tutorial](https://huggingface.co/docs/datasets/v3.2.0/en/tutorial) available online.

We see that there are three splits, and how many samples are contained in each split.\
Let's take a look at an example.

In [None]:
dataset_train = sst2['train']
# Print the first sample in the training dataset
print(dataset_train[0])

As we can see, each sample includes an index, the input sentence and a label.\
This sample was labeled `negative` (`0`; `1` stand for `positive` in this dataset).

## Embeddings

In order to turn words into features, we can use pre-trained embeddings.\
These have been trained to carry the semantics of the words.\
See https://nlp.stanford.edu/projects/glove/ for more information.

GloVe embeddings exist trained on different data (varying in the number of tokens and vocabulary size), and with different embedding dimension.\
In this tutorial, we use the smallest vocabulary size with 300-dimensional features.

In [None]:
# Download the GloVe embeddings
glove = hf_hub_download("stanfordnlp/glove", "glove.6B.zip")

# Unpack the downloaded file
word_to_indices = dict()
embeddings = []
# There are multiple files with different dimensionality of the features in the zip archive: 50d, 100d, 200d, 300d
filename = "glove.6B.300d.txt"
with zipfile.ZipFile(glove, "r") as f:
    for idx, line in enumerate(f.open(filename)):
        values = line.split()
        word = values[0].decode("utf-8")
        features = torch.tensor([float(value) for value in values[1:]])
        word_to_indices[word] = idx
        embeddings.append(features)

# Last token in the vocabulary is '<unk>' which is used for out-of-vocabulary words
# We also add a '<pad>' token to the vocabulary for padding sequences
word_to_indices["<pad>"] = len(word_to_indices)
embeddings.append(torch.zeros(embeddings[0].shape))

# Convert the list of tensors to a single tensor
embeddings = torch.stack(embeddings)

print(f"Embeddings shape: {embeddings.shape}")
print(f"Number of words in the GloVe embeddings: {len(embeddings)}")
print(f"Embedding shape: {embeddings.size(1)}")

## Data Processing

Starting from our input sentences, we need to 1) tokenize the sentences to get smaller units (words or tokens) and 2) convert these tokens into vector representations using the pre-trained embeddings.

The datasets library provides support for processing datasets, for more information see https://huggingface.co/docs/datasets/process.


In [None]:
def tokenize(text: str):
    return text.lower().split()


def map_token_to_index(token):
    # Return the index of the token or the index of the '<unk>' token if the token is not in the vocabulary
    return word_to_indices.get(token, word_to_indices["<unk>"])


def map_text_to_indices(text: str):
    return [map_token_to_index(token) for token in tokenize(text)]


def prepare_dataset(dataset):
    return dataset.map(lambda x: {"token_ids": map_text_to_indices(x["sentence"])})

dataset_train_tokenized = prepare_dataset(dataset_train)

# Print the first sample in the tokenized training dataset
print(dataset_train_tokenized[0])

Next, we need a dataloader that takes care of batching our data.\
You have seen this before, but this time we need also take care of padding, since the length of the sentences varies in our dataset.

In [None]:
# We select the columns that we want to keep in the dataset
dataset_train_tokenized = dataset_train_tokenized.with_format(
    columns=["token_ids", "label"]
)

def pad_inputs(batch, keys_to_pad=["token_ids"], padding_value=-1):
    # Pad inputs to the maximum length
    padded_batch = {}
    for key in keys_to_pad:
        # Get maximum length in batch
        max_len = max([len(sample[key]) for sample in batch])
        # Pad all samples to the maximum length
        padded_batch[key] = torch.tensor(
            [
                sample[key] + [padding_value] * (max_len - len(sample[key]))
                for sample in batch
            ]
        )

    # Add remaining keys to the batch
    for key in batch[0].keys():
        if key not in keys_to_pad:
            padded_batch[key] = torch.tensor([sample[key] for sample in batch])
    return padded_batch


dataloader_train = DataLoader(
    dataset_train_tokenized,
    batch_size=4,
    collate_fn=partial(pad_inputs, padding_value=word_to_indices["<pad>"]),
    shuffle=True,
)

for batch in dataloader_train:
    token_ids = batch["token_ids"]
    labels = batch["label"]
    print(token_ids)
    print(labels)
    break

## Training

Now having processed the data, it is time for creating a neural network for training.\
We will use a simple network here with the pre-trained embeddings.

In [None]:
class TextClassifier(torch.nn.Module):
    def __init__(self, embeddings, hidden_size=128, padding_index=-1):
        super().__init__()
        self.embedding = torch.nn.Embedding.from_pretrained(
            embeddings, freeze=True, padding_idx=padding_index
        )
        self.layer1 = torch.nn.Linear(embeddings.shape[1], hidden_size)
        self.output_layer = torch.nn.Linear(hidden_size, 2)

    def forward(self, x):
        x = self.embedding(x)
        x = torch.sum(x, dim=1)
        x = self.layer1(x)
        x = torch.relu(x)
        x = self.output_layer(x)
        return x

To assess the performance of our model, we compute the accuracy and optionally also the loss on the provided dataset:

In [None]:
def compute_accuracy(predictions, labels):
    return torch.sum(torch.argmax(predictions, dim=1) == labels).item() / len(labels)


def evaluate_model(model, dataset, loss_fn=None):
    dataloader = DataLoader(
        dataset,
        batch_size=4,
        collate_fn=partial(pad_inputs, padding_value=word_to_indices["<pad>"]),
    )
    accuracies = []
    losses = []
    # We don't need to compute gradients for the evaluation
    with torch.no_grad():
        for batch in dataloader:
            token_ids = batch["token_ids"]
            labels = batch["label"]
            predictions = model(token_ids)
            if loss_fn:
                loss = loss_fn(predictions, labels)
                losses.append(loss.item())
            accuracies.append(compute_accuracy(predictions, labels))
    return sum(accuracies) / len(accuracies), (sum(losses) / len(losses)) if loss_fn else None

Since we want to evaluate our model on a separate dataset, we will also have to process this similarly to our training data:

In [None]:
dataset_val = sst2["validation"]
dataset_val_tokenized = prepare_dataset(dataset_val)
dataset_val_tokenized = dataset_val_tokenized.with_format(
    columns=["token_ids", "label"]
)

Now we can create and evaluate our model.

In [None]:
model = TextClassifier(embeddings, padding_index=word_to_indices["<pad>"])
accuracy = evaluate_model(model, dataset_val_tokenized)
print(f"Accuracy on the validation dataset: {accuracy}")

With randomly initialized weights, we expect to end up at ~50% Accuracy (in average)!

For the actual training, we also need a loss function and an optimizer.\
The rest is just a simple training loop as before.

In [None]:
loss_fn = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

In [None]:
losses_train, losses_val = [], []
accuracies_train, accuracies_val = [], []

# Compute loss and accuracy on the training set
accuracy, loss = evaluate_model(model, dataset_train_tokenized, loss_fn)
losses_train.append(loss)
accuracies_train.append(accuracy)

# Compute loss and accuracy on the validation set
accuracy, loss = evaluate_model(model, dataset_val_tokenized, loss_fn)
losses_val.append(loss)
accuracies_val.append(accuracy)

# Do one epoch of training
pbar = trange(20)
for epoch in pbar:
    for batch in dataloader_train:
        token_ids = batch["token_ids"]
        labels = batch["label"]
        predictions = model(token_ids)
        loss = loss_fn(predictions, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    # Calculate the loss and accuracy on the training set
    acc_train, loss_train = evaluate_model(model, dataset_train_tokenized, loss_fn)
    accuracies_train.append(acc_train)
    losses_train.append(loss_train)

    # Evaluate the model on the validation set
    acc_val, loss_val = evaluate_model(model, dataset_val_tokenized, loss_fn)
    accuracies_val.append(acc_val)
    losses_val.append(loss_val)

    pbar.set_postfix_str(f"Train loss: {losses_train[-1]} - Validation acc: {accuracies_val[-1]}")

# Visualize the loss and accuracy
plt.plot(losses_train, color="orange", linestyle="-", label="Train loss")
plt.plot(losses_val, color="orange", linestyle="--", label="Validation loss")
plt.plot(accuracies_train, color="steelblue", linestyle="-", label="Train accuracy")
plt.plot(accuracies_val, color="steelblue", linestyle="--", label="Validation accuracy")
plt.xlabel("Epoch")
plt.legend()
plt.show()