# Imports

We will use some Huggingface libraries for downloading and processing the dataset, and therefore need to install them first.\
The first is the datasets library which allows you to download and manipulate datasets.\
See [their guide](https://huggingface.co/docs/datasets/installation) for an introduction to the datasets library if you want to know more.\
The other one is the tokenizers
We can simple install it using pip:
> Note that `%pip` is a magic command for use in Jupyter Notebooks as discussed in the tutorial on setting up Python!

In [None]:
%pip install datasets huggingface_hub

In [None]:
import zipfile
import numpy as np

import torch
from torch.utils.data import DataLoader
from datasets import load_dataset
from huggingface_hub import hf_hub_download

# Sentiment Analysis

Sentiment Analysis (SA) determines the emotional tone of a text sequence (e.g., a sentence) and classifies it into predefined categories.\
Therefore, SA is a text classification task which assign a single class to the whole input.

In this part, we will use the SST2 dataset, which stands for Stanford Sentiment Treebank.\
This is a binary task where inputs are labeled as `positive` or `negative`.

## Dataset Loading

There are several possibiliites for downloading datasets.\
You can read them from files, or directly use a library with an online repository of datasets, such as [Huggingface Datasets](https://huggingface.co/docs/datasets/index).

> A note on the Huggingface Datasets library: You can decide where downloaded files are cached, see https://huggingface.co/docs/datasets/v3.2.0/en/cache#cache-directory.\
> The documentation is as always a good source of information!

We will use this library in our example:

In [None]:
sst2 = load_dataset("stanfordnlp/sst2")
# Print the dataset object to see the dataset's structure
print(sst2)

If you want to learn more about using this library, there is a very useful [tutorial](https://huggingface.co/docs/datasets/v3.2.0/en/tutorial) available online.

We see that there are three splits, and how many samples are contained in each split.\
Let's take a look at an example.

In [None]:
dataset_train = sst2['train']
# Print the first sample in the training dataset
print(dataset_train[0])

As we can see, each sample includes an index, the input sentence and a label.\
This sample was labeled `negative` (`0`; `1` stand for `positive` in this dataset).

## Embeddings
TODO 
https://nlp.stanford.edu/projects/glove/

In [None]:
# Download the GloVe embeddings
glove = hf_hub_download("stanfordnlp/glove", "glove.6B.zip")

# Unpack the downloaded file
word_to_indices = dict()
embeddings = []
# There are multiple files with different dimensionality of the features in the zip archive: 50d, 100d, 200d, 300d
filename = "glove.6B.300d.txt"
with zipfile.ZipFile(glove, "r") as f:
    for idx, line in enumerate(f.open(filename)):
        values = line.split()
        word = values[0].decode("utf-8")
        features = np.asarray(values[1:], dtype="float32")
        word_to_indices[word] = idx
        embeddings.append(features)

# Last token in the vocabulary is '<unk>' which is used for out-of-vocabulary words
print(f"Number of words in the GloVe embeddings: {len(embeddings)}")
print(f"Embedding shape: {embeddings[0].shape}")

In [None]:
def tokenize(text: str):
    return text.lower().split()


def map_token_to_index(token):
    # Return the index of the token or the index of the '<unk>' token if the token is not in the vocabulary
    return word_to_indices.get(token, word_to_indices["<unk>"])


def map_text_to_indices(text: str):
    return [map_token_to_index(token) for token in tokenize(text)]


dataset_train_tokenized = dataset_train.map(
    lambda x: {"token_ids": map_text_to_indices(x["sentence"])}
)
print(dataset_train_tokenized)
print(dataset_train_tokenized[0])
print(dataset_train_tokenized[1])

In [None]:
dataset_train_tokenized = dataset_train_tokenized.with_format(columns=["token_ids", "label"])

def pad_inputs(batch, keys_to_pad=["token_ids"]):
    # Get maximum length in batch
    max_len = max([len(sample["token_ids"]) for sample in batch])
    # Pad inputs to the maximum length
    padded_batch = {key: torch.tensor([sample[key] + [0] * (max_len - len(sample[key])) for sample in batch]) for key in keys_to_pad}
    # Add remaining keys to the batch
    for key in batch[0].keys():
        if key not in keys_to_pad:
            padded_batch[key] = [sample[key] for sample in batch]
    return padded_batch

dataloader = DataLoader(dataset_train_tokenized, batch_size=4, collate_fn=pad_inputs)

for batch in dataloader:
    token_ids = batch["token_ids"]
    labels = batch["label"]
    print(token_ids)
    print(labels)
    break