In [None]:
%%capture
!pip install transformers datasets tabulate

In [13]:
import numpy as np
import torch
import torch.nn.functional as F
import torch.nn as nn
import math
from torch.utils.data import DataLoader
from datasets import load_dataset
from tabulate import tabulate

from tqdm.notebook import tqdm
from transformers import BertTokenizer

The goal of this notebook is to implement the model proposed by  Yoon Kim, published in 2014. The original paper can be found [here](https://www.aclweb.org/anthology/D14-1181).
Of course, there exists pytorch and tensorflow implementations on the web. They are more or less correct and efficient. However, here it is important to do it yourself. The goal is to better understand pytorch and the convolution.

The road-map is to:
- Implement the convolution and pooling
- Add dropout on the last layer

To start, it is useful to discover the convolution layers. In this lab, we consider the convolution operation in 1-dimension, followed by the adapted max pooling.


We use the same dataset as before: imdb. The first following cells are very similar to what we did in the HW 1, except that we pool the dataset from the HugginFace hub, using the special `load_dataset` function.


# Data loading


In [None]:
dataset = load_dataset("scikit-learn/imdb", split="train")
print(dataset)

# Pre-processing / Tokenization

This is a very important step. It maybe boring but very important. In this session we will be lazy, but in real life, the time spent on inspecting and cleaning data is never wasted. It is true for text, but also for everything.



In PyTorch, everything is tensor. Words are replaced by indices. A sentence, is therefore a sequence of indices (long integers). In the first HW, you constructed a `WhiteSpaceTokenizer`. Here we will use an already built tokenizer. It is more appropriate to transformers. It relies on sub-word units, and converts everything in lower case. This is not always the best choice, but here it will be sufficient. To quote the documentation, this tokenizer allows you to:
- Tokenize (splitting strings in sub-word token strings), converttokens strings to ids and back, and encoding/decoding (i.e., tokenizing and converting to integers).
- Add new tokens to the vocabulary in a way that is independent of the underlying structure (BPE, SentencePiece…).
- Manage special tokens (like mask, beginning-of-sentence, etc.): adding them, assigning them to attributes in the tokenizer for easy access and making sure they are not split during tokenization.

Here we are going to use the tokenizer from the well known Bert model, that we can directly download.

In [None]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", do_lower_case=True)

🚧 **Tokenizer** 🚧

Make sure you understand the following code. It is important to understand the tokenization process.

In the next cells we are going to experiment with this object.

In [None]:
print("Type of the tokenizer:", type(tokenizer.vocab))
VOCSIZE = len(tokenizer.vocab)
print("Length of the vocabulary:", VOCSIZE)

# Print some keys from the vocabulary
print("Some keys from the vocabulary:", list(tokenizer.vocab.keys())[9000:9010])

In [None]:
def print_sentence(sent):
    """Displays the tokens and respective IDs of a text sample"""
    table = np.array(
        [
            tokenizer.tokenize(sent),
            tokenizer.convert_tokens_to_ids(tokenizer.tokenize(sent)),
        ]
    ).T
    print(tabulate(table, headers=["Tokens", "Token IDs"], tablefmt="fancy_grid"))


sample = dataset[19]
print_sentence(sample["review"])
print("The label:", sample["sentiment"])

In [15]:
def preprocessing_fn(x, tokenizer):
    x["review_ids"] = tokenizer(
        x["review"],
        add_special_tokens=False,
        truncation=True,
        max_length=256,
        padding=False,
        return_attention_mask=False,
    )["input_ids"]
    x["label"] = 0 if x["sentiment"] == "negative" else 1
    return x

In [None]:
preprocessing_fn(dataset[0], tokenizer)

Be sure, you understand this output.

Now we can really prepare the data for the NNet.


Now we can really prepare the data for the NNet.

🚧 **Data loading** 🚧

Read carefully the data loading process. We want to:
- Shuffle the dataset
- For computational reasons, use only a total of **5000 samples**.
- Tokenize the dataset with the `preprocessing_fn`. (*Hint: use the `Dataset.map` method from HuggingFace*).
- Keep only columns `review_ids` and `label`.
- Make a train/validation split, (**80% / 20%**). Call these dataset `train_set` and `valid_set`.

Everything is implemented using the `Dataset` class from HuggingFace. It is very convenient and efficient. It is important to understand how it works.

In [None]:
n_samples = 5000  # the number of training example

# We first shuffle the data !
dataset = dataset.shuffle()

# Select 5000 samples
splitted_dataset = dataset.select(range(n_samples))

# Tokenize the dataset
splitted_dataset = splitted_dataset.map(
    preprocessing_fn, fn_kwargs={"tokenizer": tokenizer}
)


# Remove useless columns
splitted_dataset = splitted_dataset.select_columns(["review_ids", "label"])

# Split the train and validation
splitted_dataset = splitted_dataset.train_test_split(test_size=0.2)

train_set = splitted_dataset["train"]
valid_set = splitted_dataset["test"]

The dataset now outputs list of ids. However, there is one last remaining step. Since we want to have batch of tensors, they should have the same size.

Below is a code for padding batch of lists to the same size.

> 💡 *Note*: This process can be done a bit quicker with HuggingFace built-in `DataCollator` object, that tokenize + pad at once. But since these objects can be a bit complex at first sight, we present here a custom equivalent method. Moreover, they use the same underlying processes than what we present here.

In [18]:
class DataCollator:
    def __init__(self, tokenizer):
        self.tokenizer = tokenizer

    def __call__(self, batch):
        # `batch` is a list of dictionary with keys "review_ids" and "label".
        features = [{"input_ids": x["review_ids"]} for x in batch]
        features = self.tokenizer.pad(
            features, padding="max_length", max_length=256, return_tensors="pt"
        )
        label = torch.tensor([x["label"] for x in batch])[:, None]
        return {"review_ids": features["input_ids"], "label": label}

Let's define the `DataLoaders`.

In [19]:
data_collator = DataCollator(tokenizer)

In [20]:
batch_size = 32

train_dataloader = DataLoader(
    train_set, batch_size=batch_size, collate_fn=data_collator
)
valid_dataloader = DataLoader(
    valid_set, batch_size=batch_size, collate_fn=data_collator
)
n_valid = len(valid_set)
n_train = len(train_set)

🚧 **Check** 🚧

Let's see what everything does.

Print various information about one batch.

Explore the variables and understand what's inside.

In [None]:
batch = next(iter(train_dataloader))
print("batch is a dictionnary with keys:", batch.keys())
print("Size of different elements:", batch["review_ids"].shape, batch["label"].shape)

# Convolution model
Now, we are done with the data and we have our interface for the model, but we need to build the model.
Here the sequence of operation the model will achieve:  
- Embedding
- Convolution (1D)
- Pooling
- Linear
Now the question is: how can do it in pytorch ?


This is easy since most of these operations are already implemented. The  difficult part of the work is dedicated to playing with dimensions. This is true for pytorch, as well as tensorflow. Moreover, things can be tricky if we want our model to work properly with mini-batch (and we want it).
We will go through everything step-by-step.

## Embedding layer

The goal is to store a set of real vectors associated to each symbol (word) in the vocabulary. The layer requires :
- num_embeddings: the vocabulary size or the number of words under consideration. Words are represented by an index (starting at 0)
- embedding_dim : the dimension of the continous space (or the word embeddings.


Implicitely a lookup matrix is created to store *num_embeddings* of *size embedding_dim*. Let start with dummy dimensions that will help us to see what happens.

🚧 **TODO** 🚧

Create an instance of `nn.Embeddings`, with a number of vocabulary equals to `VOCSIZE` and dimension `h1`.

In [None]:
h1 = 50  # dimension of embeddings, the input size for convolution

# TODO
embedding_layer = # TODO


🚧 **TODO** 🚧
- Extract the lookup matrix from the embedding layer and print its shape.

- Print the embeddings sequence of the first 2 sequence of the `batch` created above (it should be totally random!).

- Print the shape of the emebedding sequence of `batch["review_ids"]`.

In [None]:
# TODO
lookup_matrix = embedding_layer.weight
print(lookup_matrix.shape)

print(embedding_layer(batch["review_ids"][:2]).shape)

## Convolution1D

Look at the documentation of the Conv1d layer. Read it carefully and try to completely understand the following code. A convolution layer expects a tensor as input, with the following dimensions *(B, D, L)*:
- B: size of the batch, the number of examples (here the number of sequence). For the moment we consider *B=1* (only one sequence)
- D: the dimension of the vectors for each time step
- L: the length of the input sequence (the number of tokens in the sequence).

🚧 **Question** 🚧
Is this shape directly compatible with our Embeddings layer defined above?

🚧 **TODO** 🚧

- Make sure the following code computing a convolution run and is consistent.
- Draw what happens to better understand the obtained dimensions.


In [None]:
convolution_layer = nn.Conv1d(in_channels=h1, out_channels=2, kernel_size=3)
sequence_embedding = embedding_layer(batch["review_ids"])
convolution_output = # convolution_layer(sequence_embedding.mT)


🚧 **TODO** 🚧

Now if we add another parameter for padding (set to 1). What do you observe ?
Play a bit with the *kernel_size* along with the *padding* to understand the interaction:
- try (kernel_size, padding) = (3, 1)
- (5,1) and (5,2)

Here is the code for the `Conv1DClassifier`.
Its modules are:
- A convolutional layer like above.
- An embedding layer like above.
- A non-linearity.
- A pooling layer
- A dropout layer.
- A linear layer that maps the sequence to a scalar.
- A sigmoid output function.

In [23]:
class Conv1dClassifier(nn.Module):
    """A text classifier:
    - input = minibatch
    - output = probability associated to a binary classification task
    - vocab_size: the number of words in the vocabulary we want to embed
    - embedding_dim: size of the word vectors
    """

    def __init__(self, vocab_size, embedding_dim, feature_size=100, kernel_size=3):
        super().__init__()
        self.embedding_dim = embedding_dim
        self.feature_size = feature_size
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        # The number of padding symbols depends on the kernel size.
        # It is important to ensure that we have always a sequence
        # as long as the kernel size.
        # ex: if ks=3, we add 1 padding before and one after.
        # The sentence "Great" becomes "<pad> Great <pad>"
        self.conv = nn.Conv1d(
            embedding_dim,
            feature_size,
            kernel_size,
            padding=math.floor(kernel_size / 2),
        )
        # The parameter for AdaptiveMaxPool1d is the "output size"
        # or the number of output values for a dimension.
        # Here it is one: we want to get the max for every components.
        self.pool = nn.AdaptiveMaxPool1d(1)
        self.dropout = nn.Dropout(0.3)
        self.linear = nn.Linear(feature_size, 1)
        self.out_activation = nn.Sigmoid()

    def forward(self, input_ids):
        # In pytorch, convolution expects (B,d,L)
        # B: the batch dimension
        # d: the embedding dimension
        # L: the length of the sequence
        hidden_states = self.embeddings(input_ids).permute(0, 2, 1)
        hidden_states = F.relu(self.conv(hidden_states))
        hidden_states = self.pool(hidden_states)  # --> (B,d,1)
        # Before the linear, do something with dims the dimension
        # Because Linear works on the final dim
        # (B,d,1) -> (B,d)
        hidden_states = hidden_states.squeeze(dim=2)
        hidden_states = self.dropout(hidden_states)
        logits = self.linear(hidden_states)
        return self.out_activation(logits)

Test the classifier on a random sequence.

In [None]:
random_inputs = torch.randint(0, VOCSIZE, (4, 100))
# Test the class: is everything in place:
# A first classifier is built like :
model = Conv1dClassifier(
    vocab_size=VOCSIZE, embedding_dim=25
)  # The parameters of the classifier are randomly initialize, but we
# can use it on a sequence :
out = model(random_inputs)
print(out.shape)  # the output has 2 dimensions
print(out)

# It is correct ? If not, correct the class to get the expected result.

# Training the model

To train the model, we need to define a loss function and an optimizer. For the moment we will rely on an online learning algorithm: online stochastic gradient descent.
- we pick one training example
- compute the loss
- back-propagation of the gradient
- update of the parameters


At the end of one epoch, we evaluate the model on the validation step.

In [25]:
# We redefine the DataLoader, in case you have modified it.
batch_size = 32
train_set = splitted_dataset["train"]
valid_set = splitted_dataset["test"]
train_dataloader = DataLoader(
    train_set, batch_size=batch_size, collate_fn=data_collator
)
valid_dataloader = DataLoader(
    valid_set, batch_size=batch_size, collate_fn=data_collator
)

In [None]:
import matplotlib.pyplot as plt

model = Conv1dClassifier(vocab_size=VOCSIZE, embedding_dim=100, feature_size=100)
loss_function = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
n_epochs = 10
train_losses = []
valid_losses = []
train_accs = []
valid_accs = []

model.cuda()


def compute_accuracy(predictions, labels):
    pred = (predictions > 0.5).int()
    correct = (labels == pred).sum().item()
    return correct


def train_one_epoch(model, dataloader, optimizer, loss_function):
    model.train()
    total_loss = 0
    correct = 0
    total_batches = len(dataloader)

    for batch in tqdm(dataloader, leave=True):
        batch = {k: v.cuda() for k, v in batch.items()}
        optimizer.zero_grad()
        probs = model(batch["review_ids"])
        gold = batch["label"]

        correct += compute_accuracy(probs, gold)
        loss = loss_function(probs, gold.float())
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    avg_loss = total_loss / total_batches
    avg_accuracy = (correct * 100) / n_train  # Calculate accuracy in percentage
    return avg_loss, avg_accuracy


def validate_one_epoch(model, dataloader, loss_function):
    model.eval()
    total_loss = 0
    correct = 0
    total_batches = len(dataloader)

    with torch.no_grad():
        for batch in tqdm(dataloader, leave=True):
            batch = {k: v.cuda() for k, v in batch.items()}
            probs = model(batch["review_ids"])
            gold = batch["label"]

            correct += compute_accuracy(probs, gold)
            loss = loss_function(probs, gold.float())
            total_loss += loss.item()

    avg_loss = total_loss / total_batches
    avg_accuracy = (correct * 100) / n_valid  # Calculate accuracy in percentage
    return avg_loss, avg_accuracy


for epoch in range(n_epochs):
    train_avg_loss, train_avg_acc = train_one_epoch(
        model, train_dataloader, optimizer, loss_function
    )
    valid_avg_loss, valid_avg_acc = validate_one_epoch(
        model, valid_dataloader, loss_function
    )

    train_losses.append(train_avg_loss)
    valid_losses.append(valid_avg_loss)
    train_accs.append(train_avg_acc)
    valid_accs.append(valid_avg_acc)

    print(
        f"Epoch {epoch+1}/{n_epochs}",
        f"Train Loss: {train_avg_loss:.2f}",
        f"Train Acc: {train_avg_acc:.2f}%",
        f" | Valid Loss: {valid_avg_loss:.2f}",
        f"Valid Acc: {valid_avg_acc:.2f}%",
    )

# Plotting Loss and Accuracy Curves
epochs_range = range(1, n_epochs + 1)

plt.figure(figsize=(12, 5))

# Loss plot
plt.subplot(1, 2, 1)
plt.plot(epochs_range, train_losses, label="Train Loss")
plt.plot(epochs_range, valid_losses, label="Validation Loss")
plt.title("Loss Curves")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend()

# Accuracy plot
plt.subplot(1, 2, 2)
plt.plot(epochs_range, train_accs, label="Train Accuracy")
plt.plot(epochs_range, valid_accs, label="Validation Accuracy")
plt.title("Accuracy Curves")
plt.xlabel("Epochs")
plt.ylabel("Accuracy (%)")
plt.legend()

plt.tight_layout()
plt.show()