## Exercise "Natural Language Processing" -- Text Classification with PyTorch


For this course, save a COPY to your Google Drive for the tutorial (File -> Save copy in Drive). Then complete the tasks in your saved copy. If you're done, submit the notebook via moodle by sharing a link with the appropriate permissions (preferred, but please do not make changes after the deadline) or by submitting the downloaded `.ipynb` file.

This is an individual assignment, i.e., submit your solutions individually.

This assignment is **mandatory for participation in the exam**. You are required to obtain at least 50% of the total points in this assignment to become eligible for participating in the final exam.

**This assignment will take some time**, particularly if you do not have prior experience with PyTorch. Do not just start the day before the deadline. There are 8 individual tasks, however most of them will be just a few lines of code.

**Due date: 25.05.2023, 9:15 a.m.(CEST)**


You will need to install the following dependency and **restart** the Colab runtime (as of April '23).

In [1]:
! pip install portalocker>=2.0.0

In this assignment, we'll build and train a Neural Network from scratch for Text Classification using PyTorch (`import torch`). We will use the `AG_NEWS` dataset to train a model that can classifiy news articles by topic.

We will complete the following steps in this notebook:
- Define a tokenizer and preprocessing pipeline for our dataset
- Develop a Neural Network architecture using **word embeddings** (in contrast to the document-level bag-of-words emebddings from the last assignments)
- Define a PyTorch training loop and relevant hyperparameters
- Have fun!




In [2]:
# Some imports
import torch
import torchtext

## Data preprocessing

Building your data preprocessing pipeline is a crucial part of working with text! This contains:
- loading the dataset and constructing train, dev and test splits
- building a tokenizer and vocabulary (all tokens that occur in the train split)
- pre-tokenizing, truncating or padding the data to prepare for training

**Task 1 (2 points)**: Split the `AG_NEWS` dataset into train, dev and test splits. The dataset already comes with pre-defined train and test splits. Use a random sample of 10% of samples from the train split to construct our dev split. Use a fixed random seed of `42` to ensure reproducibility.

*Always* split your data in the beginning, before doing any further processing steps on the train split or you might involuntarily leak information.


How many unique labels are there? What are the classes?

In [3]:
from torchtext.datasets import AG_NEWS
from torch.utils.data.dataset import random_split # hint
from torchtext.data.functional import to_map_style_dataset # hint

# TODO: split dataset

train_split, dev_split = random_split(to_map_style_dataset(AG_NEWS(split=("train"))), [0.9, 0.1], generator=torch.Generator().manual_seed(42)) # Complete
test_split = to_map_style_dataset(AG_NEWS(split=("test")))   # Complete


assert len(train_split) == 108000
assert len(dev_split) == 12000
assert len(test_split) == 7600

assert len(train_split[0]) == 2  # Each sample is a tuple of (topic_label, news_text)

# If you have set the manual_seed to 42, you always get the sample splits in different runs
## In the following, we test if the last sample in train_split is as expected
assert train_split[-1][0] == 1
assert train_split[-1][1].startswith('Soviet-style Belarus election') == True


# TODO: What are unique label ids? How may unique labels exist?
# Bonus question: What is the mapping between the label ids and class names? hint: https://github.com/mhjabreel/CharCnn_Keras/blob/master/data/ag_news_csv/classes.txt
labels = set()
for sample in train_split:
    label_id = sample[0]
    labels.add(label_id)
assert len(labels) == 4
print(labels)

{1, 2, 3, 4}


We will use a basic tokenizer for English from the `torchtext` package. Observe the effects the tokenizer has one a sample input.



In [4]:
tokenizer = torchtext.data.utils.get_tokenizer('basic_english')

print(train_split[0][1])
print(tokenizer(train_split[0][1])) # (label, text) format in train_split

Hopkins wants piece of Roy Jones NEW YORK - One thing Bernard Hopkins knockout of Oscar De La Hoya hasnt changed for the middleweight champion: he still needs to beat Roy Jones to feel fulfilled.
['hopkins', 'wants', 'piece', 'of', 'roy', 'jones', 'new', 'york', '-', 'one', 'thing', 'bernard', 'hopkins', 'knockout', 'of', 'oscar', 'de', 'la', 'hoya', 'hasnt', 'changed', 'for', 'the', 'middleweight', 'champion', 'he', 'still', 'needs', 'to', 'beat', 'roy', 'jones', 'to', 'feel', 'fulfilled', '.']


**Task 2 (3 points)**
- (2 points) Using the tokenizer, build the total *vocabulary* of all tokens occuring in the training split of our AG News dataset. Use `torchtext.vocab.build_vocab_from_iterator` with the added special tokens `["<unk>", "<pad>"]` (**important!**).
- (1 point) Print the number of tokens in the vocabulary and the tokens that map to IDs `0-4`.

In [5]:
from torchtext.vocab import build_vocab_from_iterator # hint

def yield_tokens(data_split):
    """This might be useful"""
    for _, text in data_split:
        yield tokenizer(text)

def get_vocab(split):
  # TODO: build the vocab and return it
  vocab = build_vocab_from_iterator(yield_tokens(split), specials=["<unk>", "<pad>"])
  return vocab

vocab = get_vocab(train_split)
vocab.set_default_index(vocab["<unk>"]) # To handle out of vocabulary cases: if token not in vocabulary, return <unk>
assert len(vocab) == 91154
print("Number of tokens in the vocabulary:", len(vocab))
print("Tokens mapped to IDs 0-4:")
for token_id in range(5):
    print(vocab.get_itos()[token_id])

# Bonus question: What is the token id/indice for the special token '<unk>'?
## Hint: torchtext.vocab.get_stoi() is helful for this!
unk_token_id = vocab.get_stoi()["<unk>"]
print("Token ID for '<unk>':", unk_token_id)

Number of tokens in the vocabulary: 91154
Tokens mapped to IDs 0-4:
<unk>
<pad>
.
the
,
Token ID for '<unk>': 0


**Task 3 (3 points)**: Complete the `process_text` function. The function should:
- tokenize the `text` with the given `tokenizer` (expected output: list of strings)
- (1 point) truncate the result to a maximum of `max_sequence_length` tokens, if longer
- (1 point) pad the result to `max_sequence_length`  using the `<pad>` token (1 point)
- (1 point) convert all tokens into *token IDs* using the `vocab` (expected output: list of integers)


We'll use this fucntion to prepapre our data to be fed into the GPU. By padding and truncating, we ensure every sample in the training set to be the same length, which allows us to do efficient batched training. We convert to IDs because we only need the IDs of tokens to retrieve their corresponding embedding during training.

In [6]:
def process_text(text, tokenizer, vocab, max_sequence_length=256, pad_token="<pad>"):
  # TODO: complete function

  tokens = tokenizer(text)  # Complete

  length = len(tokens)
  if length > max_sequence_length:
    # Complete
    tokens = tokens[:max_sequence_length]

  elif length < max_sequence_length:
    # Complete
    tokens.extend([pad_token] * (max_sequence_length - len(tokens)))


  assert len(tokens) == max_sequence_length

  token_ids = [vocab[token] for token in tokens] # Complete

  return token_ids

mock_example_long = "lorem ipsum dolar sonet "*10_000
assert len(process_text(mock_example_long, tokenizer, vocab)) == 256

mock_example_short = "lorem ipsum dolar sonet "*2
assert len(process_text(mock_example_short, tokenizer, vocab)) == 256
assert process_text(mock_example_short, tokenizer, vocab)[-1] == vocab["<pad>"] # <pad> tokens in end

We'll use the `process_text` function in a PyTorch `Dataloader`. There is some more stuff that needs to be done to prepare the data, which we're doing in this `collate_fn`. The `collate_fn` in a `Dataloader` is responsible for making sure the training data is in the right format for our model.

Take a minute to go through the `collate_fn()` to understand what's going on.

In [7]:
def collate_fn(batch, tokenizer, vocab, device):
  prepared_labels, prepared_texts = [], []
  for (_label, _text) in batch:
        prepared_labels.append(int(_label) - 1) # labels in data start at 1, torch wants start at 0
        processed_text = process_text(_text, tokenizer, vocab)
        prepared_texts.append(processed_text)

  # turn into tensors (containing the token IDs)
  prepared_labels = torch.tensor(prepared_labels, dtype=torch.int64)
  prepared_texts = torch.tensor(prepared_texts, dtype=torch.int64)

  # Move to GPU if necessary
  prepared_labels, prepared_texts = prepared_labels.to(device), prepared_texts.to(device)
  return prepared_labels, prepared_texts

## Neural Network Architecture

Now, we'll define our model architecture. You will need to implement two main parts:
1. An Embedding Layer that maps token IDs to a 1-dimensional tensor (i.e. vector) of a fixed size (in this example, choose 50)
2. A Multi-Layer Percpetron that gets as input the **mean** of all token embeddings in the input sequence and outputs distribution over the number of classes.


<details>
<summary>Background: Why do we take the mean?</summary>
<br>
Another option is to concatenate all input embeddings in a sequence in order and feed them into the MLP (our MLP will then need an input layer of size <code>embed_dim * sequence_length</code>). This leads to a larger model but most importantly introduces a strong dependency on the <b>position</b> in the sequence a token appears in. If a token is just shifted a single position to the right, it will be fed into a completely different set of neurons. Obviously, this is undesirable. By taking the mean, we obtain a <b>position-invariant</b> model. The attentive reader will have noticed that a position-invariant model fixes the position-dependency problem, but looses relevant information (i.e. the token order) in the process. In the coming weeks, we will discuss the state-of-the-art architectural innovation to alleviate this problem... stay tuned!</details>



**Task 4 (10 points)**:
- (4 points) Define the network architecture in the `__init__()` function below. We will use an embedding dimension of `50` and train a 2-layer MLP with a single fully-connected hidden layer with ReLU activation, you can also use dimension `50`. The output layer should have neurons according to `num_class`.
- (4 points) Complete `init_weights()` to initialize the embedding layer and the MLP weights from a uniform distribution with range `[-1,1]`. Initialize all biases in the MLP to zero.
- (2 points) Complete the forward pass of the network (token ID -> embedding -> MLP). For the MLP, use a fully-connected layer (`Linear` layer in `torch.nn`) and the ReLU activation function.

**Tips**
- When defining the Embedding, make sure to supply the correct `padding_idx`. **This is crucial** to make sure that the padding embedding stays a `0`-vector instead of being optimized by gradient descent.
- The network should only ouput the *logits* (the stuff before the softmax layer). The actual softmax to obtain a probability distribution over classes is combined with the loss function later on for numerical stability.

In [8]:
from torch import nn
from torch.nn import EmbeddingBag, Linear, ReLU, Sequential # hints


class MyTextClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_class):
        super().__init__()
        # TODO: define embedding layer
        self.embedding = EmbeddingBag(vocab_size, embed_dim, sparse=False, padding_idx=1)


        # TODO: define MLP
        self.mlp = Sequential(
            Linear(embed_dim, hidden_dim),
            ReLU(),
            Linear(hidden_dim, num_class)
        )
        self.init_weights()

    def init_weights(self):
        # TODO: initialize weights
        # hint: what should you set for `nn.Linear.weight` and `nn.Linear.bias` variables? For all layers of the network?
        for name, param in self.named_parameters():
            if "weight" in name:
                nn.init.uniform_(param, -1, 1)
            elif "bias" in name:
                nn.init.zeros_(param)



    def forward(self, text):
        # TODO forward pass
        embedded = self.embedding(text)
        output = self.mlp(embedded)
        return output

## Training Loop

Now comes the last part: defining the training loop! We have prepared a skeleton for you to use below. Make sure you understand every part of the code.

**Task 5 (8 points)**: Complete the training loop with all relevant steps. The training should run without errors when executing the next task and loss on the training set should steadily decrease. The code will be used in conjunction with the cell form Task 6, so you should also look at the code there for context.

If you're new to PyTorch, you can find inspiration in this blog post: https://pytorch.org/tutorials/beginner/introyt/trainingyt.html.

In [9]:
import time
from tqdm import tqdm

def train(model, dataloader, optimizer, loss_fn, epoch=1):
    """Do model training for a single epoch."""
    # TODO: set model to `train` mode
    model.train()

    total_acc, total_loss, total_count = 0, 0, 0
    log_interval = 500
    start_time = time.time()
    train_progress = tqdm(dataloader, desc=f"Epoch {epoch}", leave=False)

    for idx, (label, tokens) in enumerate(train_progress, 0):
        # TODO: reset gradients to zero
        optimizer.zero_grad()

        # TODO: do the forward pass
        output = model(tokens)

        # TODO: compute the loss
        loss = loss_fn(output, label)

        # TODO: do the backward pass
        loss.backward()

        # TODO: do the optimization step
        optimizer.step()


        # Logging and evaluation
        total_acc += (output.argmax(1) == label).sum().item()
        total_count += label.size(0)
        total_loss += loss.item()
        if idx % log_interval == 0 and idx > 0:
            avg_loss = total_loss / log_interval # loss is already mean over batch dim

            # train_progress.set_description(f"Epoch: {epoch}, loss: {avg_loss:.3}")
            elapsed = time.time() - start_time
            print(f' |{idx:5d}/{len(dataloader):5d} batches | loss: {avg_loss:.3} | accuracy {total_acc/total_count:.3f}')
            total_acc, total_loss, total_count = 0, 0, 0
            start_time = time.time()

def evaluate(model, dataloader, loss_fn):
    model.eval()
    total_acc, total_count = 0, 0

    with torch.no_grad(): # Faster, since we do not want gradients here
        for idx, (label, tokens) in enumerate(dataloader):
            # TODO: do the forward pass to get predictions
            prediction = model(tokens)

            # TODO: calculate loss
            loss = loss_fn(prediction, label)

            total_acc += (prediction.argmax(1) == label).sum().item()
            total_count += label.size(0)
    return total_acc/total_count

Congratulations! Everything should be set up now and you can start the training using the cell below.

You will notice that the training is quite slow. We have already implemented the use of GPUs for hardware acceleration. So let's use a GPU to make things faster 🚀. Google Colab lets you use a free GPU. Go to Runtime > Change runtime type > Select GPU in the dropdown. **You will need to restart the runtime and re-execute cells.** The restart should happen automatically, you might have to re-execute cells manually.


To use the power of GPUs, our **model** and **data** need to be transferred to the GPU in our code. This is done by the `.to(device)` calls where the device is set to "cpu" or "cuda" (for Nvidia GPUs) depending on GPU availability.

**Task 6 (10 points)**: Achieve an accuracy on the dev split of 90%! There are simple and more creative ways to do this. It's entirely up to you! We have defined defaults for the hyperparameters, loss functions, optimizer, etc. in the code. **Hint:** they are not optimal. Briefly summarize your changes below.

If you do not know where to start: our "Neural Network Tuning Guide" at the end of this notebook will be helpful for you, particularly the "Basic Wisdom".

You are allowed to collaborate with other students for this part but everyone should submit their own unqiue solution (i.e. do not copy-paste code or text answers).

For this task it is especially important to keep the output of your successful training in the submission. The last evaluation on the dev set in the output should be over 90%. If you modify any of the previous cells, keep the code for the original task as a comment.


- If you reach 80% accuracy, you'll get 3 points
- If you reach 85%, you'll get 5 points
- If you reach 88%, you'll get 8 points
- If you get to >=90%, you'll get 10 points

In [19]:
import torch
from torch.utils.data import DataLoader

# Hyperparameters - sane default values for this example
EPOCHS = 3 # epoch
LR = 0.01 # learning rate / from 0.1 to 0.01
BATCH_SIZE = 64 # batch size for training

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.manual_seed(42) # for reproducible results

model = MyTextClassifier(
    vocab_size=len(vocab),
    embed_dim=50,
    hidden_dim=50,
    num_class=4 # TODO: fill in correct value
    )
model = model.to(device) # Move model to GPU if necessary

loss_fn = torch.nn.CrossEntropyLoss() # this combines softmax with the actual loss function
optimizer = torch.optim.Adam(model.parameters(), lr=LR) # Adam Optimizer instead of SGD

collator = lambda batch: collate_fn(batch, tokenizer, vocab, device)

train_dataloader = DataLoader(train_split, batch_size=BATCH_SIZE,
                              shuffle=True, collate_fn=collator)
dev_dataloader = DataLoader(dev_split, batch_size=BATCH_SIZE,
                              shuffle=False, collate_fn=collator)


for epoch in range(1, EPOCHS + 1):
    epoch_start_time = time.time()
    train(model, train_dataloader, optimizer, loss_fn, epoch)
    accu_val = evaluate(model, dev_dataloader, loss_fn)
    print('-' * 59)
    print(f'| end of epoch {epoch:3d} | time: {time.time() - epoch_start_time:5.2f}s | ' +
          f'dev set accuracy {accu_val:8.3f} ')
    print('-' * 59)

Epoch 1:  30%|██▉       | 502/1688 [00:38<01:45, 11.22it/s]

 |  500/ 1688 batches | loss: 0.445 | accuracy 0.841


Epoch 1:  59%|█████▉    | 1002/1688 [01:14<00:51, 13.28it/s]

 | 1000/ 1688 batches | loss: 0.274 | accuracy 0.906


Epoch 1:  89%|████████▉ | 1502/1688 [01:56<00:19,  9.34it/s]

 | 1500/ 1688 batches | loss: 0.248 | accuracy 0.916




-----------------------------------------------------------
| end of epoch   1 | time: 135.71s | dev set accuracy    0.915 
-----------------------------------------------------------


Epoch 2:  30%|██▉       | 502/1688 [00:46<01:49, 10.87it/s]

 |  500/ 1688 batches | loss: 0.118 | accuracy 0.959


Epoch 2:  59%|█████▉    | 1003/1688 [01:31<00:58, 11.63it/s]

 | 1000/ 1688 batches | loss: 0.125 | accuracy 0.956


Epoch 2:  89%|████████▉ | 1503/1688 [02:17<00:16, 11.43it/s]

 | 1500/ 1688 batches | loss: 0.138 | accuracy 0.950




-----------------------------------------------------------
| end of epoch   2 | time: 156.88s | dev set accuracy    0.909 
-----------------------------------------------------------


Epoch 3:  30%|██▉       | 502/1688 [00:47<01:42, 11.61it/s]

 |  500/ 1688 batches | loss: 0.0553 | accuracy 0.981


Epoch 3:  59%|█████▉    | 1002/1688 [01:32<01:13,  9.27it/s]

 | 1000/ 1688 batches | loss: 0.0669 | accuracy 0.976


Epoch 3:  89%|████████▉ | 1503/1688 [02:19<00:16, 11.31it/s]

 | 1500/ 1688 batches | loss: 0.0777 | accuracy 0.972




-----------------------------------------------------------
| end of epoch   3 | time: 159.82s | dev set accuracy    0.907 
-----------------------------------------------------------


In [None]:
# TODO: What did you do to achieve 90% accuracy on the dev set? Explain!
#Changed the optimizer to adam and decreased learning rate

**Task 7 (2 points)**: Finally, you should evaluate your best model on the test split. Does it perform as well as on the dev split?

**The test set**: Careful, if you run this and change something about your model, your results are not statistically significant anymore!

In [22]:
test_dataloader = DataLoader(test_split, batch_size=BATCH_SIZE,
                             shuffle=False, collate_fn=collator)

accu_test = evaluate(model, test_dataloader, loss_fn)
print(f'test accuracy {accu_test:.3f}')

test accuracy 0.913


In [None]:
# Short answer: does performance match the test set? Hypothesis why / why not?

#Pretty similar


**Task 8 (5 points)**: So far, our embedding layer was trained from a random initialization each time. However, there already exist **pretrained** word embeddings like Word2Vec or GloVe.
- Choose an appropriate pretrained word embedding (hint: `torchtext.vocab.GloVe`. hint2: embedding dimension.)
- Use the pretrained word embedding to initialize the model token embedding. The vocabularies in both embeddings will not be the same! Here, a naive solution is okay (only initialize the token embeddings that occur in both vocabularies). But you can also think of smarter strategies!
- You should use the exact same setup you have used to achieve your 90% accuracy score in the previous task (if you need to change any cells for the intialization, simply copy + paste them below; then modify & re-execute. Do not modify the code above). Summarize your observations and provide a brief explanation for why they might occur.

In [None]:
# What changes when initializing from pretrained embeddings instead of a random distribution? Explain!


In [24]:
import torch
from torch.utils.data import DataLoader

# Hyperparameters - sane default values for this example
EPOCHS = 3 # epoch
LR = 0.1 # learning rate
BATCH_SIZE = 64 # batch size for training

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.manual_seed(42) # for reproducible results

model = MyTextClassifier(vocab_size=len(vocab), embed_dim=50, hidden_dim=50, num_class=4)

# TODO: load pretrained word embeddings and use them to init the model's embedding layer
glove = torchtext.vocab.GloVe(name='6B', dim=50)

for token, idx in vocab.stoi.items():
    if token in glove.stoi:
        model.embedding.weight.data[idx] = glove.vectors[glove.stoi[token]]

model = model.to(device) # Move model to GPU if necessary

criterion = torch.nn.CrossEntropyLoss() # this combines softmax with the actual loss function
optimizer = torch.optim.SGD(model.parameters(), lr=LR)

collator = lambda batch: collate_fn(batch, tokenizer, vocab, device)

train_dataloader = DataLoader(train_split, batch_size=BATCH_SIZE,
                              shuffle=True, collate_fn=collator)
dev_dataloader = DataLoader(dev_split, batch_size=BATCH_SIZE,
                              shuffle=False, collate_fn=collator)


for epoch in range(1, EPOCHS + 1):
    epoch_start_time = time.time()
    train(model, train_dataloader, optimizer, criterion, epoch)
    accu_val = evaluate(model, dev_dataloader, criterion)
    print('-' * 59)
    print(f'| end of epoch {epoch:3d} | time: {time.time() - epoch_start_time:5.2f}s | ' +
          f'dev set accuracy {accu_val:8.3f} ')
    print('-' * 59)

.vector_cache/glove.6B.zip: 0.00B [00:00, ?B/s]


HTTPError: ignored

## Neural Network Tuning Guide

We improve the performance of our neural networks by tuning **hyperparameters**. These are all things that influence the networks performance except for the neural netwroks weights (parameters), which are tuned during training. Hyperparameters include:
- The learning rate
- The number of training epochs
- The choice for optimizer (SGD, Adam, ...)
- The batch size
- Network architecture
  - The number of layers in the network (more layers -> deeper network)
  - The "width" of layers in the network (more neurons per layer -> wider network)
  - Activation functions
- Regularization
  - L2-Regularization
  - Data augmentation (especially for images!)
  - Dropout (randomly dropping neurons during training)
  - Small batch sizes
- Normalizing the input data to unit Gaussian range (`transforms.Normalize(0.5, 0.5)`, applies mostly for images)
- Adding normalization layers (`BatchNorm`, `LayerNorm`) to the network
- Adding a learning rate schedule
- ... (many more)

This can seem overwhelming and a bit like "alchemy" (and there is some truth to this). But over time, you will reliably build an intuition about what hyperparameters are responsible for what kinds of behavior and which hyperparameters might be responsible for failure modes.

**Basic wisdom**: You can start by training the model as long as the training loss still improves. Then, check if the loss and performance on the dev set is also still improving. If yes --> train longer / with higher learning rate, make network bigger. If no --> train shorter / with lower learning rate, add regularization. Big changes over small changes: e.g.rather than just training for one more epoch, train twice as long, then iterate and finetune later on.

Some (humorful) resources are [this neural network training recipe](http://karpathy.github.io/2019/04/25/recipe/) by Andrej Karpathy, [this experience report](https://towardsdatascience.com/the-art-of-hyperparameter-tuning-in-deep-neural-nets-by-example-685cb5429a38) or [this practical advice video](https://www.youtube.com/watch?v=wKkcBPp3F1Y) by Andrew Ng.

Some steps you can go through that might be worthwhile if you cannot decide:
- make the network twice as wide
- make the network twice as deep
- use the `Adam` optimizer instead of `SGD` and tuning optimizer hyperparameters
- train for twice as many epochs
- always save a checkpoint of the best model during training

**Bonus**: Is the network overfitting or not? Compare the performance on the dev set with the performance on the train set. A common visualization is to plot train and dev losses, as well as accuracy, over the course of training. **Having a plot of loss on the train vs. dev split over time is key to diagnose training failure modes**.