In [1]:
import collections
import uuid

from typing import List

import torch
from torch import nn
from torch.utils.data import DataLoader
from torch.utils.data.dataset import random_split
from torch.utils.tensorboard import SummaryWriter

from torchtext.data.utils import get_tokenizer
from torchtext.datasets import AG_NEWS

from torchtext.vocab import Vocab

This notebook is based on the PyTorch tutorial [Text classification with the torchtext library](https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html) with my own notes supplementing what can be found there. We also add TensorBoard logging to model for visualisation of the embedding layer and tracking of model training.

We use the [AG News Dataset](https://paperswithcode.com/dataset/ag-news). Each sample from the iterator is a tuple of `(label, text)` where `text` is an amalgamation of the `title`, `source` and `description` fields that are defined in the [original source](http://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html).

The dataset is a classification dataset and the target labels are:  
1. World
2. Sports
3. Business
4. Sci/Tec

## Text Processing

We want to apply a simple text processing pipeline to each sample in our data:
 1. Tokenise: Split our inputs in to individual words
 2. Encode each word as integer (its index in our vocabulary)
 
We can use any tokeniser we want, but for simplicity just use [`torchtext`'s provided `basic_english` tokeniser](https://pytorch.org/text/stable/data_utils.html). This means that punctuation gets its own token for now.

PyTorch leaves the process of counting the token occurances to you (using `collections.Counter`) which you then pass into a [`torchtext.vocab.Vocab`](https://pytorch.org/text/stable/vocab.html) that handles the encoding, can also do things like fix total size, a minimum occurance frequency etc.

In [2]:
tokenizer = get_tokenizer('basic_english')

train_iter = AG_NEWS(split='train')

counter = collections.Counter()
for (label, text) in train_iter:
    counter.update(tokenizer(text))
    
vocab = Vocab(counter, min_freq=1)

Now we define the functions that we want to apply to each line of data and use them in a `collate_fn` that we will apply to an entire batch of data

In [3]:
use_cuda = torch.cuda.is_available()
device = torch.device("cpu") #torch.device("cuda" if use_cuda else "cpu")

class TextPipeline:
    def __init__(self, vocab, tokenizer):
        self.vocab = vocab
        self.tokenizer = tokenizer
        
    def __call__(self, text):
        return [self.vocab[token] for token in self.tokenizer(text)]

class LabelPipeline:
    def __call__(self, label):
        return int(label) - 1
    
class Collator:
    def __init__(self, text_pipeline, label_pipeline):
        self.text_pipeline = text_pipeline
        self.label_pipeline = label_pipeline
        
    def __call__(self, batch):
        """
        Prepare batch of data to be used as input to torch model.

        Returns
        -------
          labels: a torch.tensor of integer encoded labels. Has shape (batch_size)
          texts: a torch.tensor of integer encoded text sequences. Encoded using text_pipeline.
              Each example is concatenated together into a flat 1D tensor. The start of each
              example is recorded in offsets. Has shape (n_tokens_in_batch)
          offsets: a torch.tensor of the index of the start of each example.
              Has shape (batch_size)
        """
        labels, texts, offsets = [], [], [0]

        for (label, text) in batch:
            labels.append(
                self.label_pipeline(label)
            )
            processed_text = torch.tensor(
                self.text_pipeline(text),
                dtype=torch.int64
            )
            texts.append(processed_text)
            offsets.append(processed_text.size(0)) # length of processed text

        labels = torch.tensor(labels, dtype=torch.int64)
        offsets = torch.tensor(offsets[:-1]).cumsum(dim=0) # starting index of each example
        texts = torch.cat(texts) # we can treat this differently as it is a list of tensors

        return labels.to(device), texts.to(device), offsets.to(device)

In PyTorch, a [DataLoader](https://pytorch.org/docs/stable/data.html?highlight=dataloader#torch.utils.data.DataLoader) wraps a Dataset as an iterable and allows for batching, sampling, shuffling and multiprocess data loading. The AG News dataset is an _iterable_ dataset and so we can't use sampling or shuffle (since in principle these would be dealt with by the iterator).

The above classes could then be used as follows in `DataLoader`s during training

In [4]:
train_iter = AG_NEWS(split='train')

collator = Collator(
    TextPipeline(vocab, tokenizer),
    LabelPipeline()
)

dataloader = DataLoader(
    train_iter,
    batch_size=8,
    shuffle=False,
    collate_fn=collator
)

# Predictive Model

## Definition
The model is an embedding layer (actually a [`torch.nn.EmbeddingBag`](https://pytorch.org/docs/stable/generated/torch.nn.EmbeddingBag.html#torch.nn.EmbeddingBag) followed by 2 [Linear layers](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) (ie a simple two layer MLP the ouptut of the embedding layer). This is a slight extension of the Tutorial that this Notebook is based on.

## Embedding Layer
The embedding layer calculates the mean of the embeddings of each text we send it (in this case the text is the bag). Roughly we can imagine this as:
  1. Set `i = 0`
  2. Take the slice `texts[offset[i]:[offset[i+1]]`
  3. Calculate the word vector for each token in the slice
  4. Calculate the mean of all the word vectors in the slice

This means that every sample get converted into a single tensor (the mean of the tensors for each individual token in the input sample) regardless of the length of the input text avoiding the mean for padding or truncation.

In [5]:
class TextClassificationModel(nn.Module):
    def __init__(self, vocab_size, hidden_neurons, num_classes):
        super(TextClassificationModel, self).__init__()
        
        self.embedding = nn.EmbeddingBag(vocab_size, hidden_neurons[0], sparse=True)
        self.fc = nn.Sequential(
            nn.Linear(hidden_neurons[0], hidden_neurons[1]),
            nn.ReLU(),
            nn.Linear(hidden_neurons[1], num_classes),
        )
        self.init_weights()
        
    def init_weights(self):
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc[0].weight.data.uniform_(-initrange, initrange)
        self.fc[0].bias.data.zero_()
        self.fc[2].weight.data.uniform_(-initrange, initrange)
        self.fc[2].bias.data.zero_()
        
    def forward(self, text, offsets):
        embedded = self.embedding(text, offsets)
        return self.fc(embedded)

# Define Training Loop and Evaluation Function

This requires some knowledge of the peculiarities of PyTorch, including:
 - PyTorch basically doesn't do anything unless you tell it to. This means that you have to explicitly construct each step of your training loops e.g. forward pass, backward pass, calculating loss etc
 - [You have to set the "mode" of the model](https://stackoverflow.com/questions/51433378/what-does-model-train-do-in-pytorch), e.g. during training you should call `model.train()`. This means that layers that behave differently during training and evaluation (for example dropout layers) will do the right thing.
 - [By default the optimizer accumlates gradients on each call of `loss.backward()`](https://stackoverflow.com/questions/48001598/why-do-we-need-to-call-zero-grad-in-pytorch). This means that we need to call `optimizer.zero_grad()` as the first stage of each training loop.

In [6]:
import time

def train(dataloader, model, loss_fn, optimizer):
    model.train()
    total_correct, total_count = 0, 0
    log_interval = 500
    start_time = time.time()

    for idx, (label, text, offsets) in enumerate(dataloader):
        optimizer.zero_grad()
        
        predited_label = model(text, offsets)
        
        loss = loss_fn(predited_label, label)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)
        optimizer.step()
        
        total_correct += (predited_label.argmax(1) == label).sum().item()
        total_count += label.size(0)
        
        if idx % log_interval == 0 and idx > 0:
            elapsed = time.time() - start_time
            print('| epoch {:3d} | {:5d}/{:5d} batches '
                  '| accuracy {:8.3f}'.format(epoch, idx, len(dataloader),
                                              total_correct/total_count))
            total_correct, total_count = 0, 0
            start_time = time.time()
            
def evaluate(dataloader, model):
    model.eval()
    total_correct, total_count = 0, 0

    with torch.no_grad():
        for idx, (label, text, offsets) in enumerate(dataloader):
            predited_label = model(text, offsets)
            total_correct += (predited_label.argmax(1) == label).sum().item()
            total_count += label.size(0)
    return total_correct / total_count

# Train the Model

Something exciting about training the model here

At the end of training we will visualise the embedding layer in two ways using tensorboard.
 1. We will look at the embeddings of the individual words
 2. And the embeddings of the full texts. We will label these by the genre of the article
The hope is that since we are attempting to learn embeddings useful for the categorisation problem at hand we can see that the embedding vector for the full texts cluster with the label.

In [7]:
train_iter = AG_NEWS(split='train')

num_class = len(set([label for (label, text) in train_iter]))
vocab_size = len(vocab)

hidden_neurons = [64, 64]

model = TextClassificationModel(vocab_size, hidden_neurons, num_class).to(device)

experiment_id = uuid.uuid4().hex
writer = SummaryWriter(f'runs/ag_news/experiment_{experiment_id}')

# Hyperparameters
EPOCHS = 10 # epoch
LR = 5  # learning rate
BATCH_SIZE = 128# batch size for training

loss_fn = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)

total_accu = None
train_iter, test_iter = AG_NEWS()

# this changes the dataset from being an iterable to a map (ie accessible by index)
train_dataset = list(train_iter)
test_dataset = list(test_iter)

# create 95% train/val split with test heldout
num_train = int(len(train_dataset) * 0.95)

split_train_, split_valid_ = random_split(
    train_dataset,
    [num_train, len(train_dataset) - num_train]
)

train_dataloader = DataLoader(
    split_train_,
    batch_size=BATCH_SIZE,
    shuffle=True,
    collate_fn=Collator(
        TextPipeline(vocab, tokenizer),
        LabelPipeline()
    )
)

valid_dataloader = DataLoader(
    split_valid_,
    batch_size=BATCH_SIZE,
    shuffle=True,
    collate_fn=Collator(
        TextPipeline(vocab, tokenizer),
        LabelPipeline()
    )
)

test_dataloader = DataLoader(
    test_dataset,
    batch_size=BATCH_SIZE,
    shuffle=True,
    collate_fn=Collator(
        TextPipeline(vocab, tokenizer),
        LabelPipeline()
    )
)

writer.add_graph(
    model,
    input_to_model=Collator(
        TextPipeline(vocab, tokenizer),
        LabelPipeline()
    )(train_dataset[:5])[1:]
)

for epoch in range(1, EPOCHS + 1):
    epoch_start_time = time.time()
    train(train_dataloader, model, loss_fn, optimizer)
    accu_val = evaluate(valid_dataloader, model)
    if total_accu is not None and total_accu > accu_val:
      scheduler.step()
    else:
       total_accu = accu_val
    print('-' * 59)
    print('| end of epoch {:3d} | time: {:5.2f}s | '
          'valid accuracy {:8.3f} '.format(epoch,
                                           time.time() - epoch_start_time,
                                           accu_val))
    print('-' * 59)
    writer.add_scalar("Accuracy/validation", accu_val, epoch)
    

# at the end of training record the embedding of individual tokens
writer.add_embedding(
    model.embedding.weight,
    metadata=list(vocab.stoi.keys()),
    tag="all_tokens"
)
    
    
writer.flush()    
writer.close()

| epoch   1 |   500/  891 batches | accuracy    0.626
-----------------------------------------------------------
| end of epoch   1 | time:  9.93s | valid accuracy    0.830 
-----------------------------------------------------------
| epoch   2 |   500/  891 batches | accuracy    0.854
-----------------------------------------------------------
| end of epoch   2 | time:  9.75s | valid accuracy    0.867 
-----------------------------------------------------------
| epoch   3 |   500/  891 batches | accuracy    0.888
-----------------------------------------------------------
| end of epoch   3 | time:  9.55s | valid accuracy    0.873 
-----------------------------------------------------------
| epoch   4 |   500/  891 batches | accuracy    0.903
-----------------------------------------------------------
| end of epoch   4 | time:  9.65s | valid accuracy    0.887 
-----------------------------------------------------------
| epoch   5 |   500/  891 batches | accuracy    0.914
------

# Evaluation

Finally, evaluate the model performance on the heldout test dataset.

In [8]:
print('Checking the results of test dataset.')
accu_test = evaluate(test_dataloader, model)
print('test accuracy {:8.3f}'.format(accu_test))

Checking the results of test dataset.
test accuracy    0.910
