## TC 5033
### Word Embeddings

<br>

#### Activity 3b: Text Classification using RNNs and AG_NEWS dataset in PyTorch
<br>

- Objective:
    - Understand the basics of Recurrent Neural Networks (RNNs) and their application in text classification.
    - Learn how to handle a real-world text dataset, AG_NEWS, in PyTorch.
    - Gain hands-on experience in defining, training, and evaluating a text classification model in PyTorch.
    
<br>

- Instructions:
    - Data Preparation: Starter code will be provided that loads the AG_NEWS dataset and prepares it for training. Do not modify this part. However, you should be sure to understand it, and comment it, the use of markdown cells is suggested. 

    - Model Setup: A skeleton code for the RNN model class will be provided. Complete this class and use it to instantiate your model.

    - Implementing Accuracy Function: Write a function that takes model predictions and ground truth labels as input and returns the model's accuracy.

    - Training Function: Implement a function that performs training on the given model using the AG_NEWS dataset. Your model should achieve an accuracy of at least 80% to get full marks for this part.

    - Text Sampling: Write a function that takes a sample text as input and classifies it using your trained model.

    - Confusion Matrix: Implement a function to display the confusion matrix for your model on the test data.

    - Submission: Submit your completed Jupyter Notebook. Make sure to include a markdown cell at the beginning of the notebook that lists the names of all team members. Teams should consist of 3 to 4 members.
    
<br>

- Evaluation Criteria:

    - Correct setup of all the required libraries and modules (10%)
    - Code Quality (30%): Your code should be well-organized, clearly commented, and easy to follow. Use also markdown cells for clarity. Comments should be given for all the provided code, this will help you understand its functionality.
    
   - Functionality (60%): 
        - All the functions should execute without errors and provide the expected outputs.
        - RNN model class (20%)
        - Accuracy fucntion (10%)
        - Training function (10%)
        - Sampling function (10%)
        - Confucion matrix (10%)

        - The model should achieve at least an 80% accuracy on the AG_NEWS test set for full marks in this criterion.


Dataset

https://pytorch.org/text/stable/datasets.html#text-classification

https://paperswithcode.com/dataset/ag-news


### Import libraries

In [1]:
import torch
import torchtext


torchtext.disable_torchtext_deprecation_warning()
# Use GPU if available
device = torch.device("cuda")
print(device)

cuda


### Get the train and the test datasets and dataloaders

Classes:

* 1 - World

* 2 - Sports

* 3 - Business

* 4 - Sci/Tech

We will convert them to:

* 0 - World

* 1 - Sports

* 2 - Business

* 3 - Sci/Tech

In [2]:
from torchtext.data import functional
from torchtext.datasets import AG_NEWS


DATA_PATH = "data"
train_dataset = AG_NEWS(root=DATA_PATH, split="train")
train_dataset = functional.to_map_style_dataset(train_dataset)
test_dataset = AG_NEWS(root=DATA_PATH, split="test")
test_dataset = functional.to_map_style_dataset(train_dataset)

################################################################################
The 'datapipes', 'dataloader2' modules are deprecated and will be removed in a
future torchdata release! Please see https://github.com/pytorch/data/issues/1196
to learn more and leave feedback.
################################################################################



In [3]:
from torchtext.data import utils


# Get the tokeniser
# tokeniser object
tokeniser = utils.get_tokenizer('basic_english')

def yield_tokens(data):
    for _, text in data:
        yield tokeniser(text)

In [4]:
from torchtext import vocab


# Build the vocabulary
vocabulary = vocab.build_vocab_from_iterator(yield_tokens(train_dataset), specials=["<unk>"])
#set unknown token at position 0
vocabulary.set_default_index(vocabulary["<unk>"])

In [5]:
#test tokens
tokens = tokeniser('Welcome to TE3007')
print(tokens, vocabulary(tokens))

['welcome', 'to', 'te3007'] [3314, 4, 0]


In [6]:
NUM_TRAIN = int(len(train_dataset)*0.9)
NUM_VAL = len(train_dataset) - NUM_TRAIN

In [7]:
from torch.utils.data import dataset


train_dataset, val_dataset = dataset.random_split(train_dataset, [NUM_TRAIN, NUM_VAL])

In [8]:
print(len(train_dataset), len(val_dataset), len(test_dataset))

108000 12000 120000


In [9]:
labels =  ["World", "Sports", "Business", "Sci/Tech"]
max_tokens = 50
BATCH_SIZE = 256

In [10]:
# function passed to the DataLoader to process a batch of data as indicated
def collate_batch(batch):
    # Get label and text
    y, x = list(zip(*batch))
    
    # Create list with indices from tokeniser
    x = [vocabulary(tokeniser(text)) for text in x]
    x = [t + ([0]*(max_tokens - len(t))) if len(t) < max_tokens else t[:max_tokens] for t in x]

    # Prepare the labels, by subtracting 1 to get them in the range 0-3
    return torch.tensor(x, dtype=torch.int32), torch.tensor(y, dtype=torch.long) - 1

In [11]:
from torch.utils.data import DataLoader


train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, collate_fn=collate_batch, shuffle = True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, collate_fn=collate_batch, shuffle = True)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, collate_fn=collate_batch, shuffle = True)

### Let us build our RNN model

In [12]:
EMBEDDING_SIZE = 50  # complete
NEURONS = 64  # complete
LAYERS = 3  # complete
NUM_CLASSES = len(labels)  # complete

In [13]:
from torch.nn import Module, Embedding, Linear, GRU


class RNN_Model_1(Module):
    def __init__(self, embed_size, hidden, layers, num_classes):
        super().__init__()
        self.embedding_layer = Embedding(num_embeddings=len(vocabulary), embedding_dim=embed_size)
        self.rnn = GRU(embed_size, hidden, num_layers=layers, batch_first=True)
        self.fc = Linear(hidden, num_classes)

    def forward(self, x):
        x = self.embedding_layer(x)
        out, hidden = self.rnn(x)
        out = hidden[-1, :, :]
        out = self.fc(out)
        return out

In [14]:
from torch.nn import CrossEntropyLoss, Module


def accuracy(model: Module, loader: DataLoader):
    criterion = CrossEntropyLoss()
    data_loss = 0
    true_positives = 0
    with torch.no_grad():
        for data in loader:
            texts, labels = data[0].to(device), data[1].to(device)
            outputs = model(texts)
            loss = criterion(outputs, labels)
            data_loss += loss.item()
            predictions = torch.max(outputs, 1).indices
            true_positives += torch.eq(predictions, labels).int().sum().cpu().item()
    data_loss /= len(loader)
    acc = true_positives / len(loader.dataset)
    return acc, data_loss

In [15]:
def train(model: Module, optimizer, epochs: int):
    criterion = CrossEntropyLoss()
    for epoch in range(epochs):
        train_loss = 0
        model.train()
        for data in train_loader:
            texts, labels = data[0].to(device), data[1].to(device)
            optimizer.zero_grad()
            outputs = model(texts)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            train_loss += loss.item()
        train_loss /= len(train_loader)
        acc, val_loss = accuracy(model, val_loader)
        print(f"Epoch {epoch+1} - loss: {round(train_loss, 4)} - val_loss: {round(val_loss, 4)} - accuracy: {round(acc, 4)}")

In [16]:
from torch.optim.adam import Adam


epochs = 8  # define
lr = 0.001  # to do
# instantiate model
rnn_model = RNN_Model_1(EMBEDDING_SIZE, NEURONS, LAYERS, NUM_CLASSES).to(device)
optimiser = Adam(rnn_model.parameters(), lr=lr)


In [17]:
train(rnn_model, optimiser, epochs=epochs)

Epoch 1 - loss: 0.8183 - val_loss: 0.4642 - accuracy: 0.8309
Epoch 2 - loss: 0.3765 - val_loss: 0.355 - accuracy: 0.8749
Epoch 3 - loss: 0.2848 - val_loss: 0.3157 - accuracy: 0.8892
Epoch 4 - loss: 0.2339 - val_loss: 0.2996 - accuracy: 0.8969
Epoch 5 - loss: 0.1937 - val_loss: 0.2947 - accuracy: 0.9011
Epoch 6 - loss: 0.1621 - val_loss: 0.3002 - accuracy: 0.9021
Epoch 7 - loss: 0.1365 - val_loss: 0.3046 - accuracy: 0.9025
Epoch 8 - loss: 0.1143 - val_loss: 0.3233 - accuracy: 0.9022


In [18]:
print(f'{accuracy(rnn_model, test_loader)[0]:.4f}')

0.9663


In [19]:
import random


def sample_text(model, loader):
    articles = [a for a in loader]
    data = random.choice(articles)
    texts, _ = data[0].to(device), data[1].to(device)
    with torch.no_grad():
        outputs = model(texts)
        predictions = torch.max(outputs, 1).indices
    print("Input:")
    print(" ".join([vocabulary.lookup_token(idx) for idx in texts[0]]))
    print(f"Prediction: {labels[predictions[0].item()]}")


In [20]:
sample_text(rnn_model, test_loader)

Input:
apple #39 s flaming batteries recalled much loved music firm apple has had to recall more than 28 , 000 rechargeable batteries that it jacks into its aluminium 15-inch powerbook g4 because they may overheat . <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk>
Prediction: Sci/Tech


In [21]:
# create confusion matrix
pass