## TC 5033
### Word Embeddings

<br>

#### Activity 3b: Text Classification using RNNs and AG_NEWS dataset in PyTorch
<br>

- Objective:
    - Understand the basics of Recurrent Neural Networks (RNNs) and their application in text classification.
    - Learn how to handle a real-world text dataset, AG_NEWS, in PyTorch.
    - Gain hands-on experience in defining, training, and evaluating a text classification model in PyTorch.
    
<br>

- Instructions:
    - Data Preparation: Starter code will be provided that loads the AG_NEWS dataset and prepares it for training. Do not modify this part. However, you should be sure to understand it, and comment it, the use of markdown cells is suggested. 

    - Model Setup: A skeleton code for the RNN model class will be provided. Complete this class and use it to instantiate your model.

    - Implementing Accuracy Function: Write a function that takes model predictions and ground truth labels as input and returns the model's accuracy.

    - Training Function: Implement a function that performs training on the given model using the AG_NEWS dataset. Your model should achieve an accuracy of at least 80% to get full marks for this part.

    - Text Sampling: Write a function that takes a sample text as input and classifies it using your trained model.

    - Confusion Matrix: Implement a function to display the confusion matrix for your model on the test data.

    - Submission: Submit your completed Jupyter Notebook. Make sure to include a markdown cell at the beginning of the notebook that lists the names of all team members. Teams should consist of 3 to 4 members.
    
<br>

- Evaluation Criteria:

    - Correct setup of all the required libraries and modules (10%)
    - Code Quality (30%): Your code should be well-organized, clearly commented, and easy to follow. Use also markdown cells for clarity. Comments should be given for all the provided code, this will help you understand its functionality.
    
   - Functionality (60%): 
        - All the functions should execute without errors and provide the expected outputs.
        - RNN model class (20%)
        - Accuracy fucntion (10%)
        - Training function (10%)
        - Sampling function (10%)
        - Confucion matrix (10%)

        - The model should achieve at least an 80% accuracy on the AG_NEWS test set for full marks in this criterion.


Dataset

https://pytorch.org/text/stable/datasets.html#text-classification

https://paperswithcode.com/dataset/ag-news


### Import libraries

In [1]:
#%pip install torchtext
#%pip install torchdata
#%pip install scikit-plot
#%pip install portalocker
# conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch

In [2]:
# The following libraries are required for running the given code
# Please feel free to add any libraries you consider adecuate to complete the assingment.
import numpy as np
#PyTorch libraries
import torch
from torchtext.datasets import AG_NEWS
# Dataloader library
from torch.utils.data import DataLoader
from torch.utils.data.dataset import random_split
# Libraries to prepare the data
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torchtext.data.functional import to_map_style_dataset
# neural layers
from torch import nn
from torch.nn import functional as F

# These libraries are suggested to plot confusion matrix
# you may use others
import scikitplot as skplt
import gc



In [3]:
# Use GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cpu


### Get the train and the test datasets and dataloaders

Classes:

* 1 - World

* 2 - Sports

* 3 - Business

* 4 - Sci/Tech

We will convert them to:

* 0 - World

* 1 - Sports

* 2 - Business

* 3 - Sci/Tech

In [4]:
train_dataset, test_dataset = AG_NEWS()
train_dataset, test_dataset = to_map_style_dataset(train_dataset), to_map_style_dataset(test_dataset)

In [5]:
# Get the tokeniser
# tokeniser object
tokeniser = get_tokenizer('basic_english')

def yield_tokens(data):
    for _, text in data:
        yield tokeniser(text)

In [6]:
# Build the vocabulary
vocab = build_vocab_from_iterator(yield_tokens(train_dataset), specials=["<unk>"])
#set unknown token at position 0
vocab.set_default_index(vocab["<unk>"])

In [7]:
#test tokens
tokens = tokeniser('Welcome to TE3007')
print(tokens, vocab(tokens))

['welcome', 'to', 'te3007'] [3314, 4, 0]


In [8]:
NUM_TRAIN = int(len(train_dataset)*0.9)
NUM_VAL = len(train_dataset) - NUM_TRAIN

In [9]:
train_dataset, val_dataset = random_split(train_dataset, [NUM_TRAIN, NUM_VAL])

In [10]:
print(len(train_dataset), len(val_dataset), len(test_dataset))

108000 12000 7600


In [11]:
max_tokens = 50
# function passed to the DataLoader to process a batch of data as indicated
def collate_batch(batch):
    # Get label and text
    y, x = list(zip(*batch))
    
    # Create list with indices from tokeniser
    x = [vocab(tokeniser(text)) for text in x]
    x = [t + ([0]*(max_tokens - len(t))) if len(t) < max_tokens else t[:max_tokens] for t in x]

    # Prepare the labels, by subtracting 1 to get them in the range 0-3
    return torch.tensor(x, dtype=torch.int32), torch.tensor(y) - 1

In [12]:
labels =  ["World", "Sports", "Business", "Sci/Tech"]
BATCH_SIZE = 256

In [13]:
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, collate_fn=collate_batch, shuffle = True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, collate_fn=collate_batch, shuffle = True)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, collate_fn=collate_batch, shuffle = True)

### Let us build our RNN model

In [14]:
EMBEDDING_SIZE = 50
NEURONS = 40
LAYERS = 4
NUM_CLASSES = 4

In [15]:
class RNN_Model_1(nn.Module):
    def __init__(self, embed_size, hidden, layers, num_classes):
        super().__init__()
        # First layer must have a size of lenght vocabulary
        self.embedding_layer = nn.Embedding(num_embeddings=len(vocab), embedding_dim=embed_size)
        # You may use PyTorch nn.GRU(), nn.RNN(), or nn.LSTM()
        self.rnn = nn.RNN(input_size=embed_size, hidden_size=hidden, num_layers=layers, batch_first=True)                        
        # output layer must have input as hidden layers and output the number of classes
        self.linear = nn.Linear(hidden, num_classes)        
        
    def forward(self, x):
        # implement forward pass. This function will be called when executing the model
        embeddings = self.embedding_layer(x)
        output, hidden = self.rnn(embeddings, torch.randn(LAYERS, len(x), NEURONS))
        return self.linear(output[:,-1])
        

* Long Short-Term Memory (LSTM) networks are an improvement over simple Recurrent Neural Networks (RNNs). LSTMs are designed to avoid the vanishing gradient problem, which is common in standard RNNs, especially with longer sequences.
* With their structure of memory cells and control gates (forget, input, and output gates), LSTMs can maintain relevant information over long sequences of data, making them more suitable for natural language processing tasks where dependencies between words can span across long sequences.
* Initializing the hidden and cell states to zero for each batch is a common practice, as it provides a consistent starting point for learning each sequence. This is particularly useful in the context of text classification, where each input is generally independent of the others

In [22]:
class RNN_Model_1(nn.Module):
    def __init__(self, embed_size, hidden, layers, num_classes):
        super(RNN_Model_1, self).__init__()
        self.layers = layers
        self.hidden = hidden
        # Embedding layer
        self.embedding_layer = nn.Embedding(num_embeddings=len(vocab), embedding_dim=embed_size)
        # LSTM instead of RNN
        self.lstm = nn.LSTM(input_size=embed_size, hidden_size=hidden, num_layers=layers, batch_first=True)                        
        # Linear layer for classification
        self.linear = nn.Linear(hidden, num_classes)     

    def forward(self, x):
        # Embedding the input
        embeddings = self.embedding_layer(x)
        # Hidden and cell state initialized to zeros
        h0 = torch.zeros(self.layers, x.size(0), self.hidden).to(x.device)
        c0 = torch.zeros(self.layers, x.size(0), self.hidden).to(x.device)
        # Passing through LSTM
        output, (hn, cn) = self.lstm(embeddings, (h0, c0))
        # Taking the last output of the sequence for classification
        return self.linear(output[:, -1])


In [33]:
from sklearn.metrics import accuracy_score

def accuracy(model, loss_fn, loader):
    with torch.no_grad():
        Ys, Y_preds, losses = [], [], []
        for X, Y in loader:
            preds = model(X)
            loss = loss_fn(preds, Y)
            losses.append(loss.item())
            
            Ys.append(Y)
            Y_preds.append(preds.argmax(dim=-1))
        
        Ys = torch.cat(Ys)
        Y_preds = torch.cat(Y_preds)

        print("Valid Loss : {:.3f}".format(torch.tensor(losses).mean()))
        print("Valid Acc  : {:.3f}".format(accuracy_score(Ys.detach().numpy(), Y_preds.detach().numpy())))

        

In [24]:
from tqdm import tqdm

def train(model, loss_fn, optimizer, epochs=10):
    for i in range(1, epochs+1):
        losses = []
        for X, Y in tqdm(train_loader):
            Y_preds = model(X)            
            loss = loss_fn(Y_preds, Y)
            losses.append(loss.item())

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        print("Train Loss : {:.3f}".format(torch.tensor(losses).mean()))
        accuracy(model,loss_fn, val_loader)

In [25]:
epochs = 15
lr = 1e-3
# instantiate model
loss_fn = nn.CrossEntropyLoss()
rnn_model = RNN_Model_1(EMBEDDING_SIZE, NEURONS, LAYERS, NUM_CLASSES)
optimizer = torch.optim.Adam(rnn_model.parameters(), lr=lr)


In [26]:
train(rnn_model, loss_fn=loss_fn, optimizer=optimizer,  epochs=epochs)

100%|██████████| 422/422 [00:54<00:00,  7.71it/s]


Train Loss : 1.053
Valid Loss : 0.776
Valid Acc  : 0.639


100%|██████████| 422/422 [00:48<00:00,  8.71it/s]


Train Loss : 0.574
Valid Loss : 0.459
Valid Acc  : 0.834


100%|██████████| 422/422 [00:49<00:00,  8.57it/s]


Train Loss : 0.371
Valid Loss : 0.393
Valid Acc  : 0.866


100%|██████████| 422/422 [00:52<00:00,  8.01it/s]


Train Loss : 0.298
Valid Loss : 0.362
Valid Acc  : 0.877


100%|██████████| 422/422 [00:56<00:00,  7.47it/s]


Train Loss : 0.252
Valid Loss : 0.364
Valid Acc  : 0.878


100%|██████████| 422/422 [00:49<00:00,  8.55it/s]


Train Loss : 0.219
Valid Loss : 0.351
Valid Acc  : 0.889


100%|██████████| 422/422 [01:01<00:00,  6.91it/s]


Train Loss : 0.189
Valid Loss : 0.353
Valid Acc  : 0.891


100%|██████████| 422/422 [00:57<00:00,  7.35it/s]


Train Loss : 0.166
Valid Loss : 0.343
Valid Acc  : 0.894


100%|██████████| 422/422 [00:58<00:00,  7.17it/s]


Train Loss : 0.149
Valid Loss : 0.367
Valid Acc  : 0.891


100%|██████████| 422/422 [00:48<00:00,  8.68it/s]


Train Loss : 0.130
Valid Loss : 0.359
Valid Acc  : 0.892


100%|██████████| 422/422 [00:48<00:00,  8.70it/s]


Train Loss : 0.117
Valid Loss : 0.419
Valid Acc  : 0.890


100%|██████████| 422/422 [00:49<00:00,  8.60it/s]


Train Loss : 0.103
Valid Loss : 0.390
Valid Acc  : 0.892


100%|██████████| 422/422 [00:54<00:00,  7.78it/s]


Train Loss : 0.094
Valid Loss : 0.412
Valid Acc  : 0.892


100%|██████████| 422/422 [00:48<00:00,  8.68it/s]


Train Loss : 0.085
Valid Loss : 0.423
Valid Acc  : 0.889


100%|██████████| 422/422 [01:03<00:00,  6.61it/s]


Train Loss : 0.078
Valid Loss : 0.431
Valid Acc  : 0.890


In [34]:
print(accuracy(rnn_model, loss_fn, test_loader))

Valid Loss : 0.421
Valid Acc  : 0.894
None


In [None]:
def sample_text(model, loader):
    pass

In [None]:
sample_text(rnn_model, test_loader)

In [None]:
# create confusion matrix
pass