#### We are using the IMDB dataset that is defined at https://pytorch.org/text/stable/datasets.html#imdb to perform the binary classification

Number of classes =2

Number of training samples=25000

Number of testing samples=25000

In [1]:
import torch
from torchtext.datasets import IMDB
import numpy as np

train_iter = IMDB(split='train')

In [2]:
next(train_iter)

('neg',
 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

In [3]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

In [4]:
tokenizer = get_tokenizer('basic_english')
train_iter = IMDB(split='train')


In [5]:
#build_vocab_from_iterator

In [6]:
def yield_tokens(data_iter):
    for _,text in data_iter:
        yield tokenizer(text)


In [7]:
yield_tokens(train_iter)

<generator object yield_tokens at 0x7f17266a82a0>

In [8]:
## this returns an object of type torchtext.vocab.vocab.Vocab
vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>"])


In [9]:
vocab.set_default_index(vocab["<unk>"])

In [10]:
vocab(['her','rohit'])

[46, 88425]

In [11]:
## define the two functions to process the text and the labels
text_pipeline = lambda x: vocab(tokenizer(x))
label_pipeline=lambda x: 0 if x=='neg' else 1


In [12]:
text_pipeline('here is the an example')

[131, 9, 1, 40, 464]

In [13]:
label_pipeline('pos')

1

In [14]:
from torch.utils.data import DataLoader
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def collate_batch(batch):
    label_list, text_list, offsets = [], [], [0]
    for (_label, _text) in batch:
         label_list.append(label_pipeline(_label))
         processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
         text_list.append(processed_text)
         offsets.append(processed_text.size(0))
    label_list = torch.tensor(label_list, dtype=torch.int64)
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text_list = torch.cat(text_list)
    return label_list.to(device), text_list.to(device), offsets.to(device)

train_iter = IMDB(split='train')
dataloader = DataLoader(train_iter, batch_size=8, shuffle=False,collate_fn=collate_batch)


In [15]:
tt=next(iter(dataloader))

In [16]:
from torch import nn

class TextClassificationModel(nn.Module):

    def __init__(self, vocab_size, embed_dim, num_class):
        super(TextClassificationModel, self).__init__()
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True)
        self.fc = nn.Linear(embed_dim, num_class)
        self.init_weights()

    def init_weights(self):
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()

    def forward(self, text, offsets):
        embedded = self.embedding(text, offsets)
        return self.fc(embedded)


In [17]:
import time

def train(dataloader):
    ##put the model in training mode
    model.train()
    total_acc, total_count = 0, 0
    log_interval = 500
    start_time = time.time()

    ##since the dataloadeer is using a collate_batch function it is giving our 3 outputs..
    ##label, text (in the form of words) and the offsets for identifying the sentence boudary
    for idx, (label, text, offsets) in enumerate(dataloader):
        optimizer.zero_grad()
        predicted_label = model(text, offsets)
        ##loss calculation criteria
        loss = criterion(predicted_label, label)
        ##backprop
        loss.backward()
        ## instead of doing the batch norm it is doing a clipping function..this is mostly to save from exploding gradient issue
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)
        optimizer.step()
        total_acc += (predicted_label.argmax(1) == label).sum().item()
        total_count += label.size(0)
        if idx % log_interval == 0 and idx > 0:
            elapsed = time.time() - start_time
            print('| epoch {:3d} | {:5d}/{:5d} batches '
                  '| accuracy {:8.3f}'.format(epoch, idx, len(dataloader),
                                              total_acc/total_count))
            total_acc, total_count = 0, 0
            start_time = time.time()

def evaluate(dataloader):
    ##put the model in inference mode
    model.eval()
    total_acc, total_count = 0, 0

    with torch.no_grad():
        for idx, (label, text, offsets) in enumerate(dataloader):
            predicted_label = model(text, offsets)
            loss = criterion(predicted_label, label)
            total_acc += (predicted_label.argmax(1) == label).sum().item()
            total_count += label.size(0)
    return total_acc/total_count

In [18]:
from torch.utils.data.dataset import random_split
from torchtext.data.functional import to_map_style_dataset
# Hyperparameters
EPOCHS = 10 # epoch
LR = 5  # learning rate
BATCH_SIZE = 64 # batch size for training
vocab_size=len(vocab)
emsize=64
num_class=len(set([label for (label, text) in train_iter]))


model = TextClassificationModel(vocab_size, emsize, num_class).to(device)
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)
total_accu = None
train_iter, test_iter = IMDB()
train_dataset = to_map_style_dataset(train_iter)
test_dataset = to_map_style_dataset(test_iter)
num_train = int(len(train_dataset) * 0.95)
split_train_, split_valid_ = \
    random_split(train_dataset, [num_train, len(train_dataset) - num_train])


In [19]:
train_dataloader = DataLoader(split_train_, batch_size=BATCH_SIZE,
                              shuffle=True, collate_fn=collate_batch)
valid_dataloader = DataLoader(split_valid_, batch_size=BATCH_SIZE,
                              shuffle=True, collate_fn=collate_batch)
test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE,
                             shuffle=True, collate_fn=collate_batch)


As the number of classes =2 therefore by random it will have an accuracy of 50% ..We need to get a minimum of 50+.5*50=75% accuracy on the testing samples

In [20]:
for epoch in range(1, EPOCHS + 1):
    epoch_start_time = time.time()
    train(train_dataloader)
    accu_val = evaluate(valid_dataloader)
    if total_accu is not None and total_accu > accu_val:
      scheduler.step()
    else:
       total_accu = accu_val
    print('-' * 59)
    print('| end of epoch {:3d} | time: {:5.2f}s | '
          'valid accuracy {:8.3f} '.format(epoch,
                                           time.time() - epoch_start_time,
                                           accu_val))
    print('-' * 59)

-----------------------------------------------------------
| end of epoch   1 | time:  3.29s | valid accuracy    0.678 
-----------------------------------------------------------
-----------------------------------------------------------
| end of epoch   2 | time:  3.28s | valid accuracy    0.808 
-----------------------------------------------------------
-----------------------------------------------------------
| end of epoch   3 | time:  3.57s | valid accuracy    0.766 
-----------------------------------------------------------
-----------------------------------------------------------
| end of epoch   4 | time:  3.43s | valid accuracy    0.835 
-----------------------------------------------------------
-----------------------------------------------------------
| end of epoch   5 | time:  3.29s | valid accuracy    0.850 
-----------------------------------------------------------
-----------------------------------------------------------
| end of epoch   6 | time:  3.28s |

In [21]:
print('Checking the results of test dataset.')
accu_test = evaluate(test_dataloader)
print('test accuracy {:8.3f}'.format(accu_test))

Checking the results of test dataset.
test accuracy    0.850


### Here we are building the data which can be used to determine the Accuracy, Precision, Recall & F1 score
This is a binary classification

In [22]:
model.eval()
predictedLabels=[]
trueLabels=[]
with torch.no_grad():
    for idx, (label, text, offsets) in enumerate(test_dataloader):
        predicted_label = model(text, offsets)
        predictedLabels.append(predicted_label.argmax(1))
        trueLabels.append(label)
        

In [23]:
## we have got a list of batches of results so merging them together in a single list
import itertools
flatTrueLabels=list(itertools.chain.from_iterable(trueLabels))
flatPredictedLabels=list(itertools.chain.from_iterable(predictedLabels))

In [24]:
# number of labels to evaluate
len(flatPredictedLabels)

25000

In [25]:
#bringing everything to cpu as we are planning to use the sklearn method to calculate the metrics. 
#The data should be on the cpu
flatPredictedLabels=torch.stack(flatPredictedLabels).cpu()
flatTrueLabels=torch.stack(flatTrueLabels).cpu()

In [26]:
from sklearn.metrics import precision_recall_fscore_support
(prec,recall,f1,_)=precision_recall_fscore_support(flatTrueLabels, flatPredictedLabels, average='binary', zero_division=0)

In [27]:
print("Precision=",prec,"Recall=",recall,"F1=",f1)

Precision= 0.8434387010745941 Recall= 0.86024 F1= 0.8517565052081271


The same thing can be done without using the SKLean api as well
using the formulas

* precision=TP/(TP+FP)

* accuracy=(TP+TN)/(all samples)

* recall=TP/(TP+FN)

* F1=2*(Recall*Precision)/(Recall+Precision)

In [28]:
##where 
TP=sum((flatPredictedLabels==1) & (flatPredictedLabels==flatTrueLabels))
FP=sum((flatPredictedLabels==1) & (flatPredictedLabels!=flatTrueLabels))##this is the extra thing that we have predicted
FN=sum((flatPredictedLabels==0) & (flatPredictedLabels!=flatTrueLabels))##this is the extra thing that we have predicted
TN=sum((flatPredictedLabels==0) & (flatPredictedLabels==flatTrueLabels))##this is the extra thing that we have predicted
print(TP,FP,FN,TN)

tensor(10753) tensor(1996) tensor(1747) tensor(10504)


In [33]:
accuracy_manual=(TP+TN)/len(flatTrueLabels)
precision_manual=(TP)/(TP+FP)
recall_manual=TP/(TP+FN)
F1_manual=2/((1/precision_manual) + (1/recall_manual))
print("Accuracy= ",accuracy_manual.item(),"Precision=",precision_manual.item(),"Recall=",recall_manual.item(),"F1=",F1_manual.item())


Accuracy=  0.8502799868583679 Precision= 0.8434386849403381 Recall= 0.8602399826049805 F1= 0.8517564535140991
