# Review Classification

## Batching and Model Training

### Import Libraries

In [1]:
import pandas as pd # Loading data
import numpy as np
import warnings 
import re # text matching
from collections import Counter # for vocabulary
from sklearn.model_selection import train_test_split # train test splits

warnings.filterwarnings('ignore')

### Data Loading and Processing

We will first do all the necessary pre-processing before starting to create batches and training the model. All the steps are explained in the notebook named `Text Cleaning.ipynb`

In [2]:
# Read dataset
data = pd.read_csv("Reviews.csv")
# Drop unnecesary columns and duplicates
new_data = data.drop_duplicates(subset=['UserId', 'ProfileName', 'Time', 'Text'])
# Get useful columns
useful_data = new_data[['Text', 'Score']]
# Calculate length of each sentence without tokenizer
useful_data['sudo_length'] = useful_data.Text.str.split().str.len()
# Filter examples by length
useful_data = useful_data[(useful_data.sudo_length > 20) & (useful_data.sudo_length < 100)]
# Remove length column
useful_data = useful_data.drop(['sudo_length'], axis = 1)
# print 5 rows
useful_data.head()

Unnamed: 0,Text,Score
0,I have bought several of the Vitality canned d...,5
1,Product arrived labeled as Jumbo Salted Peanut...,1
2,This is a confection that has been around a fe...,4
3,If you are looking for the secret ingredient i...,2
4,Great taffy at a great price. There was a wid...,5


### Tokenizing and Creating vocabulary
Now its time to tokenize and create our vocabulary. We use the `TextProcessor` class on data splits.

In [3]:
class TextProcessor:
    def __init__(self):
        self.vocab_dict = dict({"<unk>" : 0, "<pad>" : 1})
        self.counter = Counter()

    def tokenize(self, sent):
        if sent.endswith("."):
            sent = sent[:-1]
        new_x = re.sub('<.*?>', ' ', sent)
        new_x = re.sub('\s\s+',' ', new_x)
        new_x = re.sub('\W\s', ' ', new_x)
        new_x = re.sub('\w\W{2,}', ' ', new_x)
        new_x = new_x.lower().split()
        return new_x
        
    def processDataset(self, sent):
        tokens = self.tokenize(sent)
        token_set = set(tokens)
        self.counter.update(Counter(tokens))
        return len(tokens)
        
    def build_vocab(self, num_most_common_to_use=10000):
        words = self.counter.most_common(num_most_common_to_use)
        for i in range(num_most_common_to_use - 2):
            self.vocab_dict[words[i][0]] = len(self.vocab_dict)
            
    def tokenize_and_return_length(self, sent):
        tokens = self.tokenize(sent)
        return len(tokens)
            
    def process(self, sent):
        tokens = self.tokenize(sent)
        processed = []
        for val in tokens:
            processed.append(self.vocab_dict.get(val, self.vocab_dict["<unk>"]))
            
        return processed

#### Create Train and Test sets

In [4]:
train, test = train_test_split(useful_data, test_size = 0.2)

Run text processor to create vocabulary. Also create a new column denoting length of tokens for corresponding review. This will be used in creating batches.

In [5]:
textprocessor = TextProcessor()
train['length'] = train.Text.apply(textprocessor.processDataset)
textprocessor.build_vocab()

train.head()

Unnamed: 0,Text,Score,length
429048,I should have bought Goji berries by themselve...,3,87
46813,Bakery on Main has another hit on their hands ...,5,27
120575,"despite a lot of negative reviews, I still bou...",5,56
48214,Love the taste. Dont know why this drink has s...,5,23
191664,I was pleasantly surprised to see these were a...,5,78


In [6]:
test['length'] = test.Text.apply(textprocessor.tokenize_and_return_length)

test.head()

Unnamed: 0,Text,Score,length
255342,Read the description carefully. Made by Spang...,1,34
444742,This producy works !!! Period......They were h...,5,23
541879,I really only wanted to purchase a few tins. U...,4,92
508821,"<a href=""http://www.amazon.com/gp/product/B000...",1,39
384133,I have been using all the whey low products fo...,5,35


### Batching and Data Loader creation
We are going to use `PyTorch` for training an `LSTM Model` for classification of reviews. Before creating the model, we first need to create dataloader, so that we can conviniently pass our training and testing examples to our model.

First we will create a *Custom PyTorch* dataset class which will preprocess our examples and convert them into a set of indices corresponging to vocabulary we just created.

In [45]:
import torch
from torch.utils.data import Dataset

In [8]:
class ReviewDataset(Dataset):
    def __init__(self, df, processor):
        self.data = df
        self.data = self.data.sort_values(by='length')
        self.tprocess = processor
        
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        row = self.data.iloc[idx]
        text = row.Text
        label = row.Score
        
        text = self.tprocess.process(text)
        
        return (text, label)

As you can observe that our dataset class sorted our dataframe by the length of each sentence. This allows us to create a batch with minmum padding, as we will see later when creating batches.

In [9]:
dataset = ReviewDataset(train, textprocessor)

print("Sample Data : ")
print(dataset[0])

Sample Data : 
([67, 0, 82, 0], 5)


### Creating Dataloader

#### Import Dataloader class

In [10]:
from torch.utils.data import DataLoader

#### Custom batch formation class
Since our dataset contains all the examples in sorted fashion, the batch we will get from our dataloader will have the largest length sentence at the end of the batch list. In the batch collator class, we will first create an array or size `(batch_size, seq_len)`, where seq_len will be equal to the length of last sentence recieved in batch.

As all the examples are sorted, padding required within a batch will be minimum as nearly equal length examples will be sampled.

In [11]:
class MyCollator(object):
    def __init__(self, pad_token = 1):
        self.pad = pad_token
    def __call__(self, batch):
        batch_size = len(batch)
        seq_len = len(batch[-1][0])
        formed = np.zeros((batch_size, seq_len), dtype = np.long) + self.pad
        labels = []
        for i in range(batch_size):
            example = batch[i]
            formed[i, :len(example[0])] = example[0]
            labels.append(example[1])
            
        return torch.LongTensor(formed), torch.LongTensor(labels)

In [40]:
BATCH_SIZE = 64

collator = MyCollator()
dloader = DataLoader(dataset, batch_size=BATCH_SIZE, collate_fn=collator)

#### Example batch

In [13]:
batch = next(iter(dloader))
print("Examples : ")
print(batch[0])
print("Labels : ")
print(batch[1])

Examples : 
tensor([[  67,    0,   82,    0,    1,    1,    1,    1,    1,    1,    1,    1,
            1,    1,    1,    1,    1,    1],
        [ 464,  785,   29,    4,  228,   91, 1938,    1,    1,    1,    1,    1,
            1,    1,    1,    1,    1,    1],
        [ 383,   56,  204,   40,   28,    0,   48,   63,  102,    8, 1034,  741,
            1,    1,    1,    1,    1,    1],
        [   0,  111,  376,   10,   96,  225,   48, 1419,   63,  102,  666,    9,
            0,    1,    1,    1,    1,    1],
        [5215,    0, 3099,    0, 1665, 8146, 2477,    0,    2,  185,  519,    7,
          778,  282,  197,    1,    1,    1],
        [   3,   26,    6,   31,   68,  712,   77,  149,    3,    0,   26,    7,
           95,   60,  843,    1,    1,    1],
        [   0,  235,  152, 2126,   48,   36,  213,   51,    3,  199,   53,  572,
          257, 4901, 1674,    1,    1,    1],
        [   0,  266,  243,   12,  142,  716,    4,   31,  364,   54,  181,   63,
           46,   3

### Calculating calss weights

In [14]:
from sklearn.utils.class_weight import compute_class_weight

weights = compute_class_weight('balanced', sorted(train.Score.unique()), train.Score)
for cat in sorted(train.Score.unique()):
    print("Class {} : {:.5f}".format(cat, weights[cat - 1]))

Class 1 : 2.16915
Class 2 : 3.98303
Class 3 : 2.95403
Class 4 : 1.54162
Class 5 : 0.30296


Now we are ready to start creating our model for Classification!!

### Import Pytorch modules

In [50]:
import torch.nn as nn
import torch.optim as opt

from tqdm import tqdm
from sklearn.metrics import confusion_matrix

#### Creating Base Model

In [51]:
class BaseModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_size, num_classes):
        super(BaseModel, self).__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.hidden_size = hidden_size
        self.cell = nn.LSTM(embedding_dim, hidden_size, batch_first = True)
        self.linear = nn.Linear(hidden_size, num_classes)
        self.soft = nn.Softmax(dim=1)
        
    def forward(self, x, hstate = None):
        if hstate is None:
            hstate = self.init_hidden(self.hidden_size, x.shape[0])
            
        cell_out, _ = self.cell(self.embedding(x), hstate)
        
        out = self.linear(cell_out[:, -1, :])
        
        return self.soft(out)
            
    def init_hidden(self, hidden_size, bs):
        return (torch.zeros(1, bs, hidden_size, device=device), torch.zeros(1, bs, hidden_size, device=device))

Creating evalutaion metrics

In [52]:
class ClassificationMetrics:
    def __init__(self, num_classes):
        self.num_classes = num_classes
        self.classes = list(range(num_classes))
        self.epsilon = 1e-12
        self.cmatrix = np.zeros((num_classes, num_classes), dtype = np.int64) + self.epsilon
        
        self.total_correct = 0
        self.total_examples = 0
        
    def update(self, pred, truth):
        pred = pred.cpu()
        truth = truth.cpu()
        
        _, idx = pred.topk(1)
        truth = truth.view(-1, 1)
        
        self.total_examples += len(truth)
        self.total_correct += sum(idx == truth).item()
        
        val = confusion_matrix(truth, idx, labels=self.classes)
        
        self.cmatrix = self.cmatrix + val
        
        
    def precision_score(self):
        scores = {}
        for i in range(self.num_classes):
            scores[i] = self.cmatrix[i, i] / (sum(self.cmatrix[:, i]) + self.epsilon)
        
        return scores
    
    def recall_score(self):
        scores = {}
        for i in range(self.num_classes):
            scores[i] = self.cmatrix[i, i] / (sum(self.cmatrix[i, :]) + self.epsilon)
        
        return scores
    
    def scores(self, return_type = 'f1'):
        pscores = self.precision_score()
        rscores = self.recall_score()
        scores = {}
        for i in range(self.num_classes):
            if(pscores[i] == 0 and rscores[i] == 0):
                scores[i] = 0
            else:
                scores[i] = 2 * ((pscores[i] * rscores[i]) / (pscores[i] + rscores[i])  + self.epsilon)
            
        if return_type == 'f1':
            return scores
        elif return_type == 'all':
            all_scores = list(zip(pscores.values(), rscores.values(), scores.values()))
            t = {}
            for i in range(self.num_classes):
                t[i] = all_scores[i]
                
            return t
        else:
            raise Exception("Invalid argument for return type")
            
    def accuracy_score(self):
        return self.total_correct / self.total_examples
    
    def reset(self):
        self.total_correct = 0
        self.total_examples = 0
        self.cmatrix = np.zeros((self.num_classes, self.num_classes))
            
    def print_report(self):
        all_scores = self.scores('all')
        print("{:^15}\t{:^15}\t{:^15}\t{:^15}".format("Class", "Precision", "Recall", "F1-score"))
        for c, values in all_scores.items():
            print("{:^15}\t{:^15.3f}\t{:^15.3f}\t{:^15.3f}".format(c, values[0], values[1], values[2]))
            
        print("Accuracy : {:.5f} %".format(self.accuracy_score()))

Creating necessary variables along with our BaseModel, loss function and optimizer.

In [61]:
VOCAB_SIZE = len(textprocessor.vocab_dict)
HIDDEN_SIZE = 30
EMB_DIM = 20
NUM_CLASSES = 5
device = 'cuda'

net = BaseModel(VOCAB_SIZE, EMB_DIM, HIDDEN_SIZE, NUM_CLASSES)
net = net.cuda()
print(net)

metrics = ClassificationMetrics(NUM_CLASSES)
criterion = nn.CrossEntropyLoss(weight=torch.FloatTensor(weights).to(device))
optim = opt.Adam(net.parameters(), lr = 0.001)

BaseModel(
  (embedding): Embedding(10000, 20)
  (cell): LSTM(20, 30, batch_first=True)
  (linear): Linear(in_features=30, out_features=5, bias=True)
  (soft): Softmax(dim=1)
)


#### Training loop

In [62]:
N_EPOCHS = 20

pltloss = []
pltacc = []
for epoch in range(N_EPOCHS):
    losses = []
    net.train()
    for batch in tqdm(dloader):
        metrics.reset()
        optim.zero_grad()

        X, labels = batch[0].to(device), (batch[1] - 1).to(device)
        pred = net(X)
        loss = criterion(pred, labels)
        losses.append(loss.item())
        loss.backward()
        optim.step()
        metrics.update(pred, labels)
        
    print("Training Run\nEpoch : {} Loss : {:.5f}".format(epoch + 1, sum(losses) / len(losses)))
    metrics.print_report()
    pltloss.append(sum(losses) / len(losses))
    pltacc.append(metrics.accuracy_score() * 100)

 87%|███████████████████████████████████████████████████████████████████▋          | 3044/3506 [01:39<00:15, 30.66it/s]


KeyboardInterrupt: 