# NLP - Multi-Class Text Classification using RNNs

By [Akshaj Verma](https://akshajverma.com)  

This notebook takes you through the implementation of binary text classification in the form of sentiment analysis on yelp reviews using RNNs in PyTorch.

In [1]:
import re
import numpy as np
import pandas as pd
from pprint import pprint
from collections import Counter

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence, pad_packed_sequence

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

%matplotlib inline

torch.manual_seed(1)

<torch._C.Generator at 0x7fa45b4c02d0>

## Prepare Data

In [2]:
df = pd.read_csv("../../../data/nlp/text_classification/bbc-text.csv")
df = df.rename(columns = {'category':'tag'})
df.head()

Unnamed: 0,tag,text
0,tech,tv future in the hands of viewers with home th...
1,business,worldcom boss left books alone former worldc...
2,sport,tigers wary of farrell gamble leicester say ...
3,sport,yeading face newcastle in fa cup premiership s...
4,entertainment,ocean s twelve raids box office ocean s twelve...


### Convert from dataframe to list

In [3]:
sentence_list = [t for t in df['text'].to_list()]
tag_list = [t for t in df['tag'].to_list()]

#### The input sentences.

In [4]:
sentence_list[:2]

['tv future in the hands of viewers with home theatre systems  plasma high-definition tvs  and digital video recorders moving into the living room  the way people watch tv will be radically different in five years  time.  that is according to an expert panel which gathered at the annual consumer electronics show in las vegas to discuss how these new technologies will impact one of our favourite pastimes. with the us leading the trend  programmes and other content will be delivered to viewers via home networks  through cable  satellite  telecoms companies  and broadband service providers to front rooms and portable devices.  one of the most talked-about technologies of ces has been digital and personal video recorders (dvr and pvr). these set-top boxes  like the us s tivo and the uk s sky+ system  allow people to record  store  play  pause and forward wind tv programmes when they want.  essentially  the technology allows for much more personalised tv. they are also being built-in to hig

#### The output tags.

In [5]:
tag_list[:2]

['tech', 'business']

### Clean the input data.

In [6]:
# Convert to lowercase
sentence_list = [s.lower() for s in sentence_list]

# Remove non alphavets
regex_remove_nonalphabets = re.compile('[^a-zA-Z]')
sentence_list = [regex_remove_nonalphabets.sub(' ', s) for s in sentence_list]

# Remove words with less than 2 letters
# regex_remove_shortwords = re.compile(r'\b\w{1,2}\b')
# sentence_list = [regex_remove_shortwords.sub("", s) for s in sentence_list]

# Remove words that appear only once
c = Counter(w for s in sentence_list for w in s.split())
sentence_list = [' '.join(y for y in x.split() if c[y] > 1) for x in sentence_list]

# Strip extra whitespaces
sentence_list = [" ".join(s.split()) for s in sentence_list]

In [7]:
sentence_list[:2]

['tv future in the hands of viewers with home theatre systems plasma high definition tvs and digital video recorders moving into the living room the way people watch tv will be radically different in five years time that is according to an expert panel which gathered at the annual consumer electronics show in las vegas to discuss how these new technologies will impact one of our favourite with the us leading the trend programmes and other content will be delivered to viewers via home networks through cable satellite telecoms companies and broadband service providers to front rooms and portable devices one of the most talked about technologies of ces has been digital and personal video recorders dvr and pvr these set top boxes like the us s tivo and the uk s sky system allow people to record store play pause and forward wind tv programmes when they want essentially the technology allows for much more personalised tv they are also being built in to high definition tv sets which are big b

### Create a vocab and dictionary for input.

#### Vocab for input.

In [8]:
words = []
for sentence in sentence_list:
    for w in sentence.split():
        words.append(w)
    
words = list(set(words))
print(f"Size of word-vocablury: {len(words)}\n")

Size of word-vocablury: 18636



#### Input <=> ID.

In [9]:
word2idx = {word: i for i, word in enumerate(words)}

### Create a vocab and dictionary for output.

#### Vocab for output.

In [10]:
tags = []
for tag in tag_list:
    tags.append(tag)
tags = list(set(tags))
print(f"Size of tag-vocab: {len(tags)}\n")
print(tags)

Size of tag-vocab: 5

['politics', 'tech', 'entertainment', 'business', 'sport']


#### Output <=> ID.

In [11]:
tag2idx = {word: i for i, word in enumerate(tags)}
print(tag2idx)

{'politics': 0, 'tech': 1, 'entertainment': 2, 'business': 3, 'sport': 4}


### Encode the input and output to numbers.

#### Input

In [12]:
X = [[word2idx[w] for w in s.split()] for s in sentence_list]

#### Output

In [13]:
y = [tag2idx[t] for t in tag_list]
y[:3]

[1, 3, 4]

### Train-Test Split

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [15]:
print("X_train size: ", len(X_train))
print("X_test size: ", len(X_test))

X_train size:  1557
X_test size:  668


## Sample Neural Network

### Sample Parameters.

In [16]:
BATCH_SIZE_SAMPLE = 2
EMBEDDING_SIZE_SAMPLE = 5
VOCAB_SIZE = len(word2idx)
TARGET_SIZE = len(tag2idx)
HIDDEN_SIZE_SAMPLE = 3
STACKED_LAYERS_SAMPLE = 4

### Sample Dataloader.

In [17]:
class SampleData(Dataset):
    
    def __init__(self, X_data, y_data):
        self.X_data = X_data
        self.y_data = y_data
        
        
    def __getitem__(self, index):
        return self.X_data[index], self.y_data[index]
        
    def __len__ (self):
        return len(self.X_data)

In [18]:
sample_data = SampleData(X_train, y_train)
sample_loader = DataLoader(sample_data, batch_size=BATCH_SIZE_SAMPLE, collate_fn=lambda x:x)

In [19]:
tl = iter(sample_loader)

i,j = map(list, zip(*next(tl)))

print(i,"\n\n", j, "\n")

[[14907, 9218, 15699, 1052, 2641, 8401, 334, 18346, 18407, 7297, 3470, 14480, 13709, 1495, 15055, 7794, 18084, 5795, 1346, 5795, 334, 15130, 334, 17560, 16640, 14480, 5994, 2641, 7763, 3639, 8401, 334, 17833, 16131, 11967, 16182, 334, 1442, 10192, 11798, 1495, 9484, 3126, 4132, 11870, 12269, 15213, 11967, 15510, 13077, 334, 10787, 4321, 11870, 9347, 3781, 11718, 16661, 17314, 9870, 5651, 3018, 13077, 8954, 1346, 334, 17907, 16730, 4321, 334, 8140, 4616, 11870, 5023, 14523, 2404, 11967, 10875, 2797, 16567, 4321, 13709, 2667, 334, 15251, 15447, 9245, 10525, 2428, 4321, 334, 1093, 16709, 1796, 4321, 11870, 2641, 11967, 11119, 14523, 2404, 11967, 18198, 12625, 16842, 9490, 16302, 3242, 11967, 1235, 7152, 6381, 11967, 15348, 5690, 6149, 1993, 11967, 4628, 8018, 5385, 4854, 7922, 334, 1442, 17786, 334, 4980, 11967, 4749, 7192, 4321, 3233, 4321, 13709, 8313, 15792, 11870, 15055, 7708, 18084, 5195, 11967, 9801, 9036, 9484, 2616, 4052, 1495, 13077, 15348, 10945, 15051, 16640, 18396, 10178, 1721

### Sample RNN class.

In [20]:
class ModelGRUSample(nn.Module):
    
    def __init__(self, embedding_size, vocab_size, hidden_size, target_size, stacked_layers):
        super(ModelGRUSample, self).__init__()
        
        self.word_embeddings = nn.Embedding(num_embeddings = vocab_size, embedding_dim = embedding_size)
        self.gru = nn.GRU(input_size = embedding_size, hidden_size = hidden_size, batch_first = True, num_layers=stacked_layers)
        self.linear = nn.Linear(in_features = hidden_size, out_features=target_size)
        
    def forward(self, x_batch):
        print("\nList of tensor lengths in a batch: ")
        len_list = list(map(len, x_batch))
        print(len_list)
        
        padded_batch = pad_sequence(x_batch, batch_first=True)
        print("\nPadded X_batch: \n", padded_batch, "\n")

        
        embeds = self.word_embeddings(padded_batch)
        print("\nEmbeddings:", embeds, embeds.size(), "\n")

        pack_embeds = pack_padded_sequence(embeds, lengths=len_list, batch_first=True, enforce_sorted=False)
        
        rnn_out, rnn_hidden = self.gru(pack_embeds)
        print("\nRNN hidden last layer:\n", rnn_hidden)
        
        linear_out = self.linear(rnn_hidden)
        print("\nLinear Output:\n", linear_out)
        
        y_out = torch.log_softmax(linear_out, dim = 1)
        y_out = y_out[-1]
        print("\nLogSoftmax:\n", y_out)

        
        return y_out

In [21]:
gru_model_sample = ModelGRUSample(embedding_size=EMBEDDING_SIZE_SAMPLE, vocab_size=len(word2idx), hidden_size=HIDDEN_SIZE_SAMPLE, target_size=len(tag2idx), stacked_layers=STACKED_LAYERS_SAMPLE)
print(gru_model_sample)

ModelGRUSample(
  (word_embeddings): Embedding(18636, 5)
  (gru): GRU(5, 3, num_layers=4, batch_first=True)
  (linear): Linear(in_features=3, out_features=5, bias=True)
)


### Sample Output.

output = [batch size, sent len, hid dim]  
hidden = [batch size, 1, hid dim]

In [22]:
with torch.no_grad():
    for batch in sample_loader:
        x_batch, y_batch = map(list, zip(*batch))
        x_batch = [torch.tensor(i) for i in x_batch]
        y_batch = [torch.tensor(i) for i in y_batch]
        
        
        print("X batch: ")
        pprint(x_batch)
        print("\ny batch: ")
        pprint(y_batch)
        
        y_out = gru_model_sample(x_batch)
                        
        _, y_out_tag = torch.max(y_out, dim = 1)
        print("\nY Output Tag: \n", y_out_tag)
        
        print("\nActual Output: ")
        print(y_batch)

        break

X batch: 
[tensor([14907,  9218, 15699,  1052,  2641,  8401,   334, 18346, 18407,  7297,
         3470, 14480, 13709,  1495, 15055,  7794, 18084,  5795,  1346,  5795,
          334, 15130,   334, 17560, 16640, 14480,  5994,  2641,  7763,  3639,
         8401,   334, 17833, 16131, 11967, 16182,   334,  1442, 10192, 11798,
         1495,  9484,  3126,  4132, 11870, 12269, 15213, 11967, 15510, 13077,
          334, 10787,  4321, 11870,  9347,  3781, 11718, 16661, 17314,  9870,
         5651,  3018, 13077,  8954,  1346,   334, 17907, 16730,  4321,   334,
         8140,  4616, 11870,  5023, 14523,  2404, 11967, 10875,  2797, 16567,
         4321, 13709,  2667,   334, 15251, 15447,  9245, 10525,  2428,  4321,
          334,  1093, 16709,  1796,  4321, 11870,  2641, 11967, 11119, 14523,
         2404, 11967, 18198, 12625, 16842,  9490, 16302,  3242, 11967,  1235,
         7152,  6381, 11967, 15348,  5690,  6149,  1993, 11967,  4628,  8018,
         5385,  4854,  7922,   334,  1442, 17786,   3

## Acutal Neural Network.

### Model parameters.

In [23]:
EPOCHS = 15
BATCH_SIZE = 32
EMBEDDING_SIZE = 300
VOCAB_SIZE = len(word2idx)
TARGET_SIZE = len(tag2idx)
HIDDEN_SIZE = 64
LEARNING_RATE = 0.005
STACKED_LAYERS = 2

### Data Loader.

#### Train Loader.

In [24]:
class TrainData(Dataset):
    
    def __init__(self, X_data, y_data):
        self.X_data = X_data
        self.y_data = y_data
        
        
    def __getitem__(self, index):
        return self.X_data[index], self.y_data[index]
        
    def __len__ (self):
        return len(self.X_data)

In [25]:
train_data = TrainData(X_train, y_train)
train_loader = DataLoader(train_data, batch_size=BATCH_SIZE, collate_fn=lambda x:x)

#### Test Loader

In [26]:
class TestData(Dataset):
    
    def __init__(self, X_data, y_data):
        self.X_data = X_data
        self.y_data = y_data
        
    def __getitem__(self, index):
        return self.X_data[index], self.y_data[index]
        
    def __len__ (self):
        return len(self.X_data)

In [27]:
test_data = TestData(X_test, y_test)
test_loader = DataLoader(test_data, batch_size=1, collate_fn=lambda x:x)

### LSTM Model Class.

In [28]:
class ModelLSTM(nn.Module):
    
    def __init__(self, embedding_size, vocab_size, hidden_size, target_size, stacked_layers):
        super(ModelLSTM, self).__init__()
        
        self.word_embeddings = nn.Embedding(num_embeddings = vocab_size, embedding_dim = embedding_size)
        self.lstm = nn.LSTM(input_size = embedding_size, hidden_size = hidden_size, batch_first = True, num_layers = stacked_layers, dropout = 0.2)
        self.linear = nn.Linear(in_features = hidden_size, out_features=target_size)
        self.tanh = nn.Tanh()
        
    def forward(self, x_batch):
        len_list = list(map(len, x_batch))
        padded_batch = pad_sequence(x_batch, batch_first=True)
        embeds = self.word_embeddings(padded_batch)
        pack_embeds = pack_padded_sequence(embeds, lengths=len_list, batch_first=True, enforce_sorted=False)
        rnn_out, (rnn_h, _) = self.lstm(pack_embeds)
        linear_out = self.linear(self.tanh(rnn_h))
        y_out = linear_out[-1]
        
        return y_out

In [29]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

cuda:0


In [30]:
lstm_model = ModelLSTM(embedding_size=EMBEDDING_SIZE, vocab_size=len(word2idx), hidden_size=HIDDEN_SIZE, target_size=len(tag2idx), stacked_layers=STACKED_LAYERS)

lstm_model.to(device)
print(lstm_model)

criterion = nn.CrossEntropyLoss()

optimizer =  optim.Adam(lstm_model.parameters())

ModelLSTM(
  (word_embeddings): Embedding(18636, 300)
  (lstm): LSTM(300, 64, num_layers=2, batch_first=True, dropout=0.2)
  (linear): Linear(in_features=64, out_features=5, bias=True)
  (tanh): Tanh()
)


## Train model.

In [31]:
def multi_acc(y_pred, y_test):
    y_pred_softmax = torch.log_softmax(y_pred, dim = 1)
    _, y_pred_tags = torch.max(y_pred_softmax, dim = 1)    
    
    correct_pred = (y_pred_tags == y_test).float()
    acc = correct_pred.sum() / len(correct_pred)
    
    acc = torch.round(acc) * 100
    
    return acc

In [32]:
lstm_model.train()
for e in range(1, EPOCHS+1):
    epoch_loss = 0
    epoch_acc = 0
    for batch in train_loader:
        x_batch, y_batch = map(list, zip(*batch))
        x_batch = [torch.tensor(i).to(device) for i in x_batch]
        y_batch = torch.tensor(y_batch).long().to(device)
                
        optimizer.zero_grad()
        
        y_pred = lstm_model(x_batch)        
        
        loss = criterion(y_pred.squeeze(0), y_batch)
        acc = multi_acc(y_pred.squeeze(0), y_batch)
        
        loss.backward()
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    print(f'Epoch {e+0:03}: | Loss: {epoch_loss/len(train_loader):.5f} | Acc: {acc}')

Epoch 001: | Loss: 1.59706 | Acc: 0.0
Epoch 002: | Loss: 1.44596 | Acc: 100.0
Epoch 003: | Loss: 0.90228 | Acc: 100.0
Epoch 004: | Loss: 0.42660 | Acc: 100.0
Epoch 005: | Loss: 0.23653 | Acc: 100.0
Epoch 006: | Loss: 0.13660 | Acc: 100.0
Epoch 007: | Loss: 0.05966 | Acc: 100.0
Epoch 008: | Loss: 0.02126 | Acc: 100.0
Epoch 009: | Loss: 0.01265 | Acc: 100.0
Epoch 010: | Loss: 0.00860 | Acc: 100.0
Epoch 011: | Loss: 0.00571 | Acc: 100.0
Epoch 012: | Loss: 0.00461 | Acc: 100.0
Epoch 013: | Loss: 0.00408 | Acc: 100.0
Epoch 014: | Loss: 0.00340 | Acc: 100.0
Epoch 015: | Loss: 0.00294 | Acc: 100.0


## Test Model.

In [33]:
y_out_tags_list = []
with torch.no_grad():
    for batch in test_loader:
        x_batch, y_batch = map(list, zip(*batch))
        x_batch = [torch.tensor(i).to(device) for i in x_batch]
        y_batch = torch.tensor(y_batch).long().to(device)
        
        y_pred = lstm_model(x_batch)
        _, y_pred_tag = torch.max(y_pred, dim = 1)

        y_out_tags_list.append(y_pred_tag.squeeze(0).cpu().numpy())

## Confusion Matrix.

In [34]:
print(confusion_matrix(y_test, y_out_tags_list))

[[ 78  10   9  16   2]
 [ 14  94   3  10   4]
 [ 11   7  80   7  16]
 [ 15  10   3 120   2]
 [  3   1   7   1 145]]


## Classification Report.

In [35]:
y_out_tags_list = [a.squeeze().tolist() for a in y_out_tags_list]

In [36]:
print(classification_report(y_test, y_out_tags_list))

              precision    recall  f1-score   support

           0       0.64      0.68      0.66       115
           1       0.77      0.75      0.76       125
           2       0.78      0.66      0.72       121
           3       0.78      0.80      0.79       150
           4       0.86      0.92      0.89       157

    accuracy                           0.77       668
   macro avg       0.77      0.76      0.76       668
weighted avg       0.77      0.77      0.77       668



## View model output.

In [37]:
idx2word = {v: k for k, v in word2idx.items()}
idx2tag = {v: k for k, v in tag2idx.items()}

In [38]:
print('{:80}: {:15}\n'.format("Word", "Class"))
for sentence, tag in zip(X_test[:10], y_out_tags_list[:10]):
    s = " ".join([idx2word[w] for w in sentence])
    print('{:80}: {:5}\n'.format(s, tag))


Word                                                                            : Class          

mg rover china tie up delayed mg rover s proposed tie up with china s top carmaker has been delayed due to concerns by chinese regulators according to the financial times the paper said chinese officials had been irritated by rover s disclosure of its talks with shanghai automotive industry corp in october the proposed deal was seen as crucial to safeguarding the future of rover s longbridge plant in the west midlands however there are growing fears that the deal could result in job losses the observer reported on sunday that nearly half the workforce at longbridge could be under threat if the deal goes ahead shanghai automotive s proposed bn investment in rover is awaiting approval by its owner the shanghai city government and by the national development and reform commission which oversees foreign investment by chinese firms according to the ft the regulator has been annoyed by rover s 