![license_header_logo](../../../images/license_header_logo.png)

> **Copyright (c) 2021 CertifAI Sdn. Bhd.**<br>
<br>
This program is part of OSRFramework. You can redistribute it and/or modify
<br>it under the terms of the GNU Affero General Public License as published by
<br>the Free Software Foundation, either version 3 of the License, or
<br>(at your option) any later version.
<br>
<br>This program is distributed in the hope that it will be useful
<br>but WITHOUT ANY WARRANTY; without even the implied warranty of
<br>MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
<br>GNU Affero General Public License for more details.
<br>
<br>You should have received a copy of the GNU Affero General Public License
<br>along with this program.  If not, see <http://www.gnu.org/licenses/>.
<br>

# Introduction

In this notebook, we are going to work on a really interesting problem.

Quora wants to keep track of **insincere questions** on their platform so as to make users feel safe while sharing their knowledge. An insincere question in this context is defined as a question intended to make a statement rather than looking for helpful answers. To break this down further, here are some characteristics that can signify that a particular question is insincere:


* Has a non-neutral tone
* Is disparaging or inflammatory
* Isn’t grounded in reality
* Uses sexual content (incest, bestiality, pedophilia) for shock value, and not to seek genuine answers


The training data includes the question that was asked, and a flag denoting whether it was **identified as insincere (target = 1)**. The ground-truth labels contain some amount of noise, i.e. they are not guaranteed to be perfect. Our task will be to identify if a given question is ‘insincere’. You dataset should be inside the *dataset* folder, else you may download the dataset from [here](https://drive.google.com/open?id=1fcip8PgsrX7m4AFgvUPLaac5pZ79mpwX).

It is time to code our own text classification model using PyTorch.

# What will we accomplish?

Steps to implement intent classifier using PyTorch Deep Learning:

> Step 1: Data preprocessing using Pytorch

> Step 2: Building network model

> Step 3: Building model training and evaluation

> Step 4: Inference and prediction

# Notebook Content

* [Import Library](#Import-Library)


* [Configure Torch](#Configure-Torch)


* [Data Preprocessing](#Data-Preprocessing)


* [Loading Training Data](#Loading-Training-Data)


* [Train-Test Split](#Train-Test-Split)


* [Create Training Batches](#Create-Training-Batches)


* [Building Neural Network Model](#Building-Neural-Network-Model)


* [LSTM Network Configuration](#LSTM-Network-Configuration)


* [Training Configuration](#Training-Configuration)


* [Model Training](#Model-Training)


* [Inference](#Inference)


* [Prediction](#Prediction)

# Implementation – Text Classification in PyTorch

Let us first import all the necessary libraries required to build a model. Here is a brief overview of the packages/libraries we are going to use:

* **Torch**: Used to define tensors and mathematical operations on it.
* **TorchText**: NLP library in PyTorch. This library contains the scripts for preprocessing text and source of few popular NLP datasets.

### Import Library

In [1]:
# Deal with tensors
import torch   

# Handling text data
from torchtext.legacy import data  

### Configure Torch

Seed value is specified to make result **reproducible**. Since Deep Learning model might produce different results each when it is executed due to the randomness in it, it is important to specify the seed value.


`torch.backends.cudnn.deterministic = True` enables the use of same algorithm each time the application runs.

In [2]:
# Reproducing same results
SEED = 2021

# Torch
torch.manual_seed(SEED)

# Cuda algorithms
torch.backends.cudnn.deterministic = True  

### Data Preprocessing
Now, let us see how to preprocess the text using `field` objects. There are 2 different types of `field` objects – `Field` and `LabelField`. Let us quickly understand the difference between the two:

1. **Field**: `Field` object from `data` module is used to specify preprocessing steps for each column in the dataset.


2. **LabelField**: `LabelField` object is a special case of `Field` object which is used **only for the classification tasks**. Its only use is to set the unk_token and sequential to None by default.


Before we use `Field`, let us look at the different parameters of Field and what are they used for.

**Parameters of `Field`**:
* **Tokenize**: specifies the way of tokenizing the sentence i.e. converting sentence to words. We use default tokenizer here
* **Lower**: converts text to lowercase
* **batch_first**: The first dimension of input and output is always batch size

In [3]:
TEXT = data.Field(batch_first=True, include_lengths=True)

LABEL = data.LabelField(dtype = torch.float, batch_first=True)

Next we are going to create a list of tuples where first value in every tuple contains a **column name** and second value is a **`field` object** defined above. Furthermore we will arrange each tuple in the order of the columns of csv, and also specify as (None,None) to ignore first column from a csv file.

In [4]:
fields = [(None, None), ('text',TEXT),('label', LABEL)]

### Loading Training Data

Now, we going to load the quora dataset using `TabularDataset()` function.

In [5]:
# Loading custom dataset
training_data = data.TabularDataset(path = '../../../resources/day_07/quora.csv', format = 'csv',
                                    fields = fields,skip_header = True)

`vars()` returns the `__dict__` attribute of the given object.

In [6]:
# Print preprocessed text
print(vars(training_data.examples[0]))

{'text': ['Why', 'are', 'most', 'indian', 'parents', 'against', 'even', 'liking', 'someone?'], 'label': '1'}


### Train-Test Split

Here we using 70% of training set with 30% of test set.

In [7]:
import random

train_data, test_data = training_data.split(split_ratio=0.7, random_state = random.seed(SEED))

Preparing input and output sequences:

The next step is to build the vocabulary for the text and convert them into integer sequences. Vocabulary contains the unique words in the entire text. Each unique word is assigned an index. Below are the parameters listed for the same

Parameters:

1. min_freq: Ignores the words in vocabulary which has frequency less than specified one and map it to unknown token.


2. Two special tokens known as unknown and padding will be added to the vocabulary
    * Unknown token is used to handle Out Of Vocabulary words
    * Padding token is used to make input sequences of same length
    
    
Let us build vocabulary and initialize the words with the pretrained embeddings. Ignore the vectors parameter if you wish to randomly initialize embeddings.

In [8]:
# Initialize GloVe embeddings
TEXT.build_vocab(training_data, min_freq=3, vectors = "glove.6B.100d", vectors_cache="../../../resources/.vector_cache")  
LABEL.build_vocab(training_data)

In [9]:
# No. of unique tokens in text
print("Size of TEXT vocabulary:",len(TEXT.vocab))

Size of TEXT vocabulary: 27322


In [10]:
# No. of unique tokens in label
print("Size of LABEL vocabulary:",len(LABEL.vocab))

Size of LABEL vocabulary: 71


In [11]:
# Commonly used words
print(TEXT.vocab.freqs.most_common(10))

[('the', 56284), ('to', 37100), ('a', 31818), ('of', 27796), ('in', 26680), ('and', 26416), ('Why', 24675), ('is', 24356), ('What', 20450), ('are', 19300)]


In [12]:
# Word dictionary
print(TEXT.vocab.stoi) 



### Create Training Batches

Now we will prepare batches for training the model. `BucketIterator` forms the batches in such a way that a minimum amount of padding is required.

In [13]:
# Check whether cuda is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')  

print("Device Preferred:", device)

Device Preferred: cuda


In [14]:
# Set batch size
BATCH_SIZE = 64

# Load an iterator
train_iterator, test_iterator = data.BucketIterator.splits((train_data, test_data), 
                                                           batch_size = BATCH_SIZE, 
                                                           sort_key = lambda x: len(x.text),
                                                           sort_within_batch=True, device = device)

### Building Neural Network Model

#### Model Architecture

It is now time to define the **architecture** to solve the binary classification problem. The `nn` module from `torch` is a **base model** for all the neural network models. This means that every model must be a subclass of the `nn` module.

We have defined 2 functions here: `init()` as well as `forward()`. Let see the explaination and the use case of both of these functions:

1. **`Init()`**: Whenever an instance of a class is created, `init` function is automatically invoked. Hence, it is called as a **constructor**. The arguments passed to the class are initialized by the constructor.We will **define all the layers** that we will be using in the model.


2. **`Forward`**: `Forward` function defines the **feedforward pass** of the inputs.



Finally, let’s understand in detail about the different layers used for building the architecture and their parameters:

* **Embedding layer**: Embeddings are extremely important for any NLP related task since it **represents a word in a numerical format**. Embedding layer creates a look up table where each row represents an embedding of a word. The embedding layer converts the **integer sequence** into a **dense vector representation**. Here are the two most important parameters of the embedding layer:

    1. num_embeddings: No. of **unique words** in dictionary
    
    2. embedding_dim:  No. of **dimensions** for representing a word
    
    
* **LSTM**: LSTM is a **variant of RNN** that is capable of capturing **long term dependencies**. Following the some important parameters of LSTM that you should be familiar with. Given below are the parameters of this layer:

    1. input_size  :  Dimension of input
    2. hidden_size :  Number of hidden nodes
    3. num_layers  :  Number of layers to be stacked
    4. batch_first  : If True, then the input and output tensors are provided as (batch, seq, feature)
    5. dropout: If non-zero, introduces a **Dropout layer** on the outputs of each LSTM layer except the last layer. Default: 0
    6. bidirection: If True, introduces a Bi directional LSTM
    

* **Linear Layer**: Linear layer refers to dense layer. The two important parameters here are described below:

    * in_features : No. of input features
    * out_features: No. of hidden nodes
    

* **Pack Padding**: Used to define the **dynamic recurrent neural network**. Without pack padding, the padding inputs are also processed by the rnn and returns the hidden state of the padded element. This an awesome wrapper that does not show the inputs that are padded. It simply ignores the values and returns the **hidden state of the non padded element**.

Now that we have a good understanding of all the blocks of the architecture, let us go to the code!

In [15]:
import torch.nn as nn

class classifier(nn.Module):
    
    # Define all the layers used in model
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, bidirectional, dropout):
        
        # Constructor
        super().__init__()          
        
        # Embedding layer
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        
        # LSTM layer
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=n_layers, 
                            bidirectional=bidirectional, dropout=dropout, batch_first=True)
        
        # Dense layer
        self.fc = nn.Linear(hidden_dim * 2, output_dim)
        
        # Activation function
        self.act = nn.Sigmoid()
    
    # Define the feedforward pass
    def forward(self, text, text_lengths):
        
        # text = (batch size, sent_length)
        embedded = self.embedding(text)
        #embedded = [batch size, sent_len, emb dim]
        
        # Padded sequence
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths.to('cpu'), batch_first=True)
        
        packed_output, (hidden, cell) = self.lstm(packed_embedded)
        # hidden = (batch size, num layers * num directions,hid dim)
        # cell = (batch size, num layers * num directions,hid dim)
        
        # Concatenate the final forward and backward hidden state
        hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)
                
        # hidden = (batch size, hid dim * num directions)
        dense_outputs=self.fc(hidden)

        # Final activation function
        outputs=self.act(dense_outputs)
        
        return outputs

### LSTM Network Configuration

The next step would be to define the hyperparameters and instantiate the model.

In [16]:
# Define hyperparameters
size_of_vocab = len(TEXT.vocab)
embedding_dim = 100
num_hidden_nodes = 32
num_output_nodes = 1
num_layers = 2
bidirection = True
dropout = 0.2

In [17]:
# Instantiate the model
model = classifier(size_of_vocab, embedding_dim, num_hidden_nodes, 
                   num_output_nodes, num_layers, bidirectional=True, dropout=dropout)

View the model summary.

In [18]:
# Model Architecture
print(model)

classifier(
  (embedding): Embedding(27322, 100)
  (lstm): LSTM(100, 32, num_layers=2, batch_first=True, dropout=0.2, bidirectional=True)
  (fc): Linear(in_features=64, out_features=1, bias=True)
  (act): Sigmoid()
)


In [19]:
# No. of trianable parameters
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 2,791,657 trainable parameters


In [20]:
#Initialize the pretrained embedding
pretrained_embeddings = TEXT.vocab.vectors
model.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]])

In [21]:
print(pretrained_embeddings.shape)

torch.Size([27322, 100])


### Training Configuration

We have to define the optimizer, loss and metric for the model.

In [22]:
import torch.optim as optim

# Define optimizer and loss
optimizer = optim.Adam(model.parameters())

# Binary Cross Entropy loss
criterion = nn.BCELoss()

In [23]:
# Define metric
def binary_accuracy(preds, y):
    # Round predictions to the closest integer
    rounded_preds = torch.round(preds)
    
    correct = (rounded_preds == y).float() 
    acc = correct.sum() / len(correct)
    return acc

In [24]:
# Push to cuda if available
model = model.to(device)
criterion = criterion.to(device)

There are 2 phases while building the model:

* **Training phase**: `model.train()` sets the model on the **training phase** and activates the dropout layers.


* **Inference phase**: `model.eval()` sets the model on the **evaluation phase** and deactivates the dropout layers.

#### Training Phase

In [25]:
def train(model, iterator, optimizer, criterion):
    
    # Initialize every epoch
    epoch_loss = 0
    epoch_acc = 0
    
    # Set the model in training phase
    model.train()
    
    for batch in iterator:
        
        # Resets the gradients after every batch
        optimizer.zero_grad()
        
        # Retrieve text and no. of words
        text, text_lengths = batch.text
        
        # Convert to 1D tensor
        predictions = model(text, text_lengths).squeeze()
        
        # Compute the loss
        loss = criterion(predictions, batch.label)
        
        # Compute the binary accuracy
        acc = binary_accuracy(predictions, batch.label)
        
        # Backpropage the loss and compute the gradients
        loss.backward()
        
        # Update the weights
        optimizer.step()
        
        # Loss and accuracy
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

#### Evaluating Phase

In [26]:
def evaluate(model, iterator, criterion):
    
    # Initialize every epoch
    epoch_loss = 0
    epoch_acc = 0

    # Deactivating dropout layers
    model.eval()
    
    # Deactivates autograd
    with torch.no_grad():
    
        for batch in iterator:
            
            # Retrieve text and no. of words
            text, text_lengths = batch.text
            
            # Convert to 1d tensor
            predictions = model(text, text_lengths).squeeze()
            
            # Compute loss and accuracy
            loss = criterion(predictions, batch.label)
            acc = binary_accuracy(predictions, batch.label)
            
            # Keep track of loss and accuracy
            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

### Model Training

Finally we will train the model for a certain number of epochs and save the best model every epoch.

In [27]:
N_EPOCHS = 5
best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
     
    # Train the model
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    
    # Evaluate the model
    valid_loss, valid_acc = evaluate(model, test_iterator, criterion)
    
    # Save the best model
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'model/saved_weights.pt')
    
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

	Train Loss: 0.353 | Train Acc: 83.82%
	 Val. Loss: 0.318 |  Val. Acc: 87.05%
	Train Loss: 0.189 | Train Acc: 89.59%
	 Val. Loss: 0.315 |  Val. Acc: 86.85%
	Train Loss: -0.443 | Train Acc: 91.19%
	 Val. Loss: -0.082 |  Val. Acc: 86.80%
	Train Loss: -1.602 | Train Acc: 92.79%
	 Val. Loss: -0.032 |  Val. Acc: 86.15%
	Train Loss: -2.077 | Train Acc: 94.05%
	 Val. Loss: -0.001 |  Val. Acc: 85.96%


In [28]:
#load the best weights
path='model/saved_weights.pt'
model.load_state_dict(torch.load(path));
model.eval();

### Inference

In [29]:
#inference 
import spacy
nlp = spacy.load('en_core_web_sm')  

In [30]:
def predict(model, sentence):
    tokenized = [tok.text for tok in nlp.tokenizer(sentence)]  #tokenize the sentence 
    indexed = [TEXT.vocab.stoi[t] for t in tokenized]          #convert to integer sequence
    length = [len(indexed)]                                    #compute no. of words
    tensor = torch.LongTensor(indexed).to(device)              #convert to tensor
    tensor = tensor.unsqueeze(1).T                             #reshape in form of batch,no. of words
    length_tensor = torch.LongTensor(length)                   #convert to tensor
    prediction = model(tensor, length_tensor)                  #prediction 
    return prediction.item()  

### Prediction

In [31]:
#make predictions
predict(model, "Are there any sports that you don't like?")

0.35329875349998474

In [32]:
#insincere question
predict(model, "Why Indian girls go crazy about marrying Shri. Rahul Gandhi ji?")

1.0

# Remark

We have seen how to build our own text classification model in PyTorch and learnt the importance of pack padding. You can play around with the hyper-parameters of the Long Short Term Model such as number of hidden nodes, number of  hidden layers and so on to improve the performance even further.

# Contributors

**Author**
<br>Chee Lam

# References

1. [Build Your First Text Classification model using PyTorch](https://www.analyticsvidhya.com/blog/2020/01/first-text-classification-in-pytorch/)