### Questions Classification Custom dataset.

In the previous notebook we looked much in depth with different NN layers and models for predicting question category. The easiest part of the problem was that we were just predicting a single label. In this notebook we are going to learn how we can predict two labels at the same time:

* All credits goes to [the pytorch community](https://discuss.pytorch.org/t/a-model-with-multiple-outputs/10440)

### Data preparation using torchtext.

In this notebook we are not going to prepare the data, because it has already been prepared for us in file. What we are going to do is to load the data.


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Imports

In [2]:
import time
from prettytable import PrettyTable
import numpy as np
from matplotlib import pyplot as plt
import pandas as pd

import torch, os, random
from torch import nn
import torch.nn.functional as F

torch.__version__

'1.9.0+cu102'

### Setting up the seeds

In [3]:
SEED = 42

np.random.seed(SEED)
random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deteministic = True

### File names

In [4]:
train_path_json = 'train.json'
test_path_json = 'test.json'
val_path_json = 'val.json'

### Creating the Fields.

In this notebook we are going to have two label fields So we need a `Label` and a `Text` Field. On the `Text` field we are going to pass the arg `include_lengths=True` since we are going to used packed padded sequences.


In [5]:
from torchtext.legacy import data, datasets

In [6]:
TEXT = data.Field(
   tokenize="spacy",
   include_lengths = True,
  tokenizer_language = 'en_core_web_sm',
)
LABEL_1 = data.LabelField()
LABEL_2 = data.LabelField()

In [7]:
fields = {
  "Questions": ('text', TEXT),
  "Category1": ('label_1', LABEL_1),
  "Category2":('label_2', LABEL_2),
}

### Creating the dataset.

We ar going to use the `TabularDataset.split()` to create the datasets.

In [8]:
files_path = '/content/drive/MyDrive/NLP Data/questions-classification/pytorch'
train_data, val_data, test_data = data.TabularDataset.splits(
   files_path,
   train=train_path_json,
   test= test_path_json,
   validation= val_path_json,
   format = "json",
   fields=fields
)

In [9]:
len(train_data), len(test_data), len(val_data)

(5179, 28, 245)

In [10]:
print(vars(train_data.examples[0]))

{'text': ['What', 'is', 'the', 'name', 'of', 'Miss', 'India', '1994', '?'], 'label_1': 'HUM', 'label_2': 'ind'}


### Building the Vocabulary and Loading the `pretrained` word vectors.

We are going to use the `glove.6B.100d` word vectors which was trained with 6 billion words and each word is a 100 dimesional vector.

**Note** We should only build the vocabulary on the `train` dataset only.

In [11]:
MAX_VOCAB_SIZE = 100_000_000

TEXT.build_vocab(
    train_data,
     max_size = MAX_VOCAB_SIZE,
    vectors = "glove.6B.100d",
    unk_init = torch.Tensor.normal_
)
LABEL_1.build_vocab(train_data)
LABEL_2.build_vocab(train_data)


.vector_cache/glove.6B.zip: 862MB [02:39, 5.40MB/s]                           
100%|█████████▉| 398400/400000 [00:14<00:00, 27403.29it/s]

### Checking the labels

In [12]:
print("************************ FIRST LABELS ********************************")
print(LABEL_1.vocab.stoi)
print("************************ SECOND LABELS ********************************")
print(LABEL_2.vocab.stoi)

************************ FIRST LABELS ********************************
defaultdict(None, {'ENTY': 0, 'HUM': 1, 'DESC': 2, 'NUM': 3, 'LOC': 4, 'ABBR': 5})
************************ SECOND LABELS ********************************
defaultdict(None, {'ind': 0, 'other': 1, 'def': 2, 'count': 3, 'desc': 4, 'manner': 5, 'cremat': 6, 'date': 7, 'gr': 8, 'reason': 9, 'country': 10, 'city': 11, 'animal': 12, 'food': 13, 'dismed': 14, 'termeq': 15, 'period': 16, 'money': 17, 'exp': 18, 'state': 19, 'sport': 20, 'event': 21, 'product': 22, 'substance': 23, 'techmeth': 24, 'color': 25, 'dist': 26, 'perc': 27, 'veh': 28, 'word': 29, 'title': 30, 'mount': 31, 'body': 32, 'abb': 33, 'lang': 34, 'volsize': 35, 'plant': 36, 'symbol': 37, 'instru': 38, 'weight': 39, 'code': 40, 'letter': 41, 'speed': 42, 'temp': 43, 'ord': 44, 'currency': 45, 'religion': 46})


### Device.

In [13]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

### Creating iterators.

We are going to use our favorite iterator known as the `BucketIterator` to create iterators for all the sets that we have.

In [14]:
sort_key = lambda x: len(x.text)

BATCH_SIZE = 64

train_iter, val_iter, test_iter = data.BucketIterator.splits(
    (train_data, val_data, test_data),
    device = device,
    batch_size = BATCH_SIZE,
    sort_key = sort_key,
)

### Creating the Model.

In [15]:
class QuestionsLSTMRNN(nn.Module):
  def __init__(self, 
               vocab_size,
               embedding_size,
               hidden_size,
               output_size_1,
               output_size_2,
               num_layers,
               pad_index,
               bidirectional = True,
               dropout=.5
               ):
    super(QuestionsLSTMRNN, self).__init__()
    self.embedding = nn.Embedding(
        vocab_size,
        embedding_size,
        padding_idx = pad_index
    )
    self.lstm = nn.LSTM(
        embedding_size,
        hidden_size  = hidden_size,
        bidirectional = bidirectional,
        num_layers = num_layers,
        dropout = dropout
    )
    self.fc_1 = nn.Linear(
        hidden_size * 2 if bidirectional else hidden_size,
        out_features = 512
    )
    self.fc_2 = nn.Linear(
        512,
        out_features = 256
    )
    self.out_1 = nn.Linear(
        256,
        out_features = output_size_1
    )
    self.out_2 = nn.Linear(
        256,
        out_features = output_size_2
    )
    self.dropout = nn.Dropout(dropout)

  def forward(self, text, text_lengths):
    embedded = self.dropout(self.embedding(text))
    packed_embedded = nn.utils.rnn.pack_padded_sequence(
        embedded, text_lengths.to('cpu'), enforce_sorted=False
    )
    packed_output, (h_0, c_0) = self.lstm(packed_embedded)
    output, output_lengths = nn.utils.rnn.pad_packed_sequence(packed_output)
    h_0 = self.dropout(torch.cat((h_0[-2,:,:], h_0[-1,:,:]), dim = 1))
    fc_hidden_1 = self.dropout(self.fc_1(h_0))
    fc_hidden_2 = self.dropout(self.fc_2(fc_hidden_1))
    return self.out_1(fc_hidden_2), self.out_2(fc_hidden_2)


### Creating the model instance.

In [16]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM_1 =  len(LABEL_1.vocab)
OUTPUT_DIM_2 =  len(LABEL_2.vocab)
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.5
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token] 

questions_model = QuestionsLSTMRNN(
            INPUT_DIM, 
            EMBEDDING_DIM, 
            HIDDEN_DIM, 
            OUTPUT_DIM_1, 
            OUTPUT_DIM_2,
            N_LAYERS, 
            bidirectional = BIDIRECTIONAL, 
            dropout = DROPOUT, 
            pad_index = PAD_IDX
            ).to(device)
questions_model

QuestionsLSTMRNN(
  (embedding): Embedding(9053, 100, padding_idx=1)
  (lstm): LSTM(100, 256, num_layers=2, dropout=0.5, bidirectional=True)
  (fc_1): Linear(in_features=512, out_features=512, bias=True)
  (fc_2): Linear(in_features=512, out_features=256, bias=True)
  (out_1): Linear(in_features=256, out_features=6, bias=True)
  (out_2): Linear(in_features=256, out_features=47, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
)

### Model parameters

In [17]:

def count_trainable_params(model):
  return sum(p.numel() for p in model.parameters()), sum(p.numel() for p in model.parameters() if p.requires_grad)

n_params, trainable_params = count_trainable_params(questions_model)
print(f"Total number of paramaters: {n_params:,}\nTotal tainable parameters: {trainable_params:,}")


Total number of paramaters: 3,623,049
Total tainable parameters: 3,623,049


### Loading pretrained vextors to the embedding layer.

In [18]:
pretrained_embeddings  = TEXT.vocab.vectors

In [19]:
questions_model.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[ 1.9269,  1.4873,  0.9007,  ...,  0.1233,  0.3499,  0.6173],
        [ 0.7262,  0.0912, -0.3891,  ...,  0.0821,  0.4440, -0.7240],
        [ 0.1638,  0.6046,  1.0789,  ..., -0.3140,  0.1844,  0.3624],
        ...,
        [ 0.0091,  0.2810,  0.7356,  ..., -0.7508,  0.8967, -0.7631],
        [ 0.2906,  0.3217,  0.2419,  ..., -0.9444, -0.3790,  0.6196],
        [-0.3898, -0.5949,  0.2729,  ..., -1.0948,  0.8617, -0.4429]],
       device='cuda:0')

### Zeroing the `<pad>` and `<unk>` tokens.

In [20]:
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token] or TEXT.vocab.stoi["<unk>"]
questions_model.embedding.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)
questions_model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)
questions_model.embedding.weight.data

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.1638,  0.6046,  1.0789,  ..., -0.3140,  0.1844,  0.3624],
        ...,
        [ 0.0091,  0.2810,  0.7356,  ..., -0.7508,  0.8967, -0.7631],
        [ 0.2906,  0.3217,  0.2419,  ..., -0.9444, -0.3790,  0.6196],
        [-0.3898, -0.5949,  0.2729,  ..., -1.0948,  0.8617, -0.4429]],
       device='cuda:0')

### Loss and optimizer.
For the loss we are going to create 2 loss functions. We are going to use the `Adam` as our optimizer.

In [21]:
optimizer = torch.optim.Adam(questions_model.parameters())
criterion_1 = nn.CrossEntropyLoss().to(device)
criterion_2 = nn.CrossEntropyLoss().to(device)

### Accuracy function.
We are going to create the `categorical_accuracy()` function that will calculate the categorical accuracy for predicted labels and actual labels.

**Note**: this function will remain the same we are just going to reuse it.

In [22]:
def categorical_accuracy(preds, y):
  top_pred = preds.argmax(1, keepdim = True)
  correct = top_pred.eq(y.view_as(top_pred)).sum()
  return correct.float() / y.shape[0]

### Training and Evaluation functions.

In the train and evaluate function we are going to change a lot of things. I will highlight the changes using comments.

In [23]:
def train(model, iterator, optimizer, criterion_1, criterion_2):
    """
    Losses and accuracy should be of different labels
    """
    epoch_loss_1 = 0
    epoch_acc_1 = 0
    epoch_loss_2 = 0
    epoch_acc_2 = 0

    model.train()
    for batch in iterator:
        optimizer.zero_grad()
        text, text_lengths = batch.text
        """
        The model returns two predictions for different labels.
        """
        predictions_1, predictions_2 = model(text, text_lengths)
        predictions_1 = predictions_1.squeeze(1)
        predictions_2 = predictions_2.squeeze(1)

        """
        Get the loss for each label
        """
        loss_1 = criterion_1(predictions_1, batch.label_1) # we are using label 1 to calculate the loss for the first label
        loss_2 = criterion_2(predictions_2, batch.label_2) # we are using label 2 to calculate the loss for the first label

        acc_1 = categorical_accuracy(predictions_1, batch.label_1) # accuracy for the first label
        acc_2 = categorical_accuracy(predictions_2, batch.label_2) # accuracy for the first label
        
        """
        We have to sum the loss before back propagation
        """
        loss = loss_1 + loss_2
        loss.backward()
        optimizer.step()
        """
        ********* METRICS ************
        """
        epoch_loss_1 += loss_1.item()
        epoch_loss_2 += loss_2.item()
        epoch_acc_1 += acc_1.item()
        epoch_acc_2 += acc_2.item()
    return epoch_loss_1 / len(iterator), epoch_loss_2 / len(iterator), epoch_acc_1 / len(iterator), epoch_acc_2/ len(iterator)


def evaluate(model, iterator, criterion_1, criterion_2):
    """
    Losses and accuracy should be of different labels
    """
    epoch_loss_1 = 0
    epoch_acc_1 = 0
    epoch_loss_2 = 0
    epoch_acc_2 = 0

    model.eval()
    with torch.no_grad():
      for batch in iterator:
          text, text_lengths = batch.text
          """
          The model returns two predictions for different labels.
          """
          predictions_1, predictions_2 = model(text, text_lengths)
          predictions_1 = predictions_1.squeeze(1)
          predictions_2 = predictions_2.squeeze(1)
          """
          Get the loss for each label
          """
          loss_1 = criterion_1(predictions_1, batch.label_1) # we are using label 1 to calculate the loss for the first label
          loss_2 = criterion_2(predictions_2, batch.label_2) # we are using label 2 to calculate the loss for the first label

          acc_1 = categorical_accuracy(predictions_1, batch.label_1) # accuracy for the first label
          acc_2 = categorical_accuracy(predictions_2, batch.label_2) # accuracy for the first label
          """
          ********* METRICS ************
          """
          epoch_loss_1 += loss_1.item()
          epoch_loss_2 += loss_2.item()
          epoch_acc_1 += acc_1.item()
          epoch_acc_2 += acc_2.item()
    return epoch_loss_1 / len(iterator), epoch_loss_2 / len(iterator), epoch_acc_1 / len(iterator), epoch_acc_2/ len(iterator)

### Training loop.
We are going to create helper functions that will help us to visualize our training.

1. Time to string

In [24]:
def hms_string(sec_elapsed):
    h = int(sec_elapsed / (60 * 60))
    m = int((sec_elapsed % (60 * 60)) / 60)
    s = sec_elapsed % 60
    return "{}:{:>02}:{:>05.2f}".format(h, m, s)
    

2. tabulate training epoch.

In [25]:
def visualize_training(start, end, train_loss_1, train_loss_2, train_accuracy_1, train_accuracy_2, 
                       val_loss_1, val_loss_2, val_accuracy_1, val_accuracy_2, title):
  data = [
       ["Training", f'{train_loss_1:.3f}',  f'{train_loss_2:.3f}', f'{train_accuracy_1:.3f}', f'{train_accuracy_2:.3f}', f"{hms_string(end - start)}" ],
       ["Validation", f'{val_loss_1:.3f}', f'{val_loss_2:.3f}', f'{val_accuracy_1:.3f}', f'{val_accuracy_2:.3f}', "" ],       
  ]
  table = PrettyTable(["CATEGORY", "LOSS_1", "LOSS_2", "ACCURACY_1", "ACCURACY_2", "ETA"])
  table.align["CATEGORY"] = 'l'
  table.align["ETA"] = 'r'
  table.align["LOSS_1"] = 'r'
  table.align["ACCURACY_1"] = 'r'
  table.align["LOSS_2"] = 'r'
  table.align["ACCURACY_2"] = 'r'
  table.title = title
  for row in data:
    table.add_row(row)
  print(table)


In [33]:
N_EPOCHS = 100
best_valid_loss = float('inf')
for epoch in range(N_EPOCHS):
    start = time.time()

    train_loss_1, train_loss_2, train_acc_1, train_acc_2 = train(questions_model, train_iter, 
                                                                 optimizer, criterion_1, criterion_2)
    
    valid_loss_1, valid_loss_2, valid_acc_1, valid_acc_2 = evaluate(questions_model, val_iter, 
                                                                    criterion_1, criterion_2)
    title = f"EPOCH: {epoch+1:02}/{N_EPOCHS:02} {'saving best model...' if valid_loss_2 < best_valid_loss else 'not saving...'}"
    """
    We are going to check for the validation loss of the second label with 47 
    classes feel free to check on the loss you want during model saving
    """
    if valid_loss_2 < best_valid_loss:
        best_valid_loss = valid_loss_2
        torch.save(questions_model.state_dict(), 'best-model.pt')
    end = time.time()
    visualize_training(start, end, train_loss_1, train_loss_2, train_acc_1, train_acc_2, 
                       valid_loss_1, valid_loss_2, valid_acc_1, valid_acc_2, title)


+---------------------------------------------------------------------+
|                  EPOCH: 01/100 saving best model...                 |
+------------+--------+--------+------------+------------+------------+
| CATEGORY   | LOSS_1 | LOSS_2 | ACCURACY_1 | ACCURACY_2 |        ETA |
+------------+--------+--------+------------+------------+------------+
| Training   |  0.263 |  0.842 |      0.910 |      0.754 | 0:00:01.23 |
| Validation |  0.407 |  0.989 |      0.849 |      0.748 |            |
+------------+--------+--------+------------+------------+------------+
+---------------------------------------------------------------------+
|                  EPOCH: 02/100 saving best model...                 |
+------------+--------+--------+------------+------------+------------+
| CATEGORY   | LOSS_1 | LOSS_2 | ACCURACY_1 | ACCURACY_2 |        ETA |
+------------+--------+--------+------------+------------+------------+
| Training   |  0.259 |  0.807 |      0.910 |      0.769 | 0:00:

### Model Evaluation.

In [34]:
questions_model.load_state_dict(torch.load('best-model.pt'))

test_loss_1, test_loss_2, test_acc_1, test_acc_2 = evaluate(questions_model, test_iter, criterion_1, criterion_2)
print(f'Test Loss 1: {test_loss_1:.3f} | Test Loss 2: {test_loss_2:.3f}  | Test Acc 1: {test_acc_1*100:.2f}% | Test Acc 2: {test_acc_2*100:.2f}%')

Test Loss 1: 0.543 | Test Loss 2: 1.476  | Test Acc 1: 78.57% | Test Acc 2: 67.86%


### Model Inference.

We are now ready to make predictions with our model.

In [35]:
import en_core_web_sm
nlp = en_core_web_sm.load()

In [36]:
reversed_labels_1 = dict([(v, k) for (k, v) in LABEL_1.vocab.stoi.items()])
reversed_labels_2 = dict([(v, k) for (k, v) in LABEL_2.vocab.stoi.items()])

reversed_labels_1, reversed_labels_2

({0: 'ENTY', 1: 'HUM', 2: 'DESC', 3: 'NUM', 4: 'LOC', 5: 'ABBR'},
 {0: 'ind',
  1: 'other',
  2: 'def',
  3: 'count',
  4: 'desc',
  5: 'manner',
  6: 'cremat',
  7: 'date',
  8: 'gr',
  9: 'reason',
  10: 'country',
  11: 'city',
  12: 'animal',
  13: 'food',
  14: 'dismed',
  15: 'termeq',
  16: 'period',
  17: 'money',
  18: 'exp',
  19: 'state',
  20: 'sport',
  21: 'event',
  22: 'product',
  23: 'substance',
  24: 'techmeth',
  25: 'color',
  26: 'dist',
  27: 'perc',
  28: 'veh',
  29: 'word',
  30: 'title',
  31: 'mount',
  32: 'body',
  33: 'abb',
  34: 'lang',
  35: 'volsize',
  36: 'plant',
  37: 'symbol',
  38: 'instru',
  39: 'weight',
  40: 'code',
  41: 'letter',
  42: 'speed',
  43: 'temp',
  44: 'ord',
  45: 'currency',
  46: 'religion'})

In [37]:
def tabulate(column_names, data, title="QUESTIONS PREDICTIONS TABLE"):
  table = PrettyTable(column_names)
  table.align[column_names[0]] = "l"
  table.align[column_names[1]] = "l"
  for row in data:
    table.add_row(row)
  print(table)

def predict_question_type(model, sentence, min_len = 5, actual_class_1=0, actual_class_2=0):
    model.eval()
    with torch.no_grad():
      tokenized = [tok.text for tok in nlp.tokenizer(sentence)]
     
      if len(tokenized) < min_len:
          tokenized += ['<pad>'] * (min_len - len(tokenized))
      indexed = [TEXT.vocab.stoi[t] for t in tokenized]
      length =  [len(indexed)]
      tensor = torch.LongTensor(indexed).to(device)
      tensor = tensor.unsqueeze(1)
      length_tensor = torch.LongTensor(length)
      probabilities_1, probabilities_2 = model(tensor, length_tensor)
      prediction_1 = torch.argmax(probabilities_1, dim=1).item()
      prediction_2 = torch.argmax(probabilities_2, dim=1).item()
      table_headers =["KEY", "VALUE"]
      table_data = [
          ["PREDICTED CLASS 1",  prediction_1],
          ["ACTUAL CLASS 1", actual_class_1],
          ["PREDICTED CLASS 2",  prediction_2],
          ["ACTUAL CLASS 2", actual_class_2],
          ["PREDICTED CLASS NAME 1",  reversed_labels_1[prediction_1]],    
          ["PREDICTED CLASS NAME 2",  reversed_labels_2[prediction_2]],    
      ]
      tabulate(table_headers, table_data)


###  Entity and Other

In [38]:
predict_question_type(questions_model, "What kind of weapons were used in Medieval warfare ?", 
                      actual_class_1=LABEL_1.vocab.stoi["ENTY"], actual_class_2=LABEL_2.vocab.stoi["other"]
                      )

+------------------------+-------+
| KEY                    | VALUE |
+------------------------+-------+
| PREDICTED CLASS 1      | 0     |
| ACTUAL CLASS 1         | 0     |
| PREDICTED CLASS 2      | 1     |
| ACTUAL CLASS 2         | 1     |
| PREDICTED CLASS NAME 1 | ENTY  |
| PREDICTED CLASS NAME 2 | other |
+------------------------+-------+


### Human and IND

In [39]:
predict_question_type(questions_model, "Whose video is titled Shape Up with Arnold ?", 
                      actual_class_1=LABEL_1.vocab.stoi["HUM"], actual_class_2=LABEL_2.vocab.stoi["ind"]
                      )

+------------------------+-------+
| KEY                    | VALUE |
+------------------------+-------+
| PREDICTED CLASS 1      | 1     |
| ACTUAL CLASS 1         | 1     |
| PREDICTED CLASS 2      | 0     |
| ACTUAL CLASS 2         | 0     |
| PREDICTED CLASS NAME 1 | HUM   |
| PREDICTED CLASS NAME 2 | ind   |
+------------------------+-------+


### Description and DESC.

In [40]:
predict_question_type(questions_model, "What 's the Olympic motto ?", 
                      actual_class_1=LABEL_1.vocab.stoi["DESC"], actual_class_2=LABEL_2.vocab.stoi["desc"]
                      )

+------------------------+-------+
| KEY                    | VALUE |
+------------------------+-------+
| PREDICTED CLASS 1      | 2     |
| ACTUAL CLASS 1         | 2     |
| PREDICTED CLASS 2      | 2     |
| ACTUAL CLASS 2         | 4     |
| PREDICTED CLASS NAME 1 | DESC  |
| PREDICTED CLASS NAME 2 | def   |
+------------------------+-------+


### Location and STATE

In [41]:
predict_question_type(questions_model, "What state full of milk and honey was the destination in The Grapes of Wrath ?", 
                      actual_class_1=LABEL_1.vocab.stoi["LOC"], actual_class_2=LABEL_2.vocab.stoi["state"]
                      )

+------------------------+-------+
| KEY                    | VALUE |
+------------------------+-------+
| PREDICTED CLASS 1      | 4     |
| ACTUAL CLASS 1         | 4     |
| PREDICTED CLASS 2      | 19    |
| ACTUAL CLASS 2         | 19    |
| PREDICTED CLASS NAME 1 | LOC   |
| PREDICTED CLASS NAME 2 | state |
+------------------------+-------+


### Conclusion 
As we can see our model is doing better. What's next?

### Next Step
* In the next Notebook we are going to use `FastText` to perform sentiment analyisis on this dataset with two labels.