### Questions Classification Custom dataset and modified `FastText`.

In the previous notebook we looked at how we can do question classification with two labels using packed padded sequences and RNN. This time we are going to use [this notebook](https://github.com/CrispenGari/nlp-pytorch/blob/main/04_Questions_Cassification/02_Questions_Classification_FastText.ipynb) as the base of our `FastText` implementation

The notebook will remain the same, Where there's a change i will highlight

### Data preparation using torchtext.

In this notebook we are not going to prepare the data, because it has already been prepared for us in file. What we are going to do is to load the data.


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Imports

In [2]:
import time
from prettytable import PrettyTable
import numpy as np
from matplotlib import pyplot as plt
import pandas as pd

import torch, os, random
from torch import nn
import torch.nn.functional as F

torch.__version__

'1.9.0+cu102'

### Setting up the seeds

In [3]:
SEED = 42

np.random.seed(SEED)
random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deteministic = True

### File names

In [4]:
train_path_json = 'train.json'
test_path_json = 'test.json'
val_path_json = 'val.json'

### Fast Text.
Accoding to the `FastText` paper we need to generate `bigram`s for each sentence.

We are going to create a function called `generate_bigram()` that will generate bigrams for us. We will pass this function to the `Text` field as the preprocessing function.

In [5]:
def generate_bigrams(x):
  n_grams = set(zip(*[x[i: ] for i in range(2)]))
  for n_gram in n_grams:
      x.append(' '.join(n_gram))
  return x
generate_bigrams(['What', 'is', 'the', 'meaning', "of", "OCR", "in", "python"])

['What',
 'is',
 'the',
 'meaning',
 'of',
 'OCR',
 'in',
 'python',
 'OCR in',
 'in python',
 'of OCR',
 'meaning of',
 'is the',
 'What is',
 'the meaning']

### Creating the Fields.

We don't need to pass arg `include_lengths` to `True` this time around.

In [6]:
from torchtext.legacy import data

In [7]:
TEXT = data.Field(
   tokenize="spacy",
   preprocessing = generate_bigrams,
  tokenizer_language = 'en_core_web_sm',
)
LABEL_1 = data.LabelField()
LABEL_2 = data.LabelField()

In [8]:
fields = {
  "Questions": ('text', TEXT),
  "Category1": ('label_1', LABEL_1),
  "Category2":('label_2', LABEL_2),
}

### Creating the dataset.

We ar going to use the `TabularDataset.split()` to create the datasets.

In [9]:
files_path = '/content/drive/MyDrive/NLP Data/questions-classification/pytorch'
train_data, val_data, test_data = data.TabularDataset.splits(
   files_path,
   train=train_path_json,
   test= test_path_json,
   validation= val_path_json,
   format = "json",
   fields=fields
)

In [10]:
len(train_data), len(test_data), len(val_data)

(5179, 28, 245)

In [11]:
print(vars(train_data.examples[0]))

{'text': ['What', 'is', 'the', 'name', 'of', 'Miss', 'India', '1994', '?', 'name of', '1994 ?', 'of Miss', 'Miss India', 'India 1994', 'is the', 'What is', 'the name'], 'label_1': 'HUM', 'label_2': 'ind'}


### Building the Vocabulary and Loading the `pretrained` word vectors.

We are going to use the `glove.6B.100d` word vectors which was trained with 6 billion words and each word is a 100 dimesional vector.

**Note** We should only build the vocabulary on the `train` dataset only.

In [12]:
MAX_VOCAB_SIZE = 100_000_000

TEXT.build_vocab(
    train_data,
     max_size = MAX_VOCAB_SIZE,
    vectors = "glove.6B.100d",
    unk_init = torch.Tensor.normal_
)
LABEL_1.build_vocab(train_data)
LABEL_2.build_vocab(train_data)


### Checking the labels

In [13]:
print("************************ FIRST LABELS ********************************")
print(LABEL_1.vocab.stoi)
print("************************ SECOND LABELS ********************************")
print(LABEL_2.vocab.stoi)

************************ FIRST LABELS ********************************
defaultdict(None, {'ENTY': 0, 'HUM': 1, 'DESC': 2, 'NUM': 3, 'LOC': 4, 'ABBR': 5})
************************ SECOND LABELS ********************************
defaultdict(None, {'ind': 0, 'other': 1, 'def': 2, 'count': 3, 'desc': 4, 'manner': 5, 'cremat': 6, 'date': 7, 'gr': 8, 'reason': 9, 'country': 10, 'city': 11, 'animal': 12, 'food': 13, 'dismed': 14, 'termeq': 15, 'period': 16, 'money': 17, 'exp': 18, 'state': 19, 'sport': 20, 'event': 21, 'product': 22, 'substance': 23, 'techmeth': 24, 'color': 25, 'dist': 26, 'perc': 27, 'veh': 28, 'word': 29, 'title': 30, 'mount': 31, 'body': 32, 'abb': 33, 'lang': 34, 'volsize': 35, 'plant': 36, 'symbol': 37, 'instru': 38, 'weight': 39, 'code': 40, 'letter': 41, 'speed': 42, 'temp': 43, 'ord': 44, 'currency': 45, 'religion': 46})


### Device.

In [14]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

### Creating iterators.

We are going to use our favorite iterator known as the `BucketIterator` to create iterators for all the sets that we have.

In [15]:
sort_key = lambda x: len(x.text)

BATCH_SIZE = 128

train_iter, val_iter, test_iter = data.BucketIterator.splits(
    (train_data, val_data, test_data),
    device = device,
    batch_size = BATCH_SIZE,
    sort_key = sort_key,
)

### Creating the Model.

In [16]:
class QuestionsFastText(nn.Module):
  def __init__(self, 
               vocab_size,
               embedding_size,
               output_dim_1,
               output_dim_2,
               pad_index,
               dropout = .5
               ):
    super(QuestionsFastText, self).__init__()
    self.embedding = nn.Embedding(
        vocab_size,
        embedding_size,
        padding_idx = pad_index
    )
    self.hidden_fc = nn.Linear(
        embedding_size,
        out_features = 256
    )
    self.out_1 = nn.Linear(
        256,
        out_features = output_dim_1
    )
    self.out_2 = nn.Linear(
        256,
        out_features = output_dim_2
    )
    self.dropout = nn.Dropout(dropout)
  def forward(self, text):
    embedded = self.embedding(text).permute(1 ,0, 2)
    pooled = self.dropout(F.avg_pool2d(embedded,
                         (embedded.shape[1], 1)
                          ).squeeze(1))
    hidden_fc = self.dropout(self.hidden_fc(pooled))

    return self.out_1(hidden_fc), self.out_2(hidden_fc)

### Creating the model instance.

In [17]:

INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
OUTPUT_DIM_1 =  len(LABEL_1.vocab)
OUTPUT_DIM_2 =  len(LABEL_2.vocab)
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token] 

questions_model = QuestionsFastText(
            INPUT_DIM, 
            EMBEDDING_DIM, 
            OUTPUT_DIM_1, 
            OUTPUT_DIM_2,
            pad_index = PAD_IDX
            ).to(device)
questions_model

QuestionsFastText(
  (embedding): Embedding(37259, 100, padding_idx=1)
  (hidden_fc): Linear(in_features=100, out_features=256, bias=True)
  (out_1): Linear(in_features=256, out_features=6, bias=True)
  (out_2): Linear(in_features=256, out_features=47, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
)

### Model parameters

In [18]:

def count_trainable_params(model):
  return sum(p.numel() for p in model.parameters()), sum(p.numel() for p in model.parameters() if p.requires_grad)

n_params, trainable_params = count_trainable_params(questions_model)
print(f"Total number of paramaters: {n_params:,}\nTotal tainable parameters: {trainable_params:,}")


Total number of paramaters: 3,765,377
Total tainable parameters: 3,765,377


### Loading pretrained vextors to the embedding layer.

In [19]:
pretrained_embeddings  = TEXT.vocab.vectors

In [20]:
questions_model.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[ 1.9269,  1.4873,  0.9007,  ...,  0.1233,  0.3499,  0.6173],
        [ 0.7262,  0.0912, -0.3891,  ...,  0.0821,  0.4440, -0.7240],
        [ 0.1638,  0.6046,  1.0789,  ..., -0.3140,  0.1844,  0.3624],
        ...,
        [ 1.4463, -1.1674, -0.2216,  ..., -1.6196, -0.6633,  1.1526],
        [-0.8444, -0.5054,  0.2824,  ...,  1.2317, -0.8442,  0.2483],
        [ 1.1007,  0.2795, -0.3990,  ..., -0.7641,  0.7015,  0.8293]],
       device='cuda:0')

### Zeroing the `<pad>` and `<unk>` tokens.

In [21]:
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token] or TEXT.vocab.stoi["<unk>"]
questions_model.embedding.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)
questions_model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)
questions_model.embedding.weight.data

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.1638,  0.6046,  1.0789,  ..., -0.3140,  0.1844,  0.3624],
        ...,
        [ 1.4463, -1.1674, -0.2216,  ..., -1.6196, -0.6633,  1.1526],
        [-0.8444, -0.5054,  0.2824,  ...,  1.2317, -0.8442,  0.2483],
        [ 1.1007,  0.2795, -0.3990,  ..., -0.7641,  0.7015,  0.8293]],
       device='cuda:0')

### Loss and optimizer.
For the loss we are going to create 2 loss functions. We are going to use the `Adam` as our optimizer.

In [22]:
optimizer = torch.optim.Adam(questions_model.parameters())
criterion_1 = nn.CrossEntropyLoss().to(device)
criterion_2 = nn.CrossEntropyLoss().to(device)

### Accuracy function.
We are going to create the `categorical_accuracy()` function that will calculate the categorical accuracy for predicted labels and actual labels.

**Note**: this function will remain the same we are just going to reuse it.

In [23]:
def categorical_accuracy(preds, y):
  top_pred = preds.argmax(1, keepdim = True)
  correct = top_pred.eq(y.view_as(top_pred)).sum()
  return correct.float() / y.shape[0]

### Training and Evaluation functions.

In the train and evaluate function we are going to change a lot of things. I will highlight the changes using comments.

In [24]:
def train(model, iterator, optimizer, criterion_1, criterion_2):
    """
    Losses and accuracy should be of different labels
    """
    epoch_loss_1 = 0
    epoch_acc_1 = 0
    epoch_loss_2 = 0
    epoch_acc_2 = 0

    model.train()
    for batch in iterator:
        optimizer.zero_grad()
        text = batch.text
        """
        The model returns two predictions for different labels.
        """
        predictions_1, predictions_2 = model(text)
        predictions_1 = predictions_1.squeeze(1)
        predictions_2 = predictions_2.squeeze(1)

        """
        Get the loss for each label
        """
        loss_1 = criterion_1(predictions_1, batch.label_1) # we are using label 1 to calculate the loss for the first label
        loss_2 = criterion_2(predictions_2, batch.label_2) # we are using label 2 to calculate the loss for the first label

        acc_1 = categorical_accuracy(predictions_1, batch.label_1) # accuracy for the first label
        acc_2 = categorical_accuracy(predictions_2, batch.label_2) # accuracy for the first label
        
        """
        We have to sum the loss before back propagation
        """
        loss = loss_1 + loss_2
        loss.backward()
        optimizer.step()
        """
        ********* METRICS ************
        """
        epoch_loss_1 += loss_1.item()
        epoch_loss_2 += loss_2.item()
        epoch_acc_1 += acc_1.item()
        epoch_acc_2 += acc_2.item()
    return epoch_loss_1 / len(iterator), epoch_loss_2 / len(iterator), epoch_acc_1 / len(iterator), epoch_acc_2/ len(iterator)


def evaluate(model, iterator, criterion_1, criterion_2):
    """
    Losses and accuracy should be of different labels
    """
    epoch_loss_1 = 0
    epoch_acc_1 = 0
    epoch_loss_2 = 0
    epoch_acc_2 = 0

    model.eval()
    with torch.no_grad():
      for batch in iterator:
          text = batch.text
          """
          The model returns two predictions for different labels.
          """
          predictions_1, predictions_2 = model(text)
          predictions_1 = predictions_1.squeeze(1)
          predictions_2 = predictions_2.squeeze(1)
          """
          Get the loss for each label
          """
          loss_1 = criterion_1(predictions_1, batch.label_1) # we are using label 1 to calculate the loss for the first label
          loss_2 = criterion_2(predictions_2, batch.label_2) # we are using label 2 to calculate the loss for the first label

          acc_1 = categorical_accuracy(predictions_1, batch.label_1) # accuracy for the first label
          acc_2 = categorical_accuracy(predictions_2, batch.label_2) # accuracy for the first label
          """
          ********* METRICS ************
          """
          epoch_loss_1 += loss_1.item()
          epoch_loss_2 += loss_2.item()
          epoch_acc_1 += acc_1.item()
          epoch_acc_2 += acc_2.item()
    return epoch_loss_1 / len(iterator), epoch_loss_2 / len(iterator), epoch_acc_1 / len(iterator), epoch_acc_2/ len(iterator)

### Training loop.
We are going to create helper functions that will help us to visualize our training.

1. Time to string

In [25]:
def hms_string(sec_elapsed):
    h = int(sec_elapsed / (60 * 60))
    m = int((sec_elapsed % (60 * 60)) / 60)
    s = sec_elapsed % 60
    return "{}:{:>02}:{:>05.2f}".format(h, m, s)
    

2. tabulate training epoch.

In [26]:
def visualize_training(start, end, train_loss_1, train_loss_2, train_accuracy_1, train_accuracy_2, 
                       val_loss_1, val_loss_2, val_accuracy_1, val_accuracy_2, title):
  data = [
       ["Training", f'{train_loss_1:.3f}',  f'{train_loss_2:.3f}', f'{train_accuracy_1:.3f}', f'{train_accuracy_2:.3f}', f"{hms_string(end - start)}" ],
       ["Validation", f'{val_loss_1:.3f}', f'{val_loss_2:.3f}', f'{val_accuracy_1:.3f}', f'{val_accuracy_2:.3f}', "" ],       
  ]
  table = PrettyTable(["CATEGORY", "LOSS_1", "LOSS_2", "ACCURACY_1", "ACCURACY_2", "ETA"])
  table.align["CATEGORY"] = 'l'
  table.align["ETA"] = 'r'
  table.align["LOSS_1"] = 'r'
  table.align["ACCURACY_1"] = 'r'
  table.align["LOSS_2"] = 'r'
  table.align["ACCURACY_2"] = 'r'
  table.title = title
  for row in data:
    table.add_row(row)
  print(table)


In [27]:
N_EPOCHS = 100
best_valid_loss = float('inf')
for epoch in range(N_EPOCHS):
    start = time.time()

    train_loss_1, train_loss_2, train_acc_1, train_acc_2 = train(questions_model, train_iter, 
                                                                 optimizer, criterion_1, criterion_2)
    
    valid_loss_1, valid_loss_2, valid_acc_1, valid_acc_2 = evaluate(questions_model, val_iter, 
                                                                    criterion_1, criterion_2)
    title = f"EPOCH: {epoch+1:02}/{N_EPOCHS:02} {'saving best model...' if valid_loss_2 < best_valid_loss else 'not saving...'}"
    """
    We are going to check for the validation loss of the second label with 47 
    classes feel free to check on the loss you want during model saving
    """
    if valid_loss_2 < best_valid_loss:
        best_valid_loss = valid_loss_2
        torch.save(questions_model.state_dict(), 'best-model.pt')
    end = time.time()
    visualize_training(start, end, train_loss_1, train_loss_2, train_acc_1, train_acc_2, 
                       valid_loss_1, valid_loss_2, valid_acc_1, valid_acc_2, title)


+---------------------------------------------------------------------+
|                  EPOCH: 01/100 saving best model...                 |
+------------+--------+--------+------------+------------+------------+
| CATEGORY   | LOSS_1 | LOSS_2 | ACCURACY_1 | ACCURACY_2 |        ETA |
+------------+--------+--------+------------+------------+------------+
| Training   |  1.692 |  3.490 |      0.264 |      0.146 | 0:00:00.29 |
| Validation |  1.617 |  3.100 |      0.324 |      0.165 |            |
+------------+--------+--------+------------+------------+------------+
+---------------------------------------------------------------------+
|                  EPOCH: 02/100 saving best model...                 |
+------------+--------+--------+------------+------------+------------+
| CATEGORY   | LOSS_1 | LOSS_2 | ACCURACY_1 | ACCURACY_2 |        ETA |
+------------+--------+--------+------------+------------+------------+
| Training   |  1.578 |  3.038 |      0.362 |      0.201 | 0:00:

### Model Evaluation.

In [28]:
questions_model.load_state_dict(torch.load('best-model.pt'))

test_loss_1, test_loss_2, test_acc_1, test_acc_2 = evaluate(questions_model, test_iter, criterion_1, criterion_2)
print(f'Test Loss 1: {test_loss_1:.3f} | Test Loss 2: {test_loss_2:.3f}  | Test Acc 1: {test_acc_1*100:.2f}% | Test Acc 2: {test_acc_2*100:.2f}%')

Test Loss 1: 0.852 | Test Loss 2: 1.692  | Test Acc 1: 67.86% | Test Acc 2: 50.00%


### Model Inference.

We are now ready to make predictions with our model.

In [29]:
import en_core_web_sm
nlp = en_core_web_sm.load()

In [30]:
reversed_labels_1 = dict([(v, k) for (k, v) in LABEL_1.vocab.stoi.items()])
reversed_labels_2 = dict([(v, k) for (k, v) in LABEL_2.vocab.stoi.items()])

reversed_labels_1, reversed_labels_2

({0: 'ENTY', 1: 'HUM', 2: 'DESC', 3: 'NUM', 4: 'LOC', 5: 'ABBR'},
 {0: 'ind',
  1: 'other',
  2: 'def',
  3: 'count',
  4: 'desc',
  5: 'manner',
  6: 'cremat',
  7: 'date',
  8: 'gr',
  9: 'reason',
  10: 'country',
  11: 'city',
  12: 'animal',
  13: 'food',
  14: 'dismed',
  15: 'termeq',
  16: 'period',
  17: 'money',
  18: 'exp',
  19: 'state',
  20: 'sport',
  21: 'event',
  22: 'product',
  23: 'substance',
  24: 'techmeth',
  25: 'color',
  26: 'dist',
  27: 'perc',
  28: 'veh',
  29: 'word',
  30: 'title',
  31: 'mount',
  32: 'body',
  33: 'abb',
  34: 'lang',
  35: 'volsize',
  36: 'plant',
  37: 'symbol',
  38: 'instru',
  39: 'weight',
  40: 'code',
  41: 'letter',
  42: 'speed',
  43: 'temp',
  44: 'ord',
  45: 'currency',
  46: 'religion'})

In [31]:
def tabulate(column_names, data, title="QUESTIONS PREDICTIONS TABLE"):
  table = PrettyTable(column_names)
  table.align[column_names[0]] = "l"
  table.align[column_names[1]] = "l"
  for row in data:
    table.add_row(row)
  print(table)

def predict_question_type(model, sentence, min_len = 5, actual_class_1=0, actual_class_2=0):
    model.eval()
    with torch.no_grad():
      tokenized = [tok.text for tok in nlp.tokenizer(sentence)]
     
      if len(tokenized) < min_len:
          tokenized += ['<pad>'] * (min_len - len(tokenized))
      indexed = [TEXT.vocab.stoi[t] for t in tokenized]
      tensor = torch.LongTensor(indexed).to(device)
      tensor = tensor.unsqueeze(1)
      probabilities_1, probabilities_2 = model(tensor)
      prediction_1 = torch.argmax(probabilities_1, dim=1).item()
      prediction_2 = torch.argmax(probabilities_2, dim=1).item()
      table_headers =["KEY", "VALUE"]
      table_data = [
          ["PREDICTED CLASS 1",  prediction_1],
          ["ACTUAL CLASS 1", actual_class_1],
          ["PREDICTED CLASS 2",  prediction_2],
          ["ACTUAL CLASS 2", actual_class_2],
          ["PREDICTED CLASS NAME 1",  reversed_labels_1[prediction_1]],    
          ["PREDICTED CLASS NAME 2",  reversed_labels_2[prediction_2]],    
      ]
      tabulate(table_headers, table_data)


###  Entity and Other

In [32]:
predict_question_type(questions_model, "What kind of weapons were used in Medieval warfare ?", 
                      actual_class_1=LABEL_1.vocab.stoi["ENTY"], actual_class_2=LABEL_2.vocab.stoi["other"]
                      )

+------------------------+-------+
| KEY                    | VALUE |
+------------------------+-------+
| PREDICTED CLASS 1      | 0     |
| ACTUAL CLASS 1         | 0     |
| PREDICTED CLASS 2      | 20    |
| ACTUAL CLASS 2         | 1     |
| PREDICTED CLASS NAME 1 | ENTY  |
| PREDICTED CLASS NAME 2 | sport |
+------------------------+-------+


### Human and IND

In [33]:
predict_question_type(questions_model, "Whose video is titled Shape Up with Arnold ?", 
                      actual_class_1=LABEL_1.vocab.stoi["HUM"], actual_class_2=LABEL_2.vocab.stoi["ind"]
                      )

+------------------------+-------+
| KEY                    | VALUE |
+------------------------+-------+
| PREDICTED CLASS 1      | 1     |
| ACTUAL CLASS 1         | 1     |
| PREDICTED CLASS 2      | 0     |
| ACTUAL CLASS 2         | 0     |
| PREDICTED CLASS NAME 1 | HUM   |
| PREDICTED CLASS NAME 2 | ind   |
+------------------------+-------+


### Description and DESC.

In [34]:
predict_question_type(questions_model, "What 's the Olympic motto ?", 
                      actual_class_1=LABEL_1.vocab.stoi["DESC"], actual_class_2=LABEL_2.vocab.stoi["desc"]
                      )

+------------------------+-------+
| KEY                    | VALUE |
+------------------------+-------+
| PREDICTED CLASS 1      | 1     |
| ACTUAL CLASS 1         | 2     |
| PREDICTED CLASS 2      | 0     |
| ACTUAL CLASS 2         | 4     |
| PREDICTED CLASS NAME 1 | HUM   |
| PREDICTED CLASS NAME 2 | ind   |
+------------------------+-------+


### Location and STATE

In [35]:
predict_question_type(questions_model, "What state full of milk and honey was the destination in The Grapes of Wrath ?", 
                      actual_class_1=LABEL_1.vocab.stoi["LOC"], actual_class_2=LABEL_2.vocab.stoi["state"]
                      )

+------------------------+-------+
| KEY                    | VALUE |
+------------------------+-------+
| PREDICTED CLASS 1      | 4     |
| ACTUAL CLASS 1         | 4     |
| PREDICTED CLASS 2      | 8     |
| ACTUAL CLASS 2         | 19    |
| PREDICTED CLASS NAME 1 | LOC   |
| PREDICTED CLASS NAME 2 | gr    |
+------------------------+-------+


### Conclusion 

Our model did not perform well this time around but it is fine on our toy dataset. This maybe because we have unbalanced data on the second label with 47 classes. What's next?

### Next Step
* In the next Notebook we are going to use `ConvNets` to perform sentiment analyisis on this dataset with two labels.