### Questions Classification Custom dataset Conv2D.

In this notebook we are going to use the previous notebook as the base notebook for this Conv2D notebook on question classification. In the previous notebbok we have leant how to implement the `FastText` model that was able to get a reasonable accuracy of `100%` and loss of `0` on the train, validation and test sets.


### First let's mount our drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Imports

In [2]:
import time
from prettytable import PrettyTable
import numpy as np
from matplotlib import pyplot as plt
import pandas as pd

import torch, os, random
from torch import nn
import torch.nn.functional as F

torch.__version__

'1.9.0+cu102'

### Setting up the seeds

In [3]:
SEED = 42

np.random.seed(SEED)
random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deteministic = True

### Loading files.

Now we have 3 files for three sets that were created which are:
```
train.csv
test.csv
val.csv
```

We are going to use torchtext to load these files.

**Note:** In the previous notebooks we loaded these our files as json files. This time around we are going to load `csv` files instead. The procedure is the same.

### Paths

In [4]:
files_path = '/content/drive/MyDrive/NLP Data/questions-classification/pytorch'

In [5]:
train_path = 'train.csv'
test_path = 'test.csv'
val_path = 'val.csv'

### Conv2D

Conv Nets are not the best for processing sequence data. But in this notebook we are going to  use them beacuase. **Why not?**

Unlike from the previous notebook we created a function called `generate_bigrams` that was passed on our `Text` field as a preprocessing function. That was because that's what the fasttext paper says. In this notebook we are nog going to do that. All we have to do is to pass `batch_first` argument to true, because Conv Nets expect input to have batchsize as the first dim.

### Creating the Fields.


In [6]:
from torchtext.legacy import data

In [7]:
TEXT = data.Field(
   tokenize="spacy",
  batch_first=True,
  tokenizer_language = 'en_core_web_sm',
)
LABEL = data.LabelField()

In [8]:
fields = {
  "Questions": ('text', TEXT),
  "Category1": ('label', LABEL)
}

### Creating the dataset.

We ar going to use the `TabularDataset.split()` to create the datasets.

In [9]:
train_data, val_data, test_data = data.TabularDataset.splits(
   files_path,
   train=train_path,
   test= train_path,
   validation= train_path,
   format = "csv",
   fields=fields,
)

In [10]:
len(train_data), len(test_data), len(val_data)

(5179, 5179, 5179)

In [11]:
print(vars(train_data.examples[0]))

{'text': ['What', 'is', 'the', 'name', 'of', 'Miss', 'India', '1994', '?'], 'label': 'HUM'}


### Building the Vocabulary and Loading the `pretrained` word vectors.

We are going to use the `glove.6B.100d` word vectors which was trained with 6 billion words and each word is a 100 dimesional vector.

**Note** We should only build the vocabulary on the `train` dataset only.

In [12]:
MAX_VOCAB_SIZE = 100_000_000

TEXT.build_vocab(
    train_data,
     max_size = MAX_VOCAB_SIZE,
    vectors = "glove.6B.100d",
    unk_init = torch.Tensor.normal_
)
LABEL.build_vocab(train_data)


.vector_cache/glove.6B.zip: 862MB [02:39, 5.39MB/s]                          
100%|█████████▉| 399278/400000 [00:15<00:00, 25431.66it/s]

### Device.

In [13]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

In [14]:
LABEL.vocab.stoi

defaultdict(None,
            {'ABBR': 5, 'DESC': 2, 'ENTY': 0, 'HUM': 1, 'LOC': 4, 'NUM': 3})

### Creating iterators.

We are going to use our favorite iterator known as the `BucketIterator` to create iterators for all the sets that we have.

For the `batch_size` this time around we want to test a huge batch.

In [15]:
BATCH_SIZE = 128
train_iter, val_iter, test_iter = data.BucketIterator.splits(
    (train_data, val_data, test_data),
    device = device,
    batch_size = BATCH_SIZE,
    sort_key = lambda x: len(x.text),
)

### Creating the  Con2D Model.

* In this notebook I'm not going to explain how convnets works but If you want to understand more about convnets. I recommend [this](https://github.com/CrispenGari/pytorch-python/blob/main/09_TorchText/02_Sentiment_Analyisis_Series/04_CNN_Sentiment_Analyisis.ipynb) notebook. Which has a clear explanation of how conv nets work on sequential data.


* We are going to create a generic `ConvNet` model that accepts a number of parameters and all the magic will hapen behind the scene. **Really!!**.

In [16]:
class QuestionsConv2DNet(nn.Module):
  def __init__(self,
               vocab_size,
               embedding_size,
               n_filters,
               filter_sizes,
               output_size,
               pad_idx,
               dropout=.5
               ):
    super(QuestionsConv2DNet, self).__init__()

    self.embedding = nn.Embedding(
        vocab_size, embedding_size, padding_idx=pad_idx
    )
    self.convs = nn.ModuleList([
        nn.Conv2d(
            in_channels=1,
            out_channels = n_filters,
            kernel_size=(fs, embedding_size)
        ) for fs in filter_sizes 
    ])

    self.fc = nn.Linear(len(filter_sizes) * n_filters, output_size)
    self.dropout = nn.Dropout(dropout)

  def forward(self, text):
    embedded = self.embedding(text).unsqueeze(1)
    conved = [F.relu(conv(embedded)).squeeze(3) for conv in self.convs]
    pooled = [F.max_pool1d(conv, conv.shape[2]).squeeze(2)
            for conv in conved
            ]
    cat = self.dropout(torch.cat(pooled, dim=1))
    return self.fc(cat)

### Creating the model instance.

In [17]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
N_FILTERS = 100
FILTER_SIZES = [3, 3, 3, 3, 3, 3, 5, 5, 5, 5, 5]
OUTPUT_DIM =  6
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token] 
DROPOUT = 0.5

questions_model = QuestionsConv2DNet(
            INPUT_DIM, 
            EMBEDDING_DIM, 
            N_FILTERS,
            FILTER_SIZES,
            OUTPUT_DIM, 
            pad_idx = PAD_IDX,
            dropout=DROPOUT
            ).to(device)
questions_model

QuestionsConv2DNet(
  (embedding): Embedding(9053, 100, padding_idx=1)
  (convs): ModuleList(
    (0): Conv2d(1, 100, kernel_size=(3, 100), stride=(1, 1))
    (1): Conv2d(1, 100, kernel_size=(3, 100), stride=(1, 1))
    (2): Conv2d(1, 100, kernel_size=(3, 100), stride=(1, 1))
    (3): Conv2d(1, 100, kernel_size=(3, 100), stride=(1, 1))
    (4): Conv2d(1, 100, kernel_size=(3, 100), stride=(1, 1))
    (5): Conv2d(1, 100, kernel_size=(3, 100), stride=(1, 1))
    (6): Conv2d(1, 100, kernel_size=(5, 100), stride=(1, 1))
    (7): Conv2d(1, 100, kernel_size=(5, 100), stride=(1, 1))
    (8): Conv2d(1, 100, kernel_size=(5, 100), stride=(1, 1))
    (9): Conv2d(1, 100, kernel_size=(5, 100), stride=(1, 1))
    (10): Conv2d(1, 100, kernel_size=(5, 100), stride=(1, 1))
  )
  (fc): Linear(in_features=1100, out_features=6, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
)

### Model parameters

In [18]:
def count_trainable_params(model):
  return sum(p.numel() for p in model.parameters()), sum(p.numel() for p in model.parameters() if p.requires_grad)

n_params, trainable_params = count_trainable_params(questions_model)
print(f"Total number of paramaters: {n_params:,}\nTotal tainable parameters: {trainable_params:,}")


Total number of paramaters: 1,343,006
Total tainable parameters: 1,343,006


### Loading pretrained vectors to the embedding layer.

In [19]:
pretrained_embeddings  = TEXT.vocab.vectors

In [20]:
questions_model.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[ 1.9269,  1.4873,  0.9007,  ...,  0.1233,  0.3499,  0.6173],
        [ 0.7262,  0.0912, -0.3891,  ...,  0.0821,  0.4440, -0.7240],
        [ 0.1638,  0.6046,  1.0789,  ..., -0.3140,  0.1844,  0.3624],
        ...,
        [ 0.0091,  0.2810,  0.7356,  ..., -0.7508,  0.8967, -0.7631],
        [ 0.2906,  0.3217,  0.2419,  ..., -0.9444, -0.3790,  0.6196],
        [-0.3898, -0.5949,  0.2729,  ..., -1.0948,  0.8617, -0.4429]],
       device='cuda:0')

### Zeroing the `<pad>` and `<unk>` tokens.

In [21]:
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token] or TEXT.vocab.stoi["<unk>"]
questions_model.embedding.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)
questions_model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)
questions_model.embedding.weight.data

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.1638,  0.6046,  1.0789,  ..., -0.3140,  0.1844,  0.3624],
        ...,
        [ 0.0091,  0.2810,  0.7356,  ..., -0.7508,  0.8967, -0.7631],
        [ 0.2906,  0.3217,  0.2419,  ..., -0.9444, -0.3790,  0.6196],
        [-0.3898, -0.5949,  0.2729,  ..., -1.0948,  0.8617, -0.4429]],
       device='cuda:0')

### Loss and optimizer.
We are going to use the Adam as our optimizer with the default leaning rate. We are also going to use `CrossEntropyLoss()` as our loss function.

In [22]:
optimizer = torch.optim.Adam(questions_model.parameters())
criterion = nn.CrossEntropyLoss().to(device)

### Accuracy function.
We are going to create the `categorical_accuracy()` function that will calculate the categorical accuracy for predicted labels and actual labels.

In [23]:
def categorical_accuracy(preds, y):
  top_pred = preds.argmax(1, keepdim = True)
  correct = top_pred.eq(y.view_as(top_pred)).sum()
  return correct.float() / y.shape[0]

### Training and Evaluation functions.

In [24]:
def train(model, iterator, optimizer, criterion):
    epoch_loss ,epoch_acc = 0, 0
    model.train()
    for batch in iterator:
        optimizer.zero_grad()
        text = batch.text
        predictions = model(text).squeeze(1)
        loss = criterion(predictions, batch.label)
        acc = categorical_accuracy(predictions, batch.label)
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
        epoch_acc += acc.item()
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

def evaluate(model, iterator, criterion):
    epoch_loss , epoch_acc = 0, 0
    model.eval()
    with torch.no_grad():
        for batch in iterator:
            text = batch.text
            predictions = model(text)
            loss = criterion(predictions, batch.label)
            acc = categorical_accuracy(predictions, batch.label)
            epoch_loss += loss.item()
            epoch_acc += acc.item()
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

### Training loop.
We are going to create helper functions that will help us to visualize our training.

1. Time to string

In [25]:
def hms_string(sec_elapsed):
    h = int(sec_elapsed / (60 * 60))
    m = int((sec_elapsed % (60 * 60)) / 60)
    s = sec_elapsed % 60
    return "{}:{:>02}:{:>05.2f}".format(h, m, s)
    

2. tabulate training epoch.

In [26]:
def visualize_training(start, end, train_loss, train_accuracy, val_loss, val_accuracy, title):
  data = [
       ["Training", f'{train_loss:.3f}', f'{train_accuracy:.3f}', f"{hms_string(end - start)}" ],
       ["Validation", f'{val_loss:.3f}', f'{val_accuracy:.3f}', "" ],       
  ]
  table = PrettyTable(["CATEGORY", "LOSS", "ACCURACY", "ETA"])
  table.align["CATEGORY"] = 'l'
  table.align["LOSS"] = 'r'
  table.align["ACCURACY"] = 'r'
  table.align["ETA"] = 'r'
  table.title = title
  for row in data:
    table.add_row(row)
  print(table)


In [37]:
N_EPOCHS = 30
best_valid_loss = float('inf')
for epoch in range(N_EPOCHS):
    start = time.time()
    train_loss, train_acc = train(questions_model, train_iter, optimizer, criterion)
    valid_loss, valid_acc = evaluate(questions_model, val_iter, criterion)
    title = f"EPOCH: {epoch+1:02}/{N_EPOCHS:02} {'saving best model...' if valid_loss < best_valid_loss else 'not saving...'}"
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(questions_model.state_dict(), 'best-model.pt')
    end = time.time()
    visualize_training(start, end, train_loss, train_acc, valid_loss, valid_acc, title)


+--------------------------------------------+
|     EPOCH: 01/30 saving best model...      |
+------------+-------+----------+------------+
| CATEGORY   |  LOSS | ACCURACY |        ETA |
+------------+-------+----------+------------+
| Training   | 0.589 |    0.799 | 0:00:32.42 |
| Validation | 0.413 |    0.880 |            |
+------------+-------+----------+------------+
+--------------------------------------------+
|     EPOCH: 02/30 saving best model...      |
+------------+-------+----------+------------+
| CATEGORY   |  LOSS | ACCURACY |        ETA |
+------------+-------+----------+------------+
| Training   | 0.411 |    0.869 | 0:00:32.61 |
| Validation | 0.274 |    0.925 |            |
+------------+-------+----------+------------+
+--------------------------------------------+
|     EPOCH: 03/30 saving best model...      |
+------------+-------+----------+------------+
| CATEGORY   |  LOSS | ACCURACY |        ETA |
+------------+-------+----------+------------+
| Training   

### Model Evaluation.

In [38]:
questions_model.load_state_dict(torch.load('best-model.pt'))

test_loss, test_acc = evaluate(questions_model, test_iter, criterion)
print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.002 | Test Acc: 100.00%


### Model Inference.

We are now ready to make predictions with our model.

In [39]:
import en_core_web_sm
nlp = en_core_web_sm.load()

In [40]:
reversed_labels = dict([(v, k) for (k, v) in LABEL.vocab.stoi.items()])
reversed_labels

{0: 'ENTY', 1: 'HUM', 2: 'DESC', 3: 'NUM', 4: 'LOC', 5: 'ABBR'}

In [41]:
def tabulate(column_names, data, title="QUESTIONS PREDICTIONS TABLE"):
  table = PrettyTable(column_names)
  table.align[column_names[0]] = "l"
  table.align[column_names[1]] = "l"
  for row in data:
    table.add_row(row)
  print(table)

def predict_question_type(model, sentence, min_len = 5, actual_class=0):
    model.eval()
    with torch.no_grad():
      tokenized = [tok.text for tok in nlp.tokenizer(sentence)]
     
      if len(tokenized) < min_len:
          tokenized += ['<pad>'] * (min_len - len(tokenized))
      indexed = [TEXT.vocab.stoi[t] for t in tokenized]
      tensor = torch.LongTensor(indexed).to(device).unsqueeze(0)
      probabilities = model(tensor)
      prediction = torch.argmax(probabilities, dim=1)
      prediction = prediction.item()
    
      table_headers =["KEY", "VALUE"]
      table_data = [
          ["PREDICTED CLASS",  prediction],
          ["ACTUAL CLASS", actual_class],
          ["PREDICTED CLASS NAME",  reversed_labels[prediction]],    
      ]
      tabulate(table_headers, table_data)


In [42]:
reversed_labels

{0: 'ENTY', 1: 'HUM', 2: 'DESC', 3: 'NUM', 4: 'LOC', 5: 'ABBR'}

### Location

In [43]:
predict_question_type(questions_model, "What are the largest libraries in the US ?", actual_class=4)

+----------------------+-------+
| KEY                  | VALUE |
+----------------------+-------+
| PREDICTED CLASS      | 4     |
| ACTUAL CLASS         | 4     |
| PREDICTED CLASS NAME | LOC   |
+----------------------+-------+


### Human

In [44]:
predict_question_type(questions_model, "Who is John Macarthur , 1767-1834 ?", actual_class=1)

+----------------------+-------+
| KEY                  | VALUE |
+----------------------+-------+
| PREDICTED CLASS      | 1     |
| ACTUAL CLASS         | 1     |
| PREDICTED CLASS NAME | HUM   |
+----------------------+-------+


### DESCRIPTION

In [45]:
predict_question_type(questions_model, "What is the root of all evil ? ", actual_class=2)

+----------------------+-------+
| KEY                  | VALUE |
+----------------------+-------+
| PREDICTED CLASS      | 2     |
| ACTUAL CLASS         | 2     |
| PREDICTED CLASS NAME | DESC  |
+----------------------+-------+


### Numeric

In [46]:
predict_question_type(questions_model, "How many watts make a kilowatt ?", actual_class=3)

+----------------------+-------+
| KEY                  | VALUE |
+----------------------+-------+
| PREDICTED CLASS      | 3     |
| ACTUAL CLASS         | 3     |
| PREDICTED CLASS NAME | NUM   |
+----------------------+-------+


### ENTITY

In [47]:

predict_question_type(questions_model, "What films featured the character Popeye Doyle ?", actual_class=0)

+----------------------+-------+
| KEY                  | VALUE |
+----------------------+-------+
| PREDICTED CLASS      | 0     |
| ACTUAL CLASS         | 0     |
| PREDICTED CLASS NAME | ENTY  |
+----------------------+-------+


### ABBREVIATION

In [48]:
predict_question_type(questions_model, "What does NECROSIS stands for ?", actual_class=5)

+----------------------+-------+
| KEY                  | VALUE |
+----------------------+-------+
| PREDICTED CLASS      | 5     |
| ACTUAL CLASS         | 5     |
| PREDICTED CLASS NAME | ABBR  |
+----------------------+-------+


### Conclusion

We were able to create our model and get a loss of `0` and `100%` accuracy on the validation and test data.

### Next Step
* In the next Notebook we are going to use `Conv1D` to perform sentiment analyisis on this dataset.