### Questions Classification Custom dataset.

In this notebook we are going to learn how to load the questions dataset using torchtext and prepare it for sentiment classification in pytorch. We are going to use [this series](https://github.com/CrispenGari/pytorch-python/tree/main/09_TorchText/02_Sentiment_Analyisis_Series) as the base of our code.

In this series we will learn the following:

1. Creating our own dataset using torchtext
2. Using RNN and packed padded sequences for sentiment analysis
3. Using Fasttext for sentiment analyisis
4. Using ConvNet's to do sentiment analyisis.

### 1. Data preparation using torchtext.
* Refer to [this](https://github.com/CrispenGari/pytorch-python/blob/main/09_TorchText/02_Sentiment_Analyisis_Series/09_TorchText_with_custom_data.ipynb)  and [this](https://github.com/CrispenGari/pytorch-python/blob/main/09_TorchText/01_Introduction/01_TorchText.ipynb) notebook for claryification.

I've already uploaded the files which we are going to use in my google drive. So the first thing we should do is to mount the google drive.


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Imports

In [2]:
import time
from prettytable import PrettyTable
import numpy as np
from matplotlib import pyplot as plt
import pandas as pd

import torch, os, random
from torch import nn
import torch.nn.functional as F

torch.__version__

'1.9.0+cu102'

### Setting up the seeds

In [3]:
SEED = 42

np.random.seed(SEED)
random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deteministic = True

### Splitting sets
We are going to create 3 files which are:

```
train.csv
test.csv
val.csv
```

In [4]:
base_path = '/content/drive/MyDrive/NLP Data/questions-classification'
file_path = os.path.join(base_path, "Question_Classification_Dataset.csv")

In [5]:
dataframe = pd.read_csv(file_path)
dataframe.head(1)

Unnamed: 0.1,Unnamed: 0,Questions,Category0,Category1,Category2
0,0,How did serfdom develop in and then leave Russ...,DESCRIPTION,DESC,manner


In [6]:
dataframe.isnull().any()

Unnamed: 0    False
Questions     False
Category0     False
Category1     False
Category2     False
dtype: bool

### Splitting the sets.

In [7]:
from sklearn.model_selection import train_test_split


In [8]:
train, valid = train_test_split(dataframe, test_size=.05)
valid, test = train_test_split(valid, test_size=.10)
len(train), len(test), len(valid)

(5179, 28, 245)

### Saving the files.

In [9]:
train_path = 'train.csv'
test_path = 'test.csv'
val_path = 'val.csv'

In [10]:
if not os.path.exists(os.path.join(base_path, "pytorch")):
  os.makedirs(os.path.join(base_path, "pytorch"))

valid.to_csv(os.path.join(base_path, "pytorch", val_path), index=False)
test.to_csv(os.path.join(base_path, "pytorch", test_path), index=False)
train.to_csv(os.path.join(base_path, "pytorch", train_path), index=False)

print("files saved")

files saved


### Loading files.

Now we have 3 files for three sets that were created which are:
```
train.csv
test.csv
val.csv
```

We are going to use torchtext to load these files.

### Paths

In [11]:
files_path = '/content/drive/MyDrive/NLP Data/questions-classification/pytorch'

### Creating `.json` files for all the three sets.

In [12]:
train_path_json = 'train.json'
test_path_json = 'test.json'
val_path_json = 'val.json'

### Creating dataframes for each set so that we can easily convert the files to `.json` files.

In [13]:
train_dataframe = pd.read_csv(os.path.join(files_path, train_path))
test_dataframe = pd.read_csv(os.path.join(files_path, test_path))
val_dataframe = pd.read_csv(os.path.join(files_path, val_path))

train_dataframe.to_json(os.path.join(files_path, train_path_json),  orient="records", lines=True)
test_dataframe.to_json(os.path.join(files_path, test_path_json),  orient="records", lines=True)
val_dataframe.to_json(os.path.join(files_path, val_path_json),  orient="records", lines=True)

print("saved!")

saved!


### Creating the Fields.

In this first notebook our task is simple we are going to classify one label based on a single question. So we need a `Label` and a `Text` Field. On the `Text` field we are going to pass the arg `include_lengths=True` since we are going to used packed padded sequences.


In [14]:
from torchtext.legacy import data, datasets

In [15]:
TEXT = data.Field(
   tokenize="spacy",
   include_lengths = True,
  tokenizer_language = 'en_core_web_sm',
)
LABEL = data.LabelField()

In [16]:
fields = {
  "Questions": ('text', TEXT),
  "Category1": ('label', LABEL)
}

### Creating the dataset.

We ar going to use the `TabularDataset.split()` to create the datasets.

In [17]:
train_data, val_data, test_data = data.TabularDataset.splits(
   files_path,
   train=train_path_json,
   test= test_path_json,
   validation= val_path_json,
   format = "json",
   fields=fields
)

In [18]:
len(train_data), len(test_data), len(val_data)

(5179, 28, 245)

In [19]:
print(vars(train_data.examples[0]))

{'text': ['What', 'is', 'the', 'name', 'of', 'Miss', 'India', '1994', '?'], 'label': 'HUM'}


### Building the Vocabulary and Loading the `pretrained` word vectors.

We are going to use the `glove.6B.100d` word vectors which was trained with 6 billion words and each word is a 100 dimesional vector.

**Note** We should only build the vocabulary on the `train` dataset only.

In [20]:
MAX_VOCAB_SIZE = 100_000_000

TEXT.build_vocab(
    train_data,
     max_size = MAX_VOCAB_SIZE,
    vectors = "glove.6B.100d",
    unk_init = torch.Tensor.normal_
)
LABEL.build_vocab(train_data)


### Device.

In [21]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

In [22]:
LABEL.vocab.stoi

defaultdict(None,
            {'ABBR': 5, 'DESC': 2, 'ENTY': 0, 'HUM': 1, 'LOC': 4, 'NUM': 3})

### Creating iterators.

We are going to use our favorite iterator known as the `BucketIterator` to create iterators for all the sets that we have.

In [23]:
sort_key = lambda x: len(x.text)

BATCH_SIZE = 64

train_iter, val_iter, test_iter = data.BucketIterator.splits(
    (train_data, val_data, test_data),
    device = device,
    batch_size = BATCH_SIZE,
    sort_key = sort_key,
)

### Creating the Model.

In [24]:
class QuestionsLSTMRNN(nn.Module):
  def __init__(self, 
               vocab_size,
               embedding_size,
               hidden_size,
               output_size,
               num_layers,
               pad_index,
               bidirectional = True,
               dropout=.5
               ):
    super(QuestionsLSTMRNN, self).__init__()
    self.embedding = nn.Embedding(
        vocab_size,
        embedding_size,
        padding_idx = pad_index
    )
    self.lstm = nn.LSTM(
        embedding_size,
        hidden_size  = hidden_size,
        bidirectional = bidirectional,
        num_layers = num_layers,
        dropout = dropout
    )
    self.fc_1 = nn.Linear(
        hidden_size * 2 if bidirectional else hidden_size,
        out_features = 512
    )
    self.fc_2 = nn.Linear(
        512,
        out_features = 256
    )
    self.out = nn.Linear(
        256,
        out_features = output_size
    )
    self.dropout = nn.Dropout(dropout)

  def forward(self, text, text_lengths):
    embedded = self.dropout(self.embedding(text))
    packed_embedded = nn.utils.rnn.pack_padded_sequence(
        embedded, text_lengths.to('cpu'), enforce_sorted=False
    )
    packed_output, (h_0, c_0) = self.lstm(packed_embedded)
    output, output_lengths = nn.utils.rnn.pad_packed_sequence(packed_output)
    h_0 = self.dropout(torch.cat((h_0[-2,:,:], h_0[-1,:,:]), dim = 1))
    out = self.dropout(self.fc_1(h_0))
    out = self.dropout(self.fc_2(h_0))
    return self.out(out)


### Creating the model instance.

In [25]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM =  6
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.5
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token] 

questions_model = QuestionsLSTMRNN(
            INPUT_DIM, 
            EMBEDDING_DIM, 
            HIDDEN_DIM, 
            OUTPUT_DIM, 
            N_LAYERS, 
            bidirectional = BIDIRECTIONAL, 
            dropout = DROPOUT, 
            pad_index = PAD_IDX
            ).to(device)
questions_model

QuestionsLSTMRNN(
  (embedding): Embedding(9053, 100, padding_idx=1)
  (lstm): LSTM(100, 256, num_layers=2, dropout=0.5, bidirectional=True)
  (fc_1): Linear(in_features=512, out_features=512, bias=True)
  (fc_2): Linear(in_features=512, out_features=256, bias=True)
  (out): Linear(in_features=256, out_features=6, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
)

### Model parameters

In [26]:

def count_trainable_params(model):
  return sum(p.numel() for p in model.parameters()), sum(p.numel() for p in model.parameters() if p.requires_grad)

n_params, trainable_params = count_trainable_params(questions_model)
print(f"Total number of paramaters: {n_params:,}\nTotal tainable parameters: {trainable_params:,}")


Total number of paramaters: 3,610,970
Total tainable parameters: 3,610,970


### Loading pretrained vextors to the embedding layer.

In [27]:
pretrained_embeddings  = TEXT.vocab.vectors

In [28]:
questions_model.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[ 1.9269,  1.4873,  0.9007,  ...,  0.1233,  0.3499,  0.6173],
        [ 0.7262,  0.0912, -0.3891,  ...,  0.0821,  0.4440, -0.7240],
        [ 0.1638,  0.6046,  1.0789,  ..., -0.3140,  0.1844,  0.3624],
        ...,
        [ 0.0091,  0.2810,  0.7356,  ..., -0.7508,  0.8967, -0.7631],
        [ 0.2906,  0.3217,  0.2419,  ..., -0.9444, -0.3790,  0.6196],
        [-0.3898, -0.5949,  0.2729,  ..., -1.0948,  0.8617, -0.4429]],
       device='cuda:0')

### Zeroing the `<pad>` and `<unk>` tokens.

In [29]:
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token] or TEXT.vocab.stoi["<unk>"]
questions_model.embedding.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)
questions_model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)
questions_model.embedding.weight.data

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.1638,  0.6046,  1.0789,  ..., -0.3140,  0.1844,  0.3624],
        ...,
        [ 0.0091,  0.2810,  0.7356,  ..., -0.7508,  0.8967, -0.7631],
        [ 0.2906,  0.3217,  0.2419,  ..., -0.9444, -0.3790,  0.6196],
        [-0.3898, -0.5949,  0.2729,  ..., -1.0948,  0.8617, -0.4429]],
       device='cuda:0')

### Loss and optimizer.
We are going to use the Adam as our optimizer with the default leaning rate. We are also going to use `CrossEntropyLoss()` as our loss function.

In [30]:
optimizer = torch.optim.Adam(questions_model.parameters())
criterion = nn.CrossEntropyLoss().to(device)

### Accuracy function.
We are going to create the `categorical_accuracy()` function that will calculate the categorical accuracy for predicted labels and actual labels.

In [31]:
def categorical_accuracy(preds, y):
  top_pred = preds.argmax(1, keepdim = True)
  correct = top_pred.eq(y.view_as(top_pred)).sum()
  return correct.float() / y.shape[0]

### Training and Evaluation functions.

In [32]:
def train(model, iterator, optimizer, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.train()
    for batch in iterator:
        optimizer.zero_grad()
        text, text_lengths = batch.text
        predictions = model(text, text_lengths).squeeze(1)
        loss = criterion(predictions, batch.label)
        acc = categorical_accuracy(predictions, batch.label)
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
        epoch_acc += acc.item()
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

def evaluate(model, iterator, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.eval()
    with torch.no_grad():
        for batch in iterator:
            text, text_lengths = batch.text
            predictions = model(text, text_lengths)
            loss = criterion(predictions, batch.label)
            acc = categorical_accuracy(predictions, batch.label)
            epoch_loss += loss.item()
            epoch_acc += acc.item()
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

### Training loop.
We are going to create helper functions that will help us to visualize our training.

1. Time to string

In [33]:
def hms_string(sec_elapsed):
    h = int(sec_elapsed / (60 * 60))
    m = int((sec_elapsed % (60 * 60)) / 60)
    s = sec_elapsed % 60
    return "{}:{:>02}:{:>05.2f}".format(h, m, s)
    

2. tabulate training epoch.

In [34]:
def visualize_training(start, end, train_loss, train_accuracy, val_loss, val_accuracy, title):
  data = [
       ["Training", f'{train_loss:.3f}', f'{train_accuracy:.3f}', f"{hms_string(end - start)}" ],
       ["Validation", f'{val_loss:.3f}', f'{val_accuracy:.3f}', "" ],       
  ]
  table = PrettyTable(["CATEGORY", "LOSS", "ACCURACY", "ETA"])
  table.align["CATEGORY"] = 'l'
  table.align["LOSS"] = 'r'
  table.align["ACCURACY"] = 'r'
  table.align["ETA"] = 'r'
  table.title = title
  for row in data:
    table.add_row(row)
  print(table)


In [35]:
N_EPOCHS = 100
best_valid_loss = float('inf')
for epoch in range(N_EPOCHS):
    start = time.time()
    train_loss, train_acc = train(questions_model, train_iter, optimizer, criterion)
    valid_loss, valid_acc = evaluate(questions_model, val_iter, criterion)
    title = f"EPOCH: {epoch+1:02}/{N_EPOCHS:02} {'saving best model...' if valid_loss < best_valid_loss else 'not saving...'}"
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(questions_model.state_dict(), 'best-model.pt')
    end = time.time()
    visualize_training(start, end, train_loss, train_acc, valid_loss, valid_acc, title)


+--------------------------------------------+
|     EPOCH: 01/100 saving best model...     |
+------------+-------+----------+------------+
| CATEGORY   |  LOSS | ACCURACY |        ETA |
+------------+-------+----------+------------+
| Training   | 1.202 |    0.509 | 0:00:01.16 |
| Validation | 0.882 |    0.696 |            |
+------------+-------+----------+------------+
+--------------------------------------------+
|     EPOCH: 02/100 saving best model...     |
+------------+-------+----------+------------+
| CATEGORY   |  LOSS | ACCURACY |        ETA |
+------------+-------+----------+------------+
| Training   | 0.815 |    0.692 | 0:00:01.06 |
| Validation | 0.731 |    0.706 |            |
+------------+-------+----------+------------+
+--------------------------------------------+
|     EPOCH: 03/100 saving best model...     |
+------------+-------+----------+------------+
| CATEGORY   |  LOSS | ACCURACY |        ETA |
+------------+-------+----------+------------+
| Training   

### Model Evaluation.

In [36]:
questions_model.load_state_dict(torch.load('best-model.pt'))

test_loss, test_acc = evaluate(questions_model, test_iter, criterion)
print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.568 | Test Acc: 85.71%


### Model Inference.

We are now ready to make predictions with our model.

In [37]:
import en_core_web_sm
nlp = en_core_web_sm.load()

In [38]:
reversed_labels = dict([(v, k) for (k, v) in LABEL.vocab.stoi.items()])
reversed_labels

{0: 'ENTY', 1: 'HUM', 2: 'DESC', 3: 'NUM', 4: 'LOC', 5: 'ABBR'}

In [39]:
def tabulate(column_names, data, title="QUESTIONS PREDICTIONS TABLE"):
  table = PrettyTable(column_names)
  table.align[column_names[0]] = "l"
  table.align[column_names[1]] = "l"
  for row in data:
    table.add_row(row)
  print(table)

def predict_question_type(model, sentence, min_len = 5, actual_class=0):
    model.eval()
    with torch.no_grad():
      tokenized = [tok.text for tok in nlp.tokenizer(sentence)]
     
      if len(tokenized) < min_len:
          tokenized += ['<pad>'] * (min_len - len(tokenized))
      indexed = [TEXT.vocab.stoi[t] for t in tokenized]
      length =  [len(indexed)]
      tensor = torch.LongTensor(indexed).to(device)
      tensor = tensor.unsqueeze(1)
      length_tensor = torch.LongTensor(length)
      probabilities = model(tensor, length_tensor)
      prediction = torch.argmax(probabilities, dim=1)
      prediction = prediction.item()
    
      table_headers =["KEY", "VALUE"]
      table_data = [
          ["PREDICTED CLASS",  prediction],
          ["ACTUAL CLASS", actual_class],
          ["PREDICTED CLASS NAME",  reversed_labels[prediction]],    
      ]
      tabulate(table_headers, table_data)


In [40]:
reversed_labels

{0: 'ENTY', 1: 'HUM', 2: 'DESC', 3: 'NUM', 4: 'LOC', 5: 'ABBR'}

### Location

In [41]:
predict_question_type(questions_model, "What are the largest libraries in the US ?", actual_class=4)

+----------------------+-------+
| KEY                  | VALUE |
+----------------------+-------+
| PREDICTED CLASS      | 4     |
| ACTUAL CLASS         | 4     |
| PREDICTED CLASS NAME | LOC   |
+----------------------+-------+


### Human

In [42]:
predict_question_type(questions_model, "Who is John Macarthur , 1767-1834 ?", actual_class=1)

+----------------------+-------+
| KEY                  | VALUE |
+----------------------+-------+
| PREDICTED CLASS      | 1     |
| ACTUAL CLASS         | 1     |
| PREDICTED CLASS NAME | HUM   |
+----------------------+-------+


### DESCRIPTION

In [43]:
predict_question_type(questions_model, "What is the root of all evil ? ", actual_class=2)

+----------------------+-------+
| KEY                  | VALUE |
+----------------------+-------+
| PREDICTED CLASS      | 2     |
| ACTUAL CLASS         | 2     |
| PREDICTED CLASS NAME | DESC  |
+----------------------+-------+


### Numeric

In [44]:
predict_question_type(questions_model, "How many watts make a kilowatt ?", actual_class=3)

+----------------------+-------+
| KEY                  | VALUE |
+----------------------+-------+
| PREDICTED CLASS      | 3     |
| ACTUAL CLASS         | 3     |
| PREDICTED CLASS NAME | NUM   |
+----------------------+-------+


### ENTITY

In [45]:

predict_question_type(questions_model, "What films featured the character Popeye Doyle ?", actual_class=0)

+----------------------+-------+
| KEY                  | VALUE |
+----------------------+-------+
| PREDICTED CLASS      | 0     |
| ACTUAL CLASS         | 0     |
| PREDICTED CLASS NAME | ENTY  |
+----------------------+-------+


### ABBREVIATION

In [46]:
predict_question_type(questions_model, "What does NECROSIS mean ?", actual_class=5)

+----------------------+-------+
| KEY                  | VALUE |
+----------------------+-------+
| PREDICTED CLASS      | 5     |
| ACTUAL CLASS         | 5     |
| PREDICTED CLASS NAME | ABBR  |
+----------------------+-------+


### Next Step
* In the next Notebook we are going to use `FastText` to perform sentiment analyisis on this dataset.