### Question Pairs

In the last notebook we used packed padded to do a binary classification on questions. We were classifying weather questions are duplicated or not. In this notebook we are going to create a modified `FastText` that will do the same task.

**Note**: The rest of the notebook will remain unchanged from the previous one. Where there's a change i will highlight.

### Imports


In [1]:
import time, os, torch, random, math

from prettytable import PrettyTable
import numpy as np
from matplotlib import pyplot as plt
import pandas as pd

import torch, os, random
from torch import nn
import torch.nn.functional as F

torch.__version__

'1.9.0+cu102'

### SEEDS

In [2]:
SEED = 42

np.random.seed(SEED)
random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deteministic = True

### Device

In [3]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

### Mounting the google drive

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Paths to data

In [5]:
base_path = '/content/drive/MyDrive/NLP Data/duplicates-questions'
train_path = 'train.csv'
val_path = 'val.csv'
test_path = 'test.csv'

os.path.exists(base_path)

True

### Data Loading
This is a binary classification task where we are going to predict weather questions are duplicates or not. We are going to have 2 inputs which is two differant questions that will map to one label, is_duplicate(1) or is_not_duplicate (0). We are going to create the fields of our data. 

### Fast Text
Accoding to the FastText paper we need to generate bigrams for each question.

We are going to create a function called ``generate_bigram()`` that will generate bigrams for us for both of these two input questions. We will pass this function to the Text field as the preprocessing function.

### What do we have?
We are having three `csv` files for each set whih makes it easy to create the dataset for this task.


In [6]:
def generate_bigrams(x):
  x = [i.lower() for i in x]
  n_grams = set(zip(*[x[i: ] for i in range(2)]))
  for n_gram in n_grams:
      x.append(' '.join(n_gram))
  return x
generate_bigrams(['What', 'is', 'the', 'meaning', "of", "OCR", "in", "python"])


['what',
 'is',
 'the',
 'meaning',
 'of',
 'ocr',
 'in',
 'python',
 'the meaning',
 'of ocr',
 'in python',
 'meaning of',
 'is the',
 'what is',
 'ocr in']

In [7]:
from torchtext.legacy import data

### Fields

In [8]:
TEXT = data.Field(
      tokenize = 'spacy',
      tokenizer_language = 'en_core_web_sm',
      preprocessing = generate_bigrams,
    )
LABEL = data.LabelField(dtype = torch.float)

In [9]:
fields = {
    "question1": ("qn1", TEXT),
    "question2": ("qn2", TEXT), 
    "is_duplicate": ("label", LABEL),
}

Next we will create our dataset using our favourate class fro  torchtext `TabularDataset`. We are going to load the data that is in `csv` format as follows.

In [10]:
train_data, val_data, test_data = data.TabularDataset.splits(
   base_path,
   train=train_path,
   test= test_path,
   validation= val_path,
   format = "csv",
   fields=fields
)

In [11]:
print(vars(train_data.examples[0]))

{'qn1': ['is', 'it', 'right', 'for', 'a', 'woman', 'to', 'date', 'someone', '2', '-', '3', 'years', 'younger', 'than', 'her', '?', 'is it', 'for a', 'date someone', 'a woman', 'woman to', 'to date', 'someone 2', '2 -', 'younger than', 'it right', 'her ?', '3 years', 'right for', '- 3', 'than her', 'years younger'], 'qn2': ['is', 'it', 'strange', 'to', 'have', 'a', 'crush', 'on', 'someone', 'say', '17', 'years', 'younger', 'than', 'me', '?', 'crush on', 'is it', 'to have', 'have a', 'me ?', 'on someone', 'younger than', 'strange to', 'than me', 'it strange', 'say 17', 'someone say', 'a crush', 'years younger', '17 years'], 'label': '0'}


#### Next we will build the Vocabulary.

We are going to use the pretrained word vectors `glove.6B.100d` which was trained on about 6 billion english words.


In [12]:

MAX_VOCAB_SIZE = 100_000

TEXT.build_vocab(
     train_data,
     max_size = MAX_VOCAB_SIZE,
     vectors = "glove.6B.100d",
    unk_init = torch.Tensor.normal_
)
LABEL.build_vocab(train_data)

In [13]:
LABEL.vocab.stoi

defaultdict(None, {'0': 0, '1': 1})

### Creating iterators

We are going to use the `BucketIterator` to create iterators for all these sets that we have.

In [14]:
sort_key = lambda x: len(x.qn1)

BATCH_SIZE = 128

train_iter, val_iter, test_iter = data.BucketIterator.splits(
    (train_data, val_data, test_data),
    device = device,
    batch_size = BATCH_SIZE,
    sort_key = sort_key,
    sort_within_batch=True
)

### Next we are going to create the model.

We are going to have two inputs which will be Question1 and Question2.
* Each question will be passed through it's own embedding layer.
* These embedding layers will then be concatenated and passed through a linear layer for predictions.

In [15]:
class DuplicateQuestionsFastText(nn.Module):
  def __init__(self,
               vocab_size,
               embedding_size,
               output_dim,
               pad_index,
               dropout=.5
               ):
    super(DuplicateQuestionsFastText, self).__init__()
    self.embedding_1 = nn.Embedding(
        vocab_size,
        embedding_size,
        padding_idx = pad_index
    )
    self.embedding_2 = nn.Embedding(
        vocab_size,
        embedding_size,
        padding_idx = pad_index
    )
    self.out = nn.Linear(
        embedding_size,
        out_features = output_dim
    )
    self.dropout = nn.Dropout(dropout)
  
  def forward(self,
              question1, 
              question2, 
              ):
    embedded_1 = self.embedding_1(question1).permute(1 ,0, 2)
    embedded_2 = self.embedding_2(question2).permute(1 ,0, 2)
    embedded = self.dropout(torch.cat((embedded_1, embedded_2), dim=1))
    pooled = F.avg_pool2d(embedded,
                         (embedded.shape[1], 1)
                          ).squeeze(1)
    return self.out(pooled)


In [16]:
a = torch.tensor([[2, 3, 5], [2, 3, 4]])
a = a.reshape((3, -1))
a.shape


torch.Size([3, 2])

### Creating the model instance.


In [17]:

INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
OUTPUT_DIM =  1
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token] 

duplicate_questions_model = DuplicateQuestionsFastText(
            INPUT_DIM, 
            EMBEDDING_DIM, 
            OUTPUT_DIM, 
            pad_index = PAD_IDX
            ).to(device)
duplicate_questions_model

DuplicateQuestionsFastText(
  (embedding_1): Embedding(100002, 100, padding_idx=1)
  (embedding_2): Embedding(100002, 100, padding_idx=1)
  (out): Linear(in_features=100, out_features=1, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
)

### Model parameters

In [18]:
def count_trainable_params(model):
  return sum(p.numel() for p in model.parameters()), sum(p.numel() for p in model.parameters() if p.requires_grad)

n_params, trainable_params = count_trainable_params(duplicate_questions_model)
print(f"Total number of paramaters: {n_params:,}\nTotal tainable parameters: {trainable_params:,}")

Total number of paramaters: 20,000,501
Total tainable parameters: 20,000,501


### Loading pretrained vectors to the `embedding` layers.
* Now we have two embedding layers in the model, so we need to add the word vectors to each embedding layer as follows:

In [19]:
pretrained_embeddings  = TEXT.vocab.vectors

In [20]:
duplicate_questions_model.embedding_1.weight.data.copy_(
    pretrained_embeddings
    )
duplicate_questions_model.embedding_2.weight.data.copy_(
    pretrained_embeddings
    )

tensor([[ 1.9269,  1.4873,  0.9007,  ...,  0.1233,  0.3499,  0.6173],
        [ 0.7262,  0.0912, -0.3891,  ...,  0.0821,  0.4440, -0.7240],
        [ 0.1638,  0.6046,  1.0789,  ..., -0.3140,  0.1844,  0.3624],
        ...,
        [-1.0188, -1.3804, -1.4044,  ..., -2.0274, -0.4045, -1.8920],
        [ 0.0247, -1.1202, -0.2275,  ...,  1.1231,  0.2079, -2.3545],
        [-1.8090,  0.4517, -1.6228,  ..., -0.1685, -0.4630, -0.9866]],
       device='cuda:0')

### Zeroing the `<pad>` and the `<unk>` tokens.

These tokens are not acually necessary for the model trainning that's the reason we are zeroing them. We will do this for all our emmbedding layers in the model.

In [21]:
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token] or TEXT.vocab.stoi["<unk>"]

duplicate_questions_model.embedding_1.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)
duplicate_questions_model.embedding_1.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)

duplicate_questions_model.embedding_2.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)
duplicate_questions_model.embedding_2.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)


duplicate_questions_model.embedding_1.weight.data

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.1638,  0.6046,  1.0789,  ..., -0.3140,  0.1844,  0.3624],
        ...,
        [-1.0188, -1.3804, -1.4044,  ..., -2.0274, -0.4045, -1.8920],
        [ 0.0247, -1.1202, -0.2275,  ...,  1.1231,  0.2079, -2.3545],
        [-1.8090,  0.4517, -1.6228,  ..., -0.1685, -0.4630, -0.9866]],
       device='cuda:0')

### Loss and optimizer
For the optimizer we are going to use `Adam()` with default paramaters and for the loss function we are going to use the `BCEWithLogitsLoss()` since we are doing a binary classification.

In [22]:
optimizer = torch.optim.Adam(duplicate_questions_model.parameters())
criterion = nn.BCEWithLogitsLoss().to(device)

### Accuracy function.
For the accuracy we are going to create a `binary_accuracy` function that will take predicted labels and accual labels to return the accuracy as a probability value.

In [23]:
def binary_accuracy(y_preds, y_true):
  rounded_preds = torch.round(torch.sigmoid(y_preds))
  correct = (rounded_preds == y_true).float()
  return correct.sum() / len(correct)

### Train and evaluation function.
This time around we have two features which is our two text labels. The model except 2 positional args which are:
```
  question1, 
  question2
```
### Where are we going to get them?

Well our iterator contains all this information so we dont have o worry much about that. Let's create a train and evaluation functions.

In [24]:
def train(model, iterator, optimizer, criterion):
  epoch_loss,epoch_acc = 0, 0
  model.train()
  for batch in iterator:
    optimizer.zero_grad()
    qn1 = batch.qn1
    qn2= batch.qn2
    predictions = model(qn1, qn2).squeeze(1)
    loss = criterion(predictions, batch.label)
    acc = binary_accuracy(predictions, batch.label)
    loss.backward()
    optimizer.step()
    epoch_loss += loss.item()
    epoch_acc += acc.item()

  return epoch_loss / len(iterator), epoch_acc / len(iterator)

def evaluate(model, iterator, criterion):
  epoch_loss,epoch_acc = 0, 0
  model.eval()
  with torch.no_grad():
    for batch in iterator:
      qn1 = batch.qn1
      qn2 = batch.qn2
      predictions = model(qn1, qn2).squeeze(1)
      loss = criterion(predictions, batch.label)
      acc = binary_accuracy(predictions, batch.label)
      epoch_loss += loss.item()
      epoch_acc += acc.item()
  return epoch_loss / len(iterator), epoch_acc / len(iterator)


### Train Loop

We are going to create some helper functions that will help us to visualize every epoch during training.

Time to string

In [25]:
def hms_string(sec_elapsed):
    h = int(sec_elapsed / (60 * 60))
    m = int((sec_elapsed % (60 * 60)) / 60)
    s = sec_elapsed % 60
    return "{}:{:>02}:{:>05.2f}".format(h, m, s)

Tabulate training epoch

In [26]:
def visualize_training(start, end, train_loss, train_accuracy, val_loss, val_accuracy, title):
  data = [
       ["Training", f'{train_loss:.3f}', f'{train_accuracy:.3f}', f"{hms_string(end - start)}" ],
       ["Validation", f'{val_loss:.3f}', f'{val_accuracy:.3f}', "" ],       
  ]
  table = PrettyTable(["CATEGORY", "LOSS", "ACCURACY", "ETA"])
  table.align["CATEGORY"] = 'l'
  table.align["LOSS"] = 'r'
  table.align["ACCURACY"] = 'r'
  table.align["ETA"] = 'r'
  table.title = title
  for row in data:
    table.add_row(row)
  print(table)
  

In [27]:
N_EPOCHS = 10
best_valid_loss = float('inf')
for epoch in range(N_EPOCHS):
  start = time.time()
  train_loss, train_acc = train(duplicate_questions_model, train_iter, optimizer, criterion)
  valid_loss, valid_acc = evaluate(duplicate_questions_model, val_iter, criterion)
  title = f"EPOCH: {epoch+1:02}/{N_EPOCHS:02} {'saving best model...' if valid_loss < best_valid_loss else 'not saving...'}"
  if valid_loss < best_valid_loss:
      best_valid_loss = valid_loss
      torch.save(duplicate_questions_model.state_dict(), 'best-model.pt')
  end = time.time()
  visualize_training(start, end, train_loss, train_acc, valid_loss, valid_acc, title)

+--------------------------------------------+
|     EPOCH: 01/10 saving best model...      |
+------------+-------+----------+------------+
| CATEGORY   |  LOSS | ACCURACY |        ETA |
+------------+-------+----------+------------+
| Training   | 0.530 |    0.739 | 0:00:45.04 |
| Validation | 0.492 |    0.765 |            |
+------------+-------+----------+------------+
+--------------------------------------------+
|     EPOCH: 02/10 saving best model...      |
+------------+-------+----------+------------+
| CATEGORY   |  LOSS | ACCURACY |        ETA |
+------------+-------+----------+------------+
| Training   | 0.466 |    0.783 | 0:00:44.70 |
| Validation | 0.470 |    0.777 |            |
+------------+-------+----------+------------+
+--------------------------------------------+
|     EPOCH: 03/10 saving best model...      |
+------------+-------+----------+------------+
| CATEGORY   |  LOSS | ACCURACY |        ETA |
+------------+-------+----------+------------+
| Training   

### Evaluating the best model.

In [28]:
duplicate_questions_model.load_state_dict(torch.load('best-model.pt'))

test_loss, test_acc = evaluate(duplicate_questions_model, test_iter, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.450 | Test Acc: 78.80%


### Model Inference

Our predict sentiment function will:

* get `two` question pairs, tokenize them and convert them to sequences.
* pass the model the, questions that are converted to tensors.
* Apply the sigmoid to get the accual label.

In [33]:
import en_core_web_sm
nlp = en_core_web_sm.load()

def predict_sentiment(model, q1, q2):
  model.eval()
  tokenized_q1 = [tok.text for tok in nlp.tokenizer(q1.lower())]
  tokenized_q2 = [tok.text for tok in nlp.tokenizer(q2.lower())]

  indexed_1 = [TEXT.vocab.stoi[t] for t in tokenized_q1]
  indexed_2 = [TEXT.vocab.stoi[t] for t in tokenized_q2]

  tensor_1 = torch.LongTensor(indexed_1).to(device).unsqueeze(1)
  tensor_2 = torch.LongTensor(indexed_2).to(device).unsqueeze(1)

  prediction = torch.sigmoid(model(tensor_1, tensor_2))
  return prediction.item()

### Getting questions for testing.

In [34]:
dataframe = pd.read_csv(os.path.join(
    base_path,
    test_path
))

qns1 = dataframe.question1.values
qns2 = dataframe.question2.values
true_labels = dataframe.is_duplicate.values

In [35]:
from prettytable import PrettyTable
def tabulate(column_names, data, max_characters:int, title:str):
  table = PrettyTable(column_names)
  table.align[column_names[0]] = "l"
  table.align[column_names[1]] = "l"
  table.title = title
  table._max_width = {column_names[0] :max_characters, column_names[1] :max_characters}
  for row in data:
    table.add_row(row)
  print(table)

In [36]:

for i, (q1, q2, label) in enumerate(zip(qns1, qns2, true_labels[:10])):
  pred = predict_sentiment(duplicate_questions_model, q1, q2)
  classes = ["not duplicate", "duplicate"]
  probability = pred if pred >=0.5 else 1 - pred
  table_headers =["KEY", "VALUE"]
  table_data = [
        ["Question 1", q1],
        ["Question2", q2],
        ["PREDICTED CLASS",  round(pred)],
        ["PREDICTED CLASS NAME",  classes[round(pred)]],
        ["REAL CLASS",  label],
        ["REAL CLASS NAME",  classes[label]],
        ["CONFIDENCE OVER OTHER CLASSES", f'{ probability * 100:.2f}%'],
             
    ]
  title = "Duplicate Questions"
  tabulate(table_headers, table_data, 50, title=title)

+------------------------------------------------------------------------------------+
|                                Duplicate Questions                                 |
+-------------------------------+----------------------------------------------------+
| KEY                           | VALUE                                              |
+-------------------------------+----------------------------------------------------+
| Question 1                    | Do you watch Korean dramas?                        |
| Question2                     | Is it normal to watch Korean drama if you are a    |
|                               | guy?                                               |
| PREDICTED CLASS               | 0                                                  |
| PREDICTED CLASS NAME          | not duplicate                                      |
| REAL CLASS                    | 0                                                  |
| REAL CLASS NAME               | not dupli

### Conclusion.

We have learned how to create a model that maps 2 inputs to one output using the modified FastText model. What's Next?

### Next.
In the next notebook we are going to do the same task with `ConvNets` specifically the use of `Conv2D` layers on sequential data. We are going to use this notebook as the base and then we expand it.