### Question Pairs
In this notebook we are going to learn how we can classify wether questions are dulicates/simmilar or not using the dataset downloaded from [Kaggle](https://www.kaggle.com/quora/question-pairs-dataset).

 As usual I'm going to uzip the file and upload it to my google drive so that it can be easly loaded here in google colab.

### Imports


In [1]:
import time, os, torch, random, math

from prettytable import PrettyTable
import numpy as np
from matplotlib import pyplot as plt
import pandas as pd

import torch, os, random
from torch import nn
import torch.nn.functional as F

torch.__version__

'1.9.0+cu102'

### SEEDS

In [2]:
SEED = 42

np.random.seed(SEED)
random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deteministic = True

### Device

In [3]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

### Mounting the google drive

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Paths to data

In [5]:
base_path = '/content/drive/MyDrive/NLP Data/duplicates-questions'
train_path = 'train.csv'
val_path = 'val.csv'
test_path = 'test.csv'

os.path.exists(base_path)

True

### Data Loading
This is a binary classification task where we are going to predict weather questions are duplicates or not. We are going to have 2 inputs which is two differant questions that will map to one label, is_duplicate(1) or is_not_duplicate (0). We are going to create the fields of our data. We are going to use [this notebook](https://github.com/CrispenGari/pytorch-python/blob/main/09_TorchText/02_Sentiment_Analyisis_Series/02_Updated_Sentiment_Analysis.ipynb) together with [this](https://github.com/CrispenGari/nlp-pytorch/blob/main/04_Questions_Cassification/01_Questions_Classification.ipynb) as our guieds for this task.

### What do we have?
We are having three `csv` files for each set whih makes it easy to create the dataset for this task.

Since we are going to use the `packed_padded_sequence` for this classification task, during our Text field creation we need to pass `include_lengths=True`. And also since this is a binary classificatio task we are going to create a label field with a float data type. This is what i did in [this](https://github.com/CrispenGari/pytorch-python/blob/main/09_TorchText/02_Sentiment_Analyisis_Series/02_Updated_Sentiment_Analysis.ipynb) notebook.

In [6]:
from torchtext.legacy import data

### Fields

In [7]:
TEXT = data.Field(
      tokenize = 'spacy',
      tokenizer_language = 'en_core_web_sm',
      include_lengths=True
    )
LABEL = data.LabelField(dtype = torch.float)

In [8]:
fields = {
    "question1": ("qn1", TEXT),
    "question2": ("qn2", TEXT), 
    "is_duplicate": ("label", LABEL),
}

Next we will create our dataset using our favourate class fro  torchtext `TabularDataset`. We are going to load the data that is in `csv` format as follows.

In [9]:
train_data, val_data, test_data = data.TabularDataset.splits(
   base_path,
   train=train_path,
   test= test_path,
   validation= val_path,
   format = "csv",
   fields=fields
)

In [10]:
print(vars(train_data.examples[0]))

{'qn1': ['Is', 'it', 'right', 'for', 'a', 'woman', 'to', 'date', 'someone', '2', '-', '3', 'years', 'younger', 'than', 'her', '?'], 'qn2': ['Is', 'it', 'strange', 'to', 'have', 'a', 'crush', 'on', 'someone', 'say', '17', 'years', 'younger', 'than', 'me', '?'], 'label': '0'}


#### Next we will build the Vocabulary.

We are going to use the pretrained word vectors `glove.6B.100d` which was trained on about 6 billion english words.


In [11]:

MAX_VOCAB_SIZE = 100_000

TEXT.build_vocab(
     train_data,
     max_size = MAX_VOCAB_SIZE,
     vectors = "glove.6B.100d",
    unk_init = torch.Tensor.normal_
)
LABEL.build_vocab(train_data)

In [12]:
LABEL.vocab.stoi

defaultdict(None, {'0': 0, '1': 1})

### Creating iterators

We are going to use the `BucketIterator` to create iterators for all these sets that we have.

In [13]:
sort_key = lambda x: len(x.qn1)

BATCH_SIZE = 128

train_iter, val_iter, test_iter = data.BucketIterator.splits(
    (train_data, val_data, test_data),
    device = device,
    batch_size = BATCH_SIZE,
    sort_key = sort_key,
    sort_within_batch=True
)

### Next we are going to create the model.

We are going to have two inputs which will be Question1 and Question2.
* Each question will be passed through it's own embedding layer.
* Then each embedding layer will have an LSTM layer following it that will learn each question seperately and then we will concatenate the learned layers and using the  `torch.cat()` in the forward pass.
* We will then concatate the layers and pass down into the fully connected layer where we will learn the parameters before passing it down to the output layerr.

In [14]:
class DuplicateQuestions(nn.Module):
  def __init__(self,
               vocab_size,
               embedding_size,
               hidden_size,
               output_size,
               num_layers,
               pad_index,
               bidirectional = True,
               dropout=.5
               ):
    super(DuplicateQuestions, self).__init__()
    self.embedding_1 = nn.Embedding(
        vocab_size,
        embedding_size,
        padding_idx = pad_index
    )
    self.embedding_2 = nn.Embedding(
        vocab_size,
        embedding_size,
        padding_idx = pad_index
    )
    self.lstm = nn.LSTM(
        embedding_size,
        hidden_size  = hidden_size,
        bidirectional = bidirectional,
        num_layers = num_layers,
        dropout = dropout
    )
    self.fc_1 = nn.Linear(
        hidden_size * 2 if bidirectional else hidden_size  ,
        out_features = 512
    )
    self.fc_2 = nn.Linear(
        512 * 2,
        out_features = 256
    )
    """
    Why 512 * 2?
    We are going to concatenate the two learned outputs from the first 
    feature(qn1) and the second feature(qn2).
    """
    self.out = nn.Linear(
        256,
        out_features = output_size
    )
    self.dropout = nn.Dropout(dropout)
  
  def forward(self,
              question1, 
              question1_lengths,
              question2, 
              question2_lengths,
              ):
    embedded_1 = self.dropout(self.embedding_1(question1))
    embedded_2 = self.dropout(self.embedding_2(question2))

    packed_embedded_1 = nn.utils.rnn.pack_padded_sequence(
        embedded_1, question1_lengths.to('cpu'), enforce_sorted=False
    )
    packed_embedded_2 = nn.utils.rnn.pack_padded_sequence(
        embedded_2, question2_lengths.to('cpu'), enforce_sorted=False
    )
    packed_output_1, (h_0_1, c_0_1) = self.lstm(packed_embedded_1)
    packed_output_2, (h_0_2, c_0_2) = self.lstm(packed_embedded_2)


    output_1, output_lengths_1 = nn.utils.rnn.pad_packed_sequence(packed_output_1)
    output_2, output_lengths_2 = nn.utils.rnn.pad_packed_sequence(packed_output_2)

    h_0_1 = self.dropout(torch.cat((h_0_1[-2,:,:], h_0_1[-1,:,:]), dim = 1))
    h_0_2 = self.dropout(torch.cat((h_0_2[-2,:,:], h_0_2[-1,:,:]), dim = 1))

    out_1 = self.dropout(self.fc_1(h_0_1))
    out_2 = self.dropout(self.fc_1(h_0_1))
    concatenated = self.dropout(torch.cat((out_1, out_2), dim=1))
    out = self.dropout(self.fc_2(concatenated))
    return self.out(out)


### Creating the model instance.


In [15]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM =  1
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.5
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token] 

duplicate_questions_model = DuplicateQuestions(
            INPUT_DIM, 
            EMBEDDING_DIM, 
            HIDDEN_DIM, 
            OUTPUT_DIM, 
            N_LAYERS, 
            bidirectional = BIDIRECTIONAL, 
            dropout = DROPOUT, 
            pad_index = PAD_IDX
            ).to(device)
duplicate_questions_model

DuplicateQuestions(
  (embedding_1): Embedding(100002, 100, padding_idx=1)
  (embedding_2): Embedding(100002, 100, padding_idx=1)
  (lstm): LSTM(100, 256, num_layers=2, dropout=0.5, bidirectional=True)
  (fc_1): Linear(in_features=512, out_features=512, bias=True)
  (fc_2): Linear(in_features=1024, out_features=256, bias=True)
  (out): Linear(in_features=256, out_features=1, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
)

### Model parameters

In [16]:

def count_trainable_params(model):
  return sum(p.numel() for p in model.parameters()), sum(p.numel() for p in model.parameters() if p.requires_grad)

n_params, trainable_params = count_trainable_params(duplicate_questions_model)
print(f"Total number of paramaters: {n_params:,}\nTotal tainable parameters: {trainable_params:,}")

Total number of paramaters: 22,835,857
Total tainable parameters: 22,835,857


### Loading pretrained vectors to the `embedding` layers.
* Now we have two embedding layers in the model, so we need to add the word vectors to each embedding layer as follows:

In [17]:
pretrained_embeddings  = TEXT.vocab.vectors

In [18]:
duplicate_questions_model.embedding_1.weight.data.copy_(
    pretrained_embeddings
    )
duplicate_questions_model.embedding_2.weight.data.copy_(
    pretrained_embeddings
    )

tensor([[ 1.9269,  1.4873,  0.9007,  ...,  0.1233,  0.3499,  0.6173],
        [ 0.7262,  0.0912, -0.3891,  ...,  0.0821,  0.4440, -0.7240],
        [ 0.1638,  0.6046,  1.0789,  ..., -0.3140,  0.1844,  0.3624],
        ...,
        [ 0.1434,  0.8499,  1.2881,  ...,  2.0599, -1.1911,  0.5823],
        [ 0.6803, -0.4131,  1.0441,  ...,  0.8616,  0.4945,  0.1403],
        [ 0.2489, -0.2695, -0.0509,  ..., -0.3666, -0.1306,  0.8195]],
       device='cuda:0')

### Zeroing the `<pad>` and the `<unk>` tokens.

These tokens are not acually necessary for the model trainning that's the reason we are zeroing them. We will do this for all our emmbedding layers in the model.

In [19]:
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token] or TEXT.vocab.stoi["<unk>"]

duplicate_questions_model.embedding_1.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)
duplicate_questions_model.embedding_1.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)

duplicate_questions_model.embedding_2.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)
duplicate_questions_model.embedding_2.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)


duplicate_questions_model.embedding_1.weight.data

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.1638,  0.6046,  1.0789,  ..., -0.3140,  0.1844,  0.3624],
        ...,
        [ 0.1434,  0.8499,  1.2881,  ...,  2.0599, -1.1911,  0.5823],
        [ 0.6803, -0.4131,  1.0441,  ...,  0.8616,  0.4945,  0.1403],
        [ 0.2489, -0.2695, -0.0509,  ..., -0.3666, -0.1306,  0.8195]],
       device='cuda:0')

### Loss and optimizer
For the optimizer we are going to use `Adam()` with default paramaters and for the loss function we are going to use the `BCEWithLogitsLoss()` since we are doing a binary classification.

In [20]:
optimizer = torch.optim.Adam(duplicate_questions_model.parameters())
criterion = nn.BCEWithLogitsLoss().to(device)

### Accuracy function.
For the accuracy we are going to create a `binary_accuracy` function that will take predicted labels and accual labels to return the accuracy as a probability value.

In [21]:
def binary_accuracy(y_preds, y_true):
  rounded_preds = torch.round(torch.sigmoid(y_preds))
  correct = (rounded_preds == y_true).float()
  return correct.sum() / len(correct)

### Train and evaluation function.
This time around we have two features which is our two text labels. The model except 4 positional args whic are:
```
  question1, 
  question1_lengths
  question2
  question2_lengths
```
### Where are we going to get them?

Well our iterator contains all this information so we dont have o worry much about that. Let's create a train and evaluation functions.

In [22]:
def train(model, iterator, optimizer, criterion):
  epoch_loss,epoch_acc = 0, 0
  model.train()
  for batch in iterator:
    optimizer.zero_grad()
    qn1, qn1_lengths = batch.qn1
    qn2, qn2_lengths = batch.qn2
    try:
      predictions = model(qn1, qn1_lengths,
                          qn2, qn2_lengths ).squeeze(1)
      loss = criterion(predictions, batch.label)
      acc = binary_accuracy(predictions, batch.label)
      loss.backward()
      optimizer.step()
      epoch_loss += loss.item()
      epoch_acc += acc.item()
    except:
      pass
  return epoch_loss / len(iterator), epoch_acc / len(iterator)

def evaluate(model, iterator, criterion):
  epoch_loss,epoch_acc = 0, 0
  model.eval()
  with torch.no_grad():
    for batch in iterator:
      qn1, qn1_lengths = batch.qn1
      qn2, qn2_lengths = batch.qn2
      predictions = model(qn1, qn1_lengths,
                          qn2, qn2_lengths ).squeeze(1)
      loss = criterion(predictions, batch.label)
      acc = binary_accuracy(predictions, batch.label)
      epoch_loss += loss.item()
      epoch_acc += acc.item()

  return epoch_loss / len(iterator), epoch_acc / len(iterator)


### Train Loop

We are going to create some helper functions that will help us to visualize every epoch during training.

Time to string

In [23]:
def hms_string(sec_elapsed):
    h = int(sec_elapsed / (60 * 60))
    m = int((sec_elapsed % (60 * 60)) / 60)
    s = sec_elapsed % 60
    return "{}:{:>02}:{:>05.2f}".format(h, m, s)

Tabulate training epoch

In [24]:
def visualize_training(start, end, train_loss, train_accuracy, val_loss, val_accuracy, title):
  data = [
       ["Training", f'{train_loss:.3f}', f'{train_accuracy:.3f}', f"{hms_string(end - start)}" ],
       ["Validation", f'{val_loss:.3f}', f'{val_accuracy:.3f}', "" ],       
  ]
  table = PrettyTable(["CATEGORY", "LOSS", "ACCURACY", "ETA"])
  table.align["CATEGORY"] = 'l'
  table.align["LOSS"] = 'r'
  table.align["ACCURACY"] = 'r'
  table.align["ETA"] = 'r'
  table.title = title
  for row in data:
    table.add_row(row)
  print(table)
  

In [25]:
N_EPOCHS = 10
best_valid_loss = float('inf')
for epoch in range(N_EPOCHS):
  start = time.time()
  train_loss, train_acc = train(duplicate_questions_model, train_iter, optimizer, criterion)
  valid_loss, valid_acc = evaluate(duplicate_questions_model, val_iter, criterion)
  title = f"EPOCH: {epoch+1:02}/{N_EPOCHS:02} {'saving best model...' if valid_loss < best_valid_loss else 'not saving...'}"
  if valid_loss < best_valid_loss:
      best_valid_loss = valid_loss
      torch.save(duplicate_questions_model.state_dict(), 'best-model.pt')
  end = time.time()
  visualize_training(start, end, train_loss, train_acc, valid_loss, valid_acc, title)


+--------------------------------------------+
|     EPOCH: 01/10 saving best model...      |
+------------+-------+----------+------------+
| CATEGORY   |  LOSS | ACCURACY |        ETA |
+------------+-------+----------+------------+
| Training   | 0.570 |    0.707 | 0:03:30.48 |
| Validation | 0.522 |    0.747 |            |
+------------+-------+----------+------------+
+--------------------------------------------+
|     EPOCH: 02/10 saving best model...      |
+------------+-------+----------+------------+
| CATEGORY   |  LOSS | ACCURACY |        ETA |
+------------+-------+----------+------------+
| Training   | 0.523 |    0.741 | 0:03:31.43 |
| Validation | 0.508 |    0.758 |            |
+------------+-------+----------+------------+
+--------------------------------------------+
|     EPOCH: 03/10 saving best model...      |
+------------+-------+----------+------------+
| CATEGORY   |  LOSS | ACCURACY |        ETA |
+------------+-------+----------+------------+
| Training   

### Evaluating the best model.

In [28]:
duplicate_questions_model.load_state_dict(torch.load('best-model.pt'))

test_loss, test_acc = evaluate(duplicate_questions_model, test_iter, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.486 | Test Acc: 76.97%


### Model Inference

Our predict sentiment function will:

* get `two` question pairs, tokenize them and convert them to sequences.
* We will then get the lengths of each sentence and convert them to tensors.
* pass the model the, questions and their lenghts.
* Apply the sigmoid to get the accual label.

In [33]:
import en_core_web_sm
nlp = en_core_web_sm.load()

def predict_sentiment(model, q1, q2):
  model.eval()
  tokenized_q1 = [tok.text for tok in nlp.tokenizer(q1)]
  tokenized_q2 = [tok.text for tok in nlp.tokenizer(q2)]

  indexed_1 = [TEXT.vocab.stoi[t] for t in tokenized_q1]
  indexed_2 = [TEXT.vocab.stoi[t] for t in tokenized_q2]

  length_1 = [len(indexed_1)]
  length_2 = [len(indexed_2)]

  tensor_1 = torch.LongTensor(indexed_1).to(device).unsqueeze(1)
  tensor_2 = torch.LongTensor(indexed_2).to(device).unsqueeze(1)
  
  length_tensor_1 = torch.LongTensor(length_1)
  length_tensor_2 = torch.LongTensor(length_2)

  prediction = torch.sigmoid(model(
      tensor_1, length_tensor_1,
      tensor_2, length_tensor_2
      ))
  return prediction.item()

### Getting questions for testing.

In [34]:
dataframe = pd.read_csv(os.path.join(
    base_path,
    test_path
))

qns1 = dataframe.question1.values
qns2 = dataframe.question2.values
true_labels = dataframe.is_duplicate.values

In [35]:
from prettytable import PrettyTable
def tabulate(column_names, data, max_characters:int, title:str):
  table = PrettyTable(column_names)
  table.align[column_names[0]] = "l"
  table.align[column_names[1]] = "l"
  table.title = title
  table._max_width = {column_names[0] :max_characters, column_names[1] :max_characters}
  for row in data:
    table.add_row(row)
  print(table)

In [39]:

for i, (q1, q2, label) in enumerate(zip(qns1, qns2, true_labels[:10])):
  pred = predict_sentiment(duplicate_questions_model, q1, q2)
  classes = ["not duplicate", "duplicate"]
  probability = pred if pred >=0.5 else 1 - pred
  table_headers =["KEY", "VALUE"]
  table_data = [
        ["Question 1", q1],
        ["Question2", q2],
        ["PREDICTED CLASS",  round(pred)],
        ["PREDICTED CLASS NAME",  classes[round(pred)]],
        ["REAL CLASS",  label],
        ["REAL CLASS NAME",  classes[label]],
        ["CONFIDENCE OVER OTHER CLASSES", f'{ probability * 100:.2f}%'],
             
    ]
  title = "Duplicate Questions"
  tabulate(table_headers, table_data, 50, title=title)

+------------------------------------------------------------------------------------+
|                                Duplicate Questions                                 |
+-------------------------------+----------------------------------------------------+
| KEY                           | VALUE                                              |
+-------------------------------+----------------------------------------------------+
| Question 1                    | Do you watch Korean dramas?                        |
| Question2                     | Is it normal to watch Korean drama if you are a    |
|                               | guy?                                               |
| PREDICTED CLASS               | 0                                                  |
| PREDICTED CLASS NAME          | not duplicate                                      |
| REAL CLASS                    | 0                                                  |
| REAL CLASS NAME               | not dupli

### Conclusion.

We have learned how to create a model that maps 2 inputs to one input. What's Next?

### Next.
In the next notebook we are going to do the same task with different the modified `FastText`. We are going to use this notbook as the base and then we expand it.