#  Question duplicates


## Importing and Loading the Data

I will be using the Quora question answer dataset to build a model that could identify similar questions. This is a particularly useful task because we don't want to have several versions of the same question posted. The data has been already labelled here. By running the cell below we can import some of the useful and important packages for this project.  

In [None]:
import os
import nltk
import trax
from trax import layers as tl
from trax.supervised import training
from trax.fastmath import numpy as fastnp
import numpy as np
import pandas as pd
import random as rnd

# set random seeds
trax.supervised.trainer_lib.init_random_number_generators(34)
rnd.seed(34)

INFO:tensorflow:tokens_length=568 inputs_length=512 targets_length=114 noise_density=0.15 mean_noise_span_length=3.0 


**In this project Trax's numpy is referred to as `fastnp`, while the regular numpy is referred to as `np`**

In the cell below, I am loading the dataset and doing some essential preprocessing required.  

In [None]:
data = pd.read_csv("questions.csv")
N=len(data)
print(data.shape)
print('Number of question pairs: ', N)

print('the number of duplicates:',np.sum(data['is_duplicate']==1))
print('example of duplicates:',data[data['is_duplicate']==1].iloc[0])
data.head()


(404351, 6)
Number of question pairs:  404351
the number of duplicates: 149306
example of duplicates: id                                                              5
qid1                                                           11
qid2                                                           12
question1       Astrology: I am a Capricorn Sun Cap moon and c...
question2       I'm a triple Capricorn (Sun, Moon and ascendan...
is_duplicate                                                    1
Name: 5, dtype: object


Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


We first split the data into a train and test set. The test set will be used later to evaluate our model.

In [None]:
N_train = 300000
N_test  = 10*1024
data_train = data[:N_train]
data_test  = data[N_train:N_train+N_test]
print("Train set:", len(data_train), "Test set:", len(data_test))
del(data) # remove to free memory

Train set: 300000 Test set: 10240


We are selecting only the questio pairs that are duplicate to train the model. 
We build two batches as input for the Siamese network and we assume that question $q1_i$ (question $i$ in the first batch) is a duplicate of $q2_i$ (question $i$ in the second batch), but all other questions in the second batch are not duplicates of $q1_i$.  
The test set uses the original pairs of questions and the status describing if the questions are duplicates.

In [None]:
td_index = (data_train['is_duplicate'] == 1).to_numpy()
td_index = [i for i, x in enumerate(td_index) if x] 
print('number of duplicate questions: ', len(td_index))
print('indexes of first ten duplicate questions:', td_index[:10])

number of duplicate questions:  111486
indexes of first ten duplicate questions: [5, 7, 11, 12, 13, 15, 16, 18, 20, 29]


In [None]:
print(data_train['question1'][7])  #  Example of question duplicates (first one in data)
print(data_train['question2'][7])
print('is_duplicate: ', data_train['is_duplicate'][7])

How can I be a good geologist?
What should I do to be a great geologist?
is_duplicate:  1


In [None]:
Q1_train_words = np.array(data_train['question1'][td_index])
Q2_train_words = np.array(data_train['question2'][td_index])

Q1_test_words = np.array(data_test['question1'])
Q2_test_words = np.array(data_test['question2'])
y_test  = np.array(data_test['is_duplicate'])



Here I only took duplicated questions for training the model, because the data generator will produce batches $([q1_1, q1_2, q1_3, ...]$, $[q2_1, q2_2,q2_3, ...])$  where $q1_i$ and $q2_k$ are duplicate if and only if $i = k$.

<br>Let's print to see what the data looks like.

In [None]:
print('TRAINING QUESTIONS:\n')
print('Question 1: ', Q1_train_words[0])
print('Question 2: ', Q2_train_words[0], '\n')
print('Question 1: ', Q1_train_words[5])
print('Question 2: ', Q2_train_words[5], '\n')

print('TESTING QUESTIONS:\n')
print('Question 1: ', Q1_test_words[0])
print('Question 2: ', Q2_test_words[0], '\n')
print('is_duplicate =', y_test[0], '\n')

TRAINING QUESTIONS:

Question 1:  Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?
Question 2:  I'm a triple Capricorn (Sun, Moon and ascendant in Capricorn) What does this say about me? 

Question 1:  What would a Trump presidency mean for current international master’s students on an F1 visa?
Question 2:  How will a Trump presidency affect the students presently in US or planning to study in US? 

TESTING QUESTIONS:

Question 1:  How do I prepare for interviews for cse?
Question 2:  What is the best way to prepare for cse? 

is_duplicate = 0 



Here I am tokenizing the questions using `ntlk.word_tokenize` and also building a python default dictionary, which later assigns values 0 to all the Out Of Vocabuary (OOV) words

In [None]:
#create arrays
Q1_train = np.empty_like(Q1_train_words)
Q2_train = np.empty_like(Q2_train_words)

Q1_test = np.empty_like(Q1_test_words)
Q2_test = np.empty_like(Q2_test_words)
print(Q1_train.shape)

(111486,)


In [None]:
# Building the vocabulary with the train set         (this might take a minute)
from collections import defaultdict

vocab = defaultdict(lambda: 0)
vocab['<PAD>'] = 1

for idx in range(len(Q1_train_words)):
    Q1_train[idx] = nltk.word_tokenize(Q1_train_words[idx])
    Q2_train[idx] = nltk.word_tokenize(Q2_train_words[idx])
    q = Q1_train[idx] + Q2_train[idx]
    for word in q:
        if word not in vocab:
            vocab[word] = len(vocab) + 1
print('The length of the vocabulary is: ', len(vocab))
#print(Q1_train[0])

The length of the vocabulary is:  36268


In [None]:
print(vocab['<PAD>'])
print(vocab['Astrology'])
print(vocab['Astronomy'])  #not in vocabulary, returns 0

1
2
0


In [None]:
for idx in range(len(Q1_test_words)): 
    Q1_test[idx] = nltk.word_tokenize(Q1_test_words[idx])
    Q2_test[idx] = nltk.word_tokenize(Q2_test_words[idx])

In [None]:
print('Train set has reduced to: ', len(Q1_train) ) 
print('Test set length: ', len(Q1_test) ) 

Train set has reduced to:  111486
Test set length:  10240


<a name='1.2'></a>
### Converting a question to a tensor

Coverting every question to a tensor, or an array of numbers, using the vocabulary built above. 

In [None]:
# Converting questions to array of integers
for i in range(len(Q1_train)):
    Q1_train[i] = [vocab[word] for word in Q1_train[i]]
    Q2_train[i] = [vocab[word] for word in Q2_train[i]]

        
for i in range(len(Q1_test)):
    Q1_test[i] = [vocab[word] for word in Q1_test[i]]
    Q2_test[i] = [vocab[word] for word in Q2_test[i]]

In [None]:
print('first question in the train set:\n')
print(Q1_train_words[0], '\n') 
print('encoded version:')
print(Q1_train[0],'\n')

print('first question in the test set:\n')
print(Q1_test_words[0], '\n')
print('encoded version:')
print(Q1_test[0]) 

first question in the train set:

Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me? 

encoded version:
[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21] 

first question in the test set:

How do I prepare for interviews for cse? 

encoded version:
[32, 38, 4, 107, 65, 1015, 65, 11509, 21]


I now split your train set into a training/validation set so that it can be used to train and evaluate the Siamese Model

In [None]:
# Splitting the data
cut_off = int(len(Q1_train)*.8)
train_Q1, train_Q2 = Q1_train[:cut_off], Q2_train[:cut_off]
val_Q1, val_Q2 = Q1_train[cut_off: ], Q2_train[cut_off:]
print('Number of duplicate questions: ', len(Q1_train))
print("The length of the training set is:  ", len(train_Q1))
print("The length of the validation set is: ", len(val_Q1))


Number of duplicate questions:  111486
The length of the training set is:   89188
The length of the validation set is:  22298






The commmand `next(data_generator)` returns the next batch. This iterator returns the data in the format that we can drectly use in the model when computing the feed-forward part of the algorithm. This iterator returns a pair of array of questions.

In [None]:

def data_generator(Q1, Q2, batch_size, pad=1, shuffle=True):


    input1 = []
    input2 = []
    idx = 0
    len_q = len(Q1)
    question_indexes = [*range(len_q)]
    
    if shuffle:
        rnd.shuffle(question_indexes)
    

    while True:
        if idx >= len_q:

            idx = 0
            if shuffle:
                rnd.shuffle(question_indexes)

        q1 = Q1[question_indexes[idx]]
        q2 = Q2[question_indexes[idx]]
        
        # increment idx by 1
        idx += 1
        # append q1
        input1.append(q1)
        # append q2
        input2.append(q2)
       
        if len(input1) == batch_size:
            max_len = max(max([len(q) for q in input1]),max([len(q) for q in input2]))
            max_len = 2**int(np.ceil(np.log2(max_len)))
            b1 = []
            b2 = []
            for q1, q2 in zip(input1, input2):
                # add [pad] to q1 until it reaches max_len
                q1 = q1+(max_len-len(q1))*[pad]
                # add [pad] to q2 until it reaches max_len
                q2 = q2+(max_len-len(q2))*[pad]
                # append q1
                b1.append(q1)
                # append q2
                b2.append(q2)
            # use b1 and b2
            yield np.array(b1), np.array(b2)
            # reset the batches
            input1, input2 = [], []  # reset the batches

In [None]:
batch_size = 2
res1, res2 = next(data_generator(train_Q1, train_Q2, batch_size))
print("First questions  : ",'\n', res1, '\n')
print("Second questions : ",'\n', res2)

First questions  :  
 [[  30   87   78  134 2132 1981   28   78  594   21    1    1    1    1
     1    1]
 [  30   55   78 3541 1460   28   56  253   21    1    1    1    1    1
     1    1]] 

Second questions :  
 [[  30  156   78  134 2132 9508   21    1    1    1    1    1    1    1
     1    1]
 [  30  156   78 3541 1460  131   56  253   21    1    1    1    1    1
     1    1]]


## Defining Siamese Network

In [None]:
def Siamese(vocab_size=len(vocab), d_model=128, mode='train'):


    def normalize(x):  # normalizes the vectors to have L2 norm 1
        return x / fastnp.sqrt(fastnp.sum(x * x, axis=-1, keepdims=True))
    

    q_processor = tl.Serial(  # Processor will run on Q1 and Q2.
        tl.Embedding(vocab_size,d_model), # Embedding layer
        tl.LSTM(n_units=d_model), # LSTM layer
        tl.Mean(axis=1), # Mean over columns
        tl.Fn('normalize',lambda x:normalize(x))  # Apply normalize function
    )  # Returns one vector of shape [batch_size, d_model].
    
    
    # Run on Q1 and Q2 in parallel.
    model = tl.Parallel(q_processor, q_processor)
    return model


Setup the Siamese network model

In [None]:
# check your model
model = Siamese()
print(model)

Parallel_in2_out2[
  Serial[
    Embedding_41699_128
    LSTM_128
    Mean
    normalize
  ]
  Serial[
    Embedding_41699_128
    LSTM_128
    Mean
    normalize
  ]
]


Implementing Triplet Loss

In [None]:

def TripletLossFn(v1, v2, margin=0.25):

    # use fastnp to take the dot product of the two batches (don't forget to transpose the second argument)
    scores = fastnp.matmul(v1,v2.T)  # pairwise cosine sim
    # calculate new batch size
    batch_size = len(scores)
    # use fastnp to grab all postive `diagonal` entries in `scores`
    positive = fastnp.diagonal(scores)  # the positive ones (duplicates)
    # multiply `fastnp.eye(batch_size)` with 2.0 and subtract it out of `scores`
    negative_without_positive = scores-fastnp.eye(batch_size)*2.0
    # take the row by row `max` of `negative_without_positive`. 
    closest_negative = negative_without_positive.max(axis=1)
    # subtract `fastnp.eye(batch_size)` out of 1.0 and do element-wise multiplication with `scores`
    negative_zero_on_duplicate = (1-fastnp.eye(batch_size))*scores
    # use `fastnp.sum` on `negative_zero_on_duplicate` for `axis=1` and divide it by `(batch_size - 1)` 
    mean_negative = fastnp.sum(negative_zero_on_duplicate,axis=1)/(batch_size-1)
    # compute `fastnp.maximum` among 0.0 and `A`
    # A = subtract `positive` from `margin` and add `closest_negative` 
    triplet_loss1 = fastnp.maximum(-positive+mean_negative+margin,0)
    # compute `fastnp.maximum` among 0.0 and `B`
    # B = subtract `positive` from `margin` and add `mean_negative`
    triplet_loss2 = fastnp.maximum(-positive+closest_negative+margin,0)
    # add the two losses together and take the `fastnp.mean` of it
    triplet_loss = fastnp.mean(triplet_loss1+triplet_loss2)

    return triplet_loss

In [None]:
v1 = np.array([[0.26726124, 0.53452248, 0.80178373],[0.5178918 , 0.57543534, 0.63297887]])
v2 = np.array([[ 0.26726124,  0.53452248,  0.80178373],[-0.5178918 , -0.57543534, -0.63297887]])
TripletLossFn(v2,v1)
print("Triplet Loss:", TripletLossFn(v2,v1))

Triplet Loss: 0.5


To make a layer out of a function with no trainable variables, use `tl.Fn`.

In [None]:
from functools import partial
def TripletLoss(margin=0.25):
    triplet_loss_fn = partial(TripletLossFn, margin=margin)
    return tl.Fn('TripletLoss', triplet_loss_fn)



# Training
We will define the inouts using the data generator built above. the lambda function acts as a seed to remember the last batch that was given. 

In [None]:
batch_size = 256
train_generator = data_generator(train_Q1, train_Q2, batch_size, vocab['<PAD>'])
val_generator = data_generator(val_Q1, val_Q2, batch_size, vocab['<PAD>'])
print('train_Q1.shape ', train_Q1.shape)
print('val_Q1.shape   ', val_Q1.shape)

train_Q1.shape  (89188,)
val_Q1.shape    (22298,)


In [None]:
lr_schedule = trax.lr.warmup_and_rsqrt_decay(400, 0.01)

def train_model(Siamese, TripletLoss, lr_schedule, train_generator=train_generator, val_generator=val_generator, output_dir='model/'):

    output_dir = os.path.expanduser(output_dir)



    train_task = training.TrainTask(
        labeled_data=train_generator,       # Use generator (train)
        loss_layer=TripletLoss(),         # Use triplet loss. Don't forget to instantiate this object
        optimizer=trax.optimizers.Adam(0.01),          # Don't forget to add the learning rate parameter
        lr_schedule=lr_schedule, # Use Trax multifactor schedule function
    )

    eval_task = training.EvalTask(
        labeled_data=val_generator,       # Use generator (val)
        metrics=[TripletLoss()],          # Use triplet loss. Don't forget to instantiate this object
    )
    

    training_loop = training.Loop(Siamese(),
                                  train_task,
                                  eval_task=eval_task,
                                  output_dir=output_dir)

    return training_loop

In [None]:
train_steps = 5
training_loop = train_model(Siamese, TripletLoss, lr_schedule)
training_loop.run(train_steps)

Step      1: train TripletLoss |  0.49926734
Step      1: eval  TripletLoss |  0.49950904


The model was only trained for 5 steps due to the constraints of this environment. For the rest of the assignment you will be using a pretrained model but now you should understand how the training can be done using Trax.

Evaluating the Siamese Network 

In [None]:
# Loading in the saved model
model = Siamese()
model.init_from_file('model.pkl.gz')
model

Parallel_in2_out2[
  Serial[
    Embedding_41699_128
    LSTM_128
    Mean
    normalize
  ]
  Serial[
    Embedding_41699_128
    LSTM_128
    Mean
    normalize
  ]
]

In [None]:

def classify(test_Q1, test_Q2, y, threshold, model, vocab, data_generator=data_generator, batch_size=64):

    accuracy = 0
 
    for i in range(0, len(test_Q1), batch_size):
        # Call the data generator (built in Ex 01) with shuffle=False using next()
        # use batch size chuncks of questions as Q1 & Q2 arguments of the data generator. e.g x[i:i + batch_size]
        q1, q2 = next(data_generator(test_Q1[i:i+batch_size],test_Q2[i:i+batch_size],batch_size,vocab['<PAD>'],shuffle=False))
        # use batch size chuncks of actual output targets (same syntax as example above)
        y_test = y[i:i+batch_size]
        # Call the model
        v1, v2 = model((q1,q2))

        for j in range(batch_size):
            # take dot product to compute cos similarity of each pair of entries, v1[j], v2[j]
            # don't forget to transpose the second argument
            d = np.dot(v1[j],v2[j])
            # is d greater than the threshold?
            res = d>threshold
            # increment accurancy if y_test is equal `res`
            accuracy += (res==y_test[j])
    # compute accuracy using accuracy and total length of test questions
    accuracy = accuracy/len(test_Q1)
    
    return accuracy

In [None]:
# this takes around 1 minute
accuracy = classify(Q1_test,Q2_test, y_test, 0.7, model, vocab, batch_size = 512) 
print("Accuracy", accuracy)

Accuracy 0.69091796875


<a name='5'></a>

# Testing with your own questions

In this section you will test the model with your own questions. You will write a function `predict` which takes two questions as input and returns $1$ or $0$ depending on whether the question pair is a duplicate or not.   

But first, we build a reverse vocabulary that allows to map encoded questions back to words: 



`Predict` function below takes in two questions, the model, and the vocabulary and returns whether the questins are duplicates (1) or not duplicates (0) given a similarity method. 

Workflow :
* Tokenize the question using `nltk.word_tokenize`
* Create Q1, Q2 by encoding the questions as a list of numbers using vocab
* pad Q1, Q2 with next(data_generator([Q1], [Q2], 1, vocab['<PAD>']))
* use model() to create v1, v2
* compute the cosine similarity (dot product) of v1, v2
* compute res by comparing d to the threshold


In [None]:
def predict(question1, question2, threshold, model, vocab, data_generator=data_generator, verbose=False):


    # use `nltk` word tokenize function to tokenize
    q1 = nltk.word_tokenize(question1)  # tokenize
    q2 = nltk.word_tokenize(question2)  # tokenize
    Q1, Q2 = [vocab[x] for x in q1], [vocab[x] for x in q2]

    Q1, Q2 = next(data_generator([Q1],[Q2],1,vocab['<PAD>'],shuffle=False))
    # Call the model
    v1, v2 = model((Q1,Q2))
    # take dot product to compute cos similarity of each pair of entries, v1, v2
    # don't forget to transpose the second argument
    d = fastnp.dot(v1,v2.T)
    # is d greater than the threshold?
    res = d>threshold
    
    ### END CODE HERE ###
    
    if(verbose):
        print("Q1  = ", question1, "\nQ2  = ", question2)
        print("d   = ", d)
        print("res = ", res)

    return res

In [None]:
# Feel free to try with your own questions
question1 = "When will I see you?"
question2 = "When can I see you again?"
# 1 means it is duplicated, 0 otherwise
predict(question1 , question2, 0.7, model, vocab, verbose = True)

Q1  =  When will I see you? 
Q2  =  When can I see you again?
d   =  [[0.8811324]]
res =  [[ True]]


DeviceArray([[ True]], dtype=bool)

In [None]:
# Feel free to try with your own questions
question1 = "Do they enjoy eating the dessert?"
question2 = "Do they like hiking in the desert?"
# 1 means it is duplicated, 0 otherwise
predict(question1 , question2, 0.7, model, vocab, verbose=True)

Q1  =  Do they enjoy eating the dessert? 
Q2  =  Do they like hiking in the desert?
d   =  [[0.477536]]
res =  [[False]]


DeviceArray([[False]], dtype=bool)

In [None]:
# Feel free to try with your own questions
question1 = "she looks overweight?"
question2 = "she looks smart?"
# 1 means it is duplicated, 0 otherwise
predict(question1 , question2, 0.7, model, vocab, verbose=True)

Q1  =  she looks overweight? 
Q2  =  she looks smart?
d   =  [[0.7437502]]
res =  [[ True]]


DeviceArray([[ True]], dtype=bool)

We can see that the Siamese network is capable of catching complicated structures. Concretely it can identify question duplicates although the questions do not have many words in common. 
 

<a name='6'></a>

###  <span style="color:blue"> On Siamese networks </span>

Siamese networks are important and useful. Many times there are several questions that are already asked in quora, or other platforms and we can use Siamese networks to avoid question duplicates. 