## Assignment 4 - Question Duplicates

<a name='0'></a>
## Overview
In this assignment, concretely you will: 

- Learn about Siamese networks
- Understand how the triplet loss works
- Understand how to evaluate accuracy
- Use cosine similarity between the model's outputted vectors
- Use the data generator to get batches of questions
- Predict using your own model

By now, you are familiar with trax and know how to make use of classes to define your model. We will start this homework by asking you to preprocess the data the same way you did in the previous assignments. After processing the data you will build a classifier that will allow you to identify whether two questions are the same or not. 
<img src = "images/meme.png" style="width:550px;height:300px;"/>


You will process the data first and then pad in a similar way you have done in the previous assignment. Your model will take in the two question embeddings, run them through an LSTM, and then compare the outputs of the two sub networks using cosine similarity. Before taking a deep dive into the model, start by importing the data set.

### 1. Importing the data

In [28]:
import os
import nltk
import trax
from trax import layers as tl
from trax.supervised import training
from trax.fastmath import numpy as fastnp
import numpy as np
import pandas as pd
import random as rnd
from trax import shapes

import w4_unittest

nltk.data.path.append('nltk_data')
rnd.seed(4)

In [3]:
data = pd.read_csv('data/questions.csv')
print(f"number of questions pairs: {len(data)}")
data.head()

number of questions pairs: 404351


Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


In [4]:
N_TRAIN = 300000
N_TEST=10*1024
data_train = data[:N_TRAIN]
data_test = data[N_TRAIN:N_TRAIN + N_TEST]

print(f"length of training set: {len(data_train)}, length of testing set: {len(data_test)}")

length of training set: 300000, length of testing set: 10240


In [15]:
is_duplicate_index = data_train[data_train['is_duplicate'] == True].index.to_list()
is_duplicate_index

[5,
 7,
 11,
 12,
 13,
 15,
 16,
 18,
 20,
 29,
 31,
 32,
 38,
 48,
 49,
 50,
 51,
 53,
 58,
 62,
 65,
 66,
 67,
 71,
 72,
 73,
 74,
 79,
 84,
 85,
 86,
 88,
 92,
 93,
 95,
 100,
 104,
 107,
 113,
 120,
 122,
 125,
 127,
 135,
 136,
 143,
 144,
 152,
 156,
 158,
 159,
 160,
 163,
 165,
 168,
 173,
 175,
 176,
 178,
 179,
 180,
 182,
 185,
 188,
 189,
 190,
 191,
 193,
 194,
 197,
 198,
 199,
 200,
 203,
 209,
 210,
 215,
 216,
 219,
 220,
 221,
 224,
 226,
 229,
 235,
 236,
 238,
 242,
 243,
 244,
 246,
 249,
 250,
 251,
 253,
 255,
 260,
 261,
 262,
 267,
 269,
 270,
 273,
 274,
 275,
 281,
 284,
 285,
 286,
 287,
 288,
 291,
 293,
 295,
 296,
 299,
 304,
 307,
 308,
 309,
 312,
 317,
 318,
 321,
 322,
 323,
 326,
 329,
 331,
 339,
 341,
 346,
 347,
 348,
 349,
 350,
 353,
 364,
 365,
 368,
 373,
 377,
 380,
 383,
 390,
 393,
 394,
 395,
 397,
 399,
 400,
 402,
 403,
 404,
 405,
 409,
 410,
 412,
 415,
 421,
 422,
 428,
 430,
 431,
 432,
 439,
 442,
 443,
 445,
 446,
 450,
 451,
 457,

In [16]:
print(f"number of duplicate questions: {len(is_duplicate_index)}, number of non duplicate questions {len(data) - len(is_duplicate_index)}")

number of duplicate questions: 111486, number of non duplicate questions 292865


In [18]:
data.loc[is_duplicate_index[:5]]

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
5,5,11,12,Astrology: I am a Capricorn Sun Cap moon and c...,"I'm a triple Capricorn (Sun, Moon and ascendan...",1
7,7,15,16,How can I be a good geologist?,What should I do to be a great geologist?,1
11,11,23,24,How do I read and find my YouTube comments?,How can I see all my Youtube comments?,1
12,12,25,26,What can make Physics easy to learn?,How can you make physics easy to learn?,1
13,13,27,28,What was your first sexual experience like?,What was your first sexual experience?,1


Splitting out test and train q1 and q2 words

In [23]:
q1_train_words = data_train.loc[is_duplicate_index,'question1']
q2_train_words = data_train.loc[is_duplicate_index,'question2']

q1_test_words = data_test['question1']
q2_test_words = data_test['question2']
y_test = data_test['is_duplicate']


Q1 and Q2 training breakdown

In [25]:
q1_train_words[:10], q2_train_words[:10]

(5     Astrology: I am a Capricorn Sun Cap moon and c...
 7                        How can I be a good geologist?
 11          How do I read and find my YouTube comments?
 12                 What can make Physics easy to learn?
 13          What was your first sexual experience like?
 15    What would a Trump presidency mean for current...
 16                         What does manipulation mean?
 18    Why are so many Quora users posting questions ...
 20                           Why do rockets look white?
 29               How should I prepare for CA final law?
 Name: question1, dtype: object,
 5     I'm a triple Capricorn (Sun, Moon and ascendan...
 7             What should I do to be a great geologist?
 11               How can I see all my Youtube comments?
 12              How can you make physics easy to learn?
 13               What was your first sexual experience?
 15    How will a Trump presidency affect the student...
 16                        What does manipulation means

In [21]:
print(f"number of q1 train words: {len(q1_train_words)}, and number of q2 train words {len(q2_train_words)}")

number of q1 train words: 111486, and number of q2 train words 111486


Q1 and Q2 testing breakdown.

In [24]:
print(f"number of q1 test words: {len(q1_test_words)}, number of q2 test words: {len(q2_test_words)}, number of y test labels: {len(y_test)}")

number of q1 test words: 10240, number of q2 test words: 10240, number of y test labels: 10240


In [30]:
test_sentence = "How old are you"
test_words = nltk.word_tokenize(test_sentence)
test_words

['How', 'old', 'are', 'you']

In [33]:
q1_train = q1_train_words.apply(lambda x : nltk.word_tokenize(x))
q2_train = q2_train_words.apply(lambda x : nltk.word_tokenize(x))
q1_train.head(), q2_train.head()

(5     [Astrology, :, I, am, a, Capricorn, Sun, Cap, ...
 7              [How, can, I, be, a, good, geologist, ?]
 11    [How, do, I, read, and, find, my, YouTube, com...
 12       [What, can, make, Physics, easy, to, learn, ?]
 13    [What, was, your, first, sexual, experience, l...
 Name: question1, dtype: object,
 5     [I, 'm, a, triple, Capricorn, (, Sun, ,, Moon,...
 7     [What, should, I, do, to, be, a, great, geolog...
 11    [How, can, I, see, all, my, Youtube, comments, ?]
 12    [How, can, you, make, physics, easy, to, learn...
 13      [What, was, your, first, sexual, experience, ?]
 Name: question2, dtype: object)

In [34]:
q1_train = q1_train.to_numpy()
q2_train = q2_train.to_numpy()
q1_train, q2_train

(array([list(['Astrology', ':', 'I', 'am', 'a', 'Capricorn', 'Sun', 'Cap', 'moon', 'and', 'cap', 'rising', '...', 'what', 'does', 'that', 'say', 'about', 'me', '?']),
        list(['How', 'can', 'I', 'be', 'a', 'good', 'geologist', '?']),
        list(['How', 'do', 'I', 'read', 'and', 'find', 'my', 'YouTube', 'comments', '?']),
        ...,
        list(['What', 'are', 'the', 'top', '10', 'TV', 'series', 'one', 'should', 'genuinely', 'watch', '?']),
        list(['Is', 'there', 'no', 'life', 'on', 'other', 'planets', '?']),
        list(['How', 'do', 'I', 'tell', 'the', 'difference', 'between', 'infatuation', 'and', 'love', '?'])],
       dtype=object),
 array([list(['I', "'m", 'a', 'triple', 'Capricorn', '(', 'Sun', ',', 'Moon', 'and', 'ascendant', 'in', 'Capricorn', ')', 'What', 'does', 'this', 'say', 'about', 'me', '?']),
        list(['What', 'should', 'I', 'do', 'to', 'be', 'a', 'great', 'geologist', '?']),
        list(['How', 'can', 'I', 'see', 'all', 'my', 'Youtube', 'comments'

In [35]:
q1_test = q1_test_words.apply(lambda x : nltk.word_tokenize(x))
q2_test = q2_test_words.apply(lambda x : nltk.word_tokenize(x))
q1_test.head(), q2_test.head()

(300000    [How, do, I, prepare, for, interviews, for, cs...
 300001    [What, is, the, best, bicycle, to, buy, under,...
 300002    [How, do, I, become, Mutual, funds, distribute...
 300003                  [Will, this, relationship, work, ?]
 300004                [How, does, Brexit, affect, India, ?]
 Name: question1, dtype: object,
 300000    [What, is, the, best, way, to, prepare, for, c...
 300001    [Which, is, the, best, bike, in, in, dia, to, ...
 300002    [How, do, I, become, mutual, funds, distributo...
 300003    [Relationship, :, Will, this, relationship, wo...
 300004    [Will, the, GBP/AUD, be, affected, by, Brexit, ?]
 Name: question2, dtype: object)

In [36]:
q1_test = q1_test.to_numpy()
q2_test = q2_test.to_numpy()
q1_test, q2_test

(array([list(['How', 'do', 'I', 'prepare', 'for', 'interviews', 'for', 'cse', '?']),
        list(['What', 'is', 'the', 'best', 'bicycle', 'to', 'buy', 'under', '10k', '?']),
        list(['How', 'do', 'I', 'become', 'Mutual', 'funds', 'distributer', 'for', 'all', 'company', 'mutual', 'funds', '?']),
        ...,
        list(['What', 'are', 'some', 'biblical', 'examples', 'of', 'God', 'giving', 'people', 'more', 'than', 'they', 'can', 'handle', '?']),
        list(['What', 'is', 'the', 'main', 'cause', 'of', 'typhoons', '?']),
        list(['How', 'does', 'one', 'become', 'a', 'man', 'of', 'action', '?'])],
       dtype=object),
 array([list(['What', 'is', 'the', 'best', 'way', 'to', 'prepare', 'for', 'cse', '?']),
        list(['Which', 'is', 'the', 'best', 'bike', 'in', 'in', 'dia', 'to', 'buy', 'in', 'INR', '10k', '?']),
        list(['How', 'do', 'I', 'become', 'mutual', 'funds', 'distributor', 'for', 'all', 'company', 'mutual', 'funds', '?']),
        ...,
        list(['If', 'Go

In [40]:
question_words = [word for words in q1_train for word in words]
question_words.extend([word for words in q2_train for word in words])
question_words

['Astrology',
 ':',
 'I',
 'am',
 'a',
 'Capricorn',
 'Sun',
 'Cap',
 'moon',
 'and',
 'cap',
 'rising',
 '...',
 'what',
 'does',
 'that',
 'say',
 'about',
 'me',
 '?',
 'How',
 'can',
 'I',
 'be',
 'a',
 'good',
 'geologist',
 '?',
 'How',
 'do',
 'I',
 'read',
 'and',
 'find',
 'my',
 'YouTube',
 'comments',
 '?',
 'What',
 'can',
 'make',
 'Physics',
 'easy',
 'to',
 'learn',
 '?',
 'What',
 'was',
 'your',
 'first',
 'sexual',
 'experience',
 'like',
 '?',
 'What',
 'would',
 'a',
 'Trump',
 'presidency',
 'mean',
 'for',
 'current',
 'international',
 'master',
 '’',
 's',
 'students',
 'on',
 'an',
 'F1',
 'visa',
 '?',
 'What',
 'does',
 'manipulation',
 'mean',
 '?',
 'Why',
 'are',
 'so',
 'many',
 'Quora',
 'users',
 'posting',
 'questions',
 'that',
 'are',
 'readily',
 'answered',
 'on',
 'Google',
 '?',
 'Why',
 'do',
 'rockets',
 'look',
 'white',
 '?',
 'How',
 'should',
 'I',
 'prepare',
 'for',
 'CA',
 'final',
 'law',
 '?',
 'What',
 'are',
 'some',
 'special',
 'ca

In [41]:
from collections import defaultdict

vocab = defaultdict(lambda : 0)
vocab['<PAD>']=1

for word in question_words:
    if word not in vocab:
        vocab[word] = len(vocab) + 1

vocab

defaultdict(<function __main__.<lambda>()>,
            {'<PAD>': 1,
             'Astrology': 2,
             ':': 3,
             'I': 4,
             'am': 5,
             'a': 6,
             'Capricorn': 7,
             'Sun': 8,
             'Cap': 9,
             'moon': 10,
             'and': 11,
             'cap': 12,
             'rising': 13,
             '...': 14,
             'what': 15,
             'does': 16,
             'that': 17,
             'say': 18,
             'about': 19,
             'me': 20,
             '?': 21,
             'How': 22,
             'can': 23,
             'be': 24,
             'good': 25,
             'geologist': 26,
             'do': 27,
             'read': 28,
             'find': 29,
             'my': 30,
             'YouTube': 31,
             'comments': 32,
             'What': 33,
             'make': 34,
             'Physics': 35,
             'easy': 36,
             'to': 37,
             'learn': 38,
             'was

In [43]:
print(f"'<PAD>' index {vocab['<PAD>']}")
print(f"'Astrology' index {vocab['Astrology']}")
print(f"'Astronomy' index {vocab['Astronomy']}")

'<PAD>' index 1
'Astrology' index 2
'Astronomy' index 0


<a name='1-2'></a>
### 1.2 - Converting a Question to a Tensor

You will now convert every question to a tensor, or an array of numbers, using your vocabulary built above.

In [44]:
q1_train = [vocab[w] for words in q1_train for w in words]
q2_train = [vocab[w] for words in q2_train for w in words]

q1_train, q2_train

([2,
  3,
  4,
  5,
  6,
  7,
  8,
  9,
  10,
  11,
  12,
  13,
  14,
  15,
  16,
  17,
  18,
  19,
  20,
  21,
  22,
  23,
  4,
  24,
  6,
  25,
  26,
  21,
  22,
  27,
  4,
  28,
  11,
  29,
  30,
  31,
  32,
  21,
  33,
  23,
  34,
  35,
  36,
  37,
  38,
  21,
  33,
  39,
  40,
  41,
  42,
  43,
  44,
  21,
  33,
  45,
  6,
  46,
  47,
  48,
  49,
  50,
  51,
  52,
  53,
  54,
  55,
  56,
  57,
  58,
  59,
  21,
  33,
  16,
  60,
  48,
  21,
  61,
  62,
  63,
  64,
  65,
  66,
  67,
  68,
  17,
  62,
  69,
  70,
  56,
  71,
  21,
  61,
  27,
  72,
  73,
  74,
  21,
  22,
  75,
  4,
  76,
  49,
  77,
  78,
  79,
  21,
  33,
  62,
  80,
  81,
  82,
  49,
  83,
  84,
  6,
  85,
  17,
  86,
  87,
  88,
  89,
  90,
  21,
  33,
  91,
  92,
  93,
  94,
  45,
  24,
  89,
  95,
  96,
  37,
  97,
  98,
  99,
  21,
  22,
  27,
  100,
  76,
  49,
  101,
  21,
  33,
  62,
  80,
  102,
  92,
  103,
  17,
  23,
  24,
  34,
  104,
  105,
  106,
  21,
  22,
  27,
  4,
  34,
  107,
  108,
  109,
  1

In [45]:
q1_test = [vocab[w] for words in q1_test for w in words]
q2_test = [vocab[w] for words in q2_test for w in words]

q1_test, q2_test

([22,
  27,
  4,
  76,
  49,
  779,
  49,
  9242,
  21,
  33,
  126,
  89,
  163,
  7100,
  37,
  557,
  221,
  6633,
  21,
  22,
  27,
  4,
  408,
  1031,
  574,
  0,
  49,
  190,
  1791,
  1584,
  574,
  21,
  116,
  372,
  1219,
  230,
  21,
  22,
  16,
  1442,
  726,
  236,
  21,
  109,
  11749,
  3264,
  7475,
  1825,
  22698,
  2293,
  385,
  0,
  0,
  12113,
  26005,
  7475,
  1825,
  21,
  33,
  62,
  80,
  27115,
  1624,
  21,
  218,
  126,
  89,
  163,
  1110,
  940,
  3236,
  21,
  33,
  126,
  89,
  163,
  11,
  1653,
  177,
  3369,
  169,
  1311,
  804,
  1317,
  55,
  21,
  61,
  27,
  80,
  2577,
  3501,
  49,
  806,
  21,
  33,
  62,
  89,
  163,
  766,
  37,
  819,
  169,
  856,
  21,
  33,
  4655,
  15623,
  37,
  89,
  2565,
  92,
  1505,
  23328,
  21,
  33,
  62,
  80,
  1170,
  25,
  3546,
  3607,
  1439,
  37,
  1090,
  56,
  2162,
  21,
  22,
  23,
  0,
  798,
  20,
  169,
  2335,
  3767,
  21,
  933,
  89,
  2287,
  14636,
  0,
  21,
  22,
  27,
  4,
  944,
  1

In [None]:
train_cutoff = int(len(q1_train) * 0.8)
train_q1, train_q2 = q1_train[:train_cutoff], q2_train[:train_cutoff]
val_q1, val_q2 = q1_train[train_cutoff:], q2_train[train_cutoff:]