## Problem 4.
This exercise  is inspired by the following Quora’s question pairs challenge https://www.kaggle.com/c/quora-question-pairs and the following  blog https://medium.com/@bassimfaizal/finding-duplicate-questions-using-datasketch-2ae1f3d8bc5c 

In [4]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from tqdm import tqdm # make your loops show the progress
import nltk # The Natural Language Toolkit 


## Step 1: Data Extraction

In [5]:
qa_pairs = pd.read_csv('./train.csv')

In [6]:
qa_pairs.tail()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
404285,404285,433578,379845,How many keywords are there in the Racket prog...,How many keywords are there in PERL Programmin...,0
404286,404286,18840,155606,Do you believe there is life after death?,Is it true that there is life after death?,1
404287,404287,537928,537929,What is one coin?,What's this coin?,0
404288,404288,537930,537931,What is the approx annual cost of living while...,I am having little hairfall problem but I want...,0
404289,404289,537932,537933,What is like to have sex with cousin?,What is it like to have sex with your cousin?,0


In [7]:
qa_pairs.sample(10, random_state=42)

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
8067,8067,15738,15739,How do I play Pokémon GO in Korea?,How do I play Pokémon GO in China?,0
368101,368101,12736,104117,What are some of the best side dishes for crab...,What are some good side dishes for buffalo chi...,0
70497,70497,121486,121487,Which is more advisable and better material fo...,What is the best server setup for buddypress?,0
226567,226567,254474,258192,How do I improve logical programming skills?,How can I improve my logical skills for progra...,1
73186,73186,48103,3062,How close we are to see 3rd world war?,How close is a World War III?,1
215105,215105,177688,83888,What do Chinese people think about Donald Trump?,What do Chinese people think of Donald Trump?,1
253209,253209,367707,153452,How many hours a week do Google employees work?,How many hours a day do Google employees work ...,0
354651,354651,483796,11244,How can we follow a Quora question privately w...,How can we view private Instagram pictures wit...,0
104478,104478,172497,172498,Why are cats so overprotective?,How do you know if your cat is overprotective?,1
163628,163628,254474,254475,How do I improve logical programming skills?,What is the best way to improve logical skills...,1


## Step 2: MinHast and LSH

In [8]:
import datasketch


In [9]:
sents_pairs = pd.concat([qa_pairs[qa_pairs['is_duplicate'] == 0].sample(100, random_state=42), 
                   qa_pairs[qa_pairs['is_duplicate'] == 1].sample(100, random_state=42)]).reset_index(drop=True)
sents_pairs = sents_pairs.sample(frac=1.)
sents_pairs.shape
# drops the current index of the DataFrame and replaces it with an index of increasing integers.
# sample(frac=1.) shuffles the order of the DataFrame's rows: The frac keyword argument specifies the fraction of rows to return in the random sample, so frac=1 means return all rows (in random order).
# pandas.DataFrame.sample returns a random sample of items from an axis of object.
# NumPy arrays have an attribute called shape that returns a tuple with each index having the number of corresponding elements.

(200, 6)

In [10]:
sents_pairs.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
179,32509,59839,59840,How do you remove Sharpie from skin?,How do you use Sharpie ink from skin?,1
97,95783,159665,104095,What are the top torrent sites?,What is the best torrent site for games?,0
61,157372,246097,246098,What's the best book on abstract algebra?,What is the best introductory abstract algebra...,0
100,95846,159763,159764,Should I sell an iPhone 6s and buy an iPhone SE?,Should I buy the iPhone 6s or an SE?,1
117,302583,132417,6804,How can we reduce masturbating?,How should I stop masturbating?,1


In [11]:
sents = pd.concat([sents_pairs['question1'], sents_pairs['question2']])
sents.head()

179                How do you remove Sharpie from skin?
97                      What are the top torrent sites?
61            What's the best book on abstract algebra?
100    Should I sell an iPhone 6s and buy an iPhone SE?
117                     How can we reduce masturbating?
dtype: object

In [12]:
import nltk
# nltk.download('stopwords')

In [13]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

We will create two dictionaries: `'set_dict'` and `'norm_dict'`.
The dictionary `'set_dict'` has as  keys m1,m2, etc and as  elements `set_dict[mi]`, the set of shingles.
The dictionary `'norm_dict'` maps question id (eg 'm23') to the actual question. We will use this dictionary to evaluate the results of LSH output.

We loop through each question, convert them into shingles, 
and if the shingle isn’t a stop word,
we add them to a hashset which will be the value for the `set_dict` dictionary.


In [14]:
'''
format(value)  puts value in the replacement  defined by a pair of curly braces { } into a string
'''

set_dict={} 

norm_dict={} 
count=1
for question in sents:
  temp_list = []
  for shingle in question.split(' '):
      if shingle not in stop_words:
         temp_list.append(shingle.lower())
  set_dict["m{0}".format(count)] = set(temp_list)
  norm_dict["m{0}".format(count)] = question
  count +=1


In [23]:
set_dict['m23']

{'about?',
 'exist',
 'inventions',
 'know',
 'mind-blowing',
 'people',
 'smartphone',
 'what'}

In [16]:
norm_dict['m2']

'What are the top torrent sites?'

## Step 3: Create minHash signatures


We loop through all the set representations of questions and calculate the signatures and store them in the `min_dict` dictionary.
We encode the shingles using the utf8 format.

In [17]:


'''
num_perm is the number of permutations we want for the MinHash algorithm. 

min_dict maps question id (eg 'm23') to min hash signatures.


'''

num_perm = 256
min_dict = {}
count2 = 1
for val in tqdm(set_dict.values()):
   m = datasketch.MinHash(num_perm=num_perm)
   for shingle in val:
       m.update(shingle.encode('utf8'))
   min_dict["m{}".format(count2)] = m
   count2+=1

100%|██████████| 400/400 [00:01<00:00, 343.17it/s]


MinHash data structure:

In [25]:
#test whether the signature is created successfully
min_dict['m23']

<datasketch.minhash.MinHash at 0x7f8c9feabb80>

## Step 4: Create LSH index

We set the Jaccard similarity threshold as a parameter in MinHashLSH. 
We loop through the signatures or keys in the `min_dict` dictionary. 
Datasketch stores these in a dictionary format, where the key is a question and the values are all the questions deemed similar based on the threshold. 




In [19]:


'''
Create an MinHashLSH index optimized for Jaccard threshold 0.4,
that accepts MinHash objects with num_perm permutations functions
'''

lsh = datasketch.MinHashLSH(threshold=0.8, num_perm=num_perm)
for key in tqdm(min_dict.keys()):
   lsh.insert(key,min_dict[key]) # insert minhash data structure


100%|██████████| 400/400 [00:00<00:00, 9841.51it/s]




lsh.query: Giving the MinHash of the query set, retrieve the keys (m1, m2 etc.) that references sets with approximate! Jaccard similarities greater than the threshold

In [20]:
big_list = []
for query in min_dict.keys():
   big_list.append(lsh.query(min_dict[query]))

In [21]:
counti = 0
for elem in big_list:
    if len(elem) >1:
        print(norm_dict[elem[0]],norm_dict[elem[1]])

Should I sell an iPhone 6s and buy an iPhone SE? Should I buy the iPhone 6s or an SE?
Does Switzerland keep records of people who go missing in national parks and wilderness areas? If not, why not? Does Ecuador keep records of people who go missing in national parks and wilderness areas? If not, why not?
What are some mind-blowing mobile inventions that exist that most people don't know about? What are some mind-blowing Smartphone inventions that exist that most people don't know about?
Do employees at Paramount Group have a good work-life balance? Does this differ across positions and departments? Do employees at Navigators Group have a good work-life balance? Does this differ across positions and departments?
How do you get rid of dry or sore throat? How do you get rid of a sore throat?
How can I increase traffic to a story blog? How can I increase traffic on my blog?
How do I recover my permanently deleted emails in Gmail? How do I recover permanently deleted emails in gmail?
Which 

In [22]:
big_list

[['m1'],
 ['m2'],
 ['m3'],
 ['m4', 'm204'],
 ['m5'],
 ['m6'],
 ['m7'],
 ['m8'],
 ['m9'],
 ['m10'],
 ['m11'],
 ['m12'],
 ['m213', 'm13'],
 ['m14'],
 ['m15'],
 ['m16'],
 ['m17'],
 ['m18'],
 ['m19'],
 ['m20'],
 ['m21'],
 ['m22'],
 ['m223', 'm23'],
 ['m24'],
 ['m225', 'm25'],
 ['m26'],
 ['m27'],
 ['m28'],
 ['m29'],
 ['m30'],
 ['m31'],
 ['m32'],
 ['m33'],
 ['m34'],
 ['m35'],
 ['m36'],
 ['m37'],
 ['m38'],
 ['m39'],
 ['m40'],
 ['m41'],
 ['m42'],
 ['m43'],
 ['m44', 'm244'],
 ['m45'],
 ['m46'],
 ['m47'],
 ['m48'],
 ['m49'],
 ['m50'],
 ['m51'],
 ['m52'],
 ['m53'],
 ['m54'],
 ['m55'],
 ['m56'],
 ['m57'],
 ['m58'],
 ['m59'],
 ['m60'],
 ['m61'],
 ['m62'],
 ['m63'],
 ['m64'],
 ['m65'],
 ['m66'],
 ['m67'],
 ['m68'],
 ['m69'],
 ['m70'],
 ['m71'],
 ['m72'],
 ['m73'],
 ['m74'],
 ['m75'],
 ['m76'],
 ['m77'],
 ['m78'],
 ['m79'],
 ['m80'],
 ['m81'],
 ['m82'],
 ['m83'],
 ['m84'],
 ['m85'],
 ['m86'],
 ['m87'],
 ['m88'],
 ['m89'],
 ['m90'],
 ['m91'],
 ['m92'],
 ['m93'],
 ['m94'],
 ['m95'],
 ['m96'],
 ['m97'],