# Identifying Duplicate Questions
#### Over 100 million people visit Quora every month, so it's no surprise that many people ask similar (or the same) questions. Various questions with the same intent can cause people to spend extra time searching for the best answer to their question, and results in members answering multiple versions of the same question. Quora uses random forest to identify duplicated questions to provide a better experience to active seekers and writers, and offer more value to both of these groups in the long term. Follow the steps outlined below to build the appropriate classifier model.

### Steps:

1. Download data
2. Exploration
3. Cleaning
4. Feature Engineering
5. Modeling

#### By the end of this project you should have a presentation that describes the model you built and its performance.

## 1. Download the Data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
import string
import re
import nltk
import spacy
import gensim

2022-11-18 17:14:50.937849: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [3]:
from nltk.tokenize import word_tokenize
from nltk import sent_tokenize
from nltk.corpus import stopwords
from nltk.corpus import wordnet as wn
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

In [104]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from gensim.models import Word2Vec
from collections import Counter
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn import naive_bayes
from sklearn.metrics import roc_auc_score
from sklearn.metrics.pairwise import cosine_similarity
import xgboost as xgb

In [5]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout

In [6]:
df=pd.read_csv('train.csv')

## 2. Exploration

In [7]:
df.head(10)

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0
5,5,11,12,Astrology: I am a Capricorn Sun Cap moon and c...,"I'm a triple Capricorn (Sun, Moon and ascendan...",1
6,6,13,14,Should I buy tiago?,What keeps childern active and far from phone ...,0
7,7,15,16,How can I be a good geologist?,What should I do to be a great geologist?,1
8,8,17,18,When do you use シ instead of し?,"When do you use ""&"" instead of ""and""?",0
9,9,19,20,Motorola (company): Can I hack my Charter Moto...,How do I hack Motorola DCX3400 for free internet?,0


In [8]:
# Check for missing values

df.isnull().sum()

id              0
qid1            0
qid2            0
question1       1
question2       2
is_duplicate    0
dtype: int64

In [9]:
# Check shape of dataframe

df.shape

(404290, 6)

In [10]:
df['is_duplicate'].value_counts()

0    255027
1    149263
Name: is_duplicate, dtype: int64

## 3. Cleaning

In [11]:
# Drop null value rows as they are an insignificant portion of the total data

df = df.dropna()

In [12]:
# Show random set of qid1 and qid2 to see if they can be dropped

df[['id', 'qid1', 'qid2']][20000:20020]

Unnamed: 0,id,qid1,qid2
20000,20000,37768,37769
20001,20001,37770,37771
20002,20002,8848,37772
20003,20003,37773,37774
20004,20004,37775,37776
20005,20005,37777,37778
20006,20006,37779,37780
20007,20007,37781,37782
20008,20008,37783,37784
20009,20009,37785,37786


In [13]:
# Split data from target variable

x_df=df[['id', 'question1', 'question2']]
y_df=df['is_duplicate']

In [14]:
# train test split

x_train, x_test, y_train, y_test = train_test_split(
    x_df, y_df, test_size=0.3, random_state=1000)

In [15]:
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

In [16]:
def preprocess_all(text):
    
    # Remove Punctuation
    text = "".join([char for char in text if char not in string.punctuation])
    
    tokens = text.split()
    
    ENGstopwords = stopwords.words('english')
    text = [word for word in tokens if word not in ENGstopwords]
    
    porter = PorterStemmer()
    stemmed_text = [porter.stem(word) for word in text]
    
    wnl = WordNetLemmatizer()
    lemma_text = [wnl.lemmatize(word) for word in stemmed_text]
    
    return lemma_text

In [17]:
x_train.head()

Unnamed: 0,id,question1,question2
271488,271488,What do pilots carry in their flight bags?,What are the things an airline pilot carry in ...
659,659,How long can raw and cooked sausage last refri...,How long does onigiri last if left refrigerate...
319232,319232,What do I need for video streaming?,"What, from an infrastructure point of view, do..."
352892,352892,What are the benefits you got from reading?,What are some benefits of reading?
147381,147381,"I am drunk, what should I do?",How do I know that I am drunk?


In [18]:
# Remove 'i' or 'I' as corpus specific stopword

x_train['question1']=x_train['question1'].str.replace(' I ', ' ')
x_train['question1']=x_train['question1'].str.replace('I ', '')
x_train['question1']=x_train['question1'].str.replace('i ', '')
x_train['question1']=x_train['question1'].str.replace(' i ', ' ')
x_train['question2']=x_train['question2'].str.replace(' I ', ' ')
x_train['question2']=x_train['question2'].str.replace('I ', '')
x_train['question2']=x_train['question2'].str.replace('i ', '')
x_train['question2']=x_train['question2'].str.replace(' i ', ' ')

In [19]:
# Remove 'what', 'What', 'how' and 'How' as corpus specific stopwords

x_train['question1']=x_train['question1'].str.replace('what', '')

In [20]:
x_train['question1']=x_train['question1'].str.replace('What', '')

In [21]:
x_train['question2']=x_train['question2'].str.replace('what', '')
x_train['question2']=x_train['question2'].str.replace('What', '')
x_train['question1']=x_train['question1'].str.replace('how', '')
x_train['question1']=x_train['question1'].str.replace('How', '')
x_train['question2']=x_train['question2'].str.replace('how', '')
x_train['question2']=x_train['question2'].str.replace('How', '')

In [22]:
x_test['question1']=x_test['question1'].str.replace(' I ', ' ')
x_test['question1']=x_test['question1'].str.replace('I ', '')
x_test['question1']=x_test['question1'].str.replace('i ', '')
x_test['question1']=x_test['question1'].str.replace(' i ', ' ')
x_test['question2']=x_test['question2'].str.replace(' I ', ' ')
x_test['question2']=x_test['question2'].str.replace('I ', '')
x_test['question2']=x_test['question2'].str.replace('i ', '')
x_test['question2']=x_test['question2'].str.replace(' i ', ' ')
x_test['question1']=x_test['question1'].str.replace('what', '')
x_test['question1']=x_test['question1'].str.replace('What', '')
x_test['question2']=x_test['question2'].str.replace('what', '')
x_test['question2']=x_test['question2'].str.replace('What', '')
x_test['question1']=x_test['question1'].str.replace('how', '')
x_test['question1']=x_test['question1'].str.replace('How', '')
x_test['question2']=x_test['question2'].str.replace('how', '')
x_test['question2']=x_test['question2'].str.replace('How', '')

In [23]:
x_train.head()

Unnamed: 0,id,question1,question2
271488,271488,do pilots carry in their flight bags?,are the things an airline pilot carry in thei...
659,659,long can raw and cooked sausage last refriger...,long does onigirlast if left refrigerated? c...
319232,319232,do need for video streaming?,", from an infrastructure point of view, do nee..."
352892,352892,are the benefits you got from reading?,are some benefits of reading?
147381,147381,"am drunk, should do?",do know that am drunk?


In [24]:
x_train['q1_clean'] = x_train['question1'].apply(lambda x: preprocess_all(x))
x_train['q2_clean'] = x_train['question2'].apply(lambda x: preprocess_all(x))

In [25]:
x_test['q1_clean'] = x_test['question1'].apply(lambda x: preprocess_all(x))
x_test['q2_clean'] = x_test['question2'].apply(lambda x: preprocess_all(x))

In [26]:
x_train.head()

Unnamed: 0,id,question1,question2,q1_clean,q2_clean
271488,271488,do pilots carry in their flight bags?,are the things an airline pilot carry in thei...,"[pilot, carri, flight, bag]","[thing, airlin, pilot, carri, flight, bag]"
659,659,long can raw and cooked sausage last refriger...,long does onigirlast if left refrigerated? c...,"[long, raw, cook, sausag, last, refriger]","[long, onigirlast, left, refriger, kept, longer]"
319232,319232,do need for video streaming?,", from an infrastructure point of view, do nee...","[need, video, stream]","[infrastructur, point, view, need, creat, webs..."
352892,352892,are the benefits you got from reading?,are some benefits of reading?,"[benefit, got, read]","[benefit, read]"
147381,147381,"am drunk, should do?",do know that am drunk?,[drunk],"[know, drunk]"


In [27]:
# Remove uncleaned question columns from x dataframe
x_train=x_train[['id', 'q1_clean', 'q2_clean']]
x_test=x_test[['id', 'q1_clean', 'q2_clean']]
print(x_train.shape)
x_train.head()

(283000, 3)


Unnamed: 0,id,q1_clean,q2_clean
271488,271488,"[pilot, carri, flight, bag]","[thing, airlin, pilot, carri, flight, bag]"
659,659,"[long, raw, cook, sausag, last, refriger]","[long, onigirlast, left, refriger, kept, longer]"
319232,319232,"[need, video, stream]","[infrastructur, point, view, need, creat, webs..."
352892,352892,"[benefit, got, read]","[benefit, read]"
147381,147381,[drunk],"[know, drunk]"


## 4. Feature Engineering

### Word2vec

In [28]:
# Instantiate Word2Vec

word2vec_1=Word2Vec(x_train['q1_clean'], min_count=2)
word2vec_2=Word2Vec(x_train['q2_clean'], min_count=2)

In [29]:
# Display unique words that appear at least twice in the dictionary

vocabulary1 = word2vec_1.wv.index_to_key
vocabulary2 = word2vec_2.wv.index_to_key

In [30]:
# Display similar word vector scores to randomly chosen word

sim_words = word2vec_1.wv.most_similar('best')
sim_words

[('cheapest', 0.6594325304031372),
 ('fastest', 0.6345575451850891),
 ('safest', 0.6214231252670288),
 ('suitabl', 0.6018700003623962),
 ('good', 0.5988691449165344),
 ('easiest', 0.5930651426315308),
 ('stuntmen', 0.5844835042953491),
 ('easi', 0.5522700548171997),
 ('favourit', 0.5466839671134949),
 ('britian', 0.5459638237953186)]

### Tfidf

In [31]:
# Apply lambda function to remove words from list

x_train['q1_clean'] = x_train['q1_clean'].apply(lambda x: ' '.join(x))
x_train['q2_clean'] = x_train['q2_clean'].apply(lambda x: ' '.join(x))

In [32]:
x_test['q1_clean'] = x_test['q1_clean'].apply(lambda x: ' '.join(x))
x_test['q2_clean'] = x_test['q2_clean'].apply(lambda x: ' '.join(x))

In [33]:
# Instantiate Tfidf
vectorizer=TfidfVectorizer(strip_accents=ascii, lowercase=False, max_df=0.6)

X1_tfidf=vectorizer.fit(x_train['q1_clean'])
X2_tfidf=vectorizer.fit(x_train['q2_clean'])

In [34]:
X1_test=vectorizer.fit(x_test['q1_clean'])
X2_test=vectorizer.fit(x_test['q2_clean'])

In [35]:
# Encoding the columns
X1_vector=vectorizer.transform(x_train['q1_clean'])
X2_vector=vectorizer.transform(x_train['q2_clean'])

In [36]:
X1_testVector=vectorizer.transform(x_test['q1_clean'])
X2_testVector=vectorizer.transform(x_test['q2_clean'])

In [37]:
# summarize encoded vector
print('vectors: ', X1_vector.toarray())
print('vectors: ', X2_vector.toarray())

vectors:  [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
vectors:  [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [38]:
# Check to ensure not all values are 0 in sparse matrix

print(X1_vector[0].sum())
print(X2_vector[0].sum())

1.998628154266811
2.4212388223483727


In [39]:
# Create separate dataframes of each vectorized column
X1_df=pd.DataFrame(X1_vector, columns=['q1'])
X1_df.head()

Unnamed: 0,q1
0,"(0, 22580)\t0.4986456030397158\n (0, 11513)..."
1,"(0, 25706)\t0.5289715566116155\n (0, 24508)..."
2,"(0, 31658)\t0.538356518832852\n (0, 27919)\..."
3,"(0, 24291)\t0.5346183715637537\n (0, 12725)..."
4,"(0, 9693)\t1.0"


In [40]:
# Create 2nd separate df of q2 vectorized column
X2_df=pd.DataFrame(X2_vector, columns=['q2'])
X2_df.head()

Unnamed: 0,q2
0,"(0, 29118)\t0.26840811203114845\n (0, 22580..."
1,"(0, 24508)\t0.5000324835474549\n (0, 17454)..."
2,"(0, 32206)\t0.25253979823040484\n (0, 31678..."
3,"(0, 24291)\t0.6637123462685323\n (0, 4597)\..."
4,"(0, 16466)\t0.47901587552842606\n (0, 9693)..."


In [41]:
vectorized_df=pd.concat([X1_df, X2_df], axis=1)
vectorized_df.head()

Unnamed: 0,q1,q2
0,"(0, 22580)\t0.4986456030397158\n (0, 11513)...","(0, 29118)\t0.26840811203114845\n (0, 22580..."
1,"(0, 25706)\t0.5289715566116155\n (0, 24508)...","(0, 24508)\t0.5000324835474549\n (0, 17454)..."
2,"(0, 31658)\t0.538356518832852\n (0, 27919)\...","(0, 32206)\t0.25253979823040484\n (0, 31678..."
3,"(0, 24291)\t0.5346183715637537\n (0, 12725)...","(0, 24291)\t0.6637123462685323\n (0, 4597)\..."
4,"(0, 9693)\t1.0","(0, 16466)\t0.47901587552842606\n (0, 9693)..."


In [42]:
X1_test_df=pd.DataFrame(X1_testVector, columns=['q1'])
X2_test_df=pd.DataFrame(X2_testVector, columns=['q2'])
test_vectorized_df=pd.concat([X1_test_df, X2_test_df], axis=1)
test_vectorized_df.head()

Unnamed: 0,q1,q2
0,"(0, 23372)\t0.3251344157581774\n (0, 8936)\...","(0, 23372)\t0.3202632030925727\n (0, 22709)..."
1,"(0, 32689)\t0.4072867067672329\n (0, 32030)...","(0, 32689)\t0.34473161740889585\n (0, 32030..."
2,"(0, 32711)\t0.34922750236358824\n (0, 32333...","(0, 32333)\t0.2891888242558739\n (0, 22218)..."
3,"(0, 31028)\t0.2765560190314651\n (0, 24414)...","(0, 31028)\t0.29713583300637086\n (0, 27573..."
4,"(0, 31869)\t0.48321306497181876\n (0, 13446...","(0, 32590)\t0.20395978779805138\n (0, 31869..."


In [43]:
range(len(vectorized_df))

range(0, 283000)

In [44]:
range(len(test_vectorized_df))

range(0, 121287)

In [45]:
# Get the cosine_similarity of 1 row
cosine_similarity(vectorized_df['q1'][0], vectorized_df['q2'][0])

array([[0.86071001]])

In [46]:
# Iterate through each row of the dataframe to get all the cosine_similarities

cosine=[]

for row in range(len(vectorized_df)):
    cosine.append(cosine_similarity(vectorized_df['q1'][row], vectorized_df['q2'][row]))

In [47]:
testCosine=[]

for row in range(len(test_vectorized_df)):
    testCosine.append(cosine_similarity(test_vectorized_df['q1'][row], test_vectorized_df['q2'][row]))

In [48]:
# Reshape list in order to convert to DataFrame
cosine1=np.reshape(cosine, (283000,1))

In [49]:
testcosine1=np.reshape(testCosine, (121287,1))

In [50]:
# Create dataframe from list
cosine_df = pd.DataFrame(cosine1, columns=['cosine_sim'])

In [51]:
test_cosine_df=pd.DataFrame(testcosine1, columns=['cosine_sim'])

In [52]:
# Concatenate cosine_similarity score to vectorized dataframe
vectorized_final=pd.concat([vectorized_df, cosine_df],axis=1)

In [53]:
test_vectorized_final=pd.concat([test_vectorized_df, test_cosine_df], axis=1)

In [54]:
vectorized_final.tail()

Unnamed: 0,q1,q2,cosine_sim
282995,"(0, 32333)\t0.3009568372517852\n (0, 26839)...","(0, 32333)\t0.2904858111851718\n (0, 22218)...",0.417177
282996,"(0, 32334)\t0.29994745445854126\n (0, 15074...","(0, 33187)\t0.3027510891673205\n (0, 32024)...",0.119377
282997,"(0, 25058)\t0.4566864473011708\n (0, 14044)...","(0, 25058)\t0.4456092494937859\n (0, 14044)...",0.975744
282998,"(0, 29122)\t0.17621632274395\n (0, 29118)\t...","(0, 29122)\t0.17348823015983014\n (0, 29118...",0.984519
282999,"(0, 26177)\t1.0","(0, 26177)\t0.6365530467271575\n (0, 19886)...",0.636553


### Word count

In [55]:
print(x_train['q1_clean'].shape)
print(x_train['q2_clean'].shape)

(283000,)
(283000,)


In [56]:
# Instantiate CountVectorizer
countVectorizer=CountVectorizer(strip_accents=ascii, lowercase=False)

In [57]:
x_train_clean = x_train['q1_clean'].append(x_train['q2_clean'])

  x_train_clean = x_train['q1_clean'].append(x_train['q2_clean'])


In [58]:
x_test_clean = x_test['q1_clean'].append(x_test['q2_clean'])

  x_test_clean = x_test['q1_clean'].append(x_test['q2_clean'])


In [59]:
len(x_train_clean)

566000

In [60]:
countVectorizer.fit(x_train_clean)

In [61]:
test_countVectorizer = countVectorizer.fit(x_test_clean)

In [62]:
x_train_q1=countVectorizer.transform(x_train['q1_clean'])
x_train_q2=countVectorizer.transform(x_train['q2_clean'])

In [63]:
x_test_q1=countVectorizer.transform(x_test['q1_clean'])
x_test_q2=countVectorizer.transform(x_test['q2_clean'])

In [64]:
x_train_q1[0].toarray()

array([[0, 0, 0, ..., 0, 0, 0]])

In [65]:
x_train_q2[0].toarray()

array([[0, 0, 0, ..., 0, 0, 0]])

In [66]:
# Create separate dataframes of each vectorized column
X1_count=pd.DataFrame(x_train_q1, columns=['q1'])
X1_count.head()

Unnamed: 0,q1
0,"(0, 6086)\t1\n (0, 8860)\t1\n (0, 16446)\t..."
1,"(0, 10969)\t1\n (0, 24075)\t1\n (0, 24924)..."
2,"(0, 29010)\t1\n (0, 39951)\t1\n (0, 45546)\t1"
3,"(0, 6715)\t1\n (0, 18175)\t1\n (0, 34841)\t1"
4,"(0, 13878)\t1"


In [67]:
X2_count=pd.DataFrame(x_train_q2, columns=['q2'])
X2_count.head()

Unnamed: 0,q2
0,"(0, 3874)\t1\n (0, 6086)\t1\n (0, 8860)\t1..."
1,"(0, 23191)\t1\n (0, 24273)\t1\n (0, 24924)..."
2,"(0, 6002)\t1\n (0, 11378)\t1\n (0, 21332)\..."
3,"(0, 6715)\t1\n (0, 34841)\t1"
4,"(0, 13878)\t1\n (0, 23522)\t1"


In [68]:
countvector_df=pd.concat([X1_count, X2_count], axis=1)
countvector_df.head()

Unnamed: 0,q1,q2
0,"(0, 6086)\t1\n (0, 8860)\t1\n (0, 16446)\t...","(0, 3874)\t1\n (0, 6086)\t1\n (0, 8860)\t1..."
1,"(0, 10969)\t1\n (0, 24075)\t1\n (0, 24924)...","(0, 23191)\t1\n (0, 24273)\t1\n (0, 24924)..."
2,"(0, 29010)\t1\n (0, 39951)\t1\n (0, 45546)\t1","(0, 6002)\t1\n (0, 11378)\t1\n (0, 21332)\..."
3,"(0, 6715)\t1\n (0, 18175)\t1\n (0, 34841)\t1","(0, 6715)\t1\n (0, 34841)\t1"
4,"(0, 13878)\t1","(0, 13878)\t1\n (0, 23522)\t1"


In [69]:
X1_test_count=pd.DataFrame(x_test_q1, columns=['q1'])
X2_test_count=pd.DataFrame(x_test_q2, columns=['q2'])
test_countvector_df=pd.concat([X1_test_count, X2_test_count], axis=1)
test_countvector_df.head()

Unnamed: 0,q1,q2
0,"(0, 4222)\t1\n (0, 7862)\t1\n (0, 10610)\t...","(0, 4222)\t1\n (0, 10610)\t1\n (0, 12644)\..."
1,"(0, 10508)\t1\n (0, 21913)\t1\n (0, 41691)...","(0, 3438)\t1\n (0, 18941)\t1\n (0, 33701)\..."
2,"(0, 16008)\t1\n (0, 44811)\t1\n (0, 46510)...","(0, 16008)\t1\n (0, 24919)\t1\n (0, 31929)..."
3,"(0, 8791)\t1\n (0, 16231)\t1\n (0, 18225)\...","(0, 16231)\t1\n (0, 18225)\t1\n (0, 24954)..."
4,"(0, 13060)\t1\n (0, 15756)\t1\n (0, 19210)...","(0, 3199)\t1\n (0, 3576)\t1\n (0, 8453)\t1..."


In [70]:
# Iterate through each row of the dataframe to get all the cosine_similarities

count_cosine=[]

for row in range(len(countvector_df)):
    count_cosine.append(cosine_similarity(countvector_df['q1'][row], countvector_df['q2'][row]))

In [71]:
test_count_cosine=[]

for row in range(len(test_countvector_df)):
    test_count_cosine.append(cosine_similarity(test_countvector_df['q1'][row], test_countvector_df['q2'][row]))

In [72]:
# Reshape list in order to convert to DataFrame
count_cosine1=np.reshape(count_cosine, (283000,1))

In [73]:
test_count_cosine1=np.reshape(test_count_cosine, (121287,1))

In [74]:
# Create dataframe from list
cosine_df = pd.DataFrame(count_cosine1, columns=['cosine_sim'])
cosine_df.head()

Unnamed: 0,cosine_sim
0,0.816497
1,0.365148
2,0.522233
3,0.816497
4,0.707107


In [75]:
test_cosine_df = pd.DataFrame(test_count_cosine1, columns=['cosine_sim'])

In [76]:
final_countvector_df=pd.concat([countvector_df, cosine_df], axis=1)
final_countvector_df.head()

Unnamed: 0,q1,q2,cosine_sim
0,"(0, 6086)\t1\n (0, 8860)\t1\n (0, 16446)\t...","(0, 3874)\t1\n (0, 6086)\t1\n (0, 8860)\t1...",0.816497
1,"(0, 10969)\t1\n (0, 24075)\t1\n (0, 24924)...","(0, 23191)\t1\n (0, 24273)\t1\n (0, 24924)...",0.365148
2,"(0, 29010)\t1\n (0, 39951)\t1\n (0, 45546)\t1","(0, 6002)\t1\n (0, 11378)\t1\n (0, 21332)\...",0.522233
3,"(0, 6715)\t1\n (0, 18175)\t1\n (0, 34841)\t1","(0, 6715)\t1\n (0, 34841)\t1",0.816497
4,"(0, 13878)\t1","(0, 13878)\t1\n (0, 23522)\t1",0.707107


In [77]:
final_test_countvector_df=pd.concat([test_countvector_df, test_cosine_df], axis=1)
final_test_countvector_df.head()

Unnamed: 0,q1,q2,cosine_sim
0,"(0, 4222)\t1\n (0, 7862)\t1\n (0, 10610)\t...","(0, 4222)\t1\n (0, 10610)\t1\n (0, 12644)\...",0.833333
1,"(0, 10508)\t1\n (0, 21913)\t1\n (0, 41691)...","(0, 3438)\t1\n (0, 18941)\t1\n (0, 33701)\...",0.4
2,"(0, 16008)\t1\n (0, 44811)\t1\n (0, 46510)...","(0, 16008)\t1\n (0, 24919)\t1\n (0, 31929)...",0.5
3,"(0, 8791)\t1\n (0, 16231)\t1\n (0, 18225)\...","(0, 16231)\t1\n (0, 18225)\t1\n (0, 24954)...",0.875
4,"(0, 13060)\t1\n (0, 15756)\t1\n (0, 19210)...","(0, 3199)\t1\n (0, 3576)\t1\n (0, 8453)\t1...",0.301511


### Number of same words in both questions

In [78]:
x_train.head(10)

Unnamed: 0,id,q1_clean,q2_clean
271488,271488,pilot carri flight bag,thing airlin pilot carri flight bag
659,659,long raw cook sausag last refriger,long onigirlast left refriger kept longer
319232,319232,need video stream,infrastructur point view need creat websit let...
352892,352892,benefit got read,benefit read
147381,147381,drunk,know drunk
98345,98345,increas chanc becom accept caltech,one increas chanc get accept top undergradu sc...
109578,109578,is cure tinnitu,will cure tinnitu futur
218475,218475,lock symbol circl around iphon mean,lock symbol iphon 6 mean
70342,70342,gari johnson chanc win novemb,doe gari johnson stand chanc elect presid
364826,364826,capac 35 inch floppi disk,memori capac floppi disk


In [79]:
X_train=x_train

In [80]:
# Create column that splits words in each row and places in dictionary overlapping words
X_train['word_overlap'] = [set(x[1].split()) & set(x[2].split()) for x in X_train.values]
# Create column that counts the overlapping words from new column
X_train['overlap_count'] = X_train['word_overlap'].str.len()

In [81]:
X_train.head(10)

Unnamed: 0,id,q1_clean,q2_clean,word_overlap,overlap_count
271488,271488,pilot carri flight bag,thing airlin pilot carri flight bag,"{carri, flight, bag, pilot}",4
659,659,long raw cook sausag last refriger,long onigirlast left refriger kept longer,"{refriger, long}",2
319232,319232,need video stream,infrastructur point view need creat websit let...,"{stream, video, need}",3
352892,352892,benefit got read,benefit read,"{benefit, read}",2
147381,147381,drunk,know drunk,{drunk},1
98345,98345,increas chanc becom accept caltech,one increas chanc get accept top undergradu sc...,"{chanc, increas, accept}",3
109578,109578,is cure tinnitu,will cure tinnitu futur,"{tinnitu, cure}",2
218475,218475,lock symbol circl around iphon mean,lock symbol iphon 6 mean,"{lock, mean, symbol, iphon}",4
70342,70342,gari johnson chanc win novemb,doe gari johnson stand chanc elect presid,"{chanc, johnson, gari}",3
364826,364826,capac 35 inch floppi disk,memori capac floppi disk,"{disk, capac, floppi}",3


## 5. Modelling

### Logistic Regression - Tfidf

In [82]:
vectorized_final.head()

Unnamed: 0,q1,q2,cosine_sim
0,"(0, 22580)\t0.4986456030397158\n (0, 11513)...","(0, 29118)\t0.26840811203114845\n (0, 22580...",0.86071
1,"(0, 25706)\t0.5289715566116155\n (0, 24508)...","(0, 24508)\t0.5000324835474549\n (0, 17454)...",0.314811
2,"(0, 31658)\t0.538356518832852\n (0, 27919)\...","(0, 32206)\t0.25253979823040484\n (0, 31678...",0.488158
3,"(0, 24291)\t0.5346183715637537\n (0, 12725)...","(0, 24291)\t0.6637123462685323\n (0, 4597)\...",0.805497
4,"(0, 9693)\t1.0","(0, 16466)\t0.47901587552842606\n (0, 9693)...",0.877806


In [83]:
x_train_cosineSim = vectorized_final['cosine_sim']
x_train_cosineSim = x_train_cosineSim.values.reshape(-1,1)

In [84]:
x_test_cosineSim = test_vectorized_final['cosine_sim']
x_test_cosineSim = x_test_cosineSim.values.reshape(-1,1)

In [85]:
logmodel = LogisticRegression()
logmodel.fit(x_train_cosineSim, y_train)

In [86]:
# Score the model
score = logmodel.score(x_test_cosineSim, y_test)

print("Accuracy:", score)

Accuracy: 0.665058909858435


### XGBoost - Tfidf

In [87]:
xgbmodel = GradientBoostingClassifier()
xgbmodel.fit(x_train_cosineSim, y_train)

In [88]:
# Score the model

score = xgbmodel.score(x_test_cosineSim, y_test)

print("Accuracy:", score)

Accuracy: 0.676239003355677


### Naive Bayes Classifier - Tfidf

In [89]:
NBclassifier = naive_bayes.BernoulliNB()
NBclassifier.fit(x_train_cosineSim, y_train)

In [90]:
score = NBclassifier.score(x_test_cosineSim, y_test)
print("Accuracy:", score)

Accuracy: 0.6308920164568338


In [91]:
roc_auc_score(y_test, NBclassifier.predict_proba(x_test_cosineSim)[:,1])

0.5671192363460239

### Logistic Regression - Countvectorizer

In [92]:
final_countvector_df.head()

Unnamed: 0,q1,q2,cosine_sim
0,"(0, 6086)\t1\n (0, 8860)\t1\n (0, 16446)\t...","(0, 3874)\t1\n (0, 6086)\t1\n (0, 8860)\t1...",0.816497
1,"(0, 10969)\t1\n (0, 24075)\t1\n (0, 24924)...","(0, 23191)\t1\n (0, 24273)\t1\n (0, 24924)...",0.365148
2,"(0, 29010)\t1\n (0, 39951)\t1\n (0, 45546)\t1","(0, 6002)\t1\n (0, 11378)\t1\n (0, 21332)\...",0.522233
3,"(0, 6715)\t1\n (0, 18175)\t1\n (0, 34841)\t1","(0, 6715)\t1\n (0, 34841)\t1",0.816497
4,"(0, 13878)\t1","(0, 13878)\t1\n (0, 23522)\t1",0.707107


In [93]:
x_train_cosineSim_count = final_countvector_df['cosine_sim']
x_train_cosineSim_count = x_train_cosineSim_count.values.reshape(-1,1)

In [94]:
x_test_cosineSim_count = final_test_countvector_df['cosine_sim']
x_test_cosineSim_count = x_test_cosineSim_count.values.reshape(-1,1)

In [95]:
logmodelCount = LogisticRegression()
logmodelCount.fit(x_train_cosineSim_count, y_train)
score = logmodelCount.score(x_test_cosineSim_count, y_test)

print("Accuracy:", score)

Accuracy: 0.6648692770041307


### XGBoost - Countvectorizer

In [110]:
x_train1=x_train[['id', 'q1_clean', 'q2_clean']]
x_train1.head()

Unnamed: 0,id,q1_clean,q2_clean
271488,271488,pilot carri flight bag,thing airlin pilot carri flight bag
659,659,long raw cook sausag last refriger,long onigirlast left refriger kept longer
319232,319232,need video stream,infrastructur point view need creat websit let...
352892,352892,benefit got read,benefit read
147381,147381,drunk,know drunk


In [109]:
x_test.head()

Unnamed: 0,id,q1_clean,q2_clean
397804,397804,process determin densiti aluminum brass compar,process determin densiti aluminum platinum compar
245946,245946,is third world war come,propheci world war 3 actual happen
395333,395333,whi would feel unwant,whi peopl feel lone
282003,282003,univers cardin financi recruit new grad major ...,univers squar 1 financi recruit new grad major...
302866,302866,differ head voic falsetto,is impress adult man abl sing c5 chest voic wi...


In [113]:
# Model tuning with 5-fold cross-validation
from sklearn.model_selection import RandomizedSearchCV


params_xgb = {'n_estimators' : [1, 2, 4], 
               'learning_rate' : np.linspace(.01, 1, 10, endpoint=True),
               'max_depth' : np.linspace(1, 32, 32, endpoint=True, dtype=int)
                 }
cv_xgb = RandomizedSearchCV(xgb.XGBClassifier(objective='binary:logistic', random_state=42), 
                            param_distributions=params_xgb, cv=5, n_jobs=-1)

# final_countvector contains all vectorized features for the training question-pairs
cv_xgb.fit(x_train_cosineSim_count, y_train)

In [107]:
score = cv_xgb.score(x_test_cosineSim_count, y_test)

print("Accuracy:", score)

Accuracy: 0.695985554923446


In [96]:
score = xgbmodel.score(x_test_cosineSim_count, y_test)

print("Accuracy:", score)

Accuracy: 0.6943942879286321


### Naive Bayes - Countvectorizer

In [97]:
NBclassifier = naive_bayes.BernoulliNB()
NBclassifier.fit(x_train_cosineSim_count, y_train)
roc_auc_score(y_test, NBclassifier.predict_proba(x_test_cosineSim_count)[:,1])

0.5671192363460239

In [98]:
score = NBclassifier.score(x_test_cosineSim_count, y_test)
print("Accuracy:", score)

Accuracy: 0.6308920164568338


### LSTM

In [99]:
final_countvector_df.head()

Unnamed: 0,q1,q2,cosine_sim
0,"(0, 6086)\t1\n (0, 8860)\t1\n (0, 16446)\t...","(0, 3874)\t1\n (0, 6086)\t1\n (0, 8860)\t1...",0.816497
1,"(0, 10969)\t1\n (0, 24075)\t1\n (0, 24924)...","(0, 23191)\t1\n (0, 24273)\t1\n (0, 24924)...",0.365148
2,"(0, 29010)\t1\n (0, 39951)\t1\n (0, 45546)\t1","(0, 6002)\t1\n (0, 11378)\t1\n (0, 21332)\...",0.522233
3,"(0, 6715)\t1\n (0, 18175)\t1\n (0, 34841)\t1","(0, 6715)\t1\n (0, 34841)\t1",0.816497
4,"(0, 13878)\t1","(0, 13878)\t1\n (0, 23522)\t1",0.707107


In [100]:
x_train_cosineSim_count.shape

(283000, 1)

In [101]:
x_train.head()

Unnamed: 0,id,q1_clean,q2_clean,word_overlap,overlap_count
271488,271488,pilot carri flight bag,thing airlin pilot carri flight bag,"{carri, flight, bag, pilot}",4
659,659,long raw cook sausag last refriger,long onigirlast left refriger kept longer,"{refriger, long}",2
319232,319232,need video stream,infrastructur point view need creat websit let...,"{stream, video, need}",3
352892,352892,benefit got read,benefit read,"{benefit, read}",2
147381,147381,drunk,know drunk,{drunk},1


In [None]:
# Create column that splits words in each row and places in dictionary overlapping words
X_train['q1_q2_clean'] = [(x[1].split()) + (x[2].split()) for x in x_train.values]

In [None]:
X_test=x_test

X_test['q1_q2_clean'] = [(x[1].split()) + (x[2].split()) for x in x_test.values]

In [None]:
X_train.head()

In [None]:
X_train['q1_q2_clean'] = X_train['q1_q2_clean'].apply(lambda x: ' '.join(x))

In [None]:
X_test['q1_q2_clean'] = X_test['q1_q2_clean'].apply(lambda x: ' '.join(x))

In [None]:
X_train.head()

In [None]:
X_test.head()