# Identifying Duplicate Questions

Over 100 million people visit Quora every month, so it's no surprise that many people ask similar (or the same) questions. Various questions with the same intent can cause people to spend extra time searching for the best answer to their question, and results in members answering multiple versions of the same question. Quora uses random forest to identify duplicated questions to provide a better experience to active seekers and writers, and offer more value to both of these groups in the long term.
Follow the steps outlined below to build the appropriate classifier model. 


Steps:
- Download data
- Exploration
- Cleaning
- Feature Engineering
- Modeling

By the end of this project you should have **a presentation that describes the model you built** and its **performance**. 


In [38]:
import pandas as pd
import numpy as np

import string
import itertools
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.preprocessing import StandardScaler
import gensim
import scipy

from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier

from sklearn.metrics import confusion_matrix, accuracy_score, f1_score

import warnings
warnings.filterwarnings('ignore')

#### Note
There is no designated test.csv file. The train.csv file is the entire dataset. Part of the data in the train.csv file should be set aside to act as the final testing data.

In [3]:
df = pd.read_csv("train.csv")
df.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


## Cleaning

- Tokenization
- Stopwords cleaning
- Removing punctuation
- Normalizing
- Stemming

In [7]:
# drop na
df.dropna(inplace=True)
print(df.shape)
print(df.isnull().sum().sum())

(404287, 6)
0


In [8]:
stop_words = set(stopwords.words('english'))
table = str.maketrans('', '', string.punctuation)
lemmatizer = WordNetLemmatizer()
porter = PorterStemmer()
# define a function for text cleaning
def clean_text(text):
    # tokenize
    words = word_tokenize(text)
    # remove number
    words = [w for w in words if not w.isdigit()]
    # remove stopwords
    words = [w for w in words if not w in stop_words]
    # remove punctuations
    words = [w.translate(table) for w in words]
    # # lemmatize
    # words = [lemmatizer.lemmatize(w) for w in words]
    # stemming
    words = [porter.stem(w) for w in words]
    return words

# text cleaning
df['q1_clean'] = df['question1'].apply(clean_text)
df['q2_clean'] = df['question2'].apply(clean_text)

In [9]:
# check target distribution 
df.is_duplicate.value_counts()

0    255024
1    149263
Name: is_duplicate, dtype: int64

In [10]:
# Undersampling, randomly sample class 0 to get same number of rows with class 1
# number of rows with target 1
n_rows_1 = df.is_duplicate.value_counts()[1]
# separate two classes and undersample class 0
df_1 = df[df.is_duplicate == 1][['is_duplicate', 'q1_clean', 'q2_clean']]
df_0 = df[df.is_duplicate == 0][['is_duplicate', 'q1_clean', 'q2_clean']].sample(n=n_rows_1, random_state=66)
# concate two classes and shuffle
df_clean = pd.concat([df_1, df_0]).sample(frac=1, random_state=66)
print(df_clean.is_duplicate.value_counts())
df_clean.head()

1    149263
0    149263
Name: is_duplicate, dtype: int64


Unnamed: 0,is_duplicate,q1_clean,q2_clean
366758,1,"[what, best, horror, movi, ]","[what, best, horror, movi, ]"
240037,0,"[whi, , trillion, rich, rothschild, forb, rich...","[who, richest, peopl, trinidad, ]"
382835,1,"[how, cold, gobi, desert, get, , averag, tempe...","[how, cold, gobi, desert, get, , averag, tempe..."
335899,0,"[what, differ, follow, sentenc, ]","[what, differ, follow, sentenc, without, , , ]"
297820,1,"[what, reason, behind, abrupt, remov, cyru, mi...","[whi, tata, son, sack, cyru, mistri, ]"


## Feature Engineering

- tf-idf
- word count


In [11]:
# since tf-idf or count vectorizers are fed with all questions;
# split train/test sets first, then map test set with train-fitted vectorizer to avoid data leakage
# also, fit both q1 and q2 in trainset, then separatly transform q1 and q2
X_train, X_test, y_train, y_test = train_test_split(df[['q1_clean', 'q2_clean']], df['is_duplicate'], test_size=0.25, random_state=66)

# concatenate clean text questions in one series for vectorization fitting
train_questions = pd.concat([X_train.q1_clean, X_train.q2_clean])

#### TfIdfVectorizer

In [12]:
def identity_tokenizer(text):
    return text

tfidf = TfidfVectorizer(tokenizer=identity_tokenizer, use_idf=True, lowercase=False)
# vectorization
tfidf.fit(train_questions)
# transform questions
train_q1 = tfidf.transform(X_train['q1_clean'])
train_q2 = tfidf.transform(X_train['q2_clean'])
test_q1 = tfidf.transform(X_test['q1_clean'])
test_q2 = tfidf.transform(X_test['q2_clean']) 

In [26]:
def add(q1, q2):
    return q1 + q2
def sub(q1, q2):
    return np.abs(q1-q2)
def multiply(q1, q2):
    return q1.multiply(q2)

train_add = add(train_q1, train_q2)
train_sub = sub(train_q1, train_q2)
train_multiply = multiply(train_q1, train_q2)

test_add = add(test_q1, test_q2)
test_sub = sub(test_q1, test_q2)
test_multiply = multiply(test_q1, test_q2)

## Modeling

Different modeling techniques can be used:

- logistic regression
- XGBoost
- LSTMs
- etc

In [33]:
def predict(model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(f'\n-----------{model}-----------')
    print('confusion matrix: ', confusion_matrix(y_test, y_pred))
    print('accuracy: ', accuracy_score(y_test, y_pred))
    print('f1 score: ', f1_score(y_test, y_pred))

#### Vectors Difference of Pair Questions as Features

In [34]:
# input sub of q1 and q2 only
predict(BernoulliNB(), train_sub, y_train, test_sub, y_test)
predict(LogisticRegression(), train_sub, y_train, test_sub, y_test)
predict(XGBClassifier(), train_sub, y_train, test_sub, y_test)


-----------BernoulliNB()-----------
confusion matrix:  [[47375 16462]
 [10191 27044]]
accuracy:  0.7362968972613583
f1 score:  0.6698950966671207

-----------LogisticRegression()-----------
confusion matrix:  [[53285 10552]
 [12669 24566]]
accuracy:  0.7702528890296026
f1 score:  0.6790596105206419

-----------XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
              gamma=0, gpu_id=-1, importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=100, n_jobs=16,
              num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)-----------
confusion matrix:  [[57515  6322]
 [1

In [35]:
# input add of q1 and q2 only
predict(BernoulliNB(), train_add, y_train, test_add, y_test)
predict(LogisticRegression(), train_add, y_train, test_add, y_test)
predict(XGBClassifier(), train_add, y_train, test_add, y_test)


-----------BernoulliNB()-----------
confusion matrix:  [[46738 17099]
 [10310 26925]]
accuracy:  0.7288170808928289
f1 score:  0.6626958244625211

-----------LogisticRegression()-----------
confusion matrix:  [[54563  9274]
 [16123 21112]]
accuracy:  0.7487236821275922
f1 score:  0.6244214075509088

-----------XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
              gamma=0, gpu_id=-1, importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=100, n_jobs=16,
              num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)-----------
confusion matrix:  [[56211  7626]
 [1

In [36]:
# input multiply of q1 and q2 only
predict(BernoulliNB(), train_multiply, y_train, test_multiply, y_test)
predict(LogisticRegression(), train_multiply, y_train, test_multiply, y_test)
predict(XGBClassifier(), train_multiply, y_train, test_multiply, y_test)


-----------BernoulliNB()-----------
confusion matrix:  [[56088  7749]
 [18812 18423]]
accuracy:  0.7372071394649359
f1 score:  0.581103032788178

-----------LogisticRegression()-----------
confusion matrix:  [[56726  7111]
 [17901 19334]]
accuracy:  0.7525328478708248
f1 score:  0.6072236180904523

-----------XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
              gamma=0, gpu_id=-1, importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=100, n_jobs=16,
              num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)-----------
confusion matrix:  [[57549  6288]
 [19

In [39]:
# input both add and sub of q1 and q2
train = scipy.sparse.hstack([train_add, train_sub])
test = scipy.sparse.hstack([test_add, test_sub])

predict(BernoulliNB(), train, y_train, test, y_test)
predict(LogisticRegression(), train, y_train, test, y_test)
predict(XGBClassifier(), train, y_train, test, y_test)


-----------BernoulliNB()-----------
confusion matrix:  [[44987 18850]
 [ 8740 28495]]
accuracy:  0.7270262782966598
f1 score:  0.6737999527074959

-----------LogisticRegression()-----------
confusion matrix:  [[54976  8861]
 [10944 26291]]
accuracy:  0.8040505778059205
f1 score:  0.7264011493776507

-----------XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
              gamma=0, gpu_id=-1, importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=100, n_jobs=16,
              num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)-----------
confusion matrix:  [[56348  7489]
 [1

In [40]:
# input both sub and multiply of q1 and q2
train = scipy.sparse.hstack([train_sub, train_multiply])
test = scipy.sparse.hstack([test_sub, test_multiply])

predict(BernoulliNB(), train, y_train, test, y_test)
predict(LogisticRegression(), train, y_train, test, y_test)
predict(XGBClassifier(), train, y_train, test, y_test)


-----------BernoulliNB()-----------
confusion matrix:  [[50148 13689]
 [10598 26637]]
accuracy:  0.7597059521924965
f1 score:  0.6868658217403077

-----------LogisticRegression()-----------
confusion matrix:  [[55305  8532]
 [11728 25507]]
accuracy:  0.7995488364730093
f1 score:  0.7157448719027976

-----------XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
              gamma=0, gpu_id=-1, importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=100, n_jobs=16,
              num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)-----------
confusion matrix:  [[56535  7302]
 [1

In [41]:
# input both add and multiply of q1 and q2
train = scipy.sparse.hstack([train_add, train_multiply])
test = scipy.sparse.hstack([test_add, test_multiply])

predict(BernoulliNB(), train, y_train, test, y_test)
predict(LogisticRegression(), train, y_train, test, y_test)
predict(XGBClassifier(), train, y_train, test, y_test)


-----------BernoulliNB()-----------
confusion matrix:  [[50058 13779]
 [11209 26026]]
accuracy:  0.7527703023587146
f1 score:  0.6756490134994808

-----------LogisticRegression()-----------
confusion matrix:  [[55274  8563]
 [13952 23283]]
accuracy:  0.7772380085483616
f1 score:  0.6740782559603943

-----------XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
              gamma=0, gpu_id=-1, importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=100, n_jobs=16,
              num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)-----------
confusion matrix:  [[56314  7523]
 [1

In [43]:
# input both add and multiply of q1 and q2
train = scipy.sparse.hstack([train_add, train_sub, train_multiply])
test = scipy.sparse.hstack([test_add, test_sub, test_multiply])

predict(BernoulliNB(), train, y_train, test, y_test)
predict(LogisticRegression(), train, y_train, test, y_test)
predict(XGBClassifier(), train, y_train, test, y_test)


-----------BernoulliNB()-----------
confusion matrix:  [[47148 16689]
 [ 9169 28066]]
accuracy:  0.7441625771727086
f1 score:  0.684620075618978

-----------LogisticRegression()-----------
confusion matrix:  [[54952  8885]
 [10936 26299]]
accuracy:  0.803892274813994
f1 score:  0.7263011088250321

-----------XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
              gamma=0, gpu_id=-1, importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=100, n_jobs=16,
              num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)-----------
confusion matrix:  [[56520  7317]
 [157

In [208]:
# input sub of q1 and q2 only
predict(BernoulliNB(), bow_train_sub, y_train, bow_test_sub, y_test)

confusion matrix:  [[41377 22460]
 [ 6672 30563]]
accuracy:  0.7117698274497388
recall:  0.8208137505035584
precision:  0.5764102370669332


In [209]:
# input add of q1 and q2 only
predict(BernoulliNB(), bow_train_add, y_train, bow_test_add, y_test)

confusion matrix:  [[46738 17099]
 [10310 26925]]
accuracy:  0.7288170808928289
recall:  0.7231099771720155
precision:  0.6115982191531891


In [210]:
# input multiply of q1 and q2 only
predict(BernoulliNB(), bow_train_multiply, y_train, bow_test_multiply, y_test)

confusion matrix:  [[56088  7749]
 [18812 18423]]
accuracy:  0.7372071394649359
recall:  0.4947764200349134
precision:  0.703920220082531


In [211]:
# input both add and sub of q1 and q2
bow_train = scipy.sparse.hstack([bow_train_add, bow_train_sub])
bow_test = scipy.sparse.hstack([bow_test_add, bow_test_sub])

predict(BernoulliNB(), bow_train, y_train, bow_test, y_test)

confusion matrix:  [[42744 21093]
 [ 6399 30836]]
accuracy:  0.72799588412221
recall:  0.8281455619712635
precision:  0.593810780103603


In [212]:
# input both add and multiply of q1 and q2
bow_train = scipy.sparse.hstack([bow_train_add, bow_train_multiply])
bow_test = scipy.sparse.hstack([bow_test_add, bow_test_multiply])

predict(BernoulliNB(), bow_train, y_train, bow_test, y_test)

confusion matrix:  [[50058 13779]
 [11209 26026]]
accuracy:  0.7527703023587146
recall:  0.6989660265878878
precision:  0.6538374576058285


In [224]:
# input both sub and multiply of q1 and q2
bow_train = scipy.sparse.hstack([bow_train_sub, bow_train_multiply])
bow_test = scipy.sparse.hstack([bow_test_sub, bow_test_multiply])

print('\n-----------Naive Bayes-----------')
predict(BernoulliNB(), bow_train, y_train, bow_test, y_test)
print('\n------------Logistic-------------')
predict(LogisticRegression(), bow_train, y_train, bow_test, y_test)
print('\n-------------xgboost-------------')
predict(XGBClassifier(), bow_train, y_train, bow_test, y_test)


-----------Naive Bayes-----------
confusion matrix:  [[46029 17808]
 [ 6698 30537]]
accuracy:  0.7575391799905018
recall:  0.8201154827447295
precision:  0.631647533354018

------------Logistic-------------
confusion matrix:  [[53889  9948]
 [11096 26139]]
accuracy:  0.7917919898686085
recall:  0.7020008056935679
precision:  0.7243328622495635

-------------xgboost-------------
confusion matrix:  [[57306  6531]
 [18604 18631]]
accuracy:  0.7513158936203894
recall:  0.5003625621055459
precision:  0.7404419362530801


In [214]:
# input add, sub and multiply of q1 and q2
bow_train = scipy.sparse.hstack([bow_train_add, bow_train_sub, bow_train_multiply])
bow_test = scipy.sparse.hstack([bow_test_add, bow_test_sub, bow_test_multiply])

predict(BernoulliNB(), bow_train, y_train, bow_test, y_test)

confusion matrix:  [[45114 18723]
 [ 6913 30322]]
accuracy:  0.7463590311856894
recall:  0.8143413455082583
precision:  0.6182485472525232


In [216]:
# input q1, q2, sub and multiply of q1 and q2
bow_train = scipy.sparse.hstack([bow_train_q1, bow_train_q2, bow_train_sub, bow_train_multiply])
bow_test = scipy.sparse.hstack([bow_test_q1, bow_test_q2, bow_test_sub, bow_test_multiply])

predict(BernoulliNB(), bow_train, y_train, bow_test, y_test)

confusion matrix:  [[47333 16504]
 [ 8442 28793]]
accuracy:  0.7531858477125217
recall:  0.7732778299986571
precision:  0.6356491599885202


In [176]:
import scipy.sparse
t_c_train = scipy.sparse.hstack([tfidf_train, bow_train])
t_c_test = scipy.sparse.hstack([tfidf_test, bow_test])

predict(BernoulliNB(), t_c_train, y_train, t_c_test, y_test)

confusion matrix:  [[44987 18850]
 [ 8740 28495]]
accuracy:  0.7270262782966598
recall:  0.7652746072243857
precision:  0.6018586968000845


In [225]:
import scipy.sparse
t_c_train = scipy.sparse.hstack([tfidf_train_sub, tfidf_train_multiply, bow_train_sub, bow_train_multiply])
t_c_test = scipy.sparse.hstack([tfidf_test_sub, tfidf_test_multiply, bow_test_sub, bow_test_multiply])

print('\n-----------Naive Bayes-----------')
predict(BernoulliNB(), t_c_train, y_train, t_c_test, y_test)
print('\n------------Logistic-------------')
predict(LogisticRegression(), t_c_train, y_train, t_c_test, y_test)
print('\n-------------xgboost-------------')
predict(XGBClassifier(), t_c_train, y_train, t_c_test, y_test)


-----------Naive Bayes-----------
confusion matrix:  [[47246 16591]
 [ 7616 29619]]
accuracy:  0.7604974671521292
recall:  0.7954612595676112
precision:  0.6409651590564813

------------Logistic-------------
confusion matrix:  [[53728 10109]
 [10712 26523]]
accuracy:  0.7939983378185848
recall:  0.7123136833624278
precision:  0.7240390915046954

-------------xgboost-------------
confusion matrix:  [[56070  7767]
 [15145 22090]]
accuracy:  0.7733101155611841
recall:  0.5932590304820733
precision:  0.7398599993301403
