## Identifying Duplicate Questions

Over 100 million people visit Quora every month, so it's no surprise that many people ask similar (or the same) questions. Various questions with the same intent can cause people to spend extra time searching for the best answer to their question, and results in members answering multiple versions of the same question. Quora uses random forest to identify duplicated questions to provide a better experience to active seekers and writers, and offer more value to both of these groups in the long term.
Follow the steps outlined below to build the appropriate classifier model. 


Steps:
- Download data
- Exploration
- Cleaning
- Feature Engineering
- Modeling

By the end of this project you should have **a presentation that describes the model you built** and its **performance**. 


In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
df = pd.read_csv("../data/train.csv")

#### Note
There is no designated test.csv file. The train.csv file is the entire dataset. Part of the data in the train.csv file should be set aside to act as the final testing data.

### Exploration

In [3]:
df.shape

(404290, 6)

In [4]:
df.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


In [5]:
df.isnull().sum()

id              0
qid1            0
qid2            0
question1       1
question2       2
is_duplicate    0
dtype: int64

In [6]:
df.dropna(inplace=True)

In [7]:
df.shape

(404287, 6)

In [8]:
df.isnull().sum()

id              0
qid1            0
qid2            0
question1       0
question2       0
is_duplicate    0
dtype: int64

In [9]:
df_m = df.iloc[:,3:]
df_m.head()

Unnamed: 0,question1,question2,is_duplicate
0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


### Cleaning

- Tokenization
- Stopwords cleaning
- Removing punctuation
- Normalizing
- Stemming

In [10]:
import string
from nltk.tokenize import word_tokenize
import nltk
from nltk.corpus import stopwords
import re
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

In [11]:
lemmatizer = WordNetLemmatizer() 
lemmatizer.lemmatize('rocks')
porter = PorterStemmer()
ENGstopwords = stopwords.words('english')

In [81]:
def clean_txt(text):
    ''' function that preprocesses input sentences for numeric representation '''
    text = "".join([char for char in text if char not in string.punctuation])
    #' '.join( [w for w in text.split() if len(w)>1] )
    text = re.sub(r'(?:^| )\w(?:$| )', ' ', text).strip() # removes single characters
    text = text.replace('  ', ' ')
    tokens = word_tokenize(text.lower())
    text = [word for word in tokens if word not in ENGstopwords]
    #final_text = [porter.stem(word) for word in text]
    final_text = [lemmatizer.lemmatize(word) for word in text]
    return " ".join(final_text)
    

In [13]:
df_m['question1_clean'] = df_m['question1'].apply(lambda x: clean_txt(x))

In [14]:
df_m['question2_clean'] = df_m['question2'].apply(lambda x: clean_txt(x))

In [15]:
df_m.head()

Unnamed: 0,question1,question2,is_duplicate,question1_clean,question2_clean
0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0,step step guide invest share market india,step step guide invest share market
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0,story kohinoor kohinoor diamond,would happen indian government stole kohinoor ...
2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0,increase speed internet connection using vpn,internet speed increased hacking dns
3,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0,mentally lonely solve,find remainder math2324math divided 2423
4,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0,one dissolve water quikly sugar salt methane c...,fish would survive salt water


### Feature Engineering

- tf-idf
- word2vec
- word count
- number of the same words in both questions
- ....

In [16]:
df_features = df_m.iloc[:,2:]

In [17]:
df_features.head()

Unnamed: 0,is_duplicate,question1_clean,question2_clean
0,0,step step guide invest share market india,step step guide invest share market
1,0,story kohinoor kohinoor diamond,would happen indian government stole kohinoor ...
2,0,increase speed internet connection using vpn,internet speed increased hacking dns
3,0,mentally lonely solve,find remainder math2324math divided 2423
4,0,one dissolve water quikly sugar salt methane c...,fish would survive salt water


In [18]:
df_features['Common'] = df_features.apply(lambda r: set(r['question1_clean'].split()) & 
                                         set(r['question2_clean'].split()), axis=1)

In [19]:
df_features['Common_count'] = df_features['Common'].str.len()

In [20]:
df_features.drop('Common', axis=1, inplace=True)

In [21]:
df_features.head()

Unnamed: 0,is_duplicate,question1_clean,question2_clean,Common_count
0,0,step step guide invest share market india,step step guide invest share market,5
1,0,story kohinoor kohinoor diamond,would happen indian government stole kohinoor ...,2
2,0,increase speed internet connection using vpn,internet speed increased hacking dns,2
3,0,mentally lonely solve,find remainder math2324math divided 2423,0
4,0,one dissolve water quikly sugar salt methane c...,fish would survive salt water,2


In [22]:
X_train, X_test, y_train, y_test = train_test_split(df_features.iloc[:,1:], df_features['is_duplicate'], test_size=0.3, random_state=42)

In [23]:
X_train

Unnamed: 0,question1_clean,question2_clean,Common_count
140908,new method angioplasty cost r 5000 j hospital ...,much cost run hospital,2
107096,whatsapp say message info message read blue ti...,friend abroad sent message one grey tick next ...,7
27940,holy scripture hinduism compare contrast taoism,holy scripture hinduism compare contrast italo...,5
157100,long typically take get pilot license,much cost get private pilot license,3
111382,question havent changed marked needing improve...,question marked instantly needing improvement ...,4
...,...,...,...
259180,power positive thinking,cultivate power positive thinking,3
365841,thing new employee know going first day tennant,thing new employee know going first day att,7
131933,currently winning presidential election,bias aside point time think win presidential e...,2
146868,telugutamilhindi movie leading actor becomes m...,fix internet connection available problem mobi...,0


In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [82]:
def create_doc_term_matrix(text):
    ''' Vectorizes input texts and returns those vectors in a Dataframe plus the vectorizer '''
    vectorizer = TfidfVectorizer(max_features = 1000)
    doc_term_matrix = vectorizer.fit_transform(text)
    return pd.DataFrame(doc_term_matrix.toarray(), columns = vectorizer.get_feature_names_out()), vectorizer

In [26]:
q1_df, vectorizer1 = create_doc_term_matrix(X_train['question1_clean'])

In [27]:
vectorizer1

TfidfVectorizer(max_features=1000)

In [28]:
q1_df.head()

Unnamed: 0,10,100,1000,12,15,20,2000,2015,2016,2017,...,written,wrong,yahoo,year,yes,yet,young,youre,youtube,youve
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [29]:
q2_df, vectorizer2 = create_doc_term_matrix(X_train['question2_clean'])
q2_df.head()

Unnamed: 0,10,100,1000,12,12th,15,16,20,2000,2015,...,written,wrong,yahoo,year,yes,yet,york,youre,youtube,youve
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [30]:
X_train_vectorized = pd.merge(q1_df,q2_df, left_index=True, right_index=True)

In [31]:
X_train_vectorized.shape

(283000, 2000)

In [32]:
X_train_vectorized['Common_count'] = X_train['Common_count'].values

In [33]:
X_train_vectorized.head()

Unnamed: 0,10_x,100_x,1000_x,12_x,15_x,20_x,2000_x,2015_x,2016_x,2017_x,...,wrong_y,yahoo_y,year_y,yes_y,yet_y,york,youre_y,youtube_y,youve_y,Common_count
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4


In [34]:
X_train_vectorized.isnull().sum()

10_x            0
100_x           0
1000_x          0
12_x            0
15_x            0
               ..
york            0
youre_y         0
youtube_y       0
youve_y         0
Common_count    0
Length: 2001, dtype: int64

In [35]:
X_test.head()

Unnamed: 0,question1_clean,question2_clean,Common_count
8067,play pokémon go korea,play pokémon go china,3
224279,breathing treatment help cough,help someone unconscious still breathing,2
252452,kellyanne conway annoying opinion,kellyanne conway really imply pay attention wo...,2
174039,rate 110 review maruti baleno,career option one completing bachelor degree d...,0
384863,good book marketing,best book ever written marketing,2


In [83]:
def create_test_doc_term_matrix(text, vectorizer):
    '''Transforms input texts into vectors, fitted by trained texts'''
    doc_term_test_matrix = vectorizer.transform(text)
    return pd.DataFrame(doc_term_test_matrix.toarray(), columns = vectorizer.get_feature_names_out())

In [37]:
X_test_q1 = create_test_doc_term_matrix(X_test['question1_clean'], vectorizer1)

In [38]:
X_test_q2 = create_test_doc_term_matrix(X_test['question2_clean'], vectorizer2)

In [39]:
X_test_vectorized = pd.merge(X_test_q1,X_test_q2, left_index=True, right_index=True)

In [40]:
X_test_vectorized.head()

Unnamed: 0,10_x,100_x,1000_x,12_x,15_x,20_x,2000_x,2015_x,2016_x,2017_x,...,written_y,wrong_y,yahoo_y,year_y,yes_y,yet_y,york,youre_y,youtube_y,youve_y
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.590632,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [41]:
X_test_vectorized['Common_count'] = X_test['Common_count'].values

In [42]:
X_test_vectorized.isnull().sum()

10_x            0
100_x           0
1000_x          0
12_x            0
15_x            0
               ..
york            0
youre_y         0
youtube_y       0
youve_y         0
Common_count    0
Length: 2001, dtype: int64

### Modeling

Different modeling techniques can be used:

- logistic regression
- XGBoost
- LSTMs
- etc

In [43]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, confusion_matrix
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler

In [44]:
scaler = StandardScaler()

In [45]:
X_train_scaled = scaler.fit_transform(X_train_vectorized)

In [46]:
X_test_scaled = scaler.transform(X_test_vectorized)

In [47]:
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train_scaled, y_train)

LogisticRegression(max_iter=1000)

In [48]:
y_pred = clf.predict(X_test_scaled)

In [49]:
f1_LR = f1_score(y_test, y_pred)
f1_LR

0.6081711336359679

In [50]:
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt

In [51]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.76      0.85      0.81     76609
           1       0.68      0.55      0.61     44678

    accuracy                           0.74    121287
   macro avg       0.72      0.70      0.71    121287
weighted avg       0.73      0.74      0.73    121287



In [52]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

In [53]:
#pipeline = Pipeline([('model', LogisticRegression())])
#params = {'model': [LogisticRegression(max_iter=1000), GaussianNB(), RandomForestClassifier()]}


#grid = GridSearchCV(pipeline, param_grid=params, cv=5, scoring='f1')
#grid.fit(X_train_vectorized, y_train)

#best_model = grid.best_estimator_
#best_hyperparams = grid.best_params_
#best_acc = grid.score(X_test_vectorized, y_test)
#print(f'Best test set accuracy: {best_acc}\nAchieved with hyperparameters: {best_hyperparams}')

#### Gridsearch was taking too long, I'm just going to search for the best model manually

In [54]:
model = GaussianNB()
model.fit(X_train_scaled, y_train)

GaussianNB()

In [55]:
y_pred_NB = model.predict(X_test_scaled)

In [56]:
f1_NB = f1_score(y_test, y_pred_NB)
f1_NB

0.5852175313296264

In [57]:
print(classification_report(y_test, y_pred_NB))

              precision    recall  f1-score   support

           0       0.77      0.61      0.68     76609
           1       0.51      0.69      0.59     44678

    accuracy                           0.64    121287
   macro avg       0.64      0.65      0.63    121287
weighted avg       0.67      0.64      0.65    121287



# Word2Vec

In [58]:
import gensim

In [59]:
model = gensim.models.KeyedVectors.load_word2vec_format('../src/GoogleNews-vectors-negative300.bin.gz', binary=True)

In [60]:
model.vector_size

300

In [1]:
def sent_vec(sent):
    ''' Takes a given text and transforms it into word2vec representation where each word is given a 300 value vector representation.
        Then divides the total vector by the number of words in the scentence so that each sentence will have an averaged vector
        with size 300'''
    vector_size = model.vector_size
    model_res = np.zeros(vector_size)
    ctr = 1
    for word in sent:
        if word in model:
            ctr += 1
            model_res += model[word]
    model_res = model_res/ctr
    return model_res

In [62]:
df_features['vec_q1'] = df_features['question1_clean'].apply(sent_vec)

In [63]:
df_features['vec_q2'] = df_features['question2_clean'].apply(sent_vec)

In [64]:
df_q1_w2v = pd.DataFrame(df_features['vec_q1'].tolist())

In [65]:
df_q2_w2v = pd.DataFrame(df_features['vec_q2'].tolist())

In [66]:
df_w2v = pd.merge(df_q1_w2v, df_q2_w2v, left_index=True, right_index=True)

In [67]:
df_w2v.head()

Unnamed: 0,0_x,1_x,2_x,3_x,4_x,5_x,6_x,7_x,8_x,9_x,...,290_y,291_y,292_y,293_y,294_y,295_y,296_y,297_y,298_y,299_y
0,-0.170836,0.114323,-0.006718,0.161022,-0.051896,0.017819,-0.075502,-0.04103,-0.002517,-0.007202,...,0.038486,-0.06732,-0.1197,0.093805,-0.028404,-0.153506,-0.071074,-0.024759,-0.062761,0.173281
1,-0.160921,0.086849,0.022443,0.143241,0.015695,0.000854,-0.083155,-0.035627,-0.052534,0.044617,...,0.069901,0.016205,-0.077333,0.123644,-0.059435,-0.155334,-0.134787,-0.013493,-0.134024,0.118662
2,-0.144193,0.079826,-0.010764,0.164125,-0.074295,0.035537,-0.054748,-0.018905,-0.048064,0.013645,...,0.053246,-0.048667,-0.071004,0.102996,-0.038228,-0.148201,-0.073793,-0.018244,-0.119204,0.153984
3,-0.081569,0.126857,-0.006502,0.076519,-0.01722,0.072028,-0.037532,0.000977,-0.111033,0.060402,...,0.041852,-0.007486,-0.118155,0.12626,-0.02766,-0.101261,-0.131761,-0.051866,-0.12006,0.069013
4,-0.162272,0.110233,0.012168,0.123234,-0.034326,0.036412,-0.08205,-0.052319,-0.049577,0.016926,...,0.082896,0.0041,-0.138751,0.10675,-0.029978,-0.201314,-0.119731,-0.053594,-0.087118,0.156102


In [68]:
df_w2v['Common_count'] = df_features['Common_count'].values

In [69]:
X_train_w2v, X_test_w2v, y_train_w2v, y_test_w2v = train_test_split(df_w2v, df_features['is_duplicate'], test_size=0.3, random_state=42)

In [70]:
scaler_w2v = StandardScaler()

In [71]:
X_train_w2v_scaled = scaler_w2v.fit_transform(X_train_w2v)
X_test_w2v_scaled = scaler_w2v.transform(X_test_w2v)

In [72]:
clf_w2v = LogisticRegression(max_iter =1000)
clf_w2v.fit(X_train_w2v_scaled, y_train_w2v)
y_pred_w2v=clf_w2v.predict(X_test_w2v_scaled)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [73]:
f1_w2v = f1_score(y_test_w2v, y_pred_w2v)
f1_w2v

0.4185834141886879

In [74]:
print(classification_report(y_test_w2v, y_pred_w2v))

              precision    recall  f1-score   support

           0       0.69      0.86      0.76     76609
           1       0.58      0.33      0.42     44678

    accuracy                           0.66    121287
   macro avg       0.63      0.59      0.59    121287
weighted avg       0.65      0.66      0.64    121287



In [75]:
model_w2v = GaussianNB()
model_w2v.fit(X_train_w2v_scaled, y_train_w2v)

GaussianNB()

In [77]:
y_pred_NB_w2v = model_w2v.predict(X_test_w2v_scaled)

In [78]:
f1_NB_w2v = f1_score(y_test_w2v, y_pred_NB_w2v)
f1_NB_w2v

0.4328463197223215

In [79]:
print(classification_report(y_test_w2v, y_pred_NB_w2v))

              precision    recall  f1-score   support

           0       0.67      0.70      0.68     76609
           1       0.45      0.42      0.43     44678

    accuracy                           0.59    121287
   macro avg       0.56      0.56      0.56    121287
weighted avg       0.59      0.59      0.59    121287



##### Guassian Naive bayes gave the better performance with word2vec vectors but overall tfidf vectorizer modelled with Logistic Regression gave the best performace