## Identifying Duplicate Questions

Over 100 million people visit Quora every month, so it's no surprise that many people ask similar (or the same) questions. Various questions with the same intent can cause people to spend extra time searching for the best answer to their question, and results in members answering multiple versions of the same question. Quora uses random forest to identify duplicated questions to provide a better experience to active seekers and writers, and offer more value to both of these groups in the long term.
Follow the steps outlined below to build the appropriate classifier model. 


Steps:
- Download data
- Exploration
- Cleaning
- Feature Engineering
- Modeling

By the end of this project you should have **a presentation that describes the model you built** and its **performance**. 


In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [2]:
df = pd.read_csv("../data/train.csv")

#### Note
There is no designated test.csv file. The train.csv file is the entire dataset. Part of the data in the train.csv file should be set aside to act as the final testing data.

### Exploration

In [3]:
df.shape

(404290, 6)

In [4]:
df.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


In [5]:
df.isnull().sum()

id              0
qid1            0
qid2            0
question1       1
question2       2
is_duplicate    0
dtype: int64

In [6]:
df.dropna(inplace=True)

In [7]:
df.isnull().sum()

id              0
qid1            0
qid2            0
question1       0
question2       0
is_duplicate    0
dtype: int64

In [8]:
df_m = df.iloc[:,3:]
df_m.head()

Unnamed: 0,question1,question2,is_duplicate
0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


### Cleaning

- Tokenization
- Stopwords cleaning
- Removing punctuation
- Normalizing
- Stemming

In [61]:
import string
from nltk.tokenize import word_tokenize
import nltk
from nltk.corpus import stopwords
import re
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

In [62]:
lemmatizer = WordNetLemmatizer() 
lemmatizer.lemmatize('rocks')
porter = PorterStemmer()
ENGstopwords = stopwords.words('english')

In [80]:
def clean_txt(text):
    text = "".join([char for char in text if char not in string.punctuation])
    #' '.join( [w for w in text.split() if len(w)>1] )
    text = re.sub(r'(?:^| )\w(?:$| )', ' ', text).strip()
    text = text.replace('  ', ' ')
    tokens = word_tokenize(text.lower())
    text = [word for word in tokens if word not in ENGstopwords]
    #final_text = [porter.stem(word) for word in text]
    final_text = [lemmatizer.lemmatize(word) for word in text]
    return " ".join(final_text)
    

In [81]:
X_train['question1'][27940]

'How do the holy scriptures of Hinduism compare and contrast to those of Taoism?'

In [82]:
clean_txt(X_train['question1'][27940])

'holy scripture hinduism compare contrast taoism'

In [83]:
df_m['question1_clean'] = df_m['question1'].apply(lambda x: clean_txt(x))

In [84]:
df_m['question2_clean'] = df_m['question2'].apply(lambda x: clean_txt(x))

In [85]:
df_m.head()

Unnamed: 0,question1,question2,is_duplicate,question1_clean,question2_clean
0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0,step step guide invest share market india,step step guide invest share market
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0,story kohinoor kohinoor diamond,would happen indian government stole kohinoor ...
2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0,increase speed internet connection using vpn,internet speed increased hacking dns
3,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0,mentally lonely solve,find remainder math2324math divided 2423
4,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0,one dissolve water quikly sugar salt methane c...,fish would survive salt water


### Feature Engineering

- tf-idf
- word2vec
- word count
- number of the same words in both questions
- ....

In [86]:
X_train, X_test, y_train, y_test = train_test_split(df_m.iloc[:,3:], df_m['is_duplicate'], test_size=0.3, random_state =42)

In [87]:
X_train.head()

Unnamed: 0,question1_clean,question2_clean
140908,new method angioplasty cost r 5000 j hospital ...,much cost run hospital
107096,whatsapp say message info message read blue ti...,friend abroad sent message one grey tick next ...
27940,holy scripture hinduism compare contrast taoism,holy scripture hinduism compare contrast italo...
157100,long typically take get pilot license,much cost get private pilot license
111382,question havent changed marked needing improve...,question marked instantly needing improvement ...


In [88]:
X_test.head()

Unnamed: 0,question1_clean,question2_clean
8067,play pokémon go korea,play pokémon go china
224279,breathing treatment help cough,help someone unconscious still breathing
252452,kellyanne conway annoying opinion,kellyanne conway really imply pay attention wo...
174039,rate 110 review maruti baleno,career option one completing bachelor degree d...
384863,good book marketing,best book ever written marketing


### Modeling

Different modeling techniques can be used:

- logistic regression
- XGBoost
- LSTMs
- etc