## Identifying Duplicate Questions

Over 100 million people visit Quora every month, so it's no surprise that many people ask similar (or the same) questions. Various questions with the same intent can cause people to spend extra time searching for the best answer to their question, and results in members answering multiple versions of the same question. Quora uses random forest to identify duplicated questions to provide a better experience to active seekers and writers, and offer more value to both of these groups in the long term.
Follow the steps outlined below to build the appropriate classifier model. 


Steps:
- Download data
- Exploration
- Cleaning
- Feature Engineering
- Modeling

By the end of this project you should have **a presentation that describes the model you built** and its **performance**. 


In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("train.csv")

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 404290 entries, 0 to 404289
Data columns (total 6 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   id            404290 non-null  int64 
 1   qid1          404290 non-null  int64 
 2   qid2          404290 non-null  int64 
 3   question1     404289 non-null  object
 4   question2     404288 non-null  object
 5   is_duplicate  404290 non-null  int64 
dtypes: int64(4), object(2)
memory usage: 18.5+ MB


#### Note
There is no designated test.csv file. The train.csv file is the entire dataset. Part of the data in the train.csv file should be set aside to act as the final testing data.

In [4]:
#Filling nulls and dropping unecessary columns
df.dropna(inplace=True)
df_cleaned = df.drop(['id', 'qid1', 'qid2'], axis=1)

### Exploration

In [5]:
#checking for rows with null values
df_cleaned.isna().sum()

question1       0
question2       0
is_duplicate    0
dtype: int64

In [12]:
df_cleaned[df_cleaned["is_duplicate"]==1]

Unnamed: 0,question1,question2,is_duplicate,q1_token,q2_token,q1_nostop,q2_nostop,q1_stem,q2_stem
5,astrology i am a capricorn sun cap moon and ca...,im a triple capricorn sun moon and ascendant i...,1,"[astrology, capricorn, sun, cap, moon, cap, ri...","[im, triple, capricorn, sun, moon, ascendant, ...",astrology capricorn sun cap moon cap risingwha...,im triple capricorn sun moon ascendant caprico...,"[astrolog, capricorn, sun, cap, moon, cap, ris...","[im, tripl, capricorn, sun, moon, ascend, capr..."
7,how can i be a good geologist,what should i do to be a great geologist,1,"[good, geologist]","[great, geologist]",good geologist,great geologist,"[good, geologist]","[great, geologist]"
11,how do i read and find my youtube comments,how can i see all my youtube comments,1,"[read, find, youtube, comments]","[see, youtube, comments]",read find youtube comments,see youtube comments,"[read, find, youtub, comment]","[see, youtub, comment]"
12,what can make physics easy to learn,how can you make physics easy to learn,1,"[make, physics, easy, learn]","[make, physics, easy, learn]",make physics easy learn,make physics easy learn,"[make, physic, easi, learn]","[make, physic, easi, learn]"
13,what was your first sexual experience like,what was your first sexual experience,1,"[first, sexual, experience, like]","[first, sexual, experience]",first sexual experience like,first sexual experience,"[first, sexual, experi, like]","[first, sexual, experi]"
...,...,...,...,...,...,...,...,...,...
404277,what are some outfit ideas to wear to a frat p...,what are some outfit ideas wear to a frat them...,1,"[outfit, ideas, wear, frat, party]","[outfit, ideas, wear, frat, themed, party]",outfit ideas wear frat party,outfit ideas wear frat themed party,"[outfit, idea, wear, frat, parti]","[outfit, idea, wear, frat, theme, parti]"
404278,why is manaphy childish in pokémon ranger and ...,why is manaphy annoying in pokemon ranger and ...,1,"[manaphy, childish, pokémon, ranger, temple, sea]","[manaphy, annoying, pokemon, ranger, temple, sea]",manaphy childish pokémon ranger temple sea,manaphy annoying pokemon ranger temple sea,"[manaphi, childish, pokémon, ranger, templ, sea]","[manaphi, annoy, pokemon, ranger, templ, sea]"
404279,how does a long distance relationship work,how are long distance relationships maintained,1,"[long, distance, relationship, work]","[long, distance, relationships, maintained]",long distance relationship work,long distance relationships maintained,"[long, distanc, relationship, work]","[long, distanc, relationship, maintain]"
404281,what does jainism say about homosexuality,what does jainism say about gays and homosexua...,1,"[jainism, say, homosexuality]","[jainism, say, gays, homosexuality]",jainism say homosexuality,jainism say gays homosexuality,"[jainism, say, homosexu]","[jainism, say, gay, homosexu]"


In [5]:
#questions that are considered duplicates
dup = df_cleaned[df_cleaned["is_duplicate"]==1]
dup.head()

Unnamed: 0,question1,question2,is_duplicate
5,Astrology: I am a Capricorn Sun Cap moon and c...,"I'm a triple Capricorn (Sun, Moon and ascendan...",1
7,How can I be a good geologist?,What should I do to be a great geologist?,1
11,How do I read and find my YouTube comments?,How can I see all my Youtube comments?,1
12,What can make Physics easy to learn?,How can you make physics easy to learn?,1
13,What was your first sexual experience like?,What was your first sexual experience?,1


In [6]:
dup.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 149263 entries, 5 to 404286
Data columns (total 3 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   question1     149263 non-null  object
 1   question2     149263 non-null  object
 2   is_duplicate  149263 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 4.6+ MB


In [7]:
#questions that are considered non duplicates
nodup = df_cleaned[df_cleaned["is_duplicate"]==0]
nodup.head()

Unnamed: 0,question1,question2,is_duplicate
0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


In [8]:
nodup.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 255024 entries, 0 to 404289
Data columns (total 3 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   question1     255024 non-null  object
 1   question2     255024 non-null  object
 2   is_duplicate  255024 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 7.8+ MB


### Cleaning

- Tokenization
- Stopwords cleaning
- Removing punctuation
- Normalizing
- Stemming

In [6]:
#importing naturl language toolkit for processing
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import string

In [9]:
#define a function to tokenize the questions and remove punctuations
def text_cleaning(df):
    df2 = pd.DataFrame()
    #turn questions into lower case letters
    df2['question1'] = df['question1'].str.lower()
    df2['question2'] = df['question2'].str.lower()
    
    #remove punctuation
    punc = string.punctuation
    df2['question1'] = df2['question1'].str.replace(r'[^\w\s]+', '')
    df2['question2'] = df2['question2'].str.replace(r'[^\w\s]+', '')
    
    #tokenize the questions
    df2['q1_token'] = df2.apply(lambda row: word_tokenize(row['question1']), axis=1)
    df2['q2_token'] = df2.apply(lambda row: word_tokenize(row['question2']), axis=1)
    
    #removing stopwords
    stop = stopwords.words('english')
    df2['q1_token'] = df2.apply(lambda x: [word for word in x['q1_token'] if not word in stop], axis=1)
    df2['q2_token'] = df2.apply(lambda x: [word for word in x['q2_token'] if not word in stop], axis=1)
    df2['q1_nostop'] = df2['question1'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
    df2['q2_nostop'] = df2['question2'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
    
    #Stemming
    porter = PorterStemmer()
    df2['q1_stem'] = df2.apply(lambda x: [porter.stem(word) for word in x['q1_token']], axis=1)
    df2['q2_stem'] = df2.apply(lambda x: [porter.stem(word) for word in x['q2_token']], axis=1)
    
    return df2

In [10]:
df_test = text_cleaning(df_cleaned)
df_test.head()

  df2['question1'] = df2['question1'].str.replace(r'[^\w\s]+', '')
  df2['question2'] = df2['question2'].str.replace(r'[^\w\s]+', '')


Unnamed: 0,question1,question2,q1_token,q2_token,q1_nostop,q2_nostop,q1_stem,q2_stem
0,what is the step by step guide to invest in sh...,what is the step by step guide to invest in sh...,"[step, step, guide, invest, share, market, india]","[step, step, guide, invest, share, market]",step step guide invest share market india,step step guide invest share market,"[step, step, guid, invest, share, market, india]","[step, step, guid, invest, share, market]"
1,what is the story of kohinoor kohinoor diamond,what would happen if the indian government sto...,"[story, kohinoor, kohinoor, diamond]","[would, happen, indian, government, stole, koh...",story kohinoor kohinoor diamond,would happen indian government stole kohinoor ...,"[stori, kohinoor, kohinoor, diamond]","[would, happen, indian, govern, stole, kohinoo..."
2,how can i increase the speed of my internet co...,how can internet speed be increased by hacking...,"[increase, speed, internet, connection, using,...","[internet, speed, increased, hacking, dns]",increase speed internet connection using vpn,internet speed increased hacking dns,"[increas, speed, internet, connect, use, vpn]","[internet, speed, increas, hack, dn]"
3,why am i mentally very lonely how can i solve it,find the remainder when math2324math is divide...,"[mentally, lonely, solve]","[find, remainder, math2324math, divided, 2423]",mentally lonely solve,find remainder math2324math divided 2423,"[mental, lone, solv]","[find, remaind, math2324math, divid, 2423]"
4,which one dissolve in water quikly sugar salt ...,which fish would survive in salt water,"[one, dissolve, water, quikly, sugar, salt, me...","[fish, would, survive, salt, water]",one dissolve water quikly sugar salt methane c...,fish would survive salt water,"[one, dissolv, water, quikli, sugar, salt, met...","[fish, would, surviv, salt, water]"


### Feature Engineering

- tf-idf
- word2vec
- word count
- number of the same words in both questions
- ....

In [15]:
data.reset_index(drop=True, inplace=True)

In [16]:
data.head()

Unnamed: 0,question1,question2,is_duplicate,q1_token,q2_token,q1_nostop,q2_nostop,q1_stem,q2_stem
0,what is the step by step guide to invest in sh...,what is the step by step guide to invest in sh...,0,"[step, step, guide, invest, share, market, india]","[step, step, guide, invest, share, market]",step step guide invest share market india,step step guide invest share market,"[step, step, guid, invest, share, market, india]","[step, step, guid, invest, share, market]"
1,what is the story of kohinoor kohinoor diamond,what would happen if the indian government sto...,0,"[story, kohinoor, kohinoor, diamond]","[would, happen, indian, government, stole, koh...",story kohinoor kohinoor diamond,would happen indian government stole kohinoor ...,"[stori, kohinoor, kohinoor, diamond]","[would, happen, indian, govern, stole, kohinoo..."
2,how can i increase the speed of my internet co...,how can internet speed be increased by hacking...,0,"[increase, speed, internet, connection, using,...","[internet, speed, increased, hacking, dns]",increase speed internet connection using vpn,internet speed increased hacking dns,"[increas, speed, internet, connect, use, vpn]","[internet, speed, increas, hack, dn]"
3,why am i mentally very lonely how can i solve it,find the remainder when math2324math is divide...,0,"[mentally, lonely, solve]","[find, remainder, math2324math, divided, 2423]",mentally lonely solve,find remainder math2324math divided 2423,"[mental, lone, solv]","[find, remaind, math2324math, divid, 2423]"
4,which one dissolve in water quikly sugar salt ...,which fish would survive in salt water,0,"[one, dissolve, water, quikly, sugar, salt, me...","[fish, would, survive, salt, water]",one dissolve water quikly sugar salt methane c...,fish would survive salt water,"[one, dissolv, water, quikli, sugar, salt, met...","[fish, would, surviv, salt, water]"


In [11]:
#Using pre-trained model of Word2Vec
import gensim

model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

In [16]:
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

vectorizer = CountVectorizer(dtype=np.int16)

vectorizer.fit(df_test['q1_nostop'])

q_vec1= vectorizer.transform(df_test['q1_nostop'])
q_vec2= vectorizer.transform(df_test['q2_nostop'])

In [22]:
q_vec1

<404287x80964 sparse matrix of type '<class 'numpy.int16'>'
	with 2156545 stored elements in Compressed Sparse Row format>

In [18]:
q_vec2

<404287x80964 sparse matrix of type '<class 'numpy.int16'>'
	with 2144560 stored elements in Compressed Sparse Row format>

In [21]:
from sklearn.metrics.pairwise import cosine_similarity

test = cosine_similarity(q_vec1, q_vec2, dense_output=False)

MemoryError: Unable to allocate 36.0 GiB for an array with shape (4836604201,) and data type int64

### Modeling

Different modeling techniques can be used:

- logistic regression
- XGBoost
- LSTMs
- etc