## Identifying Duplicate Questions

Over 100 million people visit Quora every month, so it's no surprise that many people ask similar (or the same) questions. Various questions with the same intent can cause people to spend extra time searching for the best answer to their question, and results in members answering multiple versions of the same question. Quora uses random forest to identify duplicated questions to provide a better experience to active seekers and writers, and offer more value to both of these groups in the long term.
Follow the steps outlined below to build the appropriate classifier model. 


Steps:
- Download data
- Exploration
- Cleaning
- Feature Engineering
- Modeling

By the end of this project you should have **a presentation that describes the model you built** and its **performance**. 


In [21]:
import numpy as np
import pandas as pd

In [22]:
df = pd.read_csv("train.csv")

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 404290 entries, 0 to 404289
Data columns (total 6 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   id            404290 non-null  int64 
 1   qid1          404290 non-null  int64 
 2   qid2          404290 non-null  int64 
 3   question1     404289 non-null  object
 4   question2     404288 non-null  object
 5   is_duplicate  404290 non-null  int64 
dtypes: int64(4), object(2)
memory usage: 18.5+ MB


#### Note
There is no designated test.csv file. The train.csv file is the entire dataset. Part of the data in the train.csv file should be set aside to act as the final testing data.

### Exploration

In [23]:
print("There are", len(df), 'pairs of questions')
print("There are", len(df[df["is_duplicate"]==1]), 'pairs of duplicate questions')
print("There are", len(df[df["is_duplicate"]==0]), 'pairs of non-duplicate questions')


print("That means that about",  round(len(df[df["is_duplicate"]==1])/len(df)*100,2), "% of the questions are duplicate")


There are 404290 pairs of questions
There are 149263 pairs of duplicate questions
There are 255027 pairs of non-duplicate questions
That means that about 36.92 % of the questions are duplicate


### Cleaning

- Tokenization
- Stopwords cleaning
- Removing punctuation
- Normalizing
- Stemming

In [24]:
#drop rows with nulls and unnecessary columns
df = df.dropna()
df = df.drop(["id", "qid1", "qid2"],axis=1)
df = df.reset_index(drop=True)

In [25]:
#remove puncation
import string
df['question1'] = df['question1'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))
df['question2'] = df['question2'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))

In [26]:
#tokeniation
df['question1'] = df['question1'].apply(lambda x: x. split())
df['question2'] = df['question2'].apply(lambda x: x. split())

In [27]:
#removeing stop words
import nltk
from nltk.corpus import stopwords
ENGstopwords = stopwords.words('english')

df['question1'] = df['question1'].apply(lambda x: [word for word in x if word not in ENGstopwords])
df['question2'] = df['question2'].apply(lambda x: [word for word in x if word not in ENGstopwords])



In [28]:
df.to_csv("cleaned_data.csv",index=False)

### Feature Engineering

- tf-idf
- word2vec
- word count
- number of the same words in both questions
- ....

In [29]:
import pandas as pd
import numpy as np
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer

df2 =df
df2

Unnamed: 0,question1,question2,is_duplicate
0,"[What, step, step, guide, invest, share, marke...","[What, step, step, guide, invest, share, market]",0
1,"[What, story, Kohinoor, KohiNoor, Diamond]","[What, would, happen, Indian, government, stol...",0
2,"[How, I, increase, speed, internet, connection...","[How, Internet, speed, increased, hacking, DNS]",0
3,"[Why, I, mentally, lonely, How, I, solve]","[Find, remainder, math2324math, divided, 2423]",0
4,"[Which, one, dissolve, water, quikly, sugar, s...","[Which, fish, would, survive, salt, water]",0
...,...,...,...
404282,"[How, many, keywords, Racket, programming, lan...","[How, many, keywords, PERL, Programming, Langu...",0
404283,"[Do, believe, life, death]","[Is, true, life, death]",1
404284,"[What, one, coin]","[Whats, coin]",0
404285,"[What, approx, annual, cost, living, studying,...","[I, little, hairfall, problem, I, want, use, h...",0


In [30]:

df2['q1_vector'] = ""
df2['q2_vector'] = ""
"""""
dropCount = 0

for i in range(len(df2.index)):
    try:
        df2['q1_vector'][i] = TfidfVectorizer().fit_transform(df2['question1'][i])
        df2['q2_vector'][i] = TfidfVectorizer().fit_transform(df2['question2'][i])
    except ValueError:
        df2 = df2.drop([i])
        print("dropped", i)
        dropCount = dropCount + 1

        
df2 = df2.reset_index(drop=True)
print("total drop:", dropCount)
"""

'""\ndropCount = 0\n\nfor i in range(len(df2.index)):\n    try:\n        df2[\'q1_vector\'][i] = TfidfVectorizer().fit_transform(df2[\'question1\'][i])\n        df2[\'q2_vector\'][i] = TfidfVectorizer().fit_transform(df2[\'question2\'][i])\n    except ValueError:\n        df2 = df2.drop([i])\n        print("dropped", i)\n        dropCount = dropCount + 1\n\n        \ndf2 = df2.reset_index(drop=True)\nprint("total drop:", dropCount)\n'

In [31]:
#character count
df2['q1_character_count'] = df2['question1'].str.len().astype(float)
df2['q2_character_count'] = df2['question2'].str.len()

#word count
df2['q1_word_count'] = df2['question1'].apply(len)
df2['q2_word_count'] = df2['question2'].apply(len)


df2['shared_word_count'] = df2.apply(lambda x: len(set(x['question1']) & set(x['question2'])), axis=1)


df2['shared_word_percent'] = df2.apply(lambda x: x['shared_word_count'] / (x['q1_word_count'] + x['q2_word_count']) * 100, axis=1)




In [32]:

test = df2['is_duplicate']
df2 = df2.drop(['is_duplicate'],axis=1)

In [33]:
print(df2['q1_character_count'].apply(type))

df2 = df2.drop(['question1','question2','q1_vector','q2_vector'],axis=1)

df2 = df2.reset_index(drop=True)
df2.to_csv("test.csv",index=False)

0         <class 'float'>
1         <class 'float'>
2         <class 'float'>
3         <class 'float'>
4         <class 'float'>
               ...       
404282    <class 'float'>
404283    <class 'float'>
404284    <class 'float'>
404285    <class 'float'>
404286    <class 'float'>
Name: q1_character_count, Length: 404287, dtype: object


### Modeling

Different modeling techniques can be used:

- logistic regression
- XGBoost
- LSTMs
- etc

In [41]:
from sklearn.model_selection import train_test_split
X = df2
y = df['is_duplicate']
X.reindex

# Prepare train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

print("X training shape:", X_train.shape)
print("X test shape:", X_test.shape)
print("y training shape:", y_train.shape)
print("y test shape:", y_test.shape)



X training shape: (323429, 6)
X test shape: (80858, 6)
y training shape: (323429,)
y test shape: (80858,)


In [42]:
from sklearn.metrics import accuracy_score, recall_score, precision_score
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()


from sklearn.decomposition import PCA

pca = PCA(n_components=3)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.fit_transform(X_test)

model = LogisticRegression()
model.fit(X_train_pca, y_train)

y_pred = model.predict(X_test_pca)
acc = accuracy_score(y_test, y_pred)
percent = round(acc, 4) * 100
print(f'Test Set Accuracy: {percent}%')

Test Set Accuracy: 65.5%
