## Identifying Duplicate Questions

Over 100 million people visit Quora every month, so it's no surprise that many people ask similar (or the same) questions. Various questions with the same intent can cause people to spend extra time searching for the best answer to their question, and results in members answering multiple versions of the same question. Quora uses random forest to identify duplicated questions to provide a better experience to active seekers and writers, and offer more value to both of these groups in the long term.
Follow the steps outlined below to build the appropriate classifier model. 


Steps:
- Download data
- Exploration
- Cleaning
- Feature Engineering
- Modeling

By the end of this project you should have **a presentation that describes the model you built** and its **performance**. 


In [102]:
# import pandas as pd
import pandas as pd
# import train_test_split
from sklearn.model_selection import train_test_split
# import nltk
import nltk
# import stopwords
from nltk.corpus import stopwords
# import tokenize
from nltk.tokenize import word_tokenize
# import stemming
from nltk.stem import PorterStemmer
# spacy for lemmatization
import spacy
# import nlp
from spacy.lang.en import English
# import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
# import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
# import regex
import re
# import matplotlib
import matplotlib.pyplot as plt

In [24]:
df = pd.read_csv("D:\\Python(New)\\Project\\Project_Week_10\\data\\train.csv")

#### Note
There is no designated test.csv file. The train.csv file is the entire dataset. Part of the data in the train.csv file should be set aside to act as the final testing data.

### Exploration

In [25]:
# explore the data
df.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


In [26]:
# check the shape of the data
df.shape

(404290, 6)

In [27]:
# count number of duplicates
df['is_duplicate'].value_counts()

0    255027
1    149263
Name: is_duplicate, dtype: int64

In [28]:
# number of unique questions 1 and percentage
number_q1 = df['question1'].nunique()
print(f'question 1 has {number_q1} unique questions\n   Percentage of total: {number_q1/df.shape[0]*100:.2f}%')
# number of unique questions 2 and percentage
number_q2 = df['question2'].nunique()
print(f'question 2 has {number_q2} unique questions\n   Percentage of total: {number_q2/df.shape[0]*100:.2f}%')

question 1 has 290456 unique questions
   Percentage of total: 71.84%
question 2 has 299174 unique questions
   Percentage of total: 74.00%


### Cleaning

- Tokenization
- Stopwords cleaning
- Removing punctuation
- Normalizing
- Stemming

In [29]:
processed_features = []
# define a function to clean the text
def text_cleaning(features):
   """function takes in an iterable of string data types, and using
   regex module successively:
       - Removes all the special characters.
       - Removes all single characters.
       - Removes single characters from the start.
       - Substitutes multiple spaces with single space.
       - Converts all text to lowercase.

   Args:
       features (iterable): an array or list. Example: df['col']
   """
   processed_features = []
   for sentence in range(0, len(features)):
       # Remove all the special characters
       alt_text = re.sub(r'\W', ' ', str(features[sentence]))

       # remove all single characters
       alt_text= re.sub(r'\s+[a-zA-Z]\s+', ' ', alt_text)

       # Remove single characters from the start
       alt_text = re.sub(r'\^[a-zA-Z]\s+', ' ', alt_text)

       # Substituting multiple spaces with single space
       alt_text = re.sub(r'\s+', ' ', alt_text, flags=re.I)

       # Removing prefixed 'b'
       alt_text = re.sub(r'^b\s+', '', alt_text)

       # Converting to Lowercase
       alt_text = alt_text.lower()
       processed_features.append(alt_text)
   return processed_features
        

In [30]:
# define a function to tokenize the text
def tokenize(text):
    tokens = word_tokenize(text)
    return tokens
# define a function to filter words:
def alphabetic_filter(stripped):
    words = [word for word in stripped if word.isalpha()]
    return words
# define a function to filter stopwords
def remove_stopwords(words):
    stop_words = set(stopwords.words('english'))
    filtered_words = [word for word in words if word not in stop_words]
    return filtered_words
# define a function to stem words
def stemming(words):
    ps = PorterStemmer()
    stemmed_words = [ps.stem(word) for word in words]
    return stemmed_words
# define a preprocessing function
def preprocessing(text, output_list = True):
    text = tokenize(text)
    text = alphabetic_filter(text)
    text = remove_stopwords(text)
    text = stemming(text)
    if output_list == False:
        text = ' '.join([word for word in text])
    return text



In [31]:
# select the questions
q1 = df['question1'].values
q2 = df['question2'].values

In [32]:
# clean question 1 and question 2
df['question1'] = text_cleaning(q1)

df['question2'] = text_cleaning(q2)

In [33]:
df

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,what is the step by step guide to invest in sh...,what is the step by step guide to invest in sh...,0
1,1,3,4,what is the story of kohinoor koh noor diamond,what would happen if the indian government sto...,0
2,2,5,6,how can increase the speed of my internet conn...,how can internet speed be increased by hacking...,0
3,3,7,8,why am mentally very lonely how can solve it,find the remainder when math 23 24 math is div...,0
4,4,9,10,which one dissolve in water quikly sugar salt ...,which fish would survive in salt water,0
...,...,...,...,...,...,...
404285,404285,433578,379845,how many keywords are there in the racket prog...,how many keywords are there in perl programmin...,0
404286,404286,18840,155606,do you believe there is life after death,is it true that there is life after death,1
404287,404287,537928,537929,what is one coin,what this coin,0
404288,404288,537930,537931,what is the approx annual cost of living while...,i am having little hairfall problem but want t...,0


In [35]:
# Preprocess the questions
df['question1_list'] = df['question1'].apply(preprocessing)
df['question2_list'] = df['question2'].apply(preprocessing)


In [71]:
# save cleaned data
df.to_csv('D:\Python(New)\Project\Project_Week_10\data\cleaned_data.csv', index = False)

In [72]:
df

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate,question1_list,question2_list
0,0,1,2,what is the step by step guide to invest in sh...,what is the step by step guide to invest in sh...,0,"['step', 'step', 'guid', 'invest', 'share', 'm...","['step', 'step', 'guid', 'invest', 'share', 'm..."
1,1,3,4,what is the story of kohinoor koh noor diamond,what would happen if the indian government sto...,0,"['stori', 'kohinoor', 'koh', 'noor', 'diamond']","['would', 'happen', 'indian', 'govern', 'stole..."
2,2,5,6,how can increase the speed of my internet conn...,how can internet speed be increased by hacking...,0,"['increas', 'speed', 'internet', 'connect', 'u...","['internet', 'speed', 'increas', 'hack', 'dn']"
3,3,7,8,why am mentally very lonely how can solve it,find the remainder when math 23 24 math is div...,0,"['mental', 'lone', 'solv']","['find', 'remaind', 'math', 'math', 'divid']"
4,4,9,10,which one dissolve in water quikly sugar salt ...,which fish would survive in salt water,0,"['one', 'dissolv', 'water', 'quikli', 'sugar',...","['fish', 'would', 'surviv', 'salt', 'water']"
...,...,...,...,...,...,...,...,...
404285,404285,433578,379845,how many keywords are there in the racket prog...,how many keywords are there in perl programmin...,0,"['mani', 'keyword', 'racket', 'program', 'lang...","['mani', 'keyword', 'perl', 'program', 'langua..."
404286,404286,18840,155606,do you believe there is life after death,is it true that there is life after death,1,"['believ', 'life', 'death']","['true', 'life', 'death']"
404287,404287,537928,537929,what is one coin,what this coin,0,"['one', 'coin']",['coin']
404288,404288,537930,537931,what is the approx annual cost of living while...,i am having little hairfall problem but want t...,0,"['approx', 'annual', 'cost', 'live', 'studi', ...","['littl', 'hairfal', 'problem', 'want', 'use',..."


In [122]:
# load cleaned data
df = pd.read_csv('D:\Python(New)\Project\Project_Week_10\data\cleaned_data.csv')
# drop Unnamed: 0 and Unnamed: 0.1
#df = df.drop(['Unnamed: 0', 'Unnamed: 0.1'], axis = 1)

In [125]:
df.dropna()
df

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate,question1_list,question2_list
0,0,1,2,what is the step by step guide to invest in sh...,what is the step by step guide to invest in sh...,0,"['step', 'step', 'guid', 'invest', 'share', 'm...","['step', 'step', 'guid', 'invest', 'share', 'm..."
1,1,3,4,what is the story of kohinoor koh noor diamond,what would happen if the indian government sto...,0,"['stori', 'kohinoor', 'koh', 'noor', 'diamond']","['would', 'happen', 'indian', 'govern', 'stole..."
2,2,5,6,how can increase the speed of my internet conn...,how can internet speed be increased by hacking...,0,"['increas', 'speed', 'internet', 'connect', 'u...","['internet', 'speed', 'increas', 'hack', 'dn']"
3,3,7,8,why am mentally very lonely how can solve it,find the remainder when math 23 24 math is div...,0,"['mental', 'lone', 'solv']","['find', 'remaind', 'math', 'math', 'divid']"
4,4,9,10,which one dissolve in water quikly sugar salt ...,which fish would survive in salt water,0,"['one', 'dissolv', 'water', 'quikli', 'sugar',...","['fish', 'would', 'surviv', 'salt', 'water']"
...,...,...,...,...,...,...,...,...
404285,404285,433578,379845,how many keywords are there in the racket prog...,how many keywords are there in perl programmin...,0,"['mani', 'keyword', 'racket', 'program', 'lang...","['mani', 'keyword', 'perl', 'program', 'langua..."
404286,404286,18840,155606,do you believe there is life after death,is it true that there is life after death,1,"['believ', 'life', 'death']","['true', 'life', 'death']"
404287,404287,537928,537929,what is one coin,what this coin,0,"['one', 'coin']",['coin']
404288,404288,537930,537931,what is the approx annual cost of living while...,i am having little hairfall problem but want t...,0,"['approx', 'annual', 'cost', 'live', 'studi', ...","['littl', 'hairfal', 'problem', 'want', 'use',..."


### Feature Engineering

- tf-idf
- word2vec
- word count
- number of the same words in both questions
- ....

In [80]:
# TFIDF Vectorizer
vectorizer = TfidfVectorizer()

In [129]:
# convert question 1 list from text to features
X = vectorizer.fit_transform(df['question1_list'] + df['question2_list'])

In [82]:
y = df['is_duplicate']

In [130]:
print(X.shape)
print(y.shape)

(404290, 58812)
(404290,)


### Modeling

Different modeling techniques can be used:

- logistic regression
- XGBoost
- LSTMs
- etc

In [91]:
# import LogisticRegression
from sklearn.linear_model import LogisticRegression
# import accuracy_score
from sklearn.metrics import accuracy_score
# import confusion_matrix
from sklearn.metrics import confusion_matrix
# import classification_report
from sklearn.metrics import classification_report

In [142]:
# Test Train Split question 1
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 40)

### Logistic Regression

In [143]:
# Logistic Regression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [144]:
# predict
y_pred = logreg.predict(X_test)
# accuracy
print(f'Accuracy: {accuracy_score(y_test, y_pred)}')

Accuracy: 0.7434143807662816


In [145]:
# confusion matrix
print(f'Confusion Matrix: \n{confusion_matrix(y_test, y_pred)}')

Confusion Matrix: 
[[44449  6421]
 [14326 15662]]


In [146]:
# classification report
print(f'Classification Report: \n{classification_report(y_test, y_pred)}')

Classification Report: 
              precision    recall  f1-score   support

           0       0.76      0.87      0.81     50870
           1       0.71      0.52      0.60     29988

    accuracy                           0.74     80858
   macro avg       0.73      0.70      0.71     80858
weighted avg       0.74      0.74      0.73     80858



### Random Forest Classifier

In [147]:
# random forest classifier
from sklearn.ensemble import RandomForestClassifier

In [98]:
# random forest classifier prediction
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)

In [99]:
# predict
y_pred = rfc.predict(X_test)
# accuracy
print(f'Accuracy: {accuracy_score(y_test, y_pred)}')


Accuracy: 0.7995745628138218


In [100]:
# confusion matrix
print(f'Confusion Matrix: \n{confusion_matrix(y_test, y_pred)}')

Confusion Matrix: 
[[45468  5335]
 [10871 19184]]
