## Identifying Duplicate Questions

Over 100 million people visit Quora every month, so it's no surprise that many people ask similar (or the same) questions. Various questions with the same intent can cause people to spend extra time searching for the best answer to their question, and results in members answering multiple versions of the same question. Quora uses random forest to identify duplicated questions to provide a better experience to active seekers and writers, and offer more value to both of these groups in the long term.
Follow the steps outlined below to build the appropriate classifier model. 


Steps:
- Download data
- Exploration
- Cleaning
- Feature Engineering
- Modeling

By the end of this project you should have **a presentation that describes the model you built** and its **performance**. 


In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
df = pd.read_csv("train.csv")

In [3]:
df_sample = df.sample(frac = 0.1, random_state = 42)
df_sample.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
8067,8067,15738,15739,How do I play Pokémon GO in Korea?,How do I play Pokémon GO in China?,0
368101,368101,12736,104117,What are some of the best side dishes for crab...,What are some good side dishes for buffalo chi...,0
70497,70497,121486,121487,Which is more advisable and better material fo...,What is the best server setup for buddypress?,0
226567,226567,254474,258192,How do I improve logical programming skills?,How can I improve my logical skills for progra...,1
73186,73186,48103,3062,How close we are to see 3rd world war?,How close is a World War III?,1


In [4]:
df_sample.shape

(40429, 6)

#### Note
There is no designated test.csv file. The train.csv file is the entire dataset. Part of the data in the train.csv file should be set aside to act as the final testing data.

### Exploration

In [5]:
df_sample.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 40429 entries, 8067 to 291758
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            40429 non-null  int64 
 1   qid1          40429 non-null  int64 
 2   qid2          40429 non-null  int64 
 3   question1     40429 non-null  object
 4   question2     40429 non-null  object
 5   is_duplicate  40429 non-null  int64 
dtypes: int64(4), object(2)
memory usage: 2.2+ MB


In [6]:
df_sample.dropna(inplace=True)

In [7]:
df_sample['is_duplicate'].value_counts()

0    25497
1    14932
Name: is_duplicate, dtype: int64

In [8]:
# Are we missing any data?
print('Number of nulls in label: {}'.format(df_sample['is_duplicate'].isnull().sum()))
print('Number of nulls in q1: {}'.format(df_sample['question1'].isnull().sum()))
print('Number of nulls in q2: {}'.format(df_sample['question2'].isnull().sum()))

Number of nulls in label: 0
Number of nulls in q1: 0
Number of nulls in q2: 0


### Cleaning

- Tokenization
- Stopwords cleaning
- Removing punctuation
- Normalizing
- Stemming

In [9]:
# Drop columns
df_sample = df_sample[['question1', 'question2', 'is_duplicate']]

In [10]:
df_sample['question_merged'] = df_sample['question1'] + ' ' + df_sample['question2']

In [11]:
df_sample.head()

Unnamed: 0,question1,question2,is_duplicate,question_merged
8067,How do I play Pokémon GO in Korea?,How do I play Pokémon GO in China?,0,How do I play Pokémon GO in Korea? How do I pl...
368101,What are some of the best side dishes for crab...,What are some good side dishes for buffalo chi...,0,What are some of the best side dishes for crab...
70497,Which is more advisable and better material fo...,What is the best server setup for buddypress?,0,Which is more advisable and better material fo...
226567,How do I improve logical programming skills?,How can I improve my logical skills for progra...,1,How do I improve logical programming skills? H...
73186,How close we are to see 3rd world war?,How close is a World War III?,1,How close we are to see 3rd world war? How clo...


In [12]:
df_sample.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 40429 entries, 8067 to 291758
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   question1        40429 non-null  object
 1   question2        40429 non-null  object
 2   is_duplicate     40429 non-null  int64 
 3   question_merged  40429 non-null  object
dtypes: int64(1), object(3)
memory usage: 1.5+ MB


In [13]:
# remove punctuation
import string

def remove_punct(text):
    text = "".join([char for char in text if char not in string.punctuation])
    return text

df_sample['question_merged'] = df_sample['question_merged'].apply(lambda x: remove_punct(x))

In [14]:
# Import the NLTK package and download the necessary data
import nltk
#nltk.download('stopwords')
from nltk.corpus import stopwords

ENGstopwords = stopwords.words('english')

In [15]:
# stopword, tokenize and tf-idf
input_ = df_sample['question_merged']
vectorizer = TfidfVectorizer(strip_accents = 'ascii', lowercase = True, stop_words = 'english')
X = vectorizer.fit_transform(input_)

In [16]:
X.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [17]:
X.shape

(40429, 33268)

Words to Vec Prep

In [34]:
# tokenize
import re

def tokenize(text):
    tokens = text.split()
    return tokens

df_sample['q12_token'] = df_sample['question_merged'].apply(lambda x: tokenize(x.lower()))


Unnamed: 0,question1,question2,is_duplicate,question_merged,q12_token
8067,How do I play Pokémon GO in Korea?,How do I play Pokémon GO in China?,0,How do I play Pokémon GO in Korea How do I pla...,"[how, do, i, play, pokémon, go, in, korea, how..."
368101,What are some of the best side dishes for crab...,What are some good side dishes for buffalo chi...,0,What are some of the best side dishes for crab...,"[what, are, some, of, the, best, side, dishes,..."
70497,Which is more advisable and better material fo...,What is the best server setup for buddypress?,0,Which is more advisable and better material fo...,"[which, is, more, advisable, and, better, mate..."
226567,How do I improve logical programming skills?,How can I improve my logical skills for progra...,1,How do I improve logical programming skills Ho...,"[how, do, i, improve, logical, programming, sk..."
73186,How close we are to see 3rd world war?,How close is a World War III?,1,How close we are to see 3rd world war How clos...,"[how, close, we, are, to, see, 3rd, world, war..."


In [37]:
# drop stopwords
def remove_stopwords(tokenized_text):    
    text = [word for word in tokenized_text if word not in ENGstopwords]
    return text

df_sample['q12_token_stop'] = df_sample['q12_token'].apply(lambda x: remove_stopwords(x))

In [38]:
df_sample.head()

Unnamed: 0,question1,question2,is_duplicate,question_merged,q12_token,q12_token_stop
8067,How do I play Pokémon GO in Korea?,How do I play Pokémon GO in China?,0,How do I play Pokémon GO in Korea How do I pla...,"[how, do, i, play, pokémon, go, in, korea, how...","[play, pokémon, go, korea, play, pokémon, go, ..."
368101,What are some of the best side dishes for crab...,What are some good side dishes for buffalo chi...,0,What are some of the best side dishes for crab...,"[what, are, some, of, the, best, side, dishes,...","[best, side, dishes, crab, cakes, good, side, ..."
70497,Which is more advisable and better material fo...,What is the best server setup for buddypress?,0,Which is more advisable and better material fo...,"[which, is, more, advisable, and, better, mate...","[advisable, better, material, crash, test, aut..."
226567,How do I improve logical programming skills?,How can I improve my logical skills for progra...,1,How do I improve logical programming skills Ho...,"[how, do, i, improve, logical, programming, sk...","[improve, logical, programming, skills, improv..."
73186,How close we are to see 3rd world war?,How close is a World War III?,1,How close we are to see 3rd world war How clos...,"[how, close, we, are, to, see, 3rd, world, war...","[close, see, 3rd, world, war, close, world, wa..."


## Stemming

In [39]:
from nltk.stem.snowball import SnowballStemmer

In [42]:
stemmer = SnowballStemmer("english")

In [53]:
def stemming(array):
    return [stemmer.stem(token) for token in array]

df_sample['stemmed'] = df_sample['q12_token_stop'].apply(stemming)


In [55]:
df_sample.head()

Unnamed: 0,question1,question2,is_duplicate,question_merged,q12_token,q12_token_stop,stemmed
8067,How do I play Pokémon GO in Korea?,How do I play Pokémon GO in China?,0,How do I play Pokémon GO in Korea How do I pla...,"[how, do, i, play, pokémon, go, in, korea, how...","[play, pokémon, go, korea, play, pokémon, go, ...","[play, pokémon, go, korea, play, pokémon, go, ..."
368101,What are some of the best side dishes for crab...,What are some good side dishes for buffalo chi...,0,What are some of the best side dishes for crab...,"[what, are, some, of, the, best, side, dishes,...","[best, side, dishes, crab, cakes, good, side, ...","[best, side, dish, crab, cake, good, side, dis..."
70497,Which is more advisable and better material fo...,What is the best server setup for buddypress?,0,Which is more advisable and better material fo...,"[which, is, more, advisable, and, better, mate...","[advisable, better, material, crash, test, aut...","[advis, better, materi, crash, test, automobil..."
226567,How do I improve logical programming skills?,How can I improve my logical skills for progra...,1,How do I improve logical programming skills Ho...,"[how, do, i, improve, logical, programming, sk...","[improve, logical, programming, skills, improv...","[improv, logic, program, skill, improv, logic,..."
73186,How close we are to see 3rd world war?,How close is a World War III?,1,How close we are to see 3rd world war How clos...,"[how, close, we, are, to, see, 3rd, world, war...","[close, see, 3rd, world, war, close, world, wa...","[close, see, 3rd, world, war, close, world, wa..."


### Feature Engineering

- tf-idf
- word2vec
- word count
- number of the same words in both questions
- ....

Questions:
    - should the 2 questions columsn be merged so that when we do tf-idf we can see a count of the duplicate words within the list and the words that are not duplicates?
    - for words to vec, once it is set up do i just pass that into another model to see if it will work?
    - does this need to be completed as a neural network?
    - 

In [21]:
#Merge question columns

In [22]:
#Remove duplicates

In [23]:
#tf-idf

In [57]:
#word2vec
import gensim

Model_SG = gensim.models.Word2Vec(df_sample['q12_token_stop'], vector_size = 100, window = 3, min_count = 2, sg = 1)
#Model_SG_stemmed = gensim.models.Word2Vec(df_sample['stemmed'], vector_size = 100, window = 3, min_count = 2, sg = 1)
#Model_CBoW = gensim.models.Word2Vec(df_sample['q12_token_stop'], vector_size = 100, window = 5, min_count = 1)
#Model_CBoW = gensim.models.Word2Vec(df_sample['stemmed'], vector_size = 100, window = 5, min_count = 1)

In [63]:
# what I want to do is calculate the cosine similarity for the 2 sentences based on the w2v model.
# the generated models above will be shite so need to load back in the google news model and use that.

KeyError: "Key 'how' not present"

### Modeling

Different modeling techniques can be used:

- logistic regression
- XGBoost
- LSTMs
- etc

In [25]:
y = df_sample['is_duplicate']

In [26]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [27]:
# Initialize different Classification Models
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import *
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier



In [28]:
from sklearn.metrics import f1_score, confusion_matrix, accuracy_score

cl1 = RandomForestClassifier(random_state=42)
cl1.fit(X_train,y_train)
z = cl1.predict(X_test)
accuracy = accuracy_score(y_test, z)
f1 = f1_score(y_test,z)
C = confusion_matrix(y_test,z)
print(f'F1 score: {f1}')
print(f'Accuracy score: {accuracy}')
print(f'Confusion matrix:\n {C}')

F1 score: 0.6292673925078428
Accuracy score: 0.7515458817709622
Confusion matrix:
 [[4372  784]
 [1225 1705]]


In [29]:
cl2 = SVC(probability=True, random_state=42)
cl2.fit(X_train,y_train)
z = cl2.predict(X_test)
accuracy = accuracy_score(y_test, z)
f1 = f1_score(y_test,z)
C = confusion_matrix(y_test,z)
print(f'F1 score: {f1}')
print(f'Accuracy score: {accuracy}')
print(f'Confusion matrix:\n {C}')

F1 score: 0.5887045979424732
Accuracy score: 0.7577294088548108
Confusion matrix:
 [[4725  431]
 [1528 1402]]


In [30]:
cl3 = LogisticRegression(random_state=42)
cl3.fit(X_train,y_train)
z = cl3.predict(X_test)
accuracy = accuracy_score(y_test, z)
f1 = f1_score(y_test,z)
C = confusion_matrix(y_test,z)
print(f'F1 score: {f1}')
print(f'Accuracy score: {accuracy}')
print(f'Confusion matrix:\n {C}')

F1 score: 0.561709188964945
Accuracy score: 0.738684145436557
Confusion matrix:
 [[4619  537]
 [1576 1354]]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [31]:
cl4 = GradientBoostingClassifier(random_state=42)
cl4.fit(X_train,y_train)
z = cl4.predict(X_test)
accuracy = accuracy_score(y_test, z)
f1 = f1_score(y_test,z)
C = confusion_matrix(y_test,z)
print(f'F1 score: {f1}')
print(f'Accuracy score: {accuracy}')
print(f'Confusion matrix:\n {C}')

F1 score: 0.35945663531870425
Accuracy score: 0.6967598318080633
Confusion matrix:
 [[4946  210]
 [2242  688]]


cl1 = RandomForestClassifier(random_state=42)
cl2 = SVC(probability=True, random_state=42)
cl3 = KNeighborsClassifier()
cl4 = LogisticRegression(random_state=42)
cl5 = GradientBoostingClassifier(random_state=42)
ft1 = PCA()
ft2 = SelectKBest()

# Initiaze the hyperparameters for each dictionary

param1 = {}
#param1['classifier__n_estimators'] = [2, 5, 10, 15, 50]
#param1['classifier__max_depth'] = [5, 10, 20]
param1['classifier'] = [cl1]

param2 = {}
param2['classifier__C'] = [10**-2, 10**-1, 10**0, 10**1, 10**2]
#param2['classifier__kernal'] = ['rbf', 'linear']
param2['classifier'] = [cl2]

param3 = {}
#param3['classifier__n_neighbors'] = [2,5,10,25,50]
param3['classifier'] = [cl3]

param4 = {}
param4['classifier__C'] = [10**-2, 10**-1, 10**0, 10**1, 10**2]
param4['classifier__penalty'] = ['l1', 'l2']
param4['classifier__class_weight'] = [None, {0:1,1:5}, {0:1,1:10}, {0:1,1:25}]
param4['classifier__max_iter'] = [250]
param4['classifier'] = [cl4]

param5 = {}
#param5['classifier__n_estimators'] = [2, 5, 10, 50, 100, 250]
#param5['classifier__max_depth'] = [5, 10, 20]
param5['classifier'] = [cl5]

param6 = {}
param6['features__pca__n_components'] = [2, 5, 8, 11, 14, 17]
param6['features'] = [ft1]

param7 = {}
param7['features__select_best__k'] = [1, 3, 6]
param7['features'] = [ft2]

#no feature union
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline, FeatureUnion

classifiers = [cl1, cl2, cl3]

#params = [param1, param2, param3, param5]
params = [param1, param2, param3, param5]

#feature_union = FeatureUnion([('pca', PCA()),
                              #('select_best', SelectKBest())])

pipeline = Pipeline(steps = [('classifier', 'passthrough')])

grid = GridSearchCV(pipeline, param_grid = params, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)

best_model = grid.best_estimator_
best_hyperparams = grid.best_params_
best_acc = grid.score(X_test, y_test)
print(f'Best test set accuracy:\n\t {best_acc}\nAchieved with hyperparameters:\n\t {best_hyperparams}')

In [32]:
# set pipeline for for multiple classifiers

In [33]:
# determine if this needs to be run as an NN..... hopefully not