# <b>The Quora Question Pair Similarity Problem</b>

You can also check out the blog summary https://sanjayc.medium.com/the-quora-question-pair-similarity-problem-3598477af172

# Introduction

Quora is a platform for Q&A, just like StackOverflow. But quora is more of a general-purpose Q&A platform that means there is not much code like in StackOverflow.

One of the many problems that quora face is the duplication of questions. Duplication of question ruins the experience for both the questioner and the answerer. Since the questioner is asking a duplicate question, we can just show him/her the answers to the previous question. And the answerer doesn't have to repeat his/her answer for essentially the same questions.

For example, we have a question like "How can I be a good geologist?" and there are some answers to that question. Later someone else asks another question like "What should I do to be a great geologist?".<br>
We can see that both the questions are asking the same thing. Even though the wordings for the question are different, the intention of both questions is same. <br>
So the answers will be same for both questions. That means we can just show the answers of the first question. That way the person who is asking the question will get the answers immediately and people who have answered already the first question don't have to repeat themselves.

This problem is available on Kaggle as a competition. https://www.kaggle.com/c/quora-question-pairs

# Business Objectives and Constraints

* There is no strict latency requirement.
* We would like to have interpretability but it is not absolutely mandatory.
* The cost of misclassification is medium.
* Both classes (duplicate or not) are equally important.

# Data Overview

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from bs4 import BeautifulSoup
import re
from fuzzywuzzy import fuzz
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy import sparse
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import RandomOverSampler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, log_loss
from sklearn.calibration import CalibratedClassifierCV
import xgboost as xgb
import nltk
import time
from matplotlib.pyplot import figure
from sentence_transformers import SentenceTransformer
from sklearn.preprocessing import MinMaxScaler
from joblib import dump, load
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV
from sklearn.model_selection import ShuffleSplit
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
# import optuna
import hyperopt
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
import warnings
import gc
from sklearn.model_selection import cross_val_score
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')

In [None]:
data = pd.read_csv('train.csv.zip')

In [None]:
print(data.columns)
print(data.is_duplicate.unique())
print(data.is_duplicate.value_counts())
print(data.shape)

Available Columns : <b>id, qid1, qid2, question1, question2, is_duplicate</b><br>
Class labels : <b>0, 1</b><br>
Total training data / No. of rows :  <b>404290</b><br>
No. of columns :  <b>6</b><br>
**is_duplicate** is the dependent variable.<br>
No. of non-duplicate data points is <b>255027</b><br>
No. of duplicate data points is <b>149263</b>

We have **404290** training data points. And only **36.92%** are positive. That means it is an imbalanced dataset.

# Business Metrics

It is a binary classification.
* We need to minimize the log loss for this challenge.

# Basic EDA

In [None]:
testdata = pd.read_csv('test.csv')
print(testdata.shape)

In [None]:
data.head(5)

In [None]:
testdata.head(5)

Test data don't have question ids. So the independent variables are **question1**, **question2** and the dependent variable is **is_duplicate**.

In [None]:
data.info()

In [None]:
data = data.dropna()
print(data.shape)

In [None]:
print(data.duplicated(('question1', 'question2')).sum())

3 rows had null values. So We removed them and now We have **404287** question pairs.

In [None]:
duplicate_value_counts = data.is_duplicate.value_counts()
print(duplicate_value_counts/duplicate_value_counts.sum())
plt.title('Distribution of classes')
duplicate_value_counts.plot.bar()

**36.92%** of question pairs are duplicates and **63.08%** of questions pair non-duplicate.

In [None]:
qids = np.append(data.qid1.values,data.qid2.values)
print(len(set(qids)))
print(len(qids))

In [None]:
occurences = np.bincount(qids)
plt.figure(figsize=(10,5)) 
plt.hist(occurences, bins=range(0,np.max(occurences)))
plt.yscale('log')
plt.xlabel('Number of times question repeated')
plt.ylabel('Number of questions')
plt.title('Question vs Repeatition')
plt.show()
print(np.min(occurences), np.max(occurences))

* Out of **808574** total questions (including both question1 and question2), **537929** are unique.
* Most of the questions are repeated very few times. Only a few of them are repeated multiple times.
* One question is repeated **157** times which is the max number of repetitions.

In [None]:
print(data.question1.apply(len).min())
print(data.loc[data.question1.apply(len).argmin()])
print(data.question2.apply(len).min())
print(data.loc[data.question2.apply(len).argmin()])

There are some questions with very few characters, which does not make sense. It will be taken care of later with Data Cleaning.

# Data Cleaning

In [None]:
def preprocess_text(x):
    x = str(x).lower()
    x = x.replace(",000,000", "m").replace(",000", "k").replace("′", "'").replace("’", "'")\
                           .replace("won't", "will not").replace("cannot", "can not").replace("can't", "can not")\
                           .replace("n't", " not").replace("what's", "what is").replace("it's", "it is")\
                           .replace("'ve", " have").replace("i'm", "i am").replace("'re", " are")\
                           .replace("he's", "he is").replace("she's", "she is").replace("'s", " own")\
                           .replace("%", " percent ").replace("₹", " rupee ").replace("$", " dollar ")\
                           .replace("€", " euro ").replace("'ll", " will")
    x = re.sub(r"([0-9]+)000000", r"\1m", x)
    x = re.sub(r"([0-9]+)000", r"\1k", x)
    x = re.sub(r"http\S+", "", x)
    x = re.sub('\W', ' ', x)
    
    lemmatizer = WordNetLemmatizer()
    x = lemmatizer.lemmatize(x)
    bfs = BeautifulSoup(x)
    x = bfs.get_text()
    x = x.strip()
    return x

In [None]:
def data_cleaning(data):
    newdata = pd.DataFrame()
    newdata['question1_final'] = data.question1.apply(preprocess_text)
    newdata['question2_final'] = data.question2.apply(preprocess_text)
    return newdata

In [None]:
traindata = data_cleaning(data)

In [None]:
testdata = data_cleaning(testdata)

In [None]:
print(data.head())

In [None]:
print(traindata.head())

* We have converted everything to lower case.
* We have removed contractions.
* We have replaced currency symbols with currency names.
* We have also removed hyperlinks.
* We have removed non-alphanumeric characters.
* We have removed inflections with word lemmatizer.
* We have also removed HTML tags.

# Feature Extraction

In [None]:
def doesMatch (q, match):
    q1, q2 = q['question1_final'], q['question2_final']
    q1 = q1.split()
    q2 = q2.split()
    if len(q1)>0 and len(q2)>0 and q1[match]==q2[match]:
        return 1
    else:
        return 0

In [None]:
def feature_extract(data):
    data['q1_char_num'] = data.question1_final.apply(len)
    data['q2_char_num'] = data.question2_final.apply(len)
    data['q1_word_num'] = data.question1_final.apply(lambda x: len(x.split()))
    data['q2_word_num'] = data.question2_final.apply(lambda x: len(x.split()))
    
    data['total_word_num'] = data['q1_word_num'] + data['q2_word_num']
    data['differ_word_num'] = abs(data['q1_word_num'] - data['q2_word_num'])
    data['same_first_word'] = data.apply(lambda x: doesMatch(x, 0) ,axis=1)
    data['same_last_word'] = data.apply(lambda x: doesMatch(x, -1) ,axis=1)
    data['total_unique_word_num'] = data.apply(lambda x: len(set(x.question1_final.split()).union(set(x.question2_final.split()))) ,axis=1)
    data['total_unique_word_withoutstopword_num'] = data.apply(lambda x: len(set(x.question1_final.split()).union(set(x.question2_final.split())) - set(stopwords.words('english'))) ,axis=1)
    data['total_unique_word_num_ratio'] = data['total_unique_word_num'] / data['total_word_num']
    
    data['common_word_num'] = data.apply(lambda x: len(set(x.question1_final.split()).intersection(set(x.question2_final.split()))) ,axis=1)
    data['common_word_ratio'] = data['common_word_num'] / data['total_unique_word_num']
    data['common_word_ratio_min'] = data['common_word_num'] / data.apply(lambda x: min(len(set(x.question1_final.split())), len(set(x.question2_final.split()))) ,axis=1) 
    data['common_word_ratio_max'] = data['common_word_num'] / data.apply(lambda x: max(len(set(x.question1_final.split())), len(set(x.question2_final.split()))) ,axis=1) 
    
    data['common_word_withoutstopword_num'] = data.apply(lambda x: len(set(x.question1_final.split()).intersection(set(x.question2_final.split())) - set(stopwords.words('english'))) ,axis=1)
    data['common_word_withoutstopword_ratio'] = data['common_word_withoutstopword_num'] / data['total_unique_word_withoutstopword_num']
    data['common_word_withoutstopword_ratio_min'] = data['common_word_withoutstopword_num'] / data.apply(lambda x: min(len(set(x.question1_final.split()) - set(stopwords.words('english'))), len(set(x.question2_final.split()) - set(stopwords.words('english')))) ,axis=1) 
    data['common_word_withoutstopword_ratio_max'] = data['common_word_withoutstopword_num'] / data.apply(lambda x: max(len(set(x.question1_final.split()) - set(stopwords.words('english'))), len(set(x.question2_final.split()) - set(stopwords.words('english')))) ,axis=1) 
    
    data["fuzz_ratio"] = data.apply(lambda x: fuzz.ratio(x.question1_final, x.question2_final), axis=1)
    data["fuzz_partial_ratio"] = data.apply(lambda x: fuzz.partial_ratio(x.question1_final, x.question2_final), axis=1)
    data["fuzz_token_set_ratio"] = data.apply(lambda x: fuzz.token_set_ratio(x.question1_final, x.question2_final), axis=1)
    data["fuzz_token_sort_ratio"] = data.apply(lambda x: fuzz.token_sort_ratio(x.question1_final, x.question2_final), axis=1)
    data.fillna(0, inplace=True)
    return data

FuzzyWuzzy uses Levenshtein Distance to calculate the differences between sequences. https://github.com/seatgeek/fuzzywuzzy

In [None]:
traindata = feature_extract(traindata)

In [None]:
testdata = feature_extract(testdata)

In [None]:
traindata.head()

In [None]:
traindata.shape

We have created **23** features from the questions.
* We have created features q1_word_num, q2_word_num with count of characters for both questions.
* We have created total_word_num feature which is equal to sum of q1_word_num and q2_word_num.
* We have created differ_word_num feature which is absolute difference between q1_word_num and q2_word_num.
* We have created same_first_word feature which is 1 if both questions have same first word otherwise 0.
* We have created same_last_word feature which is 1 if both questions have same last word otherwise 0.
* We have created total_unique_word_num feature which is equal to total number of unique words in both questions.
* We have created total_unique_word_withoutstopword_num feature which is equal to total number of unique words in both questions without the stop words.
* The total_unique_word_num_ratio is equal to total_unique_word_num divided by total_word_num.
* We have created common_word_num feature which is count of total common words in both questions.
* The common_word_ratio feature is equal to common_word_num divided by total_unique_word_num.
* The common_word_ratio_min is equal to common_word_num divided by minimum number of words between question 1 and question 2.
* The common_word_ratio_max is equal to common_word_num divided by maximum number of words between question 1 and question 2.
* We have created common_word_withoutstopword_num feature which is count of total common words in both questions excluding the stopwords.
* The common_word_withoutstopword_ratio feature is equal to common_word_withoutstopword_num divided by total_unique_word_withoutstopword_num.
* The common_word_withoutstopword_ratio_min is equal to common_word_withoutstopword_num divided by minimum number of words between question 1 and question 2 excluding the stopwords.
* The common_word_withoutstopword_ratio_max is equal to common_word_withoutstopword_num divided by maximum number of words between question 1 and question 2 excluding the stopwords.
* Then we have extracted fuzz_ratio, fuzz_partial_ratio, fuzz_token_set_ratio and fuzz_token_sort_ratio features with fuzzywuzzy string matching tool. Reference: https://github.com/seatgeek/fuzzywuzzy

# EDA with Features

In [None]:
fig, ax =plt.subplots(1,2,figsize=(15,5))
ax[0].title.set_text('PDF of Total Number of Words')
sns.kdeplot(traindata['total_word_num'], hue=data.is_duplicate, palette="Dark2", ax=ax[0])
ax[1].title.set_text('Boxplot of Total Number of Words')
sns.boxplot(x=data.is_duplicate, y=traindata['total_word_num'], palette="Dark2", ax=ax[1])
plt.show()

In [None]:
fig, ax =plt.subplots(1,2,figsize=(15,5))
ax[0].title.set_text('PDF of Difference in Number of Words')
sns.kdeplot(traindata['differ_word_num'], hue=data.is_duplicate, palette="Dark2", ax=ax[0])
ax[1].title.set_text('Boxplot of Difference in Number of Words')
sns.boxplot(x=data.is_duplicate, y=traindata['differ_word_num'], palette="Dark2", ax=ax[1])
plt.show()

In [None]:
fig, ax =plt.subplots(1,2,figsize=(15,5))
ax[0].title.set_text('Have same First word?')
sns.kdeplot(traindata['same_first_word'], hue=data.is_duplicate, palette="Dark2", ax=ax[0])
ax[1].title.set_text('Have same First word?')
sns.countplot(x=traindata['same_first_word'], hue=data.is_duplicate, palette="Dark2", ax=ax[1])
plt.show()

In [None]:
fig, ax =plt.subplots(1,2,figsize=(15,5))
ax[0].title.set_text('Have same Last word?')
sns.kdeplot(traindata['same_last_word'], hue=data.is_duplicate, palette="Dark2", ax=ax[0])
ax[1].title.set_text('Have same Last word?')
sns.countplot(x=traindata['same_last_word'], hue=data.is_duplicate, palette="Dark2", ax=ax[1])
plt.show()

In [None]:
fig, ax =plt.subplots(1,2,figsize=(15,5))
ax[0].title.set_text('PDF of Total Number of Unique Words')
sns.kdeplot(traindata['total_unique_word_num'], hue=data.is_duplicate, palette="Dark2", ax=ax[0])
ax[1].title.set_text('Boxplot of Total Number of Unique Words')
sns.boxplot(x=data.is_duplicate, y=traindata['total_unique_word_num'], palette="Dark2", ax=ax[1])
plt.show()

In [None]:
fig, ax =plt.subplots(1,2,figsize=(15,5))
ax[0].title.set_text('PDF of Total Number of Unique Words without Stop words')
sns.kdeplot(traindata['total_unique_word_withoutstopword_num'], hue=data.is_duplicate, palette="Dark2", ax=ax[0])
ax[1].title.set_text('Boxplot of Total Number of Unique Words without Stop words')
sns.boxplot(x=data.is_duplicate, y=traindata['total_unique_word_withoutstopword_num'], palette="Dark2", ax=ax[1])
plt.show()

In [None]:
fig, ax =plt.subplots(1,2,figsize=(15,5))
ax[0].title.set_text('PDF of Total Unique words to Total words Ratio')
sns.kdeplot(traindata['total_unique_word_num_ratio'], hue=data.is_duplicate, palette="Dark2", ax=ax[0])
ax[1].title.set_text('Boxplot of Total Unique words to Total words Ratio')
sns.boxplot(x=data.is_duplicate, y=traindata['total_unique_word_num_ratio'], palette="Dark2", ax=ax[1])
plt.show()

In [None]:
fig, ax =plt.subplots(1,2,figsize=(15,5))
ax[0].title.set_text('PDF of Number of Common words')
sns.kdeplot(traindata['common_word_num'], hue=data.is_duplicate, palette="Dark2", ax=ax[0])
ax[1].title.set_text('Boxplot of Number of Common words')
sns.boxplot(x=data.is_duplicate, y=traindata['common_word_num'], palette="Dark2", ax=ax[1])
plt.show()

In [None]:
fig, ax =plt.subplots(1,2,figsize=(15,5))
ax[0].title.set_text('PDF of Ratio of number of Common words to total Unique words')
sns.kdeplot(traindata['common_word_ratio'], hue=data.is_duplicate, palette="Dark2", ax=ax[0])
ax[1].title.set_text('Boxplot of Ratio of number of Common words to total Unique words')
sns.boxplot(x=data.is_duplicate, y=traindata['common_word_ratio'], palette="Dark2", ax=ax[1])
plt.show()

In [None]:
fig, ax =plt.subplots(1,2,figsize=(15,5))
ax[0].title.set_text('PDF of Ratio of number of Common words to Minimum of Unique words')
sns.kdeplot(traindata['common_word_ratio_min'], hue=data.is_duplicate, palette="Dark2", ax=ax[0])
ax[1].title.set_text('Boxplot of Ratio of number of Common words to Minimum of Unique words')
sns.boxplot(x=data.is_duplicate, y=traindata['common_word_ratio_min'], palette="Dark2", ax=ax[1])
plt.show()

In [None]:
fig, ax =plt.subplots(1,2,figsize=(15,5))
ax[0].title.set_text('PDF of Ratio of number of Common words to Maximum of Unique words')
sns.kdeplot(traindata['common_word_ratio_max'], hue=data.is_duplicate, palette="Dark2", ax=ax[0])
ax[1].title.set_text('Boxplot of Ratio of number of Common words to Maximum of Unique words')
sns.boxplot(x=data.is_duplicate, y=traindata['common_word_ratio_max'], palette="Dark2", ax=ax[1])
plt.show()

In [None]:
fig, ax =plt.subplots(1,2,figsize=(15,5))
ax[0].title.set_text('PDF of Number of Common words Without Stop words')
sns.kdeplot(traindata['common_word_withoutstopword_num'], hue=data.is_duplicate, palette="Dark2", ax=ax[0])
ax[1].title.set_text('Boxplot of Number of Common words Without Stop words')
sns.boxplot(x=data.is_duplicate, y=traindata['common_word_withoutstopword_num'], palette="Dark2", ax=ax[1])
plt.show()

In [None]:
fig, ax =plt.subplots(1,2,figsize=(15,5))
ax[0].title.set_text('Ratio of no. of Common words to total Unique words Without Stop words')
sns.kdeplot(traindata['common_word_withoutstopword_ratio'], hue=data.is_duplicate, palette="Dark2", ax=ax[0])
ax[1].title.set_text('Ratio of no. of Common words to total Unique words Without Stop words')
sns.boxplot(x=data.is_duplicate, y=traindata['common_word_withoutstopword_ratio'], palette="Dark2", ax=ax[1])
plt.show()

In [None]:
fig, ax =plt.subplots(1,2,figsize=(15,5))
ax[0].title.set_text('Ratio of no. of Common words to Minimum Unique words w/o Stop words')
sns.kdeplot(traindata['common_word_withoutstopword_ratio_min'], hue=data.is_duplicate, palette="Dark2", ax=ax[0])
ax[1].title.set_text('Ratio of no. of Common words to Minimum Unique words w/o Stop words')
sns.boxplot(x=data.is_duplicate, y=traindata['common_word_withoutstopword_ratio_min'], palette="Dark2", ax=ax[1])
plt.show()

In [None]:
fig, ax =plt.subplots(1,2,figsize=(15,5))
ax[0].title.set_text('Ratio of no. of Common words to Maximum Unique words w/o Stop words')
sns.kdeplot(traindata['common_word_withoutstopword_ratio_max'], hue=data.is_duplicate, palette="Dark2", ax=ax[0])
ax[1].title.set_text('Ratio of no. of Common words to Maximum Unique words w/o Stop words')
sns.boxplot(x=data.is_duplicate, y=traindata['common_word_withoutstopword_ratio_max'], palette="Dark2", ax=ax[1])
plt.show()

In [None]:
fig, ax =plt.subplots(1,2,figsize=(15,5))
ax[0].title.set_text('PDF of Fuzz Ratio')
sns.kdeplot(traindata['fuzz_ratio'], hue=data.is_duplicate, palette="Dark2", ax=ax[0])
ax[1].title.set_text('Boxplot of Fuzz Ratio')
sns.boxplot(x=data.is_duplicate, y=traindata['fuzz_ratio'], palette="Dark2", ax=ax[1])
plt.show()

In [None]:
fig, ax =plt.subplots(1,2,figsize=(15,5))
ax[0].title.set_text('PDF of Fuzz Partial Ratio')
sns.kdeplot(traindata['fuzz_partial_ratio'], hue=data.is_duplicate, palette="Dark2", ax=ax[0])
ax[1].title.set_text('Boxplot of Fuzz Partial Ratio')
sns.boxplot(x=data.is_duplicate, y=traindata['fuzz_partial_ratio'], palette="Dark2", ax=ax[1])
plt.show()

In [None]:
fig, ax =plt.subplots(1,2,figsize=(15,5))
ax[0].title.set_text('PDF of Fuzz Token Set Ratio')
sns.kdeplot(traindata['fuzz_token_set_ratio'], hue=data.is_duplicate, palette="Dark2", ax=ax[0])
ax[1].title.set_text('Boxplot of Fuzz Token Set Ratio')
sns.boxplot(x=data.is_duplicate, y=traindata['fuzz_token_set_ratio'], palette="Dark2", ax=ax[1])
plt.show()

In [None]:
fig, ax =plt.subplots(1,2,figsize=(15,5))
ax[0].title.set_text('PDF of Fuzz Token Sort Ratio')
sns.kdeplot(traindata['fuzz_token_sort_ratio'], hue=data.is_duplicate, palette="Dark2", ax=ax[0])
ax[1].title.set_text('Boxplot of Fuzz Token Sort Ratio')
sns.boxplot(x=data.is_duplicate, y=traindata['fuzz_token_sort_ratio'], palette="Dark2", ax=ax[1])
plt.show()

In [None]:
g = sns.jointplot(x = 'q1_char_num', y = 'q2_char_num', kind = "scatter", hue=data.is_duplicate, data = traindata, palette="Dark2")
g.fig.suptitle("Joint plot between Number of Characters in Quesion1 and Quesion2", y=1.02)
plt.show()

In [None]:
g = sns.jointplot(x = 'q1_word_num', y = 'q2_word_num', kind = "scatter", hue=data.is_duplicate, data = traindata, palette="Dark2")
g.fig.suptitle("Joint plot between Number of Words in Quesion1 and Quesion2", y=1.02)
plt.show()

* If First word or Last word is the same then there is a high chance that the question pairs are duplicates.
* The number of total unique words (q1 and q2 both combined) with and without stopwords is less if question pairs are duplicate.
* For duplicate question pairs, the total unique words to total words ratio is generally smaller.
* Duplicate question pairs tend to have more common words between both the questions. Hence extracted features related to common words are also showing differences in distributions.
* The fuzz ratios tend to be generally higher for duplicate question pairs.

# Featurization with SentenceBERT

I tried InferSent sentence embeddings. But it returns 4096 dimension representation. And after applying it the train data became huge. So I discarded it. And I chose SentenceBERT for this problem.

SentenceBERT is a BERT based sentence embedding technique. We will use pre-trained SentenceBERT model paraphrase-mpnet-base-v2, which is recommended for best quality. The SentenceBERT produces an output of 768 dimensions. https://www.sbert.net/

In [None]:
modelST = SentenceTransformer('paraphrase-mpnet-base-v2')

In [None]:
# It took a lot of time, caused gpu overheat.
# So I decided to do it in batch and save them in file.
def getBertEmbeddings(data, filename):
    batch = 20000
    with open(filename, 'wb') as f:
        while(len(data)):
            tempdata = data[:batch]
            data = data[batch:]
            tempembed = modelST.encode(tempdata.values, device='cuda')
            np.save(f, tempembed, allow_pickle=True)
#             time.sleep(60) # for gpu heating issue
            

In [None]:
# Get SentenceBERT embedding of train data
getBertEmbeddings(traindata.question1_final, 'temp_train_question1_sentenceBERT.npy')
getBertEmbeddings(traindata.question2_final, 'temp_train_question2_sentenceBERT.npy')

In [None]:
# Get SentenceBERT embedding of test data
getBertEmbeddings(testdata.question1_final, 'temp_test_question1_sentenceBERT.npy')
getBertEmbeddings(testdata.question2_final, 'temp_test_question2_sentenceBERT.npy')

In [None]:
# Get cosine similarity and euclidean distance between two vectors
def cosine_euclidean(u, v):
    return np.array([np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v)), np.linalg.norm(u - v)])

In [None]:
# open .npy files and loop through the sentence embeddings
with open('temp_train_question1_sentenceBERT.npy', 'rb') as q1_vec, open('temp_train_question2_sentenceBERT.npy', 'rb') as q2_vec:
    distances = []
    while True:
        try:
            q1_20k = np.load(q1_vec, allow_pickle=True)
            q2_20k = np.load(q2_vec, allow_pickle=True)
            for q1,q2 in zip(q1_20k, q2_20k):
                dists = cosine_euclidean(q1, q2)
                distances.append(dists)
        except IOError as e:
            distances = np.array(distances)
            break

In [None]:
distances = pd.DataFrame(distances, columns=['cosine_simlarity_bert', 'euclidean_distance_bert'])

In [None]:
traindata = pd.concat([traindata, pd.DataFrame(distances)], axis=1)

In [None]:
# open .npy files and loop through the sentence embeddings
with open('temp_test_question1_sentenceBERT.npy', 'rb') as q1_vec, open('temp_test_question2_sentenceBERT.npy', 'rb') as q2_vec:
    distances = []
    while True:
        try:
            q1_20k = np.load(q1_vec, allow_pickle=True)
            q2_20k = np.load(q2_vec, allow_pickle=True)
            for q1,q2 in zip(q1_20k, q2_20k):
                dists = cosine_euclidean(q1, q2)
                distances.append(dists)
        except IOError as e:
            distances = np.array(distances)
            break
distances = pd.DataFrame(distances, columns=['cosine_simlarity_bert', 'euclidean_distance_bert'])
testdata = pd.concat([testdata, pd.DataFrame(distances)], axis=1)

We have created two more features **cosine_simlarity_bert** and **euclidean_distance_bert** which measures similarity and distance between both pairs of questions.

The total number of features till now is **25**.

### EDA on new features related to SentenceBERT

In [None]:
fig, ax =plt.subplots(1,2,figsize=(15,5))
ax[0].title.set_text('Cosine Similarity based on SentenceBERT b/w Question 1 and Question 2')
sns.kdeplot(traindata['cosine_simlarity_bert'], hue=data.is_duplicate, palette="Dark2", ax=ax[0])
ax[1].title.set_text('Cosine Similarity based on SentenceBERT b/w Question 1 and Question 2')
sns.boxplot(x=data.is_duplicate, y=traindata['cosine_simlarity_bert'], palette="Dark2", ax=ax[1])
plt.show()

In [None]:
plt.title('ECDF plot of Cosine Similarity based on SentenceBERT b/w Question 1 and Question 2')
sns.axes_style("whitegrid")
sns.ecdfplot(x=traindata['cosine_simlarity_bert'], hue=data.is_duplicate, palette="Dark2")
plt.show()

In [None]:
fig, ax =plt.subplots(1,2,figsize=(15,5))
ax[0].title.set_text('Euclidean Distance based on SentenceBERT b/w Question 1 and 2')
sns.kdeplot(traindata['euclidean_distance_bert'], hue=data.is_duplicate, palette="Dark2", ax=ax[0])
ax[1].title.set_text('Euclidean Distance based on SentenceBERT b/w Question 1 and 2')
sns.boxplot(x=data.is_duplicate, y=traindata['euclidean_distance_bert'], palette="Dark2", ax=ax[1])
plt.show()

In [None]:
plt.title('ECDF plot of Euclidean Distance based on SentenceBERT b/w Question 1 and Question 2')
sns.axes_style("whitegrid")
sns.ecdfplot(x=traindata['euclidean_distance_bert'], hue=data.is_duplicate, palette="Dark2")
plt.show()

These features seems to be the most successful ones. It seems we can separate most of the classes just by using one of these features.
* Cosine Similarity is larger for duplicate pairs.
* 80% of non-duplicate question pairs and only 20% of duplicate question pairs have cosine similarity of <= .815
* Euclidean Distance is smaller for duplicate pairs.
* 20% of non-duplicate question pairs and approx 80% of duplicate question pairs have euclidean distance of <= 2.

It is showing the Pareto Principle (80-20 rule).

# Data Pre-processing

In [None]:
traindata.drop(columns=['question1_final', 'question2_final'], inplace=True)

In [None]:
traindata = traindata.to_numpy()

In [None]:
scaler = MinMaxScaler()

In [None]:
scaler.fit(traindata)

In [None]:
traindata = scaler.transform(traindata)

We have normalized (min-max scaling) the extracted features of train data. We have not normalized the embeddings because it is not recommended.

In [None]:
testdata.drop(columns=['question1_final', 'question2_final'], inplace=True)
testdata = testdata.to_numpy()
testdata = scaler.transform(testdata)

In [None]:
with open('temp_testdata.npy', 'wb') as f:
    batch = 20000
    while(len(testdata)):
        tempdata = testdata[:batch]
        testdata = testdata[batch:]
        np.save(f, tempdata, allow_pickle=True)

We have normalized the test data also. And save them in batch of 20k, just like we did with the embeddings.

In [None]:
def loadVectors(filename):
    with open(filename, 'rb') as f:
        q_vectors = []
        while True:
            try:
                q_vec = np.load(f, allow_pickle=True)
                q_vectors.extend(list(q_vec))
            except IOError as e:
                q_vectors = np.array(q_vectors)
                break
    return q_vectors

In [None]:
train_question1_vec = loadVectors('temp_train_question1_sentenceBERT.npy')

In [None]:
train_question2_vec = loadVectors('temp_train_question2_sentenceBERT.npy')

In [None]:
traindata = np.hstack((traindata, train_question1_vec, train_question2_vec))

In [None]:
traindata.shape

We have **1561** features (25 + 768 + 768). <br>
* **25** are extracted features.<br>
* **768+768** for sentence embedding of question 1 and question 2.

In [None]:
oversample = RandomOverSampler(sampling_strategy='minority')
X_train, y_train = oversample.fit_resample(traindata, data.is_duplicate.to_numpy())

In [None]:
print(np.count_nonzero(y_train == 0))
print(np.count_nonzero(y_train == 1))

Since the dataset was imbalanced. We did **oversample** by sampling from the minority class. <br>
Now we have **510048** data points. **255024** from each class.

Note that I have not set aside any data for testing locally. Because our main goal is to get a good score on Kaggle.

# Training Models

## Support Vector Classifier

### Training

In [None]:
splits = ShuffleSplit(n_splits=1, test_size=.3, random_state=42)

In [None]:
svc_param_grid = {'C':[1e-2, 1e-1, 1e0, 1e1, 1e2]}

In [None]:
svc_clf = LinearSVC(penalty='l2', loss='squared_hinge', dual=False, max_iter=3000)

In [None]:
svc_clf_search = HalvingGridSearchCV(svc_clf, svc_param_grid, cv=splits, factor=2, scoring='accuracy', verbose=3)

In [None]:
svc_clf_search.fit(X_train, y_train)

In [None]:
svc_clf_search.best_params_

In [None]:
svc_clf_search.best_score_

The Halving Grid Search CV found C=100 to be the best param. And the best accuracy is 85.79%.

In [None]:
svc_clf_model = svc_clf_search.best_estimator_

In [None]:
svc_clf_model

Now since we need to minimize log loss for the competition. We would want a good predicted probability. Calibrated Classifier can be used to get a good predicted probability.

In [None]:
svc_calibrated = CalibratedClassifierCV(base_estimator=svc_clf_model, method="sigmoid", cv=splits)

In [None]:
svc_calibrated.fit(X_train, y_train)

### Testing

In [None]:
with open('testdata.npy', 'rb') as X_test_1, \
    open('test_question1_sentenceBERT.npy', 'rb') as X_test_q1, \
    open('test_question2_sentenceBERT.npy', 'rb') as X_test_q2:
    y_pred_proba_svc = []
    while True:
        try:
            test_20k = np.load(X_test_1, allow_pickle=True)
            q1_20k = np.load(X_test_q1, allow_pickle=True)
            q2_20k = np.load(X_test_q2, allow_pickle=True)
            X_test = np.hstack((test_20k, q1_20k, q2_20k))
            y_pred_proba_svc.extend(list(svc_calibrated.predict_proba(X_test)[:,1]))
        except IOError as e:
            break

In [None]:
testids = pd.read_csv('test_id.csv', na_filter=False)

In [None]:
submission_svc = pd.DataFrame({'test_id':testids.test_id.values, 'is_duplicate':y_pred_proba_svc})

In [None]:
submission_svc.to_csv('submission_svc.csv', index=False)

The public leader board score for the Kaggle submission is **0.36980**<br>
It is very good considering that the model assumes linear separability.

## Random Forest

### Training

In [None]:
splits = ShuffleSplit(n_splits=1, test_size=.3, random_state=42)

In [None]:
rf_param_grid = {
                    'n_estimators':[200, 500, 800], 
                    'min_samples_split':[5, 15],
                    'max_depth': [70, 150, None]
                }

In [None]:
rf_clf = RandomForestClassifier()

In [None]:
rf_clf_search = HalvingGridSearchCV(rf_clf, rf_param_grid, cv=splits, factor=2, scoring='accuracy', verbose=3)

In [None]:
rf_clf_search.fit(X_train, y_train)

In [None]:
rf_clf_search.best_params_

In [None]:
rf_clf_search.best_score_

In [None]:
rf_clf_model = rf_clf_search.best_estimator_

In [None]:
rf_clf_model

### Testing

In [None]:
with open('testdata.npy', 'rb') as X_test_1, \
    open('test_question1_sentenceBERT.npy', 'rb') as X_test_q1, \
    open('test_question2_sentenceBERT.npy', 'rb') as X_test_q2:
    y_pred_proba_rf = []
    while True:
        try:
            test_20k = np.load(X_test_1, allow_pickle=True)
            q1_20k = np.load(X_test_q1, allow_pickle=True)
            q2_20k = np.load(X_test_q2, allow_pickle=True)
            X_test = np.hstack((test_20k, q1_20k, q2_20k))
            y_pred_proba_rf.extend(list(rf_clf_model.predict_proba(X_test)[:,1]))
        except IOError as e:
            break

In [None]:
testids = pd.read_csv('test_id.csv', na_filter=False)

In [None]:
submission_rf = pd.DataFrame({'test_id':testids.test_id.values, 'is_duplicate':y_pred_proba_rf})

In [None]:
submission_rf.to_csv('submission_rf.csv', index=False)

The public leader board score for the Kaggle submission is **0.32372**, which slightly better than SVC.<br>
I was expecting a little less logloss but remember we have not done calibration (due to time constraints).

## XGBoost

Due to time and system configuration constrained, I decided to use 200000 data points to estimate a few of the params.<br>
At first, I was using Optuna for hyperparameter tuning but it had some issues because of which it was not releasing memory after the trials. So the system was crash after few trials.<br>
Later on, I decided to use HyperOpt for the tuning.

### Training

In [None]:
random_2l = np.random.choice(range(len(X_train)), size=200000, replace=False)

In [None]:
X_train_2l = X_train[random_2l]
y_train_2l = y_train[random_2l]

In [None]:
def objective(space):
    warnings.filterwarnings(action='ignore', category=UserWarning)
    classifier = xgb.XGBClassifier(
                    objective = "binary:logistic",
                    eval_metric = "logloss",
                    booster = "gbtree",
                    tree_method = "hist",
                    grow_policy = "lossguide",
                    n_estimators = 300, 
                    max_depth = space['max_depth'],
                    learning_rate = space['learning_rate'],
                )
    
    X_train, X_cv, y_train, y_cv = train_test_split(X_train_2l, y_train_2l, test_size=0.25)
    
    classifier.fit(X_train, y_train)
    
    predicted_probs = classifier.predict_proba(X_cv)

    logloss = log_loss(y_cv, predicted_probs)

    print("Log loss = " + str(logloss))

    return{'loss':logloss, 'status': STATUS_OK }


In [None]:
space = {
    'max_depth' : hp.choice('max_depth', range(4, 10, 1)),
    "learning_rate": hp.quniform("learning_rate", 0.01, 0.5, 0.01)
}

In [None]:
trials = Trials()
best_param = fmin(fn=objective,
            space=space,
            algo=tpe.suggest,
            max_evals=5,
            trials=trials)

In [None]:
print("Best Param : ", best_param)

Train the model on whole data with the tuned parameters.

In [None]:
params = dict(
            objective = "binary:logistic",
            eval_metric = "logloss",
            booster = "gbtree",
            tree_method = "hist",
            grow_policy = "lossguide",
            max_depth = 4,
            eta = 0.14
        )

In [None]:
X_train, X_cv, y_train, y_cv = train_test_split(X_train, y_train, test_size=0.2)

In [None]:
dtrain = xgb.DMatrix(X_train, label=y_train)
dvalid = xgb.DMatrix(X_cv, label=y_cv)

In [None]:
watchlist = [(dtrain, 'train'), (dvalid, 'valid')]

In [None]:
xgb_model = xgb.train(params, dtrain, 600, watchlist, early_stopping_rounds=20, verbose_eval=10)

### Testing

In [None]:
with open('testdata.npy', 'rb') as X_test_1, \
    open('test_question1_sentenceBERT.npy', 'rb') as X_test_q1, \
    open('test_question2_sentenceBERT.npy', 'rb') as X_test_q2:
    y_pred_proba_xgb = []
    while True:
        try:
            test_20k = np.load(X_test_1, allow_pickle=True)
            q1_20k = np.load(X_test_q1, allow_pickle=True)
            q2_20k = np.load(X_test_q2, allow_pickle=True)
            X_test = xgb.DMatrix(np.hstack((test_20k, q1_20k, q2_20k)))
            y_pred_proba_xgb.extend(list(xgb_model.predict(X_test)))
        except IOError as e:
            break

In [None]:
testids = pd.read_csv('test_id.csv', na_filter=False)

In [None]:
submission_xgb = pd.DataFrame({'test_id':testids.test_id.values, 'is_duplicate':y_pred_proba_xgb})

In [None]:
submission_xgb.to_csv('submission_xgb.csv', index=False)

The public leader board score for the Kaggle submission is **0.32105**, which slightly better than the other models. I was expecting a better result than this. Which is possible with more fine-tuning the hyperparameters. XGBoost have tons of hyperparameters https://xgboost.readthedocs.io/en/latest/parameter.html

## Another XGBoost

I was not happy with the result of the XGBoost model so I decided to tune the parameters with gut feeling.

### Training

In [None]:
X_train_orig, idices = np.unique(X_train, axis=0, return_index=True)

In [None]:
y_train_orig = y_train[idices]

In [None]:
X_train_orig.shape

We got rid of oversampled data by removing the duplicate rows.

In [None]:
params = dict(
            objective = "binary:logistic",
            eval_metric = "logloss",
            booster = "gbtree",
            tree_method = "hist",
            grow_policy = "lossguide",
            max_depth = 4,
            eta = 0.15,
            subsample = .8,
            colsample_bytree = .8,
            reg_lambda = 1,
            reg_alpha = 1
        )

I chose few more params based on instinct.

In [None]:
X_train, X_cv, y_train, y_cv = train_test_split(X_train_orig, y_train_orig, test_size=0.25)

In [None]:
dtrain = xgb.DMatrix(X_train, label=y_train)
dvalid = xgb.DMatrix(X_cv, label=y_cv)

In [None]:
watchlist = [(dtrain, 'train'), (dvalid, 'valid')]

In [None]:
xgb_model2 = xgb.train(params, dtrain, 500, watchlist, early_stopping_rounds=20, verbose_eval=10)

### Testing

In [None]:
with open('testdata.npy', 'rb') as X_test_1, \
    open('test_question1_sentenceBERT.npy', 'rb') as X_test_q1, \
    open('test_question2_sentenceBERT.npy', 'rb') as X_test_q2:
    y_pred_proba_xgb = []
    while True:
        try:
            test_20k = np.load(X_test_1, allow_pickle=True)
            q1_20k = np.load(X_test_q1, allow_pickle=True)
            q2_20k = np.load(X_test_q2, allow_pickle=True)
            X_test = xgb.DMatrix(np.hstack((test_20k, q1_20k, q2_20k)))
            y_pred_proba_xgb.extend(list(xgb_model2.predict(X_test)))
        except IOError as e:
            break
testids = pd.read_csv('test_id.csv', na_filter=False)
submission_xgb2 = pd.DataFrame({'test_id':testids.test_id.values, 'is_duplicate':y_pred_proba_xgb})
submission_xgb2.to_csv('submission_xgb2.csv', index=False)

🥁 Voila! We have a winner. **This submission resulted in public LB score of 0.28170**<br>
This seems a very good result.

# Future Work

We can try Deep learning based models to get even better result.