<h1 style="text-align:center;font-size:30px;" > Quora Question Pairs </h1>

<h1> 1. Business Problem </h1>

<h2> 1.1 Description </h2>

<p>Quora is a place to gain and share knowledge—about anything. It’s a platform to ask questions and connect with people who contribute unique insights and quality answers. This empowers people to learn from each other and to better understand the world.</p>
<p>
Over 100 million people visit Quora every month, so it's no surprise that many people ask similarly worded questions. Multiple questions with the same intent can cause seekers to spend more time finding the best answer to their question, and make writers feel they need to answer multiple versions of the same question. Quora values canonical questions because they provide a better experience to active seekers and writers, and offer more value to both of these groups in the long term.
</p>
<br>
> Credits: Kaggle 


__ Problem Statement __

- Identify which questions asked on Quora are duplicates of questions that have already been asked. 
- This could be useful to instantly provide answers to questions that have already been answered. 
- We are tasked with predicting whether a pair of questions are duplicates or not. 

<h2> 1.2 Sources/Useful Links</h2>

- Source : https://www.kaggle.com/c/quora-question-pairs
<br><br>____ Useful Links ____
- Discussions : https://www.kaggle.com/anokas/data-analysis-xgboost-starter-0-35460-lb/comments
- Kaggle Winning Solution and other approaches: https://www.dropbox.com/sh/93968nfnrzh8bp5/AACZdtsApc1QSTQc7X0H3QZ5a?dl=0
- Blog 1 : https://engineering.quora.com/Semantic-Question-Matching-with-Deep-Learning
- Blog 2 : https://towardsdatascience.com/identifying-duplicate-questions-on-quora-top-12-on-kaggle-4c1cf93f1c30

<h2>1.3 Real world/Business Objectives and Constraints </h2>

1. The cost of a mis-classification can be very high.
2. You would want a probability of a pair of questions to be duplicates so that you can choose any threshold of choice.
3. No strict latency concerns.
4. Interpretability is partially important.

<h1>2. Machine Learning Problem </h1>

<h2> 2.1 Data </h2>

<h3> 2.1.1 Data Overview </h3>

<p> 
- Data will be in a file Train.csv <br>
- Train.csv contains 5 columns : qid1, qid2, question1, question2, is_duplicate <br>
- Size of Train.csv - 60MB <br>
- Number of rows in Train.csv = 404,290
</p>

<h3> 2.1.2 Example Data point </h3>

<pre>
"id","qid1","qid2","question1","question2","is_duplicate"
"0","1","2","What is the step by step guide to invest in share market in india?","What is the step by step guide to invest in share market?","0"
"1","3","4","What is the story of Kohinoor (Koh-i-Noor) Diamond?","What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back?","0"
"7","15","16","How can I be a good geologist?","What should I do to be a great geologist?","1"
"11","23","24","How do I read and find my YouTube comments?","How can I see all my Youtube comments?","1"
</pre>

## 2.2 Mapping the real world problem to an ML problem

### 2.2.1 Type of Machine Learning Problem

This is a binary classification problem, for a given pair of questions we need to predict if they are duplicate or not.

### 2.2.2 Performance Metric

Source: https://www.kaggle.com/c/quora-question-pairs#evaluation

Metric(s):

* log-loss: https://www.kaggle.com/wiki/LogarithmicLoss
* Binary Confusion Matrix

## 2.3 Train and Test Construction

We build train and test by randomly splitting in the ratio of 70:30 or 80:20 whatever we choose as we have sufficient data points to work with.

Had we been given the timestamp along with the data points, we could have split the dataset into train and test based on the temporal distribution of data. With the earlier 70-80% being the training set and the latter remaining being the test set.

# 3. Exploratory Data Analysis

In [1]:
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from subprocess import check_output
%matplotlib inline
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import os
import gc

import re
from nltk.corpus import stopwords
import distance
from nltk.stem import PorterStemmer
from bs4 import BeautifulSoup
from sklearn.manifold import TSNE
from fuzzywuzzy import fuzz
from wordcloud import WordCloud, STOPWORDS
from os import path
from PIL import Image

from sklearn.preprocessing import normalize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import sys
from tqdm import tqdm
import spacy

In [None]:
dataset = pd.read_csv('train.csv')
print("Number of data points: ", dataset.shape[0])

In [None]:
dataset.head()

In [None]:
dataset.info()

We are given a minimal number of data fields here, consisting of:

* id: Looks like a simple rowID
* qid{1,2}: The unique ID of each question in the pair
* question{1,2}: The acutal textual contents of the questions
* is_duplicate: The label that we are trying to predict - whether the two questions are duplicates of each other

### 3.2.1 Distribution of data points among output classes
- Number of duplicate(smilar) and non-duplicate(non similar) questions

In [None]:
dataset.groupby('is_duplicate')['id'].count().plot.bar()

In [None]:
print(f"-> Total number of question pairs for training:\n {len(dataset)}")

In [None]:
not_similar = 100 - round(dataset['is_duplicate'].mean()*100, 2)
print(f'Question Pairs that are not similar (is_duplicate=0)\n{not_similar}%')

In [None]:
similar = round(dataset['is_duplicate'].mean()*100, 2)
print(f'Question Pairs that are not similar (is_duplicate=1)\n{similar}%')

 ### 3.2.2 Number of unique questions

In [None]:
qids = pd.Series(dataset['qid1'].tolist() + dataset['qid2'].tolist())
unique_qs = len(np.unique(qids))
qs_morethan_one_time = np.sum(qids.value_counts() > 1)

print(f'Total number of unique questions are: {unique_qs}\n')
print(f"Number of unique questions that appear more than one time: {qs_morethan_one_time}, {round((qs_morethan_one_time/unique_qs * 100),2)}%")
print(f"Max number of times a single question is repeated: {max(qids.value_counts())}")
q_vals = qids.value_counts()
q_vals = q_vals.values

In [None]:
x = ["Unique Questions", "Repeated Questions"]
y = [unique_qs, qs_morethan_one_time]

plt.figure(figsize=(10, 6))
plt.title("Plot representing unique and repeated questions")
sns.barplot(x=x, y=y)
plt.show()

### 3.2.3 Checking for Duplicates

In [None]:
pair_duplicates = dataset[['qid1', 'qid2', 'is_duplicate']].groupby(['qid1', 'qid2']).count().reset_index()
print("Number of duplicate questions: ",  (pair_duplicates).shape[0] - dataset.shape[0])

### 3.2.4 Number of occurrences of each question

In [None]:
plt.figure(figsize=(20,10))
plt.hist(qids.value_counts(), bins=160)
plt.yscale('log', nonpositive='clip')
plt.title('Log-Histogram of question appearance counts')
plt.xlabel('Number of occurences of question')
plt.ylabel('Number of questions')
print (f'Maximum number of times a single question is repeated: {max(qids.value_counts())}\n') 

Since this is a logarithmic scale, we can find the y-axis values count to increase in the power of 10. 1->10->100->1000->10000->100000

We can see that there is one question which has 157 occurrence while another question has 120 occurrence.

There are more than 100k unique questions with the number of questions decreasing steadily

### 3.2.5 Checking for NULL values

In [None]:
# Checking whether there are any rows with null values
nan_rows = dataset[dataset.isnull().any(1)]
print(nan_rows)

There are two null values in question2 and one in question1

In [None]:
# Filling the null values with " "
dataset = dataset.fillna('')
nan_rows = dataset[dataset.isnull().any(1)]
print(nan_rows)

## 3.3 Basic Feature Extraction (Before Cleaning)

Let us now construct a few features like:

* **freq_qid1** = Frequency of qid1's
* **freq_qid2** = Frequency of qid2's
* **q1len** = Length of q1
* **q2len** = Length of q2
* **q1_n_words** = Number of words on Question 1
* **q2_n_words** = Number of words on Question 2
* **word_Common** = (Number of common unique words in Question 1 and Question 2)
* **word_Total** = (Totan number of words in Question 1 + Total number of words in Question 2)
* **word_share** = (word_Common)/(word_Total)
* **freq_q1 + freq_q2** = sum total of frequency of qid1 and qid2
* **freq_q1 - freq_q2** = absolute differenceof frequency of qid1 and qid2


In [None]:
if os.path.isfile('df_fe_without_preprocessing_train.csv'):
    dataset = pd.read_csv("df_fe_without_preprocessing_train.csv",encoding='latin-1')
else:
    dataset['freq_qid1'] = dataset.groupby('qid1')['qid1'].transform('count') 
    dataset['freq_qid2'] = dataset.groupby('qid2')['qid2'].transform('count')
    dataset['q1len'] = dataset['question1'].str.len() 
    dataset['q2len'] = dataset['question2'].str.len()
    dataset['q1_n_words'] = dataset['question1'].apply(lambda row: len(row.split(" ")))
    dataset['q2_n_words'] = dataset['question2'].apply(lambda row: len(row.split(" ")))

    def normalized_word_Common(row):
        w1 = set(map(lambda word: word.lower().strip(), row['question1'].split(" ")))
        w2 = set(map(lambda word: word.lower().strip(), row['question2'].split(" ")))    
        return 1.0 * len(w1 & w2)
    dataset['word_Common'] = dataset.apply(normalized_word_Common, axis=1)

    def normalized_word_Total(row):
        w1 = set(map(lambda word: word.lower().strip(), row['question1'].split(" ")))
        w2 = set(map(lambda word: word.lower().strip(), row['question2'].split(" ")))    
        return 1.0 * (len(w1) + len(w2))
    dataset['word_Total'] = dataset.apply(normalized_word_Total, axis=1)

    def normalized_word_share(row):
        w1 = set(map(lambda word: word.lower().strip(), row['question1'].split(" ")))
        w2 = set(map(lambda word: word.lower().strip(), row['question2'].split(" ")))    
        return 1.0 * len(w1 & w2)/(len(w1) + len(w2))
    dataset['word_share'] = dataset.apply(normalized_word_share, axis=1)

    dataset['freq_q1+q2'] = dataset['freq_qid1']+dataset['freq_qid2']
    dataset['freq_q1-q2'] = abs(dataset['freq_qid1']-dataset['freq_qid2'])

    dataset.to_csv("df_fe_without_preprocessing_train.csv", index=False)

dataset.head()

### 3.3.1 Analysis of some of the extracted features

In [None]:
print("Minimum length of the questions in question1: ", min(dataset['q1_n_words']))
print("Minimum length of the questions in question2: ", min(dataset['q2_n_words']))

print("Number of Questions with minimum length [question1]: ", dataset[dataset['q1_n_words']==1].shape[0])
print("Number of Questions with minimum length [question2]: ", dataset[dataset['q2_n_words']==1].shape[0])

### 3.3.1.1 Feature: word_share

In [None]:
sns.displot(data = {'1': dataset[dataset['is_duplicate'] == 1.0]['word_share'][0:], 
                    '0': dataset[dataset['is_duplicate'] == 0.0]['word_share'][0:]},
            kind = 'kde')

In [None]:
plt.figure(figsize=(12, 8))

plt.subplot(1,2,1)
sns.violinplot(x = 'is_duplicate', y = 'word_share', data = dataset[0:])

plt.subplot(1,2,1)
sns.displot(data = {'1': dataset[dataset['is_duplicate'] == 1.0]['word_share'][0:], 
                    '0': dataset[dataset['is_duplicate'] == 0.0]['word_share'][0:]},
            kind = 'kde')
#sns.distplot(dataset[dataset['is_duplicate'] == 1.0]['word_share'][0:], label = "1", color = 'red')
#sns.distplot(dataset[dataset['is_duplicate'] == 0.0]['word_share'][0:], label = "0", color = 'blue')
plt.show()

- The distributions for normalized word_share have some overlap on the far right-hand side, i.e., there are quite a lot of questions with high word similarity
- The average word share and Common no. of words of qid1 and qid2 is more when they are duplicate(Similar)
- Since there is a lot of overlap in the middle, we cannot simply create a boundary for separating for duplicates (is_duplicate=1) and unique questions(is_duplicate=0)
- This feature created is still somewhat useful as the violin plot shows the overlap between the 75th percentiles for the two distributions

### 3.3.1.2 Feature : word_common

In [None]:
plt.figure(figsize=(12, 8))

plt.subplot(1,2,1)
sns.violinplot(x = 'is_duplicate', y = 'word_Common', data = dataset[0:])

plt.subplot(1,2,1)
sns.displot(data = {'1': dataset[dataset['is_duplicate'] == 1.0]['word_Common'][0:], 
                    '0': dataset[dataset['is_duplicate'] == 0.0]['word_Common'][0:]},
            kind = 'kde')
#sns.distplot(dataset[dataset['is_duplicate'] == 1.0]['word_Common'][0:] , label = "1", color = 'red')
#sns.distplot(dataset[dataset['is_duplicate'] == 0.0]['word_Common'][0:] , label = "0", color = 'blue')
plt.show()

- The distributions of the word_Common feature in similar and non-similar questions are highly overlapping

## Advanced Feature Extraction

## 3.4 Preprocessing of Text

- Preprocessing
    - Removing HTML tags
    - Removing punctuations
    - Performing Stemming
    - Removing Stopwords
    - Expanding contractions etc. 

In [None]:
# To get the results in 4 decimal points
SAFE_DIV = 0.0001

STOP_WORDS = stopwords.words('english')

In [None]:
def preprocess(x):
    x = str(x).lower()
    x = x.replace(",000,000", "m").replace(",000", "k").replace("′", "'").replace("’", "'")\
                           .replace("won't", "will not").replace("cannot", "can not").replace("can't", "can not")\
                           .replace("n't", " not").replace("what's", "what is").replace("it's", "it is")\
                           .replace("'ve", " have").replace("i'm", "i am").replace("'re", " are")\
                           .replace("he's", "he is").replace("she's", "she is").replace("'s", " own")\
                           .replace("%", " percent ").replace("₹", " rupee ").replace("$", " dollar ")\
                           .replace("€", " euro ").replace("'ll", " will")
    x = re.sub(r"([0-9]+)000000", r"\1m", x)
    x = re.sub(r"([0-9]+)000", r"\1k", x)
    
    porter = PorterStemmer()
    pattern = re.compile('\W')
    
    if type(x) == type(''):
        x = re.sub(pattern, ' ', x)
    
    
    if type(x) == type(''):
        x = porter.stem(x)
        example1 = BeautifulSoup(x)
        x = example1.get_text()
               
    
    return x


- Function to compute and get the features: With 2 parameters of Question 1 and Question 2

## 3.5 Advanced Feature Extraction (NLP and Fuzzy Features)

Definition:

- **Token**: You get a token by splitting sentence using a space as a separater
- **Stop-Word**: stop words as per NLTK
- **Word**: A token that is not a stop-word

Features:

- **cwc_min**: Ration of common_word_count to min_length of word count of Q1 and Q2. 
<br>
cwc_min = common_word_count/(min(len(q1_words), len(q2_words))



- __cwc_max__ :  Ratio of common_word_count to max lenghth of word count of Q1 and Q2 <br>cwc_max = common_word_count / (max(len(q1_words), len(q2_words))
<br>
<br>
- __csc_min__ :  Ratio of common_stop_count to min lenghth of stop count of Q1 and Q2 <br> csc_min = common_stop_count / (min(len(q1_stops), len(q2_stops))
<br>
<br>
- __csc_max__ :  Ratio of common_stop_count to max lenghth of stop count of Q1 and Q2<br>csc_max = common_stop_count / (max(len(q1_stops), len(q2_stops))
<br>
<br>
- __ctc_min__ :  Ratio of common_token_count to min lenghth of token count of Q1 and Q2<br>ctc_min = common_token_count / (min(len(q1_tokens), len(q2_tokens))
<br>
<br>
​
- __ctc_max__ :  Ratio of common_token_count to max lenghth of token count of Q1 and Q2<br>ctc_max = common_token_count / (max(len(q1_tokens), len(q2_tokens))
<br>
<br>
        
- __last_word_eq__ :  Check if Last word of both questions is equal or not<br>last_word_eq = int(q1_tokens[-1] == q2_tokens[-1])
<br>
<br>
​
- __first_word_eq__ :  Check if First word of both questions is equal or not<br>first_word_eq = int(q1_tokens[0] == q2_tokens[0])
<br>
<br>
        
- __abs_len_diff__ :  Abs. length difference<br>abs_len_diff = abs(len(q1_tokens) - len(q2_tokens))
<br>
<br>
​
- __mean_len__ :  Average Token Length of both Questions<br>mean_len = (len(q1_tokens) + len(q2_tokens))/2
<br>
<br>
​
​
- __fuzz_ratio__ :  https://github.com/seatgeek/fuzzywuzzy#usage
http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
<br>
<br>
​
- __fuzz_partial_ratio__ :  https://github.com/seatgeek/fuzzywuzzy#usage
http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
<br>
<br>
​
​
- __token_sort_ratio__ : https://github.com/seatgeek/fuzzywuzzy#usage
http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
<br>
<br>
​
​
- __token_set_ratio__ : https://github.com/seatgeek/fuzzywuzzy#usage
http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
<br>
<br>
​
​
​
​
​
- __longest_substr_ratio__ :  Ratio of length longest common substring to min lenghth of token count of Q1 and Q2<br>longest_substr_ratio = len(longest common substring) / (min(len(q1_tokens), len(q2_tokens))
​


In [None]:
def get_token_features(q1, q2):
    token_features = [0.0]*10
    
    # Converting the Sentence into Tokens: 
    q1_tokens = q1.split()
    q2_tokens = q2.split()

    if len(q1_tokens) == 0 or len(q2_tokens) == 0:
        return token_features
    # Get the non-stopwords in Questions
    q1_words = set([word for word in q1_tokens if word not in STOP_WORDS])
    q2_words = set([word for word in q2_tokens if word not in STOP_WORDS])
    
    #Get the stopwords in Questions
    q1_stops = set([word for word in q1_tokens if word in STOP_WORDS])
    q2_stops = set([word for word in q2_tokens if word in STOP_WORDS])
    
    # Get the common non-stopwords from Question pair
    common_word_count = len(q1_words.intersection(q2_words))
    
    # Get the common stopwords from Question pair
    common_stop_count = len(q1_stops.intersection(q2_stops))
    
    # Get the common Tokens from Question pair
    common_token_count = len(set(q1_tokens).intersection(set(q2_tokens)))
    
    
    token_features[0] = common_word_count / (min(len(q1_words), len(q2_words)) + SAFE_DIV)
    token_features[1] = common_word_count / (max(len(q1_words), len(q2_words)) + SAFE_DIV)
    token_features[2] = common_stop_count / (min(len(q1_stops), len(q2_stops)) + SAFE_DIV)
    token_features[3] = common_stop_count / (max(len(q1_stops), len(q2_stops)) + SAFE_DIV)
    token_features[4] = common_token_count / (min(len(q1_tokens), len(q2_tokens)) + SAFE_DIV)
    token_features[5] = common_token_count / (max(len(q1_tokens), len(q2_tokens)) + SAFE_DIV)
    
    # Last word of both question is same or not
    token_features[6] = int(q1_tokens[-1] == q2_tokens[-1])
    
    # First word of both question is same or not
    token_features[7] = int(q1_tokens[0] == q2_tokens[0])
    
    token_features[8] = abs(len(q1_tokens) - len(q2_tokens))
    
    #Average Token Length of both Questions
    token_features[9] = (len(q1_tokens) + len(q2_tokens))/2
    return token_features

In [None]:
# get the Longest Common sub string

def get_longest_substr_ratio(a, b):
    strs = list(distance.lcsubstrings(a, b))
    if len(strs) == 0:
        return 0
    else:
        return len(strs[0]) / (min(len(a), len(b)) + 1)

In [None]:
def extract_features(df):
    # preprocessing each question
    df["question1"] = df["question1"].fillna("").apply(preprocess)
    df["question2"] = df["question2"].fillna("").apply(preprocess)

    print("token features...")
    
    # Merging Features with dataset
    
    token_features = df.apply(lambda x: get_token_features(x["question1"], x["question2"]), axis=1)
    
    df["cwc_min"]       = list(map(lambda x: x[0], token_features))
    df["cwc_max"]       = list(map(lambda x: x[1], token_features))
    df["csc_min"]       = list(map(lambda x: x[2], token_features))
    df["csc_max"]       = list(map(lambda x: x[3], token_features))
    df["ctc_min"]       = list(map(lambda x: x[4], token_features))
    df["ctc_max"]       = list(map(lambda x: x[5], token_features))
    df["last_word_eq"]  = list(map(lambda x: x[6], token_features))
    df["first_word_eq"] = list(map(lambda x: x[7], token_features))
    df["abs_len_diff"]  = list(map(lambda x: x[8], token_features))
    df["mean_len"]      = list(map(lambda x: x[9], token_features))
   
    #Computing Fuzzy Features and Merging with Dataset
    
    # do read this blog: http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
    # https://stackoverflow.com/questions/31806695/when-to-use-which-fuzz-function-to-compare-2-strings
    # https://github.com/seatgeek/fuzzywuzzy
    print("fuzzy features..")

    df["token_set_ratio"]       = df.apply(lambda x: fuzz.token_set_ratio(x["question1"], x["question2"]), axis=1)
    # The token sort approach involves tokenizing the string in question, sorting the tokens alphabetically, and 
    # then joining them back into a string We then compare the transformed strings with a simple ratio().
    df["token_sort_ratio"]      = df.apply(lambda x: fuzz.token_sort_ratio(x["question1"], x["question2"]), axis=1)
    df["fuzz_ratio"]            = df.apply(lambda x: fuzz.QRatio(x["question1"], x["question2"]), axis=1)
    df["fuzz_partial_ratio"]    = df.apply(lambda x: fuzz.partial_ratio(x["question1"], x["question2"]), axis=1)
    df["longest_substr_ratio"]  = df.apply(lambda x: get_longest_substr_ratio(x["question1"], x["question2"]), axis=1)
    return df

In [None]:
if os.path.isfile('nlp_features_train.csv'):
    dataset = pd.read_csv("nlp_features_train.csv",encoding='latin-1')
    dataset.fillna('')
else:
    print("Extracting features for train:")
    dataset = pd.read_csv("train.csv")
    dataset = extract_features(df)
    dataset.to_csv("nlp_features_train.csv", index=False)
dataset.head(2)

### 3.5.1 Analysis of Extracted Features

#### 3.5.1.1 Plotting Word Clouds

- Creating Word Cloud of Duplicates and Non-Duplicates Question Pairs
- We can observe the most frequent occuring words

In [None]:
df_duplicate = dataset[dataset['is_duplicate'] == 1]
dfp_nonduplicate = dataset[dataset['is_duplicate'] == 0]

# Converting 2d array of q1 and q2 and flatten the array: like {{1,2},{3,4}} to {1,2,3,4}
p = np.dstack([df_duplicate["question1"], df_duplicate["question2"]]).flatten()
n = np.dstack([dfp_nonduplicate["question1"], dfp_nonduplicate["question2"]]).flatten()

print ("Number of data points in class 1 (duplicate pairs) :",len(p))
print ("Number of data points in class 0 (non duplicate pairs) :",len(n))

In [None]:
#Saving the np array into a text file
np.savetxt('train_p.txt', p, delimiter=' ', fmt='%s')
np.savetxt('train_n.txt', n, delimiter=' ', fmt='%s')

In [None]:
# reading the text files and removing the Stop Words:
d = path.dirname('.')

textp_w = open(path.join(d, 'train_p.txt')).read()
textn_w = open(path.join(d, 'train_n.txt')).read()
stopwords = set(STOPWORDS)
stopwords.add("said")
stopwords.add("br")
stopwords.add(" ")
stopwords.remove("not")

stopwords.remove("no")
#stopwords.remove("good")
#stopwords.remove("love")
#stopwords.remove("like")
#stopwords.remove("best")
#stopwords.remove("!")
print ("Total number of words in duplicate pair questions :",len(textp_w))
print ("Total number of words in non duplicate pair questions :",len(textn_w))

__ Word Clouds generated from  duplicate pair question's text __

In [None]:
wc = WordCloud(background_color="white", max_words=len(textp_w), stopwords=stopwords)
wc.generate(textp_w)
print ("Word Cloud for Duplicate Question pairs")
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()

In [None]:
wc = WordCloud(background_color="white", max_words=len(textn_w),stopwords=stopwords)
# generate word cloud
wc.generate(textn_w)
print ("Word Cloud for non-Duplicate Question pairs:")
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()

<h4> 3.5.1.2 Pair plot of features ['ctc_min', 'cwc_min', 'csc_min', 'token_sort_ratio'] </h4>

In [None]:
n = dataset.shape[0]
sns.pairplot(dataset[['ctc_min', 'cwc_min', 'csc_min', 'token_sort_ratio', 'is_duplicate']][0:n], hue='is_duplicate', vars=['ctc_min', 'cwc_min', 'csc_min', 'token_sort_ratio'])
plt.show()

In [None]:
# Distribution of the token_sort_ratio
plt.figure(figsize=(10, 8))

plt.subplot(1,2,1)
sns.violinplot(x = 'is_duplicate', y = 'token_sort_ratio', data = dataset[0:] , )

plt.subplot(1,2,2)
sns.distplot(dataset[dataset['is_duplicate'] == 1.0]['token_sort_ratio'][0:] , label = "1", color = 'red')
sns.distplot(dataset[dataset['is_duplicate'] == 0.0]['token_sort_ratio'][0:] , label = "0" , color = 'blue' )
plt.show()

In [None]:
plt.figure(figsize=(10, 8))

plt.subplot(1,2,1)
sns.violinplot(x = 'is_duplicate', y = 'fuzz_ratio', data = dataset[0:] , )

plt.subplot(1,2,2)
sns.distplot(dataset[dataset['is_duplicate'] == 1.0]['fuzz_ratio'][0:] , label = "1", color = 'red')
sns.distplot(dataset[dataset['is_duplicate'] == 0.0]['fuzz_ratio'][0:] , label = "0" , color = 'blue' )
plt.show()

### 3.5.2 Visualisation

In [None]:
# Using TSNE for Dimentionality reduction for 15 Features(Generated after cleaning the data) to 3 dimention

from sklearn.preprocessing import MinMaxScaler

dfp_subsampled = dataset[0:5000]
X = MinMaxScaler().fit_transform(dfp_subsampled[['cwc_min', 'cwc_max', 'csc_min', 'csc_max' , 'ctc_min' , 'ctc_max' , 'last_word_eq', 'first_word_eq' , 'abs_len_diff' , 'mean_len' , 'token_set_ratio' , 'token_sort_ratio' ,  'fuzz_ratio' , 'fuzz_partial_ratio' , 'longest_substr_ratio']])
y = dfp_subsampled['is_duplicate'].values

In [None]:
tsne2d = TSNE(
    n_components=2,
    init='random', # pca
    random_state=101,
    method='barnes_hut',
    n_iter=1000,
    verbose=2,
    angle=0.5
).fit_transform(X)

In [None]:
df = pd.DataFrame({'x':tsne2d[:,0], 'y':tsne2d[:,1] ,'label':y})

# draw the plot in appropriate place in the grid
sns.lmplot(data=df, x='x', y='y', hue='label', fit_reg=False, size=8,palette="Set1",markers=['s','o'])
plt.title("perplexity : {} and max_iter : {}".format(30, 1000))
plt.show()

In [None]:
from sklearn.manifold import TSNE
tsne3d = TSNE(
    n_components=3,
    init='random', # pca
    random_state=101,
    method='barnes_hut',
    n_iter=1000,
    verbose=2,
    angle=0.5
).fit_transform(X)

In [None]:
trace1 = go.Scatter3d(
    x=tsne3d[:,0],
    y=tsne3d[:,1],
    z=tsne3d[:,2],
    mode='markers',
    marker=dict(
        sizemode='diameter',
        color = y,
        colorscale = 'Portland',
        colorbar = dict(title = 'duplicate'),
        line=dict(color='rgb(255, 255, 255)'),
        opacity=0.75
    )
)

data=[trace1]
layout=dict(height=800, width=800, title='3d embedding with engineered features')
fig=dict(data=data, layout=layout)
py.iplot(fig, filename='3DBubble')

## 3.6 Featurizing text data with tfidf weighted word-vectors

In [None]:
# avoid decoding problems
dataset = pd.read_csv("train.csv")
 
# encode questions to unicode
dataset['question1'] = dataset['question1'].apply(lambda x: str(x))
dataset['question2'] = dataset['question2'].apply(lambda x: str(x))

In [None]:
dataset.head()

In [None]:
# merge texts
questions = list(dataset['question1']) + list(dataset['question2'])

tfidf = TfidfVectorizer(lowercase=False, )
tfidf.fit_transform(questions)

# dict key:word and value:tf-idf score
word2tfidf = dict(zip(tfidf.get_feature_names(), tfidf.idf_))

- After we find TF-IDF scores, we convert each question to a weighted average of word2vec vectors by these scores.
- here we use a pre-trained GLOVE model which comes free with "Spacy".  https://spacy.io/usage/vectors-similarity
- It is trained on Wikipedia and therefore, it is stronger in terms of word semantics. 

In [None]:
# en_vectors_web_lg, which includes over 1 million unique vectors.

spacy.prefer_gpu()
nlp = spacy.load("en_core_web_sm")

In [None]:
vecs1 = []
# https://github.com/noamraph/tqdm
# tqdm is used to print the progress bar
for qu1 in tqdm(list(dataset['question1'])):
    doc1 = nlp(qu1) 
    # 384 is the number of dimensions of vectors 
    mean_vec1 = np.zeros([len(doc1), len(doc1[0].vector)])
    for word1 in doc1:
        # word2vec
        vec1 = word1.vector
        # fetch dataset score
        try:
            idf = word2tfidf[str(word1)]
        except:
            idf = 0
        # compute final vec
        mean_vec1 += vec1 * idf
    mean_vec1 = mean_vec1.mean(axis=0)
    vecs1.append(mean_vec1)
dataset['q1_feats_m'] = list(vecs1)

In [None]:
vecs2 = []
for qu2 in tqdm(list(dataset['question2'])):
    doc2 = nlp(qu2) 
    mean_vec2 = np.zeros([len(doc1), len(doc2[0].vector)])
    for word2 in doc2:
        # word2vec
        vec2 = word2.vector
        # fetch dataset score
        try:
            idf = word2tfidf[str(word2)]
        except:
            #print word
            idf = 0
        # compute final vec
        mean_vec2 += vec2 * idf
    mean_vec2 = mean_vec2.mean(axis=0)
    vecs2.append(mean_vec2)
dataset['q2_feats_m'] = list(vecs2)

In [None]:
# prepro_features_train.csv (Simple Preprocessing Features)
# nlp_features_train.csv (NLP Features)

if os.path.isfile('nlp_features_train.csv'):
    dfnlp = pd.read_csv('nlp_features_train.csv', encoding='latin-1')
else:
    print("Download nlp_features_train.csv from drive or run previous notebook")

if os.path.isfile('df_fe_without_preprocessing_train.csv'):
    dfppro = pd.read_csv("df_fe_without_preprocessing_train.csv", encoding='latin-1')
else:
    print("Download df_fe_without_preprocessing_train.csv from drive or run previous notebook")

In [None]:
df1 = dfnlp.drop(['qid1', 'qid2', 'question1', 'question2'], axis = 1)
df2 = dfppro.drop(['qid1', 'qid2', 'question1', 'question2', 'is_duplicate'], axis = 1)
df3 = dataset.drop(['qid1', 'qid2', 'question1', 'question2', 'is_duplicate'], axis = 1)
df3_q1 = pd.DataFrame(df3.q1_feats_m.values.tolist(), index = df3.index)
df3_q2 = pd.DataFrame(df3.q2_feats_m.values.tolist(), index = df3.index)

In [None]:
# Data Frame of NLP features
df1.head()

In [None]:
# Data before preprocessing
df2.head()

In [None]:
# Questions 1 TF-IDF Weighted Word2Vec
df3_q1.head()

In [None]:
# Questions 2 tfidf weighted word2vec
df3_q2.head()

In [None]:
print("Number of features in nlp dataframe :", df1.shape[1])
print("Number of features in preprocessed dataframe :", df2.shape[1])
print("Number of features in question1 w2v dataframe :", df3_q1.shape[1])
print("Number of features in question2 w2v dataframe :", df3_q2.shape[1])
print("Number of features in final dataframe :", df1.shape[1] + df2.shape[1] + df3_q1.shape[1] + df3_q2.shape[1])

In [None]:
# storing the final features to csv file
if not os.path.isfile('final_features.csv'):
    df3_q1['id']=df1['id']
    df3_q2['id']=df1['id']
    df1  = df1.merge(df2, on='id',how='left')
    df2  = df3_q1.merge(df3_q2, on='id',how='left')
    result  = df1.merge(df2, on='id',how='left')
    result.to_csv('final_features.csv')