# Feature Engineering for Train_data of  `Quora Question Pairs`

- This is the **2nd iteration**, the change is focused on the feature_tm and feature_nlp
- Feature engineering of training data.
- Extracting features according to the order of csv file below.
-  ****Due to limited computing resources of my laptop, i couldn't merge features from en_core_web_md (which is 300 dimensions), so in the `modeling.ipynb` i used features from en_core_web_sm (have tried before and saved as features locally). so if you want to rebuild my project results, you should change the model to `en_core_web_sm` in `1.4`(vector features) . ****


- Input: `train.csv`
- Output: `feature_tm.csv`, `feature_nlp.csv`, `feature_vectors.csv`


In [23]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from subprocess import check_output
%matplotlib inline
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import os
import gc

import re
from nltk.corpus import stopwords
import distance
from nltk.stem import PorterStemmer
from bs4 import BeautifulSoup

import warnings
warnings.filterwarnings("ignore")
import distance
from nltk.stem import PorterStemmer
from bs4 import BeautifulSoup
from fuzzywuzzy import fuzz
from sklearn.manifold import TSNE
from wordcloud import WordCloud, STOPWORDS
from os import path
from PIL import Image

## Load Train Data

In [24]:
# Load train data from csv file

df = pd.read_csv("Data/train.csv")
df.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 404290 entries, 0 to 404289
Data columns (total 6 columns):
id              404290 non-null int64
qid1            404290 non-null int64
qid2            404290 non-null int64
question1       404289 non-null object
question2       404288 non-null object
is_duplicate    404290 non-null int64
dtypes: int64(4), object(2)
memory usage: 18.5+ MB


In [26]:
# Basic statistical analysis

# The proportion of duplicated question-pairs
print("There are ", round(df['is_duplicate'].mean()*100, 2),"% duplicated question pairs in the training dataset")
print("There are ", 100 - round(df['is_duplicate'].mean()*100, 2),"% question pairs that are not duplicated in the training dataset")

# Check the dupcated question-pairs (appears more than once)
question_pair_duplicates = df[['qid1','qid2','is_duplicate']].groupby(['qid1','qid2']).count().reset_index()
print ("The number of duplicate question pairs is:",(question_pair_duplicates).shape[0] - df.shape[0])

# Check the unique questions
qid_all = pd.Series(df['qid1'].tolist() + df['qid2'].tolist())
unique_questions = len(np.unique(qid_all))
un_unique_questions = np.sum(qid_all.value_counts() > 1)
print("There are:", unique_questions," unique questions.")
print("There are:", un_unique_questions," question that appear more than once.")


There are  63.08 % duplicated question pairs in the training dataset
There are  36.92 % question pairs that are not duplicated in the training dataset
The number of duplicate question pairs is: 0
There are: 537933  unique questions.
There are: 111780  question that appear more than once.


In [27]:
# Data preprocessing: deal with null values

nan_data = df[df.isnull().any(1)]
print("Here are NaN data rows:")
print(nan_data)
print("---------------Now start data cleansing for NaN values:-------------")
df = df.fillna('')
nan_data = df[df.isnull().any(1)]
print("Here are. NaN data rows:")
print(nan_data)


Here are NaN data rows:
            id    qid1    qid2                         question1  \
105780  105780  174363  174364    How can I develop android app?   
201841  201841  303951  174364  How can I create an Android app?   
363362  363362  493340  493341                               NaN   

                                                question2  is_duplicate  
105780                                                NaN             0  
201841                                                NaN             0  
363362  My Chinese name is Haichao Yu. What English na...             0  
---------------Now start data cleansing for NaN values:-------------
Here are NaN data rows:
Empty DataFrame
Columns: [id, qid1, qid2, question1, question2, is_duplicate]
Index: []


## Feature Enginerring on Text Mining Features

Extract text mining or statistical features from training data.
 - ___q1len___ = Length of q1
 - ___q2len___ = Length of q2
 - ___diff_len___ = len(q1)-len(q2)       


 - ___q1_n_words___ = Number of words in q1
 - ___q2_n_words___ = Number of words in q2
 - ___diff_n_words___ = The difference       


 - ___caps_count_q1___ = Number of capital words of q1
 - ___caps_count_q2___ = Number of capital words of q2
 - ___diff_caps___ = The difference       


 - ___len_char_q1___ = Number of characters of q1
 - ___len_char_q2___ = Number of characters of q2
 - ___diff_len_char___ = The difference      


 - ___avg_word_len1___ = len(char)/len(word) of q1
 - ___avg_word_len2___ = len(char)/len(word) of q2
 - ___diff_avg_word___ = The difference      


 - ___word_Common___ = Number of common unique words in q1 and q2
 - ___word_Total___ = Total num of words in Question 1 + Total num of words in q2
 - ___word_share___ = (word_common)/(word_Total)    
 - ___2_gram_share___ = word share on 2 gram


 - ___exactly_same___ = exactly the same


 
 
 - **Ouput: feature_tm.csv**

In [28]:
if os.path.isfile('Features/feature_tm.csv'):
    df = pd.read_csv("Features/feature_tm.csv",encoding='latin-1')
else:
    df['q1len'] = df['question1'].str.len() 
    df['q2len'] = df['question2'].str.len()
    df['diff_len'] = df['q1len'] - df['q2len']
    
    df['len_word_q1'] = df['question1'].apply(lambda row: len(row.split(" ")))
    df['len_word_q2'] = df['question2'].apply(lambda row: len(row.split(" ")))
    df['diff_words'] = df['len_word_q1'] - df['len_word_q2']
    
    df['caps_count_q1'] = df['question1'].apply(lambda x:sum(1 for i in str(x) if i.isupper()))
    df['caps_count_q2'] = df['question2'].apply(lambda x:sum(1 for i in str(x) if i.isupper()))
    df['diff_caps'] = df['caps_count_q1'] - df['caps_count_q2']
    
    df['len_char_q1'] = df['question1'].apply(lambda x: len(str(x).replace(' ', '')))
    df['len_char_q2'] = df['question2'].apply(lambda x: len(str(x).replace(' ', '')))
    df['diff_len_char'] = df['len_char_q1'] - df['len_char_q2']
    
    df['avg_world_len1'] = df['len_char_q1'] / df['len_word_q1']
    df['avg_world_len2'] = df['len_char_q2'] / df['len_word_q2']
    df['diff_avg_word'] = df['avg_world_len1'] - df['avg_world_len2']
    

    def normalized_word_Common(row):
        w1 = set(map(lambda word: word.lower().strip(), row['question1'].split(" ")))
        w2 = set(map(lambda word: word.lower().strip(), row['question2'].split(" ")))    
        return 1.0 * len(w1 & w2)
    df['word_Common'] = df.apply(normalized_word_Common, axis=1)

    def normalized_word_Total(row):
        w1 = set(map(lambda word: word.lower().strip(), row['question1'].split(" ")))
        w2 = set(map(lambda word: word.lower().strip(), row['question2'].split(" ")))    
        return 1.0 * (len(w1) + len(w2))
    df['word_Total'] = df.apply(normalized_word_Total, axis=1)

    def normalized_word_share(row):
        w1 = set(map(lambda word: word.lower().strip(), row['question1'].split(" ")))
        w2 = set(map(lambda word: word.lower().strip(), row['question2'].split(" ")))    
        return 1.0 * len(w1 & w2)/(len(w1) + len(w2))
    df['word_share'] = df.apply(normalized_word_share, axis=1)

    def get_2_gram_share(row):
        q1_list = str(row['question1']).lower().split()
        q2_list = str(row['question2']).lower().split()
        q1_2_gram = set([i for i in zip(q1_list, q1_list[1:])])
        q2_2_gram = set([i for i in zip(q2_list, q2_list[1:])])
        shared_2_gram = q1_2_gram.intersection(q2_2_gram)
        if len(q1_2_gram) + len(q2_2_gram) == 0:
            R2gram = 0
        else:
            R2gram = len(shared_2_gram) / (len(q1_2_gram) + len(q2_2_gram))
        return R2gram
    df['share_2_gram'] = df.apply(get_2_gram_share, axis=1) 

    df.to_csv("Features/feature_tm.csv", index=False)

df.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate,q1len,q2len,diff_len,len_word_q1,...,len_char_q1,len_char_q2,diff_len_char,avg_world_len1,avg_world_len2,diff_avg_word,word_Common,word_Total,word_share,share_2_gram
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0,66,57,9,14,...,53,46,7,3.785714,3.833333,-0.047619,10.0,23.0,0.434783,0.416667
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0,51,88,-37,8,...,44,76,-32,5.5,5.846154,-0.346154,4.0,20.0,0.2,0.052632
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0,73,59,14,14,...,60,50,10,4.285714,5.0,-0.714286,4.0,24.0,0.166667,0.045455
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0,50,65,-15,11,...,40,57,-17,3.636364,6.333333,-2.69697,0.0,19.0,0.0,0.0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0,76,39,37,13,...,64,33,31,4.923077,4.714286,0.208791,2.0,20.0,0.1,0.0


In [29]:
# Check the NaN data

if os.path.isfile('Features/feature_tm.csv'):
    df_tm = pd.read_csv("Features/feature_tm.csv",encoding='latin-1')
    print(df_tm.isna().sum())
    df_tm = df_tm.fillna('')
    df_tm.head()
else:
    # If there are no existing file then you need to create a csv file, make sure you have run the previous code in 1.2 chapter
    print("There is no Features/feature_tm.csv!")
    

id                0
qid1              0
qid2              0
question1         1
question2         2
is_duplicate      0
q1len             0
q2len             0
diff_len          0
len_word_q1       0
len_word_q2       0
diff_words        0
caps_count_q1     0
caps_count_q2     0
diff_caps         0
len_char_q1       0
len_char_q2       0
diff_len_char     0
avg_world_len1    0
avg_world_len2    0
diff_avg_word     0
word_Common       0
word_Total        0
word_share        0
share_2_gram      0
dtype: int64


In [30]:
print(df.isna().sum())

id                0
qid1              0
qid2              0
question1         0
question2         0
is_duplicate      0
q1len             0
q2len             0
diff_len          0
len_word_q1       0
len_word_q2       0
diff_words        0
caps_count_q1     0
caps_count_q2     0
diff_caps         0
len_char_q1       0
len_char_q2       0
diff_len_char     0
avg_world_len1    0
avg_world_len2    0
diff_avg_word     0
word_Common       0
word_Total        0
word_share        0
share_2_gram      0
dtype: int64


## Feature Engineering on NLP Features

Extracting NLP features, including:
- Statistical features of NLP nouns like stop_word, token, substring etc
- NLP distances
- Fuzzy features

Features:

- __last_word_eq__ :  Check if Last word of both questions is equal or not<br>last_word_eq = int(q1_tokens[-1] == q2_tokens[-1])


- __first_word_eq__ :  Check if First word of both questions is equal or not<br>first_word_eq = int(q1_tokens[0] == q2_tokens[0])


- __abs_len_diff__ :  Abs. length difference<br>abs_len_diff = abs(len(q1_tokens) - len(q2_tokens))


- __mean_len__ :  Average Token Length of both Questions<br>mean_len = (len(q1_tokens) + len(q2_tokens))/2

- __cwc_min__ :  Ratio of common_word_count to min lenghth of word count of Q1 and Q2 <br>cwc_min = common_word_count / (min(len(q1_words), len(q2_words))


- __cwc_max__ :  Ratio of common_word_count to max lenghth of word count of Q1 and Q2 <br>cwc_max = common_word_count / (max(len(q1_words), len(q2_words))


- __csc_min__ :  Ratio of common_stop_count to min lenghth of stop count of Q1 and Q2 <br> csc_min = common_stop_count / (min(len(q1_stops), len(q2_stops))


- __csc_max__ :  Ratio of common_stop_count to max lenghth of stop count of Q1 and Q2<br>csc_max = common_stop_count / (max(len(q1_stops), len(q2_stops))


- __ctc_min__ :  Ratio of common_token_count to min lenghth of token count of Q1 and Q2<br>ctc_min = common_token_count / (min(len(q1_tokens), len(q2_tokens))


- __ctc_max__ :  Ratio of common_token_count to max lenghth of token count of Q1 and Q2<br>ctc_max = common_token_count / (max(len(q1_tokens), len(q2_tokens))


- __wmd_dist__: Thesis reference: http://proceedings.mlr.press/v37/kusnerb15.pdf 


- __cosine_dist__: Cosine distance here is the cosine distance between two glove based vectors. Different from cosine similarity of tf/tfidf in Chapter 1.5


- __cityblock_dist__: just follow the official defination


- __canberra_dist__: just follow the official defination


- __euclidean_dist__: just follow the official defination


- __minkowski_dist__: just follow the official defination


- __fuzz_ratio__ :  https://github.com/seatgeek/fuzzywuzzy#usage
http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/


- __fuzz_partial_ratio__ :  https://github.com/seatgeek/fuzzywuzzy#usage
http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/


- __token_sort_ratio__ : https://github.com/seatgeek/fuzzywuzzy#usage
http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/


- __token_set_ratio__ : https://github.com/seatgeek/fuzzywuzzy#usage
http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/


- __longest_substr_ratio__ :  Ratio of length longest common substring to min lenghth of token count of Q1 and Q2<br>longest_substr_ratio = len(longest common substring) / (min(len(q1_tokens), len(q2_tokens))
Thesis refered from /http://static.hongbozhang.me/doc/STAT_441_Report.pdf



- **Ouput: feature_nlp.csv**

In [18]:
# Data Preprocessing: remove noise like html-tags, punctuations, stemming, stopwords, etc.
# Idea from kaggle notebooks

import nltk
nltk.download('stopwords')

# To get the results in 4 decemal points
SAFE_DIV = 0.0001 

# STOP_WORDS = nltk.stopwords.words("english")

from nltk.corpus import stopwords
STOP_WORDS = stopwords.words('english')

# Preprocessing
def preprocess(x):
    x = str(x).lower()
    x = x.replace(",000,000", "m").replace(",000", "k").replace("′", "'").replace("’", "'")\
                           .replace("won't", "will not").replace("cannot", "can not").replace("can't", "can not")\
                           .replace("n't", " not").replace("what's", "what is").replace("it's", "it is")\
                           .replace("'ve", " have").replace("i'm", "i am").replace("'re", " are")\
                           .replace("he's", "he is").replace("she's", "she is").replace("'s", " own")\
                           .replace("%", " percent ").replace("₹", " rupee ").replace("$", " dollar ")\
                           .replace("€", " euro ").replace("'ll", " will")
    x = re.sub(r"([0-9]+)000000", r"\1m", x)
    x = re.sub(r"([0-9]+)000", r"\1k", x)
    
    
    porter = PorterStemmer()
    pattern = re.compile('\W')
    
    if type(x) == type(''):
        x = re.sub(pattern, ' ', x)
    
    
    if type(x) == type(''):
        x = porter.stem(x)
        example1 = BeautifulSoup(x)
        x = example1.get_text()
               
    
    return x
    

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/yanzheyuan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [20]:
# Preparations for distance calculations

from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pickle
import gensim

# Download GloVe model
!wget http://nlp.stanford.edu/data/glove.840B.300d.zip
!unzip glove.840B.300d.zip

# Use gensim package to do word-to-vec
from gensim.scripts.glove2word2vec import glove2word2vec
glove2word2vec(glove_input_file="glove.840B.300d.txt", word2vec_output_file="glove_vectors.txt")

from gensim.models.keyedvectors import KeyedVectors
glove_model = KeyedVectors.load_word2vec_format("glove_vectors.txt", binary=False)



--2020-11-03 02:56:35--  http://nlp.stanford.edu/data/glove.840B.300d.zip
正在解析主机 nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
正在连接 nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... 已连接。
已发出 HTTP 请求，正在等待回应... 302 Found
位置：https://nlp.stanford.edu/data/glove.840B.300d.zip [跟随至新的 URL]
--2020-11-03 02:56:35--  https://nlp.stanford.edu/data/glove.840B.300d.zip
正在连接 nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... 已连接。
已发出 HTTP 请求，正在等待回应... 301 Moved Permanently
位置：http://downloads.cs.stanford.edu/nlp/data/glove.840B.300d.zip [跟随至新的 URL]
--2020-11-03 02:56:36--  http://downloads.cs.stanford.edu/nlp/data/glove.840B.300d.zip
正在解析主机 downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
正在连接 downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... 已连接。
已发出 HTTP 请求，正在等待回应... 200 OK
长度：2176768927 (2.0G) [application/zip]
正在保存至: “glove.840B.300d.zip”


2020-11-03 03:13:41 (2.03 MB/s) - 已保存 “glove.840B.300d.zip” [2176768927/2176768927])

Archi

In [73]:
# Distance calculations
from scipy.stats import skew, kurtosis
from scipy.spatial.distance import cosine, cityblock, canberra, euclidean, minkowski

# Preprocessing: simple version
def remove_stop(sentence):
    sentence  = str(sentence)
    if sentence == None:
        return ' '
    if sentence == np.nan:
        return ' '
    if sentence == 'NaN':
        return ' '
    z = [i for i in sentence.split() if i not in STOP_WORDS]
    return ' '.join(z)

# wmd_dist calculation
def wmd(s1, s2, model):
    s1 = str(s1)
    s2 = str(s2)
    s1 = s1.split()
    s2 = s2.split()
    return model.wmdistance(s1, s2)

# the average glove2word2vec-based vectors of every word in a sentence
def g2w2v(list_of_sent, model, d):
    # Returns average of word vectors for each sentance with dimension of model given
    sent_vectors = []
    for sentence in list_of_sent: # for each review/sentence
        doc = [word for word in sentence if word in model.wv.vocab] # .wv: return a numpy vector of a word
        if doc:
            sent_vec = np.mean(model.wv[doc],axis=0) # get the average of vector, namely avgw2v.
        else:
            sent_vec = np.zeros(d)
        sent_vectors.append(sent_vec)
    return sent_vectors

# Gathering all calculations
def get_distance_features(df):
    
    print("Extracting Distance Features..")
    
    # wmd_distance
    df['question1'] = df.question1.apply(remove_stop)
    df['question2'] = df.question2.apply(remove_stop)
    df['word_mover_dist'] = df.apply(lambda x: wmd(x['question1'], x['question2'],glove_model), axis=1)
    
    print("- wmd done...")
    
    # Other Distances
    # Converting questions into lists of sentences
    list_of_question1=[]
    for sentence in df.question1.values:
        list_of_question1.append(sentence.split())
    
    list_of_question2=[]
    for sentence in df.question2.values:
        list_of_question2.append(sentence.split())
    
    # Get embeded vectors from a pre-trained model(GloVe2Word2Vec)
    g2w2v_q1 = g2w2v(list_of_question1, glove_model, 300)
    g2w2v_q2 = g2w2v(list_of_question2, glove_model, 300)
    
    # !!! Glove based word-vectors: can added to NLP features but i removed, but may be useful for model improving
    # df_g2w2v = pd.DataFrame()
    # df_g2w2v['q1_vec'] = list(g2w2v_q1)
    # df_g2w2v['q2_vec'] = list(g2w2v_q2)
    # df_q1 = pd.DataFrame(df_g2w2v.q1_vec.values.tolist())
    # df_q2 = pd.DataFrame(df_g2w2v.q2_vec.values.tolist())
    
    print("- embedding done...")
    
    # Spatial Distances on vectors of questions
    df['cosine_dist'] = [cosine(q1, q2) for (q1, q2) in zip(g2w2v_q1,g2w2v_q2)]
    df['cityblock_dist'] = [cityblock(q1, q2) for (q1, q2) in zip(g2w2v_q1,g2w2v_q2)]
    df['canberra_dist'] = [canberra(q1, q2) for (q1, q2) in zip(g2w2v_q1,g2w2v_q2)]
    df['euclidean_dist'] = [euclidean(q1, q2) for (q1, q2) in zip(g2w2v_q1,g2w2v_q2)]
    df['minkowski_dist'] = [minkowski(q1, q2) for (q1, q2) in zip(g2w2v_q1,g2w2v_q2)]
    
    print('- spatial distance done')
    
    # Deal with nan values
    df.cosine_dist = df.cosine_dist.fillna(0)
    df.word_mover_dist = df.word_mover_dist.apply(lambda wmd: 30 if wmd == np.inf else wmd )
   
    return df

In [74]:
# Statistical features on Text Tokens of questions
def get_token_features(q1, q2):
    token_features = [0.0]*10
    
    # Converting the sentence into Tokens: 
    q1_tokens = q1.split()
    q2_tokens = q2.split()

    if len(q1_tokens) == 0 or len(q2_tokens) == 0:
        return token_features
    
    # Get the non-stopwords in questions
    q1_words = set([word for word in q1_tokens if word not in STOP_WORDS])
    q2_words = set([word for word in q2_tokens if word not in STOP_WORDS])
    
    # Get the stopwords in questions
    q1_stops = set([word for word in q1_tokens if word in STOP_WORDS])
    q2_stops = set([word for word in q2_tokens if word in STOP_WORDS])
    
    # Get the common non-stopwords from question pair
    common_word_count = len(q1_words.intersection(q2_words))
    
    # Get the common stopwords from question pair
    common_stop_count = len(q1_stops.intersection(q2_stops))
    
    # Get the common Tokens from question pair
    common_token_count = len(set(q1_tokens).intersection(set(q2_tokens)))
    
    # Add safety div
    token_features[0] = common_word_count / (min(len(q1_words), len(q2_words)) + SAFE_DIV)
    token_features[1] = common_word_count / (max(len(q1_words), len(q2_words)) + SAFE_DIV)
    token_features[2] = common_stop_count / (min(len(q1_stops), len(q2_stops)) + SAFE_DIV)
    token_features[3] = common_stop_count / (max(len(q1_stops), len(q2_stops)) + SAFE_DIV)
    token_features[4] = common_token_count / (min(len(q1_tokens), len(q2_tokens)) + SAFE_DIV)
    token_features[5] = common_token_count / (max(len(q1_tokens), len(q2_tokens)) + SAFE_DIV)
    
    # Last word of both question is same or not
    token_features[6] = int(q1_tokens[-1] == q2_tokens[-1])
    
    # First word of both question is same or not
    token_features[7] = int(q1_tokens[0] == q2_tokens[0])
    
    token_features[8] = abs(len(q1_tokens) - len(q2_tokens))
    
    # Average Token Length of both Questions
    token_features[9] = (len(q1_tokens) + len(q2_tokens))/2
    return token_features


# Get the Longest Common sub string
def get_longest_substr_ratio(a, b):
    strs = list(distance.lcsubstrings(a, b))
    if len(strs) == 0:
        return 0
    else:
        return len(strs[0]) / (min(len(a), len(b)) + 1)


In [77]:
# Gather all the NLP features
def extract_features(df):
    # preprocessing each question, apply self-defined function preprocess to filter text data with stopwords preparation
    df["question1"] = df["question1"].fillna("").apply(preprocess)
    df["question2"] = df["question2"].fillna("").apply(preprocess)

    print("Extracting Token Features...")
    
    token_features = df.apply(lambda x: get_token_features(x["question1"], x["question2"]), axis=1)
    
    df["cwc_min"]       = list(map(lambda x: x[0], token_features))
    df["cwc_max"]       = list(map(lambda x: x[1], token_features))
    df["csc_min"]       = list(map(lambda x: x[2], token_features))
    df["csc_max"]       = list(map(lambda x: x[3], token_features))
    df["ctc_min"]       = list(map(lambda x: x[4], token_features))
    df["ctc_max"]       = list(map(lambda x: x[5], token_features))
    df["last_word_eq"]  = list(map(lambda x: x[6], token_features))
    df["first_word_eq"] = list(map(lambda x: x[7], token_features))
    df["abs_len_diff"]  = list(map(lambda x: x[8], token_features))
    df["mean_len"]      = list(map(lambda x: x[9], token_features))
   
    # Getting Fuzzy Features and Merging with Dataset
    print("Extracting Fuzzy Features..")

    df["token_set_ratio"]       = df.apply(lambda x: fuzz.token_set_ratio(x["question1"], x["question2"]), axis=1)
    df["token_sort_ratio"]      = df.apply(lambda x: fuzz.token_sort_ratio(x["question1"], x["question2"]), axis=1)
    df["fuzz_ratio"]            = df.apply(lambda x: fuzz.QRatio(x["question1"], x["question2"]), axis=1)
    df["fuzz_partial_ratio"]    = df.apply(lambda x: fuzz.partial_ratio(x["question1"], x["question2"]), axis=1)
    df["longest_substr_ratio"]  = df.apply(lambda x: get_longest_substr_ratio(x["question1"], x["question2"]), axis=1)
    return df

In [78]:
if os.path.isfile('Features/feature_nlp.csv'):
    df_nlp = pd.read_csv("Features/feature_nlp.csv",encoding='latin-1')
    # df.fillna('')
else:
    # If there are no existing file then create a csv file, make sure you have run the previous code in 1.3 chapter
    print("Extracting features for train:")
    df = pd.read_csv("Data/train.csv")
    df = extract_features(df)
    df = get_distance_features(df)
    # drop unecessary columns
    df = df.drop(['qid1','qid2','question1','question2','is_duplicate'], axis=1)
    df.to_csv("Features/feature_nlp.csv", index=False)
df.head()

Extracting features for train:
Extracting Token Features...
Extracting Fuzzy Features..
Extracting Distance Features..
- wmd done...
- embedding done...
- spatial distance done


Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate,cwc_min,cwc_max,csc_min,csc_max,...,token_sort_ratio,fuzz_ratio,fuzz_partial_ratio,longest_substr_ratio,word_mover_dist,cosine_dist,cityblock_dist,canberra_dist,euclidean_dist,minkowski_dist
0,0,1,2,step step guide invest share market india,step step guide invest share market,0,0.99998,0.833319,0.999983,0.999983,...,93,93,100,0.982759,1.216034,0.031762,14.274065,91.483062,1.047253,1.047253
1,1,3,4,story kohinoor koh noor diamond,would happen indian government stole kohinoor ...,0,0.799984,0.399996,0.749981,0.599988,...,63,66,75,0.596154,4.897662,0.266555,33.272633,149.670092,2.624989,2.624989
2,2,5,6,increase speed internet connection using vpn,internet speed increased hacking dns,0,0.399992,0.333328,0.399992,0.249997,...,63,43,47,0.166667,4.011556,0.1189,28.457512,129.21466,2.140298,2.140298
3,3,7,8,mentally lonely solve,find remainder math 23 24 math divided 24 23,0,0.0,0.0,0.0,0.0,...,24,9,14,0.039216,7.514702,0.619671,62.016426,200.899534,4.702347,4.702347
4,4,9,10,one dissolve water quikly sugar salt methane c...,fish would survive salt water,0,0.399992,0.199998,0.99995,0.666644,...,47,35,56,0.175,6.25726,0.244168,40.127296,156.627744,3.145122,3.145122


In [79]:
df.shape

(404290, 27)

In [81]:
# df.cosine_dist = df.cosine_dist.fillna(0)

In [85]:
# Check on NaN values
df.isna().sum()

id                      0
qid1                    0
qid2                    0
question1               0
question2               0
is_duplicate            0
cwc_min                 0
cwc_max                 0
csc_min                 0
csc_max                 0
ctc_min                 0
ctc_max                 0
last_word_eq            0
first_word_eq           0
abs_len_diff            0
mean_len                0
token_set_ratio         0
token_sort_ratio        0
fuzz_ratio              0
fuzz_partial_ratio      0
longest_substr_ratio    0
word_mover_dist         0
cosine_dist             0
cityblock_dist          0
canberra_dist           0
euclidean_dist          0
minkowski_dist          0
dtype: int64

In [86]:
temp = df[df.isnull().any(1)]
temp

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate,cwc_min,cwc_max,csc_min,csc_max,...,token_sort_ratio,fuzz_ratio,fuzz_partial_ratio,longest_substr_ratio,word_mover_dist,cosine_dist,cityblock_dist,canberra_dist,euclidean_dist,minkowski_dist


In [22]:
# Drop unecessary columns: I have added this code into the previous code, so the code below is unecessary now

#if os.path.isfile('Features/feature_nlp.csv'):
#    df_nlp = pd.read_csv("Features/feature_nlp.csv",encoding='latin-1')
#    df_nlp = df_nlp.drop(['qid1','qid2','question1','question2','is_duplicate'], axis=1)
#    df_nlp.to_csv("Features/feature_nlp.csv", index=False)
#else:
#    print('There is no feature_nlp.csv!')
#df_nlp.head()


Unnamed: 0,id,cwc_min,cwc_max,csc_min,csc_max,ctc_min,ctc_max,last_word_eq,first_word_eq,abs_len_diff,...,token_sort_ratio,fuzz_ratio,fuzz_partial_ratio,longest_substr_ratio,word_mover_dist,cosine_dist,cityblock_dist,canberra_dist,euclidean_dist,minkowski_dist
0,0,0.99998,0.833319,0.999983,0.999983,0.916659,0.785709,0.0,1.0,2.0,...,93,93,100,0.982759,1.216034,0.031762,14.274065,91.483062,1.047253,1.047253
1,1,0.799984,0.399996,0.749981,0.599988,0.699993,0.466664,0.0,1.0,5.0,...,63,66,75,0.596154,4.897662,0.266555,33.272633,149.670092,2.624989,2.624989
2,2,0.399992,0.333328,0.399992,0.249997,0.399996,0.285712,0.0,1.0,4.0,...,63,43,47,0.166667,4.011556,0.1189,28.457512,129.21466,2.140298,2.140298
3,3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,...,24,9,14,0.039216,7.514702,0.619671,62.016426,200.899534,4.702347,4.702347
4,4,0.399992,0.199998,0.99995,0.666644,0.57142,0.30769,0.0,1.0,6.0,...,47,35,56,0.175,6.25726,0.244168,40.127296,156.627744,3.145122,3.145122


## Feature Engineering on TFIDF weighted Word-Vector (Vector Features)

Get the Extracting faltted tfidf-based vectors as a feature of every question.    
I use en_core_web_sm package from spacy package (Industrial-Strength package for Natural Language Processing) to do the word-to-vec process. 

- Why TFIDF based? I use the idea of Smooth Inverse Frequency, to get every word a weight of tfidf.
- Because the huge size of test data, 'en_core_web_md' can not run locally in my computer, so I run the en_core_web_md version in the Google Colab in `feature_engineering_test_md.ipynb`
- Here I retain the 'en_core_web_sm'. en_core_web_sm is English multi-task CNN trained on OntoNotes while en_core_web_md is English multi-task CNN trained on OntoNotes, with GloVe vectors trained on Common Crawl.

- Ouput: `feature_vectors.csv`

In [55]:
import pandas as pd
import matplotlib.pyplot as plt
import re
import time
import warnings
import numpy as np
from nltk.corpus import stopwords
from sklearn.preprocessing import normalize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
warnings.filterwarnings("ignore")
import sys
import os 
import pandas as pd
import numpy as np
from tqdm import tqdm


In [56]:
# Load data

df = pd.read_csv("Data/train.csv")
 
df['question1'] = df['question1'].apply(lambda x: str(x))
df['question2'] = df['question2'].apply(lambda x: str(x))
df.head()


Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


In [57]:
# Get TFIDF values of each question pair

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Merge question texts
questions = list(df['question1']) + list(df['question2'])

# Vectorizer = CountVectorizer+Transformer
tfidf = TfidfVectorizer(lowercase=False,)
tfidf.fit_transform(questions)

# Here, dictionary: {key:word} = {value: tf-idf-value}
word2tfidf = dict(zip(tfidf.get_feature_names(), tfidf.idf_))


In [58]:
# en_vectors_web_md, which includes over 1 million unique vectors.
import en_core_web_md

# en_vectors_web_md, which includes over 1 million unique vectors.
nlp = en_core_web_md.load()

vecs_1 = []

for question_1 in tqdm(list(df['question1'])):  # tqdm is a progress bar
    doc_1 = nlp(question_1) 
    # mean_vec1 = []
    mean_vec_1 = np.zeros([len(doc_1), 300])  # in en_core_web_md, the output len of vector is 300
    for word_1 in doc_1: 
        # word2vec
        vec_1 = word_1.vector
        # fetch df score
        try:
            idf = word2tfidf[str(word_1)]  # search for tfidf value in the dictionary
        except:
            idf = 0
        # compute final vec
        mean_vec_1 += vec_1 * idf
        # mean_vec1.append(vec1 * idf)
    mean_vec_1 = mean_vec_1.mean(axis=0)
    # mean_vec1 = np.array(mean_vec1.mean(axis=0))
    vecs_1.append(mean_vec_1)
df['q1_vector_features'] = list(vecs_1)


100%|██████████| 404290/404290 [45:25<00:00, 148.35it/s] 


In [60]:
import en_core_web_md
nlp = en_core_web_md.load()
vecs_2 = []

for question_2 in tqdm(list(df['question2'])):  # tqdm is a progress bar
    doc_2 = nlp(question_2) 
    # mean_vec1 = []
    mean_vec_2 = np.zeros([len(doc_2), 300])  # in en_core_web_md, the output len of vector is 300
    for word_2 in doc_2: 
        # word2vec
        vec_2 = word_2.vector
        # fetch df score
        try:
            idf = word2tfidf[str(word_2)]
        except:
            idf = 0
        # compute final vec
        mean_vec_2 += vec_2 * idf
        # mean_vec1.append(vec1 * idf)
    mean_vec_2 = mean_vec_2.mean(axis=0)
    # mean_vec1 = np.array(mean_vec1.mean(axis=0))
    vecs_2.append(mean_vec_2)
df['q2_vector_features'] = list(vecs_2)


100%|██████████| 404290/404290 [46:13<00:00, 145.79it/s]


In [61]:
# Check nan values
df.isna().sum()

id                    0
qid1                  0
qid2                  0
question1             0
question2             0
is_duplicate          0
q1_vector_features    0
q2_vector_features    0
dtype: int64

In [63]:
# Flat vectors and merge together

# flat
columns_1 = ['0_x','1_x','2_x','3_x','4_x','5_x','6_x','7_x','8_x','9_x','10_x','11_x','12_x','13_x','14_x','15_x','16_x','17_x','18_x','19_x','20_x','21_x','22_x','23_x','24_x','25_x','26_x','27_x','28_x','29_x','30_x','31_x','32_x','33_x','34_x','35_x','36_x','37_x','38_x','39_x','40_x','41_x','42_x','43_x','44_x','45_x','46_x','47_x','48_x','49_x','50_x','51_x','52_x','53_x','54_x','55_x','56_x','57_x','58_x','59_x','60_x','61_x','62_x','63_x','64_x','65_x','66_x','67_x','68_x','69_x','70_x','71_x','72_x','73_x','74_x','75_x','76_x','77_x','78_x','79_x','80_x','81_x','82_x','83_x','84_x','85_x','86_x','87_x','88_x','89_x','90_x','91_x','92_x','93_x','94_x','95_x','96_x','97_x','98_x','99_x','100_x','101_x','102_x','103_x','104_x','105_x','106_x','107_x','108_x','109_x','110_x','111_x','112_x','113_x','114_x','115_x','116_x','117_x','118_x','119_x','120_x','121_x','122_x','123_x','124_x','125_x','126_x','127_x','128_x','129_x','130_x','131_x','132_x','133_x','134_x','135_x','136_x','137_x','138_x','139_x','140_x','141_x','142_x','143_x','144_x','145_x','146_x','147_x','148_x','149_x','150_x','151_x','152_x','153_x','154_x','155_x','156_x','157_x','158_x','159_x','160_x','161_x','162_x','163_x','164_x','165_x','166_x','167_x','168_x','169_x','170_x','171_x','172_x','173_x','174_x','175_x','176_x','177_x','178_x','179_x','180_x','181_x','182_x','183_x','184_x','185_x','186_x','187_x','188_x','189_x','190_x','191_x','192_x','193_x','194_x','195_x','196_x','197_x','198_x','199_x','200_x','201_x','202_x','203_x','204_x','205_x','206_x','207_x','208_x','209_x','210_x','211_x','212_x','213_x','214_x','215_x','216_x','217_x','218_x','219_x','220_x','221_x','222_x','223_x','224_x','225_x','226_x','227_x','228_x','229_x','230_x','231_x','232_x','233_x','234_x','235_x','236_x','237_x','238_x','239_x','240_x','241_x','242_x','243_x','244_x','245_x','246_x','247_x','248_x','249_x','250_x','251_x','252_x','253_x','254_x','255_x','256_x','257_x','258_x','259_x','260_x','261_x','262_x','263_x','264_x','265_x','266_x','267_x','268_x','269_x','270_x','271_x','272_x','273_x','274_x','275_x','276_x','277_x','278_x','279_x','280_x','281_x','282_x','283_x','284_x','285_x','286_x','287_x','288_x','289_x','290_x','291_x','292_x','293_x','294_x','295_x','296_x','297_x','298_x','299_x']
columns_2 = ['0_y','1_y','2_y','3_y','4_y','5_y','6_y','7_y','8_y','9_y','10_y','11_y','12_y','13_y','14_y','15_y','16_y','17_y','18_y','19_y','20_y','21_y','22_y','23_y','24_y','25_y','26_y','27_y','28_y','29_y','30_y','31_y','32_y','33_y','34_y','35_y','36_y','37_y','38_y','39_y','40_y','41_y','42_y','43_y','44_y','45_y','46_y','47_y','48_y','49_y','50_y','51_y','52_y','53_y','54_y','55_y','56_y','57_y','58_y','59_y','60_y','61_y','62_y','63_y','64_y','65_y','66_y','67_y','68_y','69_y','70_y','71_y','72_y','73_y','74_y','75_y','76_y','77_y','78_y','79_y','80_y','81_y','82_y','83_y','84_y','85_y','86_y','87_y','88_y','89_y','90_y','91_y','92_y','93_y','94_y','95_y','96_y','97_y','98_y','99_y','100_y','101_y','102_y','103_y','104_y','105_y','106_y','107_y','108_y','109_y','110_y','111_y','112_y','113_y','114_y','115_y','116_y','117_y','118_y','119_y','120_y','121_y','122_y','123_y','124_y','125_y','126_y','127_y','128_y','129_y','130_y','131_y','132_y','133_y','134_y','135_y','136_y','137_y','138_y','139_y','140_y','141_y','142_y','143_y','144_y','145_y','146_y','147_y','148_y','149_y','150_y','151_y','152_y','153_y','154_y','155_y','156_y','157_y','158_y','159_y','160_y','161_y','162_y','163_y','164_y','165_y','166_y','167_y','168_y','169_y','170_y','171_y','172_y','173_y','174_y','175_y','176_y','177_y','178_y','179_y','180_y','181_y','182_y','183_y','184_y','185_y','186_y','187_y','188_y','189_y','190_y','191_y','192_y','193_y','194_y','195_y','196_y','197_y','198_y','199_y','200_y','201_y','202_y','203_y','204_y','205_y','206_y','207_y','208_y','209_y','210_y','211_y','212_y','213_y','214_y','215_y','216_y','217_y','218_y','219_y','220_y','221_y','222_y','223_y','224_y','225_y','226_y','227_y','228_y','229_y','230_y','231_y','232_y','233_y','234_y','235_y','236_y','237_y','238_y','239_y','240_y','241_y','242_y','243_y','244_y','245_y','246_y','247_y','248_y','249_y','250_y','251_y','252_y','253_y','254_y','255_y','256_y','257_y','258_y','259_y','260_y','261_y','262_y','263_y','264_y','265_y','266_y','267_y','268_y','269_y','270_y','271_y','272_y','273_y','274_y','275_y','276_y','277_y','278_y','279_y','280_y','281_y','282_y','283_y','284_y','285_y','286_y','287_y','288_y','289_y','290_y','291_y','292_y','293_y','294_y','295_y','296_y','297_y','298_y','299_y']
df_temp = df
df_temp = df_temp.drop(['qid1','qid2','question1','question2','is_duplicate','q1_vector_features','q2_vector_features'],axis=1)
df_q1 = pd.DataFrame(df.q1_vector_features.values.tolist(), index= df.index, columns=columns_1)  # word-vector features
df_q2 = pd.DataFrame(df.q2_vector_features.values.tolist(), index= df.index, columns=columns_2)  # word-vector features
df_q1['id'] = df['id']
df_q2['id'] = df['id']

# merge
df_vectors = df_temp.merge(df_q1, on='id', how='left')
df_vectors = df_vectors.merge(df_q2, on='id', how='left')

print(df_vectors.shape)
df_vectors.head()


(404290, 601)


Unnamed: 0,id,0_x,1_x,2_x,3_x,4_x,5_x,6_x,7_x,8_x,...,290_y,291_y,292_y,293_y,294_y,295_y,296_y,297_y,298_y,299_y
0,0,-5.856872,17.449559,4.86272,7.971019,20.345586,-5.514759,-4.0778,-2.820742,8.029026,...,-17.810438,7.231024,1.531186,-7.528823,0.473802,-11.864658,-11.293788,1.866265,3.616046,11.971096
1,1,-7.241549,10.424829,13.273801,-5.574235,5.964726,0.898817,4.561782,-11.213664,1.063151,...,18.333288,4.940264,-19.087384,1.978918,25.153889,1.649467,-10.371059,9.524476,-4.186575,24.111837
2,2,0.90952,16.050299,-8.126856,-4.848289,-2.80619,9.75228,4.349992,-5.120332,6.785252,...,-24.310109,-1.216773,11.909693,9.591573,11.846737,1.397859,6.454157,-0.27146,-12.500337,27.634567
3,3,-4.950745,17.098874,-15.474965,1.04468,-2.392017,-0.051889,2.650595,-8.451192,2.584123,...,-5.435584,1.672591,-0.863278,-2.906553,-3.466688,-3.867892,-4.249463,-12.551012,4.494087,-6.223341
4,4,-8.738103,21.68945,10.167188,-7.766195,-21.347514,17.447355,-19.23278,-1.405518,-21.179671,...,-10.407441,-8.444207,-14.450059,-12.709382,-4.44905,12.563987,-11.721362,-16.4593,3.626297,-9.790615


In [64]:
# Check again because of the left join

df_vectors.isna().sum()

id       0
0_x      0
1_x      0
2_x      0
3_x      0
        ..
295_y    0
296_y    0
297_y    0
298_y    0
299_y    0
Length: 601, dtype: int64

In [65]:
# Output/Load 

if os.path.isfile('Features/feature_vectors.csv'):
    df_vectors = pd.read_csv("Features/feature_vectors.csv",encoding='latin-1')
else:
    print("Extracting tfidf weighted word2vector features...")
    # If there are no existing file then create a csv file, make sure you have run the previous code in 1.4 chapter
    df_vectors.to_csv('Features/feature_vectors.csv')
    

Extracting tfidf weighted word2vector features...


****LOOK Due to limited computing resources of my laptop, i couldn't merge features from en_core_web_md (which is 300 dimensions), so in the `modeling.ipynb` i used features from en_core_web_sm (have tried before and saved as features locally). so if you want to rebuild my project results, you should change the model to `en_core_web_sm`. ****

## Feature Engineering on Similarity Measurements

Extracting sinmilarity measurements as a supplement of features. **Or**, they can be used in the model stacking because each one of the similarity measurements can be a independent creteria of duplicated/not duplicarted (i.e. similarities)

- tf/tfidf cosine similarity (cosine distance actually): used
- jaccord similarity (distance actually): used
- simhash： 
  - Thesis reference: Detecting Near-duplicates for web crawling`
  - https://leons.im/posts/a-python-implementation-of-simhash-algorithm/
- LSI vetor: I was intended to use, but after reading thesis and papers i found that the usage of this algo is to find the lsi vector similarity of test text compared to the topic based model trained  on corpus (large amount of data). I think it can't be used here.
  - LSI uses
  - LSA(latent semantic analysis) also known as LSI(latent semantic index)，put forward by Scott Deerwester, Susan T. Dumais
  - Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R.(1990). Indexing By Latent Semantic Analysis. Journal of the American Society For Information Science, 41, 391-407. 10
  - https://blog.csdn.net/qq_34333481/article/details/85014010

- Output: features_similarity.csv
- **This part is an addition to the model in STACKING step. For now it is not in the final features.**

In [66]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
df = pd.read_csv("Data/train.csv")
df['question1'] = df['question1'].apply(lambda x: str(x))
df['question2'] = df['question2'].apply(lambda x: str(x))
# df['text'] = [df.question1, df.question2]
df

In [None]:
# Jaccard similarity based on tfidf vectors

def jaccard_similarity_tfidf(s1, s2):
    def add_space(s):
        return ''.join(list(s))
    
    s1, s2 = add_space(s1), add_space(s2)
    # convert into tfidf matrix
    # print(s1)
    cv = CountVectorizer(tokenizer=lambda s: s.split())
    corpus = [s1, s2]
    vectors = cv.fit_transform(corpus).toarray()
    # intersection of tfidf matrix
    numerator = np.sum(np.min(vectors, axis=0))
    # union of tfidf matrix
    denominator = np.sum(np.max(vectors, axis=0))
    # calculate jaccard similarity
    return 1.0 * numerator / denominator
df_sim = df.copy()
df_sim['jcs_tfidf_sim'] = df_sim.apply(lambda x: jaccard_similarity_tfidf(x['question1'],x['question2']), axis=1)
df_sim


In [None]:
# Jaccard similarity

def jaccard_similarity(s1, s2):
    a = set(s1.split()) 
    b = set(s2.split())
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

df_sim['jcs_sim'] = df_sim.apply(lambda x: jaccard_similarity(x.loc['question1'],x.loc['question2']), axis=1)
df_sim


In [None]:
# Tf vector cosine similarity

from scipy.linalg import norm
def tf_vector_similarity(s1, s2):
    def add_space(s):
        return ''.join(list(s))

    s1, s2 = add_space(s1), add_space(s2)
    # convert into tfidf matrix
    cv = CountVectorizer(tokenizer=lambda s: s.split())
    corpus = [s1, s2]
    vectors = cv.fit_transform(corpus).toarray()
    # calculate tf vector distance by cosine distance
    return np.dot(vectors[0], vectors[1]) / (norm(vectors[0]) * norm(vectors[1]))

df_sim['tf_sim'] = df_temp.apply(lambda x: tf_vector_similarity(x.loc['question1'],x.loc['question2']), axis=1)
df_sim


In [None]:
# tfidf vector similarity

from sklearn.feature_extraction.text import TfidfVectorizer
def tfidf_similarity(s1, s2):
    def add_space(s):
        return ''.join(list(s))
    
    s1, s2 = add_space(s1), add_space(s2)
    # convert into tfidf matrix
    cv = TfidfVectorizer(tokenizer=lambda s: s.split())
    corpus = [s1, s2]
    vectors = cv.fit_transform(corpus).toarray()
    # calculate tfidf vector distance by cosine distance
    return np.dot(vectors[0], vectors[1]) / (norm(vectors[0]) * norm(vectors[1]))

df_sim['tfidf_similarity'] = df_sim.apply(lambda x: tfidf_similarity(x.loc['question1'],x.loc['question2']), axis=1)
df_sim


In [9]:
# Simhash similarity

import re
from simhash import Simhash

def simhash_similarity(s1,s2):
    def add_space(s):
        return ''.join(list(s))

    def get_features(s):
        width = 3
        s = s.lower()
        s = re.sub(r'[^\w]+', '', s)
        return [s[i:i + width] for i in range(max(len(s) - width + 1, 1))]
    
    s1, s2 = add_space(s1), add_space(s2)
    return Simhash(get_features(s1)).distance(Simhash(get_features(s2)))


df_sim['sh_similarity'] = df_sim.apply(lambda x: simhash_similarity(x.loc['question1'],x.loc['question2']), axis=1)
df_sim


27


## Feature Engineering on Other Vectors

Extracting other vectors, maybe by expanding the len(features_fianl) the result will be better. But I am not doing this right now. Need experiments.

- avg_w2v: (glove based) can expand, it is more recommended now since it is directly from gLOVe pretrained model, so maybe this can conncect with LSTM with glove model.
https://cloud.tencent.com/developer/article/1145941
- tfidf vectors can expand
  - When modeling, TFIDF features don't need to scale since it has regularized in the extracting proces
- Doc2Vec: gensim
- Word2Vec: gensim, average vector of all words in a sentence as the vector of the sentence.


- **This part is an addition to the model, i am not goona put it in the model for now.**
