<h2> 3.6 Featurizing text data with tfidf weighted word-vectors (and avoiding data leakage issue) </h2>

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import re
import time
import warnings
import numpy as np
from nltk.corpus import stopwords
from sklearn.preprocessing import normalize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
warnings.filterwarnings("ignore")
import sys
import os 
import pandas as pd
import numpy as np
from tqdm import tqdm

# exctract word2vec vectors
# https://github.com/explosion/spaCy/issues/1721
# http://landinghub.visualstudio.com/visual-cpp-build-tools
import spacy

In [2]:
# avoid decoding problems
df = pd.read_csv('train.csv')
 
# encode questions to unicode
# https://stackoverflow.com/a/6812069
# ----------------- python 2 ---------------------
# df['question1'] = df['question1'].apply(lambda x: unicode(str(x),"utf-8"))
# df['question2'] = df['question2'].apply(lambda x: unicode(str(x),"utf-8"))
# ----------------- python 3 ---------------------
df['question1'] = df['question1'].apply(lambda x: str(x))
df['question2'] = df['question2'].apply(lambda x: str(x))

In [3]:
df.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


In [4]:
from sklearn.model_selection import train_test_split

X = df.drop(['is_duplicate'], axis=1)
Y = df['is_duplicate']

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.25)

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
# merge texts
questions_train = list(x_train['question1']) + list(x_train['question2'])
questions_test = list(x_test['question1']) + list(x_test['question2'])

tfidf = TfidfVectorizer(lowercase=False, )
tfidf.fit_transform(questions_train)
tfidf.transform(questions_test)

# dict key:word and value:tf-idf score
word2tfidf = dict(zip(tfidf.get_feature_names(), tfidf.idf_))

- After we find TF-IDF scores, we convert each question to a weighted average of word2vec vectors by these scores.
- here we use a pre-trained GLOVE model which comes free with "Spacy".  https://spacy.io/usage/vectors-similarity
- It is trained on Wikipedia and therefore, it is stronger in terms of word semantics. 

In [6]:
# en_vectors_web_lg, which includes over 1 million unique vectors.
nlp = spacy.load('en_core_web_sm')

vecs1 = []
# https://github.com/noamraph/tqdm
# tqdm is used to print the progress bar
for qu1 in tqdm(list(x_train['question1'])):
    doc1 = nlp(qu1) 
    # 384 is the number of dimensions of vectors 
    mean_vec1 = np.zeros([len(doc1), len(doc1[0].vector)])
    for word1 in doc1:
        # word2vec
        vec1 = word1.vector
        # fetch df score
        try:
            idf = word2tfidf[str(word1)]
        except:
            idf = 0
        # compute final vec
        mean_vec1 += vec1 * idf
    mean_vec1 = mean_vec1.mean(axis=0)
    vecs1.append(mean_vec1)
x_train['q1_feats_m'] = list(vecs1)

100%|█████████████████████████████████████████████████████████████████████████| 303217/303217 [50:07<00:00, 100.81it/s]


In [7]:
vecs2 = []
for qu2 in tqdm(list(x_train['question2'])):
    doc2 = nlp(qu2) 
    mean_vec2 = np.zeros([len(doc1), len(doc2[0].vector)])
    for word2 in doc2:
        # word2vec
        vec2 = word2.vector
        # fetch df score
        try:
            idf = word2tfidf[str(word2)]
        except:
            #print word
            idf = 0
        # compute final vec
        mean_vec2 += vec2 * idf
    mean_vec2 = mean_vec2.mean(axis=0)
    vecs2.append(mean_vec2)
x_train['q2_feats_m'] = list(vecs2)

100%|████████████████████████████████████████████████████████████████████████| 303217/303217 [1:31:49<00:00, 55.04it/s]


In [8]:
# en_vectors_web_lg, which includes over 1 million unique vectors.
nlp = spacy.load('en_core_web_sm')

vecs1 = []
# https://github.com/noamraph/tqdm
# tqdm is used to print the progress bar
for qu1 in tqdm(list(x_test['question1'])):
    doc1 = nlp(qu1) 
    # 384 is the number of dimensions of vectors 
    mean_vec1 = np.zeros([len(doc1), len(doc1[0].vector)])
    for word1 in doc1:
        # word2vec
        vec1 = word1.vector
        # fetch df score
        try:
            idf = word2tfidf[str(word1)]
        except:
            idf = 0
        # compute final vec
        mean_vec1 += vec1 * idf
    mean_vec1 = mean_vec1.mean(axis=0)
    vecs1.append(mean_vec1)
x_test['q1_feats_m'] = list(vecs1)


100%|█████████████████████████████████████████████████████████████████████████| 101073/101073 [16:32<00:00, 101.86it/s]


In [9]:
vecs2 = []
for qu2 in tqdm(list(x_test['question2'])):
    doc2 = nlp(qu2) 
    mean_vec2 = np.zeros([len(doc1), len(doc2[0].vector)])
    for word2 in doc2:
        # word2vec
        vec2 = word2.vector
        # fetch df score
        try:
            idf = word2tfidf[str(word2)]
        except:
            #print word
            idf = 0
        # compute final vec
        mean_vec2 += vec2 * idf
    mean_vec2 = mean_vec2.mean(axis=0)
    vecs2.append(mean_vec2)
x_test['q2_feats_m'] = list(vecs2)

100%|█████████████████████████████████████████████████████████████████████████| 101073/101073 [15:59<00:00, 105.39it/s]


In [10]:
#prepro_features_train.csv (Simple Preprocessing Feartures)
#nlp_features_train.csv (NLP Features)
if os.path.isfile('nlp_features_train.csv'):
    dfnlp = pd.read_csv('nlp_features_train.csv',encoding='latin-1')
else:
    print("download nlp_features_train.csv from drive or run previous notebook")

if os.path.isfile('df_fe_without_preprocessing_train.csv'):
    dfppro = pd.read_csv('df_fe_without_preprocessing_train.csv',encoding='latin-1')
else:
    print("download df_fe_without_preprocessing_train.csv from drive or run previous notebook")

In [11]:
# For training data
df1_train = dfnlp.drop(['qid1','qid2','question1','question2'],axis=1)

In [12]:
df2_train = dfppro.drop(['qid1','qid2','question1','question2','is_duplicate'],axis=1)

In [13]:
df3_train = x_train.drop(['qid1','qid2','question1','question2'],axis=1)

In [14]:
df3_q1_train = pd.DataFrame(df3_train.q1_feats_m.values.tolist(), index= df3_train.index)

In [15]:
df3_q2_train = pd.DataFrame(df3_train.q2_feats_m.values.tolist(), index= df3_train.index)

In [16]:
# for test data
df1_test = dfnlp.drop(['qid1','qid2','question1','question2'],axis=1)
df2_test = dfppro.drop(['qid1','qid2','question1','question2','is_duplicate'],axis=1)
df3_test = x_test.drop(['qid1','qid2','question1','question2'],axis=1)
df3_q1_test = pd.DataFrame(df3_test.q1_feats_m.values.tolist(), index= df3_test.index)
df3_q2_test = pd.DataFrame(df3_test.q2_feats_m.values.tolist(), index= df3_test.index)

In [17]:
# dataframe of nlp features
df1_train.head()

Unnamed: 0,id,is_duplicate,cwc_min,cwc_max,csc_min,csc_max,ctc_min,ctc_max,last_word_eq,first_word_eq,abs_len_diff,mean_len,token_set_ratio,token_sort_ratio,fuzz_ratio,fuzz_partial_ratio,longest_substr_ratio
0,0,0,0.99998,0.833319,0.999983,0.999983,0.916659,0.785709,0.0,1.0,2.0,13.0,100,93,93,100,0.982759
1,1,0,0.799984,0.399996,0.749981,0.599988,0.699993,0.466664,0.0,1.0,5.0,12.5,86,63,66,75,0.596154
2,2,0,0.399992,0.333328,0.399992,0.249997,0.399996,0.285712,0.0,1.0,4.0,12.0,66,66,54,54,0.166667
3,3,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,12.0,36,36,35,40,0.039216
4,4,0,0.399992,0.199998,0.99995,0.666644,0.57142,0.30769,0.0,1.0,6.0,10.0,67,47,46,56,0.175


In [18]:
# data before preprocessing 
df2_train.head()

Unnamed: 0,id,freq_qid1,freq_qid2,q1len,q2len,q1_n_words,q2_n_words,word_Common,word_Total,word_share,freq_q1+q2,freq_q1-q2
0,0,1,1,66,57,14,12,10.0,23.0,0.434783,2,0
1,1,4,1,51,88,8,13,4.0,20.0,0.2,5,3
2,2,1,1,73,59,14,10,4.0,24.0,0.166667,2,0
3,3,1,1,50,65,11,9,0.0,19.0,0.0,2,0
4,4,3,1,76,39,13,7,2.0,20.0,0.1,4,2


In [19]:
# Questions 1 tfidf weighted word2vec
df3_q1_train.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,86,87,88,89,90,91,92,93,94,95
83417,49.164967,-7.515318,-43.009197,-27.019599,-25.075973,-9.738226,6.218718,56.249758,-41.994091,10.76034,...,21.943948,-0.93184,46.867764,66.01474,22.359294,2.832393,27.19785,26.15371,-15.238912,26.506474
18078,-23.736242,-36.713662,9.411465,7.044339,58.440789,13.387595,25.450284,-28.827919,25.998447,7.440075,...,57.700994,55.007597,1.620057,40.43962,33.170732,-39.274066,1.037565,-2.523261,3.09048,56.169626
120973,6.01525,-15.042102,-36.202792,-85.561301,-18.920346,94.111866,16.856228,57.554399,14.546702,48.938357,...,48.146343,-16.714357,32.708047,15.166065,3.296858,50.682767,-4.684795,-5.448235,-34.076489,29.545323
72850,-7.134538,-82.385818,-116.475997,-161.96546,-181.466732,23.632468,158.875139,43.394411,-47.024335,37.04169,...,170.269542,-110.402979,107.767663,100.081922,69.35381,4.177803,-0.915477,60.251142,-66.65,30.153067
67919,0.944909,-132.876785,-33.981938,-111.552995,-119.988276,67.066594,166.350327,63.182713,14.377173,81.81612,...,70.559373,35.221791,89.512389,31.656399,90.281584,-17.489218,-92.718966,0.236988,-88.683642,121.030736


In [20]:
# Questions 2 tfidf weighted word2vec
df3_q2_train.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,86,87,88,89,90,91,92,93,94,95
83417,51.816493,-27.657055,15.315478,-10.919976,-12.527575,13.306459,14.680881,33.76668,-35.703388,-3.249567,...,22.843261,13.393338,37.914846,57.254539,-3.689842,-18.530443,15.006716,23.949595,1.012888,24.238544
18078,-31.80423,-36.996335,12.322971,1.480402,56.8224,9.517426,25.550478,-29.384527,24.854003,1.397823,...,67.356533,51.588196,4.253127,39.647725,36.389304,-32.731867,13.473475,-6.410661,-2.119679,57.101519
120973,116.347245,20.973378,-56.041001,-133.13331,-27.505199,53.927895,94.812116,60.499524,-9.30769,53.923801,...,-22.423991,-6.845423,47.935492,25.758929,-46.274192,11.261316,5.000776,15.189243,-16.187234,-56.452883
72850,38.735829,-10.461823,-93.884742,-52.6627,-36.030887,76.611615,136.40216,-0.890572,-0.673485,69.447084,...,56.003407,15.687673,41.743254,42.898374,47.642856,-18.526762,-69.699913,31.128019,11.705194,-19.648984
67919,293.825781,-247.266625,-81.151038,-220.550102,-184.676749,-1.711183,424.290608,16.513845,30.050685,317.49721,...,128.912791,69.816996,139.464098,97.093489,62.850065,17.820628,-144.130104,144.163519,-115.206777,134.677474


In [21]:
# dataframe of nlp features
df1_test.head()

Unnamed: 0,id,is_duplicate,cwc_min,cwc_max,csc_min,csc_max,ctc_min,ctc_max,last_word_eq,first_word_eq,abs_len_diff,mean_len,token_set_ratio,token_sort_ratio,fuzz_ratio,fuzz_partial_ratio,longest_substr_ratio
0,0,0,0.99998,0.833319,0.999983,0.999983,0.916659,0.785709,0.0,1.0,2.0,13.0,100,93,93,100,0.982759
1,1,0,0.799984,0.399996,0.749981,0.599988,0.699993,0.466664,0.0,1.0,5.0,12.5,86,63,66,75,0.596154
2,2,0,0.399992,0.333328,0.399992,0.249997,0.399996,0.285712,0.0,1.0,4.0,12.0,66,66,54,54,0.166667
3,3,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,12.0,36,36,35,40,0.039216
4,4,0,0.399992,0.199998,0.99995,0.666644,0.57142,0.30769,0.0,1.0,6.0,10.0,67,47,46,56,0.175


In [22]:
# data before preprocessing 
df2_test.head()

Unnamed: 0,id,freq_qid1,freq_qid2,q1len,q2len,q1_n_words,q2_n_words,word_Common,word_Total,word_share,freq_q1+q2,freq_q1-q2
0,0,1,1,66,57,14,12,10.0,23.0,0.434783,2,0
1,1,4,1,51,88,8,13,4.0,20.0,0.2,5,3
2,2,1,1,73,59,14,10,4.0,24.0,0.166667,2,0
3,3,1,1,50,65,11,9,0.0,19.0,0.0,2,0
4,4,3,1,76,39,13,7,2.0,20.0,0.1,4,2


In [23]:
# Questions 1 tfidf weighted word2vec
df3_q1_test.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,86,87,88,89,90,91,92,93,94,95
45878,108.512224,-255.063471,-122.352259,-198.36505,-125.695904,-11.738605,346.204849,92.557456,-40.650279,106.917734,...,184.132206,-29.228197,83.525766,72.367302,77.996118,-15.69684,-206.128834,-10.458931,-91.44867,63.862342
52053,47.716367,-67.745529,-74.272315,-92.213808,-72.231411,40.20442,67.472677,22.584714,19.771025,91.779441,...,61.461055,-24.361123,43.595884,-29.863006,-2.643452,13.965949,-12.388497,-71.443006,-2.152569,34.350843
155631,105.055554,-97.229124,-114.180254,-100.36408,-20.844604,95.734463,53.376798,24.586539,-25.185283,105.760012,...,74.924185,-80.786667,69.782662,-25.033468,21.180916,34.787183,-4.02042,-10.436795,-40.063841,82.84273
297900,132.283333,39.353368,-80.283349,-184.389122,-13.083245,85.74717,236.005737,-12.444157,40.363744,75.109083,...,156.6589,-46.733946,59.08896,-20.882608,-109.962933,64.642684,97.837591,-54.326977,-139.738085,45.029019
93889,-89.048923,-41.371482,-30.063032,-66.04709,-34.883603,71.05077,26.085203,53.258529,14.928114,70.621561,...,154.281623,-30.948808,11.195291,-6.661281,50.650639,18.390653,-9.735798,66.153026,12.781212,16.550301


In [24]:
# Questions 2 tfidf weighted word2vec
df3_q2_test.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,86,87,88,89,90,91,92,93,94,95
45878,107.004255,-283.485952,-123.542157,-175.671925,-152.253458,-14.654537,369.251881,86.157264,-69.394417,130.553549,...,183.1943,1.832329,137.888876,-24.930951,64.153649,-47.642578,-231.800811,-27.567359,-117.981302,93.018823
52053,50.129327,18.642116,-50.673357,-67.589411,-78.043364,96.753222,79.473529,21.789881,-40.621285,82.516327,...,47.474984,-37.980852,20.536803,2.086258,16.795736,-33.162836,-15.980553,33.74889,-12.886731,-29.061675
155631,71.845877,-40.905753,-57.471648,-70.718917,-85.758945,81.365013,13.961524,-21.185144,-15.829755,80.833505,...,97.263362,-122.277366,15.444636,49.902564,64.97844,32.659596,51.212003,27.809269,40.037895,22.691027
297900,71.990897,-15.130752,-132.67369,-192.307275,-92.292456,-6.963554,233.33899,-11.071356,-22.267205,172.296701,...,182.861974,-90.028779,23.20335,76.697841,34.175856,78.88746,-23.587941,-20.899722,-179.652954,56.47264
93889,-18.445908,15.843467,-55.108818,-94.422083,16.687356,97.955043,127.180348,33.306696,61.393506,69.6108,...,82.553756,-10.968116,41.223866,-2.001688,18.295492,-37.093059,-17.544755,3.790991,-35.421965,-28.627527


In [25]:
# for training data
print("Number of features in nlp dataframe :", df1_train.shape[1])
print("Number of features in preprocessed dataframe :", df2_train.shape[1])
print("Number of features in question1 w2v  dataframe :", df3_q1_train.shape[1])
print("Number of features in question2 w2v  dataframe :", df3_q2_train.shape[1])
print("Number of features in final dataframe  :", df1_train.shape[1]+df2_train.shape[1]+df3_q1_train.shape[1]+df3_q2_train.shape[1])

Number of features in nlp dataframe : 17
Number of features in preprocessed dataframe : 12
Number of features in question1 w2v  dataframe : 96
Number of features in question2 w2v  dataframe : 96
Number of features in final dataframe  : 221


In [26]:
# for test data
print("Number of features in nlp dataframe :", df1_test.shape[1])
print("Number of features in preprocessed dataframe :", df2_test.shape[1])
print("Number of features in question1 w2v  dataframe :", df3_q1_test.shape[1])
print("Number of features in question2 w2v  dataframe :", df3_q2_test.shape[1])
print("Number of features in final dataframe  :", df1_test.shape[1]+df2_test.shape[1]+df3_q1_test.shape[1]+df3_q2_test.shape[1])

Number of features in nlp dataframe : 17
Number of features in preprocessed dataframe : 12
Number of features in question1 w2v  dataframe : 96
Number of features in question2 w2v  dataframe : 96
Number of features in final dataframe  : 221


In [28]:
# storing the final features of training data to csv file
if not os.path.isfile('final_features_train.csv'):
    df3_q1_train['id']=df1_train['id']
    df3_q2_train['id']=df1_train['id']
    df1_train  = df1_train.merge(df2_train, on='id',how='left')
    df2_train  = df3_q1_train.merge(df3_q2_train, on='id',how='left')
    result_train  = df1_train.merge(df2_train, on='id',how='left')
    result_train.to_csv('final_features_train.csv')

In [29]:
# storing the final features of test data to csv file
if not os.path.isfile('final_features_test.csv'):
    df3_q1_test['id']=df1_test['id']
    df3_q2_test['id']=df1_test['id']
    df1_test  = df1_test.merge(df2_test, on='id',how='left')
    df2_test  = df3_q1_test.merge(df3_q2_test, on='id',how='left')
    result_test  = df1_test.merge(df2_test, on='id',how='left')
    result_test.to_csv('final_features_test.csv')

### Observation:

1. Upto this stage, we divided the data into train and test and then vectorized the training data into tfidf w2v and then transformed both train and test data with it.
2. Now, we have 2 final features csv file, one for training data and other for test data
3. These training and test final features will be used for training and hyperparameter tuning of XGBoost model in the next ipython notebook "Quora_Case_Study_Assignment.ipynb"
4. All these execercise was done to correct the data leakage that was present in the current ipython notebook.