<h2> 3.6 Featurizing text data with tfidf weighted word-vectors </h2>

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
import re
import time
import warnings
import numpy as np
from nltk.corpus import stopwords
from sklearn.preprocessing import normalize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
warnings.filterwarnings("ignore")
import sys
import os 
import pandas as pd
import numpy as np
from tqdm import tqdm

# exctract word2vec vectors
# https://github.com/explosion/spaCy/issues/1721
# http://landinghub.visualstudio.com/visual-cpp-build-tools
import spacy

In [4]:
from google.colab import drive
drive.mount('/content/drive')

import sys
sys.path.append('/content/drive/MyDrive/Colab Notebooks/Supervised_ML_UseCase_Quora Question Pair Similarity')

Mounted at /content/drive


In [5]:
# avoid decoding problems
df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Supervised_ML_UseCase_Quora Question Pair Similarity/train.csv")
 
# encode questions to unicode
# https://stackoverflow.com/a/6812069
# ----------------- python 2 ---------------------
# df['question1'] = df['question1'].apply(lambda x: unicode(str(x),"utf-8"))
# df['question2'] = df['question2'].apply(lambda x: unicode(str(x),"utf-8"))
# ----------------- python 3 ---------------------
df['question1'] = df['question1'].apply(lambda x: str(x))
df['question2'] = df['question2'].apply(lambda x: str(x))

In [6]:
df.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


## **Here for Featurization we will be using TF IDF weighted Word2Vec.**

Steps:
1. We create a vaocabulary of all words in dataset and  calculate TF IDF values corresponding to each word in vocabulary and store it in dictionary form as {word : TF IDF value of word}

2. We use Spacy pretrained word2vec model **"en_core_web_lg"** on wikipedia = GLOVe(Global vector)
3. Finally we created TF IDF weighted vectors for our both text columns 'Question1' and 'Question2'

#### **TF IDF Calculation**

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Concatenate 2 text coulmns columns 
questions = list(df['question1']) + list(df['question2'])

tfidf = TfidfVectorizer(lowercase=False, )
tfidf.fit_transform(questions)

# dict key:word and value:tf-idf score
word2tfidf = dict(zip(tfidf.get_feature_names(), tfidf.idf_))

- After we find TF-IDF scores, we convert each question to a weighted average of word2vec vectors by these scores.
- here we use a pre-trained GLOVE model which comes free with "Spacy".  https://spacy.io/usage/vectors-similarity
- It is trained on Wikipedia and therefore, it is stronger in terms of word semantics. 

#### **Using Spacys Word2Vec pretrained model on wikipedia .**

* en_vectors_web_lg, which includes over 1 million unique vectors.
{word : vector}
* Each vector of dimension 384



In [8]:
# Download the spacy models

!python -m spacy download en_core_web_lg
# !python -m spacy download en_core_web_sm
# !python -m spacy download en_core_web_md
!python -m spacy download en

Collecting en_core_web_lg==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.2.5/en_core_web_lg-2.2.5.tar.gz (827.9 MB)
[K     |████████████████████████████████| 827.9 MB 56.1 MB/s 
Building wheels for collected packages: en-core-web-lg
  Building wheel for en-core-web-lg (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-lg: filename=en_core_web_lg-2.2.5-py3-none-any.whl size=829180942 sha256=4d6ed720d489ea55e66592b3d53382bab1fd9aa85d0e0240c096b24dcf5e39ab
  Stored in directory: /tmp/pip-ephem-wheel-cache-xp0niisr/wheels/11/95/ba/2c36cc368c0bd339b44a791c2c1881a1fb714b78c29a4cb8f5
Successfully built en-core-web-lg
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-2.2.5
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_lg')
Collecting en_core_web_sm==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_

> **In below code, TF IDF weighted word2vec is calculated for column 'Question1' :**

1. nlp is the spacy model "en_core_web_lg"
2. nlp(sentence) **(=doc1 in below code)** provides a vector corresponding to that sentence.
3. Final TF IDF weighted word2vec is calculated : **mean_vec1 in below code**

In [10]:
# en_vectors_web_lg, which includes over 1 million unique vectors.
nlp = spacy.load('en')


# List to store tf idf weighted word2vec vector for each sentence in 'Question1' column
vecs1 = []
# https://github.com/noamraph/tqdm
# tqdm is used to print the progrss bar
for qu1 in tqdm(list(df['question1'])):

    # calculating spacy(word2vec) vector for sentence in 'Question1' column
    doc1 = nlp(qu1) 
    # 384 is the number of dimensions of vectors 
    mean_vec1 = np.zeros([len(doc1), len(doc1[0].vector)])
    for word1 in doc1:
        # word2vec
        vec1 = word1.vector

        # fetch idf score
        try:
            idf = word2tfidf[str(word1)]
        except:
            idf = 0

        ########### COMPUTE FINAL TF IDF WEIGHTED W2V FOR COL 'Question1'######################
        mean_vec1 += vec1 * idf
    mean_vec1 = mean_vec1.mean(axis=0)
    vecs1.append(mean_vec1)
df['q1_feats_m'] = list(vecs1)
x=nlp('man')
len(x.vector)

100%|██████████| 404290/404290 [1:05:16<00:00, 103.22it/s]


96

> **Similary calculating TF IDF weighted w2v for col 'Question2'**

In [12]:
vecs2 = []
for qu2 in tqdm(list(df['question2'])):
    doc2 = nlp(qu2) 
    mean_vec2 = np.zeros([len(doc1), len(doc2[0].vector)])
    for word2 in doc2:
        # word2vec
        vec2 = word2.vector
        # fetch df score
        try:
            idf = word2tfidf[str(word2)]
        except:
            #print word
            idf = 0
        ########### COMPUTE FINAL TF IDF WEIGHTED W2V FOR COL 'Question2'######################
        mean_vec2 += vec2 * idf
    mean_vec2 = mean_vec2.mean(axis=0)
    vecs2.append(mean_vec2)
df['q2_feats_m'] = list(vecs2)

100%|██████████| 404290/404290 [1:08:55<00:00, 97.77it/s]


> **In below code, we are now merging all the features that we have calculated:**
1. Simple basic features : stored in **df_fe_without_preprocessing_train.csv**
2. Advanced Feature Engineering features : stored in **nlp_features_train.csv**
3. df = original training dataset
**4.** q1_feats_m and q2_feats_m are the TF IDF weighted vectors corresopnding to Question1 and Question2 text column : BOth were added in df in above cell

Now all we need is to create a df with all the features and drop any text or non necessary features.

#### **Check if above mentioned files are present or not**

In [13]:
#prepro_features_train.csv (Simple Preprocessing Feartures)
#nlp_features_train.csv (NLP Features)
if os.path.isfile('/content/drive/MyDrive/Colab Notebooks/Supervised_ML_UseCase_Quora Question Pair Similarity/nlp_features_train.csv'):
    dfnlp = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Supervised_ML_UseCase_Quora Question Pair Similarity/nlp_features_train.csv",encoding='latin-1')
else:
    print("download nlp_features_train.csv from drive or run previous notebook")

if os.path.isfile('/content/drive/MyDrive/Colab Notebooks/Supervised_ML_UseCase_Quora Question Pair Similarity/df_fe_without_preprocessing_train.csv'):
    dfppro = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Supervised_ML_UseCase_Quora Question Pair Similarity/df_fe_without_preprocessing_train.csv",encoding='latin-1')
else:
    print("download df_fe_without_preprocessing_train.csv from drive or run previous notebook")

Few notations:
> 1. dfnlp = df with advanced FE features
2. dfppro = df with basic FE features
3. df3_q1 = df that has column TF IDF weighted word2vec for 'Question1' took from df
4. df3_q2 = df that has column TF IDF weighted word2vec for 'Question2' took from df

**Final Feature df that will be fed to model  = result df**

It has (xi ,yi), where
 
**input features= xi = Basic FE features + Advanced FE features + TF IDF weifgted w2v for Question 1 + TF IDF weifgted w2v for Question 2**

**Target col = yi = is_duplicate.**


In [14]:
df1 = dfnlp.drop(['qid1','qid2','question1','question2'],axis=1)
df2 = dfppro.drop(['qid1','qid2','question1','question2','is_duplicate'],axis=1)
df3 = df.drop(['qid1','qid2','question1','question2','is_duplicate'],axis=1)
df3_q1 = pd.DataFrame(df3.q1_feats_m.values.tolist(), index= df3.index)
df3_q2 = pd.DataFrame(df3.q2_feats_m.values.tolist(), index= df3.index)

In [15]:
# dataframe of nlp features
df1.head()

Unnamed: 0,id,is_duplicate,cwc_min,cwc_max,csc_min,csc_max,ctc_min,ctc_max,last_word_eq,first_word_eq,abs_len_diff,mean_len,token_set_ratio,token_sort_ratio,fuzz_ratio,fuzz_partial_ratio,longest_substr_ratio
0,0,0,0.999983,0.857131,0.0,0.0,0.857131,0.749991,1,1,1,7.5,100,93,93,90,0.837209
1,1,0,0.833319,0.454541,0.0,0.0,0.833319,0.454541,1,0,5,8.5,91,64,64,87,0.692308
2,2,0,0.499992,0.428565,0.0,0.0,0.499992,0.428565,1,0,1,6.5,70,68,62,56,0.204545
3,3,0,0.249994,0.249994,0.0,0.0,0.249994,0.249994,1,0,0,4.0,39,39,39,44,0.241379
4,4,0,0.499992,0.272725,0.0,0.0,0.499992,0.272725,1,0,5,8.5,64,45,35,44,0.189189


In [16]:
# data before preprocessing 
df2.head()

Unnamed: 0,id,freq_qid1,freq_qid2,q1len,q2len,q1_n_words,q2_n_words,word_Common,word_Total,word_share,freq_q1+q2,freq_q1-q2
0,0,1,1,66,57,14,12,10.0,23.0,0.434783,2,0
1,1,4,1,51,88,8,13,4.0,20.0,0.2,5,3
2,2,1,1,73,59,14,10,4.0,24.0,0.166667,2,0
3,3,1,1,50,65,11,9,0.0,19.0,0.0,2,0
4,4,3,1,76,39,13,7,2.0,20.0,0.1,4,2


In [17]:
df3.head()

Unnamed: 0,id,q1_feats_m,q2_feats_m
0,0,"[-6.179506778717041, 37.45073118805885, -67.92...","[-14.616980731487274, 59.75548753142357, -53.2..."
1,1,"[9.236667931079865, -80.37141644954681, -45.78...","[-3.5657422859221697, -16.844570636749268, -13..."
2,2,"[97.54682850837708, 22.972195133566856, -39.55...","[156.8336295336485, 59.99189615249634, -8.4143..."
3,3,"[57.58699941635132, -22.017087638378143, -4.59...","[41.47243919968605, 56.71731689572334, 31.5306..."
4,4,"[83.1857842206955, -40.50698482990265, -83.403...","[-14.446974992752075, -4.33825546503067, -70.1..."


In [18]:
# Questions 1 tfidf weighted word2vec
df3_q1.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,86,87,88,89,90,91,92,93,94,95
0,-6.179507,37.450731,-67.929894,32.224274,143.348826,135.374574,17.865208,54.562352,81.618936,232.909839,...,-71.834689,-60.222858,-22.026407,103.33672,-68.477445,-54.976584,-67.802663,116.269999,60.515897,-12.245916
1,9.236668,-80.371416,-45.785907,78.291656,183.568221,100.894077,74.344804,48.360802,127.297421,112.987302,...,-32.130515,-98.080325,19.11379,-20.507508,-76.981011,82.665075,41.085582,129.377781,115.868467,4.383543
2,97.546829,22.972195,-39.558378,18.723416,56.92862,48.307643,8.719268,36.893737,106.899948,226.28308,...,-66.835015,87.592131,4.032431,56.851709,-43.62541,-57.580963,-50.425829,78.591986,105.714348,-33.304161
3,57.586999,-22.017088,-4.599304,-88.939273,-4.732172,-54.209038,74.614942,106.533731,15.520623,39.009711,...,28.362956,41.981221,-11.204984,16.833434,-36.372471,8.927573,-64.553194,95.054238,-34.157566,70.821932
4,83.185784,-40.506985,-83.403923,-52.648658,79.074884,-19.038248,53.728722,97.648612,160.555822,290.541356,...,-4.390959,109.604406,-91.160167,-25.739913,133.123058,-13.508816,-100.115211,208.424382,286.930889,68.027638


In [19]:
# Questions 2 tfidf weighted word2vec
df3_q2.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,86,87,88,89,90,91,92,93,94,95
0,-14.616981,59.755488,-53.263745,19.514497,113.916473,101.657056,8.561499,66.232769,32.888127,210.812733,...,-72.266625,-37.072086,-31.14273,94.064854,-45.053242,-34.155221,-76.548099,99.282776,50.791731,-17.566246
1,-3.565742,-16.844571,-130.911785,0.320254,79.350278,23.562028,79.124551,84.119839,128.684135,279.539877,...,6.193171,-65.084229,-15.654534,-3.475828,26.999802,170.172613,-57.038953,194.269546,128.207803,55.490061
2,156.83363,59.991896,-8.414311,29.251426,133.680218,112.457566,89.849781,21.613022,24.331766,171.11449,...,-26.185226,-19.283218,75.602438,24.144027,-91.874398,-178.454113,-91.471482,19.922719,21.26669,49.574858
3,41.472439,56.717317,31.530616,-5.520164,33.4548,79.596179,15.508996,40.042066,21.094017,101.998116,...,-17.779019,30.152297,49.300137,27.783795,25.937188,-32.107076,-3.817634,-14.231,4.772115,7.711628
4,-14.446975,-4.338255,-70.196208,-48.636382,18.356858,-50.807069,24.311196,60.043674,32.421993,57.148702,...,36.089472,47.193216,-49.969586,44.796028,39.740803,-33.763309,-98.282341,22.118795,68.802072,21.025373


In [20]:
print("Number of features in nlp dataframe :", df1.shape[1])
print("Number of features in preprocessed dataframe :", df2.shape[1])
print("Number of features in question1 w2v  dataframe :", df3_q1.shape[1])
print("Number of features in question2 w2v  dataframe :", df3_q2.shape[1])
print("Number of features in final dataframe  :", df1.shape[1]+df2.shape[1]+df3_q1.shape[1]+df3_q2.shape[1])

Number of features in nlp dataframe : 17
Number of features in preprocessed dataframe : 12
Number of features in question1 w2v  dataframe : 96
Number of features in question2 w2v  dataframe : 96
Number of features in final dataframe  : 221


### Merging of all df to create result df that will be fed to model.

In [21]:
# storing the final features to csv file
if not os.path.isfile('/content/drive/MyDrive/Colab Notebooks/Supervised_ML_UseCase_Quora Question Pair Similarity/final_features.csv'):
    df3_q1['id']=df1['id']
    df3_q2['id']=df1['id']
    df1  = df1.merge(df2, on='id',how='left')
    df2  = df3_q1.merge(df3_q2, on='id',how='left')
    result  = df1.merge(df2, on='id',how='left')
    result.to_csv('/content/drive/MyDrive/Colab Notebooks/Supervised_ML_UseCase_Quora Question Pair Similarity/final_features.csv')