# Word2Vec for Description Columns

Goal: Use the description columns from our dataframe and create a Word2Vec model from it. We are trying to capture sentence similarity so after creating a Word2Vec Model what we can do is take every description from our description column and 'add' the words to get a vector for every description. Then using cosine similarity we can see if two sentences are similar. Note when we add we have to divide by the length of the sentence to take that into account. We want the words to impact the similarity not the length of the sentence per se.

Outline
* Create model on description columns -> grab all the words and apply it to Word2Vec
* Apply the model on every single description of a real estate listing effectively adding every single word in the description to create a new vector that represents that description. Remember to divide by sentence length
* Export those numpy arrays and to cosine similarity in feature building notebook 
* Reason for exporting the arrays was my kernel kept crashing, so wanted to save the output asap 

Note: We have to do Word2Vec on our training set then apply this model to our testing set. Word2Vec improves with more info so our results from this feature could be better with more info

In [1]:
import pandas as pd
import re
import numpy as np
import nltk
from gensim.models import Word2Vec
from nltk.corpus import stopwords
import gensim
from collections import Counter
stop_words = stopwords.words('english')
stopwords_dict = Counter(stop_words)

## Build Model

In [2]:
cols = ['description_id1' , 'description_id2']
df = pd.read_csv('cleaned_data.csv' , usecols=cols)

In [3]:
df.head(2) # Build Word2Vec of words on description columns

Unnamed: 0,description_id1,description_id2
0,"Strength, tradition and serenity around 10,000...",Magnificent Mallorquinian Mansion of XVII cent...
1,Magnificent Mallorquinian Mansion of XVII cent...,Magnificent Mallorquinian Mansion of XVII cent...


In [4]:
# Clean string data for description_id1 column
description_id1 = []
for i in df['description_id1']:
    description_id1.append(re.sub(r'\W+', ' ', i ).lower())

In [5]:
# Tokenize description_id1
description_id1 = [nltk.word_tokenize(sentence) for sentence in description_id1]

In [6]:
# Remove Stopwords from description_id1
for i in range(len(description_id1)):
    description_id1[i] = [word for word in description_id1[i] if word not in stopwords_dict]

In [7]:
# Clean string data for description_id2 column
description_id2 = []
for i in df['description_id2']:
    description_id2.append(re.sub(r'\W+', ' ', i ).lower())

In [8]:
# Tokenize description_id2
description_id2 = [nltk.word_tokenize(sentence) for sentence in description_id2]

In [9]:
# Remove Stopwords from description_id2
for i in range(len(description_id2)):
    description_id2[i] = [word for word in description_id2[i] if word not in stopwords_dict]

In [10]:
# Combine tokenized columns
description = description_id1 + description_id2

In [11]:
model = Word2Vec(description, min_count=1)

In [12]:
model.save("TrainWord2vecDescription.model")

In [31]:
model.wv.most_similar('pizza')

[('bread', 0.6565971374511719),
 ('bake', 0.6351786851882935),
 ('oven', 0.6210470795631409),
 ('drawer', 0.6147385835647583),
 ('woodfired', 0.5470974445343018),
 ('alfreso', 0.540163516998291),
 ('bullerjan', 0.5209804773330688),
 ('microwave', 0.5185470581054688),
 ('grill', 0.5059233903884888),
 ('freezer', 0.4947850704193115)]

In [15]:
len(model.wv.vocab)

22031

## Apply model on Description_id1 & Description_id2 columns from training dataframe

In [4]:
# Load Model
model = Word2Vec.load("TrainWord2vecDescription.model")

In [5]:
len(model.wv.vocab)

22031

## Description_id1 Vectors

In [16]:
# Make sure to divide by len(vec) so sentence length doesn't mess things up. Instead we want to focus on the word similarity. 
# Few sentences are length 0 as in no description so to avoid dividing by 0 we'll just append a (100,1) vector of ones for simplicity
description_1_vector_sums = []
for i in range(len(description_id1)):
    vec = []
    for word in description_id1[i]:
        vec.append(model.wv[word])
    if len(vec) > 0:
        description_1_vector_sums.append(sum(vec)/len(vec))
    else:
        description_1_vector_sums.append(np.ones(100))

In [17]:
d1np = np.asarray(description_1_vector_sums)

In [28]:
len(d1np)

502689

In [39]:
np.save('d1np.npy' , d1np)

## Description_id2 Vectors

Had to re-do the code below because my kernel crashed 

In [2]:
cols = ['description_id1' , 'description_id2']
df = pd.read_csv('cleaned_data.csv' , usecols=cols)

In [3]:
# Load Model
model = Word2Vec.load("TrainWord2vecDescription.model")

In [4]:
# Clean string data for description_id2 column
description_id2 = []
for i in df['description_id2']:
    description_id2.append(re.sub(r'\W+', ' ', i ).lower())

In [5]:
# Tokenize description_id2
description_id2 = [nltk.word_tokenize(sentence) for sentence in description_id2]

In [6]:
# Remove Stopwords from description_id2
for i in range(len(description_id2)):
    description_id2[i] = [word for word in description_id2[i] if word not in stopwords_dict]

In [7]:
# Make sure to divide by len(vec) so sentence length doesn't mess things up. Instead we want to focus on the word similarity. 
# Few sentences are length 0 as in no description so to avoid dividing by 0 we'll just append a (100,1) vector of ones for simplicity
description_2_vector_sums = []
for i in range(len(description_id2)):
    vec = []
    for word in description_id2[i]:
        vec.append(model.wv[word])
    if len(vec) > 0:
        description_2_vector_sums.append(sum(vec)/len(vec))
    else:
        description_2_vector_sums.append(np.ones(100))

In [8]:
d2np = np.asarray(description_2_vector_sums)

In [19]:
np.save('d2np.npy',d2np)