####  Pre-trained Embeddings:

`1. glove-twitter-{25/50/100/200}`

`2. globe-wiki-gigaword-{50/200/300}`

`3. word2vec-google-news-300`

`4. word2vec-ruscorpora-news-30`

###### Load pretrained word vector model using gensim

In [1]:
import gensim.downloader as api

wiki_embeddings = api.load('glove-wiki-gigaword-100')



Explore the word vector for 'king'

In [2]:
wiki_embeddings['king']

array([-0.32307 , -0.87616 ,  0.21977 ,  0.25268 ,  0.22976 ,  0.7388  ,
       -0.37954 , -0.35307 , -0.84369 , -1.1113  , -0.30266 ,  0.33178 ,
       -0.25113 ,  0.30448 , -0.077491, -0.89815 ,  0.092496, -1.1407  ,
       -0.58324 ,  0.66869 , -0.23122 , -0.95855 ,  0.28262 , -0.078848,
        0.75315 ,  0.26584 ,  0.3422  , -0.33949 ,  0.95608 ,  0.065641,
        0.45747 ,  0.39835 ,  0.57965 ,  0.39267 , -0.21851 ,  0.58795 ,
       -0.55999 ,  0.63368 , -0.043983, -0.68731 , -0.37841 ,  0.38026 ,
        0.61641 , -0.88269 , -0.12346 , -0.37928 , -0.38318 ,  0.23868 ,
        0.6685  , -0.43321 , -0.11065 ,  0.081723,  1.1569  ,  0.78958 ,
       -0.21223 , -2.3211  , -0.67806 ,  0.44561 ,  0.65707 ,  0.1045  ,
        0.46217 ,  0.19912 ,  0.25802 ,  0.057194,  0.53443 , -0.43133 ,
       -0.34311 ,  0.59789 , -0.58417 ,  0.068995,  0.23944 , -0.85181 ,
        0.30379 , -0.34177 , -0.25746 , -0.031101, -0.16285 ,  0.45169 ,
       -0.91627 ,  0.64521 ,  0.73281 , -0.22752 , 

Words most similar to king based on the trained word vectors

In [3]:
wiki_embeddings.most_similar('king')

[('prince', 0.7682329416275024),
 ('queen', 0.7507690191268921),
 ('son', 0.7020887136459351),
 ('brother', 0.6985775232315063),
 ('monarch', 0.6977890729904175),
 ('throne', 0.6919990181922913),
 ('kingdom', 0.6811410188674927),
 ('father', 0.6802029013633728),
 ('emperor', 0.6712857484817505),
 ('ii', 0.6676074266433716)]

#### Train Model:

Read in the data and clean up column names.

In [5]:
import gensim
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth', 100)

messages = pd.read_csv('spam.csv', encoding='latin-1')
messages = messages.drop(labels=["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1)
messages.columns = ['label', 'text']
messages.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


Clean data using the built-in gensim cleaner

In [6]:
messages['text_clean'] = messages['text'].apply(lambda x: gensim.utils.simple_preprocess(x))
messages.head()

Unnamed: 0,label,text,text_clean
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...","[go, until, jurong, point, crazy, available, only, in, bugis, great, world, la, buffet, cine, th..."
1,ham,Ok lar... Joking wif u oni...,"[ok, lar, joking, wif, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,"[free, entry, in, wkly, comp, to, win, fa, cup, final, tkts, st, may, text, fa, to, to, receive,..."
3,ham,U dun say so early hor... U c already then say...,"[dun, say, so, early, hor, already, then, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though","[nah, don, think, he, goes, to, usf, he, lives, around, here, though]"


Split the data into train and test sets:

In [7]:
X_train, X_test, y_train, y_test = train_test_split(messages['text_clean'],
                                                    messages['label'],
                                                    test_size=0.2)

Train the word2vec model:

In [8]:
w2v_model = gensim.models.Word2Vec(X_train, size=100, 
                                   window=5, min_count=2)

Explore the word vector for 'king' base on our trained model

In [9]:
w2v_model.wv['king']

array([-0.01218488,  0.02661741, -0.06106495,  0.00132741, -0.00455298,
       -0.01623209,  0.00316107, -0.05835401,  0.01884488,  0.10343473,
        0.06997734, -0.06022566,  0.05867344,  0.03619002,  0.01678955,
       -0.045303  , -0.01695218,  0.04076419, -0.01541594, -0.0044088 ,
       -0.02052639, -0.00708599,  0.11271296,  0.00745473, -0.06622133,
        0.0244447 , -0.04541319,  0.04481998, -0.00275197,  0.03056112,
        0.04065051, -0.06903173,  0.07022994, -0.00650361,  0.02979515,
        0.01432693, -0.00289312,  0.04508043, -0.01291056,  0.00577914,
       -0.01038253, -0.01044785, -0.03844113,  0.00023163, -0.00976733,
        0.09344584, -0.01371097,  0.02544733, -0.04614681, -0.05514842,
       -0.10766528,  0.02757762, -0.0368976 ,  0.03025775,  0.03070137,
        0.04129001,  0.0223324 , -0.02243043,  0.05255358,  0.03565809,
       -0.00559227,  0.01518424,  0.00809188, -0.01672367, -0.0299221 ,
       -0.06296328, -0.03199343,  0.01522661, -0.082532  , -0.00

Find the most similar words to 'king' based on word vectors from our trained model.

In [10]:
w2v_model.wv.most_similar('king')

[('drink', 0.9976893663406372),
 ('word', 0.9974783062934875),
 ('either', 0.9974765777587891),
 ('god', 0.9974689483642578),
 ('name', 0.9974657297134399),
 ('thing', 0.9974541664123535),
 ('stuff', 0.9974503517150879),
 ('crazy', 0.9974460601806641),
 ('she', 0.9974449872970581),
 ('games', 0.9974353313446045)]