# word2vec: How To Implement word2vec

### Explore Pre-trained Embeddings

Some other options:
- `glove-twitter-{25/50/100/200}`
- `glove-wiki-gigaword-{50/200/300}`
- `word2vec-google-news-300`
- `word2vec-ruscorpora-news-300`

In [1]:
# Install gensim
!pip install -U gensim

Collecting gensim
[?25l  Downloading https://files.pythonhosted.org/packages/5c/4e/afe2315e08a38967f8a3036bbe7e38b428e9b7a90e823a83d0d49df1adf5/gensim-3.8.3-cp37-cp37m-manylinux1_x86_64.whl (24.2MB)
[K     |████████████████████████████████| 24.2MB 1.2MB/s 
Collecting smart-open&gt;=1.8.1 (from gensim)
[?25l  Downloading https://files.pythonhosted.org/packages/53/9c/2ee26604648f92251f26a00a79fd164079163693f53792a3ba99f6152349/smart_open-2.2.1.tar.gz (122kB)
[K     |████████████████████████████████| 133kB 593kB/s 
Building wheels for collected packages: smart-open
  Building wheel for smart-open (setup.py) ... [?25ldone
[?25h  Stored in directory: /home/anshul/.cache/pip/wheels/6a/25/34/a5afefe4e3cad127e65c9bd1b6440c1916feb0bf2f744001e2
Successfully built smart-open
Installing collected packages: smart-open, gensim
Successfully installed gensim-3.8.3 smart-open-2.2.1


In [2]:
# Load pretrained word vectors using gensim
import gensim.downloader as api

wiki_embeddings = api.load('glove-wiki-gigaword-100')



In [3]:
# Explore the word vector for "king"
wiki_embeddings['king']

array([-0.32307 , -0.87616 ,  0.21977 ,  0.25268 ,  0.22976 ,  0.7388  ,
       -0.37954 , -0.35307 , -0.84369 , -1.1113  , -0.30266 ,  0.33178 ,
       -0.25113 ,  0.30448 , -0.077491, -0.89815 ,  0.092496, -1.1407  ,
       -0.58324 ,  0.66869 , -0.23122 , -0.95855 ,  0.28262 , -0.078848,
        0.75315 ,  0.26584 ,  0.3422  , -0.33949 ,  0.95608 ,  0.065641,
        0.45747 ,  0.39835 ,  0.57965 ,  0.39267 , -0.21851 ,  0.58795 ,
       -0.55999 ,  0.63368 , -0.043983, -0.68731 , -0.37841 ,  0.38026 ,
        0.61641 , -0.88269 , -0.12346 , -0.37928 , -0.38318 ,  0.23868 ,
        0.6685  , -0.43321 , -0.11065 ,  0.081723,  1.1569  ,  0.78958 ,
       -0.21223 , -2.3211  , -0.67806 ,  0.44561 ,  0.65707 ,  0.1045  ,
        0.46217 ,  0.19912 ,  0.25802 ,  0.057194,  0.53443 , -0.43133 ,
       -0.34311 ,  0.59789 , -0.58417 ,  0.068995,  0.23944 , -0.85181 ,
        0.30379 , -0.34177 , -0.25746 , -0.031101, -0.16285 ,  0.45169 ,
       -0.91627 ,  0.64521 ,  0.73281 , -0.22752 , 

In [4]:
# Find the words most similar to king based on the trained word vectors
wiki_embeddings.most_similar("king")

[(&#39;prince&#39;, 0.7682329416275024),
 (&#39;queen&#39;, 0.7507690787315369),
 (&#39;son&#39;, 0.7020887732505798),
 (&#39;brother&#39;, 0.6985775232315063),
 (&#39;monarch&#39;, 0.6977890729904175),
 (&#39;throne&#39;, 0.691999077796936),
 (&#39;kingdom&#39;, 0.6811410188674927),
 (&#39;father&#39;, 0.680202841758728),
 (&#39;emperor&#39;, 0.6712858080863953),
 (&#39;ii&#39;, 0.6676074266433716)]

### Train Our Own Model

In [5]:
# Read in the data and clean up column names
import gensim
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth', 100)

messages = pd.read_csv('../../../data/spam.csv', encoding='latin-1')
messages = messages.drop(labels = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis = 1)
messages.columns = ["label", "text"]
messages.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


In [6]:
# Clean data using the built in cleaner in gensim
messages['text_clean'] = messages['text'].apply(lambda x: gensim.utils.simple_preprocess(x))
messages.head()

Unnamed: 0,label,text,text_clean
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...","[go, until, jurong, point, crazy, available, only, in, bugis, great, world, la, buffet, cine, th..."
1,ham,Ok lar... Joking wif u oni...,"[ok, lar, joking, wif, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,"[free, entry, in, wkly, comp, to, win, fa, cup, final, tkts, st, may, text, fa, to, to, receive,..."
3,ham,U dun say so early hor... U c already then say...,"[dun, say, so, early, hor, already, then, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though","[nah, don, think, he, goes, to, usf, he, lives, around, here, though]"


In [7]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(messages['text_clean'], messages['label'], test_size=0.2)

In [8]:
# Train the word2vec model
w2v_model = gensim.models.Word2Vec(X_train, size=100, window=5, min_count=2)

In [9]:
# Explore the word vector for "king" base on our trained model
w2v_model.wv['king']

array([-0.03781047,  0.0145539 , -0.03632586, -0.02183767,  0.01777887,
       -0.00835703, -0.0272942 ,  0.03455959,  0.0530992 , -0.0036507 ,
       -0.03838603, -0.00368304, -0.02005389, -0.02398076,  0.05907796,
       -0.06484535,  0.08386756,  0.00384486,  0.11756564, -0.02352135,
        0.03496149, -0.0353152 , -0.00679098,  0.088273  , -0.01122013,
       -0.09493479, -0.09922081,  0.02414066, -0.01473035,  0.02284652,
       -0.020798  , -0.0429832 ,  0.04817836,  0.02274809,  0.04857518,
       -0.02405948,  0.03943033, -0.00262282, -0.01253747,  0.00672635,
        0.08153729, -0.01959068,  0.04829797, -0.0354427 ,  0.02288593,
        0.03009036, -0.04494644,  0.04158503,  0.01176413,  0.00184266,
        0.03290795, -0.00377896, -0.02784754,  0.08658703,  0.03809892,
        0.00862864,  0.08445028, -0.02104815,  0.06501176,  0.07554223,
       -0.06267644,  0.07500016, -0.00433768,  0.0168327 , -0.05774008,
        0.0454221 ,  0.11421556, -0.00725017,  0.0218204 ,  0.09

In [11]:
# Find the most similar words to "king" based on word vectors from our trained model
w2v_model.wv.most_similar('king')

[(&#39;everyone&#39;, 0.9985462427139282),
 (&#39;st&#39;, 0.9984534978866577),
 (&#39;ur&#39;, 0.9984291791915894),
 (&#39;win&#39;, 0.9984257221221924),
 (&#39;www&#39;, 0.9984194040298462),
 (&#39;music&#39;, 0.9984148144721985),
 (&#39;people&#39;, 0.9984117746353149),
 (&#39;more&#39;, 0.9984087347984314),
 (&#39;days&#39;, 0.9984058141708374),
 (&#39;from&#39;, 0.9984055161476135)]