# word2vec: How To Implement word2vec

### Explore Pre-trained Embeddings

Some other options:
- `glove-twitter-{25/50/100/200}`
- `glove-wiki-gigaword-{50/200/300}`
- `word2vec-google-news-300`
- `word2vec-ruscorpora-news-300`

In [1]:
# Install gensim
!pip install -U gensim

Collecting gensim
  Downloading gensim-3.8.3-cp38-cp38-macosx_10_9_x86_64.whl (24.2 MB)
[K     |████████████████████████████████| 24.2 MB 5.5 MB/s eta 0:00:01     |█████████████████████▏          | 16.0 MB 4.4 MB/s eta 0:00:02     |█████████████████████████       | 18.9 MB 4.4 MB/s eta 0:00:02
Collecting smart-open>=1.8.1
  Downloading smart_open-2.1.1.tar.gz (111 kB)
[K     |████████████████████████████████| 111 kB 4.3 MB/s eta 0:00:01
Collecting boto3
  Downloading boto3-1.14.56-py2.py3-none-any.whl (129 kB)
[K     |████████████████████████████████| 129 kB 3.8 MB/s eta 0:00:01
Collecting jmespath<1.0.0,>=0.7.1
  Downloading jmespath-0.10.0-py2.py3-none-any.whl (24 kB)
Collecting s3transfer<0.4.0,>=0.3.0
  Downloading s3transfer-0.3.3-py2.py3-none-any.whl (69 kB)
[K     |████████████████████████████████| 69 kB 4.0 MB/s eta 0:00:01
[?25hCollecting botocore<1.18.0,>=1.17.56
  Downloading botocore-1.17.56-py2.py3-none-any.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 M

In [2]:
# Load pretrained word vectors using gensim
import gensim.downloader as api

#Downloads words from Wikipedia
#100 => all word vectors is of length 100
wiki_embeddings = api.load('glove-wiki-gigaword-100')



In [5]:
# Explore the word vector for "king"
#Vector is in 100 dimensions
#In word2vec, words are expressed as vector. The stronger the correlation between two words in text, the closer they are
wiki_embeddings['king']

array([-0.32307 , -0.87616 ,  0.21977 ,  0.25268 ,  0.22976 ,  0.7388  ,
       -0.37954 , -0.35307 , -0.84369 , -1.1113  , -0.30266 ,  0.33178 ,
       -0.25113 ,  0.30448 , -0.077491, -0.89815 ,  0.092496, -1.1407  ,
       -0.58324 ,  0.66869 , -0.23122 , -0.95855 ,  0.28262 , -0.078848,
        0.75315 ,  0.26584 ,  0.3422  , -0.33949 ,  0.95608 ,  0.065641,
        0.45747 ,  0.39835 ,  0.57965 ,  0.39267 , -0.21851 ,  0.58795 ,
       -0.55999 ,  0.63368 , -0.043983, -0.68731 , -0.37841 ,  0.38026 ,
        0.61641 , -0.88269 , -0.12346 , -0.37928 , -0.38318 ,  0.23868 ,
        0.6685  , -0.43321 , -0.11065 ,  0.081723,  1.1569  ,  0.78958 ,
       -0.21223 , -2.3211  , -0.67806 ,  0.44561 ,  0.65707 ,  0.1045  ,
        0.46217 ,  0.19912 ,  0.25802 ,  0.057194,  0.53443 , -0.43133 ,
       -0.34311 ,  0.59789 , -0.58417 ,  0.068995,  0.23944 , -0.85181 ,
        0.30379 , -0.34177 , -0.25746 , -0.031101, -0.16285 ,  0.45169 ,
       -0.91627 ,  0.64521 ,  0.73281 , -0.22752 , 

In [14]:
# Find the words most similar to king based on the trained word vectors
wiki_embeddings.most_similar('king')

[('prince', 0.7682329416275024),
 ('queen', 0.7507690787315369),
 ('son', 0.7020887732505798),
 ('brother', 0.6985775232315063),
 ('monarch', 0.6977890729904175),
 ('throne', 0.691999077796936),
 ('kingdom', 0.6811410188674927),
 ('father', 0.680202841758728),
 ('emperor', 0.6712858080863953),
 ('ii', 0.6676074266433716)]

### Train Our Own Model

In [15]:
#Not relevant to what was used before, although we could use Wikipedia to infer meaning (can't classify though!)

# Read in the data and clean up column names
import gensim
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth', 100)

messages = pd.read_csv('/Users/JacobRaymond 1/Desktop/spam.csv', encoding='latin-1')
messages = messages.drop(labels = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis = 1)
messages.columns = ["label", "text"]
messages.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


In [17]:
# Clean data using the built in cleaner in gensim
messages['text_clean'] = messages['text'].apply(lambda x: gensim.utils.simple_preprocess(x))
messages.head()

Unnamed: 0,label,text,text_clean
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...","[go, until, jurong, point, crazy, available, only, in, bugis, great, world, la, buffet, cine, th..."
1,ham,Ok lar... Joking wif u oni...,"[ok, lar, joking, wif, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,"[free, entry, in, wkly, comp, to, win, fa, cup, final, tkts, st, may, text, fa, to, to, receive,..."
3,ham,U dun say so early hor... U c already then say...,"[dun, say, so, early, hor, already, then, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though","[nah, don, think, he, goes, to, usf, he, lives, around, here, though]"


In [18]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(messages['text_clean'],
                                                    messages['label'], test_size=0.2)

In [21]:
# Train the word2vec model
# Window: we are interested in the five words before and after each word
#min_count: a word must appear at least twice in the corpus to create a vector
w2v_model = gensim.models.Word2Vec(X_train, size=100, window=5, min_count=2)

In [22]:
# Explore the word vector for "king" base on our trained model
w2v_model.wv['king']

array([ 0.10487016,  0.21510172, -0.10849935,  0.05340999, -0.03962642,
       -0.10173823,  0.06563437,  0.00585718,  0.01794886,  0.0112955 ,
        0.04267868, -0.02262603,  0.09342182,  0.03310412, -0.0303591 ,
       -0.01937971,  0.0450067 , -0.01413602,  0.00135273, -0.06122703,
       -0.00642887, -0.00122807,  0.0329018 , -0.08993886,  0.01055552,
        0.06540831,  0.01650726, -0.05483284, -0.03026311,  0.02650058,
       -0.14788489,  0.03553534, -0.08980873,  0.05408211,  0.02237782,
       -0.03798092, -0.04514375, -0.02287094,  0.0437391 , -0.07131199,
       -0.06730279,  0.08392744, -0.06748024,  0.11946463,  0.00111425,
        0.04148385, -0.05268946,  0.00820543,  0.06639598,  0.03518917,
       -0.07866311,  0.00438332, -0.00699693,  0.00619019,  0.02347072,
       -0.00595546, -0.0492793 ,  0.04189854,  0.07877913,  0.04646201,
       -0.03783966, -0.01025305, -0.05802447,  0.11006525, -0.01269659,
       -0.04809914, -0.16059797, -0.04820869,  0.07158965, -0.01

In [23]:
# Find the most similar words to "king" based on word vectors from our trained model
w2v_model.wv.most_similar('king')

[('video', 0.9989591836929321),
 ('being', 0.9989287257194519),
 ('heart', 0.9989182949066162),
 ('guys', 0.9988712668418884),
 ('for', 0.9988709688186646),
 ('draw', 0.9988632202148438),
 ('from', 0.9988629817962646),
 ('your', 0.9988593459129333),
 ('com', 0.9988566637039185),
 ('into', 0.9988530278205872)]