# word2vec: How To Implement word2vec

### Explore Pre-trained Embeddings

Some other options:
- `glove-twitter-{25/50/100/200}`
- `glove-wiki-gigaword-{50/200/300}`
- `word2vec-google-news-300`
- `word2vec-ruscorpora-news-300`

In [1]:
# Install gensim
%pip install gensim




You should consider upgrading via the 'c:\Users\shadm\Documents\Codes\advanced-nlp-with-python-for-machine-learning\venv\Scripts\python.exe -m pip install --upgrade pip' command.


In [2]:
# Load pretrained word vectors using gensim
import gensim.downloader as api

wiki_embeddings = api.load('glove-wiki-gigaword-100')



In [3]:
# Explore the word vector for "king"
wiki_embeddings['king']

array([-0.32307 , -0.87616 ,  0.21977 ,  0.25268 ,  0.22976 ,  0.7388  ,
       -0.37954 , -0.35307 , -0.84369 , -1.1113  , -0.30266 ,  0.33178 ,
       -0.25113 ,  0.30448 , -0.077491, -0.89815 ,  0.092496, -1.1407  ,
       -0.58324 ,  0.66869 , -0.23122 , -0.95855 ,  0.28262 , -0.078848,
        0.75315 ,  0.26584 ,  0.3422  , -0.33949 ,  0.95608 ,  0.065641,
        0.45747 ,  0.39835 ,  0.57965 ,  0.39267 , -0.21851 ,  0.58795 ,
       -0.55999 ,  0.63368 , -0.043983, -0.68731 , -0.37841 ,  0.38026 ,
        0.61641 , -0.88269 , -0.12346 , -0.37928 , -0.38318 ,  0.23868 ,
        0.6685  , -0.43321 , -0.11065 ,  0.081723,  1.1569  ,  0.78958 ,
       -0.21223 , -2.3211  , -0.67806 ,  0.44561 ,  0.65707 ,  0.1045  ,
        0.46217 ,  0.19912 ,  0.25802 ,  0.057194,  0.53443 , -0.43133 ,
       -0.34311 ,  0.59789 , -0.58417 ,  0.068995,  0.23944 , -0.85181 ,
        0.30379 , -0.34177 , -0.25746 , -0.031101, -0.16285 ,  0.45169 ,
       -0.91627 ,  0.64521 ,  0.73281 , -0.22752 , 

In [4]:
# Find the words most similar to king based on the trained word vectors
wiki_embeddings.most_similar('king')

[('prince', 0.7682328820228577),
 ('queen', 0.7507690787315369),
 ('son', 0.7020888328552246),
 ('brother', 0.6985775232315063),
 ('monarch', 0.6977890729904175),
 ('throne', 0.6919989585876465),
 ('kingdom', 0.6811409592628479),
 ('father', 0.6802029013633728),
 ('emperor', 0.6712858080863953),
 ('ii', 0.6676074266433716)]

### Train Our Own Model

In [5]:
# Read in the data and clean up column names
%pip install sklearn
import gensim
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth', 100)

messages = pd.read_csv('../../../data/spam.csv', encoding='latin-1')
messages = messages.drop(labels = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis = 1)
messages.columns = ["label", "text"]
messages.head()

You should consider upgrading via the 'c:\Users\shadm\Documents\Codes\advanced-nlp-with-python-for-machine-learning\venv\Scripts\python.exe -m pip install --upgrade pip' command.


Collecting sklearn
  Downloading sklearn-0.0.tar.gz (1.1 kB)
Collecting scikit-learn
  Using cached scikit_learn-1.0.2-cp37-cp37m-win_amd64.whl (7.1 MB)
Collecting threadpoolctl>=2.0.0
  Using cached threadpoolctl-3.1.0-py3-none-any.whl (14 kB)
Using legacy setup.py install for sklearn, since package 'wheel' is not installed.
Installing collected packages: threadpoolctl, scikit-learn, sklearn
    Running setup.py install for sklearn: started
    Running setup.py install for sklearn: finished with status 'done'
Successfully installed scikit-learn-1.0.2 sklearn-0.0 threadpoolctl-3.1.0
Note: you may need to restart the kernel to use updated packages.


Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


In [6]:
# Clean data using the built in cleaner in gensim
messages['text_clean'] = messages['text'].apply(lambda x: gensim.utils.simple_preprocess(x))
messages.head()

Unnamed: 0,label,text,text_clean
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...","[go, until, jurong, point, crazy, available, only, in, bugis, great, world, la, buffet, cine, th..."
1,ham,Ok lar... Joking wif u oni...,"[ok, lar, joking, wif, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,"[free, entry, in, wkly, comp, to, win, fa, cup, final, tkts, st, may, text, fa, to, to, receive,..."
3,ham,U dun say so early hor... U c already then say...,"[dun, say, so, early, hor, already, then, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though","[nah, don, think, he, goes, to, usf, he, lives, around, here, though]"


In [7]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(messages['text_clean'],
                                                    messages['label'], test_size=0.2)

In [10]:
# Train the word2vec model
w2v_model = gensim.models.Word2Vec(
    X_train, 
    vector_size=100, 
    window=5, 
    min_count=2)

In [11]:
# Explore the word vector for "king" base on our trained model
w2v_model.wv['king']

array([-1.8161977e-02,  5.7616390e-02,  1.3926938e-02,  2.1130444e-02,
        8.2980907e-03, -8.0111727e-02,  2.2035718e-02,  1.1189261e-01,
       -2.6967876e-02, -1.1018725e-02, -2.5438010e-03, -9.6267372e-02,
        4.7984278e-05,  2.0964524e-02, -1.0217189e-02, -3.8939990e-02,
        1.3134870e-02, -6.3317664e-02,  1.0043609e-02, -9.7543791e-02,
        2.4526289e-02,  2.2568697e-02,  3.4238897e-02, -2.6947554e-02,
       -1.4646398e-02,  2.0653252e-02, -3.4913559e-02, -2.1883029e-02,
       -5.1514685e-02,  1.3277801e-03,  4.0859573e-02,  7.7405148e-03,
        3.6399029e-02, -2.9258993e-02, -2.4418255e-02,  5.4753207e-02,
       -3.8207835e-04, -4.6696842e-02, -2.1935105e-02, -8.6616270e-02,
        2.0974584e-02, -5.4943293e-02, -1.8098596e-02, -7.0133916e-04,
        4.2300932e-02, -2.8942561e-02, -3.7437055e-02, -8.7984893e-03,
        3.4359563e-02,  2.5349237e-02,  1.3921367e-02, -6.0879402e-02,
       -1.1375098e-02, -3.2803457e-02, -2.9939182e-02,  2.3878673e-02,
      

In [12]:
# Find the most similar words to "king" based on word vectors from our trained model
w2v_model.wv.most_similar('king')

[('room', 0.989668071269989),
 ('part', 0.9895954132080078),
 ('kiss', 0.9893442988395691),
 ('ìï', 0.9893378615379333),
 ('dad', 0.9893037676811218),
 ('pic', 0.9892616868019104),
 ('text', 0.9891378283500671),
 ('needs', 0.9890959858894348),
 ('nothing', 0.9890937805175781),
 ('selected', 0.9890815615653992)]