### What it word2vec?

It is a shallow, two-layer neural network that accepts a text corpus as an input, and it returns a set of vectors (also known as embeddings); each vector is a numeric representation of a given word.

"You shall know a word by the company it keeps." Meaning: you can infer the meaning of a word by just looking at the words around it in the context of a sentence. 

The skip-gram method: uses the words around each word to understand the context and create a numeric representation of that word. This is how it learns numeric vector representations of every word in the corpus it's trained on.

A word2vec model is a two-layer neural network that will convert a list of words into a list of numeric vectors. 

To gauge word similarity you can use word vectors. The most common way to calculate this similarity is using cosine similarity. In Python, you just pass two word vectors into the cosine similarity function, and it will return a score between -1 and 1 as a similarity measure. What it's actually doing is returning the cosine of the angle between these two vectors. Now, recall what a cosine curve looks like. The X axis in the small plot would represent the angle between two vectors, and then the Y axis is the similarity score that would be returned. So if the angle between two vectors is very, very small, near zero, then the similarity score would be very close to one. If the angle between two vectors is 180 degrees, the similarity score is -1, or opposites.

In [1]:
# Install gensim
!pip install -U gensim

Collecting gensim
  Downloading gensim-3.8.3-cp37-cp37m-win_amd64.whl (24.2 MB)
Collecting smart-open>=1.8.1
  Downloading smart_open-2.1.1.tar.gz (111 kB)
Collecting Cython==0.29.14
  Downloading Cython-0.29.14-cp37-cp37m-win_amd64.whl (1.7 MB)
Collecting boto3
  Downloading boto3-1.14.51-py2.py3-none-any.whl (129 kB)
Collecting jmespath<1.0.0,>=0.7.1
  Downloading jmespath-0.10.0-py2.py3-none-any.whl (24 kB)
Collecting botocore<1.18.0,>=1.17.51
  Downloading botocore-1.17.51-py2.py3-none-any.whl (6.6 MB)
Collecting s3transfer<0.4.0,>=0.3.0
  Downloading s3transfer-0.3.3-py2.py3-none-any.whl (69 kB)
Collecting docutils<0.16,>=0.10
  Downloading docutils-0.15.2-py3-none-any.whl (547 kB)
Building wheels for collected packages: smart-open
  Building wheel for smart-open (setup.py): started
  Building wheel for smart-open (setup.py): finished with status 'done'
  Created wheel for smart-open: filename=smart_open-2.1.1-py3-none-any.whl size=112418 sha256=4c332f39078b18c7e7dc48d15b46d8fbed9

You should consider upgrading via the 'c:\users\34677\anaconda3\python.exe -m pip install --upgrade pip' command.


In [8]:
# Load pretrained word vectors using gensim
import gensim.downloader as api

wiki_embeddings = api.load('glove-wiki-gigaword-100')



In [9]:
# Explore the word vector for "king"
wiki_embeddings['king']

array([-0.32307 , -0.87616 ,  0.21977 ,  0.25268 ,  0.22976 ,  0.7388  ,
       -0.37954 , -0.35307 , -0.84369 , -1.1113  , -0.30266 ,  0.33178 ,
       -0.25113 ,  0.30448 , -0.077491, -0.89815 ,  0.092496, -1.1407  ,
       -0.58324 ,  0.66869 , -0.23122 , -0.95855 ,  0.28262 , -0.078848,
        0.75315 ,  0.26584 ,  0.3422  , -0.33949 ,  0.95608 ,  0.065641,
        0.45747 ,  0.39835 ,  0.57965 ,  0.39267 , -0.21851 ,  0.58795 ,
       -0.55999 ,  0.63368 , -0.043983, -0.68731 , -0.37841 ,  0.38026 ,
        0.61641 , -0.88269 , -0.12346 , -0.37928 , -0.38318 ,  0.23868 ,
        0.6685  , -0.43321 , -0.11065 ,  0.081723,  1.1569  ,  0.78958 ,
       -0.21223 , -2.3211  , -0.67806 ,  0.44561 ,  0.65707 ,  0.1045  ,
        0.46217 ,  0.19912 ,  0.25802 ,  0.057194,  0.53443 , -0.43133 ,
       -0.34311 ,  0.59789 , -0.58417 ,  0.068995,  0.23944 , -0.85181 ,
        0.30379 , -0.34177 , -0.25746 , -0.031101, -0.16285 ,  0.45169 ,
       -0.91627 ,  0.64521 ,  0.73281 , -0.22752 , 

In [10]:
# Find the words most similar to king based on the trained word vectors
wiki_embeddings.most_similar('king')

[('prince', 0.7682329416275024),
 ('queen', 0.7507690191268921),
 ('son', 0.7020887136459351),
 ('brother', 0.6985775232315063),
 ('monarch', 0.6977890729904175),
 ('throne', 0.6919990181922913),
 ('kingdom', 0.6811410188674927),
 ('father', 0.6802029013633728),
 ('emperor', 0.6712857484817505),
 ('ii', 0.6676074266433716)]

In [11]:
wiki_embeddings.most_similar('love')

[('me', 0.7382813692092896),
 ('passion', 0.735213577747345),
 ('my', 0.7327208518981934),
 ('life', 0.7287957668304443),
 ('dream', 0.7267670035362244),
 ('you', 0.7181724309921265),
 ('always', 0.7111519575119019),
 ('wonder', 0.7094581127166748),
 ('i', 0.7084634304046631),
 ('dreams', 0.7067317962646484)]

In [13]:
# Train the model
# Read in the data and clean up column names
import gensim
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth', 100)

messages = pd.read_csv('spam.csv', encoding='latin-1')
messages = messages.drop(labels = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis = 1)
messages.columns = ["label", "text"]
messages.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


In [14]:
# Clean data using the built in cleaner in gensim
messages['text_clean'] = messages['text'].apply(lambda x: gensim.utils.simple_preprocess(x))
messages.head()

Unnamed: 0,label,text,text_clean
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...","[go, until, jurong, point, crazy, available, only, in, bugis, great, world, la, buffet, cine, th..."
1,ham,Ok lar... Joking wif u oni...,"[ok, lar, joking, wif, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,"[free, entry, in, wkly, comp, to, win, fa, cup, final, tkts, st, may, text, fa, to, to, receive,..."
3,ham,U dun say so early hor... U c already then say...,"[dun, say, so, early, hor, already, then, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though","[nah, don, think, he, goes, to, usf, he, lives, around, here, though]"


In [15]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(messages['text_clean'],
                                                    messages['label'], test_size=0.2)

In [16]:
# Train the word2vec model
w2v_model = gensim.models.Word2Vec(X_train,
                                   size=100,
                                   window=5,
                                   min_count=2)

In [17]:
# Explore the word vector for "king" base on trained model
w2v_model.wv['king']

array([ 0.04663172,  0.03127421,  0.06469775, -0.00727583,  0.04726887,
       -0.00250307, -0.0503101 ,  0.01107288,  0.01183673, -0.03182893,
        0.00755053, -0.05057079,  0.002737  ,  0.01299583, -0.01232938,
        0.06443909,  0.05830617, -0.01431434, -0.05226093, -0.04732038,
       -0.03040092,  0.02699241,  0.01060657,  0.05214865, -0.00116046,
        0.00930923,  0.02095554, -0.04127254, -0.02117939, -0.00923504,
        0.01227929, -0.02265489, -0.02632529,  0.01250261,  0.00870887,
        0.02763297, -0.00019156,  0.00449696,  0.01953203,  0.03260936,
       -0.00262226,  0.02667082,  0.02193566, -0.04284675, -0.01995637,
       -0.01998122,  0.03369087, -0.08635699, -0.01910882, -0.03533529,
       -0.00825467,  0.02114196,  0.07797874, -0.02580586,  0.07644919,
        0.0197979 , -0.01137579,  0.00389541, -0.0131095 , -0.03545333,
       -0.0125977 ,  0.0209763 ,  0.01070945, -0.04768215,  0.030259  ,
        0.02032248, -0.05188434, -0.02868261, -0.02285695,  0.08

In [18]:
# Find the most similar words to "king" based on word vectors from trained model
w2v_model.wv.most_similar('king')

[('has', 0.9973573088645935),
 ('contact', 0.9973569512367249),
 ('use', 0.9973336458206177),
 ('yr', 0.9973284006118774),
 ('hl', 0.9973279237747192),
 ('before', 0.997320294380188),
 ('good', 0.9973108768463135),
 ('girl', 0.9973031878471375),
 ('until', 0.9972975254058838),
 ('st', 0.9972960352897644)]

In [19]:
# Generate a list of words the word2vec model learned word vectors for
w2v_model.wv.index2word

['you',
 'to',
 'the',
 'and',
 'is',
 'in',
 'me',
 'my',
 'it',
 'for',
 'your',
 'of',
 'call',
 'that',
 'have',
 'on',
 'now',
 'are',
 'can',
 'not',
 'so',
 'but',
 'we',
 'or',
 'at',
 'do',
 'if',
 'ur',
 'get',
 'with',
 'will',
 'no',
 'be',
 'just',
 'this',
 'gt',
 'lt',
 'up',
 'how',
 'go',
 'ok',
 'when',
 'what',
 'from',
 'free',
 'll',
 'all',
 'out',
 'know',
 'then',
 'am',
 'good',
 'there',
 'like',
 'he',
 'day',
 'time',
 'got',
 'was',
 'come',
 'only',
 'its',
 'love',
 'text',
 'send',
 'txt',
 'want',
 'by',
 'as',
 'about',
 'going',
 'lor',
 'need',
 'one',
 'she',
 'sorry',
 'stop',
 'home',
 'back',
 'still',
 'see',
 'don',
 'today',
 'da',
 'our',
 'reply',
 'tell',
 'new',
 'later',
 'think',
 'hi',
 'please',
 'did',
 'week',
 'mobile',
 'any',
 'take',
 'pls',
 'they',
 'dear',
 'been',
 'dont',
 'some',
 'who',
 'her',
 're',
 'phone',
 'ì_',
 'much',
 'where',
 'hey',
 'claim',
 'oh',
 'night',
 'here',
 'give',
 'has',
 'msg',
 'great',
 'happy'

In [20]:
# Generate aggregated sentence vectors based on the word vectors for each word in the sentence
w2v_vect = np.array([np.array([w2v_model.wv[i] for i in ls if i in w2v_model.wv.index2word])
                     for ls in X_test])

  This is separate from the ipykernel package so we can avoid doing imports until


In [21]:
# Why is the length of the sentence different than the length of the sentence vector?
for i, v in enumerate(w2v_vect):
    print(len(X_test.iloc[i]), len(v))

34 32
16 15
22 22
7 6
11 8
3 3
19 18
13 12
19 17
9 9
12 12
22 20
19 19
27 22
7 7
4 4
5 5
26 23
4 4
14 12
23 22
30 28
15 14
19 16
16 16
34 32
16 14
4 2
16 15
4 4
5 5
21 21
5 5
10 10
6 6
6 6
23 23
5 5
18 17
26 26
25 22
27 25
30 29
27 24
3 3
4 3
11 10
11 11
17 16
7 7
18 17
30 28
15 12
14 14
12 12
28 28
10 10
5 5
25 25
20 19
49 40
47 32
4 4
29 29
6 5
18 16
22 22
7 5
13 11
9 8
6 6
21 20
9 9
6 4
6 6
6 4
10 10
8 8
11 8
27 25
6 6
6 6
22 22
24 23
19 12
12 12
5 5
0 0
22 22
6 6
17 16
6 5
5 5
22 22
13 12
5 5
8 7
13 12
12 11
23 22
15 13
17 15
10 10
8 7
16 12
4 4
14 13
8 7
6 5
30 30
23 21
13 10
9 8
23 18
21 21
14 14
2 2
6 6
17 13
12 10
50 43
6 5
7 6
19 18
11 8
20 20
30 28
3 3
16 15
20 20
11 10
9 9
6 6
6 3
9 8
7 5
6 5
16 16
29 29
5 5
23 13
5 5
26 26
4 3
7 6
27 23
5 5
24 24
6 6
16 14
8 8
12 11
3 3
10 7
5 5
31 29
16 14
8 8
9 9
9 9
14 13
6 6
1 1
8 8
21 19
52 52
23 23
10 10
11 11
17 16
25 19
16 13
17 16
6 6
25 25
9 9
6 6
6 6
7 7
6 6
5 4
25 25
7 7
9 9
56 55
26 22
24 24
14 13
9 9
25 23
8 8
21 21
5 5
12 12


In [22]:
# Computing sentence vectors by averaging the word vectors for the words contained in the sentence
w2v_vect_avg = []

for vect in w2v_vect:
    if len(vect)!=0:
        w2v_vect_avg.append(vect.mean(axis=0))
    else:
        w2v_vect_avg.append(np.zeros(100))

In [23]:
# Are sentence vector lengths consistent?
for i, v in enumerate(w2v_vect_avg):
    print(len(X_test.iloc[i]), len(v))

34 100
16 100
22 100
7 100
11 100
3 100
19 100
13 100
19 100
9 100
12 100
22 100
19 100
27 100
7 100
4 100
5 100
26 100
4 100
14 100
23 100
30 100
15 100
19 100
16 100
34 100
16 100
4 100
16 100
4 100
5 100
21 100
5 100
10 100
6 100
6 100
23 100
5 100
18 100
26 100
25 100
27 100
30 100
27 100
3 100
4 100
11 100
11 100
17 100
7 100
18 100
30 100
15 100
14 100
12 100
28 100
10 100
5 100
25 100
20 100
49 100
47 100
4 100
29 100
6 100
18 100
22 100
7 100
13 100
9 100
6 100
21 100
9 100
6 100
6 100
6 100
10 100
8 100
11 100
27 100
6 100
6 100
22 100
24 100
19 100
12 100
5 100
0 100
22 100
6 100
17 100
6 100
5 100
22 100
13 100
5 100
8 100
13 100
12 100
23 100
15 100
17 100
10 100
8 100
16 100
4 100
14 100
8 100
6 100
30 100
23 100
13 100
9 100
23 100
21 100
14 100
2 100
6 100
17 100
12 100
50 100
6 100
7 100
19 100
11 100
20 100
30 100
3 100
16 100
20 100
11 100
9 100
6 100
6 100
9 100
7 100
6 100
16 100
29 100
5 100
23 100
5 100
26 100
4 100
7 100
27 100
5 100
24 100
6 100
16 100
8 100
12 