# Basics of word2vec


## Download the model
Download <code>google-news-vectors</code> model. Open it using the <code>gensim</code> library.

In [4]:
! pip install -U gensim
! wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"
! gunzip GoogleNews-vectors-negative300.bin.gz
! pip install SciPy==1.5.4

--2022-03-16 09:39:52--  https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.113.93
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.113.93|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1647046227 (1.5G) [application/x-gzip]
Saving to: ‘GoogleNews-vectors-negative300.bin.gz’

.bin.gz               3%[                    ]  61.96M   875KB/s    eta 43m 53s^C
gzip: GoogleNews-vectors-negative300.bin already exists; do you wish to overwrite (y or n)? ^C



KeyboardInterrupt



In [5]:
import warnings
warnings.filterwarnings('ignore')

import gensim
from gensim.models import KeyedVectors

w = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin", 
                                      binary=True)

The structure is entitled <code>KeyedVectors</code> and in essence it is an embedding between the keys and the vectors. Each vector is identified by its search key, this is most often a short string token,  therefore, it's normally a correspondance between

<center><code>{str => 1D numpy array}</code></center><br/>



For example, let's dispaly first 10 coordinates of a vector, corresponding to the word <code>sunrise</code>

In [6]:
print(gensim.__version__)
print("Vector size: ", w["sunrise"].shape)
print("The first 10 coordinates of a vector: \n", w["sunrise"][:10])

4.1.2
Vector size:  (300,)
The first 10 coordinates of a vector: 
 [-0.22558594 -0.03540039 -0.21679688  0.03613281 -0.2265625  -0.09814453
  0.109375   -0.34570312  0.18652344  0.01806641]


## Task 1. Similarity. 

Build vectors for the words <code>London</code>, <code>England</code>, <code>Moscow</code>. Compute the cosine distance between the words <code>London</code> and <code>England</code> and between the words <code>Moscow</code> and <code>England</code>. In which pair the words are more similar to each other? Hint: to compute cosine distance use the <code>distance()</code> method. The correct answer is presented in the outputs.

In [7]:
london_vec = w["London"]
england_vec = w["England"]
moscow_vec = w["Moscow"]

w.distance("London", "England")

0.5600714385509491

In [8]:
w.distance("Moscow", "England")

0.8476868271827698

In [9]:
#enter your code here

In [10]:
w.distance("professor", "student")

0.5793381929397583

## Task 2. Analogies.
Using the most_similar method solve the analogy
```London : England = Moscow : X```

The correct answer is in the outputs.

(Hint: use the following arguments: positive and negative)

In [11]:
# M - L + E = X
w.most_similar(["Moscow", "England"], ["London"])

[('Russia', 0.6502718329429626),
 ('Ukraine', 0.5879061818122864),
 ('Belarus', 0.5666376352310181),
 ('Azerbaijan', 0.5418694615364075),
 ('Armenia', 0.5300518870353699),
 ('Poland', 0.5253247618675232),
 ('coach_Georgy_Yartsev', 0.5220180749893188),
 ('Russian', 0.5214669108390808),
 ('Croatia', 0.5166040658950806),
 ('Moldova', 0.5125792026519775)]

In [12]:
#enter your code here

## Taks 3. Similarity: find the odd-one-out word. 
Using the <code>doesnt_match</code> method, find the odd-one-out word in the string <code>breakfast cereal dinner lunch</code>.

The correct answer is in the outputs.

In [13]:
w.doesnt_match(["breakfast", "cereal", "dinner", "lunch"])

'cereal'

In [14]:
w.doesnt_match(["professor", "student", "smart", "wood"])

'wood'

In [15]:
#enter your code here

## Task 4. Sentence vector representation


A sentence is given: <code>the quick brown fox jumps over the lazy dog</code>. You need to represent this sentence as a vector. Therefore, build the vector representation for each word in the model, and then average the vectors component-wise.


In [16]:
import numpy as np

vectors = [w["the"], w["quick"], w["brown"], w["fox"], w["jumps"], w["over"], w["the"], w["lazy"], w["dog"]]
mean_vector_1 = np.mean(vectors, axis=0)
mean_vector_1 = mean_vector_1.transpose()
mean_vector_1

array([ 9.05558243e-02,  5.43416329e-02, -6.71386719e-02,  1.09686956e-01,
       -1.06065534e-02, -1.21066622e-01,  4.63748500e-02, -5.35685234e-02,
        7.00683594e-02,  9.72764790e-02,  2.70589199e-02, -1.16495766e-01,
        3.48307304e-02, -2.13351771e-02, -8.32519531e-02, -2.97851562e-02,
       -3.11482754e-02,  1.02077909e-01, -7.70467147e-02, -1.05170354e-01,
       -8.54492188e-04,  6.69555664e-02,  1.97482631e-02,  7.00336043e-03,
        1.32700605e-02,  2.12593079e-02, -1.07652456e-01,  1.05970591e-01,
        1.06550425e-01,  4.36740462e-03, -5.31684011e-02,  6.63248673e-02,
        3.62277552e-02, -6.70505092e-02, -2.00195312e-02, -1.75272617e-02,
       -2.21082903e-02,  6.37478312e-04,  9.91753489e-02,  1.46647140e-01,
        9.37500000e-02, -1.67263448e-01,  1.36345759e-01, -1.23155378e-02,
        5.61930351e-02, -6.90680593e-02, -5.95092773e-03, -5.62099889e-02,
        8.73616561e-02,  7.48155415e-02, -3.18332240e-02,  7.09228516e-02,
        7.61583149e-02,  

In [17]:
import numpy as np

#enter your code here

In [60]:
vectors = [w["fool"], w["his"], w["money"], w["are"], w["soon"], w["parted"]]
mean_vector_1 = np.mean(vectors, axis=0)
mean_vector_1 = mean_vector_1
mean_vector_1

array([ 1.74804688e-01,  5.86751290e-02,  5.55013008e-02,  5.99772148e-02,
       -1.08093262e-01,  2.47701001e-03,  2.78828945e-02, -8.40148926e-02,
        9.70636979e-02,  3.42814135e-03,  4.46777344e-02, -7.85318986e-02,
       -1.08688988e-01,  2.02311203e-01, -2.14192703e-01,  1.14013672e-01,
        5.37109375e-02,  2.02209473e-01, -2.84830737e-03, -5.38330078e-02,
        2.13623047e-02,  6.97224960e-02,  1.08978271e-01,  7.29166642e-02,
        1.26881912e-01,  7.93457031e-02, -5.19612618e-02,  6.87255859e-02,
        4.23990898e-02, -8.61409530e-02, -4.06901054e-02,  6.40055314e-02,
       -7.80944824e-02,  7.97526073e-03,  3.03344727e-02, -5.61523438e-02,
        3.93981934e-02, -2.84016933e-02,  3.33461761e-02,  3.51969409e-03,
        1.14908852e-01, -1.01684570e-01,  1.85282394e-01, -9.69441757e-02,
        9.35872365e-03, -7.89591447e-02, -1.21515907e-01,  8.76057968e-02,
        7.56835938e-03,  8.84602889e-02, -1.29699707e-02,  6.68004379e-02,
        3.09143066e-02, -

In [61]:
vectors = [w["journey"], w["thousand"], w["miles"], w["begins"], w["with"], w["single"], w["step"]]
mean_vector_2 = np.mean(vectors, axis=0)
mean_vector_2 = mean_vector_2
mean_vector_2

array([-0.00343541,  0.05318778,  0.03160749,  0.05688477, -0.07467216,
       -0.0396031 , -0.04630825, -0.10288783,  0.04419817,  0.11659459,
        0.05133929, -0.05517578,  0.0577319 ,  0.04078892,  0.0151825 ,
        0.09943063,  0.0382952 ,  0.04654367,  0.00495257, -0.07195173,
       -0.13546316,  0.09960938, -0.03876768,  0.03527396,  0.0859375 ,
        0.00854492, -0.09682246,  0.03314209,  0.06549944, -0.020595  ,
       -0.02453613,  0.0105678 , -0.07864816,  0.027274  , -0.10117885,
       -0.01447405, -0.01053292, -0.03382656, -0.01391602,  0.01940046,
       -0.02103097, -0.02934919,  0.06485421,  0.04711914, -0.09208461,
       -0.13393728, -0.06016323,  0.09922572, -0.02569144, -0.04889788,
        0.04525321,  0.06157575,  0.05718122,  0.00660924,  0.08762904,
       -0.02774266, -0.14508928, -0.10065569, -0.02283587, -0.07883998,
        0.05013602,  0.00815255,  0.00208391, -0.14137486,  0.07725307,
       -0.06021554, -0.01064628,  0.12785994, -0.08564977,  0.07

In [62]:
from numpy.linalg import norm
from numpy import dot

value1 = float(dot(mean_vector_1, mean_vector_2))
value2 = (norm(mean_vector_1)*norm(mean_vector_2))
if value2 == 0:
    print(1)
else:
    print(1 - value1/value2)

0.654121799164755


In [63]:
from gensim import matutils

1 - np.dot(matutils.unitvec(mean_vector_1), matutils.unitvec(mean_vector_2))

0.6541217863559723

# Two models comparison

## Download one more model


Let's read the google-news-vectors model and the model, trained on British national corpus http://vectors.nlpl.eu/repository/20/0.zip, using gensim. 


In [None]:
! wget -c http://vectors.nlpl.eu/repository/20/0.zip
! unzip 0.zip
! head -3 model.txt

Let's download the model, trained on the British national corpus

In [21]:
w_british = KeyedVectors.load_word2vec_format("model.bin", binary=True)

Note, that the vector size also equals 300 in this case. Specify the part of speech of the word of interest by means of underscore . All words should be lowercased.

In [22]:
try:
    print(w_british["London_NOUN"].shape)
    print('upper is ok')
except:
    print(w_british["london_NOUN"].shape)
    print('lower is ok')

(300,)
lower is ok


In [26]:
w_british.distance("professor_NOUN", "student_NOUN")

0.5742734670639038

In [27]:
w_british.doesnt_match(["professor_NOUN", "student_NOUN", "smart", "wood_NOUN"])

'wood_NOUN'

## The dataset for the quality evaluation
Let's download the wordsim353 dataset. 

 

In [28]:
! wget -c http://alfonseca.org/pubs/ws353simrel.tar.gz 
! tar -xvf ws353simrel.tar.gz
! head -5 wordsim353_sim_rel/wordsim_similarity_goldstandard.txt

--2022-03-16 09:44:45--  http://alfonseca.org/pubs/ws353simrel.tar.gz
Resolving alfonseca.org (alfonseca.org)... 162.215.249.67
Connecting to alfonseca.org (alfonseca.org)|162.215.249.67|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5460 (5.3K) [application/x-gzip]
Saving to: ‘ws353simrel.tar.gz’


2022-03-16 09:44:46 (281 MB/s) - ‘ws353simrel.tar.gz’ saved [5460/5460]

wordsim353_sim_rel/wordsim353_agreed.txt
wordsim353_sim_rel/wordsim353_annotator1.txt
wordsim353_sim_rel/wordsim353_annotator2.txt
wordsim353_sim_rel/wordsim_relatedness_goldstandard.txt
wordsim353_sim_rel/wordsim_similarity_goldstandard.txt
tiger	cat	7.35
tiger	tiger	10.00
plane	car	5.77
train	car	6.31
television	radio	6.77


## Testing dataset preparation


Let's extract word pairs from the file `wordsim_similarity_goldstandard.txt` and compute the vector cosine similarity in each model. Compute the correlation between the similarity estimators of the google-news-vectors model model and human ratings of the wordsim dataset, and then - the similarity correlation between the model based on British national corpus and a human ratings of the wordsim dataset. Which model is closer to the human ratings?

(use only such words from wordsim dataset, which have the corresponding vectors in the British national corpus labeled as NOUNs!)

In [96]:
import pandas as pd

df = pd.read_csv("wordsim353_sim_rel/wordsim_similarity_goldstandard.txt", 
                 sep="\t", header=None)
df.columns = ["first", "second", "score"]
df.head(3)

Unnamed: 0,first,second,score
0,tiger,cat,7.35
1,tiger,tiger,10.0
2,plane,car,5.77


In [97]:
df[18:118]

Unnamed: 0,first,second,score
18,football,soccer,9.03
19,football,basketball,6.81
20,football,tennis,6.63
21,Arafat,Jackson,2.50
22,physics,chemistry,7.35
...,...,...,...
113,image,surface,4.56
114,life,term,4.50
115,start,match,4.47
116,computer,news,4.47


In [98]:
df = df[18:118]

## Model similarity evaluation
We use only such words from wordsim dataset, which have the corresponding vectors in the British national corpus labeled as nouns, make 3 sets with similarity measures: 

1. Measures (cosine between vectors), obtained for the google-news-vectors model

2. Measures (cosine between vectors), obtained for the model based on the British national corpus

3. Human ratings from word_sim for the words, having the corresponding vectors in the British national corpus

The skipped words from word_sim are presented in the outputs.

In [99]:
gn_dist, br_dist, gn_scores, br_scores = [], [], [], []

for row in df.iterrows():

  w1, w2 = row[1]["first"], row[1]["second"]
  try:
    br_dist.append(w_british.similarity(w1.lower() + "_NOUN", w2.lower() + "_NOUN"))
    br_scores.append(row[1]["score"])
    gn_dist.append(w.similarity(w1, w2))
    gn_scores.append(row[1]["score"])

  except KeyError as e:
    print(e, "Skipping this word.")

"Key 'arafat_NOUN' not present" Skipping this word.
"Key 'harvard_NOUN' not present" Skipping this word.
"Key 'mexico_NOUN' not present" Skipping this word.


In [100]:
w_british["Arafat_NOUN"]

KeyError: "Key 'Arafat_NOUN' not present"

In [101]:
w_british["japanese_NOUN"]

array([ 0.012908,  0.133877,  0.119815, -0.014262, -0.145282,  0.060865,
        0.029992,  0.029138,  0.059743, -0.039268, -0.050995,  0.008046,
        0.010828, -0.116168,  0.0298  ,  0.024223,  0.048758,  0.006173,
        0.028295,  0.114714, -0.003009, -0.00194 ,  0.041983,  0.018428,
        0.035408, -0.077806, -0.017081, -0.045228, -0.008924,  0.064999,
       -0.247234, -0.089149,  0.015466, -0.005598, -0.017188,  0.080832,
        0.046895,  0.011464, -0.010543,  0.011781, -0.050704, -0.05064 ,
        0.047088,  0.00158 ,  0.019804, -0.018575,  0.031504,  0.03258 ,
       -0.045149, -0.024635,  0.106007,  0.04806 , -0.025685, -0.001249,
       -0.120237,  0.071461,  0.030777, -0.037106, -0.00215 , -0.063987,
        0.039495,  0.041261,  0.019401,  0.065367, -0.031992, -0.023388,
       -0.00414 ,  0.104649, -0.058512, -0.069146, -0.010128,  0.066927,
        0.002017, -0.029406, -0.033663, -0.043762, -0.130771, -0.069219,
        0.016728, -0.012452,  0.081425,  0.025594, 

## Model selection: correlation with human ratings

Compute Spearman's correlation between each model and human ratings from word_sim.

The results are in the outputs.

In [102]:
len(gn_dist)

97

In [103]:
len(br_scores)

97

In [104]:
gn_dist

[0.73135483,
 0.66824675,
 0.5051179,
 0.4371983,
 0.59717494,
 0.6881493,
 0.50702006,
 0.5838368,
 0.6210811,
 0.68308526,
 0.5886159,
 0.5083667,
 0.2525393,
 0.48634958,
 0.5527407,
 0.60839105,
 0.3740926,
 0.36290243,
 0.30286193,
 0.11830647,
 0.3135657,
 0.16010122,
 0.5528684,
 0.42671448,
 0.42893752,
 0.49644744,
 0.5096533,
 0.06294279,
 0.3297567,
 0.5556288,
 0.48088345,
 0.2597164,
 0.19317105,
 0.5024792,
 0.32067436,
 0.37273747,
 0.28377137,
 0.19486345,
 0.03369916,
 0.10605283,
 0.034792215,
 0.40548652,
 0.13319406,
 0.7258478,
 0.33974788,
 0.4124885,
 0.33483714,
 0.4673331,
 0.16101654,
 0.14553328,
 0.36187765,
 0.12772588,
 0.6666412,
 0.1775303,
 0.38327035,
 0.26695922,
 0.2862988,
 0.27859154,
 0.60769564,
 0.34199455,
 0.5745356,
 0.25621173,
 0.13774039,
 0.3653528,
 0.4636158,
 0.5295588,
 0.654408,
 0.30510467,
 0.3194861,
 0.6655317,
 0.76640123,
 0.15232418,
 0.60576504,
 0.17167465,
 0.37073973,
 0.63757634,
 0.45277104,
 0.33256373,
 0.34269473,
 0.

In [105]:
gn_scores

[9.03,
 6.81,
 6.63,
 7.35,
 8.46,
 8.13,
 6.87,
 8.94,
 8.96,
 9.29,
 8.83,
 9.1,
 8.87,
 9.02,
 9.29,
 8.79,
 7.52,
 7.1,
 7.38,
 4.42,
 8.42,
 9.04,
 8.0,
 8.0,
 7.08,
 6.85,
 7.0,
 4.77,
 5.62,
 8.08,
 6.71,
 5.58,
 8.45,
 8.08,
 8.02,
 5.85,
 6.04,
 6.85,
 2.92,
 3.69,
 2.15,
 7.42,
 7.27,
 8.66,
 6.22,
 6.5,
 7.59,
 7.56,
 5.0,
 4.63,
 7.88,
 5.0,
 8.97,
 6.44,
 8.88,
 6.88,
 7.81,
 7.63,
 8.44,
 7.63,
 7.78,
 9.22,
 7.13,
 7.89,
 7.47,
 8.34,
 8.7,
 7.81,
 5.7,
 8.36,
 8.3,
 5.25,
 8.53,
 6.88,
 5.56,
 7.83,
 7.59,
 7.19,
 6.31,
 5.0,
 5.0,
 4.97,
 4.94,
 4.94,
 4.88,
 4.81,
 4.75,
 4.75,
 4.75,
 4.69,
 4.62,
 4.59,
 4.56,
 4.5,
 4.47,
 4.47,
 4.47]

In [106]:
from scipy.stats import spearmanr

#enter your code here
coef, p = spearmanr(gn_dist, gn_scores)
print("gn_dist  Spearman R: %.4f" % coef)

coef, p = spearmanr(br_dist, br_scores)
print("br_dist  Spearman R: %.4f" % coef)

gn_dist  Spearman R: 0.7186
br_dist  Spearman R: 0.6761


You can notice, that the google-news-vectors model is slighly better in this case.