# How to Use Word2Vec in Gensim
What will be introduced below is mainly based on the blog tutorial by Radim Řehůřek https://rare-technologies.com/word2vec-tutorial/ and the gensim tutorial on word2vec https://radimrehurek.com/gensim/models/word2vec.html

## intall gensim

https://radimrehurek.com/gensim/install.html

## Using pre-trained vectors

### Download the GoogleNew-vectors:

https://code.google.com/archive/p/word2vec/

In [1]:
import gensim

### load pretrained models

In [2]:
# using gizziped/bz2 input works, no need to unzip

model_pretrained = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True) #Load pre-trained models in C binary format
#model_pretrained = gensim.models.Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True) #use this if you are using python 2

### Use the trained/loaded model

In [4]:
#get the vocabulary and corrsponding vectors
model_pretrained.vocab

{'neutered': <gensim.models.keyedvectors.Vocab at 0x1e1b647f0>,
 'Dominic_Nutt': <gensim.models.keyedvectors.Vocab at 0x1f343c390>,
 'Maarif': <gensim.models.keyedvectors.Vocab at 0x1f67d6ac8>,
 'biceps_muscle': <gensim.models.keyedvectors.Vocab at 0x1eb1231d0>,
 'nattered': <gensim.models.keyedvectors.Vocab at 0x203c1db00>,
 'alkalines': <gensim.models.keyedvectors.Vocab at 0x1fafa2cc0>,
 'Ted_Poe': <gensim.models.keyedvectors.Vocab at 0x1e5d82128>,
 'Dieter_Depping': <gensim.models.keyedvectors.Vocab at 0x20286ac18>,
 'Jamestown_colonists': <gensim.models.keyedvectors.Vocab at 0x21550bb38>,
 'midweek_Coppa_Italia': <gensim.models.keyedvectors.Vocab at 0x23c12d0f0>,
 'GIST_tumors': <gensim.models.keyedvectors.Vocab at 0x22b2e9be0>,
 'Fiscal_Imbalance': <gensim.models.keyedvectors.Vocab at 0x233884208>,
 'Reactive': <gensim.models.keyedvectors.Vocab at 0x1e7af22e8>,
 'succinylcholine': <gensim.models.keyedvectors.Vocab at 0x1f057cb38>,
 'Mersey_Waste': <gensim.models.keyedvectors.Vocab

In [6]:
#get all the vectors
model_pretrained.vectors

array([[ 1.1291504e-03, -8.9645386e-04,  3.1852722e-04, ...,
        -1.5640259e-03, -1.2302399e-04, -8.6307526e-05],
       [ 7.0312500e-02,  8.6914062e-02,  8.7890625e-02, ...,
        -4.7607422e-02,  1.4465332e-02, -6.2500000e-02],
       [-1.1779785e-02, -4.7363281e-02,  4.4677734e-02, ...,
         7.1289062e-02, -3.4912109e-02,  2.4169922e-02],
       ...,
       [-1.9653320e-02, -9.0820312e-02, -1.9409180e-02, ...,
        -1.6357422e-02, -1.3427734e-02,  4.6630859e-02],
       [ 3.2714844e-02, -3.2226562e-02,  3.6132812e-02, ...,
        -8.8500977e-03,  2.6977539e-02,  1.9042969e-02],
       [ 4.5166016e-02, -4.5166016e-02, -3.9367676e-03, ...,
         7.9589844e-02,  7.2265625e-02,  1.3000488e-02]], dtype=float32)

You can access vectors on a word-by-word basis:

In [9]:
model_pretrained['computer'] #Accept a single word as input. #Returns the word's representations in vector space, as a 1D numpy array.
#model_pretrained.get_vector('computer') #this does the same thing

array([ 1.07421875e-01, -2.01171875e-01,  1.23046875e-01,  2.11914062e-01,
       -9.13085938e-02,  2.16796875e-01, -1.31835938e-01,  8.30078125e-02,
        2.02148438e-01,  4.78515625e-02,  3.66210938e-02, -2.45361328e-02,
        2.39257812e-02, -1.60156250e-01, -2.61230469e-02,  9.71679688e-02,
       -6.34765625e-02,  1.84570312e-01,  1.70898438e-01, -1.63085938e-01,
       -1.09375000e-01,  1.49414062e-01, -4.65393066e-04,  9.61914062e-02,
        1.68945312e-01,  2.60925293e-03,  8.93554688e-02,  6.49414062e-02,
        3.56445312e-02, -6.93359375e-02, -1.46484375e-01, -1.21093750e-01,
       -2.27539062e-01,  2.45361328e-02, -1.24511719e-01, -3.18359375e-01,
       -2.20703125e-01,  1.30859375e-01,  3.66210938e-02, -3.63769531e-02,
       -1.13281250e-01,  1.95312500e-01,  9.76562500e-02,  1.26953125e-01,
        6.59179688e-02,  6.93359375e-02,  1.02539062e-02,  1.75781250e-01,
       -1.68945312e-01,  1.21307373e-03, -2.98828125e-01, -1.15234375e-01,
        5.66406250e-02, -

Compute cosine distance between two words

In [3]:
model_pretrained.distance('woman', 'man')

0.2335987769004647

Compute cosine distances from a given word or vector to each word in a list of words

In [4]:
model_pretrained.distances('woman', ['man', 'boy', 'girl', 'woman', 'women'])

array([0.23359877, 0.4024092 , 0.25053585, 0.        , 0.4696222 ],
      dtype=float32)

Compute cosine similarity between two words.

In [5]:
model_pretrained.similarity('woman', 'man')

0.7664012230995353

Compute cosine similarity between two sets of words

In [6]:
model_pretrained.n_similarity(['sushi', 'shop'], ['japanese', 'restaurant'])

0.5983722657356549

Find the top-N most similar words to a given word.

In [7]:
model_pretrained.similar_by_word('graph', topn=5)

[('graphs', 0.6810587644577026),
 ('diagram', 0.595000684261322),
 ('y_axis', 0.5707489252090454),
 ('graph_illustrating', 0.5563781261444092),
 ('chart', 0.5365185737609863)]

Word Analogy

i.e. Find the top-N most similar words, using the multiplicative combination objective proposed by Omer Levy and Yoav Goldberg. Positive words still contribute positively towards the similarity, negative words negatively, but with less susceptibility to one large distance dominating the calculation.


In [13]:
model_pretrained.most_similar(positive=['woman', 'king'], negative=['man'], topn=5)

[('queen', 0.7118192315101624),
 ('monarch', 0.6189674139022827),
 ('princess', 0.5902431011199951),
 ('crown_prince', 0.5499460697174072),
 ('prince', 0.5377321839332581)]

Which word from the given list doesn't go with the others?

In [14]:
model_pretrained.doesnt_match("breakfast cereal dinner lunch".split())

'cereal'

## Train our own models

In [47]:
import gensim

In [48]:
training_data = [line.lower().strip().split() for line in open('data/data.txt')]

In [49]:
#All gensim.models.Word2Vec requires is that the input yields one sentence (as a list of utf8 words) after another
model = gensim.models.Word2Vec(training_data)

In [50]:
model = gensim.models.Word2Vec(training_data, min_count = 1, size=10, window=5, workers=4)
#min_count: ignore words that appear less than min_count, default value is 5.
#size: the size of the NN layers, default value is 100
#workers parameter is for training parallelization, to speed up training. It has only effect if you have Cython installed. Default = 1 worker, i.e. no parallelization
#window: how many words before and after the target word is taken as the context

In [51]:
#store the trained model
model.save('mymodel')

#The word vectors are stored in a KeyedVectors instance in model.wv.
#This separates the read-only word vector lookup operation in KeyedVectors from the training code in Word2Vec.

In [52]:
# load the trained model
mymodel = gensim.models.Word2Vec.load('mymodel')

In [61]:
mymodel.wv.similarity('graph', 'human')

-0.09071871406943281

In [62]:
mymodel.wv.distance('graph', 'human')

1.0907187140694328

In [63]:
mymodel.wv.similar_by_word('graph', topn=5)

[('relation', 0.7568781971931458),
 ('ordering', 0.7336501479148865),
 ('interface', 0.4849057197570801),
 ('and', 0.4371868968009949),
 ('in', 0.3982817232608795)]

In [64]:
mymodel.wv.vocab

{'a': <gensim.models.keyedvectors.Vocab at 0x109698cf8>,
 'abc': <gensim.models.keyedvectors.Vocab at 0x109698c88>,
 'and': <gensim.models.keyedvectors.Vocab at 0x109698f28>,
 'applications': <gensim.models.keyedvectors.Vocab at 0x109698b00>,
 'binary': <gensim.models.keyedvectors.Vocab at 0x10969c160>,
 'computer': <gensim.models.keyedvectors.Vocab at 0x10969c048>,
 'engineering': <gensim.models.keyedvectors.Vocab at 0x109698f60>,
 'eps': <gensim.models.keyedvectors.Vocab at 0x109698cc0>,
 'error': <gensim.models.keyedvectors.Vocab at 0x10969c208>,
 'for': <gensim.models.keyedvectors.Vocab at 0x10969c278>,
 'generation': <gensim.models.keyedvectors.Vocab at 0x109698fd0>,
 'graph': <gensim.models.keyedvectors.Vocab at 0x10969c080>,
 'human': <gensim.models.keyedvectors.Vocab at 0x10969c0b8>,
 'in': <gensim.models.keyedvectors.Vocab at 0x109698eb8>,
 'interface': <gensim.models.keyedvectors.Vocab at 0x109698a20>,
 'intersection': <gensim.models.keyedvectors.Vocab at 0x109698e48>,
 'iv':

In [65]:
mymodel.wv['graph']

array([-0.00351064, -0.01765375, -0.02035482, -0.03154445, -0.04427642,
        0.01508797,  0.01351037,  0.03022235, -0.03726849,  0.04310504],
      dtype=float32)

## Continue training with the loaded model

You can continue training with the loaded model!
Note: It is impossible to continue training the vectors loaded from the C format because hidden weights, vocabulary frequency and the binary tree is missing.

In [67]:
# load the trained model
mymodel = gensim.models.Word2Vec.load('mymodel')

In [68]:
more_training_data = ['gensim provides us a convenient module to learn word embeddings'.split(), 
                      'gensim has some built-in method'.split(), 
                      'gensim graph helps human understand ideas'.split()]

In [69]:
mymodel.train(more_training_data, total_examples=mymodel.corpus_count, epochs=mymodel.epochs)

(6, 105)

## Exercise:

- Please train a word embedding model on the data provided in the folder 'pos/' (This is part of the training data of http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz). You need to clean up the data, like lower-casing every character, removing punctuations, removing stop-words etc.
- Calculate the distance between "man" and "woman" in this dataset.
- Print out the first 10 most similar words to "woman".