Once we have the model loaded in Python, we can play with its different pre-built functions. Firstly, we can find the numeric vector for every word (it will always have length of 300).

https://mccormickml.com/2016/04/12/googles-pretrained-word2vec-model-in-python/

In [24]:
from gensim.models import KeyedVectors
from gensim import models

word2vec_path = 'data/GoogleNews-vectors-negative300.bin.gz'
model = models.KeyedVectors.load_word2vec_format(word2vec_path, binary=True)

In [22]:
import gensim

In [25]:
vector = model['easy']

# see the shape of the vector - will always have length (300,)
vector.shape

(300,)

### Similarities

In [26]:
# can find the most similar words to any word
model.most_similar("nice")

[('good', 0.6836091876029968),
 ('lovely', 0.6676310896873474),
 ('neat', 0.6616737246513367),
 ('fantastic', 0.6569240689277649),
 ('wonderful', 0.6561347246170044),
 ('terrific', 0.6552367806434631),
 ('great', 0.6454657912254333),
 ('awesome', 0.6404187679290771),
 ('nicer', 0.6302445530891418),
 ('decent', 0.5993332862854004)]

In [27]:
# Or a similarity score of any two words:
model.similarity("nice","good")

0.6836092

Interestingly, if we take two antonyms (words with opposite meaning), they are going to be highly similar according to a good Word2Vec model. This because we can usually replace opposite words with each other in the text.

In [28]:
# Interesting
model.similarity("bad","good")

0.7190051

In [29]:
# We can also look for interesting relationships between words.

# king - queen = man - woman
model.most_similar(positive=['woman', 'king'], negative=['man'])

[('queen', 0.7118193507194519),
 ('monarch', 0.6189674139022827),
 ('princess', 0.5902431011199951),
 ('crown_prince', 0.5499460697174072),
 ('prince', 0.5377321839332581),
 ('kings', 0.5236844420433044),
 ('Queen_Consort', 0.5235945582389832),
 ('queens', 0.5181134343147278),
 ('sultan', 0.5098593831062317),
 ('monarchy', 0.5087411999702454)]

In [None]:
mom --> girl, dad --> ?
france --> paris, spain --> ?
mother --> ?, table --> chair

In [31]:
model.most_similar(positive=['dad', 'girl'], negative=['mom'])

[('boy', 0.808031439781189),
 ('teenager', 0.6755869388580322),
 ('teenage_girl', 0.6386617422103882),
 ('man', 0.6255338191986084),
 ('lad', 0.616614043712616),
 ('schoolgirl', 0.611348032951355),
 ('schoolboy', 0.6011567115783691),
 ('son', 0.593845784664154),
 ('father', 0.5887871384620667),
 ('uncle', 0.5734449028968811)]

In [32]:
model.most_similar(positive=['paris', 'spain'], negative=['france'])

[('madrid', 0.5295541882514954),
 ('dubai', 0.5092597603797913),
 ('heidi', 0.48901548981666565),
 ('portugal', 0.48763689398765564),
 ('paula', 0.4855714738368988),
 ('alex', 0.4807346761226654),
 ('lohan', 0.4801103174686432),
 ('diego', 0.48010098934173584),
 ('florence', 0.47695302963256836),
 ('costa', 0.4761490523815155)]

In [33]:
# mother --> ?, table --> chair
model.most_similar(positive=['chair', 'mother'], negative=['table'])

[('daughter', 0.6066097021102905),
 ('niece', 0.5490824580192566),
 ('granddaughter', 0.5400506854057312),
 ('aunt', 0.5397382378578186),
 ('husband', 0.5387389659881592),
 ('sister', 0.5360148549079895),
 ('son', 0.5356959104537964),
 ('wife', 0.5313628911972046),
 ('father', 0.5261732935905457),
 ('grandmother', 0.5253341197967529)]