# Word2Vec Examples

First, download the file from [https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?resourcekey=0-wjGZdNAUop6WykTtMip30g](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?resourcekey=0-wjGZdNAUop6WykTtMip30g) and place it in the same directory on your machine as this notebook.

(If you're on colab, you'll need to upload this file to your colab project.)

Note that this will download a giant 1.5GB file onto your computer!

**DO NOT unzip/decompress/double-click on this file once it's downloaded!** (on some computers, this means that the original compressed file is deleted, but we need it in that format for this notebook to run.)

In [2]:
#Run this every time you run the notebook 
embeddings_file = "GoogleNews-vectors-negative300.bin.gz"

Next, you'll need to install gensim by opening a terminal/command window and typing:

`conda install gensim`

## Colab and own-computer users, everyone now continue here!

In [3]:
import numpy as np
from gensim.models.keyedvectors import KeyedVectors

In [4]:
#This line will load the 200,000 most common words into wv, a variable of word vectors
#Note that this will load these into memory, so if you don't have a lot of memory on your computer you may run into problems/slowness
wv = KeyedVectors.load_word2vec_format(embeddings_file, binary=True, limit=200000)

In [5]:
#View the actual value of a vector for a word:
# this is a 300-dimensional vector
print("Shape of a word vector: ")
print(np.shape(wv['language'])) #play around with other words
print("Word vector for 'language':")
print(wv['language'])

Shape of a word vector: 
(300,)
Word vector for 'language':
[ 2.30712891e-02  1.68457031e-02  1.54296875e-01  1.27929688e-01
 -2.67578125e-01  3.51562500e-02  1.19140625e-01  2.48046875e-01
  1.93359375e-01 -7.95898438e-02  1.46484375e-01 -1.43554688e-01
 -3.04687500e-01  3.46679688e-02 -1.85546875e-02  1.06933594e-01
 -1.52343750e-01  2.89062500e-01  2.35595703e-02 -3.80859375e-01
  1.09863281e-01  4.41406250e-01  3.75976562e-02 -1.22680664e-02
  1.62353516e-02 -2.24609375e-01  7.61718750e-02 -3.12500000e-02
 -2.16064453e-02  1.49414062e-01 -4.02832031e-02 -4.46777344e-02
 -1.72851562e-01  3.32031250e-02  1.50390625e-01 -5.05371094e-02
  2.72216797e-02  3.00781250e-01 -1.33789062e-01 -7.56835938e-02
  1.93359375e-01 -1.98242188e-01 -1.27563477e-02  4.19921875e-01
 -2.19726562e-01  1.44531250e-01 -3.93066406e-02  1.94335938e-01
 -3.12500000e-01  1.84570312e-01  1.48773193e-04 -1.67968750e-01
 -7.37304688e-02 -3.12500000e-02  1.57226562e-01  3.30078125e-01
 -1.42578125e-01 -3.16406250e-

In [6]:
#We can compare words by computing their cosine similarity
# Try this out for different word pairs:
print(wv.similarity('dog', 'cat'))
print(wv.similarity('dog', 'terrier'))
print(wv.similarity('dog', 'dolphin'))
print(wv.similarity('dog', 'aubergine'))
print(wv.similarity('dog', 'exuberant'))
print(wv.similarity('dog', 'economics'))

0.7609457
0.65996563
0.3794182
0.033181373
0.04996342
-0.0038632825


In [7]:
# Look for most similar words to a given word
#Try this out for different words
wv.most_similar(positive=['dog'])

[('dogs', 0.8680489659309387),
 ('puppy', 0.8106428384780884),
 ('pit_bull', 0.7803961038589478),
 ('pooch', 0.7627376914024353),
 ('cat', 0.7609457969665527),
 ('golden_retriever', 0.7500901818275452),
 ('German_shepherd', 0.7465174198150635),
 ('Rottweiler', 0.7437615394592285),
 ('beagle', 0.7418621778488159),
 ('pup', 0.7406911253929138)]

In [8]:
# Look for most similar words to a set of words
#wv.most_similar(positive=['potato', 'yam', 'yucca'])
#wv.most_similar(positive=['croissant', 'donut'])
wv.most_similar(positive=['London', 'Paris']) #Or maybe Camberwell?

[('Brussels', 0.59405916929245),
 ('Berlin', 0.592768669128418),
 ('Amsterdam', 0.5912652015686035),
 ('Madrid', 0.5895830392837524),
 ('Frankfurt', 0.55706387758255),
 ('Parisian', 0.5517574548721313),
 ('Rome', 0.5313206911087036),
 ('Stockholm', 0.5250915884971619),
 ('Marrakesh', 0.5232300758361816),
 ('Strasbourg', 0.5227499604225159)]

In [9]:
#Try it with people
wv.most_similar(positive=['Boris_Johnson','Gordon_Brown'])

[('Ken_Livingstone', 0.7916719317436218),
 ('George_Osborne', 0.7817354798316956),
 ('Tony_Blair', 0.7768580317497253),
 ('Nick_Clegg', 0.7726122736930847),
 ('Alistair_Darling', 0.7554317712783813),
 ('Mr_Clegg', 0.7268158197402954),
 ('Chancellor_Alistair_Darling', 0.7248950004577637),
 ('Ed_Miliband', 0.7125200033187866),
 ('chancellor_Alistair_Darling', 0.7029644846916199),
 ('Ed_Balls', 0.7017458081245422)]

In [10]:
# Let's try answering a question via adding vectors: What are some snacks that people like in England?
wv.most_similar(positive=['England','snack'])

[('snacks', 0.6008167266845703),
 ('crisps', 0.5351516604423523),
 ('snacking', 0.5131285786628723),
 ('snack_foods', 0.4977743625640869),
 ('Snacks', 0.48596659302711487),
 ('potato_chips', 0.4819340705871582),
 ('Snack', 0.47909489274024963),
 ('breakfast_cereal', 0.47190386056900024),
 ('lunch', 0.4588703215122223),
 ('yogurt', 0.45873087644577026)]

In [11]:
# Now let's experiment with subtracting vectors...
# the "positive" words point in the direction(s) you want to go; the "negative" words' vectors are subtracted from these

#Example from powerpoint: goose + (dogs - dog)
wv.most_similar(positive=['goose','dogs'],negative=['dog']) #It's geese!

[('geese', 0.7546983957290649),
 ('Canada_geese', 0.6291584968566895),
 ('Geese', 0.6107496619224548),
 ('waterfowl', 0.6093072295188904),
 ('birds', 0.6005197763442993),
 ('mallards', 0.5957626104354858),
 ('pheasants', 0.5831050872802734),
 ('turkeys', 0.5781581401824951),
 ('moose', 0.5723955631256104),
 ('wild_turkeys', 0.572321891784668)]

In [12]:
#Another way of phrasing the above: Dog is to dogs as goose is to what?

#Let's try some more
# parts of speech: "Longest is to long as slowest is to what?"
wv.most_similar(positive=['slowest','long'],negative=['longest'])

[('slow', 0.5326714515686035),
 ('slower', 0.5169233679771423),
 ('painfully_slow', 0.42606601119041443),
 ('sluggish', 0.4135197401046753),
 ('fast', 0.4066060781478882),
 ('slowly', 0.38395875692367554),
 ('Slow', 0.38121846318244934),
 ('slower_pace', 0.35940873622894287),
 ('pace', 0.3565506935119629),
 ('subpar', 0.3554513156414032)]

In [13]:
# gender:"man is to king as as woman is to what?"
wv.most_similar(positive=['woman','king'],negative=['man'])

[('queen', 0.7118192911148071),
 ('monarch', 0.6189674735069275),
 ('princess', 0.5902431011199951),
 ('crown_prince', 0.5499460697174072),
 ('prince', 0.5377321243286133),
 ('kings', 0.5236844420433044),
 ('queens', 0.5181134343147278),
 ('sultan', 0.5098593235015869),
 ('monarchy', 0.5087411403656006),
 ('royal_palace', 0.5087166428565979)]

In [14]:
#'man' minus 'woman' gives us a 'manly' vector. What happens when we subtract this vector from 'doctor'? 
wv.most_similar(positive=['woman','doctor'],negative=['man'])

[('gynecologist', 0.7093892097473145),
 ('nurse', 0.647728681564331),
 ('doctors', 0.6471461057662964),
 ('physician', 0.6438996195793152),
 ('pediatrician', 0.6249487996101379),
 ('nurse_practitioner', 0.6218313574790955),
 ('obstetrician', 0.6072014570236206),
 ('ob_gyn', 0.5986712574958801),
 ('midwife', 0.5927063226699829),
 ('dermatologist', 0.5739566683769226)]

In [15]:
#'man' minus 'woman' gives us a 'manly' vector. What happens when we ADD this vector to 'doctor'? 
# Can do this explicitly in vector math and get nearly the same result as most_similar function
X = wv['man'] - wv['woman']
v = (wv['doctor'] + X)
wv.similar_by_vector(v) #grabs the most similar words to a vector

[('doctor', 0.8413018584251404),
 ('physician', 0.6823903918266296),
 ('doctors', 0.6239281892776489),
 ('surgeon', 0.5908077359199524),
 ('dentist', 0.570309042930603),
 ('cardiologist', 0.5666104555130005),
 ('neurologist', 0.5558010339736938),
 ('neurosurgeon', 0.5432174801826477),
 ('internist', 0.5405333042144775),
 ('urologist', 0.5398820042610168)]

In [16]:
#Places: "UK is to London as as France is to what?"
wv.most_similar(positive=['France','London'],negative=['UK'])

[('Paris', 0.748170018196106),
 ('Marseille', 0.5542371273040771),
 ('Marseilles', 0.5350722670555115),
 ('Bordeaux', 0.5314581990242004),
 ('Reims', 0.5264857411384583),
 ('Saint_Denis', 0.5239812135696411),
 ('Madrid', 0.5148749351501465),
 ('Lyon', 0.5130605697631836),
 ('French', 0.5105733275413513),
 ('Aix_en_Provence', 0.5064885020256042)]

In [17]:
#What is the bagel of London?
wv.most_similar(positive=['London','bagel'],negative=['New_York'])

[('croissant', 0.47772619128227234),
 ('scone', 0.46981364488601685),
 ('crisps', 0.4671917259693146),
 ('muffin', 0.45740562677383423),
 ('muesli', 0.44982248544692993),
 ('marmalade', 0.4486430287361145),
 ('puddings', 0.44180625677108765),
 ('lasagne', 0.43633562326431274),
 ('veg', 0.43387067317962646),
 ('kebab', 0.43329814076423645)]

In [18]:
#Who is the Marie Curie of music?
wv.most_similar(positive=['Marie_Curie', 'music'],negative=['science'])

[('Edith_Piaf', 0.4327777922153473),
 ('Teenage_Cancer', 0.410092830657959),
 ('Nina_Simone', 0.40169721841812134),
 ('Music', 0.3946076035499573),
 ('musicians', 0.39395079016685486),
 ('tunes', 0.3932993710041046),
 ('Piaf', 0.39279234409332275),
 ('songs', 0.38816142082214355),
 ('Melodies', 0.3843925893306732),
 ('classical_music', 0.3838009536266327)]

In [19]:
#Finally, can search for things that don't match:
wv.doesnt_match("dog cat lion rhino platypus".split())

'platypus'

In [20]:
#More exploration of bias
print(wv.similarity('man', 'nurse'))
print(wv.similarity('woman','nurse'))

0.25472283
0.441356


In [21]:
print(wv.similarity('man', 'intelligent'))
print(wv.similarity('woman', 'intelligent'))

0.12355876
0.07830421


In [22]:
print(wv.similarity('man', 'competent'))
print(wv.similarity('woman', 'competent'))

0.18578778
0.12420672


In [23]:
print(wv.similarity('man', 'attractive'))
print(wv.similarity('woman', 'attractive'))

0.0008112781
0.07724926


In [None]:
#Keep playing!