# Week 4.2 Lecture: Word2Vec Examples


First, download the file from https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?resourcekey=0-wjGZdNAUop6WykTtMip30g and place it in the same directory on your machine as this notebook.

(If you're on colab, you'll need to follow the Moodle instructions to upload this file to your colab project.)

Note that this will download a giant 1.5GB file onto your computer!

In [4]:
#Run this every time you run the notebook 
#Note that you may have to change the file path if you put it somewhere else (e.g., on colab)
embeddings_file = "GoogleNews-vectors-negative300.bin.gz"

Now, let's play!

In [5]:
import numpy as np
from gensim.models.keyedvectors import KeyedVectors

In [6]:
#This line will load the 200,000 most common words into wv, a variable of word vectors
#Note that this will load these into memory, so if you don't have a lot of memory on your computer you may run into problems/slowness
wv = KeyedVectors.load_word2vec_format(embeddings_file, binary=True, limit=200000)

BadGzipFile: Not a gzipped file (b'\n\n')

In [7]:
#View the actual value of a vector for a word:
# this is a 300-dimensional vector
print("Shape of a word vector: ")
print(np.shape(wv['language'])) #play around with other words
print("Word vector for 'language':")
print(wv['language'])

Shape of a word vector: 
(300,)
Word vector for 'language':
[ 2.30712891e-02  1.68457031e-02  1.54296875e-01  1.27929688e-01
 -2.67578125e-01  3.51562500e-02  1.19140625e-01  2.48046875e-01
  1.93359375e-01 -7.95898438e-02  1.46484375e-01 -1.43554688e-01
 -3.04687500e-01  3.46679688e-02 -1.85546875e-02  1.06933594e-01
 -1.52343750e-01  2.89062500e-01  2.35595703e-02 -3.80859375e-01
  1.09863281e-01  4.41406250e-01  3.75976562e-02 -1.22680664e-02
  1.62353516e-02 -2.24609375e-01  7.61718750e-02 -3.12500000e-02
 -2.16064453e-02  1.49414062e-01 -4.02832031e-02 -4.46777344e-02
 -1.72851562e-01  3.32031250e-02  1.50390625e-01 -5.05371094e-02
  2.72216797e-02  3.00781250e-01 -1.33789062e-01 -7.56835938e-02
  1.93359375e-01 -1.98242188e-01 -1.27563477e-02  4.19921875e-01
 -2.19726562e-01  1.44531250e-01 -3.93066406e-02  1.94335938e-01
 -3.12500000e-01  1.84570312e-01  1.48773193e-04 -1.67968750e-01
 -7.37304688e-02 -3.12500000e-02  1.57226562e-01  3.30078125e-01
 -1.42578125e-01 -3.16406250e-

In [4]:
#We can compare words by computing their cosine similarity
# Try this out for different word pairs:
print(wv.similarity('dog', 'cat'))
print(wv.similarity('dog', 'terrier'))
print(wv.similarity('dog', 'dolphin'))
print(wv.similarity('dog', 'aubergine'))
print(wv.similarity('dog', 'exuberant'))
print(wv.similarity('dog', 'economics'))

0.7609457
0.6599656
0.37941834
0.03318137
0.049963422
-0.0038632862


In [9]:
# Look for most similar words to a given word
#Try this out for different words
wv.most_similar(positive=['dog'])

[('dogs', 0.8680489659309387),
 ('puppy', 0.8106428384780884),
 ('pit_bull', 0.7803961038589478),
 ('pooch', 0.7627376914024353),
 ('cat', 0.7609457969665527),
 ('golden_retriever', 0.7500901818275452),
 ('German_shepherd', 0.7465174198150635),
 ('Rottweiler', 0.7437615394592285),
 ('beagle', 0.7418621778488159),
 ('pup', 0.7406911253929138)]

In [12]:
# Look for most similar words to a set of words
#wv.most_similar(positive=['potato', 'yam', 'yucca'])
#wv.most_similar(positive=['croissant', 'donut'])
wv.most_similar(positive=['London', 'Paris']) #Or maybe Camberwell?

[('Brussels', 0.59405916929245),
 ('Berlin', 0.592768669128418),
 ('Amsterdam', 0.5912652015686035),
 ('Madrid', 0.5895830392837524),
 ('Frankfurt', 0.55706387758255),
 ('Parisian', 0.5517574548721313),
 ('Rome', 0.5313206911087036),
 ('Stockholm', 0.5250915884971619),
 ('Marrakesh', 0.5232300758361816),
 ('Strasbourg', 0.5227499604225159)]

In [11]:
#Try it with people
wv.most_similar(positive=['Boris_Johnson','Gordon_Brown'])

[('Ken_Livingstone', 0.7916719317436218),
 ('George_Osborne', 0.7817354798316956),
 ('Tony_Blair', 0.7768580317497253),
 ('Nick_Clegg', 0.7726122736930847),
 ('Alistair_Darling', 0.7554317712783813),
 ('Mr_Clegg', 0.7268158197402954),
 ('Chancellor_Alistair_Darling', 0.7248950004577637),
 ('Ed_Miliband', 0.7125200033187866),
 ('chancellor_Alistair_Darling', 0.7029644846916199),
 ('Ed_Balls', 0.7017458081245422)]

In [None]:
# Let's try answering a question via adding vectors: What is the football-like sport that Harry Potter plays?
wv.most_similar(positive=['Harry_Potter','football'])

In [None]:
# What is the name of the school in Harry Potter?
wv.most_similar(positive=['Harry_Potter','school'])

In [None]:
# Now let's experiment with subtracting vectors...
# the "positive" words point in the direction(s) you want to go; the "negative" words' vectors are subtracted from these

#Example from powerpoint: goose + (dogs - dog)
wv.most_similar(positive=['goose','dogs'],negative=['dog']) #It's geese!

In [None]:
#Another way of phrasing the above: Dog is to dogs as goose is to what?

#Let's try some more
# parts of speech: "Longest is to long as slowest is to what?"
wv.most_similar(positive=['slowest','long'],negative=['longest'])

In [None]:
# gender:"man is to king as as woman is to what?"
wv.most_similar(positive=['woman','king'],negative=['man'])

In [None]:
#'man' minus 'woman' gives us a 'manly' vector. What happens when we subtract this vector from 'doctor'? 
wv.most_similar(positive=['woman','doctor'],negative=['man'])

In [None]:
#'man' minus 'woman' gives us a 'manly' vector. What happens when we ADD this vector to 'doctor'? 
# Can do this explicitly in vector math and get nearly the same result as most_similar function
X = wv['man'] - wv['woman']
v = (wv['doctor'] + X)
wv.similar_by_vector(v) #grabs the most similar words to a vector

In [None]:
#Places: "UK is to London as as France is to what?"
wv.most_similar(positive=['France','London'],negative=['UK'])

In [5]:
#What is the bagel of London?
wv.most_similar(positive=['London','bagel'],negative=['New_York'])

[('croissant', 0.47772619128227234),
 ('scone', 0.46981364488601685),
 ('crisps', 0.4671917259693146),
 ('muffin', 0.45740562677383423),
 ('muesli', 0.44982248544692993),
 ('marmalade', 0.4486430287361145),
 ('puddings', 0.44180625677108765),
 ('lasagne', 0.43633562326431274),
 ('veg', 0.43387067317962646),
 ('kebab', 0.43329814076423645)]

In [None]:
#Who is the Marie Curie of music?
wv.most_similar(positive=['Marie_Curie', 'music'],negative=['science'])

In [None]:
#Finally, can search for things that don't match:
wv.doesnt_match("dog cat lion rhino platypus".split())

In [None]:
#More exploration of bias
print(wv.similarity('man', 'nurse'))
print(wv.similarity('woman','nurse'))

In [None]:
print(wv.similarity('man', 'intelligent'))
print(wv.similarity('woman', 'intelligent'))

In [None]:
print(wv.similarity('man', 'competent'))
print(wv.similarity('woman', 'competent'))

In [None]:
print(wv.similarity('man', 'attractive'))
print(wv.similarity('woman', 'attractive'))