## Word embeddings
*(Credit: Leon Derczynski, IT University of Copenhagen)*

Let's load some embeddings, and then use these to see which words are close to each other.
We'll use the gensim package's word2vec implementation, and an nltk corpus.

In [8]:
from gensim.models import Word2Vec
from nltk.corpus import brown, movie_reviews

Let's generate word vectors over the Brown corpus text. We will have 20 dimensions, using a window of five for the context words in the skip-grams (e.g. c1, c2, w, c3, c4). This might be a little slow (maybe 1-2 minutes).

In [9]:
# for the Brown corpus
b = Word2Vec(brown.sents(), size=100, window=3, min_count=3)

Now we have the vectors, we can see how good they are by measuring which words are similar to each other.

In [10]:
b.most_similar('company', topn=5)

[('crowd', 0.9577046632766724),
 ('cabin', 0.9551997780799866),
 ('wheel', 0.9507994055747986),
 ('vision', 0.9496356844902039),
 ('enemy', 0.9474828243255615)]

Not great, eh? Try altering the window and the dimension size, to see if you get better results.

Try also with the movie reviews results!

In [11]:
# for the movie review corpus
mr = Word2Vec(movie_reviews.sents(), size=20, window=5, min_count=3)

In [12]:
mr.most_similar('love', topn=5)

[('constantly', 0.7110123634338379),
 ('killing', 0.6960628032684326),
 ('talking', 0.6886349320411682),
 ('lastly', 0.6733138561248779),
 ('suspect', 0.6698964834213257)]

We can also do some arithmetic with the words. Let's try that classical result, king - man + woman.

In [13]:
b.most_similar(positive=['biggest', 'small'], negative=['big'], topn=5)

[("other's", 0.9216543436050415),
 ('chromatographic', 0.9152492880821228),
 ('lower', 0.9129438996315002),
 ('tax', 0.9011315703392029),
 ('religious', 0.9001179337501526)]

Not a perfect result with the default model! Why don't we try loading a bigger dataset, based on a bigger vocabulary. This should give better results. You'll need the GloVe embeddings for this. Download from the the Moodle, or www.derczynski.com/glove.twitter.27B.25d.txt.bz2 .

We can then load these in using Gensim; they might take a minute to load.

In [14]:
from gensim.models.keyedvectors import KeyedVectors
glove = KeyedVectors.load_word2vec_format("./glove.twitter.27B.25d.txt.bz2", binary=False)
print("Done loading")

Done loading


Now, try the above again. Can you find any cool word combinations? What differences are there in the datasets?

Here are some ideas to try, substitute your own words in to these.

In [15]:
glove.most_similar('meat', topn=5)

[('bread', 0.9616428017616272),
 ('corn', 0.9524654150009155),
 ('egg', 0.9472206234931946),
 ('fish', 0.9398375153541565),
 ('soup', 0.927527666091919)]

In [16]:
glove.most_similar(positive=['biggest', 'small'], negative=['big'], topn=5)

[('average', 0.8820493221282959),
 ('human', 0.8792450428009033),
 ('persons', 0.8779707551002502),
 ('smallest', 0.8638321161270142),
 ('potential', 0.8624012470245361)]

In [17]:
glove.most_similar(positive=['woman', 'king'], negative=['man'])

[('meets', 0.8841923475265503),
 ('prince', 0.832163393497467),
 ('queen', 0.8257461786270142),
 ('’s', 0.8174098134040833),
 ('crow', 0.8134994506835938),
 ('hunter', 0.8131037950515747),
 ('father', 0.8115834593772888),
 ('soldier', 0.81113600730896),
 ('mercy', 0.8082392811775208),
 ('hero', 0.8082263469696045)]

In [18]:
glove.similarity('car', 'bike')

0.77646496599304926

In [19]:
glove.similarity('car', 'purple')

0.64489537486365123

In [20]:
glove.similarity('red', 'purple')

0.86647633586901729

In [21]:
glove.doesnt_match("breakfast cereal dinner lunch".split())

'cereal'

In [22]:
glove.doesnt_match("red green horse blue".split())

'horse'