## Word embeddings
*(Credit: Leon Derczynski, IT University of Copenhagen)*

Let's load some embeddings, and then use these to see which words are close to each other.
We'll use the gensim package's word2vec implementation, and an nltk corpus.

In [0]:
from gensim.models import Word2Vec
from nltk.corpus import brown, movie_reviews

Let's generate word vectors over the Brown corpus text. We will have 20 dimensions, using a window of five for the context words in the skip-grams (e.g. c1, c2, w, c3, c4). This might be a little slow (maybe 1-2 minutes).

In [0]:
# for the Brown corpus
b = Word2Vec(brown.sents(), size=100, window=3, min_count=3)

Now we have the vectors, we can see how good they are by measuring which words are similar to each other.

In [0]:
b.most_similar('company', topn=5)

Not great, eh? Try altering the window and the dimension size, to see if you get better results.

Try also with the movie reviews results!

In [0]:
# for the movie review corpus
mr = Word2Vec(movie_reviews.sents(), size=20, window=5, min_count=3)

In [0]:
mr.most_similar('love', topn=5)

We can also do some arithmetic with the words. Let's try that classical result, king - man + woman.

In [0]:
b.most_similar(positive=['biggest', 'small'], negative=['big'], topn=5)

Not a perfect result with the default model! Why don't we try loading a bigger dataset, based on a bigger vocabulary. This should give better results. You'll need the GloVe embeddings for this. 

We will download this from a github repository. If you are running this on your own local computer (rather then Colaboratory) you can download from www.derczynski.com/glove.twitter.27B.25d.txt.bz2 to your machine. In this case, there is no need to run the next cell - just replace the file name in the cell after next with the path to your downloaded file.

In [0]:
!git clone --quiet https://github.com/KCL-Health-NLP/nlp_examples.git  
from gensim.models.keyedvectors import KeyedVectors
print("Done copying files")

Now let's load the model file. This might take a few minutes. If you are using a copy on your own local machine, change the file path below to that of your file.

In [0]:
glove = KeyedVectors.load_word2vec_format("nlp_examples/ann/glove.twitter.27B.25d.txt.bz2", binary=False)
print("Done loading")

Now, try the above again. Can you find any cool word combinations? What differences are there in the datasets?

Here are some ideas to try, substitute your own words in to these.

In [0]:
glove.most_similar('meat', topn=5)

In [0]:
glove.most_similar(positive=['biggest', 'small'], negative=['big'], topn=5)

In [0]:
glove.most_similar(positive=['woman', 'king'], negative=['man'])

In [0]:
glove.similarity('car', 'bike')

In [0]:
glove.similarity('car', 'purple')

In [0]:
glove.similarity('red', 'purple')

In [0]:
glove.doesnt_match("breakfast cereal dinner lunch".split())

In [0]:
glove.doesnt_match("red green horse blue".split())