## Word embeddings
*(Credit: Leon Derczynski, IT University of Copenhagen)*

Let's load some embeddings, and then use these to see which words are close to each other.
We'll use the gensim package's word2vec implementation, and an nltk corpus. We also need to download punkt - an nltk tokeniser used by the movie_reviews corpus. And we'll use seaborn to visualise embeddings as heatmaps

In [None]:
from gensim.models import Word2Vec

from nltk.corpus import brown, movie_reviews
import nltk
nltk.download('brown')
nltk.download('movie_reviews')
nltk.download('punkt_tab')

# We'll use seaborn to visualise embeddings as heatmaps
import seaborn as sns

Let's generate word vectors over the Brown corpus text. We will have 20 dimensions, using a window of three for the context words in the skip-grams (e.g. c1, c2, w, c3, c4). This might be a little slow (maybe 1-2 minutes).

In [None]:
# for the Brown corpus
b = Word2Vec(brown.sents(), vector_size=20, window=3, min_count=3)

Now we have the vectors, we can see how good they are by measuring which words are similar to each other.

In [None]:
b.wv.most_similar('company', topn=5)

Not great, eh? Try altering the window and the dimension size, to see if you get better results.

Try also with the movie reviews results!

In [None]:
# for the movie review corpus
mr = Word2Vec(movie_reviews.sents(), vector_size=20, window=5, min_count=3)

In [None]:
mr.wv.most_similar('love', topn=5)

We can also do some arithmetic with the words. Let's try that classical result, king - man + woman.

In [None]:
b.wv.most_similar(positive=['biggest', 'small'], negative=['big'], topn=5)

Not a perfect result with the default model! Why don't we try loading a bigger dataset, based on a bigger vocabulary. This should give better results. You'll need the GloVe embeddings for this.

We will download this from a github repository. If you are running this on your own local computer (rather then Colaboratory) you can download from www.derczynski.com/glove.twitter.27B.25d.txt.bz2 to your machine. In this case, there is no need to run the next cell - just replace the file name in the cell after next with the path to your downloaded file.

In [None]:
!git clone --quiet https://github.com/KCL-Health-NLP/nlp_examples.git
from gensim.models.keyedvectors import KeyedVectors
print("Done copying files")

Now let's load the model file. This might take a few minutes. If you are using a copy on your own local machine, change the file path below to that of your file.

In [None]:
glove = KeyedVectors.load_word2vec_format("nlp_examples/representation/glove.twitter.27B.25d.txt.bz2", binary=False)
print("Done loading")

Now, try the above again. Can you find any cool word combinations? What differences are there in the datasets?

Here are some ideas to try, substitute your own words in to these.

In [None]:
glove.most_similar('meat', topn=5)

In [None]:
glove.most_similar(positive=['biggest', 'small'], negative=['big'], topn=5)

In [None]:
glove.most_similar(positive=['woman', 'king'], negative=['man'])

In [None]:
glove.similarity('car', 'bike')

In [None]:
glove.similarity('car', 'purple')

In [None]:
glove.similarity('red', 'purple')

In [None]:
glove.doesnt_match("breakfast cereal dinner lunch".split())

In [None]:
glove.doesnt_match("red green horse blue".split())

What about ambiguous words? Can you think of any and try them? Past suggestions have been cancer, bank and play. Can you find any others, and explain what is going on? How does the embedding deal with ambiguity? What factors influence this?

In [None]:
glove.most_similar('word')

What do these embeddings look like? We will display embeddings for four words: two colour adjectives, and two action verbs. Each column is the enbedding for one word. We have printed to two decimal places, using Python string formatting. Can you spot any similarities and differences?

In [None]:
print("   red      green             walk    run\n")
for i in range(len(glove['red'])):
  print("%8.2f%8.2f          %8.2f%8.2f" % (glove['red'][i], glove['green'][i], glove['walk'][i], glove['run'][i]))

Let's visualise this as a heatmap, using seaborn (imported as sns)

In [None]:

sns.heatmap([glove['red'], glove['green'], glove['walk'], glove['run']],
            cmap = 'coolwarm', vmin = -2, vmax = 1.5,
            yticklabels=['red', 'green', 'walk', 'run'])

How do we use these embeddings in NLP? The usual way is to replace each occurence of a word with an embedding - it represents our word. The example below displays what we would pass to our algorithm for a sentence. We show one line for each word, with each value formatted to two decimal places again. The word is displayed at the start of the line for convenience only - this would not be passed to our algorithm.

In [None]:
sentence=["the", "quick", "brown", "fox", "jumped", "over", "the", "lazy", "dog"]
embeddings = []
for i in sentence:
  embeddings.append(glove[i])

for i, val in enumerate(embeddings):
  print(sentence[i].ljust(10), ''.join("{:6.2f}".format(x) for x in val))
