## Word embeddings
*(Credit: Leon Derczynski, IT University of Copenhagen, extended and adapted by Angus Roberts)*

We will create some embeddings from text *corpora* - collections of texts, and also load some pre-built ones. We will use these to see which words are close to each other, to get an idea of how embeddings work and how good they are.

We'll import a few Python packages first. We will use the gensim package's word2vec implementation - word2vec is a popular embedding model. We will also use nltk, a popular natural language processing toolkit. We will oad a few tools from this:

- Brown Corpus - a collection of 500 standard American English texts, each of roughly 2000 words.
- Movie Reviews corpus - contains 1000 positive and 1000 negative movie reviews.
- punkt - a *tokeniser* which will split texts in to their constituent words.

We will also use the Python seaborn data visualisation package to visualise embeddings as heatmaps

In [None]:
# A word2vec implementation
from gensim.models import Word2Vec

# Load some corpora
from nltk.corpus import brown, movie_reviews
import nltk
nltk.download('brown')
nltk.download('movie_reviews')

# A tokeniser
nltk.download('punkt')

# We'll use seaborn to visualise embeddings as heatmaps
import seaborn as sns

Let's generate word vectors over the Brown corpus text. We will have 20 dimensions, using a window of three context words on each side (e.g. c-3, c-2, c-1, w, c+1, c+2, c+3). This might be a little slow (maybe 1-2 minutes).

In [None]:
# Create embeddings for the Brown corpus
b = Word2Vec(brown.sents(), vector_size=20, window=3, min_count=3)

Gensim's Word2Vec package has some useful methods to compare and manipulate embedding vectors. We will use one of these, *most_similar* to find words that are similar to each other, and test how good our Brown embedding is.

In [None]:
# Find the five most similar embeddings to the one for a given word
b.wv.most_similar('company', topn=5)

Not great, eh? Try altering the window and the dimension size, to see if you get better results.

Try also with the movie reviews results!

In [None]:
# Build an embedding for the movie review corpus
mr = Word2Vec(movie_reviews.sents(), vector_size=20, window=5, min_count=3)

In [None]:
# Find the five most similar embeddings to the one for a given word
mr.wv.most_similar('love', topn=5)

We can also do some arithmetic with the word vectors. Let's try that classical result, king - man + woman.

In [None]:
b.wv.most_similar(positive=['biggest', 'small'], negative=['big'], topn=5)

Not a perfect result with the default model! Why don't we try loading a bigger dataset, based on a bigger vocabulary. This should give better results. Rather than build one from scratch, we will load an embedding that has already been trained, and saved to disk. You'll need the GloVe embeddings for this, which we will download from github.

In [None]:
# Copy files from github in to the local Colab filespace.
!git clone --quiet https://github.com/KCL-Health-NLP/nlp_youth_awards.git
print("Done copying files")

Now let's load the model file. This might take a few minutes.

In [None]:
# KeyedVectors can be used to implement the GloVe embeddings
from gensim.models.keyedvectors import KeyedVectors

# Load a pre-trained GloVe embedding from a compressed file
glove = KeyedVectors.load_word2vec_format("nlp_youth_awards/practicals/glove.twitter.27B.25d.txt.bz2", binary=False)
print("Done loading")

Now, try the above again. Can you find any cool word combinations? What differences are there in the datasets?

Here are some ideas to try, substitute your own words in to these.

In [None]:
glove.most_similar('meat', topn=5)

In [None]:
glove.most_similar(positive=['biggest', 'small'], negative=['big'], topn=5)

In [None]:
glove.most_similar(positive=['woman', 'king'], negative=['man'])

In [None]:
# Measures the similarity between two embeddings
glove.similarity('car', 'bike')

In [None]:
glove.similarity('car', 'purple')

In [None]:
glove.similarity('red', 'purple')

In [None]:
# Finds the least similar embedding in a list
glove.doesnt_match("breakfast cereal dinner lunch".split())

In [None]:
glove.doesnt_match("red green horse blue".split())

What about ambiguous words? Can you think of any and try them? Past suggestions have been cancer, bank and play. Can you find any others, and explain what is going on? How does the embedding deal with ambiguity? What factors influence this?

In [None]:
glove.most_similar('word')

What do these embeddings look like? We will display embeddings for four words: two colour adjectives, and two action verbs. Each column is the embedding for one word. We have printed to two decimal places, using Python string formatting. Can you spot any similarities and differences?

In [None]:
# Column headings
print("   red      green             walk    run\n")

# For each number i from 0 to the length of our embeddings
for i in range(len(glove['red'])):

  # Print the value of the four embeddings at this position
  print("%8.2f%8.2f          %8.2f%8.2f" % (glove['red'][i], glove['green'][i], glove['walk'][i], glove['run'][i]))

Let's visualise this as a heatmap, using seaborn (imported as sns)

In [None]:
# Display a heatmap in which the value of the embedding at each position is
# represented by a different colour intensity
sns.heatmap([glove['red'], glove['green'], glove['walk'], glove['run']],
            cmap = 'coolwarm', vmin = -2, vmax = 1.5,
            yticklabels=['red', 'green', 'walk', 'run'])

How do we use these embeddings in NLP? The usual way is to replace each occurence of a word with an embedding - it represents our word. The example below displays what we would pass to our algorithm for a sentence. We show one line for each word, with each value formatted to two decimal places again. The word is displayed at the start of the line for convenience only - this would not be passed to our algorithm.

In [None]:
# We will look at the embeddings for this sentence
sentence=["the", "quick", "brown", "fox", "jumped", "over", "the", "lazy", "dog"]

# An empty list in to which we will put the embeddings before printing them
embeddings = []

# For each word in the sentence
for w in sentence:
  embeddings.append(glove[w])

# For each embedding in the embeddings list, and it's position i
for i, em in enumerate(embeddings):

  # Print the word at index i, and the values (x) in it's embedding (em)
  print(sentence[i].ljust(10), ''.join("{:6.2f}".format(x) for x in em))
