# Using Pre-trained Word Embeddings

In this notebook we will show some operations on pre-trained word embeddings to gain an intuition about them.

We will be using the pre-trained GloVe embeddings that can be found in the [official website](https://nlp.stanford.edu/projects/glove/). In particular, we will use the file `glove.6B.300d.txt` contained in this [zip file](https://nlp.stanford.edu/data/glove.6B.zip).

We will first load the GloVe embeddings using [Gensim](https://radimrehurek.com/gensim/). Specifically, we will use [`KeyedVectors`](https://radimrehurek.com/gensim/models/keyedvectors.html)'s [`load_word2vec_format()`](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.load_word2vec_format) classmethod, which supports the original word2vec file format.
However, there is a difference in the file formats used by GloVe and word2vec, which is a header used by word2vec to indicate the number of embeddings and dimensions stored in the file. The file that stores the GloVe embeddings doesn't have this header, so we will have to address that when loading the embeddings.

Loading the embeddings may take a little bit, so hang in there!

In [2]:
from gensim.models import KeyedVectors

fname = "glove.6B.300d.txt"
glove = KeyedVectors.load_word2vec_format(fname, no_header=True)
glove.vectors.shape

(400000, 300)

## Word similarity

One attribute of word embeddings that makes them useful is the ability to compare them using cosine similarity to find how similar they are. [`KeyedVectors`](https://radimrehurek.com/gensim/models/keyedvectors.html) objects provide a method called [`most_similar()`](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.most_similar) that we can use to find the closest words to a particular word of interest. By default, [`most_similar()`](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.most_similar) returns the 10 most similar words, but this can be changed using the `topn` parameter.

Below we test this function using a few different words.

In [7]:
# common noun
glove.most_similar("cactus")

NameError: name 'glove' is not defined

In [8]:
# common noun
glove.most_similar("cake")

NameError: name 'glove' is not defined

In [6]:
# adjective
glove.most_similar("angry")

NameError: name 'glove' is not defined

In [9]:
# adverb
glove.most_similar("quickly")

NameError: name 'glove' is not defined

In [10]:
# preposition
glove.most_similar("between")

NameError: name 'glove' is not defined

In [8]:
# determiner
glove.most_similar("the")

[('of', 0.7057957053184509),
 ('which', 0.6992015242576599),
 ('this', 0.6747025847434998),
 ('part', 0.6727458834648132),
 ('same', 0.6592389941215515),
 ('its', 0.6446540355682373),
 ('first', 0.6398991346359253),
 ('in', 0.6361348032951355),
 ('one', 0.6245333552360535),
 ('that', 0.6176422834396362)]

## Word analogies

Another characteristic of word embeddings is their ability to solve analogy problems.
The same [`most_similar()`](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.most_similar) method can be used for this task, by passing two lists of words:
a `positive` list with the words that should be added and a `negative` list with the words that should be subtracted. Using these arguments, the famous example $\vec{king} - \vec{man} + \vec{woman} \approx \vec{queen}$ can be executed as follows:

In [5]:
# king - man + woman
glove.most_similar(positive=["king", "woman"], negative=["man"])

NameError: name 'glove' is not defined

Here are a few other interesting analogies:

In [10]:
# car - drive + fly
glove.most_similar(positive=["car", "fly"], negative=["drive"])

[('airplane', 0.5897148251533508),
 ('flying', 0.5675230026245117),
 ('plane', 0.5317023396492004),
 ('flies', 0.5172374248504639),
 ('flown', 0.514790415763855),
 ('airplanes', 0.5091356635093689),
 ('flew', 0.5011662244796753),
 ('planes', 0.4970923364162445),
 ('aircraft', 0.4957723915576935),
 ('helicopter', 0.45859551429748535)]

In [11]:
# berlin - germany + australia
glove.most_similar(positive=["berlin", "australia"], negative=["germany"])

NameError: name 'glove' is not defined

In [3]:
# england - london + baghdad
glove.most_similar(positive=["england", "baghdad"], negative=["london"])

NameError: name 'glove' is not defined

In [4]:
# japan - yen + peso
glove.most_similar(positive=["japan", "peso"], negative=["yen"])

NameError: name 'glove' is not defined

In [12]:
# best - good + tall
glove.most_similar(positive=["best", "tall"], negative=["good"])

NameError: name 'glove' is not defined

## Looking under the hood

Now that we are more familiar with the [`most_similar()`](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.most_similar) method, it is time to implement its functionality ourselves.
But first, we need to take a look at the different parts of the [`KeyedVectors`](https://radimrehurek.com/gensim/models/keyedvectors.html) object that we will need.
Obviously, we will need the vectors themselves. They are stored in the `vectors` attribute.

In [15]:
glove.vectors.shape

(400000, 300)

As we can see above, `vectors` is a 2-dimensional matrix with 400,000 rows and 300 columns.
Each row corresponds to a 300-dimensional word embedding. These embeddings are not normalized, but normalized embeddings can be obtained using the [`get_normed_vectors()`](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.get_normed_vectors) method.

In [16]:
normed_vectors = glove.get_normed_vectors()
normed_vectors.shape

(400000, 300)

Now we need to map the words in the vocabulary to rows in the `vectors` matrix, and vice versa.
The [`KeyedVectors`](https://radimrehurek.com/gensim/models/keyedvectors.html) object has the attributes `index_to_key` and `key_to_index` which are a list of words and a dictionary of words to indices, respectively.

In [17]:
#glove.index_to_key

In [18]:
#glove.key_to_index

## Word similarity from scratch

Now we have everything we need to implement a `most_similar_words()` function that takes a word, the vector matrix, the `index_to_key` list, and the `key_to_index` dictionary. This function will return the 10 most similar words to the provided word, along with their similarity scores.

In [19]:
import numpy as np

def most_similar_words(word, vectors, index_to_key, key_to_index, topn=10):
    # retrieve word_id corresponding to given word
    
    # retrieve embedding for given word
    
    # calculate similarities to all words in out vocabulary (hint: use @)
    
    # get word_ids in ascending order with respect to similarity score
    
    # reverse word_ids
    
    # get boolean array with element corresponding to word_id set to false
    
    # obtain new array of indices that doesn't contain word_id
    # (otherwise the most similar word to the argument would be the argument itself)
    
    # get topn word_ids
    
    # retrieve topn words with their corresponding similarity score
    
    # return results
    return top_words

Now let's try the same example that we used above: the most similar words to "cactus".

In [1]:
vectors = glove.get_normed_vectors()
index_to_key = glove.index_to_key
key_to_index = glove.key_to_index
most_similar_words("cactus", vectors, index_to_key, key_to_index)

NameError: name 'glove' is not defined

## Analogies from scratch

The `most_similar_words()` function behaves as expected. Now let's implement a function to perform the analogy task. We will give it the very creative name `analogy`. This function will get two lists of words (one for positive words and one for negative words), just like the [`most_similar()`](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.most_similar) method we discussed above.

In [21]:
from numpy.linalg import norm

def analogy(positive, negative, vectors, index_to_key, key_to_index, topn=10):
    # find ids for positive and negative words
    pos_ids = 
    neg_ids = 
    given_word_ids = pos_ids + neg_ids
    # get embeddings for positive and negative words
    pos_emb = 
    neg_emb = 
    # get embedding for analogy
    emb = 
    # normalize embedding
    emb = 
    # calculate similarities to all words in out vocabulary
    similarities = 
    # get word_ids in ascending order with respect to similarity score
    ids_ascending = 
    # reverse word_ids
    ids_descending = 
    # get boolean array with element corresponding to any of given_word_ids set to false
    ###Hint: You can use np.isni 
    given_words_mask = 
    # obtain new array of indices that doesn't contain any of the given_word_ids
    ids_descending = i
    # get topn word_ids
    top_ids = 
    # retrieve topn words with their corresponding similarity score
    top_words = 
    # return results
    return top_words

Let's try this function with the $\vec{king} - \vec{man} + \vec{woman} \approx \vec{queen}$ example we discussed above.

In [2]:
positive = ["king", "woman"]
negative = ["man"]
vectors = glove.get_normed_vectors()
index_to_key = glove.index_to_key
key_to_index = glove.key_to_index
analogy(positive, negative, vectors, index_to_key, key_to_index)

NameError: name 'glove' is not defined