# GloVe Vectors

(This notebook is derived from the terrific notebook by Lisa Zhang; I have added normalization and cosine similarity to improve some of the analogy results.)

The idea of learning an alternative representation/features/*embeddings* of data
is prevalent in machine learning. You have seen how convolutional networks will
learn generalized feature detectors. Good representations will
make downstream tasks (like generating new data, clustering, computing distances)
perform much better.

GloVe embeddings provides a similar kind of pre-trained embeddings, but for **words**.

You can think of the use of **GloVe embeddings** similarly the way you might use pre-trained
network weights.  More information about GloVe is available here: https://nlp.stanford.edu/projects/glove/

## GloVe Embeddings

PyTorch makes it easy for us to use pre-trained GloVe embeddings.
There are several variations of GloVe embeddings available; they differ in the corpus (data)
that the embeddings are trained on, and the size (length) of each word embedding vector.

These embeddings were trained by the authors of GloVe (Pennington et al. 2014),
and are also available on the website https://nlp.stanford.edu/projects/glove/

To load pre-trained GloVe embeddings, we'll use a package called `torchtext`.
The package `torchtext` contains other useful tools for working with text
that we will see later in the course. The documentation for torchtext
GloVe vectors are available at: https://torchtext.readthedocs.io/en/latest/vocab.html#glove

In [1]:
import torch
import torchtext

# The first time you run this will download a ~823MB file
glove = torchtext.vocab.GloVe(name="6B", # trained on Wikipedia 2014 corpus
                              dim=50)    # embedding size = 50

Let's look at what the embedding of the word "car" looks like:

In [2]:
glove['car']

tensor([ 0.4769, -0.0846,  1.4641,  0.0470,  0.1469,  0.5082, -1.2228, -0.2261,
         0.1931, -0.2976,  0.2060, -0.7128, -1.6288,  0.1710,  0.7480, -0.0619,
        -0.6577,  1.3786, -0.6804, -1.7551,  0.5832,  0.2516, -1.2114,  0.8134,
         0.0948, -1.6819, -0.6450,  0.6322,  1.1211,  0.1611,  2.5379,  0.2485,
        -0.2682,  0.3282,  1.2916,  0.2355,  0.6147, -0.1344, -0.1324,  0.2740,
        -0.1182,  0.1354,  0.0743, -0.6195,  0.4547, -0.3032, -0.2188, -0.5605,
         1.1177, -0.3659])

It is a torch tensor with dimension `(50,)`. It is difficult to determine what each
number in this embedding means, if anything. However, we know that there is structure
in this embedding space. That is, distances in this embedding space is meaningful.

## Measuring Distance

To explore the structure of the embedding space, it is necessary to introduce
a notion of *distance*. You are probably already familiar with the notion
of the **Euclidean distance**. The Euclidean distance of two vectors $x = [x_1, x_2, ... x_n]$ and
$y = [y_1, y_2, ... y_n]$ is just the 2-norm of their difference $x - y$. We can compute
the Euclidean distance between $x$ and $y$:
$\sqrt{\sum_i (x_i - y_i)^2}$

The PyTorch function `torch.norm` computes the 2-norm of a vector for us, so we 
can compute the Euclidean distance between two vectors like this:

In [3]:
x = glove['cat']
y = glove['dog']
torch.norm(y - x)

tensor(1.8846)

An alternative measure of distance is the **Cosine Similarity**.
The cosine similarity measures the *angle* between two vectors,
and has the property that it only considers the *direction* of the
vectors, not their the magnitudes.

In [4]:
x = torch.tensor([1., 1., 1.])[None]
y = torch.tensor([2., 2., 2.])[None]
torch.cosine_similarity(x, y) # should be one

tensor([1.])

The cosine similarity is a *similarity* measure rather than a *distance* measure:
The larger the similarity,
the "closer" the word embeddings are to each other.

In [5]:
x = glove['cat']
y = glove['dog']
torch.cosine_similarity(glove['cat'][None],
                        glove['dog'][None])

tensor([0.9218])

## Word Similarity

Now that we have a notion of distance in our embedding space, we can talk
about words that are "close" to each other in the embedding space.
For now, let's use Euclidean distances to look at how close various words
are to the word "cat".

In [6]:
word = 'cat'
other = ['dog', 'bike', 'kitten', 'puppy', 'kite', 'computer', 'neuron']
for w in other:
    dist = torch.norm(glove[word] - glove[w]) # euclidean distance
    print(w, float(dist))

dog 1.8846031427383423
bike 5.048375129699707
kitten 3.5068609714508057
puppy 3.0644655227661133
kite 4.210376262664795
computer 6.030652046203613
neuron 6.228669166564941


In fact, we can look through our entire vocabulary for words that are closest
to a point in the embedding space -- for example, we can look for words
that are closest to another word like "cat".

Keep in mind that GloVe vectors are trained on **word co-occurrences**, and so
words with similar embeddings will tend to co-occur with other words. For example,
"cat" and "dog" tend to occur with similar other words---even more so than "cat"
and "kitten" because these two words tend to occur in *different contexts*!

In [7]:
def print_closest_words(vec, n=5):
    dists = torch.norm(glove.vectors - vec, dim=1)     # compute distances to all words
    lst = sorted(enumerate(dists.numpy()), key=lambda x: x[1]) # sort by distance
    for idx, difference in lst[0:n+1]:                 # take the top n
        print(glove.itos[idx], difference)

print_closest_words(glove["cat"], n=10)

cat 0.0
dog 1.8846031
rabbit 2.4572797
monkey 2.8102052
cats 2.8972251
rat 2.9455352
beast 2.9878407
monster 3.0022194
pet 3.0396757
snake 3.0617998
puppy 3.0644655


## Exercise: define a closest-cosine function

Now define `print_closest_cosine` to be just like `print_closest_words`, but use the `torch.cosine_similarity` function.  ([Documentation here](https://pytorch.org/docs/stable/generated/torch.nn.functional.cosine_similarity.html).)

Hints:
 1. You will need to unsqueeze `vec` e.g., using `vec[None]`
 2. You will need to use the reverse sort order since it's a similarity instead of a distance.


In [8]:
def print_closest_cosine(vec, n=5):
    print('Your implementation of print_closest_cosine needed')

print_closest_cosine(glove["cat"], n=10)

Your implementation of print_closest_cosine needed


In [9]:
print_closest_words(glove['nurse'])

nurse 0.0
doctor 3.1274529
dentist 3.1306615
nurses 3.26872
pediatrician 3.3212206
counselor 3.3987114


In [10]:
print_closest_cosine(glove['nurse'])

Your implementation of print_closest_cosine needed


In [11]:
print_closest_words(glove['computer'])

computer 0.0
computers 2.4362664
software 2.926823
technology 3.1903508
electronic 3.5067408
computing 3.5999787


In [12]:
print_closest_cosine(glove['computer'])

Your implementation of print_closest_cosine needed


In [13]:
print_closest_words(glove['white'])

white 0.0
black 2.294861
green 2.597257
gray 2.7076583
brown 2.7215066
blue 3.1592987


In [14]:
print_closest_cosine(glove['white'])

Your implementation of print_closest_cosine needed


In [15]:
print_closest_words(glove['off-white'])

off-white 0.0
yellowish-brown 2.1355708
orange-brown 2.3458037
yellow-brown 2.422701
red-brown 2.4578924
reddish-orange 2.6069083


In [16]:
print_closest_cosine(glove['off-white'])

Your implementation of print_closest_cosine needed


We could also look at which words are closest to the midpoints of two words:

In [17]:
print_closest_words((glove['seattle'] + glove['tokyo']) / 2)

chicago 2.8645856
tokyo 2.894505
seattle 2.8945053
york 3.0259786
toronto 3.0561922
phoenix 3.175262


In [18]:
print_closest_cosine((glove['seattle'] + glove['tokyo']) / 2)

Your implementation of print_closest_cosine needed


## Analogies

One surprising aspect of GloVe vectors is that the *directions* in the
embedding space can be meaningful. The structure of the GloVe vectors
certain analogy-like relationship like this tend to hold:

$$ king - man + woman \approx queen $$

In [19]:
def normalize(v):
    return v / v.norm()

In [20]:
print_closest_cosine(normalize(glove['king']) + normalize(glove['woman'] - glove['man']))

Your implementation of print_closest_cosine needed


We get reasonable answers like "queen"

We can likewise flip the analogy around:

In [21]:
print_closest_cosine(normalize(glove['queen']) + normalize(glove['man'] - glove['woman']))

Your implementation of print_closest_cosine needed


Or, try a different but related analogies along the gender axis:

In [22]:
print_closest_cosine(normalize(glove['king']) + normalize(glove['princess'] - glove['prince']))

Your implementation of print_closest_cosine needed


In [23]:
print_closest_cosine(normalize(glove['grandfather']) + normalize(glove['woman'] - glove['man']))

Your implementation of print_closest_cosine needed


## Exercise 2: make a capital city predictor

Use the analogy idea to try to make a capital city predictor, so for example, when you ask

`guess_capital('japan')` it says `tokyo` or `guess_capital('france')` it says `paris`.

Can you improve the accuracy of the predictor by averaging a few cases?

In [24]:
def guess_capital(country):
    print('Your implementation of guess_capital needed')

print(guess_capital('japan'))
print(guess_capital('france'))

Your implementation of guess_capital needed
None
Your implementation of guess_capital needed
None


## Biases in Word Vectors

Machine learning models have an air of "fairness" about them, since models
make decisions without human intervention. However, models can and do learn
whatever bias is present in the training data!

GloVe vectors seems innocuous enough: they are just representations of
words in some embedding space. Even so, we'll show that the structure
of the GloVe vectors encodes the everyday biases present in the texts
that they are trained on.

We'll start with an example analogy:

$$ doctor - man + woman \approx ?? $$

Let's use GloVe vectors to find the answer to the above analogy:

In [25]:
print_closest_cosine(normalize(glove['doctor']) + normalize(glove['woman'] - glove['man']))

Your implementation of print_closest_cosine needed


The $$doctor - man + woman \approx nurse$$ analogy is very concerning.
Just to verify, the same result does not appear if we flip the gender terms:

In [26]:
print_closest_cosine(normalize(glove['doctor']) + normalize(glove['man'] - glove['woman']))

Your implementation of print_closest_cosine needed


We see similar types of gender bias with other professions.

In [27]:
print_closest_cosine(normalize(glove['banker']) + normalize(glove['woman'] - glove['man']))

Your implementation of print_closest_cosine needed


In contrast, if we flip the gender terms, we get very
different results:

In [28]:
print_closest_cosine(normalize(glove['banker']) + normalize(glove['man'] - glove['woman']))

Your implementation of print_closest_cosine needed


Here are the results for "engineer":

## Exercise 3: find other biases

Can you find other biases?

For example, what does GloVe say about a (woman-man) company founder?

What about a (man-woman) parent or teacher or housekeeper?

In [29]:
# TODO: explore the structure of GloVe embedding space here