# Part 2: Word Embeddings

In this part of the assignment, we'll explore a few properties of word embeddings. We'll use pre-trained GloVe ([Pennington et al. 2013](https://nlp.stanford.edu/pubs/glove.pdf)) embeddings, and evaluate on the analogy task described in ([Mikolov et al. 2013](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)).

If you haven't seen the [embeddings.ipynb](../../../materials/embeddings/embeddings.ipynb) demo notebook, we recommend you look through it; this part of the assignment will build on that material.

In [1]:
# Install a few python packages using pip
from w266_common import utils
utils.require_package("wget")      # for fetching dataset

In [2]:
# Standard python helper libraries.
import os, sys, re, json, time
import itertools, collections
from importlib import reload
from IPython.display import display

# NumPy and SciPy for matrix ops
import numpy as np
import scipy.sparse

# NLTK for NLP utils
import nltk

# Helper libraries
from w266_common import utils, vocabulary, tf_embed_viz

  return f(*args, **kwds)


# Fits like a GloVe

Word embeddings take a long time to train - since the goal is to provide a good representation for as many words as possible, generating good embeddings often requires making several passes over a very large corpus. 

Fortunately, it's possible to learn fairly general embeddings from large corpora that are useful for many downstream tasks. We'll use the GloVe vectors available at https://nlp.stanford.edu/projects/glove/ - specifically, a set trained with a vocabulary of 400,000 on a corpus of 6B tokens from Wikipedia and Gigaword.

The vectors are distributed as a (very) large text file, with one word per line followed by its vector:
```
the -0.038194 -0.24487 0.72812 -0.39961 0.083172 0.043953 -0.39141 0.3344 -0.57545 0.087459
```

We've implemented a helper class, `Hands` in `glove_helper.py`, that will parse these files in a memory efficient manner and provide a wrapper object over a NumPy array containing the actual vectors. 

Run the cell below; the first time, it will download an ~800 MB file to the `data/` directory. **_Please do not check this in to git!_**

In [3]:
import glove_helper; reload(glove_helper)

hands = glove_helper.Hands(ndim=100)  # 50, 100, 200, 300 dim are available

Loading vectors from data/glove/glove.6B.zip
Parsing file: data/glove/glove.6B.zip:glove.6B.100d.txt
Found 400,000 words.
Parsing vectors... Done! (W.shape = (400003, 100))


`hands` has a few properties and methods that might be useful:
- `hands.vocab` is a `vocabulary.Vocabulary` object that manages the set of available words
- `hands.W` is a matrix of shape $|V| \times d$ containing the actual vectors, one per row. Row indices are as given by `hands.vocab.word_to_id[word]`.
- `hands.get_vector(word)` returns the vector for a word (passed as a string).

Note that we let $|V| = $`hands.W.shape[0]`, which in addition to the actual words includes three special tokens: `<s>` (begin sentence), `</s>` (end sentence), and `<unk>` (unknown word).

In [4]:
hands.vocab

<w266_common.vocabulary.Vocabulary at 0x7f0bc94740f0>

In [5]:
hands.W

array([[ 0.05209883, -0.09711445, -0.1380765 , ...,  0.12381283,
        -0.23434106, -0.00925518],
       [ 0.05209883, -0.09711445, -0.1380765 , ...,  0.12381283,
        -0.23434106, -0.00925518],
       [ 0.05209883, -0.09711445, -0.1380765 , ...,  0.12381283,
        -0.23434106, -0.00925518],
       ..., 
       [ 0.36087999, -0.16919   , -0.32703999, ...,  0.27138999,
        -0.29188001,  0.16109   ],
       [-0.10461   , -0.50470001, -0.49331   , ...,  0.42526999,
        -0.51249999, -0.17054   ],
       [ 0.28365001, -0.62629998, -0.44351   , ...,  0.43678001,
        -0.82607001, -0.15701   ]], dtype=float32)

In [6]:
hands.W.shape

(400003, 100)

In [7]:
hands.vocab.word_to_id["geek"]

26312

In [8]:
hands.get_vector("geek")

array([ -1.08549997e-01,   2.05390006e-01,   9.30869997e-01,
        -1.21159995e+00,  -3.63279998e-01,   5.88349998e-01,
         1.24959998e-01,  -9.69470013e-03,   3.54460001e-01,
         9.26100016e-01,  -3.95599991e-01,  -3.21720004e-01,
        -3.35909992e-01,  -6.74260035e-02,   7.68050030e-02,
         7.65829980e-01,   6.58720016e-01,   2.09779993e-01,
        -2.53639996e-01,   5.18010020e-01,  -2.09670007e-01,
        -2.16020003e-01,  -6.55939996e-01,  -5.50639987e-01,
         2.71349996e-01,   1.54489994e-01,   2.28990003e-01,
         1.84009999e-01,  -4.96740006e-02,   1.47630006e-01,
        -1.33540004e-01,  -1.23460002e-01,  -2.11300001e-01,
        -1.41900003e-01,   4.23830003e-01,   3.04459989e-01,
        -8.78130019e-01,   1.59799993e-01,   3.66650000e-02,
        -6.43159986e-01,  -4.54999991e-02,   1.01970002e-01,
        -3.19700003e-01,  -5.69100022e-01,  -1.00000001e-01,
        -2.74030000e-01,  -3.81949991e-01,   8.14469993e-01,
         5.61450005e-01,

# Part (a): Nearest Neighbors

### Cosine Similarity

To measure the similarity of two words, we'll use the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) between their representation vectors:

$$ D^{cos}_{ij} = \frac{v_i^T v_j}{||v_i||\ ||v_j||}$$

*Note that this is called cosine similarity because $D^{cos}_{ij} = \cos(\theta_{ij})$, where $\theta_{ij}$ is the angle between the two vectors.*

## Part (a) Questions

1. In `vector_math.py`, implement the `find_nn_cos(...)` function. Read the docstring _carefully_ - it describes what you should return. *Hint: use NumPy functions instead of a `for` loop.*
<p>
2. Use the `show_nns(...)` function below to find the nearest neighbors for the words `"bank"`, `"plane"`, and `"flies"`. Are the neighbors dominated by one sense of these words or another? Is there evidence that the vector encodes meaning of the other senses as well?
<p>
3. Like `word2vec`, GloVe constructs representations by summarizing word-word coocurrence statistics. Use `show_nns(...)` to find the neighbors of `"green"` and `"celadon"`, and `"orange"` and `"ochre"`. Explain what you find in terms of the distributional hypothesis and the grounding problem. Do the vectors for `"ochre"` and `"celadon"` appear to encode a notion of color? What do they represent, instead?

_(Recall that the Distributional Hypothesis is the idea that "you shall know a word by the company it keeps" (Firth, 1957) - that meaning is derived from the context in which a word is used. Grounding refers to the meaning of language in terms of external concepts, such as real-world entities or physical characteristics.)_

In [9]:
import vector_math; reload(vector_math)

def show_nns(hands, word, k=10):
    """Helper function to print neighbors of a given word."""
    word = word.lower()
    print("Nearest neighbors for '{:s}'".format(word))
    v = hands.get_vector(word)
    for i, sim in zip(*vector_math.find_nn_cos(v, hands.W, k)):
        target_word = hands.vocab.id_to_word[i]
        print("{:.03f} : '{:s}'".format(sim, target_word))
    print("")
    
show_nns(hands, "the")

Nearest neighbors for 'the'
1.000 : 'the'
0.857 : 'this'
0.851 : 'part'
0.850 : 'one'
0.833 : 'of'
0.832 : 'same'
0.821 : 'first'
0.820 : 'on'
0.817 : 'its'
0.813 : 'as'



In [10]:
#### YOUR CODE HERE ####
# Code for Part (a).2
print (show_nns(hands, "bank"))
print (show_nns(hands, "place"))
print (show_nns(hands, "flies"))
#### END(YOUR CODE) ####

Nearest neighbors for 'bank'
1.000 : 'bank'
0.806 : 'banks'
0.753 : 'banking'
0.704 : 'credit'
0.694 : 'investment'
0.678 : 'financial'
0.669 : 'securities'
0.665 : 'lending'
0.648 : 'funds'
0.648 : 'ubs'

None
Nearest neighbors for 'place'
1.000 : 'place'
0.807 : 'time'
0.795 : 'only'
0.785 : 'one'
0.784 : 'take'
0.780 : 'next'
0.780 : 'this'
0.772 : 'the'
0.770 : 'places'
0.765 : 'where'

None
Nearest neighbors for 'flies'
1.000 : 'flies'
0.741 : 'fly'
0.644 : 'flying'
0.634 : 'insects'
0.632 : 'flew'
0.618 : 'butterflies'
0.614 : 'moths'
0.609 : 'moth'
0.581 : 'planes'
0.576 : 'plane'

None


Ans. Under `"bank"`, 7 of the 9 neighbors pertain to the meaning of financial services. Only `"banks"` and `"ubs"` pertain closer to the meaning of a bank as a place. Under `"place"`, the neighbors meanings are more spread, with only `"places" and "where"` to relate to locations, many others are determiners. Under `"flies"`, the nieghbors either refer to insects, the action of flying, or flights related. The meanings are spread but make relatable. Therefore, the vectors certainly encode multiple word senses, not restricted to be one way or the other.

In [11]:
#### YOUR CODE HERE ####
# Code for Part (a).3
print (show_nns(hands, "green"))
print (show_nns(hands, "celadon"))
print (show_nns(hands, "orange"))
print (show_nns(hands, "ochre"))
#### END(YOUR CODE) ####

Nearest neighbors for 'green'
1.000 : 'green'
0.820 : 'red'
0.787 : 'blue'
0.781 : 'brown'
0.771 : 'yellow'
0.762 : 'white'
0.749 : 'gray'
0.733 : 'black'
0.729 : 'pink'
0.728 : 'purple'

None
Nearest neighbors for 'celadon'
1.000 : 'celadon'
0.620 : 'faience'
0.602 : 'porcelains'
0.594 : 'majolica'
0.591 : 'ocher'
0.585 : 'blue-and-white'
0.575 : 'glazes'
0.563 : 'unglazed'
0.558 : 'porcelain'
0.549 : 'steatite'

None
Nearest neighbors for 'orange'
1.000 : 'orange'
0.736 : 'yellow'
0.714 : 'red'
0.712 : 'blue'
0.711 : 'green'
0.678 : 'pink'
0.677 : 'purple'
0.671 : 'black'
0.665 : 'colored'
0.625 : 'lemon'

None
Nearest neighbors for 'ochre'
1.000 : 'ochre'
0.687 : 'pigment'
0.677 : 'reddish'
0.674 : 'ocher'
0.662 : 'coloured'
0.658 : 'greenish'
0.648 : 'magenta'
0.634 : 'pigments'
0.632 : 'yellowish'
0.629 : 'mottled'

None


Ans. Some color words are often used in very specific contexts. For example, `"celadon"` is often used to refer to porcelain, so its neighbors are mostly words related to porcelains rather than colors. `"ochre"` is often used in the color dying context so its neighbors are mostly words related to dying rather than colors. As such, the problem with distributional representations is encoding by context can loose precision about the words' intrinsic meaning. Words representations can be grounded by real-world external concepts and unable to adpapt to new and possible concepts.

# Part (b): Linear Analogies

In this part, you'll implement the word analogy task described in Section 4 of ([Mikolov et al. 2013](https://arxiv.org/pdf/1301.3781.pdf)), and discussed in section 4.8 and 4.11 of the async.

1. In `vector_math.py`, implement the `analogy(...)` function. (*Hint: this should be a very short function, given what you've already written above.*)
<p>
2. Evaluate a few analogies using the `show_analogy(...)` function below. In particular, find at least one analogy that tests each of the following relationships, and that the model gets right:<ul>
<li> Singular / plural
<li> Superlatives
<li> Verb tense
<li> Country / capital
</ul>
(See Table 1 of ([Mikolov et al. 2013](https://arxiv.org/pdf/1301.3781.pdf)) for a few ideas)
<p>
3. Evaluate the following analogies:
<ul>
<li> `"lizard" is to "reptile" as "dog" is to ____`
<li> `"finger" is to  "hand"   as "toe" is to ____`
</ul>
What types of relations do these test? (*Hint: think back to WordNet, and things that end in -nymy.*) Does our approach of linear analogies work well here? What assumption is violated by these sorts of relationships? (*Hint: what if we reversed the order, and tested "reptile" is to "lizard", and so on?*)

In [12]:
import vector_math; reload(vector_math)

def show_analogy(hands, a, b, c, k=5):
    """Compute and print a vector analogy."""
    a, b, c = a.lower(), b.lower(), c.lower()
    va = hands.get_vector(a)
    vb = hands.get_vector(b)
    vc = hands.get_vector(c)
    print("'{a:s}' is to '{b:s}' as '{c:s}' is to ___".format(**locals()))
    for i, sim in zip(*vector_math.analogy(va, vb, vc, hands.W, k)):
        target_word = hands.vocab.id_to_word[i]
        print("{:.03f} : '{:s}'".format(sim, target_word))
    print("")

In [13]:
show_analogy(hands, "king", "queen", "man")

'king' is to 'queen' as 'man' is to ___
0.804 : 'woman'
0.779 : 'man'
0.735 : 'girl'
0.682 : 'she'
0.659 : 'her'



In [18]:
#### YOUR CODE HERE ####
# Code for Part (b).2
print ()
print ("----------------------------------------")
print ("Singular/Plural:")
print (show_analogy(hands, "mouse", "mice", "horse"))
print ("----------------------------------------")
print ("Superlatives:")
print (show_analogy(hands, "thin", "thinnest", "dense"))
print ("----------------------------------------")
print ("Verb tense:")
print (show_analogy(hands, "eat", "ate", "go"))
print ("----------------------------------------")
print ("Country / capital:")
print (show_analogy(hands, "China", "Beijing", "India"))
print (show_analogy(hands, "Japan", "Tokyo", "Australia"))
print (show_analogy(hands, "Peru", "Lima", "Korea"))

#### END(YOUR CODE) ####


----------------------------------------
Singular/Plural:
'mouse' is to 'mice' as 'horse' is to ___
0.823 : 'horses'
0.750 : 'horse'
0.614 : 'breeders'
0.611 : 'cows'
0.609 : 'thoroughbred'

None
----------------------------------------
Superlatives:
'thin' is to 'thinnest' as 'dense' is to ___
0.681 : 'densest'
0.644 : 'thinnest'
0.631 : 'rainforests'
0.571 : 'undergrowth'
0.562 : 'sub-tropical'

None
----------------------------------------
Verb tense:
'eat' is to 'ate' as 'go' is to ___
0.845 : 'went'
0.798 : 'came'
0.791 : 'gone'
0.772 : 'got'
0.748 : 'going'

None
----------------------------------------
Country / capital:
'china' is to 'beijing' as 'india' is to ___
0.897 : 'delhi'
0.817 : 'india'
0.714 : 'islamabad'
0.707 : 'pakistan'
0.665 : 'lahore'

None
'japan' is to 'tokyo' as 'australia' is to ___
0.833 : 'sydney'
0.768 : 'london'
0.755 : 'melbourne'
0.723 : 'australia'
0.714 : 'perth'

None
'peru' is to 'lima' as 'korea' is to ___
0.791 : 'seoul'
0.774 : 'korea'
0.771 : 

Ans. The model doesn't always get the relationships right, afterall we modeled the distributed representation for each word based on context rather than the more strict semantic relationships. Contextual relationships are not exactly logical relationships. Contextural relationships are less static and more diverse -- multiple relationships can exist between two words and the same relationship can exist between one word and many other candidates, while logical relationships are rigid.

In [13]:
#### YOUR CODE HERE ####
# Code for Part (b).3
print (show_analogy(hands, "lizard", "reptile", "dog", k=30))
print (show_analogy(hands, "finger", "hand", "toe", k=30))
print ("----------------------------------------------------")
print ("Reverse order:")
print (show_analogy(hands,  "reptile","lizard", "mammal", k=15))
print (show_analogy(hands, "hand", "finger", "foot", k=15))
#### END(YOUR CODE) ####

'lizard' is to 'reptile' as 'dog' is to ___
0.768 : 'dog'
0.682 : 'dogs'
0.662 : 'pet'
0.635 : 'puppy'
0.629 : 'animal'
0.617 : 'cat'
0.575 : 'pets'
0.574 : 'reptile'
0.556 : 'petting'
0.542 : 'animals'
0.542 : 'herd'
0.538 : 'hog'
0.531 : 'hunting'
0.526 : 'toy'
0.526 : 'horse'
0.525 : 'cow'
0.521 : 'canine'
0.521 : 'bird'
0.514 : 'dinosaur'
0.505 : 'moose'
0.503 : 'breed'
0.499 : 'pig'
0.497 : 'sled'
0.494 : 'cats'
0.492 : 'handlers'
0.490 : 'backyard'
0.490 : 'mad'
0.490 : 'meat'
0.490 : 'horses'
0.487 : 'hobby'

None
'finger' is to 'hand' as 'toe' is to ___
0.776 : 'toe'
0.625 : 'hand'
0.581 : 'shoes'
0.567 : 'back'
0.566 : 'hands'
0.562 : 'wear'
0.561 : 'face'
0.555 : 'shoulder'
0.549 : 'heel'
0.547 : 'wearing'
0.543 : 'men'
0.539 : 'take'
0.533 : 'walk'
0.531 : 'knees'
0.528 : 'right'
0.527 : 'out'
0.525 : 'full'
0.523 : 'go'
0.523 : 'fit'
0.523 : 'quadruple'
0.521 : 'under'
0.519 : 'forced'
0.517 : 'pair'
0.516 : 'boots'
0.515 : 'fight'
0.513 : 'put'
0.511 : 'leg'
0.510 : 'knee'

Ans. The first correct answer should be `'dog'` to `'mammal'`, the second correct answer should be `'toe'` to `'foot'`, the model failed to capture it within the top 30 candidates. This is testing the relationship from hypohyms to hypernyms and our approach does not work well. Asking the question in the reverse question, the model does a poor job as well.

The core assumption with linear analogies is that they map one to one linear relationship in vector space, essentially we are drawing single vector lines from one point (word) to anther.. `man` points to `woman` as `boy` points to `girl` and the reverse -- `woman` points to `man` as `girl` points to `boy` should be true as well. However, with hyponyms and hypernyms, the relationship mapped is many-to-one and one-to-many which cannot be mapped as single vector lines in vector space. When `reptile` map to `lizard`, `mammal` can map to `dog`, `fox`, or `pig`.