# Lecture 6: Word Embeddings

 > The meaning of a word is its use in the language.
 
(Wittgenstein, 1953)

## Goal

Learn word representations that capture word **meanings** based on examples in a corpus.

Embed words in a **vector space**.


## Models

### Skip-gram model

Use a word to predict its context. P(w1|w2) = prob that w1 occurs in the vicinity of w2.

Objective:

$$
\mathcal{L}(\theta; \mathbf{w}) = \sum_{t=1}^T \sum_{\Delta \in \mathcal{I}} 
\log p_\theta \left(w^{(t+\Delta)} \> | \> w^{(t)} \right).
$$

For every word in the corpus (index $t$), go through its neighbors and maximize the likelihood of the neighbor given the source word.

Use MLE and compute $\hat{\theta} = {\arg\!\max}_\theta \mathcal{L}(\theta; \mathbf{w})$.

However, we haven't yet described what (1) $w_t$ actually is, and (2) how to model $p_\theta(w \> | \> w')$. 


### Continuous bag-of-words

Use context to predict word.

## Latent vector model

 1. Map word as vector + bias $ \> \in \mathbb{R}^{D+1}$.
 2. Define **log-bilinear model** for the likelihoods:
 
 $$\log p_\theta(w \> | \> w') = \langle x_{w}, x_{w'} \rangle + b_w + \text{const.}$$


Bilinear because it's contains two linear components.

Why is the constant necessary? It's the normalizer denominator. +constant because constant can be negative, because we're inside log, so we can take log of a/b as log(a) - log(b).

**Key insight in representation: **

$$ \angle(x_w, x_{w'}) \downarrow \quad \implies \quad p_\theta(w | w') \uparrow $$

The higher the cosine similarity (and lower the angle between the words), the higher the likelihood of the log-bilinear model!

## Screw it let's see some pre-trained ones first!

In [4]:
from gensim.models import Word2Vec
import numpy as np
from sklearn.manifold import TSNE

# This can be downloaded from:
# https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing
pretrained_w2v_file = '../../project/data/word2vec/GoogleNews-vectors-negative300.bin'
w2v = Word2Vec.load_word2vec_format(fname=pretrained_w2v_file, binary=True)

In [5]:
w2v.most_similar(positive=['queen', 'man'], negative=['woman'])

[('king', 0.6958590149879456),
 ('kings', 0.5950952768325806),
 ('queens', 0.5838501453399658),
 ('monarch', 0.5398427248001099),
 ('prince', 0.5223615169525146),
 ('princess', 0.5175285339355469),
 ('princes', 0.49844634532928467),
 ('royal', 0.4924592971801758),
 ('NYC_anglophiles_aflutter', 0.4859851002693176),
 ('Eugene_Ionesco_absurdist_comedy', 0.4784241318702698)]

In [24]:
w2v.vector_size

300

In [26]:
# TODO(andrei): T-SNE visualization.

## Problems with current model

1. bilinearity
2. Huge denominator, summing over entire vocabulary for **every word.**

## Solving the problems

1. Context vectors -> two different embeddings, vector and context embeddings. Makes model more flexible at the cost of complexity.

2. Don't use maximum likelihood. In our case, we use weighted square loss and GloVe.

What is the partition function? Used to refer to the denominator for probability normalization. Can e.g. skip it in GloVe by computing $\tilde{p}(w)$ instead of $\tilde{p}$, which still works reasonably well.

## GloVe vs. word2vec

| GloVe | word2vec 
---|---|---
 Computation | Requires precomputed co-occurrence matrix. | Requires repeated iterations through entire corpus
 Model | Contrastive divergence  | Weighted squared loss
| Count-based model which essentially tries to factorize the co-occurrence matrix.  | Predictive model; goal is to improve (lower) the loss of predicting the target words from the context words given the vector representations. Keep updating vector representations via SGD to achieve this like in any learning problem.

### What's the difference between normalized and unnormalized GloVe?

| Normalized | Unnormalized
---|---|---
 | Requires computation of partition function (normalization constant) | No need for computing partition function.
 | Large "spikes" (large $h(w)$) counterbalanced by normalization | No implicit counterbalancing; need extra attention
 
TODO(andrei): What is a two-sided loss function?

GloVe uses an unnormalized probability distribution $\tilde{p}_{\theta}$.

$$\tilde{p}_\theta(w_i \> | \> w_j) = \exp\left[ \langle x_i, y_i \rangle + b_i + d_j \right]
\iff
\exp\left[ \langle \bar{x_i}, \bar{y_i} \rangle \right]
$$

Provided we augment $x$ and $y$ as follows:

$$ 
D := D + 2; \quad
\bar{x}_{w,D-1} = 1, \> \bar{x}_{w,D} = b_w;
\bar{y}_{w,D-1} = d_w, \> \bar{y}_{w,D} = 1;
$$

We can now define $M$ and stack the word vectors $x_w$ and the context vectors $y_w$ into the matrices $X$ and $Y$:

$$ M = (m_{i,j}), m_{i,j} = \log n_{i,j} $$

$$
X := \left[ x_{w_1} \dots x_{w_{|\mathcal{V}\,|}} \right], \quad
Y := \left[ y_{w_1} \dots y_{w_{|\mathcal{C}|}} \right]
$$

(Remember, $\mathcal{V}$ represents the vocabulary, and $\mathcal{C}$ represents the contexts.)


Using these nifty tricks, we can now show that...

## GloVe solves a matrix factorization problem

IFF $f := 1$:

$$ \min_{X,Y} \| M - X^T Y \|_F^2 $$

Since we try to compute X and Y such as their product (pairwise inner products between col vectorss in X and col vectors in Y are as close as possible to the logs in $M$).

TODO(andrei): This was covered in the exercise session and the homework. Put the proof here and explain it.