# word2vec model

Representing words by their context
* Core idea: A word's meaning is given by the words that frequently appear close-by
* When a word $w$ appears in a text, its context is the set of words that apear nearby (within a fixed-size window).
* Use the many contexts of $w$ to build up a representation of $w$

Word vectors

We will build a dense vector for each word, chosen so that it is similar to vectors of words that appear in similar contexts. Word vectors are sometimes called **word embeddings** or **word representations**.

Word2vec (Mikolov et al. 2013) is a framework for learning word vectors. Idea:

* We have a large corpus of text
* Every word in a fixed vocabulary is represented by avector
* Go through each position $t$ in the text, which has a center word $c$ and context("outside") words $o$
* Use the similarity of the word vectors for $c$ and $o$ to calculate the probability of $o$ given $c$ (or vice versa)
* Keep adjusting the word vectors to maximize this probability

## Objective functions
For each position t = 1, ...,T, predict context words within a window of fixed size m, given center word w.

Likelikood

$$L(\theta) = \prod_{t=1}^{T} \prod_{-m \leq j \leq m, j \neq 0} P(w_{t+j}|w_{t};\theta)$$

Negative log likelihood

$$J(\theta) = -\frac{1}{T} \sum_{t=1}^{T} \sum_{-m \leq j \leq m, j \neq 0} \log { P(w_{t+j}|w_{t};\theta)}$$

## How to calculate $P(w_{t+j}|w{t};\theta)$?

We will use two vectors per word $w$:
* $v_{w}$ when w is a center word
* $u_{w}$ whe w is a context word

Then for a center word $c$ and a context word $o$:

$$P(o|c) = \frac{exp(u_{o}^\top v_{c})}{\sum_{w \in V} exp(u_{w}^\top v_{c})}$$

## Costs for sampling methods
The costs of a single context word in different sampling methods

* $J_{softmax-CE}(o, v_{c}, U) = -u_{o}^\top v_{c}  + \log \sum_{w=1}^{V} e^{u_{w}^\top v_{c}}$
* $J_{neg-sample}(o, v_{c}, U) = -\log(\delta(u_{o}^\top v_{c}) - \sum_{k=1}^{K} \log(\delta(-u_{k}^\top v_{c}))$

## Two model variants
* Skip-gram(SG): Predict context("outside") words (position independent) given center word
* Continuous Bag of Words(CBOW): Predict center word from (bag of) context words

We can unify the cost formula on the above as $J(o,\hat {v},U)$, where
* $\hat{v} = v_{c}$ in SG
* $\hat{v} = \sum_{-m \leq j \leq m, j \neq 0} v_{w_{t+j}} $ in CBOW

So, the cost for a context centered around c are
* $J_{skip-gram}(w_{t-m}\dots w_{t+m}) = \sum_{-m \leq j \leq m, j \neq 0} J(w_{t+j},\hat {v},U)$
* $J_{CBOW}(w_{t-m}\dots w_{t+m}) = J(w_{t},\hat {v},U)$


## Gradients
Gradients for Skip-gram model

\begin{align}
\frac{\partial {J_{skip-gram}(w_{t-m}\dots w_{t+m})}}{\partial {U}} & = \sum_{-m \leq j \leq m, j \neq 0} \frac{\partial {J(w_{t+j},\hat {v},U)}}{\partial {U}} \\
\frac{\partial {J_{skip-gram}(w_{t-m}\dots w_{t+m})}}{\partial {v_c}} & = \sum_{-m \leq j \leq m, j \neq 0} \frac{\partial {J(w_{t+j},\hat {v},U)}}{\partial {v_c}} \\
\frac{\partial {J_{skip-gram}(w_{t-m}\dots w_{t+m})}}{\partial {v_{w_{t+j}}}} & = 0, \text{for all}\; j \neq c
\end{align}

Gradients for CBOW model

\begin{align}
\frac{\partial {J_{CBOW}(w_{t-m}\dots w_{t+m})}}{\partial {U}} & = \frac{\partial {J(w_{t},\hat {v},U)}}{\partial {U}} \\
\frac{\partial {J_{CBOW}(w_{t-m}\dots w_{t+m})}}{\partial {v_{w_{t+j}}}} & = \frac{\partial {J(w_{t},\hat {v},U)}}{\partial {\hat {v}}}, \text{for all}\;j \in \{-m,\dots,-1,+1,\dots,+m\} \\
\frac{\partial {J_{CBOW}(w_{t-m}\dots w_{t+m})}}{\partial {v_{w_{t+j}}}} & = 0, \text{for all}\;j \not\in \{-m,\dots,-1,+1,\dots,+m\} \\
\end{align}