# How to Learn word vectors?

## Gradient Descent

### (Batch) Gradient Descent.
Computing the cost of the entire training set and doing gradient descent.

### Stochastic Gradient Descent
Instead of computing the cost of the entire training set and doing gradient descent, stochastic gradient descent is using one training sample to compute the cost and doing gradient descent.

### Mini-Batch Gradient Descent
We have a compromise between Batch Gradient Descent and Stochastic Gradient Descent. That is Mini-Batch Gradient Descent, where we select a batch of a given batch size N (samples) to compute the cost. Following is the definition of these three gradient descent methods by batch size:
Batch Gradient Descent: N = the size of the training set;
Stochastic Gradient Descent: N = 1;
Mini-Batch Gradient Descent: 1 < N < the size of the training set.

## Word2Vec Additional Efficiency in Training: Negative Sampling
Now that the denominator of naive softmax is compuationally complex--we have to iterate every word in the vocab and compute their dot product with a centre word--people proposed **Negative Sampling** 

**Main Idea: Train binary classfication**

Namely, instead of predicting context words from a centre word(**skip-gram** e.g.), negative sampling aims at predicting whether a word pair is a centre-word-context-word-pair. If so, it's labeled as Positive. Otherwise, it's labeled as Negative. Zum Beispiel, here's a sentence "I like apples and oranges", where we select a windows size of 1 and select "apples" as the centre word. So we have these pairs that are labeled as Positive: (apples, like), (apples, and). And we "randomly"(not entirely randomly, I'll explain later) select non-context words to form pairs like (apples, I), (apples, oranges), which are labeled as Negative.

**Objective function**
$$
\max\;log \,\sigma(v_{w_o}^\top v_{w_c})+\sum_{i=1}^{k}log \,\sigma(-v_{w_i}^\top v_{w_c})
$$
where $v_{w_o}$ is the word vector of the Positive context word("o"utside word), $v_{w_c}$ is the word vector of the centre word. For the latter term $\sum_{i=1}^{k}log \,\sigma(-v_{w_i}^\top v_{w_c})$, it means that we sample k words from our vocabulary as Negative context words. For each word $w_i$, we compute its similarity with our centre word $w_c$ by performing dot product. Since we want this value to be minimal while the overall optimisation direction is **max**, we negate it. And then we convert the objective function to the **loss function** form, which is like this:
$$
\min\;-log \,\sigma(v_{w_o}^\top v_{w_c})-\sum_{i=1}^{k}log\,\sigma(-v_{w_i}^\top v_{w_c})
$$

**Dataset**

As for **how we sample Negative context words**, naturally we want to sample them based on their frequency in the vocabulary, namely **Unigram Distribution: $U(w)$**. However, on the one hand, some highly frequent words barely contribute to final model performance(even though we have presumably done stop word removal). For example, the model always learns how to distinguish "and"(highly frequent word) from "cat" although it's obvious(it's like overfitting). On the other hand, less frequent words are more conducive to model generalisation ability. By making them sampled more often, we can further amplify their signal thus improve generalisation ability. How can we achieve this? We raise $U(w)$ to the 3/4 th power and divide it by Z for standardisation(3/4 here is an empirical number), namely $P_n(w)=\frac{U(w)^{\frac{3}{4}}}{Z}$. So $P_n(w)$ is the distribution where we sample k Negative context words.

## Co-occurence Matrix: An intuitive but raw way of obtaining word vectors
### How to get a Co-occurence matrix:

1. Similar to word2vec, use a window to capture semantic and syntactic locality.
2. or use a paragraph or the entire document as the window. This is more often used in information retrieval.

Problems with Co-occurence matrices:

1. A large vovabulary → very high dimensional → curse of dimensionality.
2. Many words in the vocab don't even co-occur once → sparsity (low information density).

## How can we reduce dimensionality while retaining as much information as possible?
SVD.
However, running an SVD directly on the co-occurence matrix doesn't work well( because ······). There are several methods that help a lot:

1. Scale the counts(e.g. log the frequencies, min(counts, t) where t ≈ 100, ignore function words, etc.)
2. Ramped Window(counts weighted by distance), which is intuitive.
3. Pearson Correlation instead of counts.

And there's a model that integrates the methods and ideas: **COALS Model**.

## GloVe: A combination of Count-Based and Direct Prediciton

For obtaining word vectors, we have count based methods (e.g. LSA, COALS, etc.) and prediction based methods(e.g. Skip-gram/CBOW, NNLM, etc.) The former leverage statistical information but are not good at word analogies while the latter are on the contrary. The GloVe algorithm aims to connect count based methods on co-occurence matrices with prediction methods.

### Model Hypotheses

GloVe starts with a co-occurence matrix and a crucial insight **"Ratios of co-occurence probabilities can encode meaning components"**. Specifically, we have a context word $\tilde{w}_k$ and centre words $w_i$ and $w_j$, then this ratio $\frac{P(k|i)}{P(k|j)}$ can encode meaning components. (But why the probability of a context word $\tilde{w}_k$ given a centre word $w_i$ or $w_j$? Because context words are dependent on center words.) For example, the context word $\tilde{w}_k$ is "solid" and $w_i$ and $w_j$ are respectively "ice" and "steam". Then the ratio is large because "ice" has more meaning component of "solid" than "steam". So the ratio reflects which word is more related to the context word. If we could somehow use word vectors to fit this ratio, we can get word vectors that capture meaning components. For now, we have used the co-occurence matrix with statistical information to build our objective, and later we will fit(predict) this ratio to get word vectors.

### Objective Function

GloVe attempts to capture the ratio of probabilities $\frac{P(k|i)}{P(k|j)}$ with word vectors $w_i\, , \tilde{w}_k\, and\, w_j$, i.e.:
$$
F(w_i, w_j, \tilde{w}_k)=\frac{P(k|i)}{P(k|j)} \tag{1}
$$
However, to more explicitly model the ratio of probabilites while simplifying the model as a bonus, the most natural way is fitting the ratio with vector differences considering the inherent linear structure of vectors, i.e:
$$
F(w_i - w_j, \tilde{w}_k)=\frac{P(k|i)}{P(k|j)} \tag{2}
$$
Moreover, we note that the inputs are vectors while the output is a scalar. Although F could in principle be any function, e.g. neural networks, this would make the linear structure that we want to capture obscure. So we could take the dot product of the word vectors as the input:
$$
F((w_i - w_j)^\top \cdot \tilde{w}_k)=\frac{P(k|i)}{P(k|j)} \tag{3}
$$
Next, considering that context words and words are symmetric, for example, "ice" may occur in the context of "cold" while "cold" may also occur in the context of "ice", when we exchange context words and target words, equation(3) should be invariant too while it is not. More specifically, initially if we apply $F$ to $(w_i - w_j)^\top \cdot \tilde{w}_k$, this value may be equal to(or fit well) the ratio on the right hand side. But if we switch $\tilde{w}_k$ with $w_i$ or $w_j$ and apply $F$, this value, e.g. $F((\tilde{w}_k - w_j)^\top \cdot w_i)$, may not be equal to the ratio. To take a step back, even the dot product before switching doesn't necessarily fit the ratio well. Therefore we must limit $F$ to functions that project **additive operations**(namely an additive group in group theory) to **multiplicative operations**(namely a multiplicative group). Here we have $w_i^\top\cdot\tilde{w}_k-w_i^\top\cdot\tilde{w}_k$, we want this difference to represent a form of ratio. So we require that 
$$
F(w_i^\top\cdot\tilde{w}_k-w_j^\top\cdot\tilde{w}_k) = \frac{F(w_i^\top\cdot\tilde{w}_k)}{F(w_j^\top\cdot\tilde{w}_k)} \tag{4}
$$
To solve this equation or find a solution that satisfy the equation, we can let $x$ = $w_i^\top\cdot\tilde{w}$, $y$ = $w_j^\top\cdot\tilde{w}$. Thus we are trying to find a solution to this equation $F(x-y)=\frac{F(x)}{F(y)}$. Note that $F(x)=e^{x}$ is a solution. Therefore, we have $e^{w_i^\top\cdot\tilde{w}_k}=P(k|i)$ or $e^{w_j^\top\cdot\tilde{w}_k}=P(k|j)$. Taking the log, we get 
$$
w_i^\top\cdot\tilde{w}_k = log\,P(k|i) = log\,\frac{X_{i,k}}{X_i}=log\,X_{i,k}-log\,X_i
$$ 
where there is an annoying $log\,X_i$ that breaks the exchange symmetry. Note that $log\,X_i$ is independent of $k$ and dependent on i, we can use a bias $b_i$ to absord this term and meanwhile add a corresponding $\tilde{b}_k$ to construct the symmetry. Then we get:
$$
w_i^\top\cdot\tilde{w}_k + b_i + \tilde{b}_k = log\,X_{i,k} \tag{5}
$$
By doing this, the model is not only symmetric but also more flexible because we use trainable parameterised bias terms to represent the constant $log\,X_i$. Now we have drastically simplified equation 1 to equation 5. However, there still exists several problems: 

One is that log is not defined when $x$ is 0, which we can simply add a "1" shift to solve (i.e. $log\,(X_{i,k}+1)$).

The other is that if we simply take squared $w_i^\top\cdot\tilde{w}_k + b_i + \tilde{b}_k - log\,(X_{i,k}+1)$ as the loss function, no matter frequent co-occurence or rare one, their losses are not weighted by frequency. Naturally, we hope that mis-predicting frequent word pairs causes more loss while mis-predicting rare word pairs, especially 0 entries that accounts for the majority in the vocabulary, causes relatively less loss. To weight the loss, we introduce a weighting term $f(X_{i,k})$ that is only depenpent on the word pair. Finally, the loss function is:
$$
J = \sum^{V}_{i,k=1} f(X_{i,k}) (w_i^\top\cdot\tilde{w}_k + b_i + \tilde{b}_k - log\,(X_{i,k}+1))^2 \tag{6}
$$

where $f$ should satisfy the following requirements:

1. $f(x)=0$ so that non-co-occurring word pairs don't contribute to any loss.
2. $f(x)$ should be non-decreasing so that more frequent co-occurence causes is more weighted.
3. $f(x)$ should dampen large values of x because highly frequent words like "the", "and" or "a" don't provide much information.

The author find a function that meet the requirements above and works well:
$$
f(x) =
\begin{cases}
\left(\frac{x}{x_{\max}}\right)^\alpha, & \text{if } x < x_{\max} \\
1, & \text{otherwise}
\end{cases} \tag{7}
$$

# How to evaluate word vectors?
## Extrinsic Methods
Evaluate word vector performance in downsteam tasks such as machine translation or sentiment analysis, etc.
## Intrinsic Methods
Evaluate word vectors themselves such as doing word analogies.

### Problems with static word vectors: 
Most words have more than one meaning like "pike". So a single word vector cannot encode different word meanings of a word. Chris actually had worked on this before word2vec came out(See **"Improving Word Representations via Global Context and Multiple Word Prototypes"**).

However, Sanjeev's work **"Linear Algebraic Structure of Word Senses, with Application to Polysemy"** indicates that word vectors learnt with word2vec, glove, etc., actually encode different senses(meanings). They propose that the overall word embedding of a word is the linear combination or weighted sum of the word embeddings of its different senses. For example:
$$
v_{pike}=\alpha_1v_{pike_1} + \alpha_2v_{pike_2} + \alpha_3v_{pike_3}
$$

More surprisingly, with the ideas from sparse encoding, we can seperate the sense components of a word vector.

## An implementation of GloVe with pytorch

In [15]:
import torch
import torch.nn as nn
class GloVe(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super().__init__()
        self.target_word=nn.Embedding(vocab_size, embedding_dim) #initialise w_i
        self.context_word=nn.Embedding(vocab_size, embedding_dim) #initialise \tilde{w}_k

        self.target_bias=nn.Embedding(vocab_size, 1) #initialise b_i
        self.context_bias=nn.Embedding(vocab_size, 1) #initialise \tilde{b}_k

    def forward(self, t_idx, c_idx):
        word_t=self.target_word(t_idx)
        word_c=self.context_word(c_idx)

        b_t=self.target_bias(t_idx).squeeze()# squeeze is to make the shape of bias from (batch_size, 1) to (batch_size)
        b_c=self.context_bias(c_idx).squeeze()
        prediction=(word_t*word_c).sum(dim=1)+b_t+b_c
        return prediction
        
        
class GloVeLoss(nn.Module):
    def __init__(self):
        super().__init__()
    def forward(self, prediction, Xik, x_max=100, alpha=0.75):
        weights=torch.where(
            Xik < x_max,
            (Xik/x_max)**alpha,
            torch.ones_like(Xik)
        )
        loss=weights*(prediction - torch.log(Xik + 1))**2
        return loss.mean()