<a href="https://colab.research.google.com/github/TranQuocViet26701/word2vec/blob/main/BTL_WordEmbedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Why do we need to use approximate training instead of the full softmax?

## Fully Softmax
In general, according to (15.1.4)
the log conditional probability
involving any pair of the center word $w_c$ and
the context word $w_o$ is

$$\log P(w_o \mid w_c) =\mathbf{u}_o^\top \mathbf{v}_c - \log\left(\sum_{i \in \mathcal{V}} \exp(\mathbf{u}_i^\top \mathbf{v}_c)\right).$$

where $\mathcal{|V|}$ = vocabulary size so complexity for full softmax cost per training pair

$$O(\mathcal{|V|})$$


For a typical Word2Vec with a vocabulary size of $|V| = 3{,}000{,}000$:

- $\approx 3{,}000{,}000$ dot products per update  
- $\approx 3{,}000{,}000$ gradient updates

## Negative Sampling

Replace the full softmax probability with a binary logistic classifier (Nagetive Sampling):

- For each real pair (context, target):

  - Predict $D = 1$

- For k noise words sampled from a unigram distribution:

  - Predict $D = 0$

**Cost per step:**

$$O(\mathcal{k})$$

where typically $k=5$ to $10$ $15$.

So instead of 3M updates, you only do ~10 updates. Massive speedup $\approx 500,000$x faster than full softmax.

## Hierarchical Softmax

Use a Huffman tree to reduce softmax from:

$$O(\mathcal{|V|})$$

to:

$$O(\log_2{|V|})$$

For a vocabulary size of 3M words:

$$O(\log_2{|3,000,000|}) \approx 22$$

So each update uses $\log$ ~20 nodes instead of millions.

## Negative Sampling in Skip-Gram Model (SG)

Negative sampling transform the multi-classification into binary classification.

Given:
- The center word $w_c$
- The context word $w_o$

The probability model will be:

$$P(D=1\mid w_c, w_o) = \sigma(\mathbf{u}_o^\top \mathbf{v}_c)$$

Based on (15.1.5), given:
- The text sequence of length $T$
- The word at time step $t$ is $w^{(t)}$
- The context window size be $m$

The joint probability will be:

$$ \prod_{t=1}^{T} \prod_{-m \leq j \leq m,\ j \neq 0} P(D=1\mid w^{(t)}, w^{(t+j)})$$

The formula only considers those events that involve positive examples ($D = 1$). The joint probability is maximized to $1$ only if $v_c^\top v_w \to +\infty$. In other words, all the word vectors are equal to infinity. We are expecting that adding negative examples ($D = 0$) will make more sense.

Given:
- $S$ is the event that a context word $w_o$ comes from the context window of a center word $w_c$
- With predefined distribution $P(w)$ sample $K$ noise words, $N_k$ is the event that a noise word $w_k$ ($k = 1, ..., K$)

So these events involving both the positive example and negative examples are $\{S, N_1, ..., N_K \}$.

Negative sampling rewrites the conditional probability:

$$P(w^{(t+j)} \mid w^{(t)})$$
$$= P_S \prod_K P_k $$
$$= P(D=1\mid w^{(t)}, w^{(t+j)}) \prod_{k=1,\,w_k \sim P(w)}^K P(D=0\mid w^{(t)}, w_k) $$

Given:
- $i_t$ is index of a word $w^{(t)}$ at time step $t$
- $h_k$ is index of a noise word $w_k$

The logarithmic loss

$$-\log P(w^{(t+j)} \mid w^{(t)})$$
$$= -\log P(D=1\mid w^{(t)}, w^{(t+j)}) - \sum_{k=1,\ w_k \sim P(w)}^K \log P(D=0\mid w^{(t)}, w_k)$$

because of classification binary so $P(D=0\mid w^{(t)}, w_k) = 1 - P(D=1\mid w^{(t)}, w_k)$, we have:

$$= -\log\, \sigma\left(\mathbf{u}_{i_{t+j}}^\top \mathbf{v}_{i_t}\right) - \sum_{k=1,\ w_k \sim P(w)}^K \log\left(1-\sigma\left(\mathbf{u}_{h_k}^\top \mathbf{v}_{i_t}\right)\right)$$

with $\sigma(x) + \sigma(-x) = 1$, We can infer:

$$= -\log\, \sigma\left(\mathbf{u}_{i_{t+j}}^\top \mathbf{v}_{i_t}\right) - \sum_{k=1,\ w_k \sim P(w)}^K \log\sigma\left(-\mathbf{u}_{h_k}^\top \mathbf{v}_{i_t}\right)$$




## Negative Sampling in Conitnuous Bag-of-Word Model (CBOW)

Given:
- The center word $w_c$
- The context vector $\bar{\mathbf{v}}_o = \left(\mathbf{v}_{o_1} + \ldots + \mathbf{v}_{o_{2m}} \right)/(2m)$

The probability model will be:

$$P(D=1\mid w_c, w_{o_1},..., w_{o_{2m}}) = \sigma(\mathbf{u}_o^\top \bar{\mathbf{v}}_o)$$

Based on (15.1.12), given:
- The text sequence of length $T$
- The word at time step $t$ is $w^{(t)}$
- The context window size be $m$

The joint probability will be:

$$ \prod_{t=1}^{T}  P(D=1 \mid w^{(t)},  w^{(t-m)}, \ldots, w^{(t-1)}, w^{(t+1)}, \ldots, w^{(t+m)})$$

Given:
- $S$ is the event that a context word $w_o$ comes from the context window of a center word $w_c$
- With predefined distribution $P(w)$ sample $K$ noise words, $N_k$ is the event that a noise word $w_k$ ($k = 1, ..., K$)

Add negative examples

$$P(w^{(t-m)}, \ldots, w^{(t-1)}, w^{(t+1)}, \ldots, w^{(t+m)}) \mid w^{(t)})$$
$$= P_S \prod_K P_k $$
$$= P(D=1\mid w^{(t)},  w^{(t-m)}, \ldots, w^{(t-1)}, w^{(t+1)}, \ldots, w^{(t+m)}) \prod_{k=1,\,w_k \sim P(w)}^K P(D=0\mid w_k, w^{(t-m)}, \ldots, w^{(t-1)}, w^{(t+1)}, \ldots, w^{(t+m)}) $$


The logarithmic loss:

$$-\log P(w^{(t-m)}, \ldots, w^{(t-1)}, w^{(t+1)}, \ldots, w^{(t+m)}) \mid w^{(t)})$$

$$= -\log P(D=1\mid w^{(t)},  w^{(t-m)}, \ldots, w^{(t-1)}, w^{(t+1)}, \ldots, w^{(t+m)}) - \sum_{k=1,\ w_k \sim P(w)}^K \log P(D=0\mid w_k, w^{(t-m)}, \ldots, w^{(t-1)}, w^{(t+1)}, \ldots, w^{(t+m)})$$


$$= -\log\, \sigma\left(\mathbf{u}_{i_{t+j}}^\top \mathbf{v}_{i_t}\right) - \sum_{k=1,\ w_k \sim P(w)}^K \log\left(1-\sigma\left(\mathbf{u}_{h_k}^\top \mathbf{v}_{i_t}\right)\right)$$

with $\sigma(x) + \sigma(-x) = 1$, We can infer (todo: cleaning):

$$= -\log\, \sigma\left(\mathbf{u}_{i_{t+j}}^\top \bar{\mathbf{v}}_{i_t}\right) - \sum_{k=1,\ w_k \sim P(w)}^K \log\sigma\left(-\mathbf{u}_{h_k}^\top \mathbf{v}_{i_t}\right)$$


# Self-Supervised word2vec

Self-Supervised word2vec
1. Giới thiệu về cách embed word thành vector hiệu quả hơn one-hot, sử dụng xác suất, học tự giám sát.
2. The Skip-Gram Model: Diễn giải lý thuyết mô hình, cách xây dựng công thức P từ softmax, max likelihood, giải thích hàm loss cho pha training (tại sao lại biến đổi ra hàm L đó từ max likelihood)
3. The Continuous Bag of Words (CBOW) Model: Ngược lại với Skip-Gram là với input context words tính xác suất sinh ra center word. Trình bày: Giải thích công thức xác suất điều kiện của mô hình, max likelihood, công thức pha training.

## Self-Supervised Word to Vector Method

To beyond the limitations of One-Hot method, there was an approach based on Self-Supervised learning with two architectures: The Skip-Gram model and the Continuous Bag-of-Words (CBOW) model [(Mikolov et al., 2013)](https://arxiv.org/pdf/1301.3781).

Following this method, they do not treat words as atomic units – there is no notion of similarity between words. Instead, both CBOW and skip-gram learn distributed representations by leveraging the local context in which words appear. The CBOW model predicts a target word based on its surrounding words, effectively aggregating contextual information to infer meaning. Conversely, the skip-gram model predicts the surrounding context words given a single target word, aiming to learn word vectors that are informative enough to generate their typical neighbors in text. Together, these two approaches capture semantic and syntactic regularities by exploiting patterns that naturally occur in large corpora.

This approach is grounded in the use of conditional probability, where models learn word representations by estimating the likelihood of a target word given its context or vice versa. Through these probability-based predictions, the embeddings capture meaningful semantic relationships directly from unlabeled text in the corpus - the reason why this method is self-supervised.

### Skip-Gram Model

### The Continuous Bag of Words Model