# Foundational Papers in Word Embeddings

---

## 1. A Neural Probabilistic Language Model — Bengio et al. (2003 / NIPS 2000)

### Motivation

Traditional n-gram models suffered from the **curse of dimensionality** — unseen word sequences received zero probability. Bengio et al. proposed a neural network that **learns distributed word representations** (embeddings) to generalize better across unseen combinations.

### Core Idea

Each word \( w \) is represented by a dense vector embedding \( C(w) \in \mathbb{R}^m \).  
A neural network predicts the next word given a context of \( n-1 \) words:

$$
P(w_t | w_{t-1}, ..., w_{t-n+1}) = \text{Softmax}(f(C(w_{t-1}), ..., C(w_{t-n+1})))
$$

The embeddings and the network are learned jointly by maximizing log-likelihood over the corpus.

### Architecture

- Embedding matrix \( C \in \mathbb{R}^{|V| \times m} \)
- Context word embeddings concatenated or averaged
- One or more hidden layers with nonlinear activation
- Output layer: softmax over vocabulary

### Significance

- Unified **embedding learning** and **language modeling**.
- Allowed **generalization to unseen n-grams**.
- Introduced **continuous-space representations** of words.
- Foundation for all subsequent embedding models.

### Limitations

- High computational cost for large vocabularies.
- Fixed-length context window.
- Static embeddings (one vector per word).

---

## 2. Word2Vec — Mikolov et al. (2013)

### Motivation

To make training faster and scalable, Word2Vec simplifies Bengio’s model while retaining semantic quality.

### Models

- **Skip-gram**: predict surrounding context words from a target word.
- **CBOW (Continuous Bag-of-Words)**: predict a target word from surrounding context.

### Objective (Skip-gram with Negative Sampling)

For observed pairs \( (w, c) \):

$$
\log \sigma(v_c' \cdot v_w) + \sum_{i=1}^k \mathbb{E}_{c_i \sim P_n} [ \log \sigma(-v_{c_i}' \cdot v_w) ]
$$

where \( P_n \) is the noise distribution for negative samples.

### Innovations

- **Negative Sampling** and **Subsampling** of frequent words for efficiency.
- Training billions of tokens on a single machine.
- Learned **linear semantic relationships** such as:

$$
\text{vec("king")} - \text{vec("man")} + \text{vec("woman")} \approx \text{vec("queen")}
$$

### Impact

- Fast, scalable, widely adopted.
- Embeddings capture syntactic and semantic analogies.

---

## 3. GloVe — Pennington, Socher & Manning (2014)

### Motivation

Combine **global co-occurrence statistics** (like in LSA) with **local predictive modeling** (like Word2Vec).

### Model

Let \( X_{ij} \) = number of times word \( j \) appears in the context of word \( i \).

They propose:

$$
w_i^\top \tilde{w}_j + b_i + \tilde{b}_j = \log X_{ij}
$$

Minimized using a weighted least-squares loss with weighting function \( f(X_{ij}) \).

### Key Features

- Captures **global corpus structure**.
- Deterministic and symmetric.
- Combines strengths of count-based and predictive models.

### Pros and Cons

| Pros | Cons |
|------|------|
| Strong global semantics | Still static embeddings |
| Simple and interpretable | Co-occurrence matrix can be large |

---

## 4. fastText — Bojanowski et al. (2017)

### Problem

Static word embeddings struggle with **rare words** and **morphologically rich languages**.

### Solution

Represent each word as the sum of its **character n-gram embeddings**:

$$
v(w) = \frac{1}{|G(w)|} \sum_{g \in G(w)} z_g
$$

where \( G(w) \) is the set of n-grams for word \( w \), and \( z_g \) is the embedding of subword \( g \).

### Advantages

- Handles OOV (out-of-vocabulary) words gracefully.
- Encodes morphological patterns.
- Integrated within Skip-gram or CBOW objectives.

### Trade-offs

- More parameters due to n-grams.
- Loses some idiomatic meaning precision.

---

## 5. ELMo — Peters et al. (2018)

### Paradigm Shift: Contextual Embeddings

Unlike static embeddings, ELMo produces **context-dependent vectors**.

### Model

Pretrain a **bidirectional language model (biLM)** using LSTMs:

$$
P(w_1, ..., w_T) = \prod_{t=1}^T P(w_t | w_{1:t-1}) P(w_t | w_{t+1:T})
$$

ELMo embedding for a word = weighted sum of hidden layers from both directions.

### Features

- Learns **different embeddings for the same word** depending on context.
- Dramatically improved NLP task performance.

### Limitations

- Based on LSTMs → limited parallelism.
- Restricted to local sequence modeling.

---

## 6. BERT — Devlin et al. (2018)

### Innovation

Introduced **Transformers** for **bidirectional contextual embeddings**.

### Pretraining Tasks

1. **Masked Language Modeling (MLM):** Predict masked tokens using context.
2. **Next Sentence Prediction (NSP):** Predict if sentence B follows sentence A.

### Transformer-based Contextualization

Tokens are embedded, then refined through self-attention layers:

$$
\text{Attention}(Q, K, V) = \text{Softmax}\left( \frac{QK^\top}{\sqrt{d_k}} \right) V
$$

### Significance

- Captures **deep bidirectional context**.
- Massive performance jump across NLP benchmarks.
- Foundation for modern language models (RoBERTa, GPT, etc.).

---

## 7. Later Developments

- **SynGCN / SemGCN:** Integrate syntactic/semantic graphs.
- **Multilingual embeddings:** Align cross-lingual spaces.
- **Distillation:** Compress contextual models for efficiency.
- **Dynamic embeddings:** Adapt continuously as language evolves.

---

## Summary Table

| Model | Year | Contextual | Architecture | Main Contribution |
|--------|------|-------------|---------------|--------------------|
| Bengio et al. | 2003 | No | MLP | Neural LM with embeddings |
| Word2Vec | 2013 | No | Shallow NN | Scalable predictive embeddings |
| GloVe | 2014 | No | Matrix factorization | Global co-occurrence modeling |
| fastText | 2017 | No | Subword NN | Morphology-aware embeddings |
| ELMo | 2018 | Yes | BiLSTM | Contextual word vectors |
| BERT | 2018 | Yes | Transformer | Deep bidirectional contextualization |


# Chronological Evolution of Word Embedding Paradigms

---

## 1. Bengio et al. (2003) — Neural Probabilistic Language Model

**Vector Representation (Conceptual Embedding):**

Year: 2003  
Core Architecture: Feedforward Neural Language Model  
Representation Type: Distributed Static Embeddings  
Objective Function: Next-word prediction via softmax  
Context Modeling: Fixed window (n-gram)  
Innovation Vector: Introduced continuous word representations; joint training of embeddings and language model  
Limitation Vector: Expensive softmax; short context window; static embeddings  
Legacy: Foundation for Word2Vec; marked the renaissance of neural language models  
Mathematical Form:  
$$
p(w_t | w_{t-n+1}, …, w_{t-1}) = \text{Softmax}(W·h + b)
$$  
Citation Impact: ≈ 20k+  
Conceptual Distance to BERT: 0.9  

---

## 2. Mikolov et al. (2013) — Word2Vec: Skip-Gram & CBOW

**Vector Representation (Conceptual Embedding):**

Year: 2013  
Core Architecture: Shallow Neural Network  
Representation Type: Static Distributed Embeddings  
Objective Function: Predict context (Skip-gram) or predict target (CBOW)  
Optimization Tricks: Negative Sampling, Hierarchical Softmax, Subsampling of frequent words  
Innovation Vector: Scalable training; vector arithmetic semantics  
Limitation Vector: Context-independent; ignores global corpus statistics  
Legacy: Dominant paradigm from 2013–2016; foundation for GloVe and FastText  
Key Equation:  
$$
\log \sigma(v'_c·v_w) + \sum_i \mathbb{E}[ \log \sigma(-v'_{c_i}·v_w) ]
$$  
Geometric Property: Linear analogies emerge as vector offsets  
Conceptual Distance to BERT: 0.8  

---

## 3. Pennington et al. (2014) — GloVe: Global Vectors for Word Representation

**Vector Representation (Conceptual Embedding):**

Year: 2014  
Core Architecture: Matrix Factorization + Log-Bilinear Model  
Representation Type: Static Global Embeddings  
Objective Function: Weighted least-squares on log co-occurrence counts  
Data Basis: Global word–word co-occurrence matrix  
Innovation Vector: Bridged LSA and Word2Vec; leveraged global corpus statistics  
Limitation Vector: Memory heavy for large vocabularies; context-independent  
Legacy: Stable pretrained embeddings for NLP tasks; mathematically interpretable  
Key Equation:  
$$
w_i^\top w_j + b_i + b_j \approx \log X_{ij}
$$  
Conceptual Distance to BERT: 0.75  

---

## 4. Bojanowski et al. (2017) — fastText: Subword Information

**Vector Representation (Conceptual Embedding):**

Year: 2017  
Core Architecture: Subword-augmented Skip-gram  
Representation Type: Compositional Static Embeddings  
Objective Function: Negative sampling with character n-gram vectors  
Data Basis: Word and subword n-grams  
Innovation Vector: Handles rare and out-of-vocabulary words; models morphology  
Limitation Vector: Ignores context; increased training cost  
Legacy: Multilingual embedding robustness; de facto model for morphologically rich languages  
Composition Function:  
$$
v(w) = \sum_{g \in G(w)} z_g
$$  
Conceptual Distance to BERT: 0.65  

---

## 5. Peters et al. (2018) — ELMo: Deep Contextualized Word Representations

**Vector Representation (Conceptual Embedding):**

Year: 2018  
Core Architecture: Bidirectional LSTM Language Model  
Representation Type: Dynamic Contextual Embeddings  
Objective Function: Language modeling (forward + backward)  
Innovation Vector: Context-dependent vectors; layer-weighted embedding fusion  
Limitation Vector: Sequential RNN architecture; limited long-range context  
Legacy: Introduced contextual embeddings; bridge to the transformer era  
Embedding Function:  
$$
ELMo(word_t) = \sum_j s_j \times h_j(word_t)
$$  
Transferability: High — used across diverse NLP tasks  
Conceptual Distance to BERT: 0.35  

---

## 6. Devlin et al. (2018) — BERT: Bidirectional Encoder Representations from Transformers

**Vector Representation (Conceptual Embedding):**

Year: 2018  
Core Architecture: Transformer Encoder (Multi-Head Self-Attention)  
Representation Type: Deep Contextual Embeddings  
Objective Function: Masked Language Modeling and Next Sentence Prediction  
Innovation Vector: Bidirectional contextualization; parallel attention; massive pretraining  
Limitation Vector: Resource-intensive; context window limited to 512 tokens  
Legacy: Unified pretraining–finetuning paradigm; backbone of modern LLMs  
Key Equation:  
$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V
$$  
Tokenization: Subword (WordPiece)  
Conceptual Distance to Bengio (2003): 0.1  

---

## 7. Evolution Vector Trajectory (Simplified)

| Era | Representative Paper | Representation Type | Context Handling | Main Innovation | Vector Shift Direction |
|------|----------------------|----------------------|------------------|------------------|------------------------|
| 2003 | Bengio et al. | Static | Local (n-gram) | Continuous word representations | → Distributed space |
| 2013 | Mikolov et al. | Static | Local (window) | Skip-gram & CBOW | → Efficient scaling |
| 2014 | Pennington et al. (GloVe) | Static | Global | Co-occurrence matrix factorization | → Global semantics |
| 2017 | Bojanowski et al. (fastText) | Static | Local + Subword | Morphological composition | → Morphological axis |
| 2018 | Peters et al. (ELMo) | Contextual | Sentence | BiLSTM contextual dynamics | → Contextualization axis |
| 2018 | Devlin et al. (BERT) | Contextual | Bidirectional | Transformer attention | → Deep contextual manifold |

---

## 8. Conceptual Embedding Space (Qualitative Map)

**Principal Components of Embedding Evolution:**

| Principal Component (Axis) | Interpretation | Dominant Models |
|-----------------------------|----------------|-----------------|
| PC₁ — Local → Global Semantics | From word-window context to full corpus co-occurrence | Word2Vec → GloVe |
| PC₂ — Static → Contextual | From fixed vectors to context-dependent embeddings | fastText → ELMo → BERT |
| PC₃ — Shallow → Deep | From single-layer networks to deep transformers | Word2Vec → ELMo → BERT |
| PC₄ — Symbolic → Subword | From whole-word tokens to character/subword units | fastText → BERT |
| PC₅ — Autoregressive → Masked/Bidirectional | From unidirectional to fully bidirectional modeling | ELMo → BERT |

Each model represents a displacement in this conceptual manifold:

$$
\vec{BERT} - \vec{Word2Vec} = Δ_{contextuality} + Δ_{depth} + Δ_{attention}
$$

This equation symbolizes the paradigm shift from static, local, and shallow models to deep, bidirectional, and context-sensitive embeddings that define the foundation of modern language understanding.
