# GloVe and SWIVEL: Complete Word Embeddings Guide

*Theory + Calculations + Q&A Format + Original Structure*

## Part 1: Understanding Co-occurrence Matrices

### Q1: What is a co-occurrence matrix?

**Answer:**

A co-occurrence matrix is a table that tracks which words appear near each other in text. Both rows and columns represent words from your vocabulary, and each cell contains a count of how many times those two words appeared together within a context window (typically 5 words apart).

**Example Setup:**
Vocabulary: {dog, bark, cat, meow, tree}

| | dog | bark | cat | meow | tree |
|---|---|---|---|---|---|
| dog | 0 | 500 | 45 | 3 | 12 |
| bark | 500 | 0 | 8 | 2 | 1 |
| cat | 45 | 8 | 0 | 480 | 20 |
| meow | 3 | 2 | 480 | 0 | 5 |
| tree | 12 | 1 | 20 | 5 | 0 |

**What this means:** The value 500 at (dog, bark) means "In the corpus, 'dog' and 'bark' appear within a context window 500 times."

**Important:** This matrix is symmetric—(dog, bark) = (bark, dog) = 500.

### Q2: Why do we need a co-occurrence matrix?

**Answer:**

Because of the principle: "You shall know a word by the company it keeps." Words that frequently appear together are likely to be semantically related. The co-occurrence matrix captures these statistical relationships across the entire text corpus.

**What this means:** If we can identify which words appear together often, we can infer semantic relationships. Words appearing with "dog" (like "bark", "pet", "animal") help us understand that "dog" is a noun related to animals.

## Part 2: The Problem with Large Matrices

### Q3: Why is a 400,000 × 400,000 co-occurrence matrix a problem?

**Answer:**

A vocabulary of 400,000 words creates a matrix with **160 billion cells**. This is computationally expensive to store and process. Most cells are zeros or very small numbers (sparse data), making this representation wasteful.

**What this means:**
- Storage: 160 billion numbers = huge memory requirement
- Speed: Each similarity query requires looking up a cell in this massive matrix
- Sparsity: Most word pairs don't co-occur, so most cells are zero (wasted space)
- Example: If only 1 million word pairs actually co-occur, we have 159.999 billion zeros

We need a better approach to represent this information.

## Part 3: Matrix Factorization Solution

### Q4: How do we solve the size problem?

**Answer:**

We use **matrix factorization**—specifically Singular Value Decomposition (SVD). The algorithm breaks down the massive 400,000 × 400,000 matrix into smaller, manageable matrices that can be multiplied back together to approximate the original.

**What this means:** Instead of storing the full matrix, we store smaller matrices that, when multiplied, recreate approximately the same information.

### Q5: How exactly does matrix factorization work?

**Answer:**

The original co-occurrence matrix C is decomposed as:

$$C = U \Sigma V^T$$

Where:
- U is a 400,000 × r matrix (left singular vectors)
- Σ is an r × r diagonal matrix of singular values
- V^T is an r × 400,000 matrix (right singular vectors transposed)

**What this means:**
- Each singular value represents how much "importance" or "variance" that component captures
- Most information concentrates in just a few singular values
- We keep only the **k largest singular values** (e.g., k = 300) and set the rest to zero
- This reduces storage and computation dramatically while preserving most information

### Q6: What do we end up with after factorization?

**Answer:**

After truncation to keep only 300 dimensions:
- U becomes 400,000 × 300
- Σ becomes 300 × 300
- V^T becomes 300 × 400,000

**Compression achieved:**
- Original: 400,000 × 400,000 = 160 billion cells
- Factorized: (400,000 × 300) + (300 × 300) + (300 × 400,000) ≈ 240 million cells
- **Compression ratio: ~1,300x smaller**

**What this means:** We've compressed the matrix dramatically while preserving semantic relationships.

### Q7: Why keep 400,000 in the dimension?

**Answer:**

The 400,000 represents your **vocabulary size**—the number of unique words. Each word needs its own representation. The 300 represents the **embedding dimensions**—the compressed features that capture each word's semantic meaning.

**What this means:**
- 400,000: Each of the 400,000 words in vocabulary gets one row (their embedding)
- 300: Each embedding is a vector of 300 numbers, not 400,000
- Result: Each of the 400,000 words gets a 300-dimensional word embedding vector instead of a 400,000-dimensional co-occurrence vector

## Part 4: Pointwise Mutual Information (PMI) - The Training Target

### Q8: What is PMI and why do we need it?

**Answer:**

PMI (Pointwise Mutual Information) measures whether two words co-occur more or less often than random chance would predict. This is more meaningful than raw co-occurrence counts because it accounts for individual word frequencies.

$$\text{PMI}(w_1, w_2) = \log\left(\frac{P(w_1, w_2)}{P(w_1) \times P(w_2)}\right)$$

**What this means:**
- **Numerator P(w₁, w₂):** Joint probability—both words appearing together
- **Denominator P(w₁) × P(w₂):** Product of marginal probabilities—what would happen if words were independent
- **Ratio:** How much more (or less) often they actually appear compared to random chance
- **Log:** Compress ratio to log scale

### Q9: Why not just use P(dog) × P(bark)?

**Answer:**

P(dog) × P(bark) represents what we'd expect **if the words were independent**. It tells us what **random chance predicts**, not what's meaningful.

**Two scenarios showing why this matters:**

**Scenario 1: "dog" and "bark" (semantically related)**
- Actual co-occurrence: 500 times
- Expected by chance (P(dog) × P(bark)): ~250 times
- Ratio: 500/250 = 2.0 (co-occur 2x more than chance predicts)
- Conclusion: Strongly related ✓

**Scenario 2: "dog" and "the" (NOT semantically related)**
- Actual co-occurrence: 5,000 times (much higher count!)
- Expected by chance: ~2,600 times
- Ratio: 5000/2600 = 1.92 (co-occur about as often as chance predicts)
- Conclusion: Not semantically related, just both common words ✓

**The key insight:** Raw count (5,000 > 500) is misleading! PMI (dog-bark ≈ 2.0 > dog-the ≈ 1.92) correctly identifies dog-bark as more related.

**Why divide instead of subtract?**
- Subtraction doesn't scale well across different probability ranges
- Example: Both (0.3/0.1 = 3.0) and (0.003/0.001 = 3.0) represent the same relationship (3x more frequent than chance)
- But subtraction gives (0.3-0.1 = 0.2) vs (0.003-0.001 = 0.002), falsely suggesting they're different
- Division preserves relative relationships regardless of absolute probabilities

### Q10: How do we calculate PMI? Complete step-by-step for dog-meow

**Answer:**

**Step 1: Count total unique co-occurrences**

Sum only the unique pairs from upper triangle (matrix is symmetric, count each pair once):

$$\text{Total} = 500 + 45 + 3 + 12 + 8 + 2 + 1 + 480 + 20 + 5 = 1,076$$

**What this means:** We observed 1,076 total unique word pair co-occurrences. (Not 2,000—we count each pair once, not symmetrically)

**Step 2: Calculate joint probability P(dog, meow)**

$$P(\text{dog, meow}) = \frac{3}{1076} = 0.0028$$

**What this means:** Out of every 1,076 word pair co-occurrences, only 3 are "dog" with "meow". So 0.28% chance any random pair is dog-meow. (Very rare!)

**Step 3: Calculate marginal probability P(dog)**

Dog appears with: bark (500) + cat (45) + meow (3) + tree (12) = **560 total**

$$P(\text{dog}) = \frac{560}{1076} = 0.520$$

**What this means:** 52% of all pairs involve "dog". So "dog" is a fairly common word in co-occurrences.

**Step 4: Calculate marginal probability P(meow)**

Meow appears with: dog (3) + bark (2) + cat (480) + tree (5) = **490 total**

$$P(\text{meow}) = \frac{490}{1076} = 0.455$$

**What this means:** 45.5% of all pairs involve "meow". Slightly less common than "dog".

**Step 5: Calculate probability if independent**

If dog and meow appeared together purely by random chance:

$$P(\text{dog}) \times P(\text{meow}) = 0.520 \times 0.455 = 0.237$$

**What this means:** Expected them to co-occur with probability 0.237. Out of 1,076 pairs, we'd expect about 255 of them to be "dog" with "meow" (0.237 × 1,076 = 255).

**Critical comparison:**
- **Actual:** 3 times (P(dog, meow) = 0.0028)
- **Expected by chance:** 255 times (P(dog) × P(meow) = 0.237)

Dog and meow co-occur only 3/255 ≈ 0.012 times (1/84th) compared to random chance. They appear **far less** together than chance would predict.

**Step 6: Calculate the ratio**

$$\frac{P(\text{dog, meow})}{P(\text{dog}) \times P(\text{meow})} = \frac{0.0028}{0.237} = 0.0118$$

**What this means:** The ratio is 0.0118 (much less than 1.0):
- Ratio > 1: Co-occur more than chance predicts (related)
- Ratio = 1: Co-occur exactly as chance predicts (independent)
- Ratio < 1: Co-occur less than chance predicts (unrelated)

In this case, ratio = 0.0118 tells us dog and meow are **strongly unrelated**—they avoid each other.

**Step 7: Apply logarithm to get PMI**

$$\text{PMI}(\text{dog, meow}) = \log(0.0118) = -4.74$$

**What this means:**
- Log converts ratio to log scale (makes numbers manageable and symmetric)
- Negative PMI (-4.74) means words are **unrelated**
- Large magnitude (|-4.74| is large) means **strong unrelatedness**
- In embedding space, they should be **far apart** (even opposite directions)

### Q11: What does PMI value tell us?

**Answer:**

- **PMI > 0:** Words co-occur MORE than chance predicts → Semantically related (attracted)
- **PMI = 0:** Words co-occur EXACTLY as chance predicts → Independent, no relationship
- **PMI < 0:** Words co-occur LESS than chance predicts → Semantically unrelated (avoid)

For dog-meow: PMI ≈ -4.74 is very negative → strongly unrelated

**What this means:** PMI(dog, meow) = -4.74 is our **training target**. When we train embeddings, we want:
$$\text{embedding}_{\text{dog}} \cdot \text{embedding}_{\text{meow}} \approx -4.74$$

## Part 5: Training Word Embeddings to Match PMI

### Q12: What is GloVe's training objective?

**Answer:**

GloVe trains word embeddings so that the dot product between any two word vectors approximates their PMI value:

$$\text{embedding}_{w_1} \cdot \text{embedding}_{w_2} \approx \text{PMI}(w_1, w_2)$$

**Important note:** This is a **design choice by GloVe engineers, not mathematically deduced.** They decided dot product should equal PMI because:
- PMI already captures semantic relationships
- Dot product is efficient and differentiable (works well with gradient descent)
- It's simple and elegant

This could have been done differently (e.g., dot product = raw count, or dot product = probability), but GloVe chose PMI.

**What this means:** Each embedding value is discovered during optimization to satisfy this constraint. Values have no inherent meaning individually—only in combination via dot product.

### Q13: How does GloVe training work? Step-by-step

**Answer:**

**Goal:** Make $\text{embedding}_{\text{dog}} \cdot \text{embedding}_{\text{meow}} = -4.74$ (the PMI we calculated)

**Step 1: Initialize random embeddings**

```
embedding_dog = [0.1, 0.2, 0.3, ..., 0.15]  (300 random numbers)
embedding_meow = [0.1, 0.2, 0.3, ..., 0.20] (300 random numbers)
```

**What this means:** Start with random values. These will be optimized.

**Step 2: Calculate current dot product**

$$\text{dot product} = (0.1 \times 0.1) + (0.2 \times 0.2) + (0.3 \times 0.3) + \ldots + (0.15 \times 0.20) = 0.55$$

**What this means:** Current result is 0.55, but we need -4.74. Error = 5.29 (huge!)

**Step 3: Adjust embeddings**

The optimization algorithm analyzes error and adjusts values. To make dot product negative, use negative values and opposite signs:

```
embedding_dog = [0.3, -0.5, 0.2, 0.1, 0.4, ..., -0.25]
embedding_meow = [0.4, 0.3, 0.1, -0.2, 0.5, ..., 0.15]
```

**What this means:** Negative products like (-0.5 × 0.3) = -0.15 help achieve negative dot product.

**Step 4: Recalculate dot product**

$$\text{dot product} = (0.3 \times 0.4) + (-0.5 \times 0.3) + (0.2 \times 0.1) + (0.1 \times -0.2) + \ldots = -2.1$$

**What this means:** Better! Error reduced from 5.29 to 2.64. Still not -4.74, continue.

**Step 5: Repeat many iterations**

| Iteration | Dot Product | Target | Error |
|-----------|-------------|--------|-------|
| 1 | 0.55 | -4.74 | 5.29 |
| 2 | -2.1 | -4.74 | 2.64 |
| 3 | -3.5 | -4.74 | 1.24 |
| 100 | -4.70 | -4.74 | 0.04 |
| Final | -4.74 | -4.74 | 0 ✓ |

**What this means:** After thousands of iterations, we found values that produce exactly -4.74.

**Step 6: Final trained embeddings**

```
embedding_dog = [0.35, -0.85, 0.25, 0.05, 0.40, ..., -0.55]
embedding_meow = [0.25, 0.65, 0.15, -0.45, 0.30, ..., 0.35]
```

**Verification:**
$$\text{dot product} = (0.35 \times 0.25) + (-0.85 \times 0.65) + (0.25 \times 0.15) + \ldots = 0.088 - 0.553 + 0.038 + \ldots = -4.74$$ 
✓

**What this means:** Embeddings are "trained"—values discovered by optimization. Notice they differ completely from random start. Optimizer found these numbers work together.

### Q14: Why doesn't each embedding dimension mean anything by itself?

**Answer:**

Each value in an embedding is a parameter discovered during optimization. Its only purpose is to satisfy:

> When dot-producted with another embedding, the result should equal the target PMI

**Example:**
```
embedding_dog[5] = 0.18
```

This 0.18 does NOT represent:
- "dog is 18% fierce" ✗
- "dog has loyalty score 0.18" ✗
- "feature X with strength 0.18" ✗

It's simply: "A parameter that, when multiplied by embedding_meow[5] and combined with 299 other products, helps achieve the target dot product of -4.74" ✓

**Semantic meaning emerges from the COMBINATION via dot product, not from individual values.**

### Q15: Why is "loosely defined" better than pre-defining what each dimension means?

**Answer:**

**Alternative: Pre-define semantic features**
```python
embedding_dog[0] = compute_animality_score("dog")
embedding_dog[1] = compute_domesticity_score("dog")
embedding_dog[2] = compute_loyalty_score("dog")
```

**Problem:** How do you compute these scores? There's no formula. This approach fails.

**GloVe's approach: Let optimizer discover values**
- No need to pre-define what dimensions mean
- Just optimize to match PMI targets
- Result: Flexible, universal embeddings
- Works for any language, domain, task
- Emerges from data, not human assumptions

**Why it works:** PMI structure naturally encodes semantic relationships. Optimization finds 300-dim representation that preserves this. Even though individual dimensions are "meaningless", together they capture semantic meaning via dot products.

## Part 6: Using the Trained Embeddings

### Q16: How do we approximate co-occurrence relationships from trained embeddings?

**Answer:**

After training, use the **dot product** between vectors to approximate **PMI**:

$$\text{dot product}(\text{embedding}_{\text{dog}}, \text{embedding}_{\text{meow}}) \approx \text{PMI}(\text{dog, meow}) = -4.74$$

**With trained embeddings:**
- embedding_dog = [0.35, -0.85, 0.25, 0.05, 0.40, ..., -0.55] (300 numbers)
- embedding_meow = [0.25, 0.65, 0.15, -0.45, 0.30, ..., 0.35] (300 numbers)

**The dot product:**
$$\text{dot product} = (0.35 \times 0.25) + (-0.85 \times 0.65) + \ldots = -4.74$$

**Interpretation:**
- Negative dot product (-4.74) → Unrelated words
- Positive dot product → Related words
- Magnitude → Strength of relationship

### Q17: Can we recover original counts from embeddings?

**Answer:**

Yes, reverse the PMI calculation:

**Starting:** Trained embeddings give dot product = -4.74

**Step 1:** Exponentiate
$$e^{-4.74} = 0.0087 = \frac{P(\text{dog, meow})}{P(\text{dog}) \times P(\text{meow})}$$

**Step 2:** Multiply by marginals
$$0.0087 \times 0.520 \times 0.455 = 0.00206 \approx P(\text{dog, meow})$$

**Step 3:** Multiply by total
$$0.00206 \times 1076 \approx 2.2 \approx 3$$    
✓ Original count

**What this means:** You can recover original information, but you don't need to. Dot product directly gives the PMI (semantic similarity), which is what matters.

### Q18: Why does GloVe work?

**Answer:**

**Why GloVe Works:**

1. **Compression:** 300 dimensions capture essential semantic information from 400,000-dim original
2. **Efficiency:** Computing similarity is simple dot product, not matrix lookup
3. **Semantic meaning:** Dot product reflects meaningful relationships via PMI
4. **Learned representations:** Values discovered through optimization
5. **Generalization:** Embeddings work for tasks beyond co-occurrence

**Limitation:** GloVe only trains on observed (non-zero) pairs. Rare words get few training signals.

### Q19: What is SWIVEL and how does it compare with GloVe?

**Answer:**

SWIVEL also trains embeddings to match PMI, but includes **zero-count pairs** in training.

**SWIVEL Training Example: dog-meow**

Same calculation as GloVe:
$$\text{Target PMI} = \log\left(\frac{0.0028}{0.237}\right) = \log(0.0118) = -4.74$$

**Loss function for non-zero pairs:**
$$\text{Loss}_{\text{non-zero}} = f(\text{count}) \times (\text{dot product} - \text{target PMI})^2$$

where f(3) = √3 ≈ 1.7

**Example:**
- Dog-meow: Loss = 1.7 × (0.55 - (-4.74))² = 1.7 × 27.98 ≈ 47.6

**Additionally, SWIVEL processes zero-count pairs:**

**Loss function for zero-count pairs:**
$$\text{Loss}_{\text{zero}} = 0.75 \times (\text{dot product} - \text{target PMI})^2$$

**Example (dog-xylophone never co-occur):**
- Target PMI ≈ -5 (large negative)
- Random dot product ≈ 0.1
- Loss = 0.75 × (0.1 - (-5))² = 0.75 × 26.01 ≈ 19.5

**What this means:** SWIVEL has many more training constraints (both observed + unobserved) that regularize embeddings.

### Q20: GloVe vs SWIVEL Step-by-Step Comparison

**Answer:**

**Training the pair dog-meow (count=3, target PMI=-4.74):**

| Step | GloVe | SWIVEL | Note |
|------|-------|--------|------|
| **Pair type** | Non-zero: dog-meow | Non-zero: dog-meow | Both process observed pairs |
| **Count** | 3 | 3 | Same |
| **Target PMI** | -4.74 | -4.74 | Same |
| **Weight function** | Implicit | f(3) = √3 ≈ 1.7 | SWIVEL weights by √count |
| **Error (iter 1)** | (0.55-(-4.74))² = 27.98 | Same | Same error |
| **Loss (iter 1)** | 27.98 | 1.7 × 27.98 ≈ 47.6 | SWIVEL: Higher weighted |
| **Add. constraints** | None for zeros | Many zeros! | SWIVEL: ~399,995 zero pairs for rare words |
| **Zero-pair loss** | N/A (ignored) | 0.75 × (0.1-(-5))² ≈ 19.5 | Small weight prevents overwhelming |
| **Final result** | dot ≈ -4.74 | dot ≈ -4.74 | Both achieve target |
| **Rare word quality** | Poor (few constraints) | Excellent (many constraints) | **SWIVEL significantly better** |

**Key difference:** SWIVEL adds thousands of zero-pair constraints that regularize rare word embeddings, while GloVe ignores zero pairs entirely.

### Q21: Piecewise Loss Function Explained

**Answer:**

SWIVEL uses different weights for different pair types:

**For non-zero-count pairs:**
$$\text{Loss} = f(\text{count}) \times (\text{error})^2 \text{ where } f(\text{count}) = \sqrt{\text{count}}$$

**What this means:**
- f(500) = √500 ≈ 22.4 (high-count pairs weighted heavily)
- f(3) = √3 ≈ 1.7 (low-count pairs weighted lightly)
- Intuition: Confident about frequent co-occurrences, less confident about rare ones

**For zero-count pairs:**
$$\text{Loss} = 0.75 \times (\text{error})^2$$

**What this means:**
- Small constant weight (0.75)
- Zero-count pairs are billions (overwhelm if equal weight)
- Small weight provides regularization without domination

### Q22: Why is SWIVEL better for rare words?

**Answer:**

**Scenario: Training "xylophone" (appears 5 times)**

**GloVe:**
- Trains on ~5 non-zero pairs (observed co-occurrences)
- Only 5 training constraints
- Under-constrained: Many possible placements
- Result: Embedding might be random

**SWIVEL:**
- Trains on ~5 non-zero pairs
- PLUS ~399,995 zero-count pairs
- Thousands of constraints saying: "xylophone should be far from unrelated words"
- Well-constrained: Meaningful placement
- Result: Significantly better rare word embeddings

**Empirical result:** SWIVEL outperforms GloVe on rare word benchmarks

## Summary

**Key Insights:**

1. **Co-occurrence matrices** capture semantic relationships but are too large (160 billion cells)

2. **Matrix factorization** compresses to 300 dimensions (~1,300x smaller)

3. **PMI** measures semantic relationships more accurately than raw counts

4. **GloVe** trains embeddings to match PMI values via dot product (design choice, not mathematical necessity)

5. **Embedding values** are discovered parameters with no individual meaning—semantic meaning emerges from dot products

6. **SWIVEL** improves on GloVe by including zero-count pairs, significantly better for rare words

7. **Piecewise loss** balances high-count pairs (heavy weight) with zero-count pairs (light weight)