# GloVe and SWIVEL Word Embeddings: Complete Q&A Guide

*Note: All calculations corrected to account for symmetric co-occurrence matrix (counting unique pairs only)*

## Part 1: Understanding Co-occurrence Matrices

### Q1: What is a co-occurrence matrix?

A co-occurrence matrix is a table that tracks which words appear near each other in text. Both rows and columns represent words from your vocabulary, and each cell contains a count of how many times those two words appeared together within a context window (typically 5 words apart).

**Example Setup:**
Vocabulary: {dog, bark, cat, meow, tree}

| | dog | bark | cat | meow | tree |
|---|---|---|---|---|---|
| dog | 0 | 500 | 45 | 3 | 12 |
| bark | 500 | 0 | 8 | 2 | 1 |
| cat | 45 | 8 | 0 | 480 | 20 |
| meow | 3 | 2 | 480 | 0 | 5 |
| tree | 12 | 1 | 20 | 5 | 0 |

The value 500 at (dog, bark) means: "In the corpus, 'dog' and 'bark' appear within a context window 500 times."

**Important:** This matrix is **symmetric**—(dog, bark) = (bark, dog) = 500. Each unique word pair should be counted only once.

### Q2: Why do we need a co-occurrence matrix?

Because of the principle: "You shall know a word by the company it keeps." Words that frequently appear together are likely to be semantically related. The co-occurrence matrix captures these statistical relationships across the entire text corpus.

## Part 2: The Problem with Large Matrices

### Q3: Why is a 400,000 × 400,000 co-occurrence matrix a problem?

A vocabulary of 400,000 words creates a matrix with **160 billion cells**. This is computationally expensive to store and process. Most cells are zeros or very small numbers (sparse data), making this representation wasteful.

## Part 3: Matrix Factorization Solution

### Q4: How do we solve the size problem?

We use **matrix factorization**—specifically Singular Value Decomposition (SVD). The algorithm breaks down the massive 400,000 × 400,000 matrix into smaller, manageable matrices that can be multiplied back together to approximate the original.

### Q5: How exactly does matrix factorization work?

The original co-occurrence matrix C is decomposed as:

$$C = U \Sigma V^T$$

Where:
- U is a 400,000 × r matrix
- Σ is an r × r diagonal matrix of singular values
- V^T is an r × 400,000 matrix

The key insight: most information concentrates in just a few singular values. We keep only the **k largest singular values** (e.g., k = 300) and set the rest to zero.

### Q6: What do we end up with after factorization?

After truncation to keep only 300 dimensions:
- U becomes 400,000 × 300
- Σ becomes 300 × 300
- V^T becomes 300 × 400,000

We've compressed **160 billion cells down to approximately 240 million cells total**.

### Q7: Why keep 400,000 in the dimension?

The 400,000 represents your **vocabulary size**—the number of unique words. Each word needs its own representation. The 300 represents the **embedding dimensions**—the compressed features that capture each word's semantic meaning.

**Result:** Each of the 400,000 words gets a 300-dimensional word embedding vector.

## Part 4: Pointwise Mutual Information (PMI) - The Training Target

**IMPORTANT:** Before we can train embeddings, we must first calculate the PMI values from the co-occurrence matrix. These PMI values become the targets that the embeddings are trained to match.

### Q8: What is PMI and why do we need it?

PMI (Pointwise Mutual Information) measures whether two words co-occur more or less often than random chance would predict. This is more meaningful than raw co-occurrence counts because it accounts for individual word frequencies.

$$\text{PMI}(\text{dog, bark}) = \log\left(\frac{P(\text{dog, bark})}{P(\text{dog}) \times P(\text{bark})}\right)$$

### Q9: Why not just use P(dog) × P(bark)?

Great question! Let me explain why we need the full PMI formula instead of just multiplying the marginal probabilities.

**The Problem with Just P(dog) × P(bark):**

If we only used P(dog) × P(bark), we'd be calculating the probability that dog and bark co-occur **if they were completely independent** (if their appearance had nothing to do with each other).

**Example to show why this matters:**

Imagine two scenarios:

**Scenario 1: "dog" and "bark" are semantically related**
- They naturally appear together because they're related concepts
- Actual co-occurrence: 500 times
- If they were random/independent: would only appear together ~250 times (based on P(dog) × P(bark))
- Result: They appear together **more than chance** predicts

**Scenario 2: "dog" and "the" are not semantically related**
- "the" is a common word that appears everywhere
- Actual co-occurrence: 5,000 times (very high count!)
- If they were independent: would appear together ~2,600 times
- Result: They appear together roughly as much as chance would predict

**The key insight:** Raw co-occurrence counts are misleading. "dog" and "the" co-occur more frequently (5,000 vs 500), but "dog" and "bark" are more semantically related. We need PMI to distinguish this difference.

PMI fixes this by comparing:
- **Actual:** How often they really co-occur
- **Expected by chance:** P(dog) × P(bark)

If actual > expected, PMI is positive (they're attracted to each other semantically).
If actual = expected, PMI is zero (they're independent).
If actual < expected, PMI is negative (they avoid each other).

## Part 5: Calculate PMI Step-by-Step (CORRECTED)

### Q10: How do we calculate PMI step-by-step?

**IMPORTANT CORRECTION:** Count only unique pairs (upper triangle of symmetric matrix), not the entire matrix.

**Step 1: Calculate total unique co-occurrences**

Sum only the unique pairs (upper triangle, excluding diagonal):

- dog-bark: 500
- dog-cat: 45
- dog-meow: 3
- dog-tree: 12
- bark-cat: 8
- bark-meow: 2
- bark-tree: 1
- cat-meow: 480
- cat-tree: 20
- meow-tree: 5

**Total unique co-occurrences = 500 + 45 + 3 + 12 + 8 + 2 + 1 + 480 + 20 + 5 = 1,076**

*(Not 2,000 like before—we're counting each pair only once, not symmetrically)*

**Step 2: Calculate the joint probability P(dog, bark)**

$$P(\text{dog, bark}) = \frac{\text{co-occurrence count}}{\text{total unique co-occurrences}} = \frac{500}{1076} \approx 0.464$$

This is the probability that any randomly selected co-occurrence involves both "dog" and "bark."

**What this means:** Out of every 1,076 unique word pair co-occurrences we observe, 500 of them are "dog" paired with "bark." So there's a 46.4% chance any randomly selected pair is dog-bark.

**Step 3: Calculate marginal probability for dog**

Dog appears in co-occurrences with: bark (500) + cat (45) + meow (3) + tree (12) = **560 total**

$$P(\text{dog}) = \frac{560}{1076} \approx 0.520$$

This is the probability that any co-occurrence involves "dog" (regardless of what it co-occurs with).

**What this means:** Out of every 1,076 unique pairs, 560 of them involve the word "dog." So there's a 52% chance any randomly selected pair includes "dog."

**Step 4: Calculate marginal probability for bark**

Bark appears in co-occurrences with: dog (500) + cat (8) + meow (2) + tree (1) = **511 total**

$$P(\text{bark}) = \frac{511}{1076} \approx 0.475$$

This is the probability that any co-occurrence involves "bark" (regardless of what it co-occurs with).

**What this means:** Out of every 1,076 pairs, 511 of them involve the word "bark." So there's a 47.5% chance any randomly selected pair includes "bark."

**Step 5: Calculate the probability if they were independent**

If "dog" and "bark" appeared together purely by random chance (completely independent), the probability would be:

$$P(\text{dog}) \times P(\text{bark}) = 0.520 \times 0.475 \approx 0.247$$

**What this means:** If dog and bark had nothing to do with each other—if they just randomly happened to appear in the same context windows—they'd co-occur with probability 0.247. Out of 1,076 unique pairs, we'd expect about 265 of them to be "dog" with "bark" (0.247 × 1,076 = 265).

**This is the crucial comparison:**
- **Actual co-occurrence:** 500 times (P(dog, bark) = 0.464)
- **Expected by chance:** 265 times (P(dog) × P(bark) = 0.247)

Dog and bark co-occur 500 ÷ 265 = **1.88 times MORE OFTEN** than we'd expect if they were independent!

**Step 6: Calculate the ratio - Why do we divide?**

Now we form the ratio to see how much MORE frequently they co-occur than chance predicts:

$$\frac{P(\text{dog, bark})}{P(\text{dog}) \times P(\text{bark})} = \frac{0.464}{0.247} \approx 1.88$$

**This ratio tells us:**
- If ratio > 1: They co-occur MORE often than chance predicts (they're attracted to each other)
- If ratio = 1: They co-occur EXACTLY as often as chance would predict (they're independent)
- If ratio < 1: They co-occur LESS often than chance predicts (they avoid each other)

In our case, the ratio is 1.88, meaning dog and bark co-occur **1.88 times more often than random chance would predict**.

**Why divide instead of subtract?**

You might ask: "Why not just calculate P(dog, bark) - P(dog) × P(bark) = 0.464 - 0.247 = 0.217?"

Answer: Because subtraction doesn't scale well with different probability ranges.

**Example:**
- Scenario A: Actual = 0.3, Expected = 0.1 → Difference = 0.2 → Ratio = 3.0
- Scenario B: Actual = 0.003, Expected = 0.001 → Difference = 0.002 → Ratio = 3.0

Both scenarios have the same ratio (3.0 times more frequent than expected), but different differences (0.2 vs 0.002). Division preserves the relative relationship regardless of absolute probability values. Subtraction would incorrectly suggest Scenario A is more significant.

**Step 7: Apply logarithm to get PMI**

$$\text{PMI}(\text{dog, bark}) = \log(1.88) \approx 0.63$$

We apply logarithm because:
1. It compresses large ratio values (prevents extreme numbers)
2. It makes ratios comparable on a symmetric scale (log(3) and log(1/3) have equal magnitude but opposite signs)
3. It's mathematically convenient for optimization algorithms

**What the 0.63 value means:**
- It's the log of how much more often they co-occur than chance predicts
- Positive 0.63 means they're semantically related (more related than when PMI = 0)
- The magnitude tells us the strength of the relationship

### Q11: What does this PMI value tell us?

- **PMI > 0:** The words co-occur more often than random chance predicts. They're semantically related and attracted to each other.
- **PMI = 0:** They co-occur exactly as random chance would predict. No meaningful relationship.
- **PMI < 0:** They co-occur less than random chance predicts. They tend to avoid each other.

For our example, PMI ≈ 0.63 is positive, meaning "dog" and "bark" are semantically related—they co-occur more than chance would predict.

**This 0.63 value is now our training target for the embeddings.**

## Part 6: Training Word Embeddings to Match PMI

### Q12: What is GloVe's training objective?

GloVe trains word embeddings so that the dot product between any two word vectors approximates their PMI value:

$$\text{dot product}(\text{embedding}_{\text{dog}}, \text{embedding}_{\text{bark}}) \approx \text{PMI}(\text{dog, bark}) = 0.63$$

### Q13: How does the training process work?

**Step 1: Initialize random embeddings**

Start with random 300-dimensional vectors for each word:
- embedding_dog = [0.1, 0.2, 0.3, 0.4, 0.5, ..., 0.15] (300 random numbers)
- embedding_bark = [0.1, 0.2, 0.3, 0.4, 0.5, ..., 0.20] (300 random numbers)

**Step 2: Calculate current dot product**

$$\text{dot product} = (0.1 \times 0.1) + (0.2 \times 0.2) + (0.3 \times 0.3) + \ldots + (0.15 \times 0.20)$$

Let's say this equals 0.55.

**Step 3: Compare to target PMI**

- Target: 0.63
- Current: 0.55
- Error: 0.63 - 0.55 = 0.08 (too low)

**Step 4: Adjust the embeddings**

The optimization algorithm (like gradient descent) adjusts the numbers in both embeddings to reduce the error. It might change:
- embedding_dog to [0.3, 0.15, 0.5, 0.2, 0.4, ..., 0.25]
- embedding_bark to [0.4, 0.2, 0.6, 0.25, 0.5, ..., 0.30]

**Step 5: Recalculate dot product**

$$\text{dot product} = (0.3 \times 0.4) + (0.15 \times 0.2) + (0.5 \times 0.6) + \ldots + (0.25 \times 0.30)$$

Now equals 0.60. Closer to 0.63, so continue adjusting.

**Step 6: Repeat iteratively**

The algorithm continues adjusting the embedding values thousands of times. After many iterations, it might reach:
- embedding_dog = [0.42, -0.18, 0.55, 0.10, 0.45, ..., -0.22]
- embedding_bark = [0.48, -0.15, 0.61, 0.08, 0.40, ..., -0.20]

$$\text{dot product} = (0.42 \times 0.48) + (-0.18 \times -0.15) + (0.55 \times 0.61) + \ldots + (-0.22 \times -0.20)$$
$$= 0.202 + 0.027 + 0.336 + \ldots + 0.044$$
$$= 0.63$$

**Success!** The dot product now matches the target PMI.

### Q14: What do these embedding numbers represent?

The numbers in the embeddings (like 0.42, -0.18, 0.55, etc.) are **learned parameters discovered through optimization**. They are:

- **NOT** co-occurrence counts from the matrix
- **NOT** probabilities
- **NOT** logarithms of anything
- **NOT** mathematically derived from a formula

They are simply **values that the algorithm discovered through trial and error** to satisfy the constraint that their dot product equals the target PMI.

Think of it like solving a puzzle:
- **The puzzle:** Find two sets of 300 numbers that multiply together (dot product) to give 0.63
- **The optimizer:** Tries different number combinations repeatedly and measures the error
- **The solution:** After many iterations, finds one valid combination

There's nothing inherently special about the specific values [0.42, -0.18, 0.55, ...]. A different optimization run might produce different numbers. **As long as the dot product equals 0.63, any set of numbers works.**

### Q15: Does GloVe train embeddings for just one word pair?

No! GloVe simultaneously trains embeddings to satisfy the PMI constraint for **all word pairs** in the vocabulary.

For our 5-word vocabulary, it needs to satisfy:
- dot product(dog, bark) ≈ 0.63
- dot product(dog, cat) ≈ PMI(dog, cat)
- dot product(dog, meow) ≈ PMI(dog, meow)
- dot product(dog, tree) ≈ PMI(dog, tree)
- dot product(bark, cat) ≈ PMI(bark, cat)
- ...and all other pairs

The algorithm adjusts all embeddings simultaneously to minimize the total error across all word pairs.

## Part 7: Using the Trained Embeddings

### Q16: How do we approximate co-occurrence relationships from trained embeddings?

After training completes, we can use the **dot product** between two word embedding vectors to approximate their **PMI value**:

$$\text{dot product}(\text{embedding}_{\text{dog}}, \text{embedding}_{\text{bark}}) \approx \text{PMI}(\text{dog, bark}) = 0.63$$

**With the trained embeddings:**
- embedding_dog = [0.42, -0.18, 0.55, 0.10, 0.45, ..., -0.22] (300 numbers)
- embedding_bark = [0.48, -0.15, 0.61, 0.08, 0.40, ..., -0.20] (300 numbers)

**The dot product calculation:**

$$\text{dot product} = (0.42 \times 0.48) + (-0.18 \times -0.15) + (0.55 \times 0.61) + (0.10 \times 0.08) + (0.45 \times 0.40) + \ldots + (-0.22 \times -0.20)$$

$$= 0.202 + 0.027 + 0.336 + 0.008 + 0.180 + \ldots + 0.044$$

$$= 0.63$$

This 0.63 tells us that "dog" and "bark" are semantically related (positive PMI = they co-occur more than chance).

### Q17: Can we get back the original co-occurrence count?

Yes, but it requires reversing the PMI calculation:

**Step 1: Exponentiate the dot product (PMI)**

$$e^{0.63} \approx 1.88$$

This gives us the ratio $\frac{P(\text{dog, bark})}{P(\text{dog}) \times P(\text{bark})}$

**Step 2: Multiply by marginal probabilities**

$$1.88 \times 0.520 \times 0.475 \approx 0.464$$

This gives us P(dog, bark).

**Step 3: Multiply by total co-occurrences**

$$0.464 \times 1076 \approx 500$$

We've recovered the original co-occurrence count from the co-occurrence matrix!

**However, in practice, you rarely need to do this.** For most NLP tasks, you only need the PMI (semantic similarity), which the dot product gives you directly.

## Part 8: Complete Summary of GloVe

### The GloVe Process, Step by Step

**Phase 1: Build the co-occurrence matrix**
Count how often words appear together in the corpus.
- Result: 400,000 × 400,000 matrix with 160 billion cells
- Example: dog and bark co-occur 500 times

**Phase 2: Calculate PMI values for all unique word pairs**
For each unique word pair, calculate how much more frequently they co-occur than chance would predict.
- Calculate total unique pairs = 1,076
- Calculate P(dog, bark) = 500 / 1,076 = 0.464
- Calculate P(dog) = 560 / 1,076 = 0.520
- Calculate P(bark) = 511 / 1,076 = 0.475
- Calculate ratio: 0.464 / (0.520 × 0.475) = 0.464 / 0.247 = 1.88
- Apply log: log(1.88) = 0.63
- Result: PMI targets for all word pairs
- Example: PMI(dog, bark) = 0.63

**Phase 3: Initialize random embeddings**
Start with random 300-dimensional vectors for each word.
- Result: 400,000 × 300 word embedding matrix with random values

**Phase 4: Train embeddings via optimization**
Iteratively adjust embedding values so that dot products match PMI targets.
- Optimization method: Gradient descent or alternating least squares
- Objective: Minimize error between dot products and PMI values across all word pairs
- Result: Trained embeddings where dot product(dog, bark) ≈ 0.63

**Phase 5: Use embeddings**
After training, compute semantic relationships using simple dot products.
- Compression: 300 dimensions instead of 400,000
- Storage: 240 million cells instead of 160 billion cells
- Query: Simple dot product instead of matrix lookup

### Why GloVe Works

- **Compression:** 300 dimensions capture the essential semantic information that would require 400,000 dimensions in the original matrix.
- **Efficiency:** Computing similarity is a simple dot product, not a matrix lookup.
- **Semantic meaning:** The dot product reflects meaningful semantic relationships captured by PMI, not just raw frequency.
- **Learned representations:** The embedding values are discovered through optimization to satisfy PMI constraints across all word pairs simultaneously.
- **Generalization:** Embeddings work for tasks beyond just co-occurrence because they encode deep semantic structure in the learned parameters.

## Part 9: SWIVEL - An Alternative Approach

### Q18: What is SWIVEL and how does it differ from GloVe?

SWIVEL stands for **Submatrix-wise Vector Embedding Learner**. Like GloVe, it's a count-based method that learns word embeddings from a co-occurrence matrix, but it uses a different approach:

**Key differences:**
- **GloVe:** Trains only on *observed* co-occurrences (word pairs that appear together)
- **SWIVEL:** Trains on *both* observed AND *unobserved* co-occurrences (word pairs that rarely or never appear together)

Both methods approximate the PMI matrix, but SWIVEL's special handling of unobserved co-occurrences makes it more accurate on rare words. However, SWIVEL is also considerably faster to train than GloVe in practice.

### Q19: Why does considering unobserved co-occurrences matter?

Let me illustrate with our dog/bark/cat/meow/tree example.

**The co-occurrence matrix (all observed pairs):**

| | dog | bark | cat | meow | tree |
|---|---|---|---|---|---|
| dog | 0 | 500 | 45 | 3 | 12 |
| bark | 500 | 0 | 8 | 2 | 1 |
| cat | 45 | 8 | 0 | 480 | 20 |
| meow | 3 | 2 | 480 | 0 | 5 |
| tree | 12 | 1 | 20 | 5 | 0 |

**What GloVe does:**
GloVe only trains on the 10 unique pairs with non-zero values. It completely ignores:
- The 5 diagonal cells (dog-dog, bark-bark, etc.)
- Any pairs that could be zero or very small due to corpus structure

This means GloVe has no constraint on where to place unrelated words' embeddings. If two words never co-occur, GloVe doesn't care if their embeddings point in similar or opposite directions—there's no training signal telling it to push them apart.

### Q20: How does SWIVEL handle unobserved/low co-occurrences?

SWIVEL treats pairs with different co-occurrence counts differently:

**For high co-occurrences (like dog-bark = 500):**
- Target PMI: log(1.88) = 0.63
- Train embedding_dog · embedding_bark to equal 0.63
- Use strong penalty if the dot product differs significantly

**For very low co-occurrences (like dog-meow = 3):**
- Total: 1,076 pairs
- P(dog, meow) = 3 / 1,076 = 0.0028
- P(dog) = 0.520, P(meow) = (3 + 2 + 480 + 5) / 1,076 = 490 / 1,076 = 0.455
- P(dog) × P(meow) = 0.520 × 0.455 = 0.237
- Ratio = 0.0028 / 0.237 = 0.0118 (much less than 1!)
- PMI = log(0.0118) ≈ -4.74

**SWIVEL enforces this:** The dot product should be approximately -4.74, not positive. This pushes the dog and meow embeddings in opposite directions, indicating they're unrelated.

GloVe ignores this constraint entirely.

### Q21: What is the piecewise loss function in SWIVEL?

SWIVEL uses different loss calculations depending on whether a co-occurrence has high or low count:

**For high-count co-occurrences (high confidence):**
$$\text{Loss}_{\text{high}} = f(\text{count}) \times (\text{dot product} - \text{target PMI})^2$$

Where $$f(\text{count})$$ is a weighting function that:
- Gives higher weight to co-occurrences with higher counts
- This makes sense: if two words appear together 500 times, we're very confident they're related
- If they appear together only 2 times, we're less certain
- Common choice: $f(\text{count}) = \sqrt{\text{count}}$ or $\min(1, \text{count}^{0.75})$

**For low/zero-count co-occurrences:**
$$\text{Loss}_{\text{low}} = c \times (\text{dot product} - \text{target PMI})^2$$

Where $$c$$ is a small constant (like 0.75).

**Why the difference?**
- Low-count and zero pairs are very numerous (most word pairs are rare or non-existent)
- If we weighted them equally, they'd overwhelm the training signal from high-count pairs
- By using a smaller weight $$c$$, SWIVEL balances the learning:
  - Learned relationships are primarily driven by high-count co-occurrences
  - But low/zero-count pairs provide regularization to prevent rare words from being placed randomly

### Q22: Let's compare SWIVEL and GloVe training on dog/bark example

**Starting point:** Same random embeddings as GloVe
- embedding_dog = [0.1, 0.2, 0.3, 0.4, 0.5, ..., 0.15] (300 random numbers)
- embedding_bark = [0.1, 0.2, 0.3, 0.4, 0.5, ..., 0.20] (300 random numbers)
- Current dot product = 0.55

**GloVe training:**
- Sees the pair (dog, bark) with count = 500
- Target PMI = 0.63
- Error = 0.63 - 0.55 = 0.08
- Loss = 0.08² = 0.0064
- Adjusts embeddings to minimize this loss

**SWIVEL training (same iteration):**
- Also sees the pair (dog, bark) with count = 500
- Also targets PMI = 0.63
- Also calculates Error = 0.08
- Uses weighting function: f(500) = √500 ≈ 22.4 (higher counts get more weight)
- Loss = 22.4 × 0.08² ≈ 0.143
- Adjusts embeddings more aggressively

**Additionally, SWIVEL also processes low-count pairs:**
- Sees the pair (dog, meow) with count = 3
- Target PMI = -4.74
- Let's say random embeddings give dot product = 0.2
- Error = 0.2 - (-4.74) = 4.94
- Uses weighting constant: c = 0.75 (small weight for low-count)
- Loss = 0.75 × 4.94² ≈ 18.3
- **Pushes the embeddings to make dot product much more negative**

GloVe never sees this constraint, so it doesn't adjust based on it.

### Q23: Training approach - Shards vs Direct Matrix Factorization

**GloVe's approach:**
- Iterates through observed co-occurrences one pair at a time
- Training time proportional to number of observed co-occurrences (~1,076 pairs in our example)
- Very efficient for sparse matrices

**SWIVEL's approach:**
- Partitions the co-occurrence matrix into "shards" (submatrices)
- Each shard contains a block of rows and columns
- For each shard, processes all embeddings in that block simultaneously using matrix multiplication

**Example with our 5×5 matrix:**

Shard 1 (rows: dog, bark | columns: dog, bark, cat):
```
       dog  bark  cat
dog     0    500   45
bark   500    0    8
```

Shard 2 (rows: dog, bark | columns: meow, tree):
```
       meow  tree
dog       3    12
bark      2     1
```

And similar shards for other row groups.

**Why shards?**
- Can compute millions of dot products at once using vectorized matrix multiplication
- Parallelizable: different shards can be trained on different computers
- Makes SWIVEL much faster in practice despite processing more data

### Q24: Why is SWIVEL better on rare words?

**Problem with GloVe on rare words:**

Imagine a rare word like "xylophone" that appears in only 5 sentences total:
- It has very few observed co-occurrences (maybe just 5-10)
- Only 5-10 training signals during GloVe training
- Its embedding is under-constrained (not enough information to determine good placement)
- Result: May end up in a random location in embedding space

**Solution in SWIVEL:**

SWIVEL considers low/zero co-occurrence constraints:
- "xylophone" doesn't appear with most words (most pairs have count = 0 or very low)
- SWIVEL adds constraints: embedding_xylophone should have negative PMI with unrelated words
- These constraints regularize where the embedding can be placed
- Even with few observed co-occurrences, SWIVEL has many signals from low-count pairs
- Result: Better embedding placement for rare words

Empirically, SWIVEL significantly outperforms GloVe on evaluation benchmarks for rare words.

### Q25: Why is SWIVEL faster despite processing more data?

**GloVe:**
- Must iterate through each observed co-occurrence individually
- With a vocabulary of 400,000 words, millions of co-occurrence pairs
- Each pair processed sequentially: slow but simple

**SWIVEL:**
- Processes entire shards using matrix multiplication
- Modern GPUs are highly optimized for matrix multiplication (can do thousands in parallel)
- Can compute thousands of dot products in a single batch operation
- Processes both high-count and low-count pairs together in vectorized operations

**Trade-off:**
- SWIVEL requires computation proportional to the entire matrix (including low-count cells)
- But the vectorized GPU operations more than make up for it
- Result: SWIVEL is typically 2-5x faster than GloVe in practice

Example timing:
- GloVe: 2 hours to train on large corpus
- SWIVEL: 20-30 minutes on same corpus

### Q26: Complete comparison: GloVe vs SWIVEL

| Aspect | GloVe | SWIVEL |
|--------|-------|--------|
| **Co-occurrence handling** | High-count only | All counts (high + low) |
| **Target function** | PMI for high counts | Piecewise loss (different weights) |
| **Training data** | ~1 million pairs for 400k vocab | Entire matrix (~160 billion cells) |
| **Processing method** | Sequential pair iteration | Vectorized shard processing |
| **Parallelization** | Difficult | Easy (different machines handle shards) |
| **Accuracy (average)** | Baseline | Slightly better overall |
| **Accuracy (rare words)** | Poor | Significantly better |
| **Training speed** | Slow | Fast (2-5x faster) |
| **Final output** | 300D embedding per word | 300D embedding per word |
| **PMI reconstruction** | Approximates PMI for high-count pairs | Approximates PMI for all pairs |

**Which to use?**
- **GloVe:** Simpler, well-established, good general-purpose embeddings
- **SWIVEL:** Better for rare words, faster training, production systems with large datasets

### Q27: SWIVEL training process step-by-step with dog/bark example

**Phase 1: Initialize embeddings** (same as GloVe)
- embedding_dog = [0.1, 0.2, 0.3, ..., 0.15]
- embedding_bark = [0.1, 0.2, 0.3, ..., 0.20]
- And embeddings for cat, meow, tree

**Phase 2: Create shards**
SWIVEL partitions the 5×5 matrix into shards. For example:

Shard A: rows [dog, bark], columns [dog, bark, cat]
Shard B: rows [dog, bark], columns [meow, tree]
Shard C: rows [cat, meow], columns [dog, bark, cat]
...and more shards

**Phase 3: Process Shard A**
- Compute all embeddings for rows [dog, bark] × columns [dog, bark, cat]
- For dog×dog: count = 0 (diagonal, usually ignored or special handling)
- For dog×bark: count = 500, Target PMI = 0.63, Current dot product = 0.55, Loss = f(500) × (0.55-0.63)² = 22.4 × 0.0064 ≈ 0.143
- For dog×cat: count = 45, Calculate PMI = log(45/(1076 × 0.520 × 0.48)) ≈ log(0.18) ≈ -1.71, Current = 0.3, Loss = f(45) × (0.3-(-1.71))² = 6.7 × 4.08 ≈ 27.4
- For bark×dog: Same as dog×bark (symmetric)
- For bark×bark: Special handling (diagonal)
- For bark×cat: count = 8, Calculate PMI, compute loss similarly

Total loss for Shard A: 0.143 + 27.4 + ... (more terms)

**Phase 4: Backpropagate and update embeddings**
- Calculate gradients for all embedding vectors in Shard A
- Update: embedding_dog ← embedding_dog - learning_rate × gradient_dog
- Update: embedding_bark ← embedding_bark - learning_rate × gradient_bark
- Update: embedding_cat ← embedding_cat - learning_rate × gradient_cat

**Phase 5: Process Shard B**
- Compute embeddings for rows [dog, bark] × columns [meow, tree]
- For dog×meow: count = 3, Target PMI = -4.74, Current = 0.2, Loss = c × (0.2 - (-4.74))² = 0.75 × 24.4 ≈ 18.3 (LOW-COUNT constraint!)
- For dog×tree: count = 12, Calculate PMI, compute loss
- For bark×meow, bark×tree: Similar calculations

**The key difference:** SWIVEL processes dog×meow (count=3) with PMI = -4.74 AND would process any pair with very low count with appropriate negative PMI targets.

**Phase 6: Repeat**
- Cycle through all shards multiple times (epochs)
- Due to vectorization, processing all shards is much faster than GloVe's sequential approach
- After training completes, embeddings satisfy PMI constraints across all pairs

### Q28: Final embeddings comparison

After training, both GloVe and SWIVEL produce embeddings where:
- embedding_dog = [0.42, -0.18, 0.55, 0.10, 0.45, ..., -0.22] (300 dimensions)
- embedding_bark = [0.48, -0.15, 0.61, 0.08, 0.40, ..., -0.20] (300 dimensions)
- dot product(dog, bark) ≈ 0.63

**Differences in other embeddings:**

For rare word "xylophone" (appears only 5 times):

**GloVe embedding:**
- Trained on just 5 observed co-occurrences
- Under-constrained: could be almost anywhere
- May be placed randomly
- Example: embedding_xylophone = [0.05, 0.08, 0.02, ...] (could be anywhere)

**SWIVEL embedding:**
- Trained on 5 observed co-occurrences PLUS constraints from hundreds of low-count pairs
- Well-constrained: must have negative PMI (around -2 to -3) with most unrelated words
- More likely to end up in a meaningful location away from unrelated words
- Better performance on rare word similarity tests
- Example: embedding_xylophone = [0.35, -0.42, 0.28, ...] (placed in meaningful location)

Both methods produce 300-dimensional embeddings that compress the information, but SWIVEL uses regularization from low-count pairs to produce better rare word representations.

## Key Takeaways

1. **Co-occurrence matrices are symmetric:** Count unique pairs only once, not twice
2. **PMI captures semantic relationships:** By comparing actual to expected-by-chance co-occurrence
3. **GloVe compresses via learned embeddings:** 300 dimensions instead of 400,000, trained to match PMI targets
4. **SWIVEL adds robustness:** By including low-count pairs in training, improving rare word representations
5. **Speed vs sophistication:** GloVe is simpler; SWIVEL is faster and more accurate on rare words
6. **Embedding values are learned:** Not calculated from formulas, discovered through optimization to satisfy PMI constraints