# 🌐 Chapter 3: Word Embeddings

## 📌 Overview  
(Recall) Bag of Words (BoW) and TF-IDF represent words as sparse vectors where each dimension corresponds to a unique word. However, they fail to capture **semantic relationships** between words (e.g., "king" and "queen").

**Word Embeddings** solve this problem by mapping words into dense, low-dimensional vectors where similar words have similar representations. These embeddings are learned from large text corpora.

Popular embedding methods:
- Word2Vec (Skip-gram, CBOW)
- GloVe (Global Vectors)
- FastText



### How Word2Vec Works (Skip-gram with Negative Sampling)

---

### 🟢 Step 1: Define Corpus, Vocabulary, and Input Representation

- **Example sentence**:  
  `"The cat sits"`

- **Vocabulary (V = 3)**:  
  `{the, cat, sits}`

  | Word   | Index |
  |--------|-------|
  | the    | 0     |
  | cat    | 1     |
  | sits   | 2     |

  where the vocabulary size $V = 3$.

- One-Hot Encoding (for input):
  For "cat" (index 1):

  $$
  x_{	the} =
  \begin{bmatrix}
  1 \\
  0 \\
  0
  \end{bmatrix},
  \;
  x_{	cat} =
  \begin{bmatrix}
  0 \\
  1 \\
  0
  \end{bmatrix},
  \;
  x_{	sits} =
  \begin{bmatrix}
  0 \\
  0 \\
  1
  \end{bmatrix}
  $$


- Shape of each $ x $: 
$ V \times 1 = 3 \times 1$

---

### 🟡 Step 2: Network Architecture (Embedding Layer)

- **Input weight matrix \( W \)**:
- Shape: $ V \times d $ wehre $d$ is embedding dimension (parameter).
- Each row of $ W $ is the embedding for one word.

- **Example when $ d = 3 $**:
$$
W = 
\begin{bmatrix}
0.2 & -0.3 & 0.1 \quad\text{← embedding of "the"}\\
0.5 & 0.1  & -0.4 \quad\text{← embedding of "cat"}\\
\vdots & \vdots & \vdots
\end{bmatrix}
$$

- **Hidden layer output (embedding lookup)**:
$$
h = W^\top x \Rightarrow W^\top X \; \text{for all x, matrix format}
$$
- Simply selects the row of $ W $ for the center word.
- Shape of $ h $: $ d \times 1 $
- In this Example : 

  Input weight matrix $W$ (input embeddings):
  Embedding dimension $d = 2$.  
  Shape of $W$: $3 \times 2$ (vocabulary size $V = 3$, embedding size $d = 2$).

  $$
  W =
  \begin{bmatrix}
  0.2 & -0.1 \\
  0.7 & 0.3 \\
  -0.5 & 0.6
  \end{bmatrix}
  \Rightarrow
  W^\top =
  \begin{bmatrix}
  0.2 & 0.7&-0.5 \\
  -0.1 & 0.3 & 0.6
  \end{bmatrix}
  $$


- Embedding Lookup (Hidden Layer Output):
  For the input word "cat":

  $$
  h = W^\top x_{\text{cat}}
  $$

  Result (selects the 2nd row of $W$):

  $$
  h = u_{\text{cat}} =
  \begin{bmatrix}
  0.7 \\
  0.3
  \end{bmatrix}
  $$


---

### 🟠 Step 3: Compute Context Scores

- **Output weight matrix $ W'$**:
- Shape: $ d \times V $
- Each column of $ W'$ represents the output vector for one word.

- **Score calculation** (dot product between center and context embeddings):
$$
s_{w_o} = u_{w_t}^\top v_{w_o}
$$
- $ u_{w_t} $: input embedding of the center word.
- $ v_{w_o} $: output embedding of the context word.
- Output weight matrix $W'$ (context embeddings):
  Shape of $W'$: $2 \times 3$.

  $$
  W' =
  \begin{bmatrix}
  0.1 & -0.2 & 0.4 \\
  0.6 & 0.5 & -0.3
  \end{bmatrix}
  $$

  - Column 0 → $v_{\text{the}} = \begin{bmatrix} 0.1 \\ 0.6 \end{bmatrix}$
  - Column 1 → $v_{\text{cat}} = \begin{bmatrix} -0.2 \\ 0.5 \end{bmatrix}$
  - Column 2 → $v_{\text{sits}} = \begin{bmatrix} 0.4 \\ -0.3 \end{bmatrix}$

- Dot Product Scores:
  $$
  s_{\text{the}} = (0.7)(0.1) + (0.3)(0.6) = 0.25
  $$

  $$
  s_{\text{cat}} = (0.7)(-0.2) + (0.3)(0.5) = 0.01
  $$

  $$
  s_{\text{sits}} = (0.7)(0.4) + (0.3)(-0.3) = 0.19
  $$

---

### 🟣 Step 4: Probability via Softmax

- **Softmax probability** of predicting context word $w_o $ given center word $w_t $:
$$
p(w_o \mid w_t) = \frac{\exp(u_{w_t}^\top v_{w_o})}{\sum_{w=1}^{V} \exp(u_{w_t}^\top v_w)}
$$

---

### 🔴 Step 5: Loss Function (Negative Log-Likelihood)

- **Loss for one true pair $(w_t, w_o)$**:
$$
L = -\log p(w_o \mid w_t)
$$
- This loss encourages the embeddings of true word pairs to be close together.

---

### ⚡ Step 6: Negative Sampling (Efficient Training)

- **Why negative sampling?**
- Computing full softmax is too slow when $ V $ is large.
- Instead, train a **binary classifier**:
  - Positive pair: $ (w_t, w_o) $ → label 1.
  - Negative pairs: $ (w_t, w_{\text{neg}}) $ → label 0.

- **Negative sampling loss**:
$$
L = -\log \sigma(u_{w_t}^\top v_{w_o}) - \sum_{i=1}^{k} \log \sigma(-u_{w_t}^\top v_{w_{\text{neg}, i}})
$$
- $ \sigma(z) = \frac{1}{1 + e^{-z}} $ is the sigmoid function.
- Only involves true context and $ k $ negative samples.

---

### 🟤 Step 7: Backpropagation and Updates

- **Compute gradients w.r.t.:**
- Center embedding $ u_{w_t} $
- Output embedding $ v_{w_o} $
- Negative samples $ v_{w_{\text{neg}}} $

- **Update embeddings using SGD or Adam**:
$$
u_{w_t} \leftarrow u_{w_t} - \eta \frac{\partial L}{\partial u_{w_t}}
$$
$$
v_{w_o} \leftarrow v_{w_o} - \eta \frac{\partial L}{\partial v_{w_o}}
$$
$$
v_{w_{\text{neg}, i}} \leftarrow v_{w_{\text{neg}, i}} - \eta \frac{\partial L}{\partial v_{w_{\text{neg}, i}}}
$$
- $ \eta $: learning rate.

---

### ✅ Step 8: Final Output — Learned Word Embeddings

- After training, **rows of $ W $** are the final learned word embeddings.
- Similar words have embeddings that are close together in the vector space.

---

### 🌟 Summary of Key Variables

| Symbol                | Meaning                                | Shape             |
|------------------------|-----------------------------------------|-------------------|
| $ x $                | One-hot vector of center word           | $ V \times 1 $   |
| $ W $                | Input weight matrix (embeddings)        | $ V \times d $   |
| $ W' $               | Output weight matrix (context vectors)  | $ d \times V $   |
| $ u_{w_t} $          | Embedding of center word                | $ d \times 1 $   |
| $ v_{w_o} $          | Embedding of context/output word        | $ d \times 1 $   |
| $ s_{w_o} $          | Score (dot product)                     | Scalar            |
| $ p(w_o \mid w_t) $  | Probability from softmax                | Scalar            |
| $ L $                | Loss                                    | Scalar            |

---




## 1️⃣ Introduction to Word2Vec  
**Idea:** Words that appear in similar contexts have similar embeddings (distributional hypothesis).  
- **CBOW (Continuous Bag of Words):** Predicts a word from surrounding context.  
- **Skip-gram:** Predicts surrounding context from a given word.

**Example using `gensim`:**

In [2]:
%pip install gensim

Defaulting to user installation because normal site-packages is not writeable
Collecting gensim
  Downloading gensim-4.3.3-cp39-cp39-macosx_11_0_arm64.whl.metadata (8.3 kB)
Collecting numpy<2.0,>=1.18.5 (from gensim)
  Downloading numpy-1.26.4-cp39-cp39-macosx_11_0_arm64.whl.metadata (61 kB)
Collecting smart-open>=1.8.1 (from gensim)
  Downloading smart_open-7.1.0-py3-none-any.whl.metadata (24 kB)
Collecting wrapt (from smart-open>=1.8.1->gensim)
  Downloading wrapt-1.17.2-cp39-cp39-macosx_11_0_arm64.whl.metadata (6.4 kB)
Downloading gensim-4.3.3-cp39-cp39-macosx_11_0_arm64.whl (24.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.0/24.0 MB[0m [31m41.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading numpy-1.26.4-cp39-cp39-macosx_11_0_arm64.whl (14.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.0/14.0 MB[0m [31m38.9 MB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25hDownloading smart_open-7.1.0-py3-none-any.whl (61 kB)

In [3]:
from gensim.models import Word2Vec  # Import Word2Vec
from nltk.tokenize import word_tokenize  # Tokenizer for splitting sentences

# Sample corpus
sentences = [
    "Natural language processing is fun",
    "Machine learning is a part of artificial intelligence",
    "Word embeddings capture semantic meaning"
]

# Tokenize the corpus (split each sentence into words)
tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in sentences]

# Train the Word2Vec model using Skip-gram (sg=1) with vector size of 50
model = Word2Vec(sentences=tokenized_sentences, vector_size=50, window=2, min_count=1, sg=1)

# Get the embedding vector for the word 'machine'
print("Embedding for 'machine':\n", model.wv['machine'])

# Check similarity between two words
print("Similarity between 'machine' and 'learning':", model.wv.similarity('machine', 'learning'))



Embedding for 'machine':
 [-0.01648536  0.01859871 -0.00039532 -0.00393455  0.00920726 -0.00819063
  0.00548623  0.01387993  0.01213085 -0.01502159  0.0187647   0.00934362
  0.00793224 -0.01248701  0.01691996 -0.00430033  0.01765038 -0.01072401
 -0.01625884  0.01364912  0.00334239 -0.00439702  0.0190272   0.01898771
 -0.01954809  0.00501046  0.01231338  0.00774491  0.00404557  0.000861
  0.00134726 -0.00764127 -0.0142805  -0.00417774  0.0078478   0.01763737
  0.0185183  -0.01195187 -0.01880534  0.01952875  0.00685957  0.01033223
  0.01256469 -0.00560853  0.01464541  0.00566054  0.00574201 -0.00476074
 -0.0062565  -0.00474028]
Similarity between 'machine' and 'learning': 0.11255005


# 2️⃣ GloVe Embeddings

Idea: Uses global word co-occurrence statistics across the entire corpus instead of local context windows.

Example: Using Pretrained GloVe Embeddings (Common Crawl or Wikipedia):

In [6]:
# Download pretrained GloVe from: https://nlp.stanford.edu/projects/glove/
# Example: 'glove.6B.50d.txt' contains 50-dimensional vectors
# the word 'to' in txt file has the value as below : 
# to = [0.68047 -0.039263 0.30186 -0.17792 0.42962 0.032246 -0.41376 0.13228 .... -0.064699 -0.26044] with the length of 50

import numpy as np

# Load the GloVe embeddings (assuming the file 'glove.6B.50d.txt' is downloaded)
glove_embeddings = {}
with open('data/glove.6B.50d.txt', 'r', encoding='utf-8') as f: # text file saved in 'data' folder
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], dtype='float32')
        glove_embeddings[word] = vector

# Example: Get embedding for 'machine'
print("GloVe embedding for 'machine':\n", glove_embeddings.get('machine'))


GloVe embedding for 'machine':
 [-0.34165   -0.81267    1.4513     0.05914   -0.080801   0.39567
  0.10064   -0.5468    -0.18887    0.11364   -0.040956  -0.5637
 -0.32191    0.15968   -0.59756   -0.14571   -0.77074    1.2955
 -0.72002   -0.90818    0.76644    0.05346   -0.0031632 -0.15341
  0.22065   -1.191     -1.0775    -0.29768    1.327     -0.51359
  2.6229    -0.67411   -0.82558    0.14283   -0.014214   0.90775
  0.66828    0.48431    0.1543     0.26044    1.0191     0.015872
 -0.75325    0.58992    0.4546    -0.19678    0.42138   -0.43168
  0.11985    0.14094  ]


# 3️⃣ FastText Embeddings

Idea: Considers subword information (character n-grams), helping handle out-of-vocabulary (OOV) words better than Word2Vec and GloVe.

FastText can generate embeddings for unseen words based on their subword components.

Example:

In [5]:
from gensim.models import FastText  # Import FastText

# Use the same tokenized corpus
tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in sentences]

# Train FastText model
fasttext_model = FastText(sentences=tokenized_sentences, vector_size=50, window=2, min_count=1)

# Get vector for a word
print("FastText embedding for 'machine':\n", fasttext_model.wv['machine'])


FastText embedding for 'machine':
 [-1.8079143e-03  2.1979981e-03 -1.7691230e-03  5.6677201e-04
  1.2654356e-03  2.1495651e-03 -1.5551907e-03  4.3800026e-03
 -2.0496619e-03  5.3299527e-04 -2.5257603e-03  1.6439921e-03
  3.4959912e-03 -9.7273325e-05  2.1534071e-03  2.0341277e-03
 -5.6405651e-04  1.8103337e-03  4.1687866e-03  7.4633321e-04
 -3.0441333e-03 -3.0280279e-03  3.9849104e-03 -7.3370530e-04
 -1.7331528e-03  1.6396311e-03 -6.8702095e-04 -2.2539324e-03
  5.6145241e-04  1.4721482e-03 -2.8888162e-03 -2.2243629e-03
  2.1639713e-03 -1.2766268e-03  6.0394765e-03  4.9851830e-03
  3.3022531e-03  1.5956949e-03 -4.3668048e-03  1.5206628e-03
 -2.3396676e-03  7.1912521e-04  2.4290388e-03 -5.5817286e-03
  2.9966333e-03 -6.4665275e-03 -5.4450257e-04 -2.2184665e-03
  1.3568229e-03 -5.2718865e-03]


## 🧩 Why Use Word Embeddings?

| Method          | Sparse/Dense    | Captures Word Meaning? | Handles OOV Words?      |
|-----------------|-----------------|-----------------------|------------------------|
| BoW / TF-IDF    | Sparse           | ❌ No                  | ❌ No                   |
| Word2Vec        | Dense            | ✅ Yes                 | ❌ No                   |
| GloVe           | Dense            | ✅ Yes                 | ❌ No                   |
| FastText        | Dense            | ✅ Yes                 | ✅ Yes (via subwords)   |


## ✅ Answers to Practice Questions (Word Embeddings)

### 1️⃣ Why are word embeddings better than BoW or TF-IDF for capturing meaning?
Word embeddings (like Word2Vec, GloVe, FastText) map words into dense vectors where **semantically similar words are closer together in the vector space**. Unlike BoW or TF-IDF, which only count word occurrences and ignore word order or meaning, embeddings capture relationships between words (e.g., "king" is close to "queen", "Paris" is close to "France").

---

### 2️⃣ What is the main difference between Skip-gram and CBOW in Word2Vec?
- **CBOW (Continuous Bag of Words):** Predicts the target word based on its surrounding context words.
- **Skip-gram:** Predicts the surrounding context words given the target word.
- Typically, **Skip-gram works better for small datasets** and rare words, while **CBOW is faster on large datasets**.

---

### 3️⃣ How does FastText handle words that it has not seen during training?
FastText breaks words into **subword units (character n-grams)**. This allows it to create word vectors by combining the vectors of these subwords. Even if a word was not in the training data (out-of-vocabulary, OOV), FastText can generate a vector based on its subword pieces, making it more robust to rare or unseen words.

Example:  
The word **"running"** may be broken into subwords like `"run"`, `"unn"`, `"nni"`, `"nin"`, `"ing"`.

---
