# 🧠 Word2Vec – Skip-Gram Model (In-Depth Intuition)

---

## 📌 What is Skip-Gram?

The **Skip-Gram** model is the second architecture introduced in the original **Word2Vec** paper by Mikolov et al. Unlike CBOW, which predicts the **center word from its context**, **Skip-Gram does the opposite**:

> 🎯 **It predicts the context words given the center word.**

This makes it powerful for learning high-quality embeddings, especially for **rare or infrequent words**.

---

## 🔄 How Does It Work?

Given a sentence like:  
`The quick brown fox jumps over the lazy dog`

Let’s assume:
- **Window size = 2**
- **Target word = "brown"**

Then Skip-Gram will try to predict the surrounding context:
- `("brown", "the")`
- `("brown", "quick")`
- `("brown", "fox")`
- `("brown", "jumps")`

So we generate **multiple (input, output) pairs**, all centered around the same target word.

---

## 🧠 Intuition Behind Skip-Gram

Skip-Gram turns one word into **many training examples**. For each word in the corpus, it creates a number of (center → context) predictions depending on the window size.

- **Input**: One center word (e.g., `"brown"`)
- **Output**: One of its surrounding context words (e.g., `"fox"`)

The model learns to associate the center word with the types of words that tend to appear near it.

---

## 🧠 Skip-Gram: Training Objective and Loss Function 

The **Skip-Gram** model learns word embeddings by doing one simple thing:

> Given a word (the **center word**), try to **predict the words around it** (the **context words**).

---

### 🧠 Example:

**Sentence:**  
`The quick brown fox jumps`

Let’s say the center word is `"brown"` and we use a window size of 2.  
Then we’ll try to predict the following pairs:

- `"brown"` → `"the"`
- `"brown"` → `"quick"`
- `"brown"` → `"fox"`
- `"brown"` → `"jumps"`

So for each center word, the model generates multiple training examples — one for each nearby word.

---

## 🎯 What Does the Model Learn?

The model starts with **random word vectors**. During training, it adjusts these vectors so that words that appear in similar contexts end up **close together** in the vector space.

For example:
- `"king"` and `"queen"` appear in similar contexts like `"royal"`, `"palace"`, `"throne"` → they end up with similar vectors
- `"apple"` and `"banana"` may appear in fruit-related sentences → also close together

---

## 💡 The Goal (Loss Function)

The goal is to make the model:
- **Increase the score** when the context word is correct (like `"brown"` → `"fox"`)
- **Decrease the score** when the context word is wrong (like `"brown"` → `"car"`)

The model repeats this process for **millions of word pairs**, slowly improving the word vectors so they reflect real-world meaning.

---

## ⚙️ Why is Training Hard?

To predict context words, the model technically needs to compare the center word to **every word in the vocabulary** — which could be millions of words. That’s too slow.

---

## ⚡ The Solution: Negative Sampling

To fix this, Word2Vec uses **negative sampling**:
- Instead of checking every possible word, the model only checks a **small number of incorrect words** (called "negative samples")
- For example: it learns `"brown"` → `"fox"` is correct, but also learns `"brown"` → `"car"` or `"window"` are wrong

This makes training **much faster** and still very effective.

---

## ✅ Summary

- **Skip-Gram** learns by predicting nearby words from a given word
- It creates many (center → context) word pairs during training
- The model improves word vectors so that related words are closer together
- **Negative sampling** makes this training process fast and scalable


## 🧠 Why Use Skip-Gram?

- Works well on **small datasets**
- **Better at capturing semantic relationships for rare words**
- Slower than CBOW but often more **accurate**

---

## 🔍 Key Insight

If two words appear in **similar contexts**, they will learn **similar embeddings**.

For example:
- `"king"` and `"queen"` might both appear near `"royal"`, `"crown"`, or `"kingdom"`
- This results in embeddings that are close in vector space

---

## 📌 Summary

| Feature              | CBOW                                      | Skip-Gram                                  |
|----------------------|-------------------------------------------|---------------------------------------------|
| Input                | Context words                             | Target (center) word                        |
| Output               | Center word                               | One context word at a time                  |
| Training Speed       | Faster                                    | Slightly slower                             |
| Best For             | Frequent words, large datasets            | Rare/infrequent words, smaller datasets     |
| Embedding Quality    | Good for general meaning                  | Better for capturing fine-grained meaning   |
| Accuracy (Rare Words)| Lower                                     | Higher                                      |
| Memory Usage         | Lower                                     | Slightly higher                             |
| Example Pair(s)      | `["the", "quick"] → "brown"`              | `"brown" → "quick"`, `"brown" → "fox"`      |

---

💡 **Tip:** Use **CBOW** for faster training with large corpora, and **Skip-Gram** for better results on rare words or smaller datasets.


## 🚀 How to Improve CBOW or Skip-Gram Word2Vec Models

Whether you're using **CBOW** or **Skip-Gram**, there are several ways to improve the quality of your word embeddings. Here are some effective strategies:

---

### 1. 🧹 Clean and Preprocess Your Text

- Remove stopwords, punctuation, and noisy symbols
- Lowercase all words for consistency
- Lemmatize or stem words to reduce sparsity
- Filter out extremely short or irrelevant sentences

---

### 2. 📏 Tune Key Hyperparameters

- **`vector_size`**: Higher dimensions capture more detail. Try 100, 200, or 300 (Google uses 300).
- **`window`**: Try larger windows (e.g., 5 or 10) to capture broader context, especially in semantic-heavy tasks.
- **`min_count`**: Set to 2–5 to filter out rare/noisy words.
- **`epochs`**: Increase the number of training iterations (e.g., 10–30) for better convergence.
- **`negative`**: Tune the number of negative samples (5–20 is typical for Skip-Gram).
- **`sg`**: 0 for CBOW, 1 for Skip-Gram — test both if unsure which fits best.

---

### 3. 📈 Use More (and Better) Data

- Train on a **larger corpus** to improve word coverage and embedding quality.
- Domain-specific corpora yield **more relevant embeddings** for specialized tasks (e.g., legal, finance, medicine).

---

### ✅ Summary

Improving Word2Vec models (CBOW or Skip-Gram) is a mix of:
- **Better data**
- **Smarter preprocessing**
- **Hyperparameter tuning**

Always experiment to find the best setup for your specific dataset and goals.
