## Word2Vec
- CBOW: Continuous Bow 
- Skip Gram

These both are shallow neural nets

### **Word2Vec**

---

### **1. Example Documents**  
- **Document 1**: "people watch techhub"  
- **Document 2**: "techhub watch techhub"  
- **Document 3**: "people write review"  
- **Document 4**: "techhub write review"  

---

### **2. Corpus Definition**  
Same corpus as before:  
```python
corpus = [
    "people watch techhub",
    "techhub watch techhub",
    "people write review",
    "techhub write review"
]
```

---

### **3. Word2Vec Overview**  
Word2Vec is a **predictive embedding model** that learns:  
- **Continuous vector representations** (typically 50-300 dimensions)  
- **Semantic relationships** via context (e.g., "king" - "man" + "woman" ≈ "queen")  

Two architectures:  
1. **Skip-gram**: Predicts context words given a target word.  
2. **CBOW (Continuous Bag-of-Words)**: Predicts a target word from context.  

---

### **4. Word2Vec Training (Hypothetical Output)**  
Assume we train a **2-dimensional** Word2Vec model for illustration:  

| Word     | Vector (x, y)    | Notes                  |
|----------|------------------|------------------------|
| people   | [0.52, 0.85]     | Close to "write"       |
| watch    | [0.33, 0.94]     | Often co-occurs with "techhub" |
| techhub  | [0.91, 0.41]     | Central to documents 1,2,4 |
| write    | [0.48, 0.88]     | Close to "people"      |
| review   | [0.76, 0.65]     | Groups with "write"    |

**Key Observations**:  
- Similar words (e.g., "people" and "write") have **closer vectors**.  
- "techhub" is distant from "review" (rarely co-occur).  

---

### **5. Document Representation**  
Unlike BoW/TF-IDF, Word2Vec requires **aggregation** for documents:  

#### **Method 1: Average Word Vectors**  
- **Doc 1**: `mean([people, watch, techhub]) = [(0.52+0.33+0.91)/3, (0.85+0.94+0.41)/3] ≈ [0.59, 0.73]`  
- **Doc 2**: `mean([techhub, watch, techhub]) ≈ [0.72, 0.60]`  
- **Doc 3**: `mean([people, write, review]) ≈ [0.59, 0.79]`  
- **Doc 4**: `mean([techhub, write, review]) ≈ [0.72, 0.65]`  

#### **Method 2: TF-IDF Weighted Average**  
Weight vectors by TF-IDF scores for richer semantics.  

---

### **6. Pros of Word2Vec**  
✅ **Captures Semantics**:  
   - "people" ≈ "write" (similar contexts).  
✅ **Fixed-Length Vectors**:  
   - All words/documents in same space (e.g., 300D).  
✅ **Generalizes to Unseen Words**:  
   - Similar words get similar vectors (e.g., "blog" ≈ "techhub").  

---

### **7. Cons of Word2Vec**  
❌ **No Out-of-the-Box Doc Rep**:  
   - Requires aggregation (average, TF-IDF) for documents.  
❌ **Ignores Word Order**:  
   - "people watch" ≠ "watch people" (no n-grams).  
❌ **Data-Hungry**:  
   - Needs large corpora for training.  

---

### **8. Comparison with Other Methods**  

| Feature          | TF-IDF            | Word2Vec          |
|------------------|-------------------|-------------------|
| **Semantics**    | ❌ No             | ✅ Yes            |
| **Dimensionality** | Sparse (V)      | Dense (50-300D)   |
| **OOV Handling** | ❌ Fails          | ✅ Partial (via similar words) |
| **Training**     | None (statistical)| Requires corpus   |

---

### **9. Practical Notes**  
- **Pretrained Models**: Use GloVe/FastText for small datasets.  
- **Fine-Tuning**: Retrain on domain-specific text (e.g., medical journals).  
- **Extensions**:  
  - **Doc2Vec**: Direct document embeddings.  
  - **BERT**: Captures context-dependent meanings.  

---

### **Summary**  
Word2Vec excels at **semantic tasks** (analogy, clustering) but needs **post-processing for documents**. For modern NLP, **contextual embeddings (BERT)** are superior but computationally heavier.  



In [1]:
from gensim.models import Word2Vec, KeyedVectors

Demo: We'll use the pre-trained weights of word2vec that was trained on Google News corpus containing 3 billion words. This model consists of 300-dimensional vectors for 3 million words and phrases.

In [6]:
import gensim.downloader as api

# Download and load the Word2Vec model (~1.6GB)
model = api.load("word2vec-google-news-300")  # Same dimensions as the original
print(model.most_similar("king"))   

[('kings', 0.7138045430183411), ('queen', 0.6510956883430481), ('monarch', 0.6413194537162781), ('crown_prince', 0.6204220056533813), ('prince', 0.6159993410110474), ('sultan', 0.5864824056625366), ('ruler', 0.5797567367553711), ('princes', 0.5646552443504333), ('Prince_Paras', 0.5432944297790527), ('throne', 0.5422105193138123)]


In [8]:
model.most_similar("queen")

[('queens', 0.739944338798523),
 ('princess', 0.7070532441139221),
 ('king', 0.6510956883430481),
 ('monarch', 0.6383602023124695),
 ('very_pampered_McElhatton', 0.6357026696205139),
 ('Queen', 0.6163407564163208),
 ('NYC_anglophiles_aflutter', 0.6060680150985718),
 ('Queen_Consort', 0.5923796892166138),
 ('princesses', 0.5908074975013733),
 ('royal', 0.5637185573577881)]