# Word Embeddings Basics

**Objective:** Understand how to represent words as continuous-valued vectors capturing meaning and relationships.

---
## What Are Word Embeddings?

Traditional methods like **Bag of Words** or **TF-IDF** only count words — they do not capture **semantic relationships** between them.

**Word Embeddings** are dense numerical representations of words where:
- Similar words have similar vector representations.
- Word relationships can be measured using distance (like cosine similarity).

Example: *king - man + woman ≈ queen*

---
## Types of Embeddings
1. **Learned from data** — Word2Vec, GloVe, FastText.
2. **Pretrained contextual** — BERT, GPT, etc. (covered later).

---
## One-Hot Encoding Recap
Before embeddings, let’s revisit one-hot encoding — where each word is represented as a binary vector.

In [None]:
import numpy as np
vocab = ["king", "queen", "man", "woman"]

word_to_index = {word: i for i, word in enumerate(vocab)}

def one_hot(word):
    vector = np.zeros(len(vocab))
    vector[word_to_index[word]] = 1
    return vector

print("One-hot representation for 'queen':")
print(one_hot("queen"))

 One-hot encoding is **simple but sparse** — it does not capture similarity. *King* and *Queen* are unrelated numerically.

---
## Introducing Word2Vec
Word2Vec (by Google) learns embeddings using a **neural network** that predicts neighboring words.

Two training models:
- **CBOW (Continuous Bag of Words):** Predicts a word from its context.
- **Skip-Gram:** Predicts context from a given word.

Let’s use `gensim` to train a simple Word2Vec model.

In [None]:
from gensim.models import Word2Vec

sentences = [
    ["king", "rules", "the", "kingdom"],
    ["queen", "rules", "the", "kingdom"],
    ["man", "is", "strong"],
    ["woman", "is", "wise"]
]

model = Word2Vec(sentences, vector_size=50, window=3, min_count=1, sg=0)

print("Vector for 'king':\n", model.wv['king'][:10], "...")
print("\nMost similar to 'king':", model.wv.most_similar('king'))

 Each word now has a **50-dimensional vector** capturing its contextual meaning.

---
## Visualizing Word Embeddings
Let’s project embeddings into 2D using PCA for visualization.

In [None]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

X = model.wv[vocab]
pca = PCA(n_components=2)
result = pca.fit_transform(X)

plt.figure(figsize=(6,4))
plt.scatter(result[:,0], result[:,1])

for i, word in enumerate(vocab):
    plt.annotate(word, xy=(result[i,0], result[i,1]))

plt.title('Word Embeddings Visualized (PCA)')
plt.show()

---
##  GloVe (Global Vectors for Word Representation)
GloVe (by Stanford) uses **word co-occurrence statistics** to learn embeddings.

You can load pretrained vectors (e.g., 100D GloVe) to use them directly.

In [None]:
from gensim.models import KeyedVectors
# Example: Load pretrained GloVe or Word2Vec embeddings if available
# glove_model = KeyedVectors.load_word2vec_format('glove.6B.100d.txt', binary=False)
# glove_model.most_similar('king')

---
##  Comparison Summary

| Method | Representation | Captures Meaning | Sparse/Dense | Example |
|---------|----------------|------------------|---------------|----------|
| One-Hot | Binary | ❌ No | Sparse | `[0,0,1,0]` |
| BoW / TF-IDF | Count-based | ❌ No | Sparse | `[1,0,2,1]` |
| Word2Vec / GloVe | Learned | ✅ Yes | Dense | `[0.21, -0.13, 0.76, ...]` |

---
## ✅ Summary
- Word embeddings capture **semantic similarity**.
- Models like **Word2Vec** and **GloVe** transform text into meaningful vectors.
- Paves the way for deep NLP models like RNNs, LSTMs, and Transformers.

---
 **Next:** `06-Word2Vec_and_GloVe.ipynb` — Dive deeper into pretrained embedding models and advanced similarity operations.