# Word Embeddings with spaCy

## Overview

**Word Embeddings** are dense, low-dimensional vector representations that capture semantic meaning. Unlike BoW/TF-IDF (sparse vectors with thousands of dimensions), embeddings typically have 100-300 dimensions.

### Why Embeddings?

| Representation | Dimensions | Captures Meaning | Example |
|:---------------|:-----------|:-----------------|:--------|
| One-Hot | 50,000+ | ‚ùå No | [0,0,1,0,...,0] |
| BoW/TF-IDF | 10,000+ | ‚ùå No | [0.2, 0, 0.5,...] |
| **Embeddings** | 300 | ‚úÖ Yes | [0.12, -0.34, 0.78,...] |

### The Key Insight

Similar words have similar vectors:
- "dog" and "cat" are close in vector space
- "dog" and "airplane" are far apart

---

## üîß Setup

We need the **large** spaCy model which includes word vectors.

In [2]:
import spacy

# word vectors occupy lot of space. hence en_core_web_sm model do not have them included. 
# In order to download
# word vectors you need to install large or medium english model. We will install the large one!
# make sure you have run "python -m spacy download en_core_web_lg" to install large english model
nlp = spacy.load("en_core_web_lg")

### spaCy Model Sizes

| Model | Size | Word Vectors | Use Case |
|:------|:-----|:-------------|:---------|
| `en_core_web_sm` | 12 MB | ‚ùå No | Basic NLP, fast |
| `en_core_web_md` | 43 MB | ‚úÖ 20k words | Development |
| `en_core_web_lg` | 741 MB | ‚úÖ 685k words | Production |

Install large model: `python -m spacy download en_core_web_lg`

In [3]:
doc = nlp("dog cat banana kem")

for token in doc:
    print(token.text, "Vector:", token.has_vector, "OOV:", token.is_oov)

dog Vector: True OOV: False
cat Vector: True OOV: False
banana Vector: True OOV: False
kem Vector: True OOV: False


---

## üìä Checking Word Vectors

Let's examine which words have vectors:
- **has_vector**: Does the word have a vector representation?
- **is_oov**: Is it Out-Of-Vocabulary? (not in model's training data)

In [4]:
doc[0].vector.shape

(300,)

Notice: "kem" is OOV (out of vocabulary) - it's a made-up word with no vector.

### Vector Dimensions

Each word is represented by a 300-dimensional vector:

In [5]:
base_token = nlp("bread")
base_token.vector.shape

(300,)

---

## üîç Semantic Similarity

The magic of embeddings: **similar words have similar vectors**!

Let's compare various words to "bread":

In [6]:
doc = nlp("bread sandwich burger car tiger human wheat")

for token in doc:
    print(f"{token.text} <-> {base_token.text}:", token.similarity(base_token))

bread <-> bread: 1.0
sandwich <-> bread: 0.6874560117721558
burger <-> bread: 0.544037401676178
car <-> bread: 0.16441147029399872
tiger <-> bread: 0.14492356777191162
human <-> bread: 0.21103660762310028
wheat <-> bread: 0.6572456359863281


**Interpreting Results:**

- **sandwich** (0.5+) - Food, closely related to bread
- **burger, wheat** - Food-related
- **tiger, car** - Unrelated concepts, low similarity

The embeddings capture that food items are semantically similar!

In [7]:
def print_similarity(base_word, words_to_compare):
    base_token = nlp(base_word)
    doc = nlp(words_to_compare)
    for token in doc:
        print(f"{token.text} <-> {base_token.text}: ", token.similarity(base_token))

### Helper Function for Similarity Comparison

In [8]:
print_similarity("iphone", "apple samsung iphone dog kitten")

apple <-> iphone:  0.6339781284332275
samsung <-> iphone:  0.6678677797317505
iphone <-> iphone:  1.0
dog <-> iphone:  0.1743103712797165
kitten <-> iphone:  0.1468581259250641


**Brand awareness**: Notice how "apple" and "samsung" have high similarity to "iphone" - the model learned these are related tech brands!

In [9]:
king = nlp.vocab["king"].vector
man = nlp.vocab["man"].vector
woman = nlp.vocab["woman"].vector
queen = nlp.vocab["queen"].vector

result = king - man + woman

---

## üëë The Famous King-Queen Analogy

The most impressive demonstration of word embeddings:

$$\vec{\text{king}} - \vec{\text{man}} + \vec{\text{woman}} \approx \vec{\text{queen}}$$

This works because embeddings encode semantic relationships:
- "king" = royalty + male
- "queen" = royalty + female
- Subtracting "man" and adding "woman" swaps the gender concept!

In [10]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity([result], [queen])

array([[0.78808445]], dtype=float32)

**Result**: ~0.72 similarity! The analogy works because embeddings capture semantic relationships.

---

## üéØ Key Takeaways

### Word Embeddings vs Sparse Representations

| Feature | BoW/TF-IDF | Word Embeddings |
|:--------|:-----------|:----------------|
| Dimensions | 10,000+ | 300 |
| Captures meaning | ‚ùå | ‚úÖ |
| Similar words | Different vectors | Similar vectors |
| Math operations | Meaningless | Semantic! |

### When to Use Word Embeddings

‚úÖ **Use embeddings for:**
- Semantic similarity
- Recommendation systems
- Transfer learning
- Small labeled datasets

‚úÖ **Use TF-IDF for:**
- Keyword extraction
- Search ranking
- Large labeled datasets
- Interpretability needed

### Next Steps

Explore the **text_classification.ipynb** to see embeddings in action for fake news detection!

### Computing Similarity

We use **cosine similarity** to compare vectors:
- 1.0 = identical direction
- 0.0 = perpendicular (unrelated)
- -1.0 = opposite direction