
# **Embedding Techniques in NLP**

---

## **1. Theory**

### **What is Embedding?**

* **Definition**: Embeddings (or vector representations) convert **text (words, sentences, documents)** into **numerical vectors** that can be processed by ML/DL models.
* Goal: Capture **semantic meaning** of text.

---

### **A. One-Hot Encoding**

* Each word is represented by a **binary vector**.
* Only one element is `1`, rest are `0`.
* Example:
  Vocabulary = `[apple, banana, orange]`

  * apple → `[1, 0, 0]`
  * banana → `[0, 1, 0]`
  * orange → `[0, 0, 1]`

**Pros**: Simple, easy to implement.
**Cons**:

* Sparse vectors (mostly zeros).
* No semantic similarity (apple ≠ banana).
* Vocabulary explosion with large corpora.

---

### **B. Bag of Words (BoW)**

* Represent text as a **vector of word counts** (or frequencies).
* Ignores word order and grammar.
* Example:
  Corpus = [“I love NLP”, “I love AI”]
  Vocabulary = `[I, love, NLP, AI]`

  * “I love NLP” → `[1, 1, 1, 0]`
  * “I love AI” → `[1, 1, 0, 1]`

**Pros**: Simple, interpretable.
**Cons**:

* Ignores context & word order.
* Large feature space.
* Common words dominate representation.

---

### **C. TF-IDF (Term Frequency – Inverse Document Frequency)**

* Improves BoW by **downweighting common words** and **highlighting important words**.

* Formula:

  * **TF (Term Frequency)** = Count of word in doc / Total words in doc
  * **IDF (Inverse Document Frequency)** = log(Total docs / Docs containing word)
  * **TF-IDF = TF × IDF**

* Example:

  * Word “the” appears in **1000 documents** → low weight.
  * Word “NLP” appears in **10 documents only** → higher weight.

**Pros**: Better than BoW, highlights informative words.
**Cons**: Still ignores context/semantics, vocabulary is large.

---

## **2. Examples**

---

### **A. One-Hot Encoding**

```python
from sklearn.preprocessing import OneHotEncoder
import numpy as np

corpus = [["apple"], ["banana"], ["orange"]]
encoder = OneHotEncoder(sparse_output=False)
one_hot = encoder.fit_transform(corpus)

print("Vocabulary:", encoder.categories_)
print("One-hot vectors:\n", one_hot)
```

**Output:**

```
Vocabulary: [array(['apple', 'banana', 'orange'], dtype=object)]
One-hot vectors:
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]
```

---

### **B. Bag of Words (CountVectorizer)**

```python
from sklearn.feature_extraction.text import CountVectorizer

corpus = ["I love NLP", "I love AI"]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

print("Vocabulary:", vectorizer.get_feature_names_out())
print("BoW Vectors:\n", X.toarray())
```

**Output:**

```
Vocabulary: ['ai', 'love', 'nlp']
BoW Vectors:
[[0 1 1]
 [1 1 0]]
```

---

### **C. TF-IDF**

```python
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ["I love NLP", "I love AI"]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

print("Vocabulary:", vectorizer.get_feature_names_out())
print("TF-IDF Vectors:\n", X.toarray())
```

**Output:**

```
Vocabulary: ['ai', 'love', 'nlp']
TF-IDF Vectors:
[[0.         0.70710678 0.70710678]
 [0.70710678 0.70710678 0.        ]]
```

---

## **3. Comparison Table**

| Technique        | Representation        | Pros                                   | Cons                                     | Use Cases                                 |
| ---------------- | --------------------- | -------------------------------------- | ---------------------------------------- | ----------------------------------------- |
| **One-Hot**      | Binary vector         | Simple, interpretable                  | Sparse, no semantic meaning              | Very small datasets, toy problems         |
| **Bag of Words** | Word counts/frequency | Easy, works with traditional ML models | Ignores context, high dimensionality     | Text classification, spam detection       |
| **TF-IDF**       | Weighted frequency    | Highlights important terms             | Still ignores context, large vocab space | Information retrieval, keyword extraction |

---

## **4. Interview-Style Q&A**

### **Basic**

**Q1. What is one-hot encoding in NLP?**
*A: It represents each word as a binary vector with only one active element (1), and the rest zeros.*

**Q2. What is the limitation of one-hot encoding?**
*A: It leads to sparse, high-dimensional vectors and cannot capture semantic similarity between words.*

---

### **Intermediate**

**Q3. How does Bag of Words represent text?**
*A: BoW represents text as a vector of word counts or frequencies, ignoring grammar and word order.*

**Q4. What problem does TF-IDF solve compared to BoW?**
*A: TF-IDF reduces the weight of common words (like "the", "is") and highlights rare but informative words, improving feature representation.*

---

### **Advanced**

**Q5. What are the limitations of BoW and TF-IDF compared to word embeddings like Word2Vec or BERT?**
*A: BoW and TF-IDF ignore semantic meaning and word order, while embeddings like Word2Vec or BERT capture context, semantics, and relationships between words.*

**Q6. If given a very large text corpus, how would you choose between BoW and TF-IDF?**
*A: Use TF-IDF because it reduces noise from frequent words, making features more informative. BoW is simpler and may be used for smaller datasets or baseline models.*

---



# **Embedding / Vectorization Techniques in NLP**

---

## **1. Classical Techniques** (we already covered)

* One-Hot Encoding
* Bag of Words (BoW)
* TF-IDF

---

## **2. Advanced Techniques**

### **A. Word2Vec (Google, 2013)**

* Learns **dense, low-dimensional vectors** for words.
* Trained on large corpora using a shallow neural network.
* Two training architectures:

  * **CBOW (Continuous Bag of Words):** Predicts a word from context.
  * **Skip-gram:** Predicts context words from a word.
* Captures **semantic relationships** → *king – man + woman ≈ queen*.

👉 **Use Cases**: Semantic search, text similarity, document clustering.

---

### **B. GloVe (Global Vectors, Stanford, 2014)**

* Uses **co-occurrence matrix** + matrix factorization.
* Captures **global statistical information** (not just local context like Word2Vec).
* Pre-trained embeddings available (Wikipedia, Common Crawl).

👉 **Use Cases**: Sentiment analysis, semantic similarity, transfer learning.

---

### **C. FastText (Facebook, 2016)**

* Extension of Word2Vec.
* Represents words as **bag of character n-grams**.
* Can generate embeddings for **out-of-vocabulary (OOV)** words → handles morphology well.

👉 **Use Cases**: Multilingual NLP, handling misspellings, domain-specific vocab.

---

### **D. ELMo (Embeddings from Language Models, 2018)**

* First **contextual embeddings**.
* Word meaning depends on **context** → *“bank” (river vs. finance)*.
* Uses **bi-directional LSTMs**.

👉 **Use Cases**: Named Entity Recognition (NER), POS tagging, sentiment analysis.

---

### **E. BERT (Bidirectional Encoder Representations from Transformers, 2018)**

* **Transformer-based embeddings**.
* Contextual, bidirectional, subword-level.
* Pre-trained with **Masked Language Modeling (MLM)** and **Next Sentence Prediction (NSP)**.
* State-of-the-art in many NLP tasks.

👉 **Use Cases**: QA systems, sentence classification, entity recognition, semantic search.

---

### **F. Sentence Transformers (SBERT, 2019)**

* Extension of BERT for **sentence-level embeddings**.
* Produces embeddings that can be compared with **cosine similarity**.
* Used widely in **semantic search, clustering, and RAG systems**.

---

### **G. Doc2Vec (Paragraph Vectors, 2014)**

* Extension of Word2Vec for **document embeddings**.
* Learns fixed-length vector representations for variable-length text (paragraphs, documents).

👉 **Use Cases**: Document classification, clustering, topic modeling.

---

### **H. Transformer-based Large Embeddings (OpenAI, Cohere, etc.)**

* Modern APIs (e.g., **OpenAI’s text-embedding-3-large**) produce **universal embeddings**.
* Capture deep context and generalize across tasks.

👉 **Use Cases**: Enterprise search, recommendation systems, knowledge graphs.

---

## **3. Comparison Table**

| Technique       | Dimension | Context-Aware | Handles OOV | Pros                          | Cons                   | Use Cases                          |
| --------------- | --------- | ------------- | ----------- | ----------------------------- | ---------------------- | ---------------------------------- |
| **One-Hot**     | High      | ❌             | ❌           | Simple                        | Sparse, no meaning     | Toy models                         |
| **BoW**         | High      | ❌             | ❌           | Easy, interpretable           | Ignores order/context  | Baseline classification            |
| **TF-IDF**      | High      | ❌             | ❌           | Weights important words       | Still no context       | Info retrieval, keyword extraction |
| **Word2Vec**    | 100–300   | ❌             | ❌           | Captures semantic similarity  | Same word, same vector | Semantic similarity                |
| **GloVe**       | 100–300   | ❌             | ❌           | Captures global co-occurrence | Static embeddings      | General NLP, similarity            |
| **FastText**    | 100–300   | ❌             | ✅           | Handles rare/OOV words        | Static embeddings      | Multilingual NLP                   |
| **Doc2Vec**     | 100–300   | ❌             | ✅           | Whole document representation | Less popular now       | Document classification            |
| **ELMo**        | 1024      | ✅             | ✅           | Contextual embeddings         | Heavy, slower          | Sequence labeling                  |
| **BERT**        | 768–1024  | ✅             | ✅           | State-of-the-art contextual   | Large compute cost     | QA, NER, classification            |
| **SBERT**       | 768       | ✅             | ✅           | Sentence-level meaning        | Requires fine-tuning   | Semantic search                    |
| **OpenAI Emb.** | 1536–4096 | ✅             | ✅           | General-purpose embeddings    | API cost               | Enterprise search, RAG             |

---

## **4. Interview-Style Q&A**

### **Basic**

**Q1. What is the difference between BoW and Word2Vec?**
*A: BoW is sparse, counts words, ignores context. Word2Vec learns dense embeddings where semantically similar words have similar vectors.*

**Q2. Why is TF-IDF better than BoW?**
*A: TF-IDF reduces the influence of frequent but uninformative words like “the” or “is”, giving higher importance to rare but informative words.*

---

### **Intermediate**

**Q3. What advantage does FastText have over Word2Vec?**
*A: FastText represents words as character n-grams, so it can generate embeddings for out-of-vocabulary words and handle morphology better.*

**Q4. How do contextual embeddings like BERT differ from static embeddings like Word2Vec?**
*A: Static embeddings assign the same vector to a word regardless of context, while contextual embeddings change meaning depending on sentence context.*

---

### **Advanced**

**Q5. Why are embeddings like BERT or SBERT preferred in modern NLP pipelines?**
*A: They capture deep semantic and syntactic context, generalize across tasks, and provide strong performance on downstream applications like semantic search, QA, and summarization.*

**Q6. How would you choose between TF-IDF and modern embeddings in a real project?**
*A: For small datasets and interpretable tasks, TF-IDF may suffice. For semantic understanding, multilingual support, or advanced tasks (chatbots, RAG, search), modern embeddings like BERT/SBERT or OpenAI embeddings are better.*

---

