## Text Representation

### Text Representation Techniques
- **ML Based**
1. OHE
2. Bag of Words
3. ngrams
4. TFIDF
5. Custom Feature

- **Deep Learning Based**
    - Word2Vec/Vector Embeddings

### Terms Used in NLP
- Document: Each row in a dataset is called document.
- Corpus: Collection of documents(all rows) is called corpus.
- Vocabulary: Unique words in corpus.


## **1. One Hot Encoding (OHE)**  

---

### **1.1 Example Documents**  
- **Document 1**: "people watch techhub"  
- **Document 2**: "techhub watch techhub"  
- **Document 3**: "people write review"  
- **Document 4**: "techhub write review"  

---

### **1.2. Corpus Definition**  
The **corpus** is the set of all documents:  
```
{
  "people watch techhub",
  "techhub watch techhub",
  "people write review",
  "techhub write review"
}
```

---

### **1.3. Vocabulary Generation**  
Unique words across all documents (case-insensitive, no duplicates):  
```
Vocabulary = {"people", "watch", "techhub", "write", "review"}
```
- **Size of Vocabulary (V)**: 5  

---

### **1.4. One Hot Encoding (OHE) Table**  
Each document is represented as a binary vector where:  
- `1` = Word present in the document.  
- `0` = Word absent.  

| Word Index | Word    | Document 1 | Document 2 | Document 3 | Document 4 |
|------------|---------|------------|------------|------------|------------|
| 0          | people  | 1          | 0          | 1          | 0          |
| 1          | watch   | 1          | 1          | 0          | 0          |
| 2          | techhub | 1          | 1          | 0          | 1          |
| 3          | write   | 0          | 0          | 1          | 1          |
| 4          | review  | 0          | 0          | 1          | 1          |

**Vector Representations**:  
- **Doc 1**: `[1, 1, 1, 0, 0]`  
- **Doc 2**: `[0, 1, 1, 0, 0]`  
- **Doc 3**: `[1, 0, 0, 1, 1]`  
- **Doc 4**: `[0, 0, 1, 1, 1]`  

---

### **1.5. Pros of One Hot Encoding (OHE)**  
✅ **Simplicity**: Easy to implement and interpret.  
✅ **Preserves Word Presence**: Clearly marks which words exist in a document.  
✅ **Works with Traditional ML Models**: Compatible with algorithms like Naive Bayes, logistic regression.  
✅ **No Prior Assumptions**: Treats each word as independent (no semantic relationships).  

---

### **1.6. Cons of One Hot Encoding (OHE)**  
❌ **Sparsity**:  
   - Vectors are mostly zeros (high-dimensional but sparse).  
   - Inefficient storage/computation (e.g., a vocabulary of 10K words → 10K-dim vectors).  
❌ **Fixed Dimensionality**:  
   - Vector size depends on vocabulary size.  
   - Adding new words requires retraining (no flexibility).  
❌ **Out-of-Vocabulary (OOV) Problem**:  
   - Fails to encode unseen words (e.g., "blog" in a new document).  
❌ **No Semantic Meaning**:  
   - "techhub" and "review" are equally distant, ignoring context/relationships.  
❌ **Poor for Deep Learning**:  
   - Large vocabularies create computational bottlenecks.  

---

### **1.7. Additional Notes**  
- **Computational Inefficiency**: OHE is impractical for large vocabularies (e.g., 1M+ words).  
- **Alternative Methods**:  
  - **Word Embeddings** (Word2Vec, GloVe): Capture semantic meaning.  
  - **TF-IDF**: Weights words by importance.  
  - **Subword Tokenization** (e.g., BPE): Handles OOV words.  
- **Use Cases**:  
  - Best for small datasets or baseline models.  
  - Avoid for tasks requiring semantic understanding (e.g., translation, sentiment analysis).  

---

### **To Concise**  
OHE is a **simple but limited** text representation method. While useful for basic tasks, its **sparsity** and **lack of semantics** make it unsuitable for modern NLP applications. Advanced techniques (embeddings, transformers) address these flaws.  



## **2. Bag of Words (BoW)**  

---

### **2.1. Example Documents**  
- **Document 1**: "people watch techhub"  
- **Document 2**: "techhub watch techhub"  
- **Document 3**: "people write review"  
- **Document 4**: "techhub write review"  

---

### **2.2 Corpus Definition**  
The **corpus** is the set of all documents:  
```
{
  "people watch techhub",
  "techhub watch techhub",
  "people write review",
  "techhub write review"
}
```

---

### **2.3. Vocabulary Generation**  
Unique words across all documents (case-insensitive, no duplicates):  
```
Vocabulary = {"people", "watch", "techhub", "write", "review"}
```
- **Size of Vocabulary (V)**: 5  

---

### **2.4. Bag of Words (BoW) Table**  
Each document is represented as a **frequency vector** where:  
- Each entry counts the occurrences of a word in the document.  

| Word Index | Word    | Document 1 | Document 2 | Document 3 | Document 4 |
|------------|---------|------------|------------|------------|------------|
| 0          | people  | 1          | 0          | 1          | 0          |
| 1          | watch   | 1          | 1          | 0          | 0          |
| 2          | techhub | 1          | 2          | 0          | 1          |
| 3          | write   | 0          | 0          | 1          | 1          |
| 4          | review  | 0          | 0          | 1          | 1          |

**Vector Representations**:  
- **Doc 1**: `[1, 1, 1, 0, 0]`  
- **Doc 2**: `[0, 1, 2, 0, 0]`  
- **Doc 3**: `[1, 0, 0, 1, 1]`  
- **Doc 4**: `[0, 0, 1, 1, 1]`  

---

### **2.5. Pros of Bag of Words (BoW)**  
✅ **Simple & Intuitive**: Easy to implement and understand.  
✅ **Preserves Word Frequency**: Captures term importance (unlike OHE).  
✅ **Works with Traditional ML Models**: Compatible with Naive Bayes, logistic regression, etc.  
✅ **No Semantic Assumptions**: Treats words as independent (no context needed).  

---

### **2.6. Cons of Bag of Words (BoW)**  
❌ **Sparsity**:  
   - High-dimensional vectors with many zeros (inefficient storage).  
❌ **Fixed Vocabulary Size**:  
   - Adding new words requires retraining.  
❌ **Out-of-Vocabulary (OOV) Problem**:  
   - Cannot handle unseen words (e.g., "blog" in a new document).  
❌ **No Semantic or Order Information**:  
   - "techhub watch" and "watch techhub" are identical in BoW.  
   - Ignores word relationships (e.g., synonyms, antonyms).  
❌ **Dominance of Common Words**:  
   - Frequent words (e.g., "the", "and") may overshadow rare but meaningful terms.  

---

### **2.7. Additional Notes**  
- **Improvements Over OHE**:  
  - BoW captures **term frequency**, while OHE only tracks presence/absence.  
- **Common Enhancements**:  
  - **TF-IDF**: Weights words by importance (reduces bias from common words).  
  - **N-grams**: Captures word sequences (e.g., "techhub watch" → bigram).  
- **Use Cases**:  
  - Best for **small datasets** or **baseline models**.  
  - Avoid for tasks needing **context** (e.g., machine translation).  

---

### **Comparison with One Hot Encoding (OHE)**  
| Feature          | OHE                          | BoW                          |
|------------------|------------------------------|------------------------------|
| **Representation** | Binary (0/1)                 | Integer counts               |
| **Semantics**     | No word importance           | Captures frequency           |
| **Sparsity**      | Extreme (mostly 0s)          | High (but less than OHE)     |
| **OOV Handling**  | Fails                       | Fails                       |
| **Use Case**      | Basic text classification   | Slightly richer frequency modeling |  

---

### **To Concise**  
BoW is a **frequency-based upgrade** to OHE but still suffers from **sparsity** and **lack of semantics**. Modern NLP prefers **embeddings (Word2Vec, BERT)** or **TF-IDF** for better performance.  

### Tip
If you draw the vectors of both of these docs they will look similar having a little diff of `not` only.
- **This is a very good movie**
- **This is not a very good movie**





In [1]:
import numpy as np 
import pandas as pd 

In [2]:
df = pd.DataFrame(
    {
        "text":["people watch techhub","techhub watch techhub","people write review","techhub write review"],
        "output":[1,1,0,0]
    }
    )
df

Unnamed: 0,text,output
0,people watch techhub,1
1,techhub watch techhub,1
2,people write review,0
3,techhub write review,0


In [9]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

In [18]:
bow = cv.fit_transform(df['text'])
print(cv.vocabulary_)
# Here 0,1,2,3, and 4 are indices


{'people': 0, 'watch': 3, 'techhub': 2, 'write': 4, 'review': 1}


In [19]:
print(bow[0].toarray()) # people, index:0, watch:3, techhub:2
print(bow[1].toarray()) # techhub:2, watch:3, techhub:2 ---> techhub, index:2, counts:2, 

[[1 0 1 1 0]]
[[0 0 2 1 0]]


In [17]:
cv.transform(['techhub watch and write review']).toarray()
# OOV: and is handled in bag of words

array([[0, 1, 1, 1, 1]])

### **3: Bag of N-Grams**  

---

### **3.1. Example Documents**  
- **Document 1**: "people watch techhub"  
- **Document 2**: "techhub watch techhub"  
- **Document 3**: "people write review"  
- **Document 4**: "techhub write review"  

---

### **3.2. Corpus Definition**  
The **corpus** is the set of all documents:  
```
{
  "people watch techhub",
  "techhub watch techhub",
  "people write review",
  "techhub write review"
}
```

---

### **3.3. Vocabulary Generation (Unigrams + Bigrams)**  
Unique **1-grams (unigrams)** and **2-grams (bigrams)** across all documents:  

#### **Unigrams**:  
```
{"people", "watch", "techhub", "write", "review"}
```  

#### **Bigrams**:  
```
{"people watch", "watch techhub", "techhub watch", "write review", "techhub write"}
```  

#### **Combined Vocabulary (Unigrams + Bigrams)**:  
```
{
  "people", "watch", "techhub", "write", "review",  
  "people watch", "watch techhub", "techhub watch", "write review", "techhub write"
}
```  
- **Size of Vocabulary (V)**: 10  

---

### **3.4. Bag of N-Grams Table**  
Each document is represented as a **frequency vector** of unigrams + bigrams.  

| Index | N-Gram          | Document 1 | Document 2 | Document 3 | Document 4 |
|-------|-----------------|------------|------------|------------|------------|
| 0     | people          | 1          | 0          | 1          | 0          |
| 1     | watch           | 1          | 1          | 0          | 0          |
| 2     | techhub         | 1          | 2          | 0          | 1          |
| 3     | write           | 0          | 0          | 1          | 1          |
| 4     | review          | 0          | 0          | 1          | 1          |
| 5     | people watch    | 1          | 0          | 0          | 0          |
| 6     | watch techhub   | 1          | 1          | 0          | 0          |
| 7     | techhub watch   | 0          | 1          | 0          | 0          |
| 8     | write review    | 0          | 0          | 1          | 1          |
| 9     | techhub write   | 0          | 0          | 0          | 1          |

**Vector Representations**:  
- **Doc 1**: `[1, 1, 1, 0, 0, 1, 1, 0, 0, 0]`  
- **Doc 2**: `[0, 1, 2, 0, 0, 0, 1, 1, 0, 0]`  
- **Doc 3**: `[1, 0, 0, 1, 1, 0, 0, 0, 1, 0]`  
- **Doc 4**: `[0, 0, 1, 1, 1, 0, 0, 0, 1, 1]`  

---

### **3.5. Pros of Bag of N-Grams**  
✅ **Captures Local Word Order**:  
   - Bigrams/trigrams preserve phrases (e.g., "techhub watch" ≠ "watch techhub").  
✅ **Better Context than BoW**:  
   - "not good" vs. "good" are distinguished.  
✅ **Works with Traditional ML**:  
   - Compatible with classifiers like SVM, Naive Bayes.  

---

### **3.6. Cons of Bag of N-Grams**  
❌ **Higher Dimensionality**:  
   - Vocabulary grows exponentially (e.g., 10K words → 100M possible bigrams).  
❌ **Still No Semantics**:  
   - "happy joy" and "joy happy" are treated as different.  
❌ **Sparsity Worsens**:  
   - More zeros in vectors than BoW.  
❌ **OOV Problem Persists**:  
   - New n-grams (e.g., "techhub review") are ignored.  

---

### **3.7. Additional Notes**  
- **Trade-off**:  
  - Higher *n* (e.g., trigrams) captures more context but increases sparsity.  
- **Common Use Cases**:  
  - Sentiment analysis (e.g., "not bad" vs. "bad").  
  - Short-text classification (e.g., tweets).  
- **Alternatives**:  
  - **TF-IDF Weighting**: Reduces bias from frequent n-grams.  
  - **Word Embeddings**: Better for semantic tasks (e.g., Word2Vec).  

---

### **Comparison with BoW and OHE**  
| Feature          | OHE               | BoW               | Bag of N-Grams          |
|------------------|-------------------|-------------------|-------------------------|
| **Represents**   | Binary presence   | Word counts       | N-gram counts           |
| **Word Order**   | ❌ No             | ❌ No             | ✅ Yes (local only)     |
| **Dimensionality**| Low (V)          | Low (V)          | High (V + Vⁿ)           |
| **Use Case**     | Baseline models  | Frequency analysis| Phrase-sensitive tasks  |

---

### **To Conise**  
Bag of N-Grams improves over BoW by **capturing phrases** but suffers from **high sparsity**. For advanced NLP, **embeddings (Word2Vec, BERT)** or **TF-IDF-weighted n-grams** are preferred.  



In [20]:
# cv = CountVectorizer(ngram_range=(1,1)) # default is (1,1) which means the bag of words is the special case of ngrams.
# bag of words: unigram

In [24]:
cv = CountVectorizer(ngram_range=(2,2)) # bigram
bigram = cv.fit_transform(df['text'])
cv.vocabulary_

{'people watch': 0,
 'watch techhub': 4,
 'techhub watch': 2,
 'people write': 1,
 'write review': 5,
 'techhub write': 3}

In [25]:
cv = CountVectorizer(ngram_range=(1,2)) # unigram + bigram
bigram = cv.fit_transform(df['text'])
cv.vocabulary_

{'people': 0,
 'watch': 7,
 'techhub': 4,
 'people watch': 1,
 'watch techhub': 8,
 'techhub watch': 5,
 'write': 9,
 'review': 3,
 'people write': 2,
 'write review': 10,
 'techhub write': 6}

### **TF-IDF (Term Frequency-Inverse Document Frequency)**  

---

### **1. Example Documents**  
- **Document 1 (D1)**: "people watch techhub"  
- **Document 2 (D2)**: "techhub watch techhub"  
- **Document 3 (D3)**: "people write review"  
- **Document 4 (D4)**: "techhub write review"  

---

### **2. Corpus Definition**  
Same as before:  
```python
corpus = [
    "people watch techhub",
    "techhub watch techhub",
    "people write review",
    "techhub write review"
]
```

---

### **3. Vocabulary Generation**  
Unique words across all documents:  
```python
vocabulary = ["people", "watch", "techhub", "write", "review"]
```

---

### **4. TF-IDF Calculation**  

#### **Step 1: Compute Term Frequency (TF)**  
`TF(t, d) = (Number of times term t appears in document d) / (Total terms in d)`  

| Term     | D1   | D2   | D3   | D4   |
|----------|------|------|------|------|
| people   | 1/3  | 0    | 1/3  | 0    |
| watch    | 1/3  | 1/3  | 0    | 0    |
| techhub  | 1/3  | 2/3  | 0    | 1/3  |
| write    | 0    | 0    | 1/3  | 1/3  |
| review   | 0    | 0    | 1/3  | 1/3  |

#### **Step 2: Compute Inverse Document Frequency (IDF)**  
`IDF(t) = log(Total documents / Number of documents containing t) + 1`  

| Term     | Doc Freq | IDF               |
|----------|----------|-------------------|
| people   | 2        | log(4/2) + 1 ≈ 1.693 |
| watch    | 2        | log(4/2) + 1 ≈ 1.693 |
| techhub  | 3        | log(4/3) + 1 ≈ 1.287 |
| write    | 2        | log(4/2) + 1 ≈ 1.693 |
| review   | 2        | log(4/2) + 1 ≈ 1.693 |

#### **Step 3: Compute TF-IDF = TF × IDF**  

| Term     | D1        | D2        | D3        | D4        |
|----------|-----------|-----------|-----------|-----------|
| people   | 0.564     | 0         | 0.564     | 0         |
| watch    | 0.564     | 0.564     | 0         | 0         |
| techhub  | 0.429     | 0.858     | 0         | 0.429     |
| write    | 0         | 0         | 0.564     | 0.564     |
| review   | 0         | 0         | 0.564     | 0.564     |

**Final TF-IDF Vectors**:  
- **D1**: `[0.564, 0.564, 0.429, 0, 0]`  
- **D2**: `[0, 0.564, 0.858, 0, 0]`  
- **D3**: `[0.564, 0, 0, 0.564, 0.564]`  
- **D4**: `[0, 0, 0.429, 0.564, 0.564]`  

---

### **5. Pros of TF-IDF**  
✅ **Weighted Importance**:  
   - Rare terms get higher weights (e.g., "review" > "techhub").  
✅ **Reduces Dominance of Common Words**:  
   - Frequent words (e.g., "techhub") are downweighted.  
✅ **Works with Sparse Data**:  
   - Better than raw counts for ML models.  
✅ **No Semantic Assumptions**:  
   - Pure statistical weighting.  

---

### **6. Cons of TF-IDF**  
❌ **Still No Word Order**:  
   - "techhub watch" ≠ "watch techhub".  
❌ **OOV Problem Persists**:  
   - Unseen terms get zero weight.  
❌ **Dimensionality Issues**:  
   - Large vocabularies → high-dimensional vectors.  
❌ **Manual Tuning Needed**:  
   - IDF smoothing (e.g., `+1`) requires experimentation.  

---

### **7. Additional Notes**  
- **N-Grams + TF-IDF**:  
  - Apply TF-IDF to bigrams/trigrams for phrase-aware weighting.  
- **Normalization**:  
  - Often, vectors are L2-normalized for cosine similarity.  
- **Use Cases**:  
  - Search engines, document clustering, and preprocessing for classifiers.  

---

### **Comparison with Other Methods**  

| Feature          | OHE               | BoW               | TF-IDF             |
|------------------|-------------------|-------------------|--------------------|
| **Weighting**    | Binary (0/1)      | Raw counts        | Term importance    |
| **Word Order**   | ❌ No             | ❌ No             | ❌ No              |
| **Handles Common Words** | ❌ No      | ❌ No             | ✅ Yes             |
| **Dimensionality** | Fixed (V)       | Fixed (V)         | Fixed (V)          |

---

### **Summary**  
TF-IDF improves on BoW by **emphasizing rare, informative terms** but still ignores semantics. For modern NLP, **TF-IDF + n-grams** or **embeddings (Word2Vec, BERT)** are stronger choices.  


In [26]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample DataFrame
df = pd.DataFrame(
    {
        "text": ["people watch techhub", "techhub watch techhub", "people write review", "techhub write review"],
        "output": [1, 1, 0, 0]
    }
)

# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the text data into the TF-IDF matrix
X = vectorizer.fit_transform(df['text'])

# Convert the result to a DataFrame for better visualization
tfidf_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

# Add the output column to the DataFrame
tfidf_df['output'] = df['output']

# Display the resulting DataFrame with TF-IDF values
print(tfidf_df)


     people    review   techhub     watch     write  output
0  0.613667  0.000000  0.496816  0.613667  0.000000       1
1  0.000000  0.000000  0.850816  0.525464  0.000000       1
2  0.577350  0.577350  0.000000  0.000000  0.577350       0
3  0.000000  0.613667  0.496816  0.000000  0.613667       0
