# 📦 Chapter 2: Text Vectorization (Bag of Words (BoW) and TF-IDF)

## 📌 Overview  
Machine learning models require numerical inputs, but raw text is unstructured. To bridge this gap, we use text vectorization techniques like:
- **Bag of Words (BoW)**: Counts the frequency of words in a document.
- **TF-IDF (Term Frequency-Inverse Document Frequency)**: Adjusts raw counts based on how common or rare a word is across multiple documents.

These methods help transform text into structured feature vectors for machine learning models.

---

## 1️⃣ Bag of Words (BoW)  
**Goal:** Represent each document as a vector of word counts.


In [2]:
from sklearn.feature_extraction.text import CountVectorizer  # Import BoW vectorizer

# Example corpus (a list of text documents)
corpus = [
    "Natural language processing is interesting",
    "Machine learning is fun",
    "Natural language processing and machine learning"
]

# Create an instance of CountVectorizer
vectorizer = CountVectorizer()

# Fit the vectorizer on the corpus and transform the text data into BoW matrix
X_bow = vectorizer.fit_transform(corpus)

# Show the vocabulary (feature names)
print("Vocabulary:", vectorizer.get_feature_names_out())

# Convert the sparse matrix to an array for readability
print("BoW Matrix:\n", X_bow.toarray())


Vocabulary: ['and' 'fun' 'interesting' 'is' 'language' 'learning' 'machine' 'natural'
 'processing']
BoW Matrix:
 [[0 0 1 1 1 0 0 1 1]
 [0 1 0 1 0 1 1 0 0]
 [1 0 0 0 1 1 1 1 1]]


# 2️⃣ Term Frequency-Inverse Document Frequency (TF-IDF)

Goal: Weigh words by their importance:

Term Frequency (TF): How often a word appears in a document.

Inverse Document Frequency (IDF): Downweights common words across all documents.

Example using TfidfVectorizer:

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer  # Import TF-IDF vectorizer

# Reusing the same corpus
corpus = [
    "Natural language processing is interesting",
    "Machine learning is fun",
    "Natural language processing and machine learning"
]

# Create an instance of TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the corpus into TF-IDF matrix
X_tfidf = tfidf_vectorizer.fit_transform(corpus)

# Show the vocabulary (feature names)
print("Vocabulary:", tfidf_vectorizer.get_feature_names_out())

# Convert the sparse matrix to an array for readability
print("TF-IDF Matrix:\n", X_tfidf.toarray())


Vocabulary: ['and' 'fun' 'interesting' 'is' 'language' 'learning' 'machine' 'natural'
 'processing']
TF-IDF Matrix:
 [[0.         0.         0.54935123 0.41779577 0.41779577 0.
  0.         0.41779577 0.41779577]
 [0.         0.60465213 0.         0.45985353 0.         0.45985353
  0.45985353 0.         0.        ]
 [0.50689001 0.         0.         0.         0.38550292 0.38550292
  0.38550292 0.38550292 0.38550292]]


# ✅ How to Interpret:
In BoW, the numbers are just raw word counts.
In TF-IDF, higher values mean that a word is important in that document but rare across others.

## 🧩 Comparing BoW vs. TF-IDF

| Aspect          | Bag of Words                     | TF-IDF                                    |
|-----------------|----------------------------------|--------------------------------------------|
| Counts          | Raw frequency of words          | Weighted by word importance (rarity)      |
| Common Words    | Frequent words have large values| Common words are downweighted             |
| Sparse          | Yes                              | Yes                                       |
| Use Case        | Simple models, baseline methods | More meaningful for text classification   |


# Corpus:
1. "Natural language processing is interesting"
2. "Machine learning is fun"
3. "Natural language processing and machine learning"


🧮 Example Bag of Words (BoW) Matrix:
| Document | and | fun | interesting | is | language | learning | machine | natural | processing |
|----------|-----|-----|-------------|----|----------|----------|---------|---------|------------|
| Doc 1    | 0   | 0   | 1           | 1  | 1        | 0        | 0       | 1       | 1          |
| Doc 2    | 0   | 1   | 0           | 1  | 0        | 1        | 1       | 0       | 0          |
| Doc 3    | 1   | 0   | 0           | 0  | 1        | 1        | 1       | 1       | 1          |

🧮 Example TF-IDF Matrix (simplified values for illustration):
| Document | and   | fun   | interesting | is    | language | learning | machine | natural | processing |
|----------|-------|-------|-------------|-------|----------|----------|---------|---------|------------|
| Doc 1    | 0     | 0     | 0.62        | 0.31  | 0.31     | 0        | 0       | 0.31    | 0.62       |
| Doc 2    | 0     | 0.58  | 0           | 0.29  | 0        | 0.58     | 0.58    | 0       | 0          |
| Doc 3    | 0.40  | 0     | 0           | 0     | 0.20     | 0.40     | 0.40    | 0.20    | 0.20       |

⚠️ Note: These TF-IDF values are illustrative and may differ slightly depending on the actual calculation formula (scikit-learn normalizes these vectors by L2 norm).

