<center>

# NLP_06 :Bag-of-Words (BoW) and TF-IDF

</center>
<br>
<br>




#### This notebook demonstrates how to convert text data into numerical features using:
- Bag-of-Words (CountVectorizer)
- TF-IDF (TfidfVectorizer)

These methods are fundamental for text mining and NLP tasks.

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [2]:
corpus = [
    "I love machine learning",
    "Machine learning is fun",
    "I love coding in Python",
    "Python is great for machine learning"
]

In [3]:
# Convert to DataFrame for better visualization
df = pd.DataFrame({'Text': corpus})
df


Unnamed: 0,Text
0,I love machine learning
1,Machine learning is fun
2,I love coding in Python
3,Python is great for machine learning


## Bag of Words (BoW) in NLP

**Bag of Words (BoW)** is a fundamental technique in **text mining** and **information retrieval**.  

It represents text documents as **term-frequency vectors**, allowing **mathematical analysis** and **machine learning** applications.


## Key Components of BoW

1. **Vocabulary (Dictionary)**  
   - Contains all **unique words** in the text corpus.  
   - Words are **normalized**:  
     - Lowercased: `Python -> python`  
     - Lemmatized: `ran -> run`  
     - Singularized: `dogs -> dog`

2. **Term Frequencies**  
   - Number of occurrences of each term in each document.


### Example Documents

- DOC 1: I love Python programming.  
- DOC 2: Python is easy to learn.  
- DOC 3: I am learning Python with fun projects.


### Bag of Words Table

| TERM        | DOC 1 | DOC 2 | DOC 3 |
|-------------|-------|-------|-------|
| am          | 0     | 0     | 1     |
| easy        | 0     | 1     | 0     |
| fun         | 0     | 0     | 1     |
| i           | 1     | 0     | 1     |
| is          | 0     | 1     | 0     |
| learn       | 0     | 1     | 0     |
| learning    | 0     | 0     | 1     |
| programming | 1     | 0     | 0     |
| projects    | 0     | 0     | 1     |
| python      | 1     | 1     | 1     |
| love        | 1     | 0     | 0     |
| with        | 0     | 0     | 1     |

Each row represents a **word**, and each column represents a **document**.


### Advantages of BoW

- Simple and easy to implement.
- Useful for **text classification** and **document similarity**.
- Can be fed into **machine learning algorithms**.

### Limitations

- Ignores **word order** → “I love Ironhack” vs “Ironhack love I”.  
- Ignores **semantics** → synonyms like “student” and “pupil” are different words.  
- Vocabulary can become **very large** → memory issues for big datasets.


In [4]:
# CountVectorizer converts text into a matrix of token counts.

# Initialize the vectorizer
bow_vectorizer = CountVectorizer()

In [5]:
# Fit and transform the corpus
bow_matrix = bow_vectorizer.fit_transform(corpus)

In [6]:
# Convert to DataFrame for better readability
bow_df = pd.DataFrame(bow_matrix.toarray(), columns=bow_vectorizer.get_feature_names_out())
bow_df


Unnamed: 0,coding,for,fun,great,in,is,learning,love,machine,python
0,0,0,0,0,0,0,1,1,1,0
1,0,0,1,0,0,1,1,0,1,0
2,1,0,0,0,1,0,0,1,0,1
3,0,1,0,1,0,1,1,0,1,1


## TF-IDF (Term Frequency-Inverse Document Frequency)

**TF-IDF** is a numerical statistic that reflects how important a word is to a document in a collection (corpus).  

It improves on **Bag of Words** by **reducing the weight of common words** (like "the", "is", "a") and highlighting **important, distinctive words** in each document.


### Key Components of TF-IDF

1. **Term Frequency (TF)**  
   - Measures how often a word appears in a document.  
   - Formula:  

$$
TF(t,d) = \frac{\text{Number of times term t appears in document d}}{\text{Total number of terms in document d}}$$

2. **Inverse Document Frequency (IDF)**  
   - Measures how important a term is across all documents.  
   - Rare words across documents have higher IDF.  
   - Formula:  
$$
IDF(t) = \log \frac{\text{Total number of documents}}{\text{Number of documents containing term t}}$$

3. **TF-IDF Score**  
   - Combines TF and IDF:  
$$
TF\text{-}IDF(t,d) = TF(t,d) \times IDF(t)
$$

Words that appear frequently in a document but rarely in other documents get **high TF-IDF scores**.

### TF-IDF Table Explanation

- Each **row** represents a **document**.  
- Each **column** represents a **term** in the corpus.  
- The values are the **TF-IDF scores**, which tell how important a word is in that document compared to others.  

Example Insights:  
- The word **"python"** appears in all documents → moderate TF-IDF score.  
- The word **"programming"** appears only in DOC 1 → high TF-IDF score for DOC 1.  
- Common words like **"i", "is", "to"** get lower scores.


#### Example Corpus
* **d1:** "I love machine learning"
* **d2:** "I love deep learning"

**Total documents ($N$):** $2$  
**Vocabulary ($V$):** $\{i, love, machine, learning, deep\}$


### Step-by-Step Manual Calculation

#### Document 1: "I love machine learning" (Total words = 4)


### TF values (Term Frequency)


#### 1.1 TF values

| Word     | Count | TF         |
| -------- | ----- | ---------- |
| i        | 1     | 1/4 = 0.25 |
| love     | 1     | 0.25       |
| machine  | 1     | 0.25       |
| learning | 1     | 0.25       |
| deep     | 0     | 0          |


#### 1.2 Document Frequency (df)

| Word     | df |
| -------- | -- |
| i        | 2  |
| love     | 2  |
| machine  | 1  |
| learning | 2  |
| deep     | 1  |


#### 1.3 IDF Calculation (log base e)

$$
\text{IDF}(i) = \log(2/2) = 0
$$
$$
\text{IDF}(love) = 0
$$
$$
\text{IDF}(learning) = 0
$$
$$\text{IDF}(machine) = \log(2/1) = 0.693
$$
$$
\text{IDF}(deep) = 0.693
$$

#### 1.4 TF-IDF (Document 1)

$$\text{TF-IDF} = \text{TF} \times \text{IDF} $$

| Word     | TF   | IDF   | TF-IDF    |
| -------- | ---- | ----- | --------- |
| i        | 0.25 | 0     | 0         |
| love     | 0.25 | 0     | 0         |
| machine  | 0.25 | 0.693 | **0.173** |
| learning | 0.25 | 0     | 0         |
| deep     | 0    | 0.693 | 0         |


#### 1.5 Document 2 (Result Only)

| Word    | TF-IDF    |
| ------- | --------- |
| deep    | **0.173** |
| machine | 0         |


#### Final TF-IDF Matrix
$$
\begin{bmatrix}
0 & 0 & 0.173 & 0 & 0 \\
0 & 0 & 0 & 0 & 0.173
\end{bmatrix}
$$




###  Important Interview Tip

If someone asks:

> “Why TF-IDF is better than BoW?”

Answer:

> Because TF-IDF reduces the weight of common words and increases the importance of rare but meaningful words.


### One-Line Formula to Remember

$$
\boxed{
\text{TF-IDF}(w, d) = \frac{f(w,d)}{|d|} \times \log\left(\frac{N}{df(w)}\right)
}$$

In [7]:
# TF-IDF assigns weight to words based on their frequency in a document and across all documents.

# Initialize the TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer()

In [8]:
# Fit and transform the corpus
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)

In [9]:
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
tfidf_df


Unnamed: 0,coding,for,fun,great,in,is,learning,love,machine,python
0,0.0,0.0,0.0,0.0,0.0,0.0,0.53257,0.657829,0.53257,0.0
1,0.0,0.0,0.640655,0.0,0.0,0.5051,0.408922,0.0,0.408922,0.0
2,0.555283,0.0,0.0,0.0,0.555283,0.0,0.0,0.437791,0.0,0.437791
3,0.0,0.496414,0.0,0.496414,0.0,0.391378,0.316854,0.0,0.316854,0.391378


### 5. Compare BoW and TF-IDF

In [10]:
print("Feature Names:", bow_vectorizer.get_feature_names_out())

Feature Names: ['coding' 'for' 'fun' 'great' 'in' 'is' 'learning' 'love' 'machine'
 'python']


In [11]:
print("\nBag-of-Words Matrix:\n", bow_df)


Bag-of-Words Matrix:
    coding  for  fun  great  in  is  learning  love  machine  python
0       0    0    0      0   0   0         1     1        1       0
1       0    0    1      0   0   1         1     0        1       0
2       1    0    0      0   1   0         0     1        0       1
3       0    1    0      1   0   1         1     0        1       1


In [12]:
print("\nTF-IDF Matrix:\n", tfidf_df)


TF-IDF Matrix:
      coding       for       fun     great        in        is  learning  \
0  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.532570   
1  0.000000  0.000000  0.640655  0.000000  0.000000  0.505100  0.408922   
2  0.555283  0.000000  0.000000  0.000000  0.555283  0.000000  0.000000   
3  0.000000  0.496414  0.000000  0.496414  0.000000  0.391378  0.316854   

       love   machine    python  
0  0.657829  0.532570  0.000000  
1  0.000000  0.408922  0.000000  
2  0.437791  0.000000  0.437791  
3  0.000000  0.316854  0.391378  


<div style="text-align: right;">
    <b>Author:</b> Monower Hossen <br>
    <b>Date:</b> January 7, 2026
</div>
