### **TF-IDF (Term Frequency – Inverse Document Frequency)**

---

### **What is TF-IDF?**

**TF-IDF** is a text representation method that tells us:

> **How important a word is in a document**, compared to all other documents in the dataset.

It gives **higher weight** to important (rare but meaningful) words and **lower weight** to common words.

---

### **Why Use TF-IDF?**

* BoW only counts word frequency — it treats all words equally.
* But some words (like "the", "is", "and") appear in every document and are not useful.
* TF-IDF **reduces the weight of common words** and **increases the weight of unique/important words**.

---

###  **How Does It Work?**

TF-IDF = **TF × IDF**

* **TF (Term Frequency)** = How often a word appears in a document
* **IDF (Inverse Document Frequency)** = How rare the word is across all documents

If a word appears in many documents → IDF becomes low → TF-IDF becomes low

---

### **Summary**

| Aspect    | Details                                          |
| --------- | ------------------------------------------------ |
| Purpose   | Weight words by importance                       |
| Pros      | Reduces noise from common words                  |
| Cons      | Still doesn’t capture word order or meaning      |
| Use Cases | Text classification, document similarity, search |

---


In [1]:
import pandas as pd

In [2]:
df = pd.DataFrame({'text': ['people watch campusx', 'campusx watch campusx', 'people write comment', 'campusx write comment'], 'output': [1, 1, 0, 0]})
df

Unnamed: 0,text,output
0,people watch campusx,1
1,campusx watch campusx,1
2,people write comment,0
3,campusx write comment,0


In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer


In [None]:
tfidf =TfidfVectorizer()


In [6]:
tfidf.fit_transform(df['text']).toarray()

array([[0.49681612, 0.        , 0.61366674, 0.61366674, 0.        ],
       [0.8508161 , 0.        , 0.        , 0.52546357, 0.        ],
       [0.        , 0.57735027, 0.57735027, 0.        , 0.57735027],
       [0.49681612, 0.61366674, 0.        , 0.        , 0.61366674]])

In [11]:
print(tfidf.idf_)
print(tfidf.get_feature_names_out())

[1.22314355 1.51082562 1.51082562 1.51082562 1.51082562]
['campusx' 'comment' 'people' 'watch' 'write']


### **You Can Do this with N-Grams also**

In [12]:
tfidf =TfidfVectorizer(ngram_range=(2,2))


In [13]:
tfidf.fit_transform(df['text']).toarray()

array([[0.        , 0.        , 0.78528828, 0.        , 0.6191303 ,
        0.        ],
       [0.78528828, 0.        , 0.        , 0.        , 0.6191303 ,
        0.        ],
       [0.        , 0.        , 0.        , 0.78528828, 0.        ,
        0.6191303 ],
       [0.        , 0.78528828, 0.        , 0.        , 0.        ,
        0.6191303 ]])

In [14]:
print(tfidf.idf_)
print(tfidf.get_feature_names_out())

[1.91629073 1.91629073 1.91629073 1.91629073 1.51082562 1.51082562]
['campusx watch' 'campusx write' 'people watch' 'people write'
 'watch campusx' 'write comment']
