# TF-IDF (Term Frequency-Inverse Document Frequency)

## What is TF-IDF?

**TF-IDF** stands for **Term Frequency-Inverse Document Frequency**. It is a statistical measure used to evaluate the importance of a word in a document relative to a collection (or corpus) of documents. It is widely used in text analysis, information retrieval, and natural language processing (NLP) to convert text data into numerical vectors.

TF-IDF works by balancing:
1. **Term Frequency (TF):** How often a word appears in a document.
2. **Inverse Document Frequency (IDF):** How unique or rare the word is across all documents.

---

# Components of TF-IDF

## 1. Term Frequency (TF)

This measures how frequently a term (word) appears in a specific document.

tf(t,d) = count of t in d / number of words in d

### Example:
For the sentence **"I love data science and I love learning"**, the term frequency of the word **"love"** is:

\[
TF(\text{"love"}) = \frac{2}{8} = 0.25
\]

---

## 2. Inverse Document Frequency (IDF)

This measures how important a term is by reducing the weight of commonly used words (e.g., "the", "is", "and").

idf(t) = N/ df(t) = N/N(t)

- A high IDF value means the term is rare across documents.
- A low IDF value means the term is common.

### Example:
If the term **"learning"** appears in only 1 out of 10 documents:

\[
IDF(\text{"learning"}) = \log\left(\frac{10}{1}\right) + 1 = \log(10) + 1 \approx 2.30
\]


---

### 3. **TF-IDF Score**
The final TF-IDF score is the product of TF and IDF:

tf-idf(t, d) = tf(t, d) * idf(t)

This score determines the importance of a term within a document and across the corpus.

---

## Why Use TF-IDF?

1. **Feature Extraction**: Converts text data into numerical values for machine learning models.
2. **Stopword Filtering**: Words with low TF-IDF scores (e.g., "the", "is") are typically not significant.
3. **Relevance Ranking**: Helps prioritize terms that are unique and meaningful to the document.

---

## Advantages of TF-IDF

1. **Simple and Effective**: Easy to implement and interpret.
2. **Weights Rare Words**: Gives higher importance to unique terms in a document.
3. **Reduces Noise**: Common words (like stopwords) are de-emphasized due to low IDF values.

---

## Limitations of TF-IDF

1. **Ignores Semantic Meaning**: It treats words as independent entities and doesn’t consider their meaning or context.
2. **Sparse Representations**: Produces high-dimensional, sparse matrices for large corpora.
3. **No Synonym Handling**: It doesn’t recognize synonyms or similar words (e.g., "car" and "vehicle").

---

## Implementation in Python

### Example 1: Basic TF-IDF Calculation
```python
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample Corpus
corpus = [
    "I love data science",
    "Data science is amazing",
    "I enjoy learning data science"
]

# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Fit and Transform the Corpus
X = vectorizer.fit_transform(corpus)

# Display Vocabulary
print("Vocabulary:", vectorizer.get_feature_names_out())

# Display TF-IDF Matrix
print("TF-IDF Matrix:\n", X.toarray())
