# TF-IDF in NLP 

Term Frequency-Inverse Document Frequency (TF-IDF) is a popular statistical measure used in Natural Language Processing (NLP) to evaluate the importance of a word in a document relative to a collection (corpus) of documents.

## **1. What is TF-IDF?**

TF-IDF is a weighted representation of text that combines two metrics:
- **Term Frequency (TF):** Measures how frequently a term occurs in a document.
- **Inverse Document Frequency (IDF):** Measures how important a term is by evaluating how unique it is across the corpus.

### **TF Formula**
$$ \text{TF}(t) = \frac{f_t}{\text{total terms in the document}} $$

- $( f_t $): Frequency of term $( t $) in a document.
- Total terms in the document: Total number of words in the document.

### **IDF Formula**
$$ \text{IDF}(t) = \log \frac{N}{1 + \text{df}(t)} $$

- $( N $): Total number of documents in the corpus.
- $ \text{df}(t) $: Number of documents containing the term \( t \).

### **TF-IDF Formula**
$$ \text{TF-IDF}(t) = \text{TF}(t) \times \text{IDF}(t) $$

This results in higher scores for words that are frequent in a specific document but rare across the corpus.

## **2. Why Use TF-IDF?**

TF-IDF is widely used in text-based applications for its ability to:
- Highlight important words in a document.
- Reduce the weight of common words (e.g., "the", "is").
- Provide a numerical representation of text for machine learning algorithms.

## **3. Implementing TF-IDF in Python**

### **Step 1: Import Libraries and Create a Dataset**

In [1]:
# Import libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample dataset
data = {
    "Sentences": [
        "The cat sat on the mat",
        "The dog lay on the rug",
        "The cat chased the dog"
    ]
}

df = pd.DataFrame(data)
print("Dataset:")
print(df)

Dataset:
                Sentences
0  The cat sat on the mat
1  The dog lay on the rug
2  The cat chased the dog


### **Step 2: Apply TF-IDF Using `TfidfVectorizer`**

In [2]:
vectorizer = TfidfVectorizer()

tfidf_matrix = vectorizer.fit_transform(df['Sentences'])

print("TF-IDF Matrix:")
print(tfidf_matrix.toarray())

print("Vocabulary:", vectorizer.get_feature_names_out())

TF-IDF Matrix:
[[0.3564574  0.         0.         0.         0.46869865 0.3564574
  0.         0.46869865 0.55364194]
 [0.         0.         0.3564574  0.46869865 0.         0.3564574
  0.46869865 0.         0.55364194]
 [0.40352536 0.53058735 0.40352536 0.         0.         0.
  0.         0.         0.62674687]]
Vocabulary: ['cat' 'chased' 'dog' 'lay' 'mat' 'on' 'rug' 'sat' 'the']


### **Step 3: Convert TF-IDF Matrix into a DataFrame for Better Visualization**

In [None]:
# Convert TF-IDF matrix to a DataFrame
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())
print("TF-IDF DataFrame:")
print(tfidf_df)

## **4. Customizing TF-IDF**

### **(a) Remove Stop Words**
Stop words are common words (e.g., "the", "is") that do not carry significant meaning.

In [3]:
# Remove stop words
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(df['Sentences'])
print("Vocabulary without Stop Words:", vectorizer.get_feature_names_out())

Vocabulary without Stop Words: ['cat' 'chased' 'dog' 'lay' 'mat' 'rug' 'sat']


### **(b) Adjust Maximum and Minimum Document Frequency**
Exclude rare and overly frequent terms by setting `max_df` and `min_df`.

In [4]:
# Adjust max_df and min_df
vectorizer = TfidfVectorizer(max_df=0.7, min_df=0.1)
tfidf_matrix = vectorizer.fit_transform(df['Sentences'])
print("Filtered Vocabulary:", vectorizer.get_feature_names_out())

Filtered Vocabulary: ['cat' 'chased' 'dog' 'lay' 'mat' 'on' 'rug' 'sat']


### **(c) Using N-Grams**
Analyze combinations of consecutive words by specifying `ngram_range`.

In [5]:
# Use unigrams and bigrams
vectorizer = TfidfVectorizer(ngram_range=(1, 2))
tfidf_matrix = vectorizer.fit_transform(df['Sentences'])
print("Vocabulary with N-grams:", vectorizer.get_feature_names_out())

Vocabulary with N-grams: ['cat' 'cat chased' 'cat sat' 'chased' 'chased the' 'dog' 'dog lay' 'lay'
 'lay on' 'mat' 'on' 'on the' 'rug' 'sat' 'sat on' 'the' 'the cat'
 'the dog' 'the mat' 'the rug']


## **Advantages of TF-IDF**
1. **Simple and Effective:** 
   - TF-IDF is easy to implement and widely used in text mining and NLP applications.
2. **Highlights Important Terms:** 
   - It assigns higher weights to important words while reducing the weights of common terms like stop words.
3. **Noise Reduction:** 
   - Reduces the impact of words that appear frequently across all documents but carry little significance.
4. **Sparse Representation:** 
   - Efficiently represents text as a sparse matrix, which can be processed by many machine learning algorithms.

## **Disadvantages of TF-IDF**
1. **Ignores Context:** 
   - TF-IDF does not consider the semantic meaning or word order, which can limit its effectiveness in understanding text.
2. **Sensitive to Data Variability:** 
   - Rare terms or misrepresented terms can have disproportionately high weights, potentially skewing the results.
3. **Computational Overhead:** 
   - Computing IDF for large corpora can be computationally expensive.
4. **Vocabulary Dependency:** 
   - Relies heavily on preprocessing techniques like stemming and lemmatization to ensure consistency in vocabulary.

## **Conclusion**
TF-IDF is a foundational text representation technique in NLP that balances term frequency with term uniqueness. It is especially effective for tasks like:
- Document similarity
- Keyword extraction
- Text classification

However, TF-IDF has limitations in understanding context and relationships between words. For advanced NLP applications, more sophisticated techniques like **Word2Vec**, **GloVe**, or **transformer-based models** (e.g., BERT) are often preferred.