<a href="https://colab.research.google.com/github/Akshaay23/NLP_Learning/blob/main/TF_IDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**TF-IDF (Term Frequency-Inverse Document Frequency)**
- is an advanced text representation technique used to measure the importance of words in a document relative to a collection of documents (corpus).
- Unlike Bag of Words (BoW), which only counts word occurrences, TF-IDF gives more weight to important words and reduces the weight of common words.

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample text corpus
texts = [
    "John loves to play football.",
    "Football is a great sport!",
    "John and Alex enjoy playing football."
]

# Create TF-IDF model
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(texts)

# Convert to array and print vocabulary
print("Vocabulary:", vectorizer.get_feature_names_out())
print("\nTF-IDF Matrix:\n", tfidf_matrix.toarray())


Vocabulary: ['alex' 'and' 'enjoy' 'football' 'great' 'is' 'john' 'loves' 'play'
 'playing' 'sport' 'to']

TF-IDF Matrix:
 [[0.         0.         0.         0.29803159 0.         0.
  0.38376993 0.50461134 0.50461134 0.         0.         0.50461134]
 [0.         0.         0.         0.32274454 0.54645401 0.54645401
  0.         0.         0.         0.         0.54645401 0.        ]
 [0.45050407 0.45050407 0.45050407 0.26607496 0.         0.
  0.34261996 0.         0.         0.45050407 0.         0.        ]]


 **Advantages of TF-IDF**
- Reduces Weight of Common Words – Unlike BoW, TF-IDF reduces importance of words like "is", "to", "the".
- Captures Important Words – Words that appear frequently but in fewer documents get a higher score.
- Improves Text Classification – Commonly used in spam detection, topic modeling, and search engines.

**Limitations of TF-IDF**
- Ignores Context & Word Order – Doesn’t capture relationships between words.
- Sparse Representation – Large vocabulary can result in a sparse matrix.
- Not Good for Deep Learning – TF-IDF is mostly used in traditional ML models (SVM, Naive Bayes) rather than deep learning.

**When to Use TF-IDF?**
- Use TF-IDF when building search engines, keyword extraction, or traditional ML models.
- Use Word Embeddings (Word2Vec, BERT) when you need semantic meaning & deep learning applications.