# 8.2.2 TF-IDF

Term Frequency-Inverse Document Frequency (TF-IDF) is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (corpus). It combines two metrics:
- **Term Frequency (TF)**: Measures the frequency of a word in a document.
- **Inverse Document Frequency (IDF)**: Measures the importance of a word by considering how frequently it appears across all documents in the corpus.

TF-IDF helps in reducing the weight of commonly used words and increasing the weight of words that are more unique and informative. This makes it a powerful tool for tasks like document retrieval, text mining, and natural language processing.

## Benefits of TF-IDF
- **Reduces Noise**: By down-weighting common words, it reduces the impact of less informative words.
- **Enhances Discrimination**: Highlights important words that distinguish documents.
- **Widely Used**: Commonly used in various text mining and information retrieval applications.

## Use Cases of TF-IDF
- **Search Engines**: To rank documents based on their relevance to a query.
- **Text Classification**: As features for machine learning models.
- **Document Clustering**: To group similar documents together.


___
___
### Readings:
- [TF-IDF in NLP (Term Frequency Inverse Document Frequency)](https://medium.com/@abhishekjainindore24/tf-idf-in-nlp-term-frequency-inverse-document-frequency-e05b65932f1d)
- [TF-IDF/Term Frequency Technique](https://medium.com/analytics-vidhya/tf-idf-term-frequency-technique-easiest-explanation-for-text-classification-in-nlp-with-code-8ca3912e58c3)
- [Understanding TF-IDF in NLP: A Comprehensive Guide](https://medium.com/@er.iit.pradeep09/understanding-tf-idf-in-nlp-a-comprehensive-guide-26707db0cec5)
- [Understanding TF-IDF for Absolute Beginners](https://readmedium.com/en/https:/medium.com/analytics-vidhya/understanding-tfidf-for-absolute-beginners-f2c260b8944b)
___
___

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample text data
texts = [
    "Machine learning is fascinating",
    "Learning algorithms are essential",
    "Machine learning is a subset of artificial intelligence"
]

# Create the TF-IDF model
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(texts)

# Display the TF-IDF matrix
print("Vocabulary:\n", tfidf_vectorizer.vocabulary_)
print("\nTF-IDF Matrix:\n", tfidf_matrix.toarray())


Vocabulary:
 {'machine': 8, 'learning': 7, 'is': 6, 'fascinating': 4, 'algorithms': 0, 'are': 1, 'essential': 3, 'subset': 10, 'of': 9, 'artificial': 2, 'intelligence': 5}

TF-IDF Matrix:
 [[0.         0.         0.         0.         0.63174505 0.
  0.4804584  0.37311881 0.4804584  0.         0.        ]
 [0.54645401 0.54645401 0.         0.54645401 0.         0.
  0.         0.32274454 0.         0.         0.        ]
 [0.         0.         0.4261835  0.         0.         0.4261835
  0.32412354 0.25171084 0.32412354 0.4261835  0.4261835 ]]


___
___
## Conclusion

TF-IDF is a powerful text representation technique that balances the frequency of words in individual documents with their importance across the entire corpus. This makes it particularly useful for tasks that require distinguishing between relevant and irrelevant documents. By reducing the weight of common words and highlighting unique words, TF-IDF improves the quality of text analysis and retrieval.
