2. TF-IDF (Term Frequency-Inverse Document Frequency)
In Natural Language Processing (NLP), Bag-of-Words (BoW) models, such as the one we explored earlier, are powerful for representing the frequency of words in documents. However, they have limitations. BoW doesn’t consider the significance of words in a document relative to the entire corpus(collection of documents). Some words may be frequent in many documents but may not carry much meaningful information.

Here’s where TF-IDF comes into play. TF-IDF addresses the limitations of BoW by assigning weights to words based on their importance in a document relative to the entire collection of documents. It helps us identify words that are not only frequent in a document but also distinctive and informative for that document in the context of the entire corpus.

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
tftransformer = TfidfVectorizer()
sample_text = ["I am Aftab Mallick",
               "I am Interested in learning NLP",
               "I know Machine Learning"]
x_tftrans = tftransformer.fit_transform(sample_text)
print(f"Vocabulary: {tftransformer.vocabulary_}")
print(f"Feature Names: {tftransformer.get_feature_names_out()}")
print(f"Document terms: \n{x_tftrans.toarray()}")

Vocabulary: {'am': 1, 'aftab': 0, 'mallick': 7, 'interested': 3, 'in': 2, 'learning': 5, 'nlp': 8, 'know': 4, 'machine': 6}
Feature Names: ['aftab' 'am' 'in' 'interested' 'know' 'learning' 'machine' 'mallick'
 'nlp']
Document terms: 
[[0.62276601 0.4736296  0.         0.         0.         0.
  0.         0.62276601 0.        ]
 [0.         0.37302199 0.49047908 0.49047908 0.         0.37302199
  0.         0.         0.49047908]
 [0.         0.         0.         0.         0.62276601 0.4736296
  0.62276601 0.         0.        ]]
