## Demonstration of TF-IDF

In [6]:
# import required module
from sklearn.feature_extraction.text import TfidfVectorizer
# Sklearn provided many text processing tools out of which above is to calculate the term frequency
# for further help check https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

In [11]:
# Prepare the Corpus 
string = [
        'This is the first document.',
        'This document is the second document.',
        'And this is the third one.',
        'This document is the fourth document.',
        'And this is the fifth one.',
        'This document is the sixth.',
        'And this is the seventh one document.',
        'This document is the eighth.',
        'And this is the nineth one document.',
        'This document is the second.',
        'And this is the tenth one document.',
    ]

In [12]:
# create object
tfidf = TfidfVectorizer()

# get tf-df values
result = tfidf.fit_transform(string)

In [13]:
# get idf values
print('idf values:')
for ele1, ele2 in zip(tfidf.get_feature_names_out(), tfidf.idf_):
	print(ele1, ':', ele2)

idf values:
and : 1.6931471805599454
document : 1.1823215567939547
eighth : 2.791759469228055
fifth : 2.791759469228055
first : 2.791759469228055
fourth : 2.791759469228055
is : 1.0
nineth : 2.791759469228055
one : 1.6931471805599454
second : 2.386294361119891
seventh : 2.791759469228055
sixth : 2.791759469228055
tenth : 2.791759469228055
the : 1.0
third : 2.791759469228055
this : 1.0


The IDF values are computed for each unique word in the corpus. The IDF measures the importance of a word in the corpus by considering its presence across all documents. Words that appear frequently across all documents get lower IDF scores, as they are considered less important. Conversely, words that appear in only a few documents get higher IDF scores, indicating their importance in distinguishing those documents.

In [14]:
# get indexing
print('\nWord indexes:')
print(tfidf.vocabulary_)


Word indexes:
{'this': 15, 'is': 6, 'the': 13, 'first': 4, 'document': 1, 'second': 9, 'and': 0, 'third': 14, 'one': 8, 'fourth': 5, 'fifth': 3, 'sixth': 11, 'seventh': 10, 'eighth': 2, 'nineth': 7, 'tenth': 12}


The vocabulary_ attribute of the TfidfVectorizer provides the word-to-index mapping. It assigns an index to each unique word in the corpus. Here, the index corresponds to the position of the word in the IDF values and TF-IDF matrix.

In [15]:
# display tf-idf values
print('\ntf-idf value:')
print(result)


tf-idf value:
  (0, 1)	0.3386114196354545
  (0, 4)	0.7995469859480685
  (0, 13)	0.28639536993104564
  (0, 6)	0.28639536993104564
  (0, 15)	0.28639536993104564
  (1, 9)	0.6313492037071095
  (1, 1)	0.625620866871847
  (1, 13)	0.26457306105807354
  (1, 6)	0.26457306105807354
  (1, 15)	0.26457306105807354
  (2, 8)	0.4164781643870174
  (2, 14)	0.6867134012352383
  (2, 0)	0.4164781643870174
  (2, 13)	0.24597871299604487
  (2, 6)	0.24597871299604487
  (2, 15)	0.24597871299604487
  (3, 5)	0.6896817045790951
  (3, 1)	0.5841660469952564
  (3, 13)	0.2470419504907412
  (3, 6)	0.2470419504907412
  (3, 15)	0.2470419504907412
  (4, 3)	0.6867134012352383
  (4, 8)	0.4164781643870174
  (4, 0)	0.4164781643870174
  (4, 13)	0.24597871299604487
  :	:
  (6, 15)	0.23619287048202153
  (7, 2)	0.7995469859480685
  (7, 1)	0.3386114196354545
  (7, 13)	0.28639536993104564
  (7, 6)	0.28639536993104564
  (7, 15)	0.28639536993104564
  (8, 7)	0.6593936827323392
  (8, 8)	0.3999092927249951
  (8, 0)	0.3999092927249951
 

The TF-IDF values are represented as a sparse matrix, where each row corresponds to a document, and each column corresponds to a word index from the vocabulary. The values represent the TF-IDF score for each word in each document. Non-zero entries indicate that a particular word appears in the corresponding document, and the value represents its TF-IDF score.

In [16]:
# in matrix form
print('\ntf-idf values in matrix form:')
print(result.toarray())


tf-idf values in matrix form:
[[0.         0.33861142 0.         0.         0.79954699 0.
  0.28639537 0.         0.         0.         0.         0.
  0.         0.28639537 0.         0.28639537]
 [0.         0.62562087 0.         0.         0.         0.
  0.26457306 0.         0.         0.6313492  0.         0.
  0.         0.26457306 0.         0.26457306]
 [0.41647816 0.         0.         0.         0.         0.
  0.24597871 0.         0.41647816 0.         0.         0.
  0.         0.24597871 0.6867134  0.24597871]
 [0.         0.58416605 0.         0.         0.         0.6896817
  0.24704195 0.         0.         0.         0.         0.
  0.         0.24704195 0.         0.24704195]
 [0.41647816 0.         0.         0.6867134  0.         0.
  0.24597871 0.         0.41647816 0.         0.         0.
  0.         0.24597871 0.         0.24597871]
 [0.         0.33861142 0.         0.         0.         0.
  0.28639537 0.         0.         0.         0.         0.79954699

Here, the TF-IDF values are represented in dense matrix form (numpy array) for better readability. Each row corresponds to a document, and each column corresponds to a word index from the vocabulary. The values represent the TF-IDF score for each word in each document. Zero values indicate that a particular word does not appear in the corresponding document.

In [None]:
# get TF values
print('\nWord indexes:')
print(tfidf.)