TF-IDF stands for Term Frequency Inverse Document Frequency of records. 

It can be defined as the calculation of how relevant a word in a series or corpus is to a text. 

The meaning increases proportionally to the number of times in the text a word appears but is compensated by the word frequency in the corpus (data-set)

**Terminologies:**

1. Term Frequency: In document d, the frequency represents the number of instances of a given word t. Therefore, we can see that it becomes more relevant when a word appears in the text, which is rational. Since the ordering of terms is not significant, we can use a vector to describe the text in the bag of term models. For each specific term in the paper, there is an entry with the value being the term frequency.

The weight of a term that occurs in a document is simply proportional to the term frequency.

**tf(t,d) = count of t in d / number of words in d**

2. Document Frequency: This tests the meaning of the text, which is very similar to TF, in the whole corpus collection. The only difference is that in document d, TF is the frequency counter for a term t, while df is the number of occurrences in the document set N of the term t. In other words, the number of papers in which the word is present is DF.

**df(t) = occurrence of t in documents**

3. Inverse Document Frequency: Mainly, it tests how relevant the word is. The key aim of the search is to locate the appropriate records that fit the demand. Since tf considers all terms equally significant, it is therefore not only possible to use the term frequencies to measure the weight of the term in the paper. First, find the document frequency of a term t by counting the number of documents containing the term:

**df(t) = N(t) where**

**df(t) = Document frequency of a term t**

**N(t) = Number of documents containing the term t**

Term frequency is the number of instances of a term in a single document only; although the frequency of the document is the number of separate documents in which the term appears, it depends on the entire corpus. 

The IDF of the word is the number of documents in the corpus separated by the frequency of the text.

**idf(t) = N/ df(t) = N/N(t)**

The more common word is supposed to be considered less significant, but the element (most definite integers) seems too harsh. We then take the logarithm (with base 2) of the inverse frequency of the paper. So the if of the term t becomes:

**idf(t) = log(N/ df(t))**

4. Computation: Tf-idf is one of the best metrics to determine how significant a term is to a text in a series or a corpus. tf-idf is a weighting system that assigns a weight to each word in a document based on its term frequency (tf) and the reciprocal document frequency (tf) (idf). The words with higher scores of weight are deemed to be more significant.

Usually, the tf-idf weight consists of two terms-
1. Normalized Term Frequency (tf)
2. Inverse Document Frequency (idf)

**tf-idf(t, d) = tf(t, d) * idf(t)**

 

In [1]:
# import required module
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
# assign documents
d0 = 'Python is a language'
d1 = 'Python'
d2 = 'doc'
  
# merge documents into a single corpus
string = [d0, d1, d2]

In [3]:
#Get tf-idf values from fit_transform() method.

# create object
tfidf = TfidfVectorizer()
  
# get tf-df values
result = tfidf.fit_transform(string)

In [4]:
#Display idf values of the words present in the corpus.

# get idf values
print('\nidf values:')
for ele1, ele2 in zip(tfidf.get_feature_names(), tfidf.idf_):
    print(ele1, ':', ele2)


idf values:
doc : 1.6931471805599454
is : 1.6931471805599454
language : 1.6931471805599454
python : 1.2876820724517808


In [5]:
#Display tf-idf values along with indexing.

# get indexing
print('\nWord indexes:')
print(tfidf.vocabulary_)
  
# display tf-idf values
print('\ntf-idf value:')
print(result)
  
# in matrix form
print('\ntf-idf values in matrix form:')
print(result.toarray())



Word indexes:
{'python': 3, 'is': 1, 'language': 2, 'doc': 0}

tf-idf value:
  (0, 2)	0.6227660078332259
  (0, 1)	0.6227660078332259
  (0, 3)	0.4736296010332684
  (1, 3)	1.0
  (2, 0)	1.0

tf-idf values in matrix form:
[[0.         0.62276601 0.62276601 0.4736296 ]
 [0.         0.         0.         1.        ]
 [1.         0.         0.         0.        ]]


Below are some examples which depict how to compute tf-idf values of words from a corpus:

In [6]:
# import required module
from sklearn.feature_extraction.text import TfidfVectorizer
  
# assign documents
d0 = 'Python is a programming language'
d1 = 'Python'
d2 = 'program'
  
# merge documents into a single corpus
string = [d0, d1, d2]
  
# create object
tfidf = TfidfVectorizer()
  
# get tf-df values
result = tfidf.fit_transform(string)
  
# get idf values
print('\nidf values:')
for ele1, ele2 in zip(tfidf.get_feature_names(), tfidf.idf_):
    print(ele1, ':', ele2)
  
# get indexing
print('\nWord indexes:')
print(tfidf.vocabulary_)
  
# display tf-idf values
print('\ntf-idf value:')
print(result)
  
# in matrix form
print('\ntf-idf values in matrix form:')
print(result.toarray())


idf values:
is : 1.6931471805599454
language : 1.6931471805599454
program : 1.6931471805599454
programming : 1.6931471805599454
python : 1.2876820724517808

Word indexes:
{'python': 4, 'is': 0, 'programming': 3, 'language': 1, 'program': 2}

tf-idf value:
  (0, 1)	0.5286346066596935
  (0, 3)	0.5286346066596935
  (0, 0)	0.5286346066596935
  (0, 4)	0.4020402441612698
  (1, 4)	1.0
  (2, 2)	1.0

tf-idf values in matrix form:
[[0.52863461 0.52863461 0.         0.52863461 0.40204024]
 [0.         0.         0.         0.         1.        ]
 [0.         0.         1.         0.         0.        ]]


Example 2: Here, tf-idf values are computed from a corpus having unique values. 



In [7]:
# assign documents
doc0 = 'Python01'
doc1 = 'Python02'
doc2 = 'Python03'
doc3 = 'Python04'
  
# merge documents into a single corpus
string = [doc0, doc1, doc2, doc3]
  
# create object
tfidf = TfidfVectorizer()
  
# get tf-df values
result = tfidf.fit_transform(string)
  
# get indexing
print('\nWord indexes:')
print(tfidf.vocabulary_)
  
# display tf-idf values
print('\ntf-idf values:')
print(result)



Word indexes:
{'python01': 0, 'python02': 1, 'python03': 2, 'python04': 3}

tf-idf values:
  (0, 0)	1.0
  (1, 1)	1.0
  (2, 2)	1.0
  (3, 3)	1.0


Example 3: In this program, tf-idf values are computed from a corpus having similar documents.


In [8]:
# assign documents
doc0 = 'Python for coding!'
doc1 = 'Python for coding!'

# merge documents into a single corpus
string = [doc0, doc1]
  
# create object
tfidf = TfidfVectorizer()
  
# get tf-df values
result = tfidf.fit_transform(string)
  
# get indexing
print('\nWord indexes:')
print(tfidf.vocabulary_)
  
# display tf-idf values
print('\ntf-idf values:')
print(result)


Word indexes:
{'python': 2, 'for': 1, 'coding': 0}

tf-idf values:
  (0, 0)	0.5773502691896258
  (0, 1)	0.5773502691896258
  (0, 2)	0.5773502691896258
  (1, 0)	0.5773502691896258
  (1, 1)	0.5773502691896258
  (1, 2)	0.5773502691896258


Example 4: Below is the program in which we try to calculate tf-idf value of a single word geeks is repeated multiple times in multiple documents.

In [9]:
# assign corpus
string = ['Python python']*5
  
# create object
tfidf = TfidfVectorizer()
  
# get tf-df values
result = tfidf.fit_transform(string)
  
# get indexing
print('\nWord indexes:')
print(tfidf.vocabulary_)
  
# display tf-idf values
print('\ntf-idf values:')
print(result)


Word indexes:
{'python': 0}

tf-idf values:
  (0, 0)	1.0
  (1, 0)	1.0
  (2, 0)	1.0
  (3, 0)	1.0
  (4, 0)	1.0


--------------