What does tf-idf mean?


Tf-idf stands for <em>term frequency-inverse document frequency</em>, and the tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tf-idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query.

One of the simplest ranking functions is computed by summing the tf-idf for each query term; many more sophisticated ranking functions are variants of this simple model.

Tf-idf can be successfully used for stop-words filtering in various subject fields including text summarization and classification.

### 1. Custom TFIDF Vectorizer & compare its results with Sklearn:

### Corpus

In [1]:
#Collection of string documents

corpus = [
     'this is the first document',
     'this document is the second document',
     'and this is the third one',
     'is this the first document',
]

### SkLearn Implementation

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
vectorizer.fit(corpus)
skl_output = vectorizer.transform(corpus)

In [3]:
#sklearn feature names, sorted in alphabetic order by default.

print(vectorizer.get_feature_names())

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']




In [4]:
#Here we will print the sklearn tfidf vectorizer idf values after applying the fit method

print(vectorizer.idf_)

[1.91629073 1.22314355 1.51082562 1.         1.91629073 1.91629073
 1.         1.91629073 1.        ]


In [5]:
#shape of sklearn tfidf vectorizer after applying transform method.

skl_output.shape

(4, 9)

In [6]:
#sklearn tfidf values for first line of the above corpus.
#output is a sparse matrix

print(skl_output[0])

  (0, 8)	0.38408524091481483
  (0, 6)	0.38408524091481483
  (0, 3)	0.38408524091481483
  (0, 2)	0.5802858236844359
  (0, 1)	0.46979138557992045


In [7]:
#sklearn tfidf values for first line of the above corpus.
#here we are converting the sparse output matrix to dense matrix and printing it.
#output is normalized using L2 normalization. sklearn does this by default.

print(skl_output[0].toarray())

[[0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]]


### custom implementation

In [8]:
from collections import Counter
from tqdm import tqdm
from scipy.sparse import csr_matrix
import math
import operator
from sklearn.preprocessing import normalize
import numpy

In [9]:
corpus = [
     'this is the first document',
     'this document is the second document',
     'and this is the third one',
     'is this the first document',
]

In [10]:
def fit(dataset):
    """Return Vocabulary of Unique Words"""
    unique_words = set()

    #Check if the input is a list of sentences
    if isinstance(dataset, list):
        #Iterate through each sentence in the dataset
        for row in dataset:
            #Split the sentence into words
            for word in row.split(" "):
                #Ignore words with length less than 2
                if len(word) < 2:
                    continue
                #Add unique words to the set
                unique_words.add(word)

        #Sort the unique words and create a vocabulary mapping
        unique_words = sorted(list(unique_words))
        vocab = {j: i for i, j in enumerate(unique_words)}
        return vocab
    else:
        print("You need to pass a list of sentences.")

In [11]:
vocab=fit(corpus)
print(vocab)

{'and': 0, 'document': 1, 'first': 2, 'is': 3, 'one': 4, 'second': 5, 'the': 6, 'third': 7, 'this': 8}


In [12]:
print(vectorizer.get_feature_names())

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']


In [13]:
import math

def idf(corpus, vocab):
    """Return Inverse Document Frequency values"""
    idf_values = {}
    total_documents = len(corpus)  #Total number of documents in the collection

    #Iterate through each word in the vocabulary
    for word in vocab.keys():
        if len(word) < 2:
            continue

        #Count the number of documents containing the word
        document_count = sum(1 for row in corpus if word in row)

        #Calculate Inverse Document Frequency
        idf_values[word] = 1 + (math.log((1 + total_documents) / (1 + document_count)))

    return idf_values


In [14]:
idf(corpus,vocab)

{'and': 1.916290731874155,
 'document': 1.2231435513142097,
 'first': 1.5108256237659907,
 'is': 1.0,
 'one': 1.916290731874155,
 'second': 1.916290731874155,
 'the': 1.0,
 'third': 1.916290731874155,
 'this': 1.0}

In [15]:
print(vectorizer.idf_)

[1.91629073 1.22314355 1.51082562 1.         1.91629073 1.91629073
 1.         1.91629073 1.        ]


In [16]:
from collections import Counter

def tf(corpus, vocab):
    """Return the Term Frequency values"""
    tf_values = {}

    #Check if the input is a list of documents
    if isinstance(corpus, list):
        #Iterate through each document in the corpus
        for idx, row in enumerate(corpus):
            #Calculate word frequency in the document
            word_frequency = dict(Counter(row.split()))
            total_words = sum(word_frequency.values())

            #Iterate through each word in the vocabulary
            for word in vocab.keys():
                if len(word) < 2:
                    continue

                #Calculate Term Frequency for each word
                if word in word_frequency:
                    tf_values[word] = word_frequency[word] / total_words
                else:
                    tf_values[word] = 0

    return tf_values

In [17]:
def transform(corpus, vocab):
    """Return TFIDF Values as a Sparse Matrix"""
    rows = []
    columns = []
    values = []
    tfidf = {}

    #Check if the input is a list of documents
    if isinstance(corpus, list):
        #Iterate through each document in the corpus
        for idx, row in enumerate(corpus):
            lst = []
            lst.append(row)

            #Calculate TFIDF values for the document
            tf_val = tf(lst, vocab)
            idf_val = idf(corpus, vocab)

            #Iterate through each word in the vocabulary
            for word, value in vocab.items():
                if len(word) < 2:
                    continue

                #Calculate TFIDF for each word
                tfidf[word] = tf_val[word] * idf_val[word]

                #Add non-zero TFIDF values to the sparse matrix
                if tfidf[word] != 0:
                    rows.append(idx)
                    columns.append(value)
                    values.append(tfidf[word])

        #Create a CSR matrix and normalize it
        csr_mat = csr_matrix((values, (rows, columns)), shape=(len(corpus), len(vocab)))
        l2_norm = normalize(csr_mat, norm='l2')

        #information about the sparse matrix
        print(l2_norm[0])
        print("***************************************")
        print(l2_norm[0].toarray())
        print("***************************************")

In [18]:
#custom implementation output
tf_idf=transform(corpus,vocab)
print(tf_idf)

  (0, 1)	0.4697913855799205
  (0, 2)	0.580285823684436
  (0, 3)	0.3840852409148149
  (0, 6)	0.3840852409148149
  (0, 8)	0.3840852409148149
***************************************
[[0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]]
***************************************
None


In [19]:
print(skl_output[0])
print("*****************************")
print(skl_output[0].toarray())

  (0, 8)	0.38408524091481483
  (0, 6)	0.38408524091481483
  (0, 3)	0.38408524091481483
  (0, 2)	0.5802858236844359
  (0, 1)	0.46979138557992045
*****************************
[[0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]]


####  Conclusion
 
 Both the OUTPUT are Same