# Term Frequency - Inverse Document Frequency (TF-IDF)
* TF-IDF is a **statistical** measure.
* It reflects how important/relevant a word is to a document in a collection or corpus.
* It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling.
* A [survey conducted in 2015](https://kops.uni-konstanz.de/handle/123456789/32348) showed that 83% of text-based recommender systems in digital libraries use tf–idf.
* The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.
* Applications: Search Engines (in determining the relevance of queries and documents) and stop-words removal (especially in text-summarization or document classification)

## Term Frequency (TF)
* The first form of term weighting is due to Hans Peter Luhn (1957) [link text](https://ieeexplore.ieee.org/document/5392697)
* The number of times a term occurs in a document is called its term frequency

## Inverse Document Frequency
* Motivation: TF will tend to incorrectly emphasize documents which happen to use the words like "the" more frequently
* [Karen Spärck Jones](https://en.wikipedia.org/wiki/Karen_Sp%C3%A4rck_Jones) (1972) conceived a statistical interpretation of term-specificity called Inverse Document Frequency (idf), which became a cornerstone of term weighting

In [9]:
%pip install scipy scikit-learn numpy

Note: you may need to restart the kernel to use updated packages.


In [29]:
from collections import Counter
from scipy.sparse import lil_matrix
import math
from sklearn.preprocessing import normalize
import numpy as np 

In [11]:
corpus = [
    'this is the first document',
    'this document is the second document',
    'and this is the third one',
    'is this the first document',
] 

In [12]:
def IDF(corpus, unique_words):
   idf_dict={}
   N=len(corpus)
   for i in unique_words:
     count=0
     for sen in corpus:
       if i in sen.split():
         count=count+1
       idf_dict[i]=(math.log((1+N)/(count+1)))+1
   return idf_dict 

In [13]:
def fit(whole_data):
    unique_words = set()
    if isinstance(whole_data, (list,)):
      for x in whole_data:
        for y in x.split():
          if len(y)<2:
            continue
          unique_words.add(y)
      unique_words = sorted(list(unique_words))
      vocab = {j:i for i,j in enumerate(unique_words)}
      Idf_values_of_all_unique_words=IDF(whole_data,unique_words)
    return vocab, Idf_values_of_all_unique_words

In [14]:
Vocabulary, idf_of_vocabulary=fit(corpus) 

In [15]:
print(list(Vocabulary.keys())) 

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']


In [35]:
def transform(dataset, vocabulary, idf_values):
    sparse_matrix = lil_matrix((len(dataset), len(vocabulary)), dtype=np.float64)
    
    for row in range(0, len(dataset)):
        words = dataset[row].split()
        number_of_words_in_sentence = Counter(words)
        sentence_length = len(words)
        
        for word in words:
            if word in vocabulary:
                tf = number_of_words_in_sentence[word] / sentence_length
                tf_idf_value = tf * idf_values[word]
                sparse_matrix[row, vocabulary[word]] = tf_idf_value
    
    # Convert to csr_matrix for efficient operations
    sparse_matrix = sparse_matrix.tocsr()
    
    print("VOCABULARY:")
    for word, idx in sorted(vocabulary.items()):
        print(f"  '{word}': {idx}")
    
    print("\nTF-IDF MATRIX (before normalization):")
    print(sparse_matrix.toarray())
    
    normalized_matrix = normalize(sparse_matrix, norm='l2', axis=1, copy=True, return_norm=False)
    
    print("\nTF-IDF MATRIX (after L2 normalization):")
    print(normalized_matrix.toarray())
    
    return normalized_matrix

In [33]:
final_output=transform(corpus,Vocabulary,idf_of_vocabulary)
print(final_output.shape) 

VOCABULARY:
  'and': 0
  'document': 1
  'first': 2
  'is': 3
  'one': 4
  'second': 5
  'the': 6
  'third': 7
  'this': 8

TF-IDF MATRIX (before normalization):
[[0.         0.24462871 0.30216512 0.2        0.         0.
  0.2        0.         0.2       ]
 [0.         0.40771452 0.         0.16666667 0.         0.31938179
  0.16666667 0.         0.16666667]
 [0.31938179 0.         0.         0.16666667 0.31938179 0.
  0.16666667 0.31938179 0.16666667]
 [0.         0.24462871 0.30216512 0.2        0.         0.
  0.2        0.         0.2       ]]

TF-IDF MATRIX (after L2 normalization):
[[0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]
 [0.         0.6876236  0.         0.28108867 0.         0.53864762
  0.28108867 0.         0.28108867]
 [0.51184851 0.         0.         0.26710379 0.51184851 0.
  0.26710379 0.51184851 0.26710379]
 [0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]]
(4, 9)


In [18]:
print(final_output[0].toarray())

[[0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]]
