**TF-IDF**

1. TF (Term Frequency): How often a word appears in a document.

2. IDF (Inverse Document Frequency): How unique or rare a word is across all documents.




TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a collection. It’s a sparse vector representation and can be used as a basic form of word/document embedding, but it lacks semantic understanding compared to neural embeddings.

**The TF-IDF score for a word increases if**

* It appears often in a document (high TF).

* But is rare across the whole corpus (high IDF).



### 🔹 How is TF-IDF calculated?

For a word `w` in a document `d` from a set of documents `D`:

**Term Frequency (TF)**  
The frequency of word `w` in document `d`:

$$
TF(w, d) = \frac{\text{Number of times } w \text{ appears in } d}{\text{Total words in } d}
$$

**Inverse Document Frequency (IDF)**  
The uniqueness of word `w` across all documents `D`:

$$
IDF(w, D) = \log\left(\frac{\text{Total number of documents}}{\text{Number of documents with } w}\right)
$$

**TF-IDF Score**  
Combining TF and IDF:

$$
TF\text{-}IDF(w, d, D) = TF(w, d) \times IDF(w, D)
$$


**TF-IDF as Word Embedding**

* Unlike neural embeddings (like Word2Vec, GloVe, or fastText), TF-IDF is a sparse representation where:

* Each word is represented as a vector in the vocabulary space.

* The vector length = total number of words in the vocabulary.

* Each dimension has a TF-IDF score or 0.

* So for a document, you get a vector of shape:
[vocab_size], where each entry corresponds to a word's importance.

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    'This this this  is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

vectorizer = TfidfVectorizer()


X = vectorizer.fit_transform(corpus)

feature_names = vectorizer.get_feature_names_out()

dense_matrix = X.todense()




In [2]:
X.toarray()

array([[0.        , 0.31817034, 0.39300367, 0.2601251 , 0.        ,
        0.        , 0.2601251 , 0.        , 0.7803753 ],
       [0.        , 0.6876236 , 0.        , 0.28108867, 0.        ,
        0.53864762, 0.28108867, 0.        , 0.28108867],
       [0.51184851, 0.        , 0.        , 0.26710379, 0.51184851,
        0.        , 0.26710379, 0.51184851, 0.26710379],
       [0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,
        0.        , 0.38408524, 0.        , 0.38408524]])

In [3]:
import pandas as pd
df = pd.DataFrame(dense_matrix, columns=feature_names)
print(df)

        and  document     first        is       one    second       the  \
0  0.000000  0.318170  0.393004  0.260125  0.000000  0.000000  0.260125   
1  0.000000  0.687624  0.000000  0.281089  0.000000  0.538648  0.281089   
2  0.511849  0.000000  0.000000  0.267104  0.511849  0.000000  0.267104   
3  0.000000  0.469791  0.580286  0.384085  0.000000  0.000000  0.384085   

      third      this  
0  0.000000  0.780375  
1  0.000000  0.281089  
2  0.511849  0.267104  
3  0.000000  0.384085  


**Scratch**

In [4]:
import math
from collections import Counter

# Sample documents
documents = [
    'This this this  is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]


# Preprocessing: tokenize and lowercase
def tokenize(doc):
    return doc.lower().split()

# Step 1: Build vocabulary
vocab = set()
tokenized_docs = []

for doc in documents:
    tokens = tokenize(doc)
    tokenized_docs.append(tokens)
    vocab.update(tokens)

vocab = sorted(vocab)
vocab_index = {word: idx for idx, word in enumerate(vocab)}

# Step 2: Compute TF
def compute_tf(tokens):
    tf = Counter(tokens)
    total_terms = len(tokens)
    return {term: count / total_terms for term, count in tf.items()}

# Step 3: Compute IDF
def compute_idf(doc_list):
    N = len(doc_list)
    idf = {}
    for term in vocab:
        doc_count = sum(term in doc for doc in doc_list)
        idf[term] = math.log((N / (1 + doc_count))) + 1  # smoothed IDF
    return idf

# Step 4: Compute TF-IDF
def compute_tfidf(docs):
    idf = compute_idf(docs)
    tfidf_matrix = []

    for tokens in docs:
        tf = compute_tf(tokens)
        tfidf_vector = [tf.get(term, 0) * idf[term] for term in vocab]
        tfidf_matrix.append(tfidf_vector)

    return tfidf_matrix

# Run it
tfidf_matrix = compute_tfidf(tokenized_docs)

# Display the result
print("Vocabulary:", vocab)
print("\nTF-IDF Matrix:")
for i, vec in enumerate(tfidf_matrix):
    print(f"Doc {i+1}:", [round(val, 3) for val in vec])


Vocabulary: ['and', 'document', 'document.', 'document?', 'first', 'is', 'one.', 'second', 'the', 'third', 'this']

TF-IDF Matrix:
Doc 1: [0.0, 0.0, 0.184, 0.0, 0.184, 0.111, 0.0, 0.0, 0.111, 0.0, 0.333]
Doc 2: [0.0, 0.282, 0.215, 0.0, 0.0, 0.129, 0.0, 0.282, 0.129, 0.0, 0.129]
Doc 3: [0.282, 0.0, 0.0, 0.0, 0.0, 0.129, 0.282, 0.0, 0.129, 0.282, 0.129]
Doc 4: [0.0, 0.0, 0.0, 0.339, 0.258, 0.155, 0.0, 0.0, 0.155, 0.0, 0.155]


In [5]:
import pandas as pd
import math

def TFIDF(corpus):

    def tokenizer(text):
        return text.lower().split()

    vocab = []
    tokenized_corpus = []

    for doc in corpus:
        tokens = tokenizer(doc)
        tokenized_corpus.append(tokens)
        vocab.extend(tokens)

    vocab = sorted(set(vocab))
    
    # Calculate Term Frequency (TF)
    tf_list = []
    for tokens in tokenized_corpus:
        tf = {word: 0 for word in vocab}
        for word in tokens:
            tf[word] += 1
        total_terms = len(tokens)
        for word in tf:
            tf[word] = tf[word] / total_terms
        tf_list.append(tf)

    # Calculate Document Frequency (DF)
    df = {word: 0 for word in vocab}
    for word in vocab:
        for tokens in tokenized_corpus:
            if word in tokens:
                df[word] += 1

    # Calculate Inverse Document Frequency (IDF)
    N = len(corpus)
    idf = {word: math.log(N / df[word]) for word in vocab}

    # Calculate TF-IDF
    tfidf_list = []
    for tf in tf_list:
        tfidf = {word: tf[word] * idf[word] for word in vocab}
        tfidf_list.append(tfidf)

    df_tfidf = pd.DataFrame(tfidf_list)
    return df_tfidf


In [6]:
print(TFIDF(corpus))

        and  document  document.  document?     first   is      one.  \
0  0.000000  0.000000   0.099021   0.000000  0.099021  0.0  0.000000   
1  0.000000  0.231049   0.115525   0.000000  0.000000  0.0  0.000000   
2  0.231049  0.000000   0.000000   0.000000  0.000000  0.0  0.231049   
3  0.000000  0.000000   0.000000   0.277259  0.138629  0.0  0.000000   

     second  the     third  this  
0  0.000000  0.0  0.000000   0.0  
1  0.231049  0.0  0.000000   0.0  
2  0.000000  0.0  0.231049   0.0  
3  0.000000  0.0  0.000000   0.0  


In [7]:
corpus = [
    'This this this  is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]


In [8]:
import pandas as pd
import math

In [9]:
def tokenizer(text):
    return text.lower().split()

In [10]:
vocab = []
tokenized_corpus = []

In [11]:
for i in corpus:
    tok = tokenizer(i)
    print(tok)
    tokenized_corpus.append(tok)
    vocab.extend(tok)
    




['this', 'this', 'this', 'is', 'the', 'first', 'document.']
['this', 'document', 'is', 'the', 'second', 'document.']
['and', 'this', 'is', 'the', 'third', 'one.']
['is', 'this', 'the', 'first', 'document?']


In [12]:
tokenized_corpus

[['this', 'this', 'this', 'is', 'the', 'first', 'document.'],
 ['this', 'document', 'is', 'the', 'second', 'document.'],
 ['and', 'this', 'is', 'the', 'third', 'one.'],
 ['is', 'this', 'the', 'first', 'document?']]

In [13]:
vocab 

['this',
 'this',
 'this',
 'is',
 'the',
 'first',
 'document.',
 'this',
 'document',
 'is',
 'the',
 'second',
 'document.',
 'and',
 'this',
 'is',
 'the',
 'third',
 'one.',
 'is',
 'this',
 'the',
 'first',
 'document?']

In [14]:
vocab = sorted(set(vocab))
vocab

['and',
 'document',
 'document.',
 'document?',
 'first',
 'is',
 'one.',
 'second',
 'the',
 'third',
 'this']

In [15]:
tf_list = []
for tokens in tokenized_corpus:
    tf = {word: 0 for word in vocab}
    # print(tf)
    for word in tokens:
        # print(word)
        tf[word] +=1
    print(tf)
    for word in tf:
        tf[word] = tf[word] / len(tokens)
    tf_list.append(tf)

{'and': 0, 'document': 0, 'document.': 1, 'document?': 0, 'first': 1, 'is': 1, 'one.': 0, 'second': 0, 'the': 1, 'third': 0, 'this': 3}
{'and': 0, 'document': 1, 'document.': 1, 'document?': 0, 'first': 0, 'is': 1, 'one.': 0, 'second': 1, 'the': 1, 'third': 0, 'this': 1}
{'and': 1, 'document': 0, 'document.': 0, 'document?': 0, 'first': 0, 'is': 1, 'one.': 1, 'second': 0, 'the': 1, 'third': 1, 'this': 1}
{'and': 0, 'document': 0, 'document.': 0, 'document?': 1, 'first': 1, 'is': 1, 'one.': 0, 'second': 0, 'the': 1, 'third': 0, 'this': 1}


In [16]:
df = {word: 0 for word in vocab}
for word in vocab:
    for tokens in tokenized_corpus:
        if word in tokens:
            df[word] += 1


In [17]:
df

{'and': 1,
 'document': 1,
 'document.': 2,
 'document?': 1,
 'first': 2,
 'is': 4,
 'one.': 1,
 'second': 1,
 'the': 4,
 'third': 1,
 'this': 4}

In [18]:
N = len(corpus)
idf = {word: math.log(N / df[word]) for word in vocab}

# Calculate TF-IDF
tfidf_list = []
for tf in tf_list:
    tfidf = {word: tf[word] * idf[word] for word in vocab}
    tfidf_list.append(tfidf)

In [19]:
df_tfidf = pd.DataFrame(tfidf_list)
df_tfidf

Unnamed: 0,and,document,document.,document?,first,is,one.,second,the,third,this
0,0.0,0.0,0.099021,0.0,0.099021,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.231049,0.115525,0.0,0.0,0.0,0.0,0.231049,0.0,0.0,0.0
2,0.231049,0.0,0.0,0.0,0.0,0.0,0.231049,0.0,0.0,0.231049,0.0
3,0.0,0.0,0.0,0.277259,0.138629,0.0,0.0,0.0,0.0,0.0,0.0
