# Building TF-IDF from Scratch
In this notebook, we will implement **Term Frequency - Inverse Document Frequency (TF-IDF)** from scratch using Python. 
This is a fundamental technique in Natural Language Processing (NLP) for converting text data into numerical vectors.

### Goals
1. Implement **TF (Term Frequency)**.
2. Implement **IDF (Inverse Document Frequency)**.
3. Combine them to create **TF-IDF**.
4. Compare our results with `sklearn`.


In [4]:
import pandas as pd
import math
import numpy as np

# Sample Corpus
corpus = [
    "the cat sat on the mat",
    "the dog sat on the log",
    "cats and dogs are great"
]
print("Corpus:", corpus)

Corpus: ['the cat sat on the mat', 'the dog sat on the log', 'cats and dogs are great']


## 1. Term Frequency (TF)

**Term Frequency** measures how frequently a term appears in a document. 

### Formula
$$
TF(t, d) = \frac{\text{count of term } t \text{ in document } d}{\text{total number of terms in document } d}
$$

### Exercise 1
Complete the function `compute_tf` below.


In [37]:
def compute_tf(document):
    """
    Computes TF for a single document (string).
    Returns a dictionary: {term: tf_value}
    """
    # Split the document into words (tokens)
    words = document.split()
    total_words = len(set(words))
    
    # 1. Count the frequency of each word
    word_counts = {}
    for word in words:
        word_counts[word] = word_counts.get(word, 0)+1
        pass 

    # 2. Calculate TF for each word
    tf_dict = {}
    for word, count in word_counts.items():
        # TODO: Calculate TF = (count of term) / (total terms)
          tf_dict[word] =count/total_words
    pass
        
    return tf_dict

# Test with the first document
print("TF for doc 0:", compute_tf(corpus[0]))


TF for doc 0: {'the': 0.4, 'cat': 0.2, 'sat': 0.2, 'on': 0.2, 'mat': 0.2}


## 2. Inverse Document Frequency (IDF)

**IDF** measures how important a term is. While TF considers all terms equally important, IDF weighs down frequent terms (like "the", "is") and scales up rare terms.

### Formula
$$
IDF(t) = \log \left( \frac{N}{DF(t)} \right)
$$
Where:
* $N$ = Total number of documents.
* $DF(t)$ = Number of documents containing term $t$.

*Note: Use `math.log10` or `math.log` (natural log). For this exercise, simple log is fine.*

### Exercise 2
Complete the function `compute_idf`.


In [16]:
from math import log


In [38]:

def compute_idf(corpus):
    N = len(corpus)
    all_words_df = {}
    
    for doc in corpus:
        words = set(doc.split())
        for word in words:
            all_words_df[word] = all_words_df.get(word, 0) + 1
    
    idf_dict = {}
    for word, df_count in all_words_df.items():
        idf_dict[word] = log(N / df_count)
        
    return idf_dict

# Test it
idf_result = compute_idf(corpus)
print("IDF Result:", idf_result)


IDF Result: {'sat': 0.4054651081081644, 'on': 0.4054651081081644, 'the': 0.4054651081081644, 'cat': 1.0986122886681098, 'mat': 1.0986122886681098, 'log': 1.0986122886681098, 'dog': 1.0986122886681098, 'are': 1.0986122886681098, 'great': 1.0986122886681098, 'cats': 1.0986122886681098, 'dogs': 1.0986122886681098, 'and': 1.0986122886681098}


## 3. TF-IDF

Now we multiply them together:
$$
TF\text{-}IDF = TF(t, d) \times IDF(t)
$$

### Exercise 3
Create the full TF-IDF matrix for our corpus.


In [40]:
def compute_tfidf(corpus):
    idf_dict = compute_idf(corpus)
    vectors = []

    for doc in corpus:
        tf_dict = compute_tf(doc)
        doc_tfidf = {}
        for word, tf_val in tf_dict.items():
            idf_val = idf_dict.get(word, 0)
            doc_tfidf[word] = tf_val * idf_val
        vectors.append(doc_tfidf)
    
    return vectors

tfidf_vectors = compute_tfidf(corpus)
df = pd.DataFrame(tfidf_vectors).fillna(0)
print("My TF-IDF Matrix:")
print(df)


My TF-IDF Matrix:
        the       cat       sat        on       mat       dog       log  \
0  0.162186  0.219722  0.081093  0.081093  0.219722  0.000000  0.000000   
1  0.162186  0.000000  0.081093  0.081093  0.000000  0.219722  0.219722   
2  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   

       cats       and      dogs       are     great  
0  0.000000  0.000000  0.000000  0.000000  0.000000  
1  0.000000  0.000000  0.000000  0.000000  0.000000  
2  0.219722  0.219722  0.219722  0.219722  0.219722  


## 4. Comparison with Scikit-Learn
Let's see how our results compare with a professional library.
Note: Sklearn uses a slightly different IDF formula (adds 1 smoothing, different normalization), so numbers wont be identical, but should be correlated.


In [41]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Setup TfidfVectorizer (defaults do normalization, we turned that off above for simplicity)
vectorizer = TfidfVectorizer(norm=None, smooth_idf=False) # Trying to match simple logic

sklearn_tfidf = vectorizer.fit_transform(corpus)
df_sklearn = pd.DataFrame(sklearn_tfidf.toarray(), columns=vectorizer.get_feature_names_out())

print("Sklearn TF-IDF Matrix:")
print(df_sklearn)


Sklearn TF-IDF Matrix:
        and       are       cat      cats       dog      dogs     great  \
0  0.000000  0.000000  2.098612  0.000000  0.000000  0.000000  0.000000   
1  0.000000  0.000000  0.000000  0.000000  2.098612  0.000000  0.000000   
2  2.098612  2.098612  0.000000  2.098612  0.000000  2.098612  2.098612   

        log       mat        on       sat      the  
0  0.000000  2.098612  1.405465  1.405465  2.81093  
1  2.098612  0.000000  1.405465  1.405465  2.81093  
2  0.000000  0.000000  0.000000  0.000000  0.00000  
