### TF-IDF from Scratch

#### **Formulas**

1. **Term Frequency (TF)**  
   Measures how often a term \( t \) appears in a document \( d \):  
   tf(t,d) = count of t in d / number of words in d¶

2. **Document Frequency (DF)**  
   Counts the number of documents containing the term \( t \):  
   df(t) = occurrence of t in documents

3. **Inverse Document Frequency (IDF)**  
   Reduces the weight of terms that appear frequently across all documents:  
   idf(t) = log(N/(df + 1))

4. **TF-IDF**  
Combines TF and IDF to calculate the importance of a term \( t \) in a document \( d \):  
   tf-idf(t, d) = tf(t, d) * log(N/(df + 1))

---

#### **Steps to Compute TF-IDF**

1. **Calculate TF (Term Frequency)**  
   For each term in a document, divide its frequency by the total number of words in the document.

2. **Calculate DF (Document Frequency)**  
   Count the number of documents in which each term appears.

3. **Calculate IDF (Inverse Document Frequency)**  
   Use the formula \( IDF(t) = \log\left(\frac{N}{DF(t) + 1}\right) \), where \( N \) is the total number of documents.

4. **Compute TF-IDF**  
   Multiply the TF of a term by its corresponding IDF value.

---



In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [2]:
from collections import Counter
import nltk
from nltk.corpus import stopwords
import re
from nltk.stem import PorterStemmer , WordNetLemmatizer
nltk.download('punkt')

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
corpus = [
    "The sky is blue and beautiful",
    "Love this blue and beautiful sky",
    "The quick brown fox jumps over the lazy dog",
    "A king's breakfast has sausages, ham, bacon, eggs, toast, and beans",
    "I love green eggs, ham, sausages, and bacon",
    "The brown fox is quick and the blue dog is lazy",
    "The sky is very blue and the sky is very beautiful today",
    "The dog is lazy but the brown fox is quick",
]

In [4]:
stemmer=PorterStemmer()
def text_cleaning(corpus):
    cleaned_corpus=[]
    for review in corpus:
        review=re.sub('[^a-zA-Z]',' ',review)
        review=review.lower().split()
        review=[stemmer.stem(word) for word in review if word not in stopwords.words('english')]
        review=' '.join (review)
        cleaned_corpus.append(review)
    return cleaned_corpus        

In [5]:
corpus=text_cleaning(corpus)
corpus

['sky blue beauti',
 'love blue beauti sky',
 'quick brown fox jump lazi dog',
 'king breakfast sausag ham bacon egg toast bean',
 'love green egg ham sausag bacon',
 'brown fox quick blue dog lazi',
 'sky blue sky beauti today',
 'dog lazi brown fox quick']

## Computing Term Frequency

In [6]:
tokenized_corpus = [sentence.split() for sentence in corpus]
tokenized_corpus

[['sky', 'blue', 'beauti'],
 ['love', 'blue', 'beauti', 'sky'],
 ['quick', 'brown', 'fox', 'jump', 'lazi', 'dog'],
 ['king', 'breakfast', 'sausag', 'ham', 'bacon', 'egg', 'toast', 'bean'],
 ['love', 'green', 'egg', 'ham', 'sausag', 'bacon'],
 ['brown', 'fox', 'quick', 'blue', 'dog', 'lazi'],
 ['sky', 'blue', 'sky', 'beauti', 'today'],
 ['dog', 'lazi', 'brown', 'fox', 'quick']]

In [7]:
for sentence in tokenized_corpus:
    print(Counter(sentence))

Counter({'sky': 1, 'blue': 1, 'beauti': 1})
Counter({'love': 1, 'blue': 1, 'beauti': 1, 'sky': 1})
Counter({'quick': 1, 'brown': 1, 'fox': 1, 'jump': 1, 'lazi': 1, 'dog': 1})
Counter({'king': 1, 'breakfast': 1, 'sausag': 1, 'ham': 1, 'bacon': 1, 'egg': 1, 'toast': 1, 'bean': 1})
Counter({'love': 1, 'green': 1, 'egg': 1, 'ham': 1, 'sausag': 1, 'bacon': 1})
Counter({'brown': 1, 'fox': 1, 'quick': 1, 'blue': 1, 'dog': 1, 'lazi': 1})
Counter({'sky': 2, 'blue': 1, 'beauti': 1, 'today': 1})
Counter({'dog': 1, 'lazi': 1, 'brown': 1, 'fox': 1, 'quick': 1})


In [8]:
def compute_tf(doc):
    tf=Counter(doc)
    total_terms=len(doc)
    return {word: count/total_terms for word, count in tf.items()}

In [9]:
tf_list = [compute_tf(doc) for doc in tokenized_corpus]
tf_list

[{'sky': 0.3333333333333333,
  'blue': 0.3333333333333333,
  'beauti': 0.3333333333333333},
 {'love': 0.25, 'blue': 0.25, 'beauti': 0.25, 'sky': 0.25},
 {'quick': 0.16666666666666666,
  'brown': 0.16666666666666666,
  'fox': 0.16666666666666666,
  'jump': 0.16666666666666666,
  'lazi': 0.16666666666666666,
  'dog': 0.16666666666666666},
 {'king': 0.125,
  'breakfast': 0.125,
  'sausag': 0.125,
  'ham': 0.125,
  'bacon': 0.125,
  'egg': 0.125,
  'toast': 0.125,
  'bean': 0.125},
 {'love': 0.16666666666666666,
  'green': 0.16666666666666666,
  'egg': 0.16666666666666666,
  'ham': 0.16666666666666666,
  'sausag': 0.16666666666666666,
  'bacon': 0.16666666666666666},
 {'brown': 0.16666666666666666,
  'fox': 0.16666666666666666,
  'quick': 0.16666666666666666,
  'blue': 0.16666666666666666,
  'dog': 0.16666666666666666,
  'lazi': 0.16666666666666666},
 {'sky': 0.4, 'blue': 0.2, 'beauti': 0.2, 'today': 0.2},
 {'dog': 0.2, 'lazi': 0.2, 'brown': 0.2, 'fox': 0.2, 'quick': 0.2}]

## Computing Idf

In [10]:
all_words=set(word for doc in tokenized_corpus for word in doc)
all_words

{'bacon',
 'bean',
 'beauti',
 'blue',
 'breakfast',
 'brown',
 'dog',
 'egg',
 'fox',
 'green',
 'ham',
 'jump',
 'king',
 'lazi',
 'love',
 'quick',
 'sausag',
 'sky',
 'toast',
 'today'}

In [11]:
def compute_idf(corpus):
    n_docs=len(corpus)
    all_words=set(word for doc in corpus for word in doc)
    idf={}
    for word in all_words:
        containing_docs=sum(1 for doc in corpus if word in doc)
        idf[word]=np.log(n_docs/(1+containing_docs))
    return idf

In [12]:
idf=compute_idf(tokenized_corpus)
idf

{'ham': 0.9808292530117262,
 'king': 1.3862943611198906,
 'today': 1.3862943611198906,
 'blue': 0.47000362924573563,
 'bacon': 0.9808292530117262,
 'toast': 1.3862943611198906,
 'beauti': 0.6931471805599453,
 'green': 1.3862943611198906,
 'dog': 0.6931471805599453,
 'sky': 0.6931471805599453,
 'egg': 0.9808292530117262,
 'breakfast': 1.3862943611198906,
 'brown': 0.6931471805599453,
 'sausag': 0.9808292530117262,
 'quick': 0.6931471805599453,
 'fox': 0.6931471805599453,
 'love': 0.9808292530117262,
 'jump': 1.3862943611198906,
 'lazi': 0.6931471805599453,
 'bean': 1.3862943611198906}

## Computing TfIdf

In [13]:
def compute_tfidf(tf,idf):
    tfidf={}
    for word , tf_value in tf.items():
        tfidf[word]=tf_value*idf[word]
    return tfidf

In [14]:
tfidf_list=(compute_tfidf(tf,idf) for tf in tf_list)
tfidf_list

<generator object <genexpr> at 0x7e017c5e6b90>

In [15]:
for i, tfidf in enumerate(tfidf_list):
    print(f"Document {i+1} TF-IDF:\n{tfidf}\n")

Document 1 TF-IDF:
{'sky': 0.23104906018664842, 'blue': 0.1566678764152452, 'beauti': 0.23104906018664842}

Document 2 TF-IDF:
{'love': 0.24520731325293155, 'blue': 0.11750090731143391, 'beauti': 0.17328679513998632, 'sky': 0.17328679513998632}

Document 3 TF-IDF:
{'quick': 0.11552453009332421, 'brown': 0.11552453009332421, 'fox': 0.11552453009332421, 'jump': 0.23104906018664842, 'lazi': 0.11552453009332421, 'dog': 0.11552453009332421}

Document 4 TF-IDF:
{'king': 0.17328679513998632, 'breakfast': 0.17328679513998632, 'sausag': 0.12260365662646577, 'ham': 0.12260365662646577, 'bacon': 0.12260365662646577, 'egg': 0.12260365662646577, 'toast': 0.17328679513998632, 'bean': 0.17328679513998632}

Document 5 TF-IDF:
{'love': 0.16347154216862103, 'green': 0.23104906018664842, 'egg': 0.16347154216862103, 'ham': 0.16347154216862103, 'sausag': 0.16347154216862103, 'bacon': 0.16347154216862103}

Document 6 TF-IDF:
{'brown': 0.11552453009332421, 'fox': 0.11552453009332421, 'quick': 0.1155245300933