# TF-IDF (Term Frequency-Inverse Document Frequency)

In chatbots or RAG systems, it's often used in:

- Intent classification

- Document retrieval

- Keyword extraction

📈 Use Cases of High TF-IDF Terms:
- Pick keywords for summaries

- Rank documents in search results

- Retrieve relevant context in RAG

- Feature selection for ML models

## TF (Term Frequency)


$$
TF(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d}
$$


In [200]:
def compute_tf(term, doc):
    return doc.lower().split().count(term.lower()) / len(doc.lower().split())

In [201]:
term = "I"
doc = "I am kenji"

result = compute_tf(term, doc)

print(f'Term Frequency of {term} in\n"{doc}"\n{result}')

Term Frequency of I in
"I am kenji"
0.3333333333333333


## IDF (Inverse Document Frequency)
IDF reduces the weight of common words and increases the weight of **rare**, informative ones.

$$
IDF(t) = \log \left( \frac{N}{df(t)} \right)
$$

Where:
- $N = \text{total number of documents}$
- $df(t) = \text{number of documents containing term t}$

In [202]:
import math

def compute_idf(term, docs, smooth=True):
    doc_count = sum(term.lower() in doc.lower().split() for doc in docs)
    if smooth:
        return math.log((1 + len(docs)) / (1 + doc_count))
    
    if doc_count == 0:
        return 0
    
    return math.log(len(docs) / doc_count)


In [203]:
docs = [
    "I want to visit Japan for tourism. What kind of visa do I need?",
    "Can I apply for a working holiday visa if I'm from Australia?",
    "I have a Japanese spouse. How do I apply for a spouse visa?",
    "What documents are needed to get a short-term business visa for Japan?",
    "I’m attending a 2-week academic conference in Tokyo. Which visa is suitable?",
    "How long can I stay in Japan on a tourist visa?",
    "Do I need a visa if I’m transiting through Narita Airport for 5 hours?",
    "I'm planning to study in Japan for one year. What visa do I need?",
    "What is the difference between a temporary visitor visa and a multiple-entry visa?",
    "Can I convert a tourist visa into a student visa while in Japan?"
]

term = "fasddsaf"
result = compute_idf(term, docs, smooth=False)
result

0

## Smoothed IDF

$$
IDF(t) = \log \left( \frac{1+N}{1+df(t)} \right)
$$

✅ This ensures:

- The denominator is never zero

- The value is finite and stable

In [204]:
result = compute_idf("2-wesfdek", docs, smooth=True)
result

2.3978952727983707

## unsmoothed vs smoothed IDF

In [205]:
term = "japanese"
idf_result = compute_idf(term, docs, smooth=True)
unsmoothed_idf_result = compute_idf(term, docs, smooth=False)

print(f"normal: {unsmoothed_idf_result}")
print(f"smoothed: {idf_result}")

normal: 2.302585092994046
smoothed: 1.7047480922384253


### TF-IDF (One Single Term)

$$
TF\text{-}IDF(t, d) = TF(t, d) \times IDF(t)
$$


In [206]:
def compute_tfidf(term, doc, docs, smooth=True):
    return compute_tf(term, doc) * compute_idf(term, docs, smooth=smooth)

In [207]:
term = "japanese"
target_doc = docs[2]

result = compute_tfidf(term, target_doc, docs, smooth=True)

print(target_doc)
print("tf-idf")
print(f"{term}: {result}")

I have a Japanese spouse. How do I apply for a spouse visa?
tf-idf
japanese: 0.13113446863372502


## TF-IDF (Matrix)

In [208]:
def get_vocabulary(docs):
    tokenized_docs = [doc.lower().split() for doc in docs]
    return sorted(set(word.lower() for doc in tokenized_docs for word in doc))

In [209]:
def compute_tfidf_matrix(docs):
    tfidf_matrix = []
    vocabulary = get_vocabulary(docs)
    for doc in docs:
        tfidf_vector = []
        for term in vocabulary:
            tf = compute_tf(term, doc)
            idf = compute_idf(term, docs, smooth=True)
            tfidf = tf * idf
            tfidf_vector.append(tfidf)
        tfidf_matrix.append(tfidf_vector)
    
    return tfidf_matrix

The following is to get the keyword per document using the tf-idf matrix

In [210]:
tfidf_result = compute_tfidf_matrix(docs)
vocabulary = get_vocabulary(docs)

for i, row in enumerate(tfidf_result):
    print(f"\nDocument {i+1}")
    keywords = []
    for term, score in zip(vocabulary, row):
        keywords.append((term, score))
    keyword = max(keywords, key=lambda x: x[1])
    print(f'{keyword[0]}: {keyword[1]}')


Document 1
kind: 0.12176772087417323

Document 2
australia?: 0.14206234101986875

Document 3
have: 0.13113446863372502

Document 4
are: 0.14206234101986875

Document 5
2-week: 0.14206234101986875

Document 6
long: 0.15497709929440232

Document 7
5: 0.12176772087417323

Document 8
one: 0.12176772087417323

Document 9
and: 0.13113446863372502

Document 10
convert: 0.13113446863372502


## Learned

1. Text Cleaning
2. Tokenization
3. Additive Smoothing

# Using Scikit-learn

In [None]:
!pip install scikit-learn

Collecting scikit-learn
  Using cached scikit_learn-1.7.1-cp311-cp311-macosx_10_9_x86_64.whl.metadata (11 kB)
Collecting numpy>=1.22.0 (from scikit-learn)
  Using cached numpy-2.3.1-cp311-cp311-macosx_14_0_x86_64.whl.metadata (62 kB)
Collecting scipy>=1.8.0 (from scikit-learn)
  Using cached scipy-1.16.0-cp311-cp311-macosx_14_0_x86_64.whl.metadata (61 kB)
Collecting joblib>=1.2.0 (from scikit-learn)
  Using cached joblib-1.5.1-py3-none-any.whl.metadata (5.6 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Using cached threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Using cached scikit_learn-1.7.1-cp311-cp311-macosx_10_9_x86_64.whl (9.3 MB)
Using cached joblib-1.5.1-py3-none-any.whl (307 kB)
Using cached numpy-2.3.1-cp311-cp311-macosx_14_0_x86_64.whl (6.9 MB)
Using cached scipy-1.16.0-cp311-cp311-macosx_14_0_x86_64.whl (23.4 MB)
Using cached threadpoolctl-3.6.0-py3-none-any.whl (18 kB)
Installing collected packages: threadpoolctl, numpy, joblib, scipy, scikit-learn
[2K  

In [218]:
from sklearn.feature_extraction.text import TfidfVectorizer

# 📝 Sample corpus
documents = [
    "Japan visa requirements for Filipino travelers.",
    "How to apply for a Japan tourist visa?",
    "Japan embassy visa checklist and required documents.",
    "Schengen visa versus Japan visa: what's the difference?",
    "Best travel agencies that assist with Japan visa application."
]

# 🛠️ Create the vectorizer
vectorizer = TfidfVectorizer()

# 📊 Compute TF-IDF matrix
tfidf_matrix = vectorizer.fit_transform(documents)

# 📚 Get feature names (vocabulary)
feature_names = vectorizer.get_feature_names_out()

# 🔍 Show results
for doc_idx, doc_vector in enumerate(tfidf_matrix.toarray()):
    print(f"\n📄 Document {doc_idx + 1}")
    keyword_scores = list(zip(feature_names, doc_vector))
    top_keyword = max(keyword_scores, key=lambda x: x[1])
    print(f"Top keyword: '{top_keyword[0]}' (score: {top_keyword[1]:.4f})")



📄 Document 1
Top keyword: 'filipino' (score: 0.4936)

📄 Document 2
Top keyword: 'apply' (score: 0.4426)

📄 Document 3
Top keyword: 'and' (score: 0.4282)

📄 Document 4
Top keyword: 'difference' (score: 0.4037)

📄 Document 5
Top keyword: 'agencies' (score: 0.3663)
