Mathematical Understanding of TF-IDF

TF-IDF (Term Frequency – Inverse Document Frequency) is a numerical representation of text used to measure how important a word is in a document relative to the entire corpus.

The formula is:

                TF(t,d)×log(N/DF(t)​)
Where:

TF(t,d) = number of times term t appears in document d

DF(t) = number of documents that contain term t

N = total number of documents in the corpus

Interpretation

Term Frequency (TF) measures how important a word is inside a specific document.

Inverse Document Frequency (IDF) reduces the weight of very common words and increases the weight of rare words.

If a word appears in many documents, it is less informative → lower score.
If a word appears frequently in one document but rarely in others, it is more informative → higher score.

![Schéma](images/tf-idf.png)

TF-IDF Implementation

1.  Required Libraries
TF-IDF is implemented in Scikit-learn.

In [3]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

from tqdm import tqdm
import json

In [5]:
# here we load documents and query 

with open("dataset/docs.json", "r", encoding="utf-8") as doc, open("dataset/queries_train.json", "r", encoding="utf-8") as query:

    documents = json.load(doc)
    queries = json.load(query)

df_doc = pd.DataFrame(documents)
df_query = pd.DataFrame(queries)

2. TF-IDF Pipeline
The goal:
For each query → find the most similar documents.

The pipeline has 6 steps:

1 Prepare the text data → 2 Create the TF-IDF Vectorizer → 3 Fit on Documents → 4 Transform Queries → 5 Compute Similarity → Retrieve Top-k Documents

Step 1 — Prepare the text data

we need two thinks:

1. a list of document text
2. a  list of query text

In [6]:
doc_texts = df_doc["text"].tolist()
query_texts = df_query["text"].tolist()

Step 2 — Create the TF-IDF Vectorizer

TF-IDF converts text into numerical vectors.

In [8]:
vectorizer = TfidfVectorizer(
    lowercase=True,           # convert all text to lowercase
    stop_words="english",     # remove common words
    max_df=0.95,              # ignore very frequent words
    min_df=2,                 # ignore very rare words
    ngram_range=(1, 1)        # use single words only
)

Step 3 — Fit on Documents

We learn the vocabulary from documents.

This creates a matrix: (number_of_documents, vocabulary_size)

In [9]:
doc_tfidf = vectorizer.fit_transform(doc_texts)

Step 4 — Transform Queries

We use the same vocabulary.

In [10]:
query_tfidf = vectorizer.transform(query_texts)

Step 5 — Compute Similarity

We compare each query to all documents using cosine similarity.

Each value = similarity score.

Shape: (number_of_queries, number_of_documents)

In [None]:
similarity_matrix = cosine_similarity(query_tfidf, doc_tfidf)

[[0.03640876 0.00597895 0.         ... 0.         0.         0.        ]
 [0.         0.00742215 0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]


Step 6 — Retrieve Top-k Documents

The top 5 documents for each query.

In [17]:
k = 5

top_k_indices = np.argsort(-similarity_matrix, axis=1)[:, :k]

Step 7 — Build Results DataFrame

In [18]:
tfidf_results = {}

for i, query_id in enumerate(df_query["id"]):
    retrieved_docs = df_doc.iloc[top_k_indices[i]]["id"].tolist()
    tfidf_results[query_id] = retrieved_docs
    
result = pd.DataFrame(tfidf_results)
result

Unnamed: 0,961c4349-8cf1-4ef1-89cc-24d20bb9d000_67878,4008ed78-e66e-4d89-9c3b-c79bd1cf6fc9_366,d5a95b09-e8ea-44dd-993d-347ed418e1f1_15138,3e66798a-b7fd-41b5-8bc0-33b3d7ce2aca_177487,f5f944d2-277a-481d-ab09-612890402ded_137489,da11c342-a2b4-442b-a95a-b2c26f78bf7a_194764,980c09bc-14a0-4832-ad0b-ac1a79ca51cd_55670,e655782b-869d-4d80-9513-fda43a2bbf84_54069,cc1235a9-a603-4891-8474-6cc0a860f159_84759,534caa94-be3a-4a61-b39e-5319f0f9425c_28551,...,cbd9045d-9abb-4103-9b45-64678a9b9262_185783,553912ab-4932-4d42-a047-b68aa507ee73_104362,1ed89f09-9d35-4668-8e05-7ffa10d0bd6c_206755,f73b0a9e-63c1-44cd-b36f-b41b696d4131_37581,46cddbe6-646c-4845-a699-7cdfe3b21721_65261,6f98ccbb-db6f-4646-ac1e-08d8d8bda71a_247103,345e8385-635c-44ab-abd1-f9fbcccaf774_159755,c38ac583-a824-46d9-ad00-105571c0c8fa_120087,3ed0fa3e-af7c-40b2-b44e-bf16de45051e_7772,7878104c-6dfa-42fe-badd-8fc53b2314ac_163896
0,6260dea3-8a5e-4e28-8c8e-9340f9352887_23121,010468c5-dd65-4827-bcb3-1f9669de84ec_144757,a1a1b7b9-48b4-4400-898f-6b1e1f310602_20625,68427169-a1c7-4b4c-b728-075cede75074_154217,77294dfe-cbbb-4802-99ef-b86c2bd637e9_137264,7c394108-1df1-46a5-910f-b4e8e17bb762_55469,7cc50259-6520-4be8-bdf3-1abfa4af9e3f_171352,8be3ce66-f518-4ad8-bb7b-87d755340368_7224,c96c87f4-5a5c-4ef3-b903-5d3018cbb3e5_105733,7b8283aa-a41e-41f9-bf7d-901ce326ae97_4418,...,ce2e8668-73a6-49a6-94f5-35a4138430dd_100571,72f99646-8469-4453-b760-26a84bd0ffd3_38593,28332c8d-5d96-4877-8429-d602a3735c30_20950,417ba01a-dc1e-42f4-b554-9894a92b152b_65625,10c63d70-c6f1-4d26-b447-2df6ac6bb834_14726,1dc6a087-d138-4401-ac10-32f50fd9859f_147083,fa87df58-bd54-4bbd-a5c1-c3fec8d187dc_151745,d5bf8830-071c-4cc1-ac63-d2439d2dd3aa_147583,0047de08-4486-4ff3-a3c2-01b2642d9a05_13746,24c60ad1-50da-4d97-999e-2500646dae63_161734
1,3dc818a3-141d-4352-8e54-db21dc85861e_67816,012ba6ce-e657-4285-b29b-4ad830598ef2_70387,a315e0ba-3022-43ef-90e6-120d478b5a39_124624,20328019-1e5c-4177-ab5c-4634bb992f14_51847,87364de1-e17d-4fbe-90ea-1be9b1b5e878_122467,90b78e53-74e2-43fc-bac2-3b8bf7be21de_183576,8d77f4bd-b169-414d-bdb8-c55d45628f8c_112015,cf97207f-058a-46b5-9db4-0fe8812df73f_2207,930ff7cb-17f4-4719-91c7-5bd908fac8fa_6294,221ad8c9-f925-451a-8c17-e48f32219f74_37735,...,7c5351dc-f0df-4b70-be91-6bc64c3c10ec_162660,331c55c7-0454-4063-a42c-e16138775f87_35895,246b9bb6-1f0d-434f-841e-a93699d8b440_102018,754f6f19-0e38-4ab7-a3f4-dc9da509b3f9_43869,d4faf0c2-2a43-40ff-b7ab-4aec42b4c7a8_3359,6a0665b7-728c-4409-a18b-a33906317ded_150426,44b6c151-9c6b-4199-ac62-17038a44df35_164717,f6708161-da4f-488a-98cb-5932cbd01a01_138608,dae27d32-4238-402c-b7ae-90525ac6b83a_45956,018c3ff0-0279-4589-bbb7-2c45b86039c6_159388
2,cbac6e51-ea52-4824-be99-93db1abbe35a_62218,a4aeeb9f-dcef-4cf0-9e01-1ef6a86a2805_50244,41af90fd-80b5-44e3-8ff9-4354855d6751_46905,cb2e75fe-6610-44d9-b35f-f88472e4ab36_79346,e2b70c43-f3dc-478b-8dac-9c989b43ae2d_37170,f0da5ab1-7012-4ea0-985b-88f69087d76a_85038,fb808b6a-9908-4c52-a37e-9dad00307a63_125820,c4094b97-6e2b-4c3d-a194-87b3e2e4378d_6009,c1a859b0-cf20-4dca-a0d3-005fc3e85656_129607,0064924c-4dc4-4a79-9c96-420bcdb3e889_37604,...,b54c0dd2-a75a-475a-a2f4-f4108efe4302_67248,22a209e4-1a3f-4115-af5e-01d2ffda42c2_92008,d468f97c-a03a-44e9-a110-93962bccca95_189582,9804d800-5f43-4533-9ad3-f050809f058d_43995,e2c22a56-2863-42b5-80a1-057a795a61a3_26204,64b29cba-fae3-4cef-8930-8dbbab76503c_688,754f6f19-0e38-4ab7-a3f4-dc9da509b3f9_43869,e753f7a7-21c0-421a-b75f-1b215a043a73_174939,5afba39e-0cec-4a99-8387-1d5215e6db3e_2533,f1cbc4a0-cbd9-4845-9940-74519139a535_125199
3,020acc87-e0cb-43bc-a38a-b6b2872a154c_7289,fb4558fa-e0db-4dc6-8bef-28905c358520_55298,808e05a0-9afc-41e2-9b9c-b8d941cf25dd_188253,2c874fe5-0af8-48cf-87cb-f9b1bcead0e7_162016,72e40f5d-9c31-4bc6-ad44-b10008b6bb1d_122466,ee1954ad-8811-4d7f-9e3a-f3b7dc06a14e_176297,a8d1bfd6-1c12-4a4a-b168-714a1153ca58_50637,486ea47a-a220-47cd-afa0-f1fdd80d20ed_21042,1b875ffb-ad1b-454e-9c9b-ca971a6af665_48375,6a4d16fc-d7b2-4769-b48d-2b09c596505d_26621,...,20e09e28-2a1c-46c0-a1c4-85553c8739ae_131184,b4b3885a-e5ab-40b5-a3b0-9efb57dfd2f0_49253,1a403783-c370-4fc9-a57f-a80113e78012_532,e5a97952-1562-4722-ac99-8d32af1432a3_113632,e19c1283-2d75-4e29-ae0a-76a7140e95eb_57397,b3e9cac4-2510-4198-9732-0230e24bf2da_18308,417ba01a-dc1e-42f4-b554-9894a92b152b_65625,2a035630-a1e7-4059-baa6-e933db5f7e7a_116186,fb560889-9347-4c9e-9882-9d486d60a2cb_103487,31352231-ffe5-41b5-bd0c-e968e4ee1c6a_72031
4,a1fdd2bb-79eb-4dc5-8671-afca8cd3dac3_44409,3dd83b8c-78c2-4751-915f-9300f9d4889c_19675,3eb11903-de7d-4e1f-853e-132fcc74821d_57991,7380755d-856c-4dfb-b8d4-7ca59a739f85_198271,3dc3767d-b121-48e5-bbd4-7d7dd39909d0_61608,107a0871-e25f-490c-ad4a-bc78e82db738_85532,43a8c810-23fa-43b8-a8d0-3bf57ab29e97_9550,58bc0e12-7dd4-41f3-a42a-cfd055a885de_44033,d26a5daf-18a6-4e3c-8219-35a9d83c1791_112268,c2317a0e-d362-40ef-8cf6-7695bf9a40d7_59548,...,5298378c-39cb-4f1e-8eef-a3e10469f325_24914,981f0409-40db-4f21-878d-b1fbbb9ebd3c_43995,0a65376c-29eb-418d-a1d8-e2926797a646_183899,6126ed4f-af47-4523-abde-f3ac36b9ba3e_193487,bc15e703-d1c1-4369-8f11-c1ed4c7cecf7_81696,de92785f-cb2a-4f3d-976e-b33640c06572_106521,e5a97952-1562-4722-ac99-8d32af1432a3_113632,2ec01b01-8f25-430b-98ec-70a47707380e_94415,9928b762-5709-4347-935b-50462f58f63f_14728,2d54b3cf-48d7-4b54-a084-34f689ffb34f_35296
