Mathematical Understanding of BM25

BM25 (Best Matching 25) is a ranking function used in Information Retrieval to estimate the relevance of a document to a query.

The formula is:

![Schéma](images/BM25.png)

Where:

TF(t,d) = number of times term t appears in document d


|d| = length of document d (number of words)


avgdl = average document length in the corpus


k₁ = term frequency scaling parameter (usually ≈ 1.2–2.0)


b = length normalization parameter (usually ≈ 0.75)


IDF(t) = inverse document frequency of term t

BM25 Implementation

The goal is the same:

For each query → rank documents by relevance score.

BM25 is a probabilistic ranking model.

The pipeline has 6 steps:

1. Prepare the Text Data → 2. Build the BM25 Model → 3. Compute Scores for Each Query → 4. Rank Documents → 5. Retrieve Document IDs → 6. Apply to All Queries

In [2]:
# Required Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from rank_bm25 import BM25Okapi
import json

In [3]:
# here we load documents and query 

with open("dataset/docs.json", "r", encoding="utf-8") as doc, open("dataset/queries_train.json", "r", encoding="utf-8") as query:

    documents = json.load(doc)
    queries = json.load(query)

df_doc = pd.DataFrame(documents)
df_query = pd.DataFrame(queries)

1. Prepare the Text Data

BM25 requires tokenized text (split into words).

So first, convert documents and queries into lists of tokens.

In [4]:
doc_texts = df_doc["text"].tolist()
query_texts = df_query["text"].tolist()

tokenized_docs = [doc.split() for doc in doc_texts]
tokenized_queries = [query.split() for query in query_texts]

2. Build the BM25 Model

We use the BM25 implementation from rank-bm25.

In [5]:
bm25 = BM25Okapi(tokenized_docs)

3. Compute Scores for Each Query

In [7]:
scores = bm25.get_scores(tokenized_queries[0])

4. Rank Documents

Sort scores

Take the highest k

Reverse order (highest first)

In [8]:
k = 5
top_k_indices = np.argsort(scores)[-k:][::-1]

5. Retrieve Document IDs

In [9]:
retrieved_docs = df_doc.iloc[top_k_indices]["id"].tolist()

6. Apply to All Queries

Loop over all queries:

In [10]:
bm25_results = {}

for i, query_id in enumerate(df_query["id"]):
    scores = bm25.get_scores(tokenized_queries[i])
    top_k = np.argsort(scores)[-k:][::-1]
    retrieved_docs = df_doc.iloc[top_k]["id"].tolist()
    bm25_results[query_id] = retrieved_docs

In [12]:
result = pd.DataFrame(bm25_results)
result

Unnamed: 0,961c4349-8cf1-4ef1-89cc-24d20bb9d000_67878,4008ed78-e66e-4d89-9c3b-c79bd1cf6fc9_366,d5a95b09-e8ea-44dd-993d-347ed418e1f1_15138,3e66798a-b7fd-41b5-8bc0-33b3d7ce2aca_177487,f5f944d2-277a-481d-ab09-612890402ded_137489,da11c342-a2b4-442b-a95a-b2c26f78bf7a_194764,980c09bc-14a0-4832-ad0b-ac1a79ca51cd_55670,e655782b-869d-4d80-9513-fda43a2bbf84_54069,cc1235a9-a603-4891-8474-6cc0a860f159_84759,534caa94-be3a-4a61-b39e-5319f0f9425c_28551,...,cbd9045d-9abb-4103-9b45-64678a9b9262_185783,553912ab-4932-4d42-a047-b68aa507ee73_104362,1ed89f09-9d35-4668-8e05-7ffa10d0bd6c_206755,f73b0a9e-63c1-44cd-b36f-b41b696d4131_37581,46cddbe6-646c-4845-a699-7cdfe3b21721_65261,6f98ccbb-db6f-4646-ac1e-08d8d8bda71a_247103,345e8385-635c-44ab-abd1-f9fbcccaf774_159755,c38ac583-a824-46d9-ad00-105571c0c8fa_120087,3ed0fa3e-af7c-40b2-b44e-bf16de45051e_7772,7878104c-6dfa-42fe-badd-8fc53b2314ac_163896
0,bddbff82-ecf7-4d78-97fc-e2b16c3639f8_13568,7600894f-be57-4d6f-9e14-b3994c51b4fe_85789,a1a1b7b9-48b4-4400-898f-6b1e1f310602_20625,902438c5-edc4-4175-bf47-faf5d1a30616_60478,1e2477c4-2e9e-40f9-ad25-5d5423874f88_2495,e76b7d5d-c8a8-4fce-80b2-c97b3fcf14e8_9841,2224e893-9c1a-458c-b68e-1db0f6b110e8_18664,6017f1fb-ed05-47d9-b2c5-a8b6d317c72d_13644,c96c87f4-5a5c-4ef3-b903-5d3018cbb3e5_105733,6a4d16fc-d7b2-4769-b48d-2b09c596505d_26621,...,ad0f007a-2cd5-4787-a064-0beb8dcb2283_126176,72f99646-8469-4453-b760-26a84bd0ffd3_38593,d2602cc7-1fd2-4dde-84a9-ccc42fff19a9_116371,111988b6-a2fb-4369-9634-d46981024e24_138940,6b5a751c-908e-4a1d-9324-b32f7f18bbc2_132336,0ee05b38-fdea-4e50-b242-0b6d536569da_106013,70820a39-17a6-4a17-b800-3be56dd2fe38_41509,db060367-7420-4ca1-b38e-b7a0c232c481_80701,f79eb400-3225-45fd-bec7-ecef8f8bdf4f_131737,31352231-ffe5-41b5-bd0c-e968e4ee1c6a_72031
1,9d76000e-bc1a-4f04-bc3b-c182abff1f7d_144282,7782b936-b9ae-4761-95a3-b83c771f395a_36567,7ec3ae55-6bd1-4289-8183-3f71b2b6862b_150748,5772b056-0bf0-4b88-b896-9f141babcfc5_71560,87364de1-e17d-4fbe-90ea-1be9b1b5e878_122467,bfa7bca5-8b05-4cb1-a451-a04a6d937c2a_105972,d24e43b9-a2b4-4215-836e-612a1a36e626_151802,9b9c36fb-b4b4-4532-932f-d170529263e6_181546,ee9b42b1-51ba-42ea-a931-f2bb95bcd735_166875,50878f6c-4250-4385-81ee-3ca51f330cec_41706,...,0b563e4d-6f14-46bb-bf01-d27e7be5bd88_181142,76c87632-d441-451e-879f-0e24fe2ab4ee_42976,9f170f33-eb2e-4f46-83b7-99d26eee5930_121547,8c44daa7-ff26-4ecc-8a48-3912d842262a_16868,d4faf0c2-2a43-40ff-b7ab-4aec42b4c7a8_3359,afa2ea26-308b-4c8b-9843-2dc6823702c8_90163,e99925a1-42fe-4ab3-9a76-b32c5789f3e6_153226,cc77393c-ac1c-4b2d-8e8d-addb200cb1aa_45081,6cd3c5ac-4e31-4d19-a395-7d14d6c86847_132328,849eb92b-1401-430a-86ef-9fdfb09e2da1_67598
2,5d8c2807-033f-4ee2-b396-3fe5ff27e9e7_56711,c0bf6384-be33-4316-a8bd-04e79caff4d0_96021,551b2d3b-7b8d-485a-8acc-dfdcf2cc7371_73096,e8161fe8-7bb8-45b4-8321-a44cb7007d55_176792,64864071-132e-4cf6-83c1-a2b89ef6e286_25761,f4e35446-2f22-4965-aa1e-88e1768a2cf8_43563,8ae452ad-be1c-4a72-8384-12ef2ee95039_146580,6caa6e78-727c-48f6-80a6-c961772f5eea_50456,1b875ffb-ad1b-454e-9c9b-ca971a6af665_48375,90f794b0-f0f5-4d44-bc88-5ab6156e3992_22334,...,33e84b8e-9c48-48b0-9197-b97bbca2595a_30244,331c55c7-0454-4063-a42c-e16138775f87_35895,051f9e60-1982-4f45-98e2-7af19ae928e3_168114,fcc0717d-158b-4e57-b192-2abd1e709123_176652,f050fef0-702e-40db-8067-041dc11786dc_40412,6435a61c-6f44-4a62-a153-53e4b5fe2e57_53816,62351b20-7ab9-4d56-96e3-b6a3a3b0e374_118589,38e2db8b-ba3c-4c36-a704-d6694a1029a0_18223,d47345a4-2bac-4603-a76c-0bed23fc02f4_79368,f1cbc4a0-cbd9-4845-9940-74519139a535_125199
3,a1f05a38-a2e1-4e9f-8827-b8bda2207ebd_40536,cd98e44d-a8e5-448a-957d-aa8835cc5867_39013,19efdb3c-11c6-4ea6-8ca8-4c403d9cb2f4_7224,fe7c6d69-12d5-498d-bb39-d00380765b85_132952,4dec8d6b-1825-4c91-9dc2-6e96c19127ea_103590,67fbeb04-978d-4ba4-b59f-84ba5f12f2c2_66394,a0bcec29-1e93-4055-8bf1-35e5fb1552b9_155009,58bc0e12-7dd4-41f3-a42a-cfd055a885de_44033,b0eb479b-2a1c-478b-ab00-b95d3a176dc2_42810,60d4ea54-c02b-4a89-afbb-09aaa665c67a_46128,...,bed9f7fc-9e87-4fe5-b092-70a84997184e_156436,b4b3885a-e5ab-40b5-a3b0-9efb57dfd2f0_49253,582b9271-f635-4888-abea-f032131b2199_98737,7fc07783-a500-4b2c-b3df-2c6f47e1eed7_173145,f744bb56-6b61-4b77-8dc0-4be0e41ce462_59594,7ae6f7f5-b4ca-477a-b71e-8a8ae204c340_59713,adfcb335-0b7b-4844-918a-1f66042a3645_49577,760098f1-026c-43aa-9b9d-3f26593a3372_44174,3ebb3ceb-9ce1-4761-947f-83b3dd63576c_8268,787a2837-21f1-473a-9267-b4f177994a33_114703
4,1b89db44-8609-4e52-96ef-d293a47d34ba_9916,efe0674f-457c-4892-a453-26bacbfb1425_49931,fd734315-d16c-479b-8d27-dc30c40e458f_177505,6bcf043a-67aa-42a7-b1e4-0c88de7b174a_34937,77294dfe-cbbb-4802-99ef-b86c2bd637e9_137264,aaf6b6db-6d6e-46bb-ada4-5780c275f370_125107,f99fc023-d3d9-413d-b06f-9ff2324fc7b7_154734,f3db2d1c-7cc7-416d-a292-9df727f5dbbe_22884,1b072d43-5e40-48f1-98d3-8a9c71648a9f_32449,3cbf08f9-d4da-44e5-9e4b-8ade04206af9_7790,...,a905d7a6-07e2-4e84-be08-cd84b3367315_175505,22a209e4-1a3f-4115-af5e-01d2ffda42c2_92008,97c2e020-6289-4cfd-92d0-0bcad7edf154_169822,9064c511-c5a3-4c77-940d-76732424783d_187323,6f418071-fd89-4f6d-8561-791f92290887_147587,62e673c4-5447-4c1b-a155-e7bf35bbd45c_145016,3945a330-3fd5-4486-a016-bd834391ef73_56270,8dc24e3c-650c-446b-a1f9-854e678d7178_18446,83ba6e76-dee3-4a5a-8ca1-5353306decd7_39428,03643044-973b-4638-bb89-193ffa3efd27_77109
