# BM25 Retrieval

It is noticeable that the BM25 retrieval method is widely used in the information retrieval field. It is a ranking function used by search engines to estimate the relevance of documents to a given search query. The BM25 algorithm is based on the probabilistic information retrieval model and is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document, regardless of the inter-relationship between the query terms within a document.

It works significatively worse than the dense retriebal methods but it pays more attention to exact and fuzzy matches beyond semantic meaning. Fusion ranking methods can be used to combine BM25 with dense retrieval methods to improve the overall performance.

For the dense retrieval, stopwords, emojis and other characters can remain with no serious affectation. However, for the BM25 retrieval, it is important to remove them to avoid noise in the retrieval process.

In addition to this, as the BM25 retrieval is based on the bag-of-words model, the bigger the bag, the better the retrieval. This means that the BM25 retrieval can be improved by using the whole text of the documents instead of just the title and the body.

In [5]:
import pandas as pd
import numpy as np
# import matplotlib.pyplot as plt

from src import config
from src.datasets import TextConcatFactCheck, TextConcatPosts
from src.models import BM25Model

tasks_path = config.TASKS_PATH
posts_path = config.POSTS_PATH
fact_checks_path = config.FACT_CHECKS_PATH
gs_path = config.GS_PATH
lang = 'deu'
task_name = "monolingual"

fc = TextConcatFactCheck(fact_checks_path, tasks_path=tasks_path, task_name=task_name, lang=lang, version="english")
posts = TextConcatPosts(posts_path, tasks_path=tasks_path, task_name=task_name, lang=lang, gs_path=gs_path, version="english")

  from .autonotebook import tqdm as notebook_tqdm


In [6]:
df_train_posts = posts.df_train
df_dev_posts = posts.df_dev
df_fc = fc.df

BM25 withouth any preprocessing is used in this notebook.

In [7]:
bm25_model = BM25Model(df_fc=df_fc, batch_size=512, k=100000, normalize_embeddings=True)
df_bm25_dev = df_dev_posts.copy()
df_bm25_dev["preds"] = bm25_model.predict(df_dev_posts["full_text"].values).tolist()
bm25_model.evaluate(df_bm25_dev, task_name=task_name, lang="deu", ls_k=[10, 50, 100])

Processing texts:   0%|          | 0/61 [00:00<?, ?it/s]

Processing texts: 100%|██████████| 61/61 [00:03<00:00, 15.92it/s]


{'monolingual': {'deu': {10: np.float64(0.4918032786885246),
   50: np.float64(0.5245901639344263),
   100: np.float64(0.5409836065573771)}}}

Dense Retrieval

In [8]:
from src.models import EmbeddingModel
teacher_model_path = '/home/bsc/bsc830651/.cache/huggingface/hub/models--intfloat--multilingual-e5-large/snapshots/ab10c1a7f42e74530fe7ae5be82e6d4f11a719eb'
teacher_model = EmbeddingModel(model_name=teacher_model_path, df_fc=df_fc, batch_size=512, k=100000)
df_teacher_dev = df_dev_posts.copy()
df_teacher_dev["preds"] = teacher_model.predict(df_dev_posts["full_text"].values).tolist()
teacher_model.evaluate(df_teacher_dev, task_name=task_name, lang="deu", ls_k=[10, 50, 100])

Batches: 100%|██████████| 10/10 [00:08<00:00,  1.16it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00,  1.95it/s]


{'monolingual': {'deu': {10: np.float64(0.7377049180327869),
   50: np.float64(0.8032786885245902),
   100: np.float64(0.8360655737704918)}}}

# Clean and try again

In [9]:
import spacy
# spacy.cli.download("en_core_web_lg")
nlp = spacy.load("en_core_web_lg")
def cleaning_spacy_cased(text):
    return " ".join([token.lemma_ for token in nlp(text) if not token.is_stop and not token.is_punct])

def cleaning_spacy(text):
    return " ".join([token.lemma_.lower() for token in nlp(text) if not token.is_stop and not token.is_punct])

def only_entities(text):
    return " ".join([ent.text for ent in nlp(text).ents])

In [10]:
cleaning_spacy_cased("I am a student at the University of Mannheim and I am studying computer science.")

'student University Mannheim study computer science'

Substantial improvement in BM25 retrieval can be achieved by cleaning the text. The cleaning process involves removing stopwords, emojis, and other characters that do not provide useful information for the retrieval process. The cleaning process can be done using the following steps:


Cased: Worse

Lowercased: Better

In [None]:
from tqdm import tqdm
tqdm.pandas()

df_dev_clean = df_dev_posts.copy()
df_fc_clean = df_fc.copy()

df_dev_clean["full_text"] = df_dev_clean["full_text"].progress_apply(cleaning_spacy_cased)
df_fc_clean["full_text"] = df_fc_clean["full_text"].progress_apply(cleaning_spacy_cased)

bm25_model = BM25Model(df_fc=df_fc_clean, batch_size=512, k=100000, normalize_embeddings=True)
df_bm25_dev = df_dev_clean.copy()
df_bm25_dev["preds"] = bm25_model.predict(df_dev_clean["full_text"].values).tolist()
bm25_model.evaluate(df_bm25_dev, task_name=task_name, lang="deu", ls_k=[10, 50, 100])

Processing texts:   0%|          | 0/61 [00:00<?, ?it/s]

Processing texts: 100%|██████████| 61/61 [00:01<00:00, 41.62it/s]


{'monolingual': {'deu': {10: np.float64(0.639344262295082),
   50: np.float64(0.7540983606557377),
   100: np.float64(0.7704918032786885)}}}

In [14]:
df_dev_clean = df_dev_posts.copy()
df_fc_clean = df_fc.copy()

df_dev_clean["full_text"] = df_dev_clean["full_text"].progress_apply(cleaning_spacy)
df_fc_clean["full_text"] = df_fc_clean["full_text"].progress_apply(cleaning_spacy)

bm25_model = BM25Model(df_fc=df_fc_clean, batch_size=512, k=100000, normalize_embeddings=True)
df_bm25_dev = df_dev_clean.copy()
df_bm25_dev["preds"] = bm25_model.predict(df_dev_clean["full_text"].values).tolist()
bm25_model.evaluate(df_bm25_dev, task_name=task_name, lang="deu", ls_k=[10, 50, 100])

100%|██████████| 61/61 [00:00<00:00, 94.45it/s] 
100%|██████████| 4996/4996 [00:29<00:00, 169.50it/s]
Processing texts: 100%|██████████| 61/61 [00:01<00:00, 40.17it/s]


{'monolingual': {'deu': {10: np.float64(0.6557377049180327),
   50: np.float64(0.819672131147541),
   100: np.float64(0.819672131147541)}}}

However, when applied to dense retrieval, the cleaning process can be detrimental to the retrieval process. This is because the dense retrieval methods are based on the semantic meaning of the text, and removing stopwords, emojis, and other characters can result in the loss of important information.

In [15]:
from src.models import EmbeddingModel

df_dev_clean = df_dev_posts.copy()
df_fc_clean = df_fc.copy()

df_dev_clean["full_text"] = df_dev_clean["full_text"].progress_apply(cleaning_spacy_cased)
df_fc_clean["full_text"] = df_fc_clean["full_text"].progress_apply(cleaning_spacy_cased)

teacher_model_path = '/home/bsc/bsc830651/.cache/huggingface/hub/models--intfloat--multilingual-e5-large/snapshots/ab10c1a7f42e74530fe7ae5be82e6d4f11a719eb'
teacher_model = EmbeddingModel(model_name=teacher_model_path, df_fc=df_fc_clean, batch_size=512, k=100000)
df_teacher_dev = df_dev_clean.copy()
df_teacher_dev["preds"] = teacher_model.predict(df_dev_clean["full_text"].values).tolist()
teacher_model.evaluate(df_teacher_dev, task_name=task_name, lang="deu", ls_k=[10, 50, 100])

100%|██████████| 61/61 [00:00<00:00, 94.14it/s] 
100%|██████████| 4996/4996 [00:29<00:00, 171.27it/s]
Batches: 100%|██████████| 10/10 [00:06<00:00,  1.58it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00,  2.20it/s]


{'monolingual': {'deu': {10: np.float64(0.6885245901639344),
   50: np.float64(0.8032786885245902),
   100: np.float64(0.8524590163934426)}}}

In [16]:
df_dev_clean = df_dev_posts.copy()
df_fc_clean = df_fc.copy()

df_dev_clean["full_text"] = df_dev_clean["full_text"].progress_apply(cleaning_spacy)
df_fc_clean["full_text"] = df_fc_clean["full_text"].progress_apply(cleaning_spacy)

teacher_model_path = '/home/bsc/bsc830651/.cache/huggingface/hub/models--intfloat--multilingual-e5-large/snapshots/ab10c1a7f42e74530fe7ae5be82e6d4f11a719eb'
teacher_model = EmbeddingModel(model_name=teacher_model_path, df_fc=df_fc_clean, batch_size=512, k=100000)
df_teacher_dev = df_dev_clean.copy()
df_teacher_dev["preds"] = teacher_model.predict(df_dev_clean["full_text"].values).tolist()
teacher_model.evaluate(df_teacher_dev, task_name=task_name, lang="deu", ls_k=[10, 50, 100])

100%|██████████| 61/61 [00:00<00:00, 61.86it/s] 
100%|██████████| 4996/4996 [00:30<00:00, 166.14it/s]
Batches: 100%|██████████| 10/10 [00:06<00:00,  1.53it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00,  2.20it/s]


{'monolingual': {'deu': {10: np.float64(0.639344262295082),
   50: np.float64(0.8360655737704918),
   100: np.float64(0.8524590163934426)}}}