## Evaluate a text retrieval feature

The task is to evaluate a text retrieval feature. The two metrics that are used to evaluate the feature are Mean Reciprocal Rank (MRR) and Hit Rate (HR).

The MRR is the mean of the reciprocal ranks of the relevant documents (see more information below).

The HR is the proportion of queries for which the feature returns at least one relevant document (see more information below).

**Mean Reciprocal Rank (MRR)**:
   - Evaluates the rank position of the _first_ relevant document.
   - Formula: `MRR = (1 / |Q|) * Σ (1 / rank_i)` for i = 1 to |Q|

**Hit Rate (HR) or Recall at k**:
   - Measures the proportion of queries for which at least _one_ relevant document is retrieved in the top k results.
   - Formula: `HR@k = (Number of queries with at least one relevant document in top k) / |Q|`

In [1]:
from services import *
from common.settings import Settings
from common.client_factory import ClientFactory
from services.retrieval_service import RetrievalService
from services.reciprocal_rank_fusion_service import ReciprocalRankFusionService
from common.sentence_transformer_model_factory import SentenceTransformerModelFactory
import pandas as pd
from pandas import DataFrame
import json
import os
from typing import Any, Dict, List
from pprint import pprint
from tqdm import tqdm

import importlib
import retrieval_evaluation_utils as ev_utils

  from tqdm.autonotebook import tqdm, trange


In [2]:
# Reload the module
importlib.reload(ev_utils)

<module 'retrieval_evaluation_utils' from '/home/jovyan/work/notebook/utils/retrieval_evaluation_utils.py'>

In [3]:
doc_with_ids_path = "/home/jovyan/work/notebook/retrieval_evaluation/dataset_with_doc_ids.csv"
ground_truth_path = "/home/jovyan/work/notebook/retrieval_evaluation/ground_truth.csv"
evaluation_results_path = "/home/jovyan/work/notebook/retrieval_evaluation/evaluation_results.csv"
test_name = "text"

### Evaluate the text retrieval feature using MRR and HR@k for the test sample.

In [4]:
example_retrieval_result =  [
                                         # MRR        # HR@k=5  # HR@k=1   # HR@k=3
    [True, False, False, False, False],  # 1/1 = 1    # 1       # 1        # 1
    [False, False, False, False, False], # 0          # 0       # 0        # 0
    [False, False, False, False, True],  # 1/5 = 0.2  # 1       # 0        # 0
    [False, True, False, False, True]    # 1/2 = 0.5  # 1       # 0        # 1
]

# MRR   : (1 + 0 + 0.2 + 0.5) / 4 = 1.7 / 4 = 0.425
# HR@k=1:  (1 + 0 + 0 + 0) / 4 = 1/4 = 0.25
# HR@k=3:  (1 + 0 + 0 + 1) / 4 = 2/4 = 0.5
# HR@k=5:  (1 + 0 + 1 + 1) / 4 = 3/4 = 0.75

MRR measures how early the _first_ relevant item appears in a ranked list of results. It calculates the reciprocal of the rank of the first relevant item and averages this across all queries.

In [5]:
def calculate_mrr(dataset: List[List[bool]]):
    total_score = 0.0    

    for row in dataset:
        for idx, value in enumerate(row):
            if value is True:
                total_score += 1 / (idx + 1)
                break

    result = total_score / len(dataset)
    return result

In [6]:
mrr_value = calculate_mrr(example_retrieval_result)
print(mrr_value)

0.425


To calculate Hit Rate at 𝑘 (HR@k) for the given dataset, we follow the same process: determine if there is at least _one_ relevant (True) item within the top 𝑘 positions for each query and average these results.

In [7]:
def calculate_hr_at_k(dataset: List[List[bool]], k: int) -> float:
    total_score = 0.0    

    for row in dataset:
        for i in range(0, k):
            if row[i] is True:
                total_score += 1
                break

    result = total_score / len(dataset)
    return result

In [8]:
hr_at_k3 = calculate_hr_at_k(example_retrieval_result, 3)
print(hr_at_k3)

0.5


In [9]:
hr_at_k5 = calculate_hr_at_k(example_retrieval_result, 5)
print(hr_at_k5)

0.75


### Evaluate the text retrieval feature using MRR and HR@k for the real dataset.

#### Load the dataset and ground truth

In [10]:
dataset_df: DataFrame

if os.path.exists(doc_with_ids_path):
    dataset_df = pd.read_csv(doc_with_ids_path, delimiter=";")
else:
    columns = ['source_system', 'category', 'question', 'document_id']
    dataset_df = pd.DataFrame(columns=columns)

dataset_df[:2]

Unnamed: 0,source_system,category,question,answer,document_id
0,evdi,Analysekonzept,Wie läuft der Analyseprozess für Immobilienpro...,Der Analyseprozess bei Engel & Völkers Digital...,f2624f5125f9
1,evdi,Analysekonzept,Wie werden die Anlageprojekte bewertet und wie...,Die Bewertung der Anlageprojekte bei Engel & V...,14e6c2e22916


In [11]:
groud_truth_df: DataFrame
if os.path.exists(ground_truth_path):
    groud_truth_df = pd.read_csv(ground_truth_path, delimiter=";")
else:
    columns = ['source_system', 'category', 'question', 'document_id']
    groud_truth_df = pd.DataFrame(columns=columns)

groud_truth_df[:2]

Unnamed: 0,source_system,category,question,document_id
0,evdi,Analysekonzept,Was sind die wichtigsten Schritte des Analysep...,f2624f5125f9
1,evdi,Analysekonzept,Welche externen Partner sind an der Analyse vo...,f2624f5125f9


In [12]:
settings = Settings()
settings.index_name = test_name
client_factory = ClientFactory(settings)
es_client = client_factory.create_elasticsearch_client()
es_client.info()

ObjectApiResponse({'name': 'f10eb90c56e6', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'iYx9l4wDS4q30JPqorqswg', 'version': {'number': '8.4.3', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '42f05b9372a9a4a470db3b52817899b99a76ee73', 'build_date': '2022-10-04T07:17:24.662462378Z', 'build_snapshot': False, 'lucene_version': '9.3.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'})

#### Index questions and answers

In [13]:
model_factory = SentenceTransformerModelFactory(settings)
embedding_model = model_factory.create_model()
retrieval_service = RetrievalService(es_client, embedding_model, settings)



In [14]:
first_dataset_item = dataset_df.iloc[0]
question = first_dataset_item["question"]
# we need dimensions to be compatible with the existing retrieval_service
vector_value = embedding_model.encode(question)
dimensions = len(vector_value)

In [15]:
index_settings = dict(
        settings=dict(
            number_of_shards=1,
            number_of_replicas=0,
        ),
        mappings=dict(
            properties=dict(
                answer=dict(type="text"),
                question=dict(type="text"),
                category=dict(type="text"),
                document_id=dict(type="text"),
                answer_instructions=dict(type="text"),
                source_system=dict(type="keyword"),
                vector_question_answer=dict(
                    type="dense_vector",
                    dims=dimensions,
                    index=True,
                    similarity="cosine",
                ),
            ),
        ),
    )

if es_client.indices.exists(index=test_name):
    es_client.indices.delete(index=test_name)

es_client.indices.create(index=test_name, body=index_settings)

for idx, row in dataset_df.iterrows():
    document_to_index = row.to_dict()
    es_client.index(index=test_name, document=document_to_index)

In [16]:
relevance_total: List[List[bool]] = []

In [17]:
for idx, row in tqdm(groud_truth_df.iterrows(), total=groud_truth_df.shape[0]):
    doc = row.to_dict()
    retrieval_result = retrieval_service.search(row["question"], 5)
    text_result = retrieval_result.text_result_items
    
    relevance: List[bool] = []
    for item in text_result:
        relevance.append(item.document_id == doc["document_id"])
    
    relevance_total.append(relevance)

  0%|          | 0/435 [00:00<?, ?it/s]

100%|██████████| 435/435 [01:52<00:00,  3.86it/s]


In [18]:
pprint(relevance_total[:16])

[[False, True],
 [True, False],
 [False, False],
 [True, False],
 [True, False],
 [False, False],
 [False, False],
 [False, True],
 [False, False],
 [False, False],
 [False],
 [False, False],
 [],
 [],
 [],
 [True, False]]


In [19]:
def calculate_hit_rate(dataset: List[List[bool]]) -> float:
    total_score = 0.0    

    for row in dataset:
        if True in row:
            total_score += 1
            

    result = total_score / len(dataset)
    return result

In [20]:
calculated_hit_rate = calculate_hit_rate(relevance_total)
print(f"Hit Rate value: {calculated_hit_rate}")

Hit Rate value: 0.6436781609195402


In [21]:
calculated_mrr = calculate_mrr(relevance_total)
print(f"MRR value: {calculated_mrr}")

MRR value: 0.5620689655172414


#### Save the results

In [22]:
df = pd.DataFrame({
    "source_system": [first_dataset_item["source_system"], first_dataset_item["source_system"]],
    "method": [test_name, test_name],
    "metric": ["mrr", "HR@K5"],
    "value": [calculated_mrr, calculated_hit_rate],
    "model": [settings.embedding_model_name, settings.embedding_model_name],
    "description": ["text evaluation", "text evaluation"]
})

ev_utils.add_evaluation_results(df, evaluation_results_path)

#### Clean up Elasticsearch index

In [23]:
if es_client.indices.exists(index=test_name):
    es_client.indices.delete(index=test_name)