# Evaluation of Text Retrieval Techniques For RAG Out

## Open documents with ids and index with elasticsearch

In [2]:
import json

with open('documents-with-ids.json', 'rt') as f_in:
    documents = json.load(f_in)

In [3]:
documents[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp',
 'id': 'c02e79ef'}

In [4]:
from elasticsearch import Elasticsearch

es_client = Elasticsearch('http://localhost:9200') 

index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"},
            "id": {"type": "keyword"},
        }
    }
}

index_name = "course-questions-with-ids"

es_client.indices.delete(index=index_name, ignore_unavailable=True)
es_client.indices.create(index=index_name, body=index_settings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'course-questions-with-ids'})

In [5]:
from tqdm.auto import tqdm

for doc in tqdm(documents):
    es_client.index(index=index_name, document=doc)

  from .autonotebook import tqdm as notebook_tqdm
100%|██████████| 948/948 [00:07<00:00, 133.50it/s]


In [6]:
def elastic_search(query, course):
    search_query = {
        "size": 5,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^3", "text", "section"],
                        "type": "best_fields"
                    }
                },
                "filter": {
                    "term": {
                        "course": course
                    }
                }
            }
        }
    }

    response = es_client.search(index=index_name, body=search_query)
    
    result_docs = []
    
    for hit in response['hits']['hits']:
        result_docs.append(hit['_source'])
    
    return result_docs

In [7]:
elastic_search(
    query="I just discovered the course. Can I still join?",
    course="data-engineering-zoomcamp"
)

[{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp',
  'id': '7842b56a'},
 {'text': 'You can start by installing and setting up all the dependencies and requirements:\nGoogle cloud account\nGoogle Cloud SDK\nPython 3 (installed with Anaconda)\nTerraform\nGit\nLook over the prerequisites and syllabus to see if you are comfortable with these subjects.',
  'section': 'General course-related questions',
  'question': 'Course - What can I do before the course starts?',
  'course': 'data-engineering-zoomcamp',
  'id': '63394d91'},
 {'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it fin

In [8]:
elastic_search(
    query="I just discovered the course. Can I still join?",
    course="mlops-zoomcamp"
)

[{'text': 'In order to obtain the certificate, completion of the final capstone project is mandatory. The completion of weekly homework assignments is optional, but they can contribute to your overall progress and ranking on the top 100 leaderboard.',
  'section': '+-General course questions',
  'question': 'Can I still graduate when I didn’t complete homework for week x?',
  'course': 'mlops-zoomcamp',
  'id': '7f93c032'},
 {'text': 'Problem: Max_depth is not recognize even when I add the mlflow.log_params\nSolution: the mlflow.log_params(params) should be added to the hpo.py script, but if you run it it will append the new model to the previous run that doesn’t contain the parameters, you should either remove the previous experiment or change it\nPastor Soto',
  'section': 'Module 2: Experiment tracking',
  'question': 'Max_depth is not recognize even when I add the mlflow.log_params',
  'course': 'mlops-zoomcamp',
  'id': 'f69fb077'},
 {'text': 'The difference is the Orchestration a

In [15]:
elastic_search(
    query="I just discovered the course. Can I still join?",
    course="machine-learning-zoomcamp"
)

[{'text': 'Yes, you can. You won’t be able to submit some of the homeworks, but you can still take part in the course.\nIn order to get a certificate, you need to submit 2 out of 3 course projects and review 3 peers’ Projects by the deadline. It means that if you join the course at the end of November and manage to work on two projects, you will still be eligible for a certificate.',
  'section': 'General course-related questions',
  'question': 'The course has already started. Can I still join it?',
  'course': 'machine-learning-zoomcamp',
  'id': 'ee58a693'},
 {'text': 'Welcome to the course! Go to the course page (http://mlzoomcamp.com/), scroll down and start going through the course materials. Then read everything in the cohort folder for your cohort’s year.\nClick on the links and start watching the videos. Also watch office hours from previous cohorts. Go to DTC youtube channel and click on Playlists and search for {course yyyy}. ML Zoomcamp was first launched in 2021.\nOr you c

## Reading the ground truth data and getting relevance against the documents in elasticsearch

In [16]:
import pandas as pd

In [45]:
import os

# Check the size of the file
print(f"File size: {os.path.getsize('ground-truth-data.csv')} bytes")

File size: 532504 bytes


In [42]:
df_ground_truth = pd.read_csv('ground-truth-data.csv', delimiter=',', encoding='utf-8')

In [43]:
df_ground_truth.head()

Unnamed: 0,question,course,document
0,Can you tell me when the course is starting?,data-engineering-zoomcamp,c02e79ef
1,What is the exact day and hour of the course?,data-engineering-zoomcamp,c02e79ef
2,Is there a specific calendar I should subscrib...,data-engineering-zoomcamp,c02e79ef
3,Can you guide me on how to register for the co...,data-engineering-zoomcamp,c02e79ef
4,What platforms can I use to stay updated on co...,data-engineering-zoomcamp,c02e79ef


In [44]:
len(df_ground_truth)

264445

In [36]:
df_ground_truth.shape

(264445, 3)

### Checking for null values
Here it looks weird to see a butch of null values. Not sure where they are coming from!!!

In [31]:
null_mask = df_ground_truth.isnull().any(axis=1)
null_rows = df_ground_truth[null_mask]
null_rows.shape

(260238, 3)

In [32]:
not_null_mask = df_ground_truth.notnull().all(axis=1)
not_null_rows = df_ground_truth[not_null_mask]
not_null_rows.shape

(4207, 3)

### Droping all the null values 
(Not expected to have any null values here)

In [46]:
df_ground_truth.dropna(inplace=True)
df_ground_truth.shape

(4207, 3)

In [47]:
ground_truth = df_ground_truth.to_dict(orient='records')

In [48]:
ground_truth[0]


{'question': 'Can you tell me when the course is starting?',
 'course': 'data-engineering-zoomcamp',
 'document': 'c02e79ef'}

In [49]:
relevance_total = []

for q in tqdm(ground_truth):
    doc_id = q['document']
    results = elastic_search(query=q['question'], course=q['course'])
    relevance = [d['id'] == doc_id for d in results]
    relevance_total.append(relevance)

  0%|          | 0/4207 [00:00<?, ?it/s]

100%|██████████| 4207/4207 [00:16<00:00, 255.47it/s]


In [50]:
len(relevance_total)

4207

In [51]:
relevance_total[0]

[True, False, False, False, False]

In [53]:
relevance_total[100]

[True, False, False, False, False]

In [52]:
relevance_total[-1]

[False, False, True, False, False]

## Evaluations Metrics: Using (Hit Rate (HR) or Recall ) and (Mean Reciprocal Rank (MRR))

See document [here](./evaluation-metrics.md) for reference on the metrics considered

In [54]:
example = [
    [True, False, False, False, False], # 1, 
    [False, False, False, False, False], # 0
    [False, False, False, False, False], # 0 
    [False, False, False, False, False], # 0
    [False, False, False, False, False], # 0 
    [True, False, False, False, False], # 1
    [True, False, False, False, False], # 1
    [True, False, False, False, False], # 1
    [True, False, False, False, False], # 1
    [True, False, False, False, False], # 1 
    [False, False, True, False, False],  # 1/3
    [False, False, False, False, False], # 0
]

# 1 => 1
# 2 => 1 / 2 = 0.5
# 3 => 1 / 3 = 0.3333
# 4 => 0.25
# 5 => 0.2
# rank => 1 / rank
# none => 0

#### hit-rate (recall)

In [55]:
def hit_rate(relevance_total):
    cnt = 0

    for line in relevance_total:
        if True in line:
            cnt = cnt + 1

    return cnt / len(relevance_total)

#### Mean Reciprocal Rank (mrr)

In [56]:
def mrr(relevance_total):
    total_score = 0.0

    for line in relevance_total:
        for rank in range(len(line)):
            if line[rank] == True:
                total_score = total_score + 1 / (rank + 1)

    return total_score / len(relevance_total)

In [57]:
hit_rate(example)

0.5833333333333334

In [58]:
mrr(example)

0.5277777777777778

In [59]:
hit_rate(relevance_total), mrr(relevance_total)

(0.6978844782505348, 0.5679225101022105)

## Using our Minsearch module and comparing with ES metrics above

In [60]:
import minsearch

index = minsearch.Index(
    text_fields=["question", "text", "section"],
    keyword_fields=["course", "id"]
)

index.fit(documents)

<minsearch.Index at 0x1288f0810>

In [61]:
def minsearch_search(query, course):
    boost = {'question': 3.0, 'section': 0.5}

    results = index.search(
        query=query,
        filter_dict={'course': course},
        boost_dict=boost,
        num_results=5
    )

    return results

In [62]:
relevance_total = []

for q in tqdm(ground_truth):
    doc_id = q['document']
    results = minsearch_search(query=q['question'], course=q['course'])
    relevance = [d['id'] == doc_id for d in results]
    relevance_total.append(relevance)

100%|██████████| 4207/4207 [00:19<00:00, 211.06it/s]


In [63]:
hit_rate(relevance_total), mrr(relevance_total)

(0.7454242928452579, 0.6350487283099601)

```
Compare with ES results:

    (0.7395720769397017, 0.6032418413658963)
```

### Generic Evaluation function for the different search enginees

In [64]:
def evaluate(ground_truth, search_function):
    relevance_total = []

    for q in tqdm(ground_truth):
        doc_id = q['document']
        results = search_function(q)
        relevance = [d['id'] == doc_id for d in results]
        relevance_total.append(relevance)

    return {
        'hit_rate': hit_rate(relevance_total),
        'mrr': mrr(relevance_total),
    }

#### Evaluation of relevance with Elasticsearch

In [65]:
evaluate(ground_truth, lambda q: elastic_search(q['question'], q['course']))

100%|██████████| 4207/4207 [00:16<00:00, 257.03it/s]


{'hit_rate': 0.6978844782505348, 'mrr': 0.5679225101022105}

#### Evaluation of relevance with minsearch

In [66]:
evaluate(ground_truth, lambda q: minsearch_search(q['question'], q['course']))

100%|██████████| 4207/4207 [00:16<00:00, 258.59it/s]


{'hit_rate': 0.7454242928452579, 'mrr': 0.6350487283099601}

## Deductions

This results may indicate that our ground_truth dataset may still need further cleaning for better results.