# Vector Search Evaluation

1. Generation of **ground truth** dataset. 

✅ This step was compeleted in `01_evaluation_generation_ground_truth.ipynb` notebook.

  This can be done:
  * Manually by annotators / domain experts
  * Getting the data from users queries
  * Generate with LLM


Generally for one query, we might have multiple relevant documents, but for this use case, we have 1 relevant document(answer) for 1 query (user question).

The automatic generation of the dataset will be done as follows:
1. For every user query (question) LLM will be prompted to generate 5 similar questions
2. Apply vector search using our LLM-generated questions as queries to find relevant document in the knowledge base 
3. During the test phase we will evaluate our vector search to be able to detect relevant document for similar queries (aka. generated ones)

-------------
2. Evaluation of **Text Retrieval** techniques.

For every record in ground truth dataset we will:
  1. Execute query (perform text search in our vector database)
  2. Check if the retrieved results contain the answer assigned to the original query (from which we've generated our artificial queries)
  3. Perform metrics calculations
      * **Hit Rate** (or Recall)
      * **MRR (Mean Reciprocal** Rank)


In [1]:
import json
import re

In [2]:
with open('documents-with-ids.json', 'rt') as f_in:
    docs_raw = json.load(f_in)

In [4]:
docs_raw[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp',
 'id': 'c02e79ef'}

## 1. Indexing our KnowledgeBase data into ElasticSearch

### 1.1 Setup ElasticSearch connection

Running ES with Docker
```
docker run -it \
    --rm \
    --name elasticsearch \
    -p 9200:9200 \
    -p 9300:9300 \
    -e "discovery.type=single-node" \
    -e "xpack.security.enabled=false" \
    docker.elastic.co/elasticsearch/elasticsearch:8.4.3
```

In [11]:
from elasticsearch import Elasticsearch

In [12]:
!curl localhost:9200

{
  "name" : "675ce062f461",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "4gR5tzWATWyRv8KHoGRj7Q",
  "version" : {
    "number" : "8.4.3",
    "build_flavor" : "default",
    "build_type" : "docker",
    "build_hash" : "42f05b9372a9a4a470db3b52817899b99a76ee73",
    "build_date" : "2022-10-04T07:17:24.662462378Z",
    "build_snapshot" : false,
    "lucene_version" : "9.3.0",
    "minimum_wire_compatibility_version" : "7.17.0",
    "minimum_index_compatibility_version" : "7.0.0"
  },
  "tagline" : "You Know, for Search"
}


In [13]:
es_client = Elasticsearch('http://localhost:9200')
es_client.info()

ObjectApiResponse({'name': '675ce062f461', 'cluster_name': 'docker-cluster', 'cluster_uuid': '4gR5tzWATWyRv8KHoGRj7Q', 'version': {'number': '8.4.3', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '42f05b9372a9a4a470db3b52817899b99a76ee73', 'build_date': '2022-10-04T07:17:24.662462378Z', 'build_snapshot': False, 'lucene_version': '9.3.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'})

In [14]:
health = es_client.cluster.health()
print(health)

{'cluster_name': 'docker-cluster', 'status': 'green', 'timed_out': False, 'number_of_nodes': 1, 'number_of_data_nodes': 1, 'active_primary_shards': 2, 'active_shards': 2, 'relocating_shards': 0, 'initializing_shards': 0, 'unassigned_shards': 0, 'delayed_unassigned_shards': 0, 'number_of_pending_tasks': 0, 'number_of_in_flight_fetch': 0, 'task_max_waiting_in_queue_millis': 0, 'active_shards_percent_as_number': 100.0}


In [None]:
# Adjust high and low watermarks temporarily
es_client.cluster.put_settings(body={
    "transient": {
        "cluster.routing.allocation.disk.watermark.low": "85%",
        "cluster.routing.allocation.disk.watermark.high": "95%",
        "cluster.routing.allocation.disk.watermark.flood_stage": "98%"
    }
})

ObjectApiResponse({'acknowledged': True, 'persistent': {}, 'transient': {'cluster': {'routing': {'allocation': {'disk': {'watermark': {'low': '85%', 'flood_stage': '98%', 'high': '95%'}}}}}}})

Using OpenAI API for generation task.

### 1.2 Indexing data

In [15]:
index_name = "course-questions"
# Check if index exists
if es_client.indices.exists(index=index_name):
    print(f"Index '{index_name}' exists.")
else:
    print(f"Index '{index_name}' does not exist.")
    # You might want to create the index here if it doesn't exist

Index 'course-questions' exists.


In [23]:
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"},
            "id": {"type": "keyword"},
        }
    }
}

index_name = "course-questions"

es_client.indices.delete(index=index_name, ignore_unavailable=True)
es_client.indices.create(index=index_name, body=index_settings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'course-questions'})

In [27]:
from tqdm.auto import tqdm

In [29]:
for doc in tqdm(docs_raw):
    es_client.index(index=index_name, document=doc)

100%|██████████| 948/948 [00:06<00:00, 151.10it/s]


Indexing data. Our setup is suitable for small-scale application, with full-text search on the `text`, `section` and `question` fields and `course` as a key-word search.

Building just one shard is enough for our small dataset. In case we had a big amount ouf documents we would be very limited in terms of scalability.

For our small experiment we don't need to create any shard replica's. 

### 1.3 Text search mechnaism

In [30]:

def elastic_search(query, course):
    search_query = {
        "size": 5,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^3", "text", "section"],
                        "type": "best_fields"
                    }
                },
                "filter": {
                    "term": {
                        "course": course
                    }
                }
            }
        }
    }

    response = es_client.search(index=index_name, body=search_query)

    result_docs = []

    for hit in response['hits']['hits']:
        result_docs.append(hit['_source'])

    return result_docs

In [31]:
elastic_search(
    query="I just discivered the course. can I still join?",
    course="data-engineering-zoomcamp"
)

[{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp',
  'id': '7842b56a'},
 {'text': 'You can start by installing and setting up all the dependencies and requirements:\nGoogle cloud account\nGoogle Cloud SDK\nPython 3 (installed with Anaconda)\nTerraform\nGit\nLook over the prerequisites and syllabus to see if you are comfortable with these subjects.',
  'section': 'General course-related questions',
  'question': 'Course - What can I do before the course starts?',
  'course': 'data-engineering-zoomcamp',
  'id': '63394d91'},
 {'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it fin

## 2. Ground truth data

### 2.1 Getting ground truth data

In [94]:
import pandas as pd

In [95]:
df_ground_truth = pd.read_csv('ground-truth-data.csv')
df_ground_truth.head()

Unnamed: 0,question,course,document
0,When does the course begin?,data-engineering-zoomcamp,c02e79ef
1,How can I get the course schedule?,data-engineering-zoomcamp,c02e79ef
2,What is the link for course registration?,data-engineering-zoomcamp,c02e79ef
3,How can I receive course announcements?,data-engineering-zoomcamp,c02e79ef
4,Where do I join the Slack channel?,data-engineering-zoomcamp,c02e79ef


there are 77 rows where there is no real question generated

In [96]:
df_ground_truth.shape

(4627, 3)

In [97]:
df_no_question = df_ground_truth[df_ground_truth.question.str.contains(r'^question')]
df_no_question.head()

Unnamed: 0,question,course,document
20,question1,data-engineering-zoomcamp,63394d91
21,question2,data-engineering-zoomcamp,63394d91
22,question3,data-engineering-zoomcamp,63394d91
23,question4,data-engineering-zoomcamp,63394d91
24,question5,data-engineering-zoomcamp,63394d91


In [98]:
df_no_question.course.value_counts()

course
machine-learning-zoomcamp    35
data-engineering-zoomcamp    33
mlops-zoomcamp                9
Name: count, dtype: int64

even though there is a record for initial question we can witness that LLM lost it for 77 rows during the generation.

In [99]:
for d in docs_raw:
    if d['id'] == 'c91b6b57':
        print(d)

{'text': 'pd.get_dummies and DictVectorizer both create a one-hot encoding on string values. Therefore you need to convert the values in PUlocationID and DOlocationID to string.\nIf you convert the values in PUlocationID and DOlocationID from numeric to string, the NaN values get converted to the string "nan".  With DictVectorizer the RMSE is the same whether you use "nan" or "-1" as string representation for the NaN values. Therefore the representation doesn\'t have to be "-1" specifically, it could also be some other string.', 'section': 'Module 1: Introduction', 'question': 'Replacing NaNs for pickup location and drop off location with -1 for One-Hot Encoding', 'course': 'mlops-zoomcamp', 'id': 'c91b6b57'}


Will be deleting records with no generated questions

In [100]:
df_ground_truth = df_ground_truth[df_ground_truth.question.str.contains(r'^question') == False]
df_ground_truth.shape

(4550, 3)

In [101]:
dict_ground_truth = df_ground_truth.to_dict(orient='records')
dict_ground_truth[:3]

[{'question': 'When does the course begin?',
  'course': 'data-engineering-zoomcamp',
  'document': 'c02e79ef'},
 {'question': 'How can I get the course schedule?',
  'course': 'data-engineering-zoomcamp',
  'document': 'c02e79ef'},
 {'question': 'What is the link for course registration?',
  'course': 'data-engineering-zoomcamp',
  'document': 'c02e79ef'}]

### 2.2 Querying ElasticSearch with our ground truth data 

On this stage we will perferm text query search using our artificially generated questions.

After we will iterate over retrieved results ti find if the right record was retrieved from our knowledge base

In [102]:
relevance_total = []

for query in tqdm(dict_ground_truth):
    # ground truth
    q = query['question']
    course = query['course']
    doc_id = query['document']

    #querying elastic search
    results = elastic_search(q, course)

    # check if text research results contain the same id as artificially generated question
    relevance = [d['id'] == doc_id for d in results]
    relevance_total.append(relevance)


100%|██████████| 4550/4550 [00:16<00:00, 272.85it/s]


In [111]:
relevance_total[:5]

[[True, False, False, False, False],
 [False, False, False, False, False],
 [False, False, False, False, False],
 [False, False, False, False, False],
 [False, False, False, False, False]]

## 3. Evaluation Metrics

### 3.1.Hit Rate (or Recall at k)

Recommendation metric.

Measures the proportion of queries for which at least one relevant document is retrieved in the top-k results.

\[
  \text{Hit Rate} = \left(\frac{\text{Number of Hits}}{\text{Total Number of Attempts}} \right) \times 100
  \]

Where:
- **Number of Hits**: The count of successful hits or outcomes.
- **Total Number of Attempts**: The total count of attempts made.

The result is expressed as a percentage.

In our case only one document can be found per query.

In [112]:
def hit_rate(results):
    """Computing hit rate using lists"""
    cnt = 0

    for line in results:
        if True in line:
            cnt += 1

    return cnt / len(results)

In [113]:
hit_rate(relevance_total)

0.752087912087912

In [104]:
df_relevance = pd.DataFrame(relevance_total)
df_relevance.head()

Unnamed: 0,0,1,2,3,4
0,True,False,False,False,False
1,False,False,False,False,False
2,False,False,False,False,False
3,False,False,False,False,False
4,False,False,False,False,False


In [105]:
df_relevance.isnull().sum()

0    0
1    0
2    0
3    0
4    0
dtype: int64

Computing the Hit-Rate

In [120]:
# some rows may have 2 True = extracting the same document 2 times
df_relevance.sum(axis=1).value_counts()

1    3417
0    1128
2       5
Name: count, dtype: int64

In [129]:
# check if any row contains True and sum over it
df_relevance.any(axis=1).sum()

3422

In [130]:
df_relevance.any(axis=1).sum() / df_relevance.shape[0]

0.752087912087912

In [132]:
def df_hit_rate(results):
    """Computing hit rate using pandas"""
    df_relevance = pd.DataFrame(results)

    # check if any row contains True and sum over it
    row_results = df_relevance.any(axis=1).sum()
    return row_results / df_relevance.shape[0]

In [134]:
df_hit_rate(relevance_total) == hit_rate(relevance_total)

True

## Mean Reciprocal Rank (MRR)

Evaluates the rank position of the first relevant document
