# Retrieval approach with Elasticsearch

### Imports

In [1]:
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module='tqdm')
from tqdm.auto import tqdm
import json
import pandas as pd

### Pretrained Model used for creation of embeddings

The model used to create the embeddings can be found in this website
https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#semantic-search-models

In [2]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-mpnet-base-v2")

In [3]:
print(f'The output of the model has {len(model.encode("How many features or dimensions the model uses to represent the input text?"))} dimensional embeddings')

The output of the model has 768 dimensional embeddings


### Load the book

In [4]:
with open('../../data/documents_with_ids.json', 'rt') as f_in:
    documents = json.load(f_in)

In [5]:
documents[0]

{'chapter': 'CHAPTER 1',
 'title': 'Machine Learning Roles and the Interview Process',
 'section': 'Overview of This Book',
 'text': 'In the first part of this chapter, I’ll walk through the structure of this book. Then, I’ll discuss the various job titles and roles that use ML skills in industry. 1 I’ll also clarify the responsibilities of various job titles, such as data scientist, machine learning engineer, and so on, as this is a common point of confusion for job seekers. These will be illustrated with an ML skills matrix and ML lifecycle that will be referenced throughout the book. The second part of this chapter walks through the interview process, from beginning to end. I’ve mentored candidates who appreciated this overview since online resources often focus on specific pieces of the interview but not how they all connect together and result in an offer. Especially for new graduates 2 and readers coming from different industries, this chapter helps get everyone on the same page 

In [6]:
# documents = []

# for chapter in book_raw:
#     chapter_name = chapter['chapter']
#     title = chapter['title']

#     for doc in chapter['content']:
#         new_doc = {
#             'chapter': chapter_name,
#             'title': title,
#             'section': doc['section'],
#             'text': doc['text']
#         }
#         documents.append(new_doc) 

# Setup Elasticsearch connection

#TODO - docker config

### run on the console (linux)

sudo docker run -it \
    --rm \
    --name elasticsearch \
    -m 4GB \
    -p 9200:9200 \
    -p 9300:9300 \
    -e "discovery.type=single-node" \
    -e "xpack.security.enabled=false" \
    docker.elastic.co/elasticsearch/elasticsearch:8.4.3

In [7]:
from elasticsearch import Elasticsearch
es_client = Elasticsearch('http://localhost:9200') 

es_client.info()

ObjectApiResponse({'name': '0c3b70f23821', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'l4X7MPHXRhqR6jR-CbMxdw', 'version': {'number': '8.4.3', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '42f05b9372a9a4a470db3b52817899b99a76ee73', 'build_date': '2022-10-04T07:17:24.662462378Z', 'build_snapshot': False, 'lucene_version': '9.3.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'})

# Create mappings and Index

Imagine that you need to create a schema. what do you need? I would say the column names, the table name, the type of data you are going to introduce...

The mapping will set this metadata.

In [8]:
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0,
        "analysis": {
            "analyzer": {
                "standard_analyzer": {
                "type": "standard"
                }
            }
        }
    },
    "mappings": {
    "properties": {        
        "chapter": {
            "type": "text",
        },
        "title": {
            "type": "text",
        },
        "section": {
            "type": "text",
        },
        "text": {
            "type": "text",
            "analyzer": "standard_analyzer"  
        },
        "id":{
            "type": "keyword",
        },
        "text_vector": {
            "type": "dense_vector",
            "dims": 768, # got them above
            "index": True,
            "similarity": "cosine"
        }
    }
}

}

In [9]:
index_name = "ds-interview-questions"

# it is better to delete the index every time when experimenting
es_client.indices.delete(index=index_name, ignore_unavailable=True) 
es_client.indices.create(index=index_name, body=index_settings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'ds-interview-questions'})

### Add documents to the index

In [10]:
for doc in tqdm(documents):
    try:
        es_client.index(index=index_name, document=doc)
    except Exception as e:
        print(f"Error when indexing the document: {e}")

  0%|          | 0/48 [00:00<?, ?it/s]

### Create user query

In [11]:
search_term = "which are the steps of the data science interview process?"
vector_search_term = model.encode(search_term)

### Create search function

In [12]:
def execute_search(query, index=index_name):
    """
    Execute a search query on the specified index.

    Parameters:
        query (dict): The search query to execute.
        index (str): The name of the index to search.

    Returns:
        None: Prints the search results.
    """
    try:
        response = es_client.search(index=index, body=query)
        return response
    except Exception as e:
        print(f"Error during search: {e}")

# Full-text search

In [13]:
def full_text_search(search_term):
    full_text_query = {
        "size": 15,
        "query": {
            "multi_match": {
                "query": search_term,
                "fields": ["text^3", "section", "title"],
                "type": "best_fields"
            }
        }
    }
    
    full_text_results = execute_search(full_text_query)

    return full_text_results


In [14]:
# print("Full Text Search Results:")
# print(full_text_search(search_term)['hits']['hits'][0])

# Semantic search

### Create the dense vector using the pre-trained model

A dense vector typically represents a word, sentence, or document as a fixed-length array of numbers, also known as an embedding. Dense vectors are crucial for Elasticsearch, when we want to perform tasks where understanding the meaning behind the words is more important than just matching exact terms.

In [15]:
operations = []
for doc in documents:
    doc["text_vector"] = model.encode(doc["text"]).tolist()
    operations.append(doc)

In [16]:
operations[1]

{'chapter': 'CHAPTER 1',
 'title': 'Machine Learning Roles and the Interview Process',
 'section': 'A Brief History of Machine Learning and Data Science Job Titles',
 'text': 'First, let’s walk through a brief history of job titles. I decided to start with this section to dispel some myths about the “data scientist” job title and shed some light on why there are so many ML-related job titles. After understanding this history, you should be more aware of what job titles to aim for yourself. If you’ve ever been confused about the litany of titles such as machine learning engineer (MLE), product data sci‐ entist, MLOps engineer, and more, this section is for you. ML techniques aren’t a new thing; in 1985, David Ackley, Geoffrey E. Hinton, and Terrence J. Sejnowski popularized the Boltzmann Machine algorithm. 3 Even before that, regression techniques 4 had early developments in the 1800s. There have long been jobs and roles that use modeling techniques to forecast and predict. Econome‐ tri

### Add documents to the index

In [46]:
for doc in operations:
    try:
        es_client.index(index=index_name, body=doc)
    except Exception as e:
        print(f"Error when indexing the document: {e}")

In [47]:
def semantic_search(vector_search_term):
    semantic_query = {
    "size": 15,
    "knn": {
        "field": "text_vector",
        "query_vector": vector_search_term,
        "k": 4,  
        "num_candidates": 10000  
    },
    "_source": ["id", "text", "section", "title"] 
    }

    semantic_results = execute_search(semantic_query)
    return semantic_results



In [48]:
# print("\nSemantic Search Results:")
# print(semantic_search(vector_search_term)['hits']['hits'][0])

# Hybrid Search

Combination of both full-text and vector search

In [61]:
def hybrid_search(search_term, vector_search_term):
    hybrid_query = {
        "size": 15,
        "query": {
            "multi_match": {
                "query": search_term,
                "fields": ["text^3", "section", "title"],
                "type": "best_fields"
            }
        },
        "knn": {
            "field": "text_vector",
            "query_vector": vector_search_term,
            "k": 4,
            "num_candidates": 10000
        }
    }

    # Ejecuta la búsqueda híbrida usando la consulta modificada
    hybrid_results = execute_search(hybrid_query)
    return hybrid_results


In [62]:
# print("Text-Vector Search Results:")
# print(hybrid_search(search_term, vector_search_term)['hits']['hits'][0])

# Evaluation

In [63]:
gt_df = pd.read_csv('../../data/ground_truth_data.csv')

In [64]:
ground_truth =  gt_df.to_dict(orient ='records')

## Metrics

In [65]:
def hit_rate(relevance_total):
    cnt = 0

    for line in relevance_total:
        if True in line:
            cnt = cnt + 1

    return cnt / len(relevance_total)

In [66]:
def mrr(relevance_total):
    total_score = 0.0

    for line in relevance_total:
        for rank in range(len(line)):
            if line[rank] == True:
                total_score = total_score + 1 / (rank + 1)

    return total_score / len(relevance_total)

## Evaluation full-text search

In [67]:
relevance_total = []

for q in tqdm(ground_truth):
    doc_id = q['text_id']
    text_results = full_text_search(q['question'])
    hits = text_results.get('hits', {}).get('hits', [])
    relevance = [doc['_source']['id'] == doc_id for doc in hits]
    relevance_total.append(relevance)
    

  0%|          | 0/240 [00:00<?, ?it/s]

In [68]:
hit_rate(relevance_total), mrr(relevance_total)

(0.6625, 0.9441591741591752)

## Evalutation Sematinc search

In [69]:
relevance_total = []

for q in tqdm(ground_truth):
    doc_id = q['text_id']
    vector = model.encode(q["question"]).tolist()
    text_results = semantic_search(vector)
    hits = text_results.get('hits', {}).get('hits', [])
    relevance = [doc['_source']['id'] == doc_id for doc in hits]
    relevance_total.append(relevance)
    


  0%|          | 0/240 [00:00<?, ?it/s]

In [70]:
hit_rate(relevance_total), mrr(relevance_total)

(0.3958333333333333, 0.509722222222222)

## Evaluation Hybrid search

In [71]:
relevance_total = []

for q in tqdm(ground_truth):
    doc_id = q['text_id']
    vector = model.encode(q["question"]).tolist()
    text_results = hybrid_search(search_term, vector)
    hits = text_results.get('hits', {}).get('hits', [])
    relevance = [doc['_source']['id'] == doc_id for doc in hits]
    relevance_total.append(relevance)

  0%|          | 0/240 [00:00<?, ?it/s]

In [72]:
hit_rate(relevance_total), mrr(relevance_total)

(0.10416666666666667, 0.06854816479816478)

Conclusion:

The best search type is full-text search within elasticsearch.

Observations: 

We have to take into consideration that this book is technical yet very general so some keywords could be found in several parts of the book. 

When applying retrieval evaluation we use the ground truth data created by llama2 (5 questions per section) but the content of our book is very heterogeneus i.e. some sections are very generalistic, others mostly introductory, some are also very short or with a lot of diagrams... this can lead in our case to worse evaluation performance. 