# Retrieval approach with Elasticsearch

### Importing stuff

In [1]:
import json

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-mpnet-base-v2")

  from tqdm.autonotebook import tqdm, trange


In [2]:
with open('../../data/parsed_book.json', 'rt') as f_in:
    book_raw = json.load(f_in)

In [3]:
book_raw[0]

{'chapter': 'CHAPTER 1',
 'title': 'Machine Learning Roles and the Interview Process',
 'content': [{'section': 'Overview of This Book',
   'text': 'In the first part of this chapter, I’ll walk through the structure of this book. Then, I’ll discuss the various job titles and roles that use ML skills in industry. 1 I’ll also clarify the responsibilities of various job titles, such as data scientist, machine learning engineer, and so on, as this is a common point of confusion for job seekers. These will be illustrated with an ML skills matrix and ML lifecycle that will be referenced throughout the book. The second part of this chapter walks through the interview process, from beginning to end. I’ve mentored candidates who appreciated this overview since online resources often focus on specific pieces of the interview but not how they all connect together and result in an offer. Especially for new graduates 2 and readers coming from different industries, this chapter helps get everyone on

sudo docker run -it \
    --rm \
    --name elasticsearch \
    -m 4GB \
    -p 9200:9200 \
    -p 9300:9300 \
    -e "discovery.type=single-node" \
    -e "xpack.security.enabled=false" \
    docker.elastic.co/elasticsearch/elasticsearch:8.4.3

## Create embeddings using pretrained model

The model used to create the embeddings can be found in this website
https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#semantic-search-models

### Flattening the documents

In [4]:
documents = []

for chapter in book_raw:
    chapter_name = chapter['chapter']
    title = chapter['title']

    for doc in chapter['content']:
        new_doc = {
            'chapter': chapter_name,
            'title': title,
            'section': doc['section'],
            'text': doc['text']
        }
        documents.append(new_doc) 

In [5]:
documents[0]

{'chapter': 'CHAPTER 1',
 'title': 'Machine Learning Roles and the Interview Process',
 'section': 'Overview of This Book',
 'text': 'In the first part of this chapter, I’ll walk through the structure of this book. Then, I’ll discuss the various job titles and roles that use ML skills in industry. 1 I’ll also clarify the responsibilities of various job titles, such as data scientist, machine learning engineer, and so on, as this is a common point of confusion for job seekers. These will be illustrated with an ML skills matrix and ML lifecycle that will be referenced throughout the book. The second part of this chapter walks through the interview process, from beginning to end. I’ve mentored candidates who appreciated this overview since online resources often focus on specific pieces of the interview but not how they all connect together and result in an offer. Especially for new graduates 2 and readers coming from different industries, this chapter helps get everyone on the same page 

In [6]:
len(model.encode("Hello I am just checking that you are working properly"))

768

### Creating the dense vector using the pre-trained model

A dense vector typically represents a word, sentence, or document as a fixed-length array of numbers, also known as an embedding. Dense vectors are crucial for Elasticsearch, especially for tasks where understanding the meaning behind the words is more important than just matching exact terms.

In [7]:
operations = []
for doc in documents:
    doc["text_vector"] = model.encode(doc["text"]).tolist()
    operations.append(doc)

In [8]:
operations[1]

{'chapter': 'CHAPTER 1',
 'title': 'Machine Learning Roles and the Interview Process',
 'section': 'A Brief History of Machine Learning and Data Science Job Titles',
 'text': 'First, let’s walk through a brief history of job titles. I decided to start with this section to dispel some myths about the “data scientist” job title and shed some light on why there are so many ML-related job titles. After understanding this history, you should be more aware of what job titles to aim for yourself. If you’ve ever been confused about the litany of titles such as machine learning engineer (MLE), product data sci‐ entist, MLOps engineer, and more, this section is for you. ML techniques aren’t a new thing; in 1985, David Ackley, Geoffrey E. Hinton, and Terrence J. Sejnowski popularized the Boltzmann Machine algorithm. 3 Even before that, regression techniques 4 had early developments in the 1800s. There have long been jobs and roles that use modeling techniques to forecast and predict. Econome‐ tri

# Setup ElasticSearch connection

In [9]:
from elasticsearch import Elasticsearch
es_client = Elasticsearch('http://localhost:9200') 

es_client.info()

ObjectApiResponse({'name': 'c4b9a141cb9f', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'zvY8omxTS82s8jtkdJwk3w', 'version': {'number': '8.4.3', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '42f05b9372a9a4a470db3b52817899b99a76ee73', 'build_date': '2022-10-04T07:17:24.662462378Z', 'build_snapshot': False, 'lucene_version': '9.3.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'})

# Creating mappings and Index

Imagine that you need to create a schema. what do you need? I would say the column names, the table name, the type of data you are going to introduce...

The mapping will set this metadata.

In [10]:
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
    "properties": {
        "text": {
            "type": "text",
        },
        "section": {
            "type": "keyword",
        },
        "chapter": {
            "type": "text",
        },
        "title": {
            "type": "text",
        },
        "text_vector": {
            "type": "dense_vector",
            "dims": 768, # got them above
            "index": True,
            "similarity": "cosine"
        }
    }
}

}


In [11]:
index_name = "ml-interview-questions"

# it is better to delete the index every time when experimenting
es_client.indices.delete(index=index_name, ignore_unavailable=True) 
es_client.indices.create(index=index_name, body=index_settings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'ml-interview-questions'})

## Add documents to the Index

In [12]:
for doc in documents:
    try:
        es_client.index(index=index_name, body=doc)
    except Exception as e:
        print(f"Error when indexing the document: {e}")

## Create user query

In [13]:
search_term = "what are the steps of ML interviews?"
vector_search_term = model.encode(search_term)

In [14]:
res = es_client.search(
    index="ml-interview-questions",
    body={
        "size": 5,  
        "knn": {
            "field": "text_vector",  
            "query_vector": vector_search_term,
            "k": 5,  
            "num_candidates": 1000  
        },
        "_source": ["text", "section", "title", "chapter"]
    }
)
res['hits']['hits']

[]

# Semantic Search with Elasticsearch

In [17]:
knn_query = {
    "field": "text_vector",
    "query_vector": vector_search_term,
    "k": 5,
    "num_candidates": 10000
}

In [18]:
response = es_client.search(
    index=index_name,
    body={
        "size": 5,
        "knn": {
            "field": "text_vector",
            "query_vector": vector_search_term,
            "k": 5,
            "num_candidates": 10000
        },
        "query": {
            "bool": {

            }
        }
    }
)

response['hits']['hits']


[{'_index': 'ml-interview-questions',
  '_id': 'g8XylZIByzqUH6XnCljF',
  '_score': 1.8904558,
  '_source': {'chapter': 'CHAPTER 8',
   'title': 'Tying It All Together: Your Interview Roadmap',
   'section': 'Interview Preparation Checklist',
   'text': 'Now that you’ve gone through the entire ML interview process, it’s time to create a plan. In Chapters 1 and 2 , you learned about the many types of ML jobs and did a self-assessment of which one(s) might be more suitable for you. Based on that, you also learned about the skills you are expected to be stronger in. In the subsequent chapters, you learned about what types of questions are commonly asked in inter‐ views. Are there any types that you need to prepare more for? The goal of this book is for you to start bridging the gap, not just read about bridging the gap. To succeed in interviews and land the job, taking action will help you—not just thinking about taking action. Follow this checklist to create a plan for your interview proc

#### Score == 2.72. When semanticsearch with elasticsearch can be more than 1. 