Step 1- Prepare the documents

In [2]:
import json
with open('documents.json','rt') as f_in:
    docs_raw=json.load(f_in)

In [3]:
docs_raw

[{'course': 'data-engineering-zoomcamp',
  'documents': [{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
    'section': 'General course-related questions',
    'question': 'Course - When will the course start?'},
   {'text': 'GitHub - DataTalksClub data-engineering-zoomcamp#prerequisites',
    'section': 'General course-related questions',
    'question': 'Course - What are the prerequisites for this course?'},
   {'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in

In [4]:
documents=[]

for course_dict in docs_raw:
    for doc in course_dict['documents']:
        doc['course']=course_dict['course']
        documents.append(doc)

documents[1]

{'text': 'GitHub - DataTalksClub data-engineering-zoomcamp#prerequisites',
 'section': 'General course-related questions',
 'question': 'Course - What are the prerequisites for this course?',
 'course': 'data-engineering-zoomcamp'}

Step 2- Create embeddings using pretrained models

In [5]:
from sentence_transformers import SentenceTransformer


  from .autonotebook import tqdm as notebook_tqdm


In [6]:
model=SentenceTransformer('multi-qa-distilbert-cos-v1')

In [7]:
len(model.encode('a very simple sentence'))

768

In [12]:
operations=[]
for doc in documents:
    doc['text_vector']=model.encode(doc['text']).tolist()
    operations.append(doc)

Step 3- Setup ElasticSearch connection

In [13]:
from elasticsearch import Elasticsearch

es_client=Elasticsearch('http://localhost:9200')

es_client.info()

ObjectApiResponse({'name': '255251115fe4', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'PzRk58DrTgK7QNPLEXGoIw', 'version': {'number': '8.4.3', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '42f05b9372a9a4a470db3b52817899b99a76ee73', 'build_date': '2022-10-04T07:17:24.662462378Z', 'build_snapshot': False, 'lucene_version': '9.3.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'})

Step 4- Create mappings and index

In [14]:
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"},
            "text_vector": {"type":"dense_vector","dims":768,"index":True,"similarity":"cosine"} 
        }
    }
}

In [15]:
index_name='course_questions'

es_client.indices.delete(index=index_name,ignore_unavailable=True)
es_client.indices.create(index=index_name,body=index_settings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'course_questions'})

Steps 5- Add documents to index

In [16]:
for doc in operations:
    try:
        es_client.index(index=index_name,document=doc)
    except Exception as e:
        print(e)
        

Step 6- Create end user query

In [17]:
search_term='I just discovered the course. Can I still join it?'
vector_Search_term=model.encode(search_term)

In [18]:
query={
    "field":"text_vector",
    "query_vector":vector_Search_term,
    "k":5,
    "num_candidates":10000
}

In [19]:
res=es_client.search(index=index_name,knn=query,source=['text','section','question','course'])
res['hits']['hits']

[{'_index': 'course_questions',
  '_id': 'RJd_zJABxMuSB7Nk15Ek',
  '_score': 0.73748606,
  '_source': {'question': 'When does the next iteration start?',
   'course': 'machine-learning-zoomcamp',
   'section': 'General course-related questions',
   'text': 'The course is available in the self-paced mode too, so you can go through the materials at any time. But if you want to do it as a cohort with other students, the next iterations will happen in September 2023, September 2024 (and potentially other Septembers as well).'}},
 {'_index': 'course_questions',
  '_id': 'Opd_zJABxMuSB7Nk1JHM',
  '_score': 0.73611045,
  '_source': {'question': "I filled the form, but haven't received a confirmation email. Is it normal?",
   'course': 'machine-learning-zoomcamp',
   'section': 'General course-related questions',
   'text': "The process is automated now, so you should receive the email eventually. If you haven’t, check your promotions tab in Gmail as well as spam.\nIf you unsubscribed from our

Step 7- Perform Semantic Search & Advanced Search

In [19]:
# Keyword search
reponse=es_client.search(
    index=index_name,
    query={
        "bool":{
            "must":{
                "multi_match":
                {
                    "query":"windows or python?",
                    "fields":["text","question","course","title"],
                    "type":"best_fields"
                }
            },
            "filter":{
                "term":{
                    "course":"data-engineering-zoomcamp"
                }
            }
        }
    }
)

In [20]:
reponse['hits']['hits']

[{'_index': 'course_questions',
  '_id': 'N5ckx5ABxMuSB7NkgIzZ',
  '_score': 7.728908,
  '_source': {'text': 'Problem: If you have already installed pgcli but bash doesn\'t recognize pgcli\nOn Git bash: bash: pgcli: command not found\nOn Windows Terminal: pgcli: The term \'pgcli\' is not recognized…\nSolution: Try adding a Python path C:\\Users\\...\\AppData\\Roaming\\Python\\Python39\\Scripts to Windows PATH\nFor details:\nGet the location: pip list -v\nCopy C:\\Users\\...\\AppData\\Roaming\\Python\\Python39\\site-packages\n3. Replace site-packages with Scripts: C:\\Users\\...\\AppData\\Roaming\\Python\\Python39\\Scripts\nIt can also be that you have Python installed elsewhere.\nFor me it was under c:\\python310\\lib\\site-packages\nSo I had to add c:\\python310\\lib\\Scripts to PATH, as shown below.\nPut the above path in "Path" (or "PATH") in System Variables\nReference: https://stackoverflow.com/a/68233660',
   'section': 'Module 1: Docker and Terraform',
   'question': 'PGCLI - pg

now results are coming from only data-engineering-zoomcamp due to filter

In [24]:
knn_query={
    "field":"text_vector",
    "query_vector":vector_Search_term,
    "k":5,
    "num_candidates":10000
}

response=es_client.search(
    index=index_name,
    query={
        "match":{
            "course":"data-engineering-zoomcamp"
        }
    },
    knn=knn_query,
    size=5
)

In [25]:
response['hits']['hits']

[{'_index': 'course_questions',
  '_id': '8Zckx5ABxMuSB7NkcItS',
  '_score': 1.4937057,
  '_source': {'text': 'Yes! Linux is ideal but technically it should not matter. Students last year used all 3 OSes successfully',
   'section': 'General course-related questions',
   'question': 'Environment - Is the course [Windows/mac/Linux/...] friendly?',
   'course': 'data-engineering-zoomcamp',
   'text_vector': [-0.02696547470986843,
    -0.0006259630899876356,
    -0.016629502177238464,
    0.05285143107175827,
    0.054765358567237854,
    -0.0313398577272892,
    0.029942670837044716,
    -0.048085663467645645,
    0.044675469398498535,
    0.005839445628225803,
    0.01623309962451458,
    0.012001149356365204,
    -0.031222347170114517,
    0.01660061441361904,
    -0.04886896535754204,
    -0.06496300548315048,
    0.046434175223112106,
    -0.009297670796513557,
    -0.06425278633832932,
    -0.013732721097767353,
    -0.015976209193468094,
    0.008629568852484226,
    -0.02447899803

this is advanced semantic search as query's embeddings are given in search with filter to get results from data-engineering-zoomcamp only