# This Notebook aims to explain the code step by step for easy reference later on

### Set up setp for docker:
- Run this command  in your terminal to set up elastic search in docker to help us connect to elastic search locally: 
  
      docker run -it \
        --rm \
        --name elasticsearch \
        -p 9200:9200 \
        -p 9300:9300 \
        -e "discovery.type=single-node" \
        -e "xpack.security.enabled=false" \
        docker.elastic.co/elasticsearch/elasticsearch:8.4.3


In [13]:
# if an error occured during installing the sentence_transformers, then
# uninstall the below 2 packages
# !pip uninstall numpy
# !pip uninstall torch
!pip install sentence_transformers==2.7.0  numpy==1.26.4 torch elasticsearch

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting elasticsearch
  Downloading elasticsearch-8.14.0-py3-none-any.whl.metadata (7.2 kB)
Collecting elastic-transport<9,>=8.13 (from elasticsearch)
  Downloading elastic_transport-8.13.1-py3-none-any.whl.metadata (3.7 kB)
Downloading elasticsearch-8.14.0-py3-none-any.whl (480 kB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.2/480.2 kB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m[36m0:00:01[0m
[?25hDownloading elastic_transport-8.13.1-py3-none-any.whl (64 kB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.5/64.5 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: elastic-transport, elasticsearch
Successfully installed elastic-transport-8.13.1 elasticsearch-8.14.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpytho

In [15]:
import json
from sentence_transformers import SentenceTransformer
from elasticsearch import Elasticsearch

### 1. Prepare the documents 

In [6]:
with open("documents.json", "rt") as f_in:
    docs_raw = json.load(f_in)

In [7]:
# Elastic Search wants everything to be on the same level 
# so we process that
# the current docs_raw = [{course:"",documents:{} } , {course:"",documents:{} }..]

documents =[] 

for course_dict in docs_raw:
    for doc in course_dict["documents"]:
        # put the course name in the dictionay of the documents
        #  with a new key called course
        doc ["course"]= course_dict["course"]
        documents.append(doc)

# Just for the purpose of understanding the final format of the documents 
# it is now a list of dicts that has the 4 keys seen below 
for key in documents[0].keys():
    print(f"{key} : {documents[0][key]} \n\n")


text : The purpose of this document is to capture frequently asked technical questions
The exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1
Subscribe to course public Google Calendar (it works from Desktop only).
Register before the course starts using this link.
Join the course Telegram channel with announcements.
Don’t forget to register in DataTalks.Club's Slack and join the channel. 


section : General course-related questions 


question : Course - When will the course start? 


course : data-engineering-zoomcamp 




### 2. Create Embeddings using Pretrained Models

In [8]:
# The sentence transformers can help us use pretrained models 
model = SentenceTransformer("all-mpnet-base-v2")

In [9]:
# The inference and it gives as output the dense vector
model.encode("This is a simple sentence")

array([ 4.44875564e-03, -7.61314258e-02, -3.77468328e-04,  7.52523402e-03,
       -3.80979776e-02,  3.80131453e-02, -9.73008294e-03, -5.05397702e-03,
       -9.37976502e-03,  1.23887584e-02,  4.91276123e-02,  1.52210230e-02,
        3.80008705e-02, -6.41802400e-02,  9.42127407e-03, -5.19749001e-02,
        9.08066332e-02,  1.71115622e-02,  1.62125528e-02,  2.98865885e-02,
        1.50541600e-03,  8.35078582e-03,  3.78841944e-02, -1.01192761e-02,
        6.46108761e-03,  3.97424155e-05, -1.45217031e-02, -1.88468415e-02,
       -3.74039710e-02, -1.51667662e-03, -1.02680055e-02, -3.68062854e-02,
        2.36677658e-02, -6.46023452e-02,  1.96967039e-06, -5.01107657e-03,
       -2.80828192e-03, -1.92073956e-02, -8.65119696e-02,  2.83465385e-02,
       -5.38667664e-02,  3.63705941e-02, -2.26468481e-02,  2.87367962e-02,
       -1.32342121e-02,  1.08689629e-01,  3.70518453e-02,  3.38802189e-02,
       -5.30679226e-02,  3.61782461e-02, -1.35725585e-03, -3.63483503e-02,
       -2.78346464e-02, -

In [11]:
# create the embeddings for our dataset 
dense_vectors= []

for doc in documents:
    doc["text_vector"]= model.encode(doc["text"]).tolist()
    dense_vectors.append(doc)

### 3. Setup Elastic Search Connection

In [17]:
es_client = Elasticsearch("http://localhost:9200")
es_client.info()

ObjectApiResponse({'name': '44e3d539d666', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'NxoZwPyDTqu4Y1GDK0Ik8A', 'version': {'number': '8.4.3', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '42f05b9372a9a4a470db3b52817899b99a76ee73', 'build_date': '2022-10-04T07:17:24.662462378Z', 'build_snapshot': False, 'lucene_version': '9.3.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'})

### 4. Create Mappings & Index

- Mappings: Process of defining how a document & the field it contains are stored & indexed.

In [24]:
len(documents[0]["text_vector"])

768

In [25]:
# Inorder to create an index, you first need to create a mapping 
# Think of it like a database.In database you need to provide 
# meta data [e.g: variables, types,etc..] inorder to creat a schema
# Also here, you need to create mappings which holds all the meta data

index_settings= {
    "settings":{
        "number_of_shards":1,
        "number_of_replicas":0
    },
    "mappings":{
        "properties":{
            "text":{"type":"text"},
            "section":{"type":"text"},
            "question":{"type":"text"},
            "course":{"type":"text"},
            "text_vector":{"type":"dense_vector",
                          "dims":768,
                          "index":True,
                          "similarity":"cosine"},
        }
    }  
}

In [27]:
# this basically creates the index after we have created the mapping 
index_name = "course-questions"
es_client.indices.delete(index = index_name , ignore_unavailable=True) # delete index if it exists  
es_client.indices.create(index = index_name , body= index_settings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'course-questions'})

### 5. Add documents into index 


In [28]:
for doc in dense_vectors:
    try:
        es_client.index(index=index_name , document=doc)
    except Exception as e:
        print(e)

 ### 6. Create end user query

 - Now, the entire training data is in the vector store
 - We will now see the stages when a user enters a query/question

In [30]:
# first we need to generate the embeddings for the search query 
# using the way we used in the training dataset
user_query="windows or mac?"
vector_search_term=model.encode(user_query)

In [33]:
# build a query 
# field : says in which part of the databse should the query search for
# k : number of nearest docs 
# num_candidate : The group of docs that the serach is going to look into 
query ={
    "field": "text_vector", 
    "query_vector": vector_search_term,
    "k":5,
    "num_candidates":10000,
}

In [50]:
# we are ready to search for the user query in the vector db
# source: The fields that I want the result to be returned 
res=es_client.search(index= index_name , knn=query ,
                     source=["text","section", "course"])

res['hits']['hits'] # the result is saved here

[{'_index': 'course-questions',
  '_id': '70fjopABIeV2n38tj-XK',
  '_score': 0.7147919,
  '_source': {'course': 'data-engineering-zoomcamp',
   'section': 'General course-related questions',
   'text': 'Yes! Linux is ideal but technically it should not matter. Students last year used all 3 OSes successfully'}},
 {'_index': 'course-questions',
  '_id': 'AkfjopABIeV2n38t1-m3',
  '_score': 0.61347336,
  '_source': {'course': 'mlops-zoomcamp',
   'section': 'Module 1: Introduction',
   'text': 'If you wish to use WSL on your windows machine, here are the setup instructions:\nCommand: Sudo apt install wget\nGet Anaconda download address here. wget <download address>\nTurn on Docker Desktop WFree Download | AnacondaSL2\nCommand: git clone <github repository address>\nVSCODE on WSL\nJupyter: pip3 install jupyter\nAdded by Gregory Morris (gwm1980@gmail.com)\nAll in all softwares at one shop:\nYou can use anaconda which has all built in services like pycharm, jupyter\nAdded by Khaja Zaffer (kha

### 7. Perform Semantic Seach & Advanced Search

In [64]:
# you can add a filter according to a field you need
# you can alsom use your regular query search 

# NOTE: THIS IS NOT A SEMANTIC SEARCH BUT A KEYWORK SEARCH
# Because we didnot encode it 
response =es_client.search(
    index= index_name,
    query={
        "bool":{
            "must":{
                "multi_match":{
                    "query":"windows or python?",
                    "fields": ["text", "question", "course","title"],
                    "type": "best_fields"
                    
                }
            },
            "filter":{
                "term": {"course": "data-engineering-zoomcamp"}
            }
        }
    }
    
    )

response

ObjectApiResponse({'took': 4, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 0, 'relation': 'eq'}, 'max_score': None, 'hits': []}})

In [68]:
# now we are doing an advanced semantic search

query ={
    "field": "text_vector", 
    "query_vector": vector_search_term,
    "k":5,
    "num_candidates":10000,
}

reposnse = es_client.search(
    index= index_name,
    query= {
        "match": {"course": "data-engineering-zoomcamp"}
    },
    knn= query,
    size=5,
    explain=True
)

response["hits"]["hits"]

[]