Q1. Running Elastic
Run Elastic Search 8.4.3, and get the cluster information. If you run it on localhost, this is how you do it:

curl localhost:9200
What's the version.build_hash value?
Answer: build_hash" : "42f05b9372a9a4a470db3b52817899b99a76ee73"

In [1]:
import requests 

docs_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/01-intro/documents.json?raw=1'
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()

documents = []

for course in documents_raw:
    course_name = course['course']

    for doc in course['documents']:
        doc['course'] = course_name
        documents.append(doc)

In [2]:
pip install requests


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


Q2. Indexing the data
Index the data in the same way as was shown in the course videos. Make the course field a keyword and the rest should be text.

Don't forget to install the ElasticSearch client for Python:

pip install elasticsearch
Which function do you use for adding your data to elastic?

insert
index
put
add

Answer:index

In [3]:
pip install elasticsearch


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [13]:
from elasticsearch import Elasticsearch

es_client = Elasticsearch('http://localhost:9200') 



In [14]:
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"} 
        }
    }
}

index_name = "course-questions"

es_client.indices.create(index=index_name, body=index_settings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'course-questions'})

In [15]:
documents[0]


{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp'}

In [16]:
from tqdm.auto import tqdm


  from .autonotebook import tqdm as notebook_tqdm


In [17]:
for doc in tqdm(documents):
    es_client.index(index=index_name, document=doc)

100%|████████████████████████████████████████████████| 948/948 [00:22<00:00, 42.63it/s]


In [22]:
search_query = {
    "query": {
        "multi_match": {
            "query": "How do I execute a command in a running docker container?",
            "fields": ["question^4", "text"],
            "type": "best_fields"
        }
    }
}

# Execute the search query
response = es.search(index=index_name, body=search_query)

# Get the score of the top-ranking result
top_score = response['hits']['hits'][0]['_score']
print(f"Top score: {top_score}")

Top score: 84.050095


Q3. Searching
Now let's search in our index.

We will execute a query "How do I execute a command in a running docker container?".

Use only question and text fields and give question a boost of 4, and use "type": "best_fields".

What's the score for the top ranking result?

94.05
84.05
74.05
64.05
Look at the _score field.



Answer:Top score: 84.050095


In [23]:
search_query = {
    "query": {
        "bool": {
            "must": [
                {
                    "multi_match": {
                        "query": "How do I execute a command in a running docker container?",
                        "fields": ["question^4", "text"],
                        "type": "best_fields"
                    }
                }
            ],
            "filter": [
                {
                    "term": {
                        "course": "machine-learning-zoomcamp"
                    }
                }
            ]
        }
    },
    "size": 3
}

# Execute the search query
response = es.search(index=index_name, body=search_query)

# Get the 3rd question from the search results
third_question = response['hits']['hits'][2]['_source']['question']
print(f"3rd question: {third_question}")


3rd question: How do I copy files from a different folder into docker container’s working directory?


Q4. Filtering
Now let's only limit the questions to machine-learning-zoomcamp.

Return 3 results. What's the 3rd question returned by the search engine?

How do I debug a docker container?
How do I copy files from a different folder into docker container’s working directory?
How do Lambda container images work?
How can I annotate a graph?

Answer:How do I copy files from a different folder into docker container’s working directory?

In [24]:
search_query = {
    "query": {
        "bool": {
            "must": [
                {
                    "multi_match": {
                        "query": "How do I execute a command in a running docker container?",
                        "fields": ["question^4", "text"],
                        "type": "best_fields"
                    }
                }
            ],
            "filter": [
                {
                    "term": {
                        "course": "machine-learning-zoomcamp"
                    }
                }
            ]
        }
    },
    "size": 3
}

# Execute the search query
response = es.search(index=index_name, body=search_query)

# Use the context template to format each record
context_entries = []
context_template = """
Q: {question}
A: {text}
""".strip()

for hit in response['hits']['hits']:
    question = hit['_source']['question']
    text = hit['_source']['text']
    context_entry = context_template.format(question=question, text=text)
    context_entries.append(context_entry)

# Combine context entries separated by two linebreaks
context = "\n\n".join(context_entries)

# Define the final prompt using the prompt template
prompt_template = """
You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
Use only the facts from the CONTEXT when answering the QUESTION.

QUESTION: {question}

CONTEXT:
{context}
""".strip()

final_prompt = prompt_template.format(
    question="How do I execute a command in a running docker container?",
    context=context
)

# Calculate the length of the resulting prompt
prompt_length = len(final_prompt)
print(f"Length of the resulting prompt: {prompt_length}")

Length of the resulting prompt: 1462


Question 5 
Answer:Length of the resulting prompt: 1462

In [25]:
pip install tiktoken


Collecting tiktoken
  Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting regex>=2022.1.18 (from tiktoken)
  Downloading regex-2024.5.15-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.9/40.9 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m23.8 MB/s[0m eta [36m0:00:00[0m0m eta [36m0:00:01[0m
[?25hDownloading regex-2024.5.15-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (775 kB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m775.1/775.1 kB[0m [31m20.2 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m
[?25hInstalling collected packages: regex, tiktoken
Successfully installed regex-202

In [27]:
import tiktoken

context_entries = [
    {
        "question": "How do I debug a docker container?",
        "text": "First, ensure that your container is running. Then, use the docker exec command to attach to the running container and debug."
    },
    {
        "question": "How do I copy files from a different folder into docker container’s working directory?",
        "text": "Use the docker cp command followed by the source path and the destination path to copy files into the container."
    },
    {
        "question": "How do Lambda container images work?",
        "text": "Lambda container images allow you to package and deploy your code and dependencies as a container image."
    }
]

# Use the context template to format each record
context_template = """
Q: {question}
A: {text}
""".strip()

context_entries_formatted = [context_template.format(**entry) for entry in context_entries]

# Combine context entries separated by two linebreaks
context = "\n\n".join(context_entries_formatted)

# Define the final prompt using the prompt template
prompt_template = """
You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
Use only the facts from the CONTEXT when answering the QUESTION.

QUESTION: How do I execute a command in a running docker container?

CONTEXT:
{context}
""".strip()

final_prompt = prompt_template.format(
    context=context
)

# Use tiktoken for tokenization
encoding = tiktoken.encoding_for_model("gpt-4o")
tokens = encoding.encode(final_prompt)

# Calculate the number of tokens
token_count = len(tokens)
print(f"Number of tokens in the prompt: {token_count}")

Number of tokens in the prompt: 156


In [None]:
Q6. Tokens
Answer: Number of tokens in the prompt: 156


In [33]:
pip install --upgrade openai

Collecting openai
  Downloading openai-1.35.6-py3-none-any.whl.metadata (21 kB)
Downloading openai-1.35.6-py3-none-any.whl (327 kB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m327.5/327.5 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m[31m41.8 MB/s[0m eta [36m0:00:01[0m
[?25hInstalling collected packages: openai
  Attempting uninstall: openai
    Found existing installation: openai 1.35.4
    Uninstalling openai-1.35.4:
      Successfully uninstalled openai-1.35.4
Successfully installed openai-1.35.6

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.
