# ***`Homework: Introduction`***
---

In this homework, we'll learn more about search and use Elastic Search for practice.

## ***`Q1. Running Elastic`***
---

Run Elastic Search 8.4.3, and get the cluster information. If you run it on localhost, this is how you do it:

```bash
curl localhost:9200
```

What's the version.build_hash value?

In [1]:
from elasticsearch import Elasticsearch

First run the following below to get ElasticSearch up and running:

```bash
docker run -it \
    --rm \
    --name elasticsearch \
    -p 9200:9200 \
    -p 9300:9300 \
    -e "discovery.type=single-node" \
    -e "xpack.security.enabled=false" \
    docker.elastic.co/elasticsearch/elasticsearch:8.4.3
```

### *Why Use Two Ports: 9200 and 9300?*

1. ***`Port 9200:`***

- This is the default port for the HTTP REST API.
#
- It is used for communication between clients and Elasticsearch. When you use tools like `curl` or applications that interact with Elasticsearch, they typically communicate over this port.

2. ***`Port 9300:`***

- This is the default port for internal communication between Elasticsearch nodes.
#
- It is used for node-to-node communication within the Elasticsearch cluster. Even in a single-node setup, this port is necessary for certain internal processes and potential future cluster expansions.
#

### *What is a Cluster in Elasticsearch?*

A **cluster** in Elasticsearch is a collection of one or more nodes (servers) that together store your entire data and provide federated indexing and search capabilities across all nodes.

#### Key Concepts of an Elasticsearch Cluster:

1. **Node**:
   - A single server that is part of the cluster. Each node stores data and participates in the cluster's indexing and search capabilities.
   - Nodes can join or leave the cluster dynamically.

2. **Cluster**:
   - A group of nodes with the same `cluster.name` setting, working together to share the workload.
   - The cluster's health status and the distribution of tasks and data are managed collectively.

3. **Cluster Name**:
   - A unique name to identify a specific cluster. Nodes in the same cluster must have the same cluster name.
   - The default name is "elasticsearch".

4. **Master Node**:
   - Manages the cluster by handling tasks such as creating/deleting indices, tracking which nodes are part of the cluster, and deciding where to allocate shards.
   - Every cluster has one elected master node, but any node can become the master.

5. **Data Node**:
   - Stores the data and performs data-related operations like CRUD, search, and aggregations.
   - In a multi-node cluster, you can have nodes dedicated solely to data handling.

6. **Cluster State**:
   - Maintained by the master node, it includes information about all the nodes, indices, and shards within the cluster.
   - It ensures that all nodes have a consistent view of the cluster.

In [2]:
!curl localhost:9200

{
  "name" : "10b41fd5905e",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "rEmcivvsQrOvQuQNzXH8Tw",
  "version" : {
    "number" : "8.4.3",
    "build_flavor" : "default",
    "build_type" : "docker",
    "build_hash" : "42f05b9372a9a4a470db3b52817899b99a76ee73",
    "build_date" : "2022-10-04T07:17:24.662462378Z",
    "build_snapshot" : false,
    "lucene_version" : "9.3.0",
    "minimum_wire_compatibility_version" : "7.17.0",
    "minimum_index_compatibility_version" : "7.0.0"
  },
  "tagline" : "You Know, for Search"
}


In [3]:
import subprocess
import json

# Run the curl command and capture the output
result = subprocess.run(['curl', 'localhost:9200'], stdout=subprocess.PIPE)
response = result.stdout.decode('utf-8')

# Parse the JSON response
response_json = json.loads(response)

# Extract the version.build_hash value
build_hash = response_json['version']['build_hash']


# answer
print(f"\nAnswer: {build_hash}\n")


Answer: 42f05b9372a9a4a470db3b52817899b99a76ee73



  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   539  100   539    0     0  35542      0 --:--:-- --:--:-- --:--:-- 35933


## ***`Getting the data`***
---

Now let's get the FAQ data. You can run this snippet:

In [4]:
import requests
from pprint import pprint

docs_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/01-intro/documents.json?raw=1'
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()

documents = []

for course in documents_raw:
    course_name = course['course']

    for doc in course['documents']:
        doc['course'] = course_name
        documents.append(doc)



# see the first document
pprint(documents[0])

{'course': 'data-engineering-zoomcamp',
 'question': 'Course - When will the course start?',
 'section': 'General course-related questions',
 'text': 'The purpose of this document is to capture frequently asked '
         'technical questions\n'
         'The exact day and hour of the course will be 15th Jan 2024 at 17h00. '
         "The course will start with the first  “Office Hours'' live.1\n"
         'Subscribe to course public Google Calendar (it works from Desktop '
         'only).\n'
         'Register before the course starts using this link.\n'
         'Join the course Telegram channel with announcements.\n'
         "Don’t forget to register in DataTalks.Club's Slack and join the "
         'channel.'}


## ***`Q2. Indexing the data`***
---

Index the data in the same way as was shown in the course videos. Make the course field a keyword and the rest should be text.

Which function do you use for adding your data to elastic?

- insert
- index
- put
- add

In [5]:
from tqdm.auto import tqdm

index_settings = {
    "settings": {
        "number_of_shards": 1,    # single node (basic units of storage in Elasticsearch) -> can be distributed across multiple nodes
        "number_of_replicas": 0   # replicas are copies of shards -> provide high availability and fault tolerance if a node fails
    },
    # mapping defines how documents and fields are indexed and stored, specifying the data type of each field
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"} # add course as a keyword field for filtering
        }
    }
}

index_name = "course-questions"

es_client = Elasticsearch('http://localhost:9200')
es_client.indices.create(index=index_name, body=index_settings)


# Index the documents
for doc in tqdm(documents):

    es_client.index(
        index    = index_name, # = where the document will be stored
        document = doc
    )

# Check if the documents are indexed
pprint(
    es_client.search(index=index_name, body={"query": {"match_all": {}}})
)

  from .autonotebook import tqdm as notebook_tqdm


BadRequestError: BadRequestError(400, 'resource_already_exists_exception', 'index [course-questions/XF0RN82rTge9LSwDzO7Arw] already exists')

## ***`Q3. Searching`***
---

Now let's search in our index. 

We will execute a query "How do I execute a command in a running docker container?". 

Use only `question` and `text` fields and give `question` a boost of 4, and use `"type": "best_fields"`.

What's the score for the top ranking result?

* 94.05
* 84.05
* 74.05
* 64.05

Hint: Look at the _score field.


- Helpers:

In [29]:
# 1)
def elastic_search(query, course="data-engineering-zoomcamp"):
    search_query = {
        "size": 3,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^4", "text"], # !! boost the question field by a factor of 4
                        "type": "best_fields"
                    }
                },
                "filter": {
                    "term": {
                        "course": course
                    }
                }
            }
        }
    }

    response = es_client.search(index=index_name, body=search_query)

    result_docs = []

    for hit in response['hits']['hits']:
        result_docs.append({
            'score':    hit['_score'],  # ! Include the score
            'text':     hit['_source']['text'],
            'section':  hit['_source']['section'],
            'question': hit['_source']['question'],
            'course':   hit['_source']['course']
        })

    return result_docs

- Execute query:

In [16]:
query = "How do I execute a command in a running docker container?"

# Execute query
response = elastic_search(query, course="data-engineering-zoomcamp")

# Print the top score
if response:
    top_score = response[0]['score']
    print(f"The score for the top-ranking result is: {top_score}")
else:
    print("No results found")

The score for the top-ranking result is: 75.54128


In [18]:
pprint(response)

[{'course': 'data-engineering-zoomcamp',
  'question': 'PGCLI - running in a Docker container',
  'score': 75.54128,
  'section': 'Module 1: Docker and Terraform',
  'text': 'In case running pgcli  locally causes issues or you do not want to '
          'install it locally you can use it running in a Docker container '
          'instead.\n'
          'Below the usage with values used in the videos of the course for:\n'
          'network name (docker network)\n'
          'postgres related variables for pgcli\n'
          'Hostname\n'
          'Username\n'
          'Port\n'
          'Database name\n'
          '$ docker run -it --rm --network pg-network '
          'ai2ys/dockerized-pgcli:4.0.1\n'
          '175dd47cda07:/# pgcli -h pg-database -U root -p 5432 -d ny_taxi\n'
          'Password for root:\n'
          'Server: PostgreSQL 16.1 (Debian 16.1-1.pgdg120+1)\n'
          'Version: 4.0.1\n'
          'Home: http://pgcli.com\n'
          'root@pg-database:ny_taxi> \\dt\n'
   

## ***`Q4. Filtering`***
---

Now let's only limit the questions to `machine-learning-zoomcamp`.

Return 3 results. What's the 3rd question returned by the search engine?

* How do I debug a docker container?
* How do I copy files from a different folder into docker container’s working directory?
* How do Lambda container images work?
* How can I annotate a graph?

In [31]:
query  = "How do I execute a command in a running docker container?"
course = "machine-learning-zoomcamp"

response = elastic_search(
    query=query,
    course=course
)

# display 3rd question:
print(response[2]['question'])

How do I copy files from a different folder into docker container’s working directory?


## ***`Q5. Building a prompt`***
---

Now we're ready to build a prompt to send to an LLM. 

Take the records returned from Elasticsearch in Q4 and use this template to build the context. Separate context entries by two linebreaks (`\n\n`)
```python
context_template = """
Q: {question}
A: {text}
""".strip()
```

Now use the context you just created along with the "How do I execute a command in a running docker container?" question 
to construct a prompt using the template below:

```
prompt_template = """
You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
Use only the facts from the CONTEXT when answering the QUESTION.

QUESTION: {question}

CONTEXT:
{context}
""".strip()
```

What's the length of the resulting prompt? (use the `len` function)

* 962
* 1462
* 1962
* 2462


In [32]:
len(response)

3

In [34]:
def build_prompt(query: str, search_results: list[dict[str, any]]) -> str:
    """
    Build a prompt for the chatbot based on the search results.
    """

    prompt_template = """
You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
Use only the facts from the CONTEXT when answering the QUESTION.

QUESTION: {question}

CONTEXT:
{context}
""".strip()

    context = ""

    for doc in search_results:
        context = context + f"Q: {doc['question']}\nA: {doc['text']}\n\n"

    prompt = prompt_template.format(question=query, context=context).strip()
    return prompt


# only use the first 3 search results
context = response

prompt = build_prompt(
    query          = query,
    search_results = context
)

# Display the prompt length
print(f"The prompt length is: {len(prompt)}")

The prompt length is: 1462


## ***`Q6. Tokens`***
---

When we use the OpenAI Platform, we're charged by the number of 
tokens we send in our prompt and receive in the response.

The OpenAI python package uses `tiktoken` for tokenization:

```bash
pip install tiktoken
```

Let's calculate the number of tokens in our query: 

```python
encoding = tiktoken.encoding_for_model("gpt-4o")
```

Use the `encode` function. How many tokens does our prompt have?

* 122
* 222
* 322
* 422

Note: to decode back a token into a word, you can use the `decode_single_token_bytes` function:

```python
encoding.decode_single_token_bytes(63842)
```

In [35]:
from openai import OpenAI
import openai
from dotenv import load_dotenv
import os

load_dotenv()

api_key = os.getenv("OPENAI_API_KEY")

openai.api_key = api_key

client = OpenAI()

In [39]:
import tiktoken

# calculate tokens in query
encoding = tiktoken.encoding_for_model("gpt-4o") # encoding method for specific model

# encode the prompt into tokens (list of integers)
tokens = encoding.encode(prompt)

# answer
print(f"\nAnswer: {len(tokens)}\n")


Answer: 322



## ***`Bonus: generating the answer (ungraded)`***
---

Let's send the prompt to OpenAI. What's the response?

Note: you can replace OpenAI with Ollama. See module 2.

In [40]:
def llm(prompt):
    """
    Run the prompt through the OpenAI language model and return the response (Inference).
    """
    response = client.chat.completions.create(
        model='gpt-4o',
        messages=[{"role": "user", "content": prompt}]
    )

    return response.choices[0].message.content

def rag(query: str) -> str:
    """
    Combined function to search the knowledge base, build the prompt, and generate the answer.
    """
    # retrieve search results from knowledge base
    search_results = elastic_search(query)
    # build the prompt with the search results, from which the model will generate the answer
    prompt = build_prompt(query, search_results)
    # inference
    return llm(prompt)


# apply
rag(query)

"To execute a command in a running Docker container, you can use the `docker exec` command followed by the container ID or name and the command you want to execute. Here's the syntax:\n\n```sh\ndocker exec -it <container_id_or_name> <command>\n```\n\nFor example, if you wanted to run a shell command within a container, you would replace `<container_id_or_name>` with the actual ID or name of the container, and `<command>` with the command you wish to execute. Here’s a specific example assuming you want to start a bash shell session:\n\n```sh\ndocker exec -it <container_id_or_name> /bin/bash\n```\n\nIf you want to execute a specific command without opening an interactive session:\n\n```sh\ndocker exec -it <container_id_or_name> <command>\n```\n\n Simply replace `<command>` with your desired command."

## ***`Bonus: calculating the costs (ungraded)`***
---

Suppose that on average per request we send 150 tokens and receive back 250 tokens.

How much will it cost to run 1000 requests?

You can see the prices [here](https://openai.com/api/pricing/)

On June 17, the prices for gpt4o are:

* Input: $0.005 / 1K tokens
* Output: $0.015 / 1K tokens

You can redo the calculations with the values you got in Q6 and Q7.

In [41]:
average_per_request_tokens = 150
receive_back_tokens        = 250

run_n_requests = 1000

# costs Input: $0.005 / 1K tokens
# costs Output: $0.015 / 1K tokens


price_input_150_tokens  = 0.005 / 1000 * 150
price_output_250_tokens = 0.015 / 1000 * 250

# How much does it cost to process 1000 requests?
cost_per_request = price_input_150_tokens + price_output_250_tokens

total_cost = cost_per_request * run_n_requests

# answer
print(f"\nAnswer: ${total_cost}\n")


Answer: $4.5

