Searching in the documents
Create a nootebook "elastic-rag" or something like that. We will use it for our experiments

First, we need to download the docs:

```
wget https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json
```

Let's load the documents

In [1]:
import json

with open('./documents.json', 'rt') as f_in:
    documents_file = json.load(f_in)

documents = []

for course in documents_file:
    course_name = course['course']

    for doc in course['documents']:
        doc['course'] = course_name
        documents.append(doc)

Now we'll index these documents with elastic search

First initiate the connection and check that it's working:

In [2]:
from elasticsearch import Elasticsearch

es = Elasticsearch("http://localhost:9200")
es.info()

ObjectApiResponse({'name': '101dc291a5e6', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'i6Ev5IAmQgaTtcxcMIry_Q', 'version': {'number': '8.4.3', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '42f05b9372a9a4a470db3b52817899b99a76ee73', 'build_date': '2022-10-04T07:17:24.662462378Z', 'build_snapshot': False, 'lucene_version': '9.3.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'})

We soo th same response as earlier wit `curl`.

Before we can index the documents, we need to create an index (an index in elasticsearch is like a table in a "usual" database):

In [3]:
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"} 
        }
    }
}

index_name = "course-questions"
response = es.indices.create(index=index_name, body=index_settings)

response

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'course-questions'})

Now we're ready to index all the documents:

In [4]:
from tqdm.auto import tqdm

for doc in tqdm(documents):
    es.index(index=index_name, document=doc)

  from .autonotebook import tqdm as notebook_tqdm
100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 948/948 [00:21<00:00, 43.75it/s]


## Retreiving the documents

In [5]:
user_question = "How do I join the course after it has started?"

search_query = {
    "size": 5,
    "query": {
        "bool": {
            "must": {
                "multi_match": {
                    "query": user_question,
                    "fields": ["question^3", "text", "section"],
                    "type": "best_fields"
                }
            },
            "filter": {
                "term": {
                    "course": "data-engineering-zoomcamp"
                }
            }
        }
    }
}

This query:

- Retrieves top 5 matching documents.
- Searches in the "question", "text", "section" fields, prioritizing "question" using `multi_match` query with type `best_fields` (see [here](https://github.com/DataTalksClub/llm-zoomcamp/blob/main/01-intro/elastic-search.md) for more information)
- Matches user query "How do I join the course after it has started?".
- Shows results only for the "data-engineering-zoomcamp" course.
Let's see the output:

In [6]:
response = es.search(index=index_name, body=search_query)

for hit in response['hits']['hits']:
    doc = hit['_source']
    print(f"Section: {doc['section']}")
    print(f"Question: {doc['question']}")
    print(f"Answer: {doc['text'][:60]}...\n")

Section: General course-related questions
Question: Course - Can I still join the course after the start date?
Answer: Yes, even if you don't register, you're still eligible to su...

Section: General course-related questions
Question: Course - Can I follow the course after it finishes?
Answer: Yes, we will keep all the materials after the course finishe...

Section: General course-related questions
Question: Course - What can I do before the course starts?
Answer: You can start by installing and setting up all the dependenc...

Section: General course-related questions
Question: How do I use Git / GitHub for this course?
Answer: After you create a GitHub account, you should clone the cour...

Section: Workshop 1 - dlthub
Question: How do I install the necessary dependencies to run the code?
Answer: Answer: To run the provided code, ensure that the 'dlt[duckd...



### Cleaned up version
Run this separately


In [13]:
def retrieve_documents(query, index_name="course-questions", max_results=5):
    es = Elasticsearch("http://localhost:9200")
    
    search_query ={
        "size": max_results,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^3", "text", "section"],
                        "type": "best_fields"
                    }
                },
                "filter": {
                    "term": {
                        "course": "data-engineering-zoomcamp"
                    }
                }
            }
        }
    }

    response = es.search(index=index_name, body=search_query)
    documents = [hit['_source'] for hit in response['hits']['hits']]
    return documents
                

And print the answers:

In [14]:
user_question = "How do I join the course after it has started?"

response = retrieve_documents(user_question)

for doc in response:
    print(f"Section: {doc['section']}")
    print(f"Question: {doc['question']}")
    print(f"Answer: {doc['text'][:60]}...\n")

Section: General course-related questions
Question: Course - Can I still join the course after the start date?
Answer: Yes, even if you don't register, you're still eligible to su...

Section: General course-related questions
Question: Course - Can I follow the course after it finishes?
Answer: Yes, we will keep all the materials after the course finishe...

Section: General course-related questions
Question: Course - What can I do before the course starts?
Answer: You can start by installing and setting up all the dependenc...

Section: General course-related questions
Question: How do I use Git / GitHub for this course?
Answer: After you create a GitHub account, you should clone the cour...

Section: Workshop 1 - dlthub
Question: How do I install the necessary dependencies to run the code?
Answer: Answer: To run the provided code, ensure that the 'dlt[duckd...



# Generation - Making up answers to questions

Time for the "G" part - generation - based on the R part.

### OpenAI
The easiest way to get started is the OpenAI API. A little later we will learn how to use open-source models.

Make sure the SDK is installed and a key is set.

This is how we communicate with a chatty GPT:


### Building a Prompt
---

Here are more words


In [22]:
context_template = """
Section: {section}
Question: {question}
Answer: {text}
""".strip()

context_docs = retrieve_documents(user_question)

context_result = ""

for doc in context_docs:
    doc_str = context_template.format(**doc)
    context_result += ("\n\n" + doc_str)

context = context_result.strip()
print(context)

Section: General course-related questions
Question: Course - Can I still join the course after the start date?
Answer: Yes, even if you don't register, you're still eligible to submit the homeworks.
Be aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.

Section: General course-related questions
Question: Course - Can I follow the course after it finishes?
Answer: Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.
You can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.

Section: General course-related questions
Question: Course - What can I do before the course starts?
Answer: You can start by installing and setting up all the dependencies and requirements:
Google cloud account
Google Cloud SDK
Python 3 (installed with Anaconda)
Terrafo

Now build the actual prompt:

In [25]:
prompt = f"""
You're a course teaching assistant. Answer the QUESTION based on the CONTEXT - the documents retrieved from our FAQ database.
Only use the facts from the CONTEXT. If the CONTEXT doesn't contain the answer, return "NONE"

QUESTION: {user_question}

CONTEXT:

{context}
""".strip()

Now we can send it to the OpenAI API:

In [26]:
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}]
)
answer = response.choices[0].message.content
answer

'NONE'

Note: There are system and user prompts, we can experiment with them to make the design og the prompt cleaner.

### Cleaning it up
---
Now let's put everything together in one function:

In [43]:
context_template = """
Section: {section}
Question: {question}
Answer: {text}
""".strip()

prompt_template = """
You're a course teaching assistant and salty sailor.
Answer the QUESTION based on the CONTEXT - the documents retrieved from our FAQ database.
Only use the facts from the CONTEXT. If the CONTEXT doesn't contain the answer, then tell a joke.
Never say NONE

QUESTION: {user_question}

CONTEXT:

{context}
""".strip()

def build_context(documents):
    context_result = ""
    
    for doc in documents:
        doc_str = context_template.format(**doc)
        context_result += ("\n\n" + doc_str)
    
    return context_result.strip()


def build_prompt(user_question, documents):
    context = build_context(documents)
    prompt = prompt_template.format(
        user_question=user_question,
        context=context
    )
    return prompt

def ask_openai(prompt, model="gpt-4o"):
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )
    answer = response.choices[0].message.content
    return answer

def qa_bot(user_question):
    context_docs = retrieve_documents(user_question)
    prompt = build_prompt(user_question, context_docs)
    answer = ask_openai(prompt)
    return answer

Let's try asking some questions

In [44]:
print(context_template)
print(prompt_template)

qa_bot("Has the course started already?")
print(answer)

Section: {section}
Question: {question}
Answer: {text}
You're a course teaching assistant and salty sailor.
Answer the QUESTION based on the CONTEXT - the documents retrieved from our FAQ database.
Only use the facts from the CONTEXT. If the CONTEXT doesn't contain the answer, then tell a joke.
Never say NONE

QUESTION: {user_question}

CONTEXT:

{context}
NONE


In [45]:
retrieve_documents(user_question)

[{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.\nYou can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.',
  'section': 'General course-related questions',
  'question': 'Course - Can I follow the course after it finishes?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'You can start by installing and setting up all the dependencies and requirements:\nGoogle cloud account\nGoogle Cloud SDK\nPython 3 (insta

In [46]:
print(answer)

NONE


In [47]:
qa_bot("how can I run kafka?")

"To run Kafka Java producer, consumer, or KStreams in the terminal, follow these steps in the project directory:\n\n```bash\njava -cp build/libs/<jar_name>-1.0-SNAPSHOT.jar:out src/main/java/org/example/JsonProducer.java\n``` \n\nThat's how you set sail with Kafka, matey! ðŸš¢ If ye need more help, I can also teach ye how to navigate the high seas. ðŸŒŠ"

NONE


In [20]:
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "How to do I put a little line under a heading in jupyter notebook markdown?"}]
)
print(response.choices[0].message.content)

In a Jupyter Notebook, you can add a horizontal line under a heading using Markdown syntax. Here's how you can do it:

1. **Using Horizontal Rule (`---` or `___` or `***`)**:

   You can use three or more hyphens (`-`), underscores (`_`), or asterisks (`*`) in a row to create a horizontal line.

   ```markdown
   # Heading Level 1
   ---

   ## Heading Level 2
   ***

   ### Heading Level 3
   ___
   ```

2. **Using HTML (`<hr>` Tag)**:

   Alternatively, you can use the HTML `<hr>` tag to insert a horizontal line.

   ```markdown
   # Heading Level 1
   <hr>

   ## Heading Level 2
   <hr>

   ### Heading Level 3
   <hr>
   ```

### Example:

```markdown
# Heading Level 1
---

Here is some text under the heading.

## Heading Level 2
***

More text under the second heading.

### Heading Level 3
___

And yet more text under the third heading.
```

When you run the cell, you'll see the headings followed by horizontal lines.

### Preview:

```
# Heading Level 1
---

Here is some text under