# Imports

In [13]:
import json
import pandas as pd
import numpy as np
import torch
from tqdm.auto import tqdm
from elasticsearch import Elasticsearch

# Variables

In [16]:
# Documents path
documents_file = '../data/input/faqs/documents.json'

# Elasticsearch configs
elastic_host = "http://localhost:9200" #Should be in .env
index_name = "course-questions"
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"} 
        }
    }
}

# OpenAI configs
openai_model = "gpt-4o-mini"


# Documents loading and indexing with Elasticsearch

### Load Docs


Remember to check the path and data.

In [17]:
with open(documents_file, 'rt') as f_in:
    documents_file = json.load(f_in)

documents = []

for course in documents_file:
    course_name = course['course']

    for doc in course['documents']:
        doc['course'] = course_name
        documents.append(doc)

### Indexing

Initiate the connection to the Elasticsearch instance and check the connection. The results should be similar to the following:

```{
  "name" : "4638448291a2",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "Rhp-8RSATb6Ekwwom3C0jw",
  "version" : {
    "number" : "8.4.3",
    "build_flavor" : "default",
    "build_type" : "docker",
    "build_hash" : "42f05b9372a9a4a470db3b52817899b99a76ee73",
    "build_date" : "2022-10-04T07:17:24.662462378Z",
    "build_snapshot" : false,
    "lucene_version" : "9.3.0",
    "minimum_wire_compatibility_version" : "7.17.0",
    "minimum_index_compatibility_version" : "7.0.0"
  },
  "tagline" : "You Know, for Search"
}

In [18]:
es = Elasticsearch(elastic_host)
es.info()

ObjectApiResponse({'name': '1b0c191eee1e', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'KKo6Jp9WSGyPt7ljpe5duA', 'version': {'number': '8.4.3', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '42f05b9372a9a4a470db3b52817899b99a76ee73', 'build_date': '2022-10-04T07:17:24.662462378Z', 'build_snapshot': False, 'lucene_version': '9.3.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'})

Create the index --> Configs in variables, beginning of the notebook.

In [None]:
index_name = "course-questions"
response = es.indices.create(
    index=index_name, 
    body=index_settings)

response

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'course-questions'})

Last step, indexing the documents.

In [20]:
for doc in tqdm(documents):
    es.index(index=index_name, document=doc)

  0%|          | 0/948 [00:00<?, ?it/s]

Now, let's make a question and see the responses!

In [21]:
user_question = "How do I join the course after it has started?"

search_query = {
    "size": 5,
    "query": {
        "bool": {
            "must": {
                "multi_match": {
                    "query": user_question,
                    "fields": ["question^3", "text", "section"],
                    "type": "best_fields"
                }
            },
            "filter": {
                "term": {
                    "course": "data-engineering-zoomcamp"
                }
            }
        }
    }
}

In [22]:
search_query

{'size': 5,
 'query': {'bool': {'must': {'multi_match': {'query': 'How do I join the course after it has started?',
     'fields': ['question^3', 'text', 'section'],
     'type': 'best_fields'}},
   'filter': {'term': {'course': 'data-engineering-zoomcamp'}}}}}

In [23]:
response = es.search(index=index_name, body=search_query)

print(f"user_question: {user_question}\n")
for hit in response['hits']['hits']:
    doc = hit['_source']
    print(f"Section: {doc['section']}")
    print(f"Question: {doc['question']}")
    print(f"Answer: {doc['text'][:60]}...\n")

user_question: How do I join the course after it has started?

Section: General course-related questions
Question: Course - Can I still join the course after the start date?
Answer: Yes, even if you don't register, you're still eligible to su...

Section: General course-related questions
Question: Course - Can I follow the course after it finishes?
Answer: Yes, we will keep all the materials after the course finishe...

Section: General course-related questions
Question: Course - What can I do before the course starts?
Answer: You can start by installing and setting up all the dependenc...

Section: General course-related questions
Question: How do I use Git / GitHub for this course?
Answer: After you create a GitHub account, you should clone the cour...

Section: Workshop 1 - dlthub
Question: How do I install the necessary dependencies to run the code?
Answer: Answer: To run the provided code, ensure that the 'dlt[duckd...



Simple function to retrieve documents from the Elasticsearch index. Will use from now on.

In [13]:
def retrieve_documents(query, index_name="course-questions", max_results=5):
    es = Elasticsearch("http://localhost:9200")
    
    search_query = {
        "size": max_results,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^3", "text", "section"],
                        "type": "best_fields"
                    }
                },
                "filter": {
                    "term": {
                        "course": "data-engineering-zoomcamp"
                    }
                }
            }
        }
    }
    
    response = es.search(index=index_name, body=search_query)
    documents = [hit['_source'] for hit in response['hits']['hits']]
    return documents

In [15]:
user_question = "How do I join the course after it has started?"

response = retrieve_documents(user_question)

print(f"user_question: {user_question}\n")
for doc in response:
    print(f"Section: {doc['section']}")
    print(f"Question: {doc['question']}")
    print(f"Answer: {doc['text'][:60]}...\n")

user_question: How do I join the course after it has started?

Section: General course-related questions
Question: Course - Can I still join the course after the start date?
Answer: Yes, even if you don't register, you're still eligible to su...

Section: General course-related questions
Question: Course - Can I follow the course after it finishes?
Answer: Yes, we will keep all the materials after the course finishe...

Section: General course-related questions
Question: Course - What can I do before the course starts?
Answer: You can start by installing and setting up all the dependenc...

Section: General course-related questions
Question: How do I use Git / GitHub for this course?
Answer: After you create a GitHub account, you should clone the cour...

Section: Workshop 1 - dlthub
Question: How do I install the necessary dependencies to run the code?
Answer: Answer: To run the provided code, ensure that the 'dlt[duckd...



# Let's use a LLM to answer the question.

Load openAI client. -> Later try to use local model.

In [16]:
from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "The course already started. Can I still join?"}]
)

print(response.choices[0].message.content)

It depends on the specific course and the policies of the institution or organization offering it. Some courses may allow late enrollment, while others may have strict deadlines. It's best to contact the course instructor or the administration directly to inquire about the possibility of joining the course after it has already started. They can provide you with the most accurate information regarding your options.


Ok, so now we retrieve the documents and format them to be used as context for the LLM. This is not embedding, but it's a simple way to use the documents as context.

In [17]:
context_template = """
Section: {section}
Question: {question}
Answer: {text}
""".strip()

context_docs = retrieve_documents(user_question)

context_result = ""

for doc in context_docs:
    doc_str = context_template.format(**doc)
    context_result += ("\n\n" + doc_str)

context = context_result.strip()
print(context)

Section: General course-related questions
Question: Course - Can I still join the course after the start date?
Answer: Yes, even if you don't register, you're still eligible to submit the homeworks.
Be aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.

Section: General course-related questions
Question: Course - Can I follow the course after it finishes?
Answer: Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.
You can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.

Section: General course-related questions
Question: Course - What can I do before the course starts?
Answer: You can start by installing and setting up all the dependencies and requirements:
Google cloud account
Google Cloud SDK
Python 3 (installed with Anaconda)
Terrafo

Building a prompt for the LLM. Note that the user question is already in the prompt.

In [18]:
prompt = f"""
You're a course teaching assistant. Answer the user QUESTION based on CONTEXT - the documents retrieved from our FAQ database. 
Only use the facts from the CONTEXT. If the CONTEXT doesn't contan the answer, return "NONE"

QUESTION: {user_question}

CONTEXT:

{context}
""".strip()

In [None]:
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}]
)
print(f"User question: {user_question}\n")
answer = response.choices[0].message.content
answer

User question: How do I join the course after it has started?



"Yes, even if you don't register, you're still eligible to submit the homeworks. Be aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute."

Getting everything together.

In [None]:
context_template = """
Section: {section}
Question: {question}
Answer: {text}
""".strip()

prompt_template = """
You're a course teaching assistant.
Answer the user QUESTION based on CONTEXT - the documents retrieved from our FAQ database.
Don't use other information outside of the provided CONTEXT.  

QUESTION: {user_question}

CONTEXT:

{context}
""".strip()


def build_context(documents):
    context_result = ""
    
    for doc in documents:
        doc_str = context_template.format(**doc)
        context_result += ("\n\n" + doc_str)
    
    return context_result.strip()


def build_prompt(user_question, documents):
    context = build_context(documents)
    prompt = prompt_template.format(
        user_question=user_question,
        context=context
    )
    return prompt

def ask_openai(prompt, model="gpt-4o-mini"):
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )
    answer = response.choices[0].message.content
    return answer


def qa_bot(user_question):
    context_docs = retrieve_documents(user_question)
    prompt = build_prompt(user_question, context_docs)
    answer = ask_openai(prompt)
    return answer

In [25]:
qa_bot("I'm getting invalid reference format: repository name must be lowercase")

'The error "invalid reference format: repository name must be lowercase" often occurs due to improper formatting in your Docker command, particularly with the volume mapping. Here are some suggestions to resolve this issue:\n\n1. Ensure that your paths are formatted correctly:\n   - Move your data to a folder without spaces. For example, instead of using “C:/Users/Alexey Grigorev/git/…”, try “C:/git/…”.\n   \n2. Use lowercase names for your repository and volume names throughout your command. Docker requires these to be lowercase.\n\n3. When specifying the volume, try one of the following syntax options:\n   - `-v /c/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data`\n   - `-v //c/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data`\n   - Use winpty in front of your command if you are on Windows: `winpty docker run [...]`.\n\n4. If issues persist, consider using quotes around your volume paths:\n   - For example: `-v "/c/some/path/ny_taxi_postgres_data:/var/lib/postgresql/d

In [26]:
qa_bot("I can't connect to postgres port 5432, my password doesn't work")

'It seems that your issue is related to the PostgreSQL connection on port 5432. Based on the context provided, the error "password authentication failed for user \'root\'" typically occurs when:\n\n1. The password you are using might not be correct. Ensure you are using the right credentials for the user "root" when connecting.\n\n2. The port 5432 is potentially occupied by another PostgreSQL service on your local machine. In this case, try using a different port, such as 5431, if you have mapped your Docker container to that port. You can change your connection string accordingly, for example:\n   ```\n   engine = create_engine(\'postgresql://root:root@localhost:5431/ny_taxi\')\n   ```\n\n3. If you have a local PostgreSQL installation, it\'s advisable to verify if it’s running and potentially causing conflicts. Stopping that service should help resolve the issue. \n\nMake sure to check if the correct PostgreSQL service is running by executing `docker ps` to see if your container is up