In [2]:
!wget https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json

--2024-06-23 04:04:33--  https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json
Resolving github.com (github.com)... 20.248.137.48
Connecting to github.com (github.com)|20.248.137.48|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/alexeygrigorev/llm-rag-workshop/main/notebooks/documents.json [following]
--2024-06-23 04:04:33--  https://raw.githubusercontent.com/alexeygrigorev/llm-rag-workshop/main/notebooks/documents.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 658332 (643K) [text/plain]
Saving to: ‘documents.json’


2024-06-23 04:04:34 (50.7 MB/s) - ‘documents.json’ saved [658332/658332]



In [3]:
!head documents.json

[
  {
    "course": "data-engineering-zoomcamp",
    "documents": [
      {
        "text": "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  \u201cOffice Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon\u2019t forget to register in DataTalks.Club's Slack and join the channel.",
        "section": "General course-related questions",
        "question": "Course - When will the course start?"
      },
      {


# Importing the documents and loading in memory and flattening it

In [1]:
import json

In [2]:
with open('./documents.json', 'rt') as f_in:
    documents_file = json.load(f_in)

documents = []

for course in documents_file:
    course_name = course['course']

    for doc in course['documents']:
        doc['course'] = course_name
        documents.append(doc)

In [3]:
documents[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp'}

## Lets index the documents in elastic search

In [4]:
from elasticsearch import Elasticsearch

es = Elasticsearch("http://localhost:9200")
es.info()

ObjectApiResponse({'name': 'c829d6cdc1e8', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'MP1mGP6-Q7WrWxpXDQ_OZw', 'version': {'number': '8.4.3', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '42f05b9372a9a4a470db3b52817899b99a76ee73', 'build_date': '2022-10-04T07:17:24.662462378Z', 'build_snapshot': False, 'lucene_version': '9.3.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'})

In [5]:
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"} 
        }
    }
}

In [6]:
index_name = "course-questions"
response = es.indices.create(index=index_name, body=index_settings)
response

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'course-questions'})

In [7]:
from tqdm.auto import tqdm

  from .autonotebook import tqdm as notebook_tqdm


In [8]:
for doc in documents:
    es.index(index=index_name, document=doc)

## Now lets try to retrieve the documents

In [9]:
user_question = "How do I join the course after it has started?"

search_query = {
    "size": 5,
    "query": {
        "bool": {
            "must": {
                "multi_match": {
                    "query": user_question,
                    "fields": ["question^3", "text", "section"],
                    "type": "best_fields"
                }
            },
            "filter": {
                "term": {
                    "course": "data-engineering-zoomcamp"
                }
            }
        }
    }
}

In [10]:
response = es.search(index=index_name, body=search_query)

for hit in response['hits']['hits']:
    doc = hit['_source']
    print(f"Section: {doc['section']}")
    print(f"Question: {doc['question']}")
    print(f"Answer: {doc['text'][:60]}...\n")

Section: General course-related questions
Question: Course - Can I still join the course after the start date?
Answer: Yes, even if you don't register, you're still eligible to su...

Section: General course-related questions
Question: Course - Can I follow the course after it finishes?
Answer: Yes, we will keep all the materials after the course finishe...

Section: General course-related questions
Question: Course - What can I do before the course starts?
Answer: You can start by installing and setting up all the dependenc...

Section: General course-related questions
Question: How do I use Git / GitHub for this course?
Answer: After you create a GitHub account, you should clone the cour...

Section: Workshop 1 - dlthub
Question: How do I install the necessary dependencies to run the code?
Answer: Answer: To run the provided code, ensure that the 'dlt[duckd...



In [11]:
class Retrieval:
    def __init__(self, index_name: str):
        self.index_name = index_name
        self._host = "http://localhost:9200"
        self.es = Elasticsearch(self._host)
        

    def retrieve_documents(self, query: str, max_results: int=5) -> dict:
        
        search_query = {
            "size": max_results,
            "query": {
                "bool": {
                    "must": {
                        "multi_match": {
                            "query": query,
                            "fields": ["question^3", "text", "section"],
                            "type": "best_fields"
                        }
                    },
                    "filter": {
                        "term": {
                            "course": "data-engineering-zoomcamp"
                        }
                    }
                }
            }
        }
        
        response = self.es.search(index=self.index_name, body=search_query)
        documents = [hit['_source'] for hit in response['hits']['hits']]
        return documents

In [12]:
ret = Retrieval(index_name=index_name)
response = ret.retrieve_documents(query=user_question)

In [13]:
for doc in response:
    print(f"Section: {doc['section']}")
    print(f"Question: {doc['question']}")
    print(f"Answer: {doc['text'][:60]}...\n")

Section: General course-related questions
Question: Course - Can I still join the course after the start date?
Answer: Yes, even if you don't register, you're still eligible to su...

Section: General course-related questions
Question: Course - Can I follow the course after it finishes?
Answer: Yes, we will keep all the materials after the course finishe...

Section: General course-related questions
Question: Course - What can I do before the course starts?
Answer: You can start by installing and setting up all the dependenc...

Section: General course-related questions
Question: How do I use Git / GitHub for this course?
Answer: After you create a GitHub account, you should clone the cour...

Section: Workshop 1 - dlthub
Question: How do I install the necessary dependencies to run the code?
Answer: Answer: To run the provided code, ensure that the 'dlt[duckd...



## Now lets move to OPENAI

In [14]:
from openai import OpenAI

In [15]:
client = OpenAI()

In [16]:
response = client.chat.completions.create(model="gpt-3.5-turbo",
                                          messages=[{
                                              "role": "user",
                                              "content": "Whats the formula for energy?"}
                                                   ]
                                         )

In [21]:
print(response.choices[0].message.content)

The formula for energy can vary depending on the type of energy being considered. However, the most general formula for energy is:

E = mc^2 

where E represents energy, m represents mass, and c represents the speed of light in a vacuum (~3.00 x 10^8 m/s). This formula is known as the mass-energy equivalence formula derived by Albert Einstein as part of his theory of relativity.


In [22]:
context_template = """
Section: {section}
Question: {question}
Answer: {text}
""".strip()

context_docs = ret.retrieve_documents(query=user_question)
context_result = ""

for doc in context_docs:
    doc_str = context_template.format(**doc)
    context_result += ("\n\n" + doc_str)

context = context_result.strip()
print(context)

Section: General course-related questions
Question: Course - Can I still join the course after the start date?
Answer: Yes, even if you don't register, you're still eligible to submit the homeworks.
Be aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.

Section: General course-related questions
Question: Course - Can I follow the course after it finishes?
Answer: Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.
You can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.

Section: General course-related questions
Question: Course - What can I do before the course starts?
Answer: You can start by installing and setting up all the dependencies and requirements:
Google cloud account
Google Cloud SDK
Python 3 (installed with Anaconda)
Terrafo

In [23]:
prompt = f"""
You're a course teaching assistant. Answer the user QUESTION based on CONTEXT - the documents retrieved from our FAQ database. 
Only use the facts from the CONTEXT. If the CONTEXT doesn't contan the answer, return "NONE"

QUESTION: {user_question}

CONTEXT:

{context}
""".strip()

In [24]:
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}]
)
answer = response.choices[0].message.content
answer

'You can still join the course after it has started. You are eligible to submit the homeworks, but please be mindful of the deadlines for turning in the final projects to avoid leaving everything until the last minute.'

## Cleaning and putting it all together

In [31]:
context_template = """
Section: {section}
Question: {question}
Answer: {text}
""".strip()

prompt_template = """
You're a course teaching assistant.
Answer the user QUESTION based on CONTEXT - the documents retrieved from our FAQ database.
Don't use other information outside of the provided CONTEXT.  

QUESTION: {user_question}

CONTEXT:

{context}
""".strip()


class OpenAIRetrieval:
    def __init__(self, model_name: str= "gpt-3.5-turbo", index_name: str="course-questions"):
        self.model_name = model_name
        self.index_name = index_name
        self.client = OpenAI()
        self.ret = Retrieval(index_name=self.index_name)
        

    def build_context(self, documents: dict={}, context_template: str="") -> str:
        
        context_result = ""
        for doc in documents:
            doc_str = context_template.format(**doc)
            context_result += ("\n\n" + doc_str)

        return context_result.strip()

    def build_prompt(self,
                     user_question: str, 
                     documents: dict={},
                     context_template: str="",
                     prompt_template:str="") -> str:
        
        context = self.build_context(documents, context_template=context_template)
        prompt = prompt_template.format(
            user_question=user_question,
            context=context
        )
        
        return prompt

    def ask_openai(self, prompt: str) -> str:
        
        response = self.client.chat.completions.create(
            model=self.model_name,
            messages=[{"role": "user", "content": prompt}]
        )
        answer = response.choices[0].message.content
        return answer

    def qa_bot(self,
               user_question: str,
               context_template: str=context_template,
               prompt_template: str=prompt_template) -> str:

        context_docs = self.ret.retrieve_documents(query=user_question)
        prompt = self.build_prompt(user_question, context_docs, context_template, prompt_template)
        answer = self.ask_openai(prompt)
        return answer
            

In [32]:
oar = OpenAIRetrieval()
oar.qa_bot("I'm getting invalid reference format: repository name must be lowercase")

'You may be receiving the error "invalid reference format: repository name must be lowercase" because you are not using the correct formatting for repository names in your Docker command. Make sure the repository name is all lowercase in your command. If you are still facing issues, try the various options provided in the course materials or consult the specific Docker documentation for Windows.'

In [33]:
oar.qa_bot("I can't connect to postgres port 5432, my password doesn't work")

'Based on the provided context, it seems that the issue you are encountering with connecting to the Postgres port 5432 could be due to multiple reasons such as password authentication failure, role not existing, or the database not existing. \n\nTo troubleshoot this, you can try changing the port from 5432 to another port like 5431 when creating the docker container. Additionally, you can use pgcli to connect to the Postgres docker container with the new port and the correct user and database information.\n\nAlso, ensure that there are no conflicting services running Postgres on your local machine. You can check this by using commands like `docker ps`, `lsof -i :5432`, or `launchctl list` to identify and possibly stop any conflicting services.'

In [35]:
oar.qa_bot("how can I run kafka?")

'To run Kafka, you can follow the instructions provided in the answer for the question "Java Kafka: How to run producer/consumer/kstreams/etc in terminal." In the project directory, you should run the command: java -cp build/libs/<jar_name>-1.0-SNAPSHOT.jar:out src/main/java/org/example/JsonProducer.java'