## Querying from the Q&A documents using search engine

In [1]:
!wget https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json

--2024-06-24 15:20:37--  https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json
Resolving github.com (github.com)... 20.207.73.82
Connecting to github.com (github.com)|20.207.73.82|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/alexeygrigorev/llm-rag-workshop/main/notebooks/documents.json [following]
--2024-06-24 15:20:38--  https://raw.githubusercontent.com/alexeygrigorev/llm-rag-workshop/main/notebooks/documents.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 658332 (643K) [text/plain]
Saving to: ‘documents.json.1’


2024-06-24 15:20:39 (2.37 MB/s) - ‘documents.json.1’ saved [658332/658332]



In [2]:
!wget https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py

--2024-06-24 15:20:39--  https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3832 (3.7K) [text/plain]
Saving to: ‘minsearch.py.1’


2024-06-24 15:20:39 (43.6 MB/s) - ‘minsearch.py.1’ saved [3832/3832]



In [3]:
import minsearch
import json

In [4]:
with open('documents.json', 'rt') as file:
    docs_raw = json.load(file)

In [None]:
docs_raw

In [6]:
documents = []
for course_dict in docs_raw:
    for doc in course_dict['documents']:
        doc['course'] = course_dict['course']
        documents.append(doc)

In [None]:
documents

In [8]:
index = minsearch.Index(
    text_fields=["question", "text", "section"],
    keyword_fields=["course"]
)

In [9]:
index.fit(documents)

<minsearch.Index at 0x7000ce5b3c80>

In [10]:
q = "course has already started, can I still enroll?"

In [11]:
boost = {'question': 3.0, 'section': 0.5}
# the question field is more important than other fields

results = index.search(
    query=q,
    boost_dict=boost,
    num_results=5
)

In [12]:
results

[{'text': 'Yes, you can. You won’t be able to submit some of the homeworks, but you can still take part in the course.\nIn order to get a certificate, you need to submit 2 out of 3 course projects and review 3 peers’ Projects by the deadline. It means that if you join the course at the end of November and manage to work on two projects, you will still be eligible for a certificate.',
  'section': 'General course-related questions',
  'question': 'The course has already started. Can I still join it?',
  'course': 'machine-learning-zoomcamp'},
 {'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'Yes, we will keep all the materials after the cour

In [13]:
results = index.search(
    query=q,
    boost_dict=boost,
    filter_dict={'course': 'data-engineering-zoomcamp'},  # filtering search only for data-engineering zoomcamp
    num_results=5
)

results

[{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.\nYou can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.',
  'section': 'General course-related questions',
  'question': 'Course - Can I follow the course after it finishes?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'You can start by installing and setting up all the dependencies and requirements:\nGoogle cloud account\nGoogle Cloud SDK\nPython 3 (insta

## Using LLMs

In [18]:
from together import Together
from dotenv import load_dotenv
import os
load_dotenv()

True

In [19]:
client = Together(api_key=os.getenv("TOGETHER_API_KEY"))

In [20]:
q

'course has already started, can I still enroll?'

In [21]:
response = client.chat.completions.create(
    model="meta-llama/Llama-3-8b-chat-hf",
    messages=[{"role": "user", "content": q}],
)

In [23]:
print(response.choices[0].message.content)

It's possible to enroll in a course that has already started, but it depends on the course and the institution offering it. Here are a few scenarios:

1. **Check with the institution**: Reach out to the course provider or institution and ask if it's possible to enroll in the course that has already started. They may have a late enrollment policy or be able to accommodate you in certain circumstances.
2. **Late enrollment allowed**: Some courses may allow late enrollment, especially if there are still seats available. In this case, you may be able to enroll and start attending classes immediately.
3. **Catch-up option**: If the course has already started, the instructor or institution may offer a catch-up option, where you can complete missed assignments or attend make-up classes to get up to speed.
4. **Wait for the next session**: If the course is already in progress, you may need to wait until the next session or semester to enroll. This is often the case for courses that have a fixe

In [35]:
prompt_template = """
You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
Use only the facts from the CONTEXT when answering the QUESTION.
    
QUESTION: {question}
    
CONTEXT: 
{context}
""".strip()

In [36]:
context = ""
    
for doc in results:
    context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"

In [37]:
print(context)

section: General course-related questions
question: Course - Can I still join the course after the start date?
answer: Yes, even if you don't register, you're still eligible to submit the homeworks.
Be aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.

section: General course-related questions
question: Course - Can I follow the course after it finishes?
answer: Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.
You can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.

section: General course-related questions
question: Course - What can I do before the course starts?
answer: You can start by installing and setting up all the dependencies and requirements:
Google cloud account
Google Cloud SDK
Python 3 (installed with Anaconda)
Terrafo

In [38]:
prompt = prompt_template.format(question=q, context=context).strip()
print(prompt)

You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
Use only the facts from the CONTEXT when answering the QUESTION.
    
QUESTION: course has already started, can I still enroll?
    
CONTEXT: 
section: General course-related questions
question: Course - Can I still join the course after the start date?
answer: Yes, even if you don't register, you're still eligible to submit the homeworks.
Be aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.

section: General course-related questions
question: Course - Can I follow the course after it finishes?
answer: Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.
You can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.

section: General course-rela

In [40]:
response = client.chat.completions.create(
    model="meta-llama/Llama-3-8b-chat-hf",
    messages=[{"role": "user", "content": prompt}],
)

print(response.choices[0].message.content)

Based on the provided context, the answer to the question "course has already started, can I still enroll?" is:

Yes, even if you don't register, you're still eligible to submit the homeworks.


## Cleaning the code

In [41]:
def search(query):
    boost = {'question': 3.0, 'section': 0.5}

    results = index.search(
        query=query,
        filter_dict={'course': 'data-engineering-zoomcamp'},
        boost_dict=boost,
        num_results=5
    )
    return results

In [42]:
def build_prompt(query, search_results):
    prompt_template = """
    You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
    Use only the facts from the CONTEXT when answering the QUESTION.
    
    QUESTION: {question}
    
    CONTEXT: 
    {context}
    """.strip()

    context = ""
    
    for doc in search_results:
        context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"
    
    prompt = prompt_template.format(question=query, context=context).strip()
    return prompt

In [43]:
def llm(prompt):
    response = client.chat.completions.create(
        model="meta-llama/Llama-3-8b-chat-hf",
        messages=[{"role": "user", "content": prompt}],
    )
    return response.choices[0].message.content

In [44]:
def rag(query):
    search_results = search(query)
    prompt = build_prompt(query, search_results)
    answer = llm(prompt)
    return answer

In [46]:
print(rag('how do I run Kafka?'))

Based on the provided context, I can answer the question "how do I run Kafka?" as follows:

Since the context mentions Kafka in the context of Module 6: streaming with Kafka, I will look for answers related to running Kafka in the terminal.

From the context, I found the following answers:

* For running a Java Kafka producer/consumer/kstreams, the command is: `java -cp build/libs/<jar_name>-1.0-SNAPSHOT.jar:out src/main/java/org/example/JsonProducer.java`
* For running a Python Kafka producer, the command is not explicitly mentioned. However, it is mentioned that the `./build.sh: Permission denied Error` can be fixed by running `chmod +x build.sh` in the same directory.

Therefore, the answer to the question "how do I run Kafka?" is:

* For Java Kafka, run the command `java -cp build/libs/<jar_name>-1.0-SNAPSHOT.jar:out src/main/java/org/example/JsonProducer.java`
* For Python Kafka, you may need to run the `build.sh` script after fixing the permission issue by running `chmod +x build