In [1]:
!wget https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py

--2024-06-22 11:33:25--  https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3832 (3.7K) [text/plain]
Saving to: ‘minsearch.py’


2024-06-22 11:33:25 (18.2 MB/s) - ‘minsearch.py’ saved [3832/3832]



In [10]:
import minsearch

In [4]:
import json

In [5]:
with open('documents.json', 'rt') as f_in:
    docs_raw = json.load(f_in)

In [6]:
documents = []

for course_dict in docs_raw:
    for doc in course_dict['documents']:
        doc['course'] = course_dict['course']
        documents.append(doc)

In [7]:
documents[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp'}

In [11]:
index = minsearch.Index(
    text_fields=["question", "text", "section"],
    keyword_fields=["course"]
)

In [None]:
## SELECT * WHERE course = 'data-engineering-zoomcamp';

In [12]:
q = 'the course has already started, can I still enroll?'

In [13]:
index.fit(documents)

<minsearch.Index at 0x7f6285082650>

In [14]:
def search(query):
    boost = {'question': 3.0, 'section': 0.5}

    results = index.search(
        query=query,
        filter_dict={'course': 'data-engineering-zoomcamp'},
        boost_dict=boost,
        num_results=5
    )

    return results

In [15]:
query = 'how do I run kafka?'
search_results = search(query)

search_results

[{'text': "Solution from Alexey: create a virtual environment and run requirements.txt and the python files in that environment.\nTo create a virtual env and install packages (run only once)\npython -m venv env\nsource env/bin/activate\npip install -r ../requirements.txt\nTo activate it (you'll need to run it every time you need the virtual env):\nsource env/bin/activate\nTo deactivate it:\ndeactivate\nThis works on MacOS, Linux and Windows - but for Windows the path is slightly different (it's env/Scripts/activate)\nAlso the virtual environment should be created only to run the python file. Docker images should first all be up and running.",
  'section': 'Module 6: streaming with kafka',
  'question': 'Module “kafka” not found when trying to run producer.py',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'In the project directory, run:\njava -cp build/libs/<jar_name>-1.0-SNAPSHOT.jar:out src/main/java/org/example/JsonProducer.java',
  'section': 'Module 6: streaming with kafka',

In [16]:
def build_prompt(query, search_results):
    prompt_template = """
    You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
    Use only the facts from the CONTEXT when answering the QUESTION.
    
    QUESTION: {question}
    
    CONTEXT: 
    {context}
    """.strip()

    context = ""
    
    for doc in search_results:
        context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"
    
    prompt = prompt_template.format(question=query, context=context).strip()
    return prompt

In [17]:
prompt = build_prompt(query, search_results)

In [18]:
prompt

'You\'re a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.\n    Use only the facts from the CONTEXT when answering the QUESTION.\n    \n    QUESTION: how do I run kafka?\n    \n    CONTEXT: \n    section: Module 6: streaming with kafka\nquestion: Module “kafka” not found when trying to run producer.py\nanswer: Solution from Alexey: create a virtual environment and run requirements.txt and the python files in that environment.\nTo create a virtual env and install packages (run only once)\npython -m venv env\nsource env/bin/activate\npip install -r ../requirements.txt\nTo activate it (you\'ll need to run it every time you need the virtual env):\nsource env/bin/activate\nTo deactivate it:\ndeactivate\nThis works on MacOS, Linux and Windows - but for Windows the path is slightly different (it\'s env/Scripts/activate)\nAlso the virtual environment should be created only to run the python file. Docker images should first all be up and running.\n\nse

In [20]:
from mistralai.client import MistralClient
from mistralai.models.chat_completion import ChatMessage

client = MistralClient()


def llm(prompt):
    response = client.chat(
        model='open-mistral-7b',
        messages=[
            ChatMessage(role="user", content=prompt)
        ]
    )

    return response.choices[0].message.content

In [21]:
response = llm(prompt)

In [22]:
response

"To run Kafka in the context of the provided course, follow these steps:\n\n1. Create a virtual environment and install the required packages.\n   - Run `python -m venv env` to create a virtual environment.\n   - Activate the virtual environment using `source env/bin/activate` (on MacOS, Linux) or `env\\Scripts\\activate` (on Windows).\n   - Install the required packages by running `pip install -r ../requirements.txt`.\n\n2. Install the necessary dependencies for the code.\n   - If you're working with Python, ensure that the 'dlt[duckdb]' package is installed. You can do this by executing `!pip install dlt[duckdb]`.\n\n3. Run the provided Python scripts.\n   - For example, to run a Kafka producer, navigate to the directory containing the script and run it using `python <script_name>.py`.\n\n4. If you encounter a permission error while running the build script (`.sh` files), use the command `chmod +x build.sh` to grant execution permissions.\n\n5. For Java-based Kafka applications, navi

In [23]:
query = 'the course has already started, can I still enroll?'

def rag(query):
    search_results = search(query)
    prompt = build_prompt(query, search_results)
    answer = llm(prompt)
    return answer

In [24]:
rag(query)

"Based on the provided context, it is stated that even if you don't register for the course after the start date, you are still eligible to submit the homeworks. However, you should be aware of the deadlines for turning in the final projects to avoid leaving everything for the last minute.\n\nIf you miss the start date, you won't be able to participate in the live sessions, but you can still follow the course materials after it finishes. The course materials will be kept available, and you can work on your final capstone project at your own pace.\n\nBefore the course starts, you can prepare by installing and setting up all the dependencies and requirements, such as a Google cloud account, Google Cloud SDK, Python 3 (installed with Anaconda), Terraform, and Git. You can also look over the prerequisites and syllabus to see if you are comfortable with these subjects.\n\nFor support, it is mentioned that the slack channel remains open, and you can ask questions there. It is also suggested 