### Downloading the [MinSearch](https://github.com/alexeygrigorev/minsearch) Search Engine - a minimalistic text search engine that uses TF-IDF and cosine similarity for text fields and exact matching for keyword fields.

In [1]:
!pip3 install python-dotenv --quiet
!pip3 install tqdm notebook==7.1.2 openai elasticsearch==8.13.0 pandas scikit-learn ipywidgets --quiet

!wget https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch/minsearch.py

--2025-06-01 11:14:39--  https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch/minsearch.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5488 (5.4K) [text/plain]
Saving to: ‘minsearch.py.3’


2025-06-01 11:14:39 (57.1 MB/s) - ‘minsearch.py.3’ saved [5488/5488]



### Importing Necessary Libraries & Opening FAQ Parsed Documents

In [2]:
import json
import minsearch
from openai import OpenAI

with open('documents.json', 'rt') as f_in:
    docs_raw = json.load(f_in)

### High-Level Summary
This code flattens a nested list of FAQ documents. It takes a list of course-related document groups (`docs_raw`) and produces a single flat list (`documents`) where each FAQ entry includes its course name.

#### Step-by-Step Breakdown
Assume docs_raw looks like this (e.g., loaded from documents.json):

```python
docs_raw = [
    {
        'course': 'data-engineering-zoomcamp',
        'documents': [
            { 'section': 'Intro', 'question': 'What is DE?', 'text': '...' },
            { 'section': 'Tools', 'question': 'What tools are used?', 'text': '...' }
        ]
    },
    {
        'course': 'mlops-zoomcamp',
        'documents': [
            { 'section': 'Setup', 'question': 'How to set up?', 'text': '...' }
        ]
    }
]

```

The code below does the following:

```python
### Initializes an empty list to hold all the individual FAQ entries across all courses.
documents = []
### Iterates over each course dictionary in docs_raw. Each course_dict contains a course name (course) & a list of FAQ documents for that course. 
for course_dict in docs_raw:
    ### ### Iterates through each individual FAQ document (doc) in the current course’s documents list
    for doc in course_dict['documents']:
        ### Adds a new key 'course' to each doc so that each FAQ entry includes the course it belongs to.
        doc['course'] = course_dict['course']
        ### Adds a new key 'course' to each doc so that each FAQ entry includes the course it belongs to.
        ### Appends the modified doc to the documents list.
        documents.append(doc)
```

#### Final Result
The resulting documents list is a flat list of all FAQ entries, each of which now includes the course name

```python
[
    {
        'section': 'Intro',
        'question': 'What is DE?',
        'text': '...',
        'course': 'data-engineering-zoomcamp'
    },
    ...
]
```

In [3]:
### Initializes an empty list to hold all the individual FAQ entries across all courses.
documents = []
### Iterates over each course dictionary in docs_raw. Each course_dict contains a course name (course) & a list of FAQ documents. 
for course_dict in docs_raw:
    ### ### Iterates through each FAQ document (doc) in the current course’s documents list
    for doc in course_dict['documents']:
        ### Adds a new key 'course' to each doc so that each FAQ entry includes the course it belongs to.
        doc['course'] = course_dict['course']
        ### Adds a new key 'course' to each doc so that each FAQ entry includes the course it belongs to.
        ### Appends the modified doc to the documents list.
        documents.append(doc)

documents[0]

{'text': "Data Engineering Zoomcamp FAQ\nData Engineering Zoomcamp FAQ\nThe purpose of this document is to capture Frequently asked technical questions\nEditing guidelines:\nWhen adding a new FAQ entry, make sure the question is “Heading 2”\nFeel free to improve if you see something is off\nDon’t change the formatting in the Data document or add any visual “improvements” (make a copy for yourself first if you need to do it for whatever reason)\nDon’t change the pages format (it should be “pageless”)\nAdd name and date for reference, if possible\nThe next cohort starts January 13th 2025. More info at DTC.\nRegister before the course starts using this link.\nJoint the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When does the course start?',
 'course': 'data-engineering-zoomcamp'}

This code is initializing and fitting a **search index** using the `minsearch` library, which is likely designed for **semantic or keyword-based search** over structured documents (like FAQs).

In [4]:
index = minsearch.Index(
    text_fields=["question", "text", "section"],
    keyword_fields=["course"]
)

q = 'the course has already started, can I still enroll?'

index.fit(documents)

<minsearch.Index at 0x727b7aec8290>

In [5]:
%%writefile .env
OPENAI_API_KEY = "<ADD_YOUR_OPENAI_API_KEY>"

Writing .env


In [6]:
import os
from dotenv import load_dotenv
import os

# Load variables from the `.env` file into the environment
load_dotenv()

# Access the key
api_key = os.getenv("OPENAI_API_KEY")

client = OpenAI(api_key=api_key)
response = client.chat.completions.create(
    model='gpt-4o',
    messages=[{"role": "user", "content": q}]
)

response.choices[0].message.content

"Whether you can still enroll in a course that has already started depends on several factors, such as the institution's policies, the specific course, and how far along the course is. Here are a few steps you can take:\n\n1. **Check the Institution's Guidelines**: Review the official website or contact the admissions office to understand their policy on late enrollments.\n\n2. **Contact the Instructor**: Often, instructors have some flexibility regarding late enrollments, especially if the course has just started. They might be willing to accommodate you.\n\n3. **Consider the Impact**: Think about how missing the start might affect your ability to catch up on missed content, assignments, and exams.\n\n4. **Online vs. In-Person Courses**: Online courses might offer more flexibility for late enrollment compared to traditional in-person classes.\n\n5. **Audit the Course**: If enrollment is not possible, ask if you can audit the course, allowing you to attend the classes without receiving

### High-Level Documentation
This code implements a simple **Retrieval-Augmented Generation (RAG)** pipeline to answer user queries about the `"data-engineering-zoomcamp"` course. The process works as follows:
1. **Search** relevant FAQ entries using a semantic search index (`minsearch`).
2. **Build a prompt** by formatting the retrieved FAQs into a natural language instruction.
3. **Query an LLM** (like GPT-4o) with that prompt to generate a precise, context-aware answer.
4. **Return** the answer to the user.

This is useful for building a course assistant that answers questions using only the information in a predefined FAQ knowledge base.

In [7]:
def search(query):
    """
    Searches the minsearch index for relevant FAQ documents based on a user query.

    Args:
        query (str): The user's natural language question.

    Returns:
        list: A list of the top 5 matched documents from the index.
    """

    # Define a boosting dictionary to prioritize certain fields more than others
    boost = {
        'question': 3.0,  # Give higher weight to matches in the 'question' field
        'section': 0.5    # Give lower weight to matches in the 'section' field
    }

    # Perform the search using the index
    results = index.search(
        query=query,
        filter_dict={'course': 'data-engineering-zoomcamp'},  # Restrict to this course only
        boost_dict=boost,     # Apply boost values to fields
        num_results=5         # Return the top 5 matches
    )

    return results


def build_prompt(query, search_results):
    """
    Builds a prompt string for the LLM using retrieved FAQ entries as context.

    Args:
        query (str): The user's natural language question.
        search_results (list): A list of relevant documents returned by `search()`.

    Returns:
        str: A formatted prompt to send to the LLM.
    """

    # Define the prompt template with placeholders for the question and context
    prompt_template = """
You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
Use only the facts from the CONTEXT when answering the QUESTION.

QUESTION: {question}

CONTEXT: 
{context}
""".strip()

    context = ""

    # Loop over the retrieved documents and construct a context block
    for doc in search_results:
        # Append structured info from each document to the context
        context += (
            f"section: {doc['section']}\n"
            f"question: {doc['question']}\n"
            f"answer: {doc['text']}\n\n"
        )

    # Fill in the prompt template with the user question and constructed context
    prompt = prompt_template.format(question=query, context=context).strip()

    return prompt


def llm(prompt):
    """
    Sends the prompt to the OpenAI Chat API using GPT-4o and returns the model's response.

    Args:
        prompt (str): A complete natural language instruction including context and question.

    Returns:
        str: The generated answer from the model.
    """

    # Call the OpenAI Chat API with the GPT-4o model and a user message
    response = client.chat.completions.create(
        model='gpt-4o',
        messages=[{"role": "user", "content": prompt}]
    )

    # Extract the generated message text from the response object
    return response.choices[0].message.content


def rag(query):
    """
    Runs the full RAG (Retrieval-Augmented Generation) pipeline:
    1. Retrieves relevant documents
    2. Builds a prompt with them
    3. Sends the prompt to the LLM
    4. Returns the answer

    Args:
        query (str): The user's input question.

    Returns:
        str: The answer generated by the LLM.
    """

    # Step 1: Search relevant FAQ entries using the query
    search_results = search(query)

    # Step 2: Create a prompt using the matched documents
    prompt = build_prompt(query, search_results)

    # Step 3: Generate a response from the LLM using the prompt
    answer = llm(prompt)

    return answer


query_1 = 'how do I run kafka?'
query_2 = 'the course has already started, can I still enroll?'

print(f"Query -> {query_1}, and it's Response -> {rag(query_1)}")
print("\n")
print(f"Query -> {query_2}, and it's Response -> {rag(query_2)}")

Query -> how do I run kafka?, and it's Response -> To run Kafka, you should ensure that all necessary components, including Docker images, are up and running first. If working in Python, create and activate a virtual environment to run the python files related to Kafka:

1. Create and activate a virtual environment:
   ```
   python -m venv env
   source env/bin/activate    # For MacOS/Linux
   env\Scripts\activate       # For Windows
   ```

2. Install the necessary packages:
   ```
   pip install -r ../requirements.txt
   ```

3. To deactivate the virtual environment once you are done, use:
   ```
   deactivate
   ```

For running Java Kafka components (like producer/consumer/kstreams), use the following command in the terminal within the project directory:
```
java -cp build/libs/<jar_name>-1.0-SNAPSHOT.jar:out src/main/java/org/example/JsonProducer.java
```


Query -> the course has already started, can I still enroll?, and it's Response -> Yes, you can still enroll in the course a

### Running ElasticSearch in Docker (code below)

The code below launches `Elasticsearch 8.4.3` in **single-node mode**, disables security features, maps ports, and automatically removes the container when it exits. Useful for local development, testing, or short-term usage without the need for persistent storage or cluster setup.

```bash
# Run Elasticsearch in a Docker container
docker run -it \
    # Automatically remove the container when it exits
    --rm \
    # Assign the container a name for easier reference
    --name elasticsearch \
    # Allocate 4GB of memory to the container (recommended for Elasticsearch)
    -m 4GB \
    # Map port 9200 on host to 9200 in container (Elasticsearch REST API)
    -p 9200:9200 \
    # Map port 9300 for internal cluster communication (not used in single-node)
    -p 9300:9300 \
    # Run Elasticsearch in single-node mode (no clustering)
    -e "discovery.type=single-node" \
    # Disable security features (no auth required — useful for local testing)
    -e "xpack.security.enabled=false" \
    # Use the official Elasticsearch image (version 8.4.3)
    docker.elastic.co/elasticsearch/elasticsearch:8.4.3
```

### Running the `ElasticSearch` in Docker and creating the index for search & retrieval

This code sets up an **Elasticsearch** index using Python. It connects to a local Elasticsearch instance, defines how the index should be structured (settings + field types), and then creates the index.

#### High-Level Summary
- Connects to a local Elasticsearch instance (`http://localhost:9200`)
- Defines an index called `course-questions`
- Sets basic configuration for performance (e.g., 1 shard, 0 replicas)
- Maps four fields: `text`, `section`, `question`, and `course` — with appropriate data types
- Creates the index using the given schema

In [10]:
from tqdm.auto import tqdm
from elasticsearch import Elasticsearch

# The Python client to connect to an Elasticsearch cluster. Connects to an Elasticsearch server running locally on port 9200.
es_client = Elasticsearch('http://localhost:9200')

index_settings = {
    "settings": {
        "number_of_shards": 1, # Splits the index into 1 shard (good for small datasets or local development).
        "number_of_replicas": 0 # Sets 0 replicas (no redundancy, also okay for dev/test environments).
    },
    "mappings": { # Mappings: Define how each field in your documents will be indexed and queried. 
        # "text" fields are full-text searchable (analyzed into tokens).
        # "keyword" fields (like course) are not analyzed — used for exact matches, filters, and aggregations.
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"} 
        }
    }
}

# Sets the name of the index to "course-questions".
index_name = "course-questions"
# Sends a request to the Elasticsearch server to create the index with the specified name and settings.
es_client.indices.create(index=index_name, body=index_settings)
# After running this code: An index called course-questions will exist in your Elasticsearch instance.

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'course-questions'})

In [11]:
## Printing one document
print(documents[0])

## Indexing the documents in the ElasticSearch
for doc in tqdm(documents):
    es_client.index(index=index_name, document=doc)

{'text': "Data Engineering Zoomcamp FAQ\nData Engineering Zoomcamp FAQ\nThe purpose of this document is to capture Frequently asked technical questions\nEditing guidelines:\nWhen adding a new FAQ entry, make sure the question is “Heading 2”\nFeel free to improve if you see something is off\nDon’t change the formatting in the Data document or add any visual “improvements” (make a copy for yourself first if you need to do it for whatever reason)\nDon’t change the pages format (it should be “pageless”)\nAdd name and date for reference, if possible\nThe next cohort starts January 13th 2025. More info at DTC.\nRegister before the course starts using this link.\nJoint the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.", 'section': 'General course-related questions', 'question': 'Course - When does the course start?', 'course': 'data-engineering-zoomcamp'}


  0%|          | 0/1119 [00:00<?, ?it/s]

In [12]:
query_3 = 'I just disovered the course. Can I still join it?'

def elastic_search(query):
    # Construct the Elasticsearch query object
    search_query = {
        "size": 5,  # Limit the search to return only the top 5 results
        "query": {
            "bool": {
                # 'must' clause: defines conditions that must match
                "must": {
                    "multi_match": {
                        "query": query,  # The user's input query string

                        # Search across multiple fields in the index
                        "fields": [
                            "question^3",  # Boost the 'question' field by 3x to increase its weight in scoring
                            "text",        # Search the main answer text
                            "section"      # Also search section titles
                        ],
                        "type": "best_fields"  # Use the best matching field to score each document
                    }
                },
                # 'filter' clause: restrict results without affecting score
                "filter": {
                    "term": {
                        "course": "data-engineering-zoomcamp"  # Only include results from this specific course
                    }
                }
            }
        }
    }
    # Send the query to Elasticsearch and store the response
    response = es_client.search(index=index_name, body=search_query)
    result_docs = []  # Prepare a list to collect the final matched documents
    # Iterate over each hit in the search result
    for hit in response['hits']['hits']:
        # '_source' contains the actual content of the document
        result_docs.append(hit['_source'])
    # Return the list of matched documents (each is a dict with question, text, section, and course)
    return result_docs

def rag(query):
    # Step 1: Search the FAQ index for relevant documents based on the input query.
    # This uses full-text search with field boosts and filtering to find top matches.
    search_results = elastic_search(query)

    # Step 2: Construct a prompt for the language model.
    # This combines the original user query and the context retrieved from the search results.
    # The prompt follows a predefined format, instructing the LLM to answer only using the provided context.
    prompt = build_prompt(query, search_results)

    # Step 3: Send the prompt to the LLM (e.g., GPT-4) and get the generated response.
    # The model processes the question and context and returns an answer.
    answer = llm(prompt)

    # Step 4: Return the model's response to the caller.
    return answer


rag(query_3)

"Yes, you can still join the course even if it has already started. You are eligible to submit the homework without registering. However, please be aware that there are deadlines for submitting homework and the final projects, so it's important not to leave everything until the last minute."