In [1]:
from mistralai import Mistral, UserMessage
import os
import json
import wget
import minsearch

# Replacing MinSearch with ElasticSearch for Scalable Search

### 1. Why Replace MinSearch with ElasticSearch?
MinSearch is a simple search solution:

- It stores everything in RAM (memory).

- It’s good for small projects and quick prototyping.

- It cannot scale for large document collections or real-world applications.


ElasticSearch, on the other hand:

- Is a distributed, scalable search engine.

- Stores data persistently on disk (doesn't lose data after shutdown).

- Provides full-text search, advanced filtering, boosting, and more.

- Is used in production environments at scale.

👉 Goal: Migrate from MinSearch to ElasticSearch for better performance, scalability, and flexibility.

### 2. How to Run ElasticSearch with Docker
We will use Docker to quickly start ElasticSearch locally.

```bash
docker run -it \
    --rm \
    --name elasticsearch \
    -m 4GB \
    -p 9200:9200 \
    -p 9300:9300 \
    -e "discovery.type=single-node" \
    -e "xpack.security.enabled=false" \
    docker.elastic.co/elasticsearch/elasticsearch:8.4.3
```

#### Breakdown:
- `-it`: Interactive mode to see logs.

- `--rm`: Remove container when stopped.

- `--name elasticsearch`: Name the container.

- `-m 4GB`: Allocate enough memory (ElasticSearch needs at least 2–4 GB).

- `-p 9200:9200`: HTTP communication (for queries).

- `-p 9300:9300`: Transport communication (internal ES nodes, not used here).

- `discovery.type=single-node`: Run as a standalone server.

- `xpack.security.enabled=false`: Turn off username/password authentication (good for local testing).


### 3. Indexing the Documents in ElasticSearch
Now ElasticSearch is running. Next steps


In [2]:
# pip install elasticsearch 
#     or
# pip install elasticsearch==8.11.0
from elasticsearch import Elasticsearch

In [3]:
es = Elasticsearch('http://localhost:9200') # localhost can be replaced with url if deployed in the cloud
es.info()

ObjectApiResponse({'name': '4599cffbb7e3', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'og0eFxe5QqODoI5tdDeCIQ', 'version': {'number': '8.4.3', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '42f05b9372a9a4a470db3b52817899b99a76ee73', 'build_date': '2022-10-04T07:17:24.662462378Z', 'build_snapshot': False, 'lucene_version': '9.3.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'})

### 3.2 Create an Index with Settings and Mappings
An index in ElasticSearch is like a table in SQL — it defines how data is stored.

Here’s the index definition:

In [4]:
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"}
        }
    }
}

#### Explanation:

- `number_of_shards`: 1: One shard (splitting not needed for small project).

- `number_of_replicas`: 0: No replica needed locally.

- `text`, `section`, `question` fields are text type (full-text searchable).

- `course` field is a keyword (for exact matching, e.g., filtering by course)

### 3.3 Create or Reset the Index

In [5]:
index_name = ('course-questions')

if es.indices.exists(index=index_name):  
    es.indices.delete(index=index_name) # Deletes old index if it already exists.

es.indices.create(index=index_name, body=index_settings) #Creates a new clean index.

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'course-questions'})

##  📚 Why Do We Have Indexes in ElasticSearch?

In **ElasticSearch** (and in general search engines and databases), an **index** is **essential** because:

| Reason | Explanation |
|:---|:---|
| **Fast Search** | Searching raw documents one by one would be **too slow**, especially with millions of records. Indexes **organize** the data smartly for **fast retrieval**. |
| **Efficient Storage** | Indexes store **only the important parts** (like tokenized words and IDs), reducing what ElasticSearch needs to look at during search. |
| **Ranking and Scoring** | Indexes allow ElasticSearch to calculate **relevance scores** (e.g., BM25, TF-IDF) quickly, deciding which documents match best. |
| **Filtering and Aggregation** | Without indexes, advanced features like **filters** (e.g., course = "data-engineering-zoomcamp") and **analytics** would be much slower or impossible. |
| **Distributed Search** | In ElasticSearch, an index can be **split into shards**, allowing **parallel search** across multiple servers — making huge datasets searchable in seconds. |

---

# 🧠 Think of it like this:

- **Without an index** → You search for a book by **reading every page of every book** one by one. 🐢
- **With an index** → You first check the **library catalog**, find the exact shelf, and jump directly to the book and page. 🚀

---

# 🔥 How ElasticSearch Builds Indexes

When you add a document:
1. ElasticSearch **tokenizes** the text (splits into words, lowercases, removes stopwords, etc.).
2. It builds an **inverted index** — like a giant map:
   - For every **word**, it stores a list of **document IDs** where the word appears.
   
Example (tiny version):

| Word | Document IDs |
|:---|:---|
| docker | [doc1, doc2] |
| container | [doc1] |
| system | [doc1, doc3] |

✅ So if you search for "docker system", ElasticSearch **instantly** knows which documents to fetch.

---

# 🛠️ In ElasticSearch, an "Index" contains:

- Documents (actual data)
- Inverted index (token-to-doc lookup)
- Metadata (shards, mappings, settings)

And an Index can be **further split into Shards** to allow **horizontal scaling** across machines!

---

# 📌 TL;DR

✅ **Indexes** make search **fast, efficient, and scalable**.  
✅ Without indexes, ElasticSearch would be **useless for large datasets**.  
✅ **Everything** — full-text search, filters, scoring — is **built on top of indexes**.

---



In [6]:
# Opening the document 
with open('documents.json', 'rt') as f_in:
    docs_raw = json.load(f_in)

In [7]:
# Initialize an empty list to store all documents
documents = []

# Loop over each course dictionary in the raw documents
for course_dict in docs_raw:
    # Inside each course_dict, there is a list of documents under the key 'documents'
    for doc in course_dict['documents']:
        # Add the course name to each individual document
        # This helps us later filter or search by course in ElasticSearch
        doc['course'] = course_dict['course']
        
        # Append the modified document to the final documents list
        documents.append(doc)
documents [0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp'}

### Quick Summary: ElasticSearch Document Structure

#### 1. Example Document

```python
{
    'text': "...",         # Main body content
    'section': "...",      # Section label
    'question': "...",     # Specific FAQ-style question
    'course': "..."        # Course name (e.g., data-engineering-zoomcamp)
}
```
#### 2. Key Points

Documents have 4 fields: text, section, question, course.

Index name (e.g., course-question) is separate — it's not inside the document.

Documents are stored inside an index.

```
Index: course-question
|
|-- Document 0: {text, section, question, course}
|-- Document 1: {text, section, question, course}
|-- Document 2: {text, section, question, course}
|-- ...
```

In [8]:
from tqdm.notebook import tqdm

In [9]:
for doc in tqdm(documents):
    es.index(index=index_name, document=doc)

  0%|          | 0/948 [00:00<?, ?it/s]

In [10]:
def search_elasticsearch(query):
    search_query = {
        "size": 5,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^3", "text", "section"],
                        "type": "best_fields"
                    }
                },
                "filter": {
                    "term": {
                        "course": "data-engineering-zoomcamp"
                    }
                }
            }
        }
    }
    
    response = es.search(index=index_name, body=search_query)
    
    hits = response["hits"]["hits"]
    result_docs = [hit["_source"] for hit in hits]
    
    return result_docs


In [11]:
search_elasticsearch('can i still join the course')

[{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'You can start by installing and setting up all the dependencies and requirements:\nGoogle cloud account\nGoogle Cloud SDK\nPython 3 (installed with Anaconda)\nTerraform\nGit\nLook over the prerequisites and syllabus to see if you are comfortable with these subjects.',
  'section': 'General course-related questions',
  'question': 'Course - What can I do before the course starts?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.\nYou can also continue looking at

## Query Explanation:

| Part                             | Purpose                                                           |
|----------------------------------|-------------------------------------------------------------------|
| **size: 5**                      | Return only the top 5 best results                                |
| **multi_match**                  | Search across multiple fields at once                             |
| **fields: ["question^3", "text", "section"]** | Search in `question`, `text`, and `section`, and boost the `question` field (multiply by 3 importance) |
| **type: best_fields**            | Score by the field with the best match                            |
| **filter: term (course)**        | Only include documents that belong to the specific course        |

---

## Why `best_fields`?

ElasticSearch offers different `multi_match` types:

| Type            | Description                                                    |
|-----------------|---------------------------------------------------------------|
| **best_fields** | (✅ What we use) — returns documents with the highest score from any one field. |
| **most_fields** | Combines matches from multiple fields.                        |
| **cross_fields**| Treats multiple fields as one big field.                      |

You can read more about these in the official [ElasticSearch Multi-Match documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-multi-match-query.html) (recommended).

---

## 5. Summary: End-to-End Pipeline

- ✅ ElasticSearch runs locally using Docker.
- ✅ Indexes are created with correct mappings.
- ✅ Documents are uploaded (indexed).
- ✅ Full-text search across `question`, `text`, `section`.
- ✅ Filtered by `course`.
- ✅ Only top 5 best-matching documents are returned.

---

## 6. Quick Comparison: MinSearch vs ElasticSearch

| Feature               | **MinSearch**        | **ElasticSearch**       |
|-----------------------|----------------------|-------------------------|
| **Storage**           | Only in memory (RAM) | Persistent on disk      |
| **Scale**             | Small, prototype-level | Production, scales to millions of docs |
| **Search Types**      | Basic only           | Full-text, filters, boosting, aggregations |
| **Authentication**    | No                   | Supported (optional)    |
| **Deployment**        | Local scripts        | Local, cloud, clusters  |


In [15]:
client = Mistral(api_key=API_KEY)

def build_prompt(query, search_results):
    prompt_template = """
    You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
    Use only the facts from the CONTEXT when answering the QUESTION.

    QUESTION: {question}

    CONTEXT: 
    {context}
    """.strip()

    context = ""
    for doc in search_results:
        context += f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"

    prompt = prompt_template.format(question=query, context=context).strip()
    return prompt


def llm(prompt):
    response = client.chat.complete(
        model='open-mistral-7b',
        messages=[UserMessage(content=prompt)]
        )
    return response.choices[0].message.content


def rag(query):
    search_results = search_elasticsearch(query)
    prompt = build_prompt(query, search_results)
    answer = llm(prompt)
    return print(answer)

In [16]:
query = "What are the modules will we complete?"
rag(query)

Based on the provided context, the course modules are not explicitly mentioned. However, it is mentioned that the course starts on January 15, 2024, and the first module seems to be related to Docker and Terraform as there is a section titled "Module 1: Docker and Terraform" in the FAQ database. The rest of the sections appear to cover course-related questions such as prerequisites, data, homework deadlines, and the start date of the course. For more detailed information about the course modules, it might be best to refer to the course syllabus or the course instructor directly.


### What's happening?
1. ElasticSearch search:

       - `search_elasticsearch(query)` returns **5 top matching documents** based on their relevance score.

       - These 5 documents are not answers yet — **they are pieces of information** (sections, questions, text).

3. `build_prompt` function:

    - It combines the 5 documents into one big CONTEXT string.

    - It writes them like:
  
    ```
        section: ...
        question: ...
        answer: ...
    ```
    - Then it attaches your original user QUESTION on top.

    - The final prompt looks like:
  
      ```
      You're a course assistant.
            QUESTION: (your question)
            CONTEXT:
            (5 documents combined here)
        ```
3. `llm(prompt)`:

    - This entire prompt (your question + 5 documents) is sent to the LLM (e.g., Mistral 7B).

    - The LLM reads everything and generates ONE final answer based on the context you provided.
  

```
User query ---> ElasticSearch ---> Top 5 docs
                   |
                   v
          build_prompt(query, docs)
                   |
                   v
              big prompt
                   |
                   v
                 LLM
                   |
                   v
             One final answer
```

### Why only one final answer?
Because the LLM is doing the reasoning.
It reads the context (5 docs) and writes one summarized, best answer based on everything it read.

ElasticSearch just helps you fetch relevant information.
LLM is responsible for reading, thinking, and answering.