# Task 2: RAG + LLM System Design

## 1. System Architecture

Here is a proposed architecture for the RAG system, including the key components:

### Document Ingestion & Preprocessing
*   **Purpose:** To load raw technical documentation PDFs and convert them into a structured format suitable for processing.
*   **Components:**
    *   **PDF Reader:** Reads text content from PDF files. Libraries like `PyMuPDF` or `pdfminer.six` can be used.
    *   **Text Extractor:** Extracts plain text from the PDF content.
    *   **Cleaner:** Removes irrelevant characters, headers, footers, and performs basic text cleaning (e.g., handling hyphenation, special characters).

### Chunking Strategy
*   **Purpose:** To break down large documents into smaller, manageable chunks that retain local context.
*   **Components:**
    *   **Text Splitter:** Splits the cleaned text based on predefined rules (e.g., fixed token size with overlap, paragraph breaks, section headers). Frameworks like `LangChain` or `LlamaIndex` provide various text splitting strategies.

### Embeddings & Indexing (Vector Database)
*   **Purpose:** To convert text chunks into numerical vector representations (embeddings) and store them in a searchable index.
*   **Components:**
    *   **Embedding Model:** An open-source model (e.g., from Hugging Face `sentence-transformers`) that converts text chunks into dense vectors.
    *   **Vector Database:** A database optimized for storing and searching vector embeddings (e.g., `FAISS`, `Chroma`, `Weaviate` in local mode).

### Retrieval Layer
*   **Purpose:** To retrieve the most relevant document chunks based on a user's natural language query.
*   **Components:**
    *   **Query Embedding:** Converts the user's query into a vector using the same embedding model used for document chunks.
    *   **Vector Search:** Performs a similarity search in the vector database to find the top-k most similar document chunk embeddings to the query embedding.
    *   **Retriever:** Orchestrates the query embedding and vector search to return the relevant document chunks.

### LLM Layer for Answer Generation
*   **Purpose:** To generate a natural language answer based on the user's query and the retrieved document chunks.
*   **Components:**
    *   **Open-source LLM:** A free and open-source language model (e.g., `flan-t5-small`, `opt-125m`, Llama 2 7B, or a Hugging Face hosted model).
    *   **Prompt Engineering:** Constructs a prompt for the LLM that includes the user's query, the retrieved document chunks (as context), and instructions for generating a concise and faithful answer.
    *   **Answer Generator:** Feeds the prompt to the LLM and processes the generated response.

### Guardrails for Safe and Faithful Responses
*   **Purpose:** To prevent the LLM from generating irrelevant, hallucinated, or sensitive content and to ensure answers are supported by the retrieved sources.
*   **Components:**
    *   **Relevance Checker:** Filters retrieved chunks to ensure they are highly relevant to the query before passing them to the LLM.
    *   **Citation Enforcer:** Develops a prompt structure or post-processing step to encourage or enforce the LLM to cite the source document chunks for its answer.
    *   **Sensitivity Filter:** Identifies and potentially blocks or modifies queries or generated answers that contain sensitive or inappropriate content. This could involve keyword matching or a separate classification model.
    *   **Fallback Mechanism:** Provides a predefined message when no relevant documents are found for a query.

This architecture provides a modular approach to building the RAG system, allowing for flexibility and potential future improvements to individual components. The next steps will detail the specific strategies and tools for each layer.

## 2. Retrieval Strategy

This section details the strategies for effectively retrieving relevant document chunks based on a user query.

### Document Chunking Approach

*   **Strategy:** A balance between retaining local context and managing chunk size for efficient embedding and retrieval is crucial. A **fixed token size with overlap** is a robust starting point.
    *   **Size:** Chunks of around 200-500 tokens are generally effective. Smaller chunks might lose context, while larger chunks can introduce irrelevant information and increase computational cost.
    *   **Overlap:** An overlap of 10-20% between consecutive chunks helps maintain context across chunk boundaries and prevents information loss.
    *   **Granularity:** While fixed-size chunks are simple, exploring **semantic chunking** (splitting based on meaning or topic) using techniques like LangChain's `SemanticChunker` could improve relevance, especially for complex technical documentation. However, for a minimal prototype, fixed-size chunking is sufficient.

### Choice of Embedding Model

*   **Requirement:** Use a free and open-source model from Hugging Face `sentence-transformers`.
*   **Suggested Model:** `all-MiniLM-L6-v2`.
    *   **Reasoning:** This model is a good balance of performance and size. It provides good semantic similarity search capabilities and is relatively fast and efficient for local use, aligning with the "free and open-source" and "minimal prototype" requirements. Other options like `paraphrase-MiniLM-L3-v2` are even smaller and faster but might offer slightly less performance. `all-mpnet-base-v2` is more powerful but also larger.

### Retrieval Method

*   **Method:** **Dense vector search** using a vector database is the primary retrieval method.
    *   **Process:** The user query is embedded into a vector using the same embedding model as the document chunks. A similarity search (e.g., cosine similarity) is performed in the vector database to find the top-k document chunks with the most similar vectors to the query vector.
*   **Consideration:** For a more robust system, **hybrid search** combining dense vector search with a lexical search method like **BM25** could be beneficial. This helps capture both semantic meaning and keyword matching. However, for the minimal prototype, dense vector search is sufficient.

### Ensuring Relevance and Faithfulness

*   **Re-ranking:** After initial retrieval, a re-ranking step can improve the order of the retrieved chunks. Techniques like using a cross-encoder model (e.g., `cross-encoder/ms-marco-MiniLM-L-6-v2`) can score the relevance of each retrieved chunk to the query more precisely.
*   **Enforcing Citations:** This can be done through prompt engineering. Instructing the LLM to cite the source document chunks when generating the answer is a key guardrail against hallucinations. The prompt should clearly present the retrieved chunks and ask the LLM to refer to them.
*   **Returning Source Snippets:** The prototype should return the retrieved document chunks alongside the generated answer. This allows users to verify the information and provides transparency.

## 4. Scalability Considerations

This section addresses how the proposed RAG system design would handle increased scale in terms of documents, users, and deployment environments.

### Handling a 10x Increase in Documents

*   **Vector Database:** The primary component to consider is the vector database. Using a scalable vector database (e.g., cloud-managed solutions or distributed on-premise options) is crucial. The indexing process will take longer, but once indexed, the search efficiency should be maintained by the database's architecture.
*   **Ingestion Pipeline:** The document ingestion and preprocessing pipeline should be designed to handle parallel processing of documents. This can be achieved using distributed processing frameworks.
*   **Embedding Model:** While `all-MiniLM-L6-v2` is suitable for the prototype, for a 10x increase, a more efficient or distributed embedding process might be needed. Cloud-based embedding APIs or running the model on more powerful hardware can help.

### Handling 100+ Concurrent Users

*   **Retrieval Layer:** The retrieval layer needs to be able to handle a high volume of concurrent queries. This primarily depends on the performance of the vector database and the efficiency of the query embedding process.
*   **LLM Layer:** The LLM layer can become a bottleneck with many concurrent users, especially if using a single instance of a large model.
    *   **Scaling LLM Inference:** Several strategies can be employed:
        *   **Model Serving Frameworks:** Use frameworks like TensorFlow Serving, PyTorchServe, or NVIDIA Triton Inference Server for efficient model deployment and scaling.
        *   **Load Balancing:** Distribute incoming user queries across multiple instances of the LLM.
        *   **Quantization and Distillation:** Explore using smaller, more efficient versions of the LLM through techniques like quantization and distillation to reduce inference time and resource requirements.
        *   **Batching:** Process multiple user queries in batches to improve efficiency.
*   **Caching:** Implement caching at various levels (e.g., caching popular query embeddings or LLM responses) to reduce the load on the system.

### Cloud Deployment Under Cost Constraints (Serverless or GPU-Efficient Scaling)

*   **Serverless Functions:** For the ingestion and preprocessing pipeline, serverless functions (e.g., AWS Lambda, Google Cloud Functions) can be cost-effective as you only pay for the compute time used.
*   **Managed Vector Database:** Cloud providers offer managed vector database services that handle scaling and infrastructure management, reducing operational overhead. Choose options that fit within cost constraints.
*   **GPU-Efficient LLM Scaling:**
    *   **Choose Efficient Models:** Select LLMs known for their efficiency and lower hardware requirements (e.g., smaller models, quantized models).
    *   **Optimize Inference:** Utilize techniques like model quantization, pruning, and efficient inference libraries (e.g., FasterTransformer, ONNX Runtime) to reduce GPU usage and cost.
    *   **Spot Instances or Preemptible VMs:** For non-critical or batch processing tasks (like initial indexing), using spot instances or preemptible VMs can significantly reduce compute costs.
    *   **Autoscaling:** Configure autoscaling for LLM inference endpoints to dynamically adjust the number of instances based on demand, optimizing cost.

By considering these factors, the RAG system can be designed to scale effectively to handle increasing data and user loads within defined cost constraints in a cloud environment.

In [1]:
!pip install PyMuPDF sentence-transformers faiss-cpu transformers torch langchain

Collecting PyMuPDF
  Downloading pymupdf-1.26.4-cp39-abi3-win_amd64.whl.metadata (3.4 kB)
Downloading pymupdf-1.26.4-cp39-abi3-win_amd64.whl (18.7 MB)
   ---------------------------------------- 0.0/18.7 MB ? eta -:--:--
   ---------------------------------------- 0.0/18.7 MB ? eta -:--:--
   - -------------------------------------- 0.5/18.7 MB 2.4 MB/s eta 0:00:08
   -- ------------------------------------- 1.3/18.7 MB 3.2 MB/s eta 0:00:06
   --- ------------------------------------ 1.8/18.7 MB 3.1 MB/s eta 0:00:06
   --- ------------------------------------ 1.8/18.7 MB 3.1 MB/s eta 0:00:06
   --- ------------------------------------ 1.8/18.7 MB 3.1 MB/s eta 0:00:06
   --- ------------------------------------ 1.8/18.7 MB 3.1 MB/s eta 0:00:06
   ----- ---------------------------------- 2.4/18.7 MB 1.5 MB/s eta 0:00:11
   ------ --------------------------------- 3.1/18.7 MB 1.8 MB/s eta 0:00:09
   -------- ------------------------------- 3.9/18.7 MB 2.1 MB/s eta 0:00:07
   --------- ---

### Document Ingestion and Chunking

We will now load the documents, extract the text, and split it into chunks. For this prototype, you can place your sample PDF documents in a folder named `docs` in the same directory as this notebook.

In [3]:
import fitz  # PyMuPDF
from langchain.text_splitter import RecursiveCharacterTextSplitter
import os

# Define the directory where sample documents are stored
docs_dir = "docs"
if not os.path.exists(docs_dir):
    os.makedirs(docs_dir)
    print(f"Created directory: {docs_dir}. Please place your sample PDF documents here.")
else:
    print(f"Using document directory: {docs_dir}")

# List to hold the text content of documents
documents = []

# Iterate through files in the docs directory and extract text from PDFs
for filename in os.listdir(docs_dir):
    if filename.endswith(".pdf"):
        filepath = os.path.join(docs_dir, filename)
        try:
            with fitz.open(filepath) as doc:
                text = ""
                for page in doc:
                    text += page.get_text()
                documents.append({"filename": filename, "content": text})
            print(f"Processed document: {filename}")
        except Exception as e:
            print(f"Error processing {filename}: {e}")

if not documents:
    print("No PDF documents found in the 'docs' directory. Please add some sample PDFs to proceed with chunking and embedding.")

# Initialize the text splitter
# Using RecursiveCharacterTextSplitter as it's a good general-purpose splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,  # Defined in retrieval strategy
    chunk_overlap=100 # Defined in retrieval strategy
)

# Split the documents into chunks
chunks = []
for doc in documents:
    doc_chunks = text_splitter.create_documents([doc["content"]])
    for i, chunk in enumerate(doc_chunks):
        chunks.append({
            "filename": doc["filename"],
            "chunk_id": f"{doc['filename']}_chunk_{i+1}",
            "content": chunk.page_content,
            "source": f"{doc['filename']}_chunk_{i+1}" # Add source for citation
        })

print(f"\nCreated {len(chunks)} chunks from {len(documents)} documents.")
# Display the first chunk as an example
if chunks:
    print("\nFirst chunk example:")
    print(chunks[0]['content'])
    print("Source:", chunks[0]['source'])

Using document directory: docs
Processed document: 172-65231ma.pdf
Processed document: 43 86_Cyclone Separator_IandO.pdf
Processed document: 654b45bb8292f586313470.pdf
Processed document: B_12a.pdf
Processed document: Cyclone_Emission_Control.pdf
Processed document: Cyclone_Manual.pdf
Processed document: D13.pdf
Processed document: Operating-and-maintenance-manual-for-a-typical-cylcone-seperator-.pdf
Processed document: Parker_DustHog_C_Series_Cyclone_Dust_Collector_User_Guide.pdf
Processed document: sect7a.pdf
Processed document: SOP Cyclone-22022023.pdf

Created 695 chunks from 11 documents.

First chunk example:
172-65231MA-03 (DC7)   6 October 2021 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Cyclone Separator 
DC7 
 
 
 
 
 
 
 
 
Copyright © 2021 by TLV CO., LTD. 
All rights reserved
 
 
172-65231MA-03 (DC7)   6 Oct 2021 
1
Contents 
 
Introduction ....................................................................... 1 
Safety Considerations ........................................

### Embedding and Indexing (Vector Database)

Now we will convert the document chunks into vector embeddings using a pre-trained model and index them in a local FAISS database for efficient similarity search.

In [4]:
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# Initialize the embedding model
# Using the suggested model from the retrieval strategy
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
print("Embedding model loaded.")

# Create embeddings for the chunks
chunk_contents = [chunk["content"] for chunk in chunks]
chunk_embeddings = embedding_model.encode(chunk_contents)
print(f"Created embeddings for {len(chunk_embeddings)} chunks.")

# Convert embeddings to a numpy array with float32 datatype
chunk_embeddings = np.array(chunk_embeddings).astype('float32')

# Create a FAISS index
# Using IndexFlatL2 for a simple L2 distance-based index
index = faiss.IndexFlatL2(chunk_embeddings.shape[1])
print(f"FAISS index created with dimension: {chunk_embeddings.shape[1]}")

# Add the embeddings to the index
index.add(chunk_embeddings)
print(f"Added {index.ntotal} embeddings to the index.")

# Store chunks and index for later use
# In a real application, you would save the index and chunk metadata to disk
# and load them when needed. For this prototype, we keep them in memory.
indexed_chunks = chunks
faiss_index = index

print("\nEmbedding and indexing complete.")

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'




modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Embedding model loaded.
Created embeddings for 695 chunks.
FAISS index created with dimension: 384
Added 695 embeddings to the index.

Embedding and indexing complete.


### Retrieval

Now we will implement the retrieval step. Given a user query, we will embed it and search the FAISS index to find the most relevant document chunks.

In [5]:
def retrieve_chunks(query, embedding_model, faiss_index, indexed_chunks, k=5):
    """
    Retrieves the top-k most relevant chunks for a given query.

    Args:
        query (str): The user's query.
        embedding_model: The sentence transformer model for embedding.
        faiss_index: The FAISS index containing chunk embeddings.
        indexed_chunks (list): A list of dictionaries, where each dictionary
                                contains the metadata and content of a chunk.
        k (int): The number of top relevant chunks to retrieve.

    Returns:
        list: A list of the top-k most relevant chunk dictionaries.
    """
    # Embed the query
    query_embedding = embedding_model.encode([query]).astype('float32')

    # Perform similarity search in the FAISS index
    distances, indices = faiss_index.search(query_embedding, k)

    # Retrieve the corresponding chunks
    retrieved_chunks = [indexed_chunks[i] for i in indices[0]]

    return retrieved_chunks

# Example usage:
query = "What does a sudden draft drop indicate?"
retrieved_chunks = retrieve_chunks(query, embedding_model, faiss_index, indexed_chunks)

print(f"Top {len(retrieved_chunks)} retrieved chunks for the query: '{query}'")
for i, chunk in enumerate(retrieved_chunks):
    print(f"\n--- Chunk {i+1} (Source: {chunk['source']}) ---")
    print(chunk['content'])

Top 5 retrieved chunks for the query: 'What does a sudden draft drop indicate?'

--- Chunk 1 (Source: SOP Cyclone-22022023.pdf_chunk_31) ---
IMD. 
 
 
 
 
 
 
Page 7 of 25 
 
7. 
Flow Chart of actions in case of Cyclone / Storm 
 
 
Figure 1 – Work Flow Chart 
Page 8 of 25 
 
8. 
Action of D.G Comm Centre and/or L.R.I.T in case of Cyclone 
8.1 
‘depression’ which is likely to develop into a Cyclone on a given date and time. 
8.2 
Inform N.A and N.S (Casualty & Response) by phone. 
8.3

--- Chunk 2 (Source: sect7a.pdf_chunk_19) ---
particle which has a 50:50 probability of reporting to either the underflow or overflow. 
 
Please note that the d50 cutpoint only describes the cyclone.  It has nothing to do with the size grading of 
the cyclone’s overflow or underflow streams (80% minus 75 microns etc.).  Those gradings are dependent 
on the size distribution of the feed to the cyclone, and are commonly described as p80, p50, p20 etc. 
 
 
Intro Cyclones And Separators.Doc 
Page 4  
Copyri

### Answer Generation with LLM

Now we will use a free and open-source LLM to generate an answer based on the retrieved document chunks. This step will also demonstrate how to incorporate citations.

**Note:** Running a local LLM might require significant computational resources. For this prototype, we will outline the process and use a placeholder for the actual LLM interaction. You can replace this with code to interact with a locally running model (e.g., using the `transformers` library with a downloaded model) or a free/open-source LLM API if available and suitable.

In [6]:
# This is a placeholder for LLM interaction.
# In a real implementation, you would load your chosen free/open LLM here
# and use the retrieved_chunks as context for generation.

def generate_answer_with_citations(query, retrieved_chunks, llm_model):
    """
    Generates an answer based on the query and retrieved chunks using an LLM,
    with an attempt to include citations.

    Args:
        query (str): The user's query.
        retrieved_chunks (list): A list of relevant chunk dictionaries.
        llm_model: The LLM model to use for generation (placeholder).

    Returns:
        str: The generated answer with citations.
    """
    # Construct a prompt for the LLM.
    # This prompt instructs the LLM to use the provided context
    # and cite the sources.
    context = "\n\n".join([f"Source: {chunk['source']}\n{chunk['content']}" for chunk in retrieved_chunks])
    prompt = f"Based on the following technical documentation, answer the query:\n\n{context}\n\nQuery: {query}\n\nAnswer:"

    # --- Placeholder for LLM interaction ---
    # Replace this with code to call your LLM.
    # The LLM should be instructed to generate an answer based *only* on the context
    # and to include the source identifier (e.g., filename_chunk_id) when referencing information.
    generated_text = f"Placeholder Answer: (Replace this with LLM generated text based on the context)\n\nRetrieved context used:\n{context}"
    # --- End of Placeholder ---

    # In a real scenario, you would process the LLM's output
    # to ensure citations are correctly formatted and included.
    # For this placeholder, we just return the placeholder text and the context.

    return generated_text

# Example usage (using the placeholder):
# Replace 'None' with your actual loaded LLM model
llm_model = None # Placeholder

# Assuming 'retrieved_chunks' is available from the previous step
if 'retrieved_chunks' in locals() and retrieved_chunks:
    answer = generate_answer_with_citations(query, retrieved_chunks, llm_model)
    print("\n--- Generated Answer ---")
    print(answer)
else:
    print("\nNo chunks retrieved in the previous step to generate an answer from.")


--- Generated Answer ---
Placeholder Answer: (Replace this with LLM generated text based on the context)

Retrieved context used:
Source: SOP Cyclone-22022023.pdf_chunk_31
IMD. 
 
 
 
 
 
 
Page 7 of 25 
 
7. 
Flow Chart of actions in case of Cyclone / Storm 
 
 
Figure 1 – Work Flow Chart 
Page 8 of 25 
 
8. 
Action of D.G Comm Centre and/or L.R.I.T in case of Cyclone 
8.1 
‘depression’ which is likely to develop into a Cyclone on a given date and time. 
8.2 
Inform N.A and N.S (Casualty & Response) by phone. 
8.3

Source: sect7a.pdf_chunk_19
particle which has a 50:50 probability of reporting to either the underflow or overflow. 
 
Please note that the d50 cutpoint only describes the cyclone.  It has nothing to do with the size grading of 
the cyclone’s overflow or underflow streams (80% minus 75 microns etc.).  Those gradings are dependent 
on the size distribution of the feed to the cyclone, and are commonly described as p80, p50, p20 etc. 
 
 
Intro Cyclones And Separators.Doc 
P

------------------------------

# **FUll CODE**

In [None]:
# This cell contains the complete code for the minimal runnable RAG prototype.
# You can copy this code and save it as a Python notebook (e.g., rag_demo.ipynb)
# in the 'prototype/' folder.

import fitz  # PyMuPDF
from langchain.text_splitter import RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
import os

# --- Configuration ---
docs_dir = "docs"
chunk_size = 500
chunk_overlap = 100
embedding_model_name = 'all-MiniLM-L6-v2'
retrieval_k = 5

# --- 1. Document Ingestion and Chunking ---

print("--- Document Ingestion and Chunking ---")

if not os.path.exists(docs_dir):
    os.makedirs(docs_dir)
    print(f"Created directory: {docs_dir}. Please place your sample PDF documents here.")
    documents = [] # No documents to process if directory was just created
else:
    print(f"Using document directory: {docs_dir}")
    documents = []
    for filename in os.listdir(docs_dir):
        if filename.endswith(".pdf"):
            filepath = os.path.join(docs_dir, filename)
            try:
                with fitz.open(filepath) as doc:
                    text = ""
                    for page in doc:
                        text += page.get_text()
                    documents.append({"filename": filename, "content": text})
                print(f"Processed document: {filename}")
            except Exception as e:
                print(f"Error processing {filename}: {e}")

if not documents:
    print("No PDF documents found in the 'docs' directory. Please add some sample PDFs to proceed with chunking and embedding.")
    chunks = [] # No chunks if no documents
else:
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap
    )
    chunks = []
    for doc in documents:
        doc_chunks = text_splitter.create_documents([doc["content"]])
        for i, chunk in enumerate(doc_chunks):
            chunks.append({
                "filename": doc["filename"],
                "chunk_id": f"{doc['filename']}_chunk_{i+1}",
                "content": chunk.page_content,
                "source": f"{doc['filename']}_chunk_{i+1}"
            })
    print(f"Created {len(chunks)} chunks from {len(documents)} documents.")
    if chunks:
        print("First chunk example:")
        print(chunks[0]['content'])
        print("Source:", chunks[0]['source'])

# --- 2. Embedding and Indexing (Vector Database) ---

print("\n--- Embedding and Indexing ---")

if chunks:
    embedding_model = SentenceTransformer(embedding_model_name)
    print(f"Embedding model '{embedding_model_name}' loaded.")

    chunk_contents = [chunk["content"] for chunk in chunks]
    chunk_embeddings = embedding_model.encode(chunk_contents)
    print(f"Created embeddings for {len(chunk_embeddings)} chunks.")

    chunk_embeddings = np.array(chunk_embeddings).astype('float32')

    faiss_index = faiss.IndexFlatL2(chunk_embeddings.shape[1])
    print(f"FAISS index created with dimension: {chunk_embeddings.shape[1]}")

    faiss_index.add(chunk_embeddings)
    print(f"Added {faiss_index.ntotal} embeddings to the index.")

    indexed_chunks = chunks # Store chunks with index
    print("Embedding and indexing complete.")
else:
    print("Skipping embedding and indexing as no chunks were created.")
    faiss_index = None
    indexed_chunks = []

# --- 3. Retrieval ---

print("\n--- Retrieval ---")

def retrieve_chunks(query, embedding_model, faiss_index, indexed_chunks, k=5):
    if faiss_index is None or not indexed_chunks:
        print("Index is not available or no chunks are indexed.")
        return []

    query_embedding = embedding_model.encode([query]).astype('float32')
    distances, indices = faiss_index.search(query_embedding, k)
    retrieved_chunks = [indexed_chunks[i] for i in indices[0]]
    return retrieved_chunks

# --- 4. Answer Generation with LLM (Placeholder) ---

print("\n--- Answer Generation (Placeholder) ---")

def generate_answer_with_citations(query, retrieved_chunks, llm_model):
    if not retrieved_chunks:
        return "No relevant information found in the documents."

    context = "\n\n".join([f"Source: {chunk['source']}\n{chunk['content']}" for chunk in retrieved_chunks])
    prompt = f"Based on the following technical documentation, answer the query:\n\n{context}\n\nQuery: {query}\n\nAnswer:"

    # --- Placeholder for LLM interaction ---
    # Replace this with code to call your LLM.
    # Example using a hypothetical local LLM:
    # from transformers import pipeline
    # llm = pipeline("text-generation", model="your-local-llm-name")
    # generated_text = llm(prompt, max_length=500)[0]['generated_text']
    # --- End of Placeholder ---

    # For the prototype, we'll just show the context that would be used
    generated_text = f"Placeholder Answer: (Replace this with LLM generated text based on the context)\n\nRetrieved context used:\n{context}"

    return generated_text

# --- Example Usage ---

if faiss_index is not None and indexed_chunks:
    query = "What does a sudden draft drop indicate?" # replace it with input() for dynamic queries
    print(f"Query: '{query}'")
    retrieved_chunks = retrieve_chunks(query, embedding_model, faiss_index, indexed_chunks, k=retrieval_k)

    print(f"\nTop {len(retrieved_chunks)} retrieved chunks:")
    for i, chunk in enumerate(retrieved_chunks):
         print(f"--- Chunk {i+1} (Source: {chunk['source']}) ---")
         print(chunk['content'])

    # Example LLM interaction (using placeholder)
    llm_model = None # Replace with your loaded LLM
    answer = generate_answer_with_citations(query, retrieved_chunks, llm_model)
    print("\n--- Generated Answer ---")
    print(answer)
else:
    print("\nSkipping retrieval and answer generation as no documents were processed.")

# --- 5. Evaluation Outline ---
print("\n--- Evaluation Outline ---")
print("Refer to the 'Evaluation (Conceptual Outline)' section in the notebook/notes.md for details on how to evaluate the prototype.")