# Executive Summary

This Inference notebook represents the second stage of the RAG pipeline, relying on the assets produced by the Data Preparation step. Created on 09.12.2024, it reads the previously generated LlamaIndex artifacts—namely default__vector_store.faiss, docstore.json, graph_store.json, and index_store.json—from /data/faiss_index/ in Google Drive.

The notebook orchestrates hybrid search functionality. First, it uses the Faiss index for vector-based similarity matching, retrieving chunks of text that semantically match the query. Next, it uses the Whoosh index for BM25-based keyword matching. These results are combined and subsequently reranked by a cross-encoder (e.g., a Sentence Transformers model). The final set of top-ranked chunks is then assembled into a “context” string fed into an LLM (here, a Nebius-hosted model).

By running the user's query against these multiple retrieval layers, the pipeline maximizes the likelihood of returning accurate, contextually relevant information. After context assembly, the LLM is prompted to generate an answer and to optionally list the sources. This approach provides a transparent chain of evidence, showing how the final response was derived.

Additionally, the notebook employs Gradio to create a simple chat interface where a user can type queries and see both an answer and the relevant reference snippets. This user-facing UI, along with robust logging, closes the loop on an effective knowledge retrieval and answering system.

Inference pipeline takes as an input files, created in Data Preparation pipeline:
FAISS
1) default__vectore_store.faiss
2) docstore.json
3) graph_store.json
4) index_store.json
WHOOSH
5) MAIN (Whoosh index)
These files are native to StorageContent class of LlamaIndex and will be used in the downstream Inference pipeline.

# History of updates:
V08 - added configuration section, timestamps to logs, batching to reranking function and treshold to combined results function.

V07 - nothing was done on inference side of the pipeline

V06 - I added hybrid search (Vector search + Lexical search instead of just vector one), after that I added cross-encoder reranker. So now top-k documents by cross-encoder reranker go to an LLM prompt.

V05 - LLM was incorporated as a generator. Before that we just used index.as_retriever - so, LLM wasn't anyhow deployed on the pipeline.

V03 - I added functionality of setting up similarity treshold. So one can filter retireved chunks based on similarity score (cosine similarity). Here LlamaIndex abstractions also were very useful. I also fixed some bugs of index calling OpenAI embedding model instead of Nebius (I setup Nebius embedding model and LLM as Settings.models and also explicitly setup nebuis embedding models in index parameters).

V02 - in fact, most of the updates of v02 were maid in upstream pipeline. Here I've just integrated upstream changes.

#Backlog

ISSUES TO ADDRESS:

1) in lexical search - we return only file name as a source. No, row or page or slide as in vector search - we need to add this logic  
2) no evaluation of my retriever: metrics like Hit Rate, Mean Reciprocal Rank (MRR), or Normalized Discounted Cumulative Gain (NDCG):
3) So far, no prompt has been implemented (multy-shot prompts, for example) - an archtecture and prompts are to be choosen based on problems RAG are solving
4) Caching Mechanism: Implement caching for frequently asked questions to reduce redundant computations and API calls.
5) Security Measures: Ensure that user inputs are sanitized to prevent potential injection attacks, especially if integrating with other systems or databases.

#1 Install required libraries

In [1]:
# 1. Install required libraries
!pip install openai gradio httpx
!pip install llama-index llama-index-core llama-parse llama-index-readers-file
!pip install llama-index-embeddings-nebius llama-index-llms-nebius
!pip install llama-index-vector-stores-faiss
!pip install faiss-cpu
!pip install whoosh
!pip install -U sentence-transformers

Collecting gradio
  Downloading gradio-5.13.1-py3-none-any.whl.metadata (16 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting fastapi<1.0,>=0.115.2 (from gradio)
  Downloading fastapi-0.115.7-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.5.0-py3-none-any.whl.metadata (3.0 kB)
Collecting gradio-client==1.6.0 (from gradio)
  Downloading gradio_client-1.6.0-py3-none-any.whl.metadata (7.1 kB)
Collecting markupsafe~=2.0 (from gradio)
  Downloading MarkupSafe-2.1.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.0 kB)
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-multipart>=0.0.18 (from gradio)
  Downloading python_multipart-0.0.20-py3-none-any.whl.metadata (1.8 kB)
Collecting ruff>=0.2.2 (from gradio)
  Downloading ruff-0.9.3-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.meta

#2 Configuration: Mount Google Drive, Load API keys and Config Environment

Mounts your Google Drive for access to the FAISS/Whoosh indexes and loads the API key for Nebius AI from a JSON config. Sets the environment for inference, including which LLM model to use and any threshold parameters (like reranking thresholds).


In [None]:
# 2. Configuration: Mount Google Drive, Load API keys and config environment

from google.colab import drive
import os
import json
from pathlib import Path
import openai

# Mount Google Drive
drive.mount("/content/gdrive")

# Define the data directory path
data_directory = Path("/content/gdrive/MyDrive/RAG_Project_5/data/")

# Load the API key
config_path = Path("/content/gdrive/MyDrive/Colab_Notebooks/config.json")
with open(config_path, encoding="utf-8-sig") as config_file:
    config = json.load(config_file)
    os.environ["API_KEY"] = config["API_KEY"]

# Set the API key and endpoint globally
openai.api_key = os.environ["API_KEY"]
openai.api_base = "https://api.studio.nebius.ai/v1/"  # Nebius AI endpoint


nebius_model="meta-llama/Meta-Llama-3.1-405B-Instruct",

reranking_treshold = 0.2 # setting up a treshold for reranked retrieval results (should be setup manually)

Mounted at /content/gdrive


#3. Logging configuration

Enables logging during inference with a rotating file handler to record user queries and pipeline activities. This helps monitor query handling, time taken for searches, and any issues with model responses or indexing lookups.

In [None]:
# 3. Setup Logging with log rotation

import logging
from logging.handlers import RotatingFileHandler
from pathlib import Path
import os

def setup_logging(
    log_folder: Path,
    log_file: str = "inference_pipeline.log",
    max_bytes: int = 5 * 1024 * 1024,  # 5 MB
    backup_count: int = 3,
) -> None:
    """
    Configures logging with log rotation to write logs to both a file and the console.

    Args:
        log_folder (Path): The directory where the log file will be stored.
        log_file (str, optional): The name of the log file. Defaults to "inference_pipeline.log".
        max_bytes (int, optional): Maximum size of the log file in bytes before it is rotated. Defaults to 5 MB.
        backup_count (int, optional): Number of backup files to keep. Defaults to 3.

    Returns:
        None
    """
    os.makedirs(log_folder, exist_ok=True)  # Ensure the folder exists
    log_file_path = log_folder / log_file

    # Create a RotatingFileHandler
    rotating_handler = RotatingFileHandler(
        filename=log_file_path,
        maxBytes=max_bytes,
        backupCount=backup_count,
        encoding="utf-8",
    )

    # Configure logging
    logging.basicConfig(
        level=logging.INFO,
        format="%(asctime)s - %(levelname)s - %(message)s",
        handlers=[
            rotating_handler,  # Handles log rotation
            logging.StreamHandler(),  # Logs also appear in Colab's output
        ],
        force=True,  # Force the configuration to apply even if logging was already configured
    )

    logging.info("Logging with rotation has been successfully configured.")

# Initialize logging with log rotation
setup_logging(
    log_folder=Path("/content/gdrive/MyDrive/RAG_Project_5/inference/"),
    log_file="inference_pipeline.log",
    max_bytes=5 * 1024 * 1024,  # 5 MB
    backup_count=3,  # Keep up to 3 backup files
)

2025-01-14 17:35:05,596 - INFO - Logging with rotation has been successfully configured.


#4. Load the LlamaIndex-based FAISS Index, Whoosh Index, Initialize LLM and Embedding Model

Loads existing FAISS and Whoosh indexes from disk. Initializes the Nebius-based embedding and LLM (set as defaults in Settings) for both vector and language tasks. Checks for potential missing indexes to prevent query failures at inference.

In [None]:
# 4. Load the LlamaIndex-based FAISS Index, Whoosh Index, Initialize LLM and Embedding Model

import httpx  # For custom HTTP client if needed
from whoosh.index import open_dir
from whoosh.qparser import QueryParser
from llama_index.embeddings.nebius import NebiusEmbedding
from llama_index.llms.nebius import NebiusLLM
from llama_index.core import Settings, StorageContext, load_index_from_storage
from llama_index.vector_stores.faiss import FaissVectorStore
from pathlib import Path
import logging

# Setup NebiusEmbedding (same as in data preparation pipeline)
custom_http_client = httpx.Client(timeout=60.0)  # Fighting a bug in NebiusEmbedding library. More details in data preparation pipeline
embedding_model = NebiusEmbedding(
    api_key=os.environ["API_KEY"],
    model_name="BAAI/bge-en-icl",
    http_client=custom_http_client,
    api_base="https://api.studio.nebius.ai/v1/"  # Explicitly specifying api_base!!! It took me 2 hours to debug!
    # UPDATE: Check in the future if the Nebius class in the library is fixed, so custom_http_client and api_base may no longer be needed.
)

# Setup Nebius LLM
llm = NebiusLLM(
    api_key=os.environ["API_KEY"],
    model="meta-llama/Meta-Llama-3.1-405B-Instruct", # You setup model name in configuration section - it should be setup here.
    temperature=0.2  # Set low temperature since we're dealing with engineering data and don't need too much creativity.
)

# Setup Nebius models as default
# Setting up Nebius models as default early ensures methods using the Embedding Model don't default to LlamaIndex's native OpenAI models (avoids conflicts).
Settings.embed_model = embedding_model
Settings.llm_model = llm

# FAISS Index Path
faiss_index_path = Path(data_directory/'faiss_index')

if (faiss_index_path / "index_store.json").exists():
    # Load the FAISS vector store from the .faiss file
    vector_store = FaissVectorStore.from_persist_path(str(faiss_index_path / "default__vector_store.faiss"))  # Note double underscore `__` in `default__vector_store.faiss`.
    # Create a StorageContext with the vector_store
    storage_context = StorageContext.from_defaults(
        persist_dir=str(faiss_index_path),
        vector_store=vector_store,
    )
    # Load the FAISS index from storage
    faiss_index = load_index_from_storage(storage_context, embedding=embedding_model)
    logging.info("FAISS index loaded successfully (LlamaIndex).")
else:
    logging.error(f"LlamaIndex FAISS index not found at {faiss_index_path}. Ensure the Data Preparation pipeline has been run.")
    raise FileNotFoundError(f"No LlamaIndex index found at {faiss_index_path}.")

# Whoosh Index Path
whoosh_index_path = Path(data_directory/'whoosh_index')

if whoosh_index_path.exists():
    # Load the Whoosh index
    whoosh_index = open_dir(str(whoosh_index_path))
    logging.info("Whoosh index loaded successfully.")
else:
    logging.error(f"Whoosh index not found at {whoosh_index_path}. Ensure the Data Preparation pipeline has been run.")
    raise FileNotFoundError(f"No Whoosh index found at {whoosh_index_path}.")


2025-01-14 17:35:25,766 - INFO - NumExpr defaulting to 2 threads.
2025-01-14 17:35:45,396 - INFO - Loading faiss with AVX2 support.
2025-01-14 17:35:45,765 - INFO - Successfully loaded faiss with AVX2 support.
2025-01-14 17:35:45,785 - INFO - Loading llama_index.vector_stores.faiss.base from /content/gdrive/MyDrive/RAG_Project_5/data/faiss_index/default__vector_store.faiss.
2025-01-14 17:35:47,010 - INFO - Loading all indices.
2025-01-14 17:35:47,270 - INFO - FAISS index loaded successfully (LlamaIndex).
2025-01-14 17:35:47,594 - INFO - Whoosh index loaded successfully.


#5 Define Hybrid Search Logic

Implements a hybrid retrieval mechanism: FAISS-based semantic/vector search combined with Whoosh-based BM25 for text-based matching. Afterwards, a cross-encoder model reranks the combined results. This step ensures relevant documents are extracted from multiple angles (semantic similarity + text matching) and then refined by the cross-encoder for final relevance ordering.

In [None]:
# 5 Define Hybrid Search Logic

import numpy as np
from sentence_transformers import CrossEncoder

def perform_vector_search(query: str, top_k: int = 10) -> list:
    """
    Retrieve top-k results from FAISS vector index, including metadata.

    Args:
        query (str): The user's query.
        top_k (int, optional): Number of top results to retrieve. Defaults to 10.

    Returns:
        list: List of tuples containing (text, score, metadata).
    """
    retriever = faiss_index.as_retriever(similarity_top_k=top_k)
    results = retriever.retrieve(query)
    # Return (text, score, metadata)
    return [(res.node.text, res.score if res.score is not None else 0.0, res.node.metadata) for res in results]

def perform_bm25_search(query: str, top_k: int = 10) -> list:
    """
    Retrieve top-k results from Whoosh BM25 search, including metadata.

    Args:
        query (str): The user's query.
        top_k (int, optional): Number of top results to retrieve. Defaults to 10.

    Returns:
        list: List of tuples containing (text, score, metadata).
    """
    with whoosh_index.searcher() as searcher:
        parser = QueryParser("content", schema=whoosh_index.schema)
        try:
            parsed_query = parser.parse(query)
            results = searcher.search(parsed_query, limit=top_k)
            bm25_results = []
            for hit in results:
                content = hit["content"]
                score = hit.score
                metadata = {
                    "source": hit.get("source", "Unknown Source"),
                    "sheet": hit.get("sheet", None),
                    "row_number": hit.get("row_number", None),
                    "slide_number": hit.get("slide_number", None),
                    "section": hit.get("section", None),
                    "chunk_number": hit.get("chunk_number", None),
                    "total_chunks_in_section": hit.get("total_chunks_in_section", None),
                    "page_number": hit.get("page_number", None),
                    "doc_id": hit.get("doc_id", None),
                    # Add other metadata fields as needed
                }
                bm25_results.append((content, score, metadata))
            logging.info(f"BM25 search results: {bm25_results}")
            # Return (text, score, metadata)
            return bm25_results
        except Exception as e:
            logging.error(f"Whoosh search error: {e}")
            return []

def combine_results(vector_results, bm25_results, alpha: float = 0.5) -> list:
    """
    Combines FAISS and Whoosh results using Reciprocal Rank Fusion.

    Args:
        vector_results (list): List of tuples (text, score, metadata) from FAISS.
        bm25_results (list): List of tuples (text, score, metadata) from Whoosh.
        alpha (float, optional): Weight for FAISS results. Defaults to 0.5.

    Returns:
        list: Combined list of tuples (text, combined_score, metadata).
    """
    combined_scores = {}
    metadata_mapping = {}

    # Process FAISS results
    for idx, (text, score, metadata) in enumerate(vector_results):
        if text:
            combined_scores[text] = combined_scores.get(text, 0) + alpha / (idx + 1)
            metadata_mapping[text] = metadata  # Store metadata from FAISS

    # Process BM25 results
    for idx, (text, score, metadata) in enumerate(bm25_results):
        if text:
            combined_scores[text] = combined_scores.get(text, 0) + (1 - alpha) / (idx + 1)
            if text not in metadata_mapping:
                metadata_mapping[text] = metadata  # Store metadata from Whoosh if not already present

    # Combine scores and retrieve metadata
    combined_results = [(text, score, metadata_mapping[text]) for text, score in combined_scores.items()]

    # Sort results by combined score in descending order
    sorted_results = sorted(combined_results, key=lambda x: x[1], reverse=True)
    return sorted_results


import numpy as np

# Load a pretrained cross-encoder model
cross_encoder = CrossEncoder("cross-encoder/ms-marco-TinyBERT-L-6")

def rerank_with_cross_encoder(query: str, documents: list, top_k: int = 10, batch_size: int = 16) -> list:
    """
    Rerank documents using a cross-encoder model with batching.

    Args:
        query (str): The user's query.
        documents (list): List of tuples (text, combined_score, metadata).
        top_k (int, optional): Number of top documents to return after reranking. Defaults to 10.
        batch_size (int, optional): Number of document-query pairs to process in a batch. Defaults to 16.

    Returns:
        list: Reranked list of top-k documents with their scores and metadata.
    """
    try:
        # Prepare input pairs for the cross-encoder
        input_pairs = [(query, doc[0]) for doc in documents]  # doc[0] is the document text
        scores = []

        # Process input pairs in batches
        for i in range(0, len(input_pairs), batch_size):
            batch = input_pairs[i:i + batch_size]
            try:
                batch_scores = cross_encoder.predict(batch)  # Predict scores for the batch
                scores.extend(batch_scores)
            except Exception as e:
                logging.error(f"Cross-encoder prediction failed for batch {i // batch_size}: {e}")
                scores.extend([0.0] * len(batch))  # Fallback to zero scores for this batch

        # Combine scores with documents
        scored_docs = [(doc[0], score, doc[2]) for doc, score in zip(documents, scores)]

        # Sort documents by relevance score in descending order
        reranked_docs = sorted(scored_docs, key=lambda x: x[1], reverse=True)

        # Filter by relevance threshold
        filtered_reranked_docs = [doc for doc in reranked_docs if doc[1] >= reranking_treshold]

        return filtered_reranked_docs[:top_k]  # Return documents from top-k, filtered by threshold

    except Exception as e:
        logging.error(f"Unexpected error in rerank_with_cross_encoder: {e}")
        return []  # Return an empty list in case of failure


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/541 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

2025-01-14 17:36:04,696 - INFO - Use pytorch device: cpu


# 6 Create a LLM-based Retriever based on Hybrid Search Logic

Implements a retrieval pipeline that uses both vector search and BM25, then reranks with a cross-encoder. Once the top results are determined, the system assembles a context for the language model. The LLM uses that context to produce a final answer. This function returns both the updated chat history and the system response.


In [None]:
# 6 Create a LLM-based Retriever based on Hybrid Search Logic

import time  # Import time for timestamping

def generate_llm_response(prompt: str) -> str:
    """
    Generates a response from the LLM based on the given prompt.

    Args:
        prompt (str): The input prompt for the LLM.

    Returns:
        str: The response generated by the LLM.
    """
    try:
        response = llm.complete(prompt=prompt)
        return response.text.strip()
    except Exception as e:
        logging.error(f"LLM generation failed: {e}")
        return "An error occurred while generating the response. Please try again later."

def get_answer(query: str, top_k: int = 10, alpha: float = 0.5, history=None):
    """
    Retrieves answers for a given query using FAISS, Whoosh indices, and a Cross-Encoder Reranker.
    Enhances the references with metadata attributes from both FAISS and Whoosh.

    Logs the time taken for each significant step.
    """
    try:
        start_time = time.time()  # Start timing
        logging.info("Pipeline started.")

        if not query.strip():
            logging.warning("Received an empty query.")
            return history, "Please enter a valid question."

        if history is None:
            history = []

        # Step 1: Perform FAISS Vector Search
        vector_search_start = time.time()
        vector_results = perform_vector_search(query, top_k * 2)
        vector_search_time = time.time() - vector_search_start
        logging.info(f"Vector search completed in {vector_search_time:.2f} seconds.")

        # Step 2: Perform BM25 Search
        bm25_search_start = time.time()
        bm25_results = perform_bm25_search(query, top_k * 2)
        bm25_search_time = time.time() - bm25_search_start
        logging.info(f"BM25 search completed in {bm25_search_time:.2f} seconds.")

        # Step 3: Combine Results
        combine_results_start = time.time()
        hybrid_results = combine_results(vector_results, bm25_results, alpha)
        combine_results_time = time.time() - combine_results_start
        logging.info(f"Combining results completed in {combine_results_time:.2f} seconds.")

        # Step 4: Rerank with Cross-Encoder
        rerank_start = time.time()
        reranked_results = rerank_with_cross_encoder(query, hybrid_results, top_k=top_k, batch_size=top_k)  # Batch_size is also equal top_k
        rerank_time = time.time() - rerank_start
        logging.info(f"Reranking completed in {rerank_time:.2f} seconds.")

        # Step 5: Prepare Context for LLM
        context_start = time.time()
        context = "\n\n".join([text for text, score, metadata in reranked_results])
        context_time = time.time() - context_start
        logging.info(f"Context preparation completed in {context_time:.2f} seconds.")

        # Step 6: Generate LLM Response
        llm_start = time.time()
        llm_prompt = (
            f"Question: {query}\n\n"
            f"Context:\n{context}\n\n"
            "Based on the provided context, is there enough relevant information to answer the question? "
            "Respond with 'Yes' or 'No'. If yes, provide the answer; otherwise, say 'No relevant information available.'"
        )
        llm_response = generate_llm_response(llm_prompt)
        llm_time = time.time() - llm_start
        logging.info(f"LLM response generated in {llm_time:.2f} seconds.")

        # Step 7: Format Final Response
        response_format_start = time.time()
        references = []
        for text, score, metadata in reranked_results:
            if metadata:
                source = metadata.get("source", "Unknown Source")
                sheet = f"Sheet: {metadata['sheet']}" if metadata.get("sheet") else ""
                row = f"Row: {metadata['row_number']}" if metadata.get("row_number") else ""
                slide = f"Slide: {metadata['slide_number']}" if metadata.get("slide_number") else ""
                section = f"Section: {metadata['section']}" if metadata.get("section") else ""
                page = f"Page: {metadata['page_number']}" if metadata.get("page_number") else ""
                chunk = f"Chunk: {metadata['chunk_number']}" if metadata.get("chunk_number") else ""
                total_chunks = f"Total Chunks: {metadata['total_chunks_in_section']}" if metadata.get("total_chunks_in_section") else ""

                # Combine all available metadata
                metadata_str = ", ".join(filter(None, [source, sheet, row, slide, section, page, chunk, total_chunks]))
                references.append(f"{metadata_str}: {text[:50]}...")
            else:
                references.append(f"{text[:50]}...")

        references_text = "\n".join([f"- {ref}" for ref in references])
        if llm_response == 'No relevant information available.':
            final_answer = llm_response
        else:
            final_answer = f"{llm_response}\n\nSources:\n{references_text}"

        response_format_time = time.time() - response_format_start
        logging.info(f"Response formatting completed in {response_format_time:.2f} seconds.")

        # Update chat history
        history.append((query, final_answer))
        total_time = time.time() - start_time
        logging.info(f"Pipeline completed in {total_time:.2f} seconds.")

        return history, history

    except Exception as e:
        logging.error(f"Unexpected Error in get_answer: {e}")
        response = "An unexpected error occurred. Please try again later."
        if history is not None:
            history.append((query, response))
        return history, history

#7. Building User Interface with Gradio

Creates a simple Gradio interface with a text input for user queries and a chatbot-like response area. Users can type questions, and the hybrid search pipeline plus LLM response is triggered. The interface displays the final answer, along with any references used. Additionally, there’s a placeholder “Settings” tab for future configuration options.

In [None]:
# 7. Build Gradio Interface

import gradio as gr

def chatbot_interface(user_input, history):
    """
    Handles the Gradio interface interaction by invoking the get_answer function.

    Args:
        user_input (str): The user's input query.
        history (list): The chat history (state).

    Returns:
        tuple: Updated history and the LLM response.
    """
    try:
        logging.info(f"Received user input: {user_input}")
        if not user_input.strip():
            return history, "Please enter a valid query."

        # Call the get_answer function
        updated_history, response = get_answer(user_input, history=history)

        # Log the response for debugging
        logging.info(f"LLM Response: {response}")
        return updated_history, response

    except Exception as e:
        logging.error(f"Error in chatbot_interface: {e}")
        return history, "An unexpected error occurred. Please try again later."

# Initialize Gradio Blocks
with gr.Blocks(css="""
    #user-input {
        padding-top: 2px !important;
        font-size: 24px !important;
        line-height: 1.2 !important;
        background-color: #f9f9f9 !important;
    } """) as demo:

    gr.Markdown("### RAG Chatbot for Oil & Gas Drilling Engineers")

    # Chatbot Tab
    with gr.Tab("Chatbot"):
        with gr.Row():
            chatbot = gr.Chatbot(label="Chat Window", show_label=True)
        with gr.Row(equal_height=True):
            user_input = gr.Textbox(
                lines=2,
                placeholder="Enter your question here...",
                label="Your Message",
                elem_id="user-input"
            )
            send_button = gr.Button("Send", scale=0.25)
        state = gr.State(value=[])  # Initialize state as an empty list

        # Define interaction logic
        send_button.click(
            fn=chatbot_interface,
            inputs=[user_input, state],
            outputs=[chatbot, state]
        )

    # Settings Tab (Future Enhancements Placeholder)
    with gr.Tab("Settings"):
        gr.Markdown("### Settings")
        gr.Markdown("Settings can be configured here in the future.")

    # Footer
    gr.Markdown("© Noname Company")

# Launch Gradio interface
demo.launch()

2025-01-14 17:36:15,125 - INFO - HTTP Request: GET https://api.gradio.app/gradio-messaging/en "HTTP/1.1 200 OK"
2025-01-14 17:36:17,176 - INFO - HTTP Request: GET https://api.gradio.app/pkg-version "HTTP/1.1 200 OK"
2025-01-14 17:36:17,392 - INFO - HTTP Request: GET http://127.0.0.1:7860/gradio_api/startup-events "HTTP/1.1 200 OK"
2025-01-14 17:36:17,444 - INFO - HTTP Request: HEAD http://127.0.0.1:7860/ "HTTP/1.1 200 OK"


Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()


2025-01-14 17:36:17,700 - INFO - HTTP Request: GET https://api.gradio.app/v3/tunnel-request "HTTP/1.1 200 OK"
2025-01-14 17:36:17,879 - INFO - HTTP Request: GET https://cdn-media.huggingface.co/frpc-gradio-0.3/frpc_linux_amd64 "HTTP/1.1 200 OK"


* Running on public URL: https://72575f3d2c8c9fcc66.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


2025-01-14 17:36:18,550 - INFO - HTTP Request: HEAD https://72575f3d2c8c9fcc66.gradio.live "HTTP/1.1 200 OK"




# APPENDIXES:

In [2]:
# SKIP ME

# APPENDIXES: Print all the codecells from the pipeline above (skips cells marked "# SKIP ME")

import json

# Fetch the current notebook's metadata and contents
from google.colab import _message

def get_notebook_content():
    try:
        notebook = _message.blocking_request('get_ipynb', timeout_sec=5)
        return notebook['ipynb']
    except Exception as e:
        print(f"Error fetching notebook content: {e}")
        return None

# Get the notebook content
notebook_data = get_notebook_content()

if notebook_data:
    code_cells = [
        "".join(cell["source"])
        for cell in notebook_data.get("cells", [])
        if cell["cell_type"] == "code" and "# SKIP ME" not in "".join(cell["source"])
    ]

    # Join the code cells with three empty lines between them
    readable_code = "\n\n\n".join(code_cells)

#    print("Extracted Code Cells:\n\n\n")
    print(readable_code)
else:
    print("Could not fetch notebook content.")

# 1. Install required libraries
!pip install openai gradio httpx
!pip install llama-index llama-index-core llama-parse llama-index-readers-file
!pip install llama-index-embeddings-nebius llama-index-llms-nebius
!pip install llama-index-vector-stores-faiss
!pip install faiss-cpu
!pip install whoosh
!pip install -U sentence-transformers


# 2. Configuration: Mount Google Drive, Load API keys and config environment

from google.colab import drive
import os
import json
from pathlib import Path
import openai

# Mount Google Drive
drive.mount("/content/gdrive")

# Define the data directory path
data_directory = Path("/content/gdrive/MyDrive/RAG_Project_5/data/")

# Load the API key
config_path = Path("/content/gdrive/MyDrive/Colab_Notebooks/config.json")
with open(config_path, encoding="utf-8-sig") as config_file:
    config = json.load(config_file)
    os.environ["API_KEY"] = config["API_KEY"]

# Set the API key and endpoint globally
openai.api_key = os.environ["API_KEY"]
openai.api_base = 