# Retrieval-Augmented Generation on Scientific Publications

### The goal of the project is to create a RAG model specifically for my own scientific publications and area of interest. Making sure that everything is in line with ownership rights I only use open access publications.

The first task is to download the open access documents. For now the following papers are used:
[1] Mihály Katona, Tamás Orosz, Robustness of a flux-intensifying permanent magnet-assisted synchronous reluctance machine focusing on shifted surface-inset ferrite magnets, Computers & Structures, Volume 316,2025,107845,ISSN 0045-7949, https://doi.org/10.1016/j.compstruc.2025.107845

[2] Mihály Katona, Miklós Kuczmann, Tamás Orosz, Accuracy of the robust design analysis for the flux barrier modelling of an interior permanent magnet synchronous motor, Journal of Computational and Applied Mathematics, Volume 429, 2023, 115228, ISSN 0377-0427, https://doi.org/10.1016/j.cam.2023.115228

[3] Mihály Katona, Tamás Orosz, Cogging Torque Reduction of a Flux-Intensifying Permanent Magnet-Assisted Synchronous Reluctance Machine with Surface-Inset Magnet Displacement, Energies 18, no. 20: 5492. https://doi.org/10.3390/en18205492

### 1) The first step of data ingestion (bronze table) is to create a structured table from the papers called parsing. (NB_000_parsing)

This table currently has one row per PDF. The most important column is parsed_output, which contains a complex "Variant" object (similar to a giant JSON tree) that holds the entire text and layout of your paper.

In [0]:
%sql
-- This creates a 'Bronze' table containing the structured content of the papers using ai_parse_document function.
CREATE OR REPLACE TABLE workspace.default.parsed_papers AS
SELECT
  path,
  ai_parse_document(content, map('version', '2.0')) as parsed_output,
  modificationTime
FROM READ_FILES('/Volumes/workspace/default/publications', format => 'binaryFile');

In [0]:
%sql
-- Return the parsed_papers table
SELECT * FROM workspace.default.parsed_papers

### 2) The second step is to transform the data (silver table) with the goal to slice the data into smaller ones calld chunking.

A single academic paper can have thousands of words. If it got fed into a chatbot, the bot will get lost in the data. It is needed to be broken into smaller pieces (chunks) while keeping the academic context (like which page or section the text came from).

In PySpark, expr stands for Expression. Standard Python code doesn't understand the special Colon (:) syntax used to navigate Variant data. expr allows Spark to treat the code as SQL command.

parsed_output:document.elements is the specific address of the research data inside the Variant.

parsed_output is the main column where the parser stored everything. (:) is the key that opens the Variant data type. document.elements is the specific folder inside the paper that contains the list of paragraphs, tables, and headers (NB_010_chunking).

The next to steps is exploding the parsed table to element level, to chunks (NB_011_chunking and NB_012_chunking). It is important to check how the parser sliced the documents. Whether those are sound contentwise. For example in this step the parses sliced the documents by paragraphs, which is a good start. Another important step in creating the silver table is deleting the NULL chunks and deleting the chunks which are less than X characters to reduce noise.

### The minimal chunks size is one parameter to optimise the parsing!

In [0]:
from pyspark.sql.functions import col, explode, expr

# Loading the bronze table with Spark.
df_parsed = spark.table("workspace.default.parsed_papers")

# Adding the elements array column
df_with_elements = df_parsed.select(
    "*", 
    expr("parsed_output:document.elements").alias("elements_array")
)

display(df_with_elements)

In [0]:
# Exploding the elements array
df_exploded = df_with_elements.lateralJoin(
    spark.tvf.variant_explode(col("elements_array"))
)

display(df_exploded)

In [0]:
# Creating the chunks
df_chunks = df_exploded.select(
    "path",
    expr("value:content").alias("chunk_text"),
    expr("value:type").alias("element_type"),
)

display(df_chunks)

print(f"Total chunks created: {df_chunks.count()}")

In [0]:
from pyspark.sql import functions as sf
from pyspark.sql.functions import expr

# The filtering step for NULL and chunk size.
df_no_index = df_chunks.filter("chunk_text IS NOT NULL AND length(chunk_text) > 50")

# Adding a unique ID and casting chunk_text to string.
df_silver = df_no_index.withColumn("chunk_id", sf.uuid()).withColumn("chunk_text_string", expr("chunk_text::string")).withColumn("element_type_text", expr("element_type::string"))

display(df_silver)

print(f"Total chunks after filtering: {df_silver.count()}")

df_silver.write.option('mergeSchema', 'true').mode("overwrite").saveAsTable("workspace.default.chunked_papers")

The next step is to turn that static table into a Living Search Engine. We do this by creating a Mosaic AI Vector Search Index. This is the process where Databricks takes the table, converts it into high-dimensional math (vectors), and stores it so that the AI model can find the right needle in the academic haystack of papers in milliseconds.

In the FREE version of Databricks follow these steps:

1. Open Catalog Explorer: Navigate to chunked_papers table.
2. In the top-right corner, click the Create button and select Vector Search Index.
3. You should see the following UI.

![image_1770057989192.png](./image_1770057989192.png "image_1770057989192.png")

a) For the index name I chose puplication_index.

b) Primary key: Select the column that uniquely identifies each chunk (ID).

c) Columns to sync: Leave this blank to sync all columns. This ensures that it retrieves the original chunk_text, the path (for citations). It is important to cast VARIANT to STRING, because only STRING can be an input for embedding.

d) Embedding source: Keep "Compute embeddings" selected. This tells Databricks to automatically turn your text into numbers using its built-in models.

e) Embedding source column: Select chunk_text. This is the actual content the AI will "read" and vectorize for semantic searching. It is important to cast VARIANT to STRING, because only STRING can be an input for embedding.

f) Embedding model: Choose databricks-gte-large-en. The databricks-gte-large-en is a text embedding model designed to map text into a 1024-dimensional vector representation. It supports an embedding window of up to 8192 tokens, making it suitable for tasks like retrieval, classification, clustering, semantic search, and question-answering. This model is particularly effective when paired with large language models (LLMs) for retrieval-augmented generation (RAG) use cases.

g) Sync computed embeddings: Toggle this to ON if you want to save the generated numerical vectors into a separate table in Unity Catalog. This is very useful for advanced operations to perform similarity analysis or clustering later without re-computing the vectors.

h) Vector search endpoint: Create one if you don't have it already. You should see the following:

![image_1770059266585.png](./image_1770059266585.png "image_1770059266585.png")

i) Sync mode: I recommend choosing Triggered sync mode. Unless you are uploading new research papers every few minutes and need the chatbot to know about them within seconds.

You might encounter the following:

Change Data Feed must be enabled to create an index from this table. See more details on Change Data Feed in the delta docs .
Enabling Change Data Feed requires active compute. Please start a cluster with DBR version 11.3 and above or Pro/Serverless warehouse.

Because Vector Search is a live system, it needs a way to watch your Silver Table for any new papers or updated paragraphs. Change Data Feed (CDF) is the security camera that allows Databricks to track every single row-level change without re-reading the entire table. Enable it with NB_014_chunking

In [0]:
# Enable Change Data Feed
spark.sql("ALTER TABLE workspace.default.chunked_papers SET TBLPROPERTIES (delta.enableChangeDataFeed = true)")

# Check if 'delta.enableChangeDataFeed' is now 'true'
display(spark.sql("SHOW TBLPROPERTIES workspace.default.chunked_papers"))

### While the endpoint is under PROVISION (getting ready) it is recommended to create a TRUTH TABLE.
The Truth Table is the source of truth for your AI. In the world of RAG (Retrieval-Augmented Generation), it is a carefully curated collection of questions and their perfect answers, validated by a human expert. With it, it is possible to mathematically approximate the models accuracy later.

In MLFlow, a Truth Table isn't just a list of questions. Each row in your table should contain these specific fields:
- inputs: The exact question a user might ask.
- expected_retrieved_context: The specific chunk or page number where the answer is found.
- expected_facts: A list of key facts that must appear in the answer for it to be considered correct.

#### How to Build Your Truth Table Manually
Open your research papers, pick a paragraph, and write a question that only that paragraph can answer. Include an unnanswerable questions. Ask something that is not in your papers. The AI should know how to say, "I'm sorry, my research papers do not contain information on that topic," rather than making it up.

#### How to Build Your Truth Table Syntheticly
It uses the Mosaic AI Agent Evaluation library to automatically generate a Truth Table. Instead of you manually writing 20 questions and answers, this code uses an LLM to read the documents and invent the testcases. Of course, one should review these synthetic testcases manually and correct or delete if necessary before evaluating correctness or relevance.

WARNING (The synthetic generation runs either way.): Failed to count tokens for text: Robustness of a flux-intensifying permanent magnet-assisted synchronous reluctance machine focusing on shifted surface-inset ferrite magnets. Error: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')) --> The cluster is trying to reach an external URL to download the Tiktoken vocabulary file and is being cut off by a Databricks network security policy. Even though token counting seems like a local task, the library tiktoken (used by databricks-agents) does not ship with the vocabulary included. It tries to fetch it from openaipublic.blob.core.windows.net on its first run and fails.

In [0]:
# When building complex AI agents several cutting-edge libraries are being updated almost weekly. The -U (or --upgrade) flag is critical because the Mosaic AI Agent Framework is evolving rapidly. The agent is used for creating synthetic truth table.
%pip install -U databricks-agents

# The %pip commands do not automatically restart the Python interpreter. When a library gets installed, the files are downloaded to the cluster, but the Python kernel in memory is still holding onto the old versions. Restarting forces the interpreter to re-scan the packages and pick up the new versions.
dbutils.library.restartPython()

In [0]:
from databricks.agents.evals import generate_evals_df
from pyspark.sql.functions import col, regexp_replace
import warnings

# Prepare the table for the generator
df_for_eval = spark.table("workspace.default.chunked_papers") \
    .filter(col("chunk_text_string").isNotNull()) \
    .withColumn("content_clean", regexp_replace(col("chunk_text_string"), r'[^\x00-\x7F]+', ' ')) \
    .select(
        col("content_clean").alias("content"),
        col("chunk_id").alias("doc_uri")
    ).toPandas()

# Run the synthetic generator
synthetic_evals = generate_evals_df(
    df_for_eval,
    # The number of evals to generate
    num_evals=20,
    # The agent description is used to specify the behaviour of the agent
    agent_description="You are an expert robust design analysis of electric machines and circular economy principles. Also a an expert in retrieval augmented generation, so you are able to generate synthetic truth tables for a given prompt.",
    # The question guidelines are used to specify the type of questions to ask
    question_guidelines="Ask technical questions that require specific values. Ask questions about the aim of the research and the conclusions too. Do not ask questions that are too broad or subjective."
)

# Convert to spark dataframe to access .write method
df_truth = spark.createDataFrame(synthetic_evals)

# Save to Unity Catalog
df_truth.write.option("overwriteSchema", "true").mode("overwrite").saveAsTable("workspace.default.truth_table")

display(df_truth)

In [0]:
from pyspark.sql.functions import col

# Define the list of IDs to remove
rows_to_remove = [
    '717e211b5073788ee11854d1d10e79b9eb805275ccf8718f710da1c79746e455',
    '736adc2c7cc82164bc51f276e95af55ebaec8bab994b891fe7642f470ab26a8e',
    '311d9ec365f1371236c949bd990c5b60b58182fefd6d0e1da9ba8434dd24c045',
    'f934e33ca2b303b39edbbd53e8cf7f0cf13caa1220c3ea566850b8a3cb38e270'
]

# Load the full table
df_truth = spark.table("workspace.default.truth_table")

# Apply the filter (WHERE request_id NOT IN ...)
df_truth_filtered = df_truth.filter(~col("request_id").isin(rows_to_remove))

# Save to Unity Catalog
df_truth_filtered.write.option("overwriteSchema", "true").mode("overwrite").saveAsTable("workspace.default.truth_table_filtered")

display(df_truth_filtered)

In [0]:
# Here the synthetic truth table is converted into a digestable format for MLflow to evaluate.
# The generate_evals_df function creates data in a complex chat protocol format (simulating a full conversation history), but MLflow evaluation setup requires a simple input map (a dictionary with a single "query" key).
from pyspark.sql.functions import col, create_map, lit

df_truth_filtered = spark.table("workspace.default.truth_table_filtered")

# Extracting the clean question and expectations
df_flat = df_truth_filtered.select(
    create_map(
        lit("query"), col("inputs").messages.content[0]
    ).alias("inputs"),
    col("expectations")
)

df_flat.write.option("overwriteSchema", "true").mode("overwrite").saveAsTable("workspace.default.truth_table_converted")

display(df_flat)

### The following snippet is the evaluation for correctness and relevance.

Correctness: "Is the answer factually accurate?" This metric measures accuracy by comparing your bot's answer against the Ground Truth (the expected_facts you defined in your truth table). Input: The bot's response + The expected_facts. The Judge asks: "Does the response contain all the critical facts listed in the expectation? Is it factually true?" Fails if: The bot says "300 Nm torque" when the truth table says "350 Nm torque". Passes if: The bot says "The torque is 350 Nm," even if the wording is different from your expectation.

Relevance (relevance_to_query)
"Did the bot actually answer the question?" This metric measures the quality of the response in relation to the User's Query, ignoring the ground truth. It checks if the bot stayed on topic or hallucinated a refusal. Input: The bot's response + The original query. The Judge asks: "Is this response helpful? Does it directly address the user's intent? Does it avoid rambling about unrelated topics?" Fails if: The user asks "What is the flux density?" and the bot replies, "Here is a summary of the circular economy." (This is irrelevant, even if the summary is factually true). Passes if: The bot provides a direct, concise answer to the specific question asked.

| Scenario | Correctness | Relevance | Diagnosis |
| :--- | :--- | :--- | :--- |
| **The Gold Standard** | ✅ High | ✅ High | The bot gave the right answer to the right question. |
| **The "Hallucination"** | ❌ Low | ✅ High | The bot gave a direct answer ("The value is 5"), but it was **wrong** (Ground truth was 10). |
| **The "Politician"** | ✅ High | ❌ Low | The bot said something true ("The sky is blue"), but it **didn't answer the question** about electric motors. |
| **The "Refusal"** | ❌ Low | ❌ Low | The bot said "I don't know" or gave a completely broken response. |

### --- PART 1: The Retrieval & Generation Logic ---
#### Initialize the Vector Search Client
This block establishes the **"Handshake"** with the Vector Database. Before we can search for answers, we must authenticate with the `VectorSearchClient` and point it to the exact location of our embedded knowledge (`publication_index`). This creates an `index` object that acts as the **Retriever**, allowing the bot to access the stored PDF chunks during the query phase.

#### A. RETRIEVAL: Find the top 5 most similar chunks
This is the core "Search" mechanism of the RAG pipeline. Instead of matching exact keywords (like Ctrl+F), we perform a **Similarity Search**. The `query_text` (user's question) is converted into a vector, and the database finds the top 5 chunks of text that are mathematically "closest" in meaning. We request both the content (`chunk_text_string`) and the source filename (`path`) so the LLM knows *what* to say and *where* it came from. The raw search results come back as a list of separate text chunks. To feed this information into the LLM, we must **flatten** them into a single string. This one-liner extracts just the text content (ignoring metadata for now) from the top 5 results and joins them together with newlines. This creates the `context` block, a unified paragraph of background knowledge, that we will paste into the prompt so the LLM has the "facts" it needs to answer the question.

#### B. GENERATION: Send context + question to the LLM
This is where the retrieved information is synthesized into a human-readable answer. We construct a prompt that strictly combines the **Context** (our PDF data) and the **Question** (the user's query). By sending this to the **Llama 3** model via the Databricks Serving Endpoint, we force the AI to "ground" its answer in our specific documents rather than its general training data. Finally, we return **both** the generated answer and the original context so the automated Judge can later verify if the answer was faithful to the source material.

#### WRAPPER: This is the function MLflow will call to evaluate
This function acts as the **"Bridge"** between our custom logic and the MLflow evaluation harness. MLflow expects a specific input/output format to run its automated tests. We wrap our `research_assistant` function here to:
1.  **Standardize Inputs:** Accept a single `query` argument (which MLflow unpacks automatically).
2.  **Format Outputs:** Return a dictionary with clearly labeled keys (`response`, `retrieved_context`) so the Judge knows exactly which text to grade.
3.  **Enable Tracing:** The `@mlflow.trace` decorator turns on detailed logging, allowing us to see every step of the retrieval and generation process in the MLflow UI for debugging.

### --- PART 2: The Data Sanitizer ---
This block is a critical safety mechanism against serialization errors. When we convert Spark tables to Pandas, data is often stored as NumPy arrays (e.g., `np.ndarray`). However, MLflow's evaluation tools rely on standard JSON for logging, which **crashes** if it encounters NumPy types. This recursive function walks through every item in our dataset and converts any arrays into standard Python lists, ensuring the evaluation run completes without a serialization error.

### --- PART 3: The Evaluation Execution ---
This is the final step where we actually run the test. 

By calling `mlflow.genai.evaluate`, we orchestrate the entire test:
1.  **Feed the Data:** It takes every question from our sanitized truth table (`eval_data`).
2.  **Run the Agent:** It passes each question to our `eval_predict_fn` to get the real-time answer and context.
3.  **Grade the Results:** It employs an "LLM-as-a-Judge" (a separate AI model) to score the answers based on **Correctness** (accuracy against ground truth) and **Relevance** (adherence to the query).
4.  **Log Everything:** The `start_run()` context ensures every input, output, and score is recorded in the MLflow Experiment for future analysis.

In [0]:
%pip install databricks-vectorsearch
dbutils.library.restartPython()

In [0]:
import mlflow
from mlflow.deployments import get_deploy_client
from databricks.vector_search.client import VectorSearchClient
import numpy as np

# --- PART 1: The Retrieval & Generation Logic ---
def research_assistant(question):
    """
    This is the core logic of your RAG bot.
    1. It searches your Vector Index for relevant PDF chunks.
    2. It sends those chunks + the question to Llama 3.
    """
    # Initialize the Vector Search Client
    vsc = VectorSearchClient(disable_notice=True)
    index = vsc.get_index(
        endpoint_name="academic_search_endpoint", 
        index_name="workspace.default.publication_index"
    )
    
# A. RETRIEVAL: Find the top 10 most similar chunks
    results = index.similarity_search(
        query_text=question,
        columns=["chunk_text_string", "path"],
        num_results=10
    )

    # Combine the retrieved text into a single string for the prompt
    context = "\n".join([row[0] for row in results['result']['data_array']])
    
# B. GENERATION: Send context + question to the LLM
    client = get_deploy_client("databricks")

    # 1. Define the System Instructions (Guardrails + CoT)
    system_prompt = """You are a technical research assistant in robust design optimisation, tolerance analysis and circular economy related topics.  Answer the question ONLY using the provided Context. If the answer is not in the context, say 'I do not have enough information in the provided documents to answer this.'

    Follow this thinking process:
    1. Extract relevant technical quotes from the context.
    2. Reason step-by-step how these facts answer the user's question.
    3. Provide a final concise answer."""

    # 2. Define Few-Shot Examples (Teaches the model the 'style' of answer)
    few_shot_examples = [
        {"role": "user", "content": "Context: [Chunk 1: Skewing reduces Cogging Torque by 15%] Question: How to fix cogging?"},
        {"role": "assistant", "content": "Thinking: Context mentions skewing as a solution. \nAnswer: According to the documents, skewing can reduce cogging torque by 15%."}
    ]

    # 3. Build the final message list
    messages = [
        {"role": "system", "content": system_prompt},
        *few_shot_examples,
        {"role": "user", "content": f"Context: {context}\n\nQuestion: {question}"}
    ]

    response = client.predict(
        endpoint="databricks-meta-llama-3-3-70b-instruct",
        inputs={"messages": messages}
    )

    answer = response['choices'][0]['message']['content']
    
    # Return both so the Judge can check if the answer matches the context
    return answer, context

# C. WRAPPER: This is the function MLflow will call to evaluate
@mlflow.trace # Enables detailed tracing in the MLflow UI
def eval_predict_fn(query):
    """
    The 'predict_fn' for mlflow.evaluate must accept a single input row.
    MLflow automatically unpacks the 'query' key from your table 
    and passes it here.
    """

    answer, context = research_assistant(query)
    
    # We return a dictionary so MLflow can map these to metrics
    return {
        "response": answer,
        "retrieved_context": context
    }

# --- PART 2: The Data Sanitizer ---
def sanitize_for_json(obj):
    """
    Recursively converts Numpy arrays and other non-JSON types 
    into standard Python lists and dictionaries.
    Essential for preventing 'Not JSON Serializable' errors in MLflow.
    """
    # If it's a Numpy array (common in Pandas), turn it into a list
    if isinstance(obj, np.ndarray):
        return [sanitize_for_json(i) for i in obj.tolist()]
    
    # If it's a list, check every item inside it
    if isinstance(obj, list):
        return [sanitize_for_json(i) for i in obj]
    
    # If it's a dictionary, check every value inside it
    if isinstance(obj, dict):
        return {k: sanitize_for_json(v) for k, v in obj.items()}
    
    # Otherwise, it's safe (string, int, float), just return it
    return obj

# Load the flattened table we created earlier
eval_data = spark.table("workspace.default.truth_table_converted").toPandas()

# Apply the sanitizer to the entire Pandas DataFrame
eval_data["expectations"] = eval_data["expectations"].apply(sanitize_for_json)

# --- PART 3: The Evaluation Execution ---
with mlflow.start_run():
    
    # Run the evaluation harness
    eval_results = mlflow.genai.evaluate(
        data=eval_data,              # The sanitized Truth Table
        predict_fn=eval_predict_fn,  # Your wrapped agent function
        scorers=[
            # 1. Correctness: Does the answer match the expected_facts?
            mlflow.genai.scorers.Correctness(metric_name="answer_correctness"),
            
            # 2. Relevance: Is the answer actually supported by the retrieved context?
            mlflow.genai.scorers.RelevanceToQuery(metric_name="context_relevance")
        ],
    )

#### Analysis of Evaluation Traces
The trace table provides a row-by-row breakdown of how the agent handled specific queries:

Consistency in Relevance: The system has a 100% pass rate for Relevance across all visible rows (e.g., tr-ff14bae..., tr-9bb01b...). This indicates that your prompt engineering is successful at keeping the LLM focused on the questions asked.

Correctness Failures: Several traces are marked as Fail for Correctness (shown in red), such as:

tr-9bb01b... ("What are the effects of combining...")

tr-d947ef2... ("What specific factors are isolated...")

tr-7f37a08... ("What is the range of torque-ripple...")

Performance Stability: Execution times are mostly stable, ranging from 1.1s to 5.8s. The longest execution (tr-76a47ab... at 5.8s) passed both metrics, suggesting that complexity doesn't necessarily correlate with failure.

![image_1771269233157.png](./image_1771269233157.png "image_1771269233157.png")

#### To improve Correctness in a RAG system, you must systematically address two areas: Retrieval (finding the right facts) and Generation (using those facts correctly). A failure in correctness often means the model is either missing the necessary information or ignoring the context to rely on its own training data.

1. Optimize Retrieval (Fixing the "Facts")
If the correctness score is low, the first step is ensuring the LLM is actually receiving the correct information.

Adjust Chunking Strategy: Increase the chunk size or add chunk overlap to ensure sentences aren't cut off mid-thought, which can lead to incomplete facts being fed to the model. (`--> The ai_parse_document function itself does not have native parameters for chunk size or overlap. This is because its primary job is structural parsing (extracting tables, text, and layout) rather than chunking. So a post processing step must be implemented using LangChain.`)

Increase top-k: Increasing the number of retrieved chunks (e.g., from 5 to 10) can improve "context sufficiency," giving the model a better chance of finding the specific detail needed for the answer. (`--> I implemented it first and it increased the correctness to 75%.`)

Implement Re-ranking: Use a cross-encoder re-ranker after your initial vector search. This secondary step re-scores the retrieved chunks more precisely, ensuring the most relevant facts are at the top of the context. (`--> If you are using Mosaic AI Vector Search, you can implement reranking with a single line of code using the built-in DatabricksReranker. When you set this to hybrid, Databricks runs two searches in parallel and then fuses the results: Semantic Search (Dense Vector): The AI understands the meaning of your query. For "How to reduce torque ripple," it might find documents about "harmonics mitigation" or "skewing techniques," even if they don't use the exact word "ripple". Keyword Search (Lexical): It looks for exact word matches. This is critical for technical terms like "PMSM," "LMP-9R2," or specific error codes that a vector model might find "similar" to other unrelated terms. *Error: Reranking is not yet enabled for this workspace. Please contact support if you are interested in this feature.*`)

2. Refine Generation (Fixing the "Response") (`-->I replaced the naive prompt with a structured system prompt that includes  instructions, few-shot examples, and Chain-of-Thought (CoT) triggers. It did not change correctness score, however, with the new system prompts it discovered a genuine misinterpretation in the paper.`)

Even with perfect retrieval, the LLM may struggle to process the information accurately.

Prompt Engineering: Use explicit instructions such as "Answer only using the provided context" or "If the information is not present, say you do not know" to reduce guessing and hallucinations.

Few-Shot Prompting: Provide 2–3 examples of "Question + Context + Correct Answer" within your prompt to show the model exactly how you want it to extract and format technical data.

Chain-of-Thought (CoT): Instruct the model to "Think step-by-step" or "Extract relevant quotes first before answering." This forces the model to reason through the context before finalizing its response.

3. Systematic Debugging with MLflow
Since you are using MLflow, leverage its diagnostic tools to pinpoint the failure:

Inspect Traces: Use the MLflow UI to look at the retrieved context for a failed row. If the answer isn't in those chunks, the problem is Retrieval; if the answer is there but the model missed it, the problem is Generation. (`--> With the new system prompts the AI discovered a genuine misinterpretation in the paper. Correctness should always be checked manually!`)

In [0]:
import mlflow
import pandas as pd
from mlflow.deployments import get_deploy_client
from databricks.vector_search.client import VectorSearchClient

def research_assistant(question):
    """
    Returns a DataFrame where each row is a retrieved chunk, 
    allowing you to see exactly which text block came from which file.
    """
    # Initialize Clients
    vsc = VectorSearchClient(disable_notice=True)
    index = vsc.get_index(
        endpoint_name="academic_search_endpoint", 
        index_name="workspace.default.publication_index"
    )
    
    # A. RETRIEVAL: Find top 10 chunks
    results = index.similarity_search(
        query_text=question,
        columns=["chunk_text_string", "path"],
        num_results=10,
        query_type="hybrid"
    )
    
    # raw_data is a list of lists: [[text, path], [text, path], ...]
    raw_data = results['result']['data_array']
    
    # Prepare single string context for the LLM
    context_text = "\n\n".join([row[0] for row in raw_data])
    
    # B. GENERATION
    client = get_deploy_client("databricks")

    system_prompt = """You are a technical research assistant in robust design optimisation. 
    Be creative based on the given context."""

    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": f"Context: {context_text}\n\nQuestion: {question}"}
    ]

    response = client.predict(
        endpoint="databricks-meta-llama-3-3-70b-instruct",
        inputs={"messages": messages}
    )

    answer = response['choices'][0]['message']['content']
    
    # C. FORMAT AS TABLE (One row per chunk)
    # 1. Create DataFrame from the raw list of chunks and paths
    # The Similarity Score is a numerical metric that tells you how closely a retrieved document matches your query. In Databricks Vector Search, this score represents the "distance" or "relevance" between your question's vector and the document's vector.
    df = pd.DataFrame(raw_data, columns=["Chunks", "Path", "Similarity"])
    
    # 2. Add the generated Answer as a new column (repeated for every row)
    df.insert(0, "Answer", answer)
    
    return df

# Test it
df_result = research_assistant("How Taguchi method is used for robust design?")
display(df_result)