## Problem Statement

### Business Context

The healthcare industry is evolving quickly, and professionals must manage large amounts of medical data while making fast, accurate decisions. Information overload makes it difficult to identify the most relevant research, especially during emergencies when time is critical. Access to trusted, up-to-date medical knowledge is essential for accurate diagnoses and effective treatment plans.

Healthcare centers can improve care and efficiency by integrating systems that centralize medical information, support rapid decision-making, and give providers continuous access to reliable resources. These tools help streamline workflows, reduce errors, and enhance patient outcomes.

**Common Questions to Answer**

1. What is the protocol for managing sepsis in a critical care unit?

2. What are the common symptoms of appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?

3. What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?

4. What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?

5. What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?

### Objective

The task is to develop a RAG-based AI solution that uses trusted medical manuals to tackle key healthcare challenges. The goal is to understand issues like information overload, apply AI to streamline decision-making, analyze its impact on diagnostics and patient outcomes, evaluate its potential to standardize care, and create a functional prototype that demonstrates its feasibility and effectiveness.

### Data Description

The **Merck Manuals** are medical references published by the American pharmaceutical company Merck & Co., that cover a wide range of medical topics, including disorders, tests, diagnoses, and drugs. The manuals have been published since 1899, when Merck & Co. was still a subsidiary of the German company Merck.

The manual is provided as a PDF with over 4,000 pages divided into 23 sections.

## Installing and Importing Necessary Libraries and Dependencies

In [None]:
# Installation for GPU llama-cpp-python
# uncomment and run the following code in case GPU is being used
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.85 --force-reinstall --no-cache-dir -q

# Installation for CPU llama-cpp-python
# uncomment and run the following code in case GPU is not being used
# !CMAKE_ARGS="-DLLAMA_CUBLAS=off" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.85 --force-reinstall --no-cache-dir -q

In [None]:
# For installing the libraries & downloading models from HF Hub
!pip install huggingface_hub==0.35.3 pandas==2.2.2 tiktoken==0.12.0 pymupdf==1.26.5 langchain==0.3.27 langchain-community==0.3.31 chromadb==1.1.1 sentence-transformers==5.1.1 numpy==2.3.3 -q

In [None]:
#Libraries for processing dataframes,text
import json,os
import tiktoken
import pandas as pd

#Libraries for Loading Data, Chunking, Embedding, and Vector Databases
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyMuPDFLoader, PyPDFLoader
from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain_community.vectorstores import Chroma

#Libraries for downloading and loading the llm
from huggingface_hub import hf_hub_download
from llama_cpp import Llama

## Question Answering using LLM

#### Downloading and Loading the model

In [None]:
# Defining the Hugging Face model repository to download from
# This specifies which pre-trained model to use from TheBloke's collection
model_name_or_path = "TheBloke/Mistral-7B-Instruct-v0.2-GGUF"

# Specifying the exact GGUF quantization file to load
# The ".Q6_K" indicates the quantization level (6-bit K-quants - good balance of quality/size)
# GGUF is the file format used by llama.cpp and compatible loaders
model_basename = "mistral-7b-instruct-v0.2.Q6_K.gguf"

In [None]:
# Downloading the specified GGUF model file from Hugging Face to local cache
model_path = hf_hub_download(
    repo_id=model_name_or_path,
    filename=model_basename
)

In [None]:
# Initializing the Llama model with GPU offloading and batch processing
llm = Llama(
    model_path=model_path,
    n_ctx=2300,
    n_gpu_layers=38,
    n_batch=512
)

#### Response

In [None]:
# Generating a text completion from the LLM with specified decoding parameters
def response(query,max_tokens=512,temperature=0,top_p=0.95,top_k=50):
    model_output = llm(
      prompt=query,
      max_tokens=max_tokens,
      temperature=temperature,
      top_p=top_p,
      top_k=top_k
    )

    return model_output['choices'][0]['text']

### Query 1: What is the protocol for managing sepsis in a critical care unit?

In [None]:
# Assigning query to a variable
query1 = "What is the protocol for managing sepsis in a critical care unit?"
# Printing the response
print(response(query1))

The query was about sepsis, and the model provided a correct and detailed response explaining the general management steps in a critical care unit.

### Query 2: What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?

In [None]:
# Assigning query to a variable
query2 = "What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?"
# Printing the response
print(response(query2))

The query contains three questions-
1. Symptoms of Appendicitis
2. Can be cured via medicine or not
3. Surgical Procedure Recommendation

Model provided correct answers to all the three questions with detailed description.

### Query 3: What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?

In [None]:
# Assigning query to a variable
query3 = "What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?"
# Printing the response
response(query3)

The query contains two questions — the treatment for patchy hair loss and the cause of patchy hair loss. The model provided correct and detailed answers to both questions, including appropriate medical terminology and treatment names.

### Query 4:  What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?

In [None]:
# Assigning query to a variable
query4 = "What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?"
# Printing the response
response(query4)

The query is asking about the treatment for physical injury to brain tissue for which model has given accurate descriptive response.

### Query 5: What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?

In [None]:
# Assigning query to a variable
query5 = "What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?"
# Printing the response
response(query5)

The query contains two questions—treatment steps for a fractured leg and the care and recovery process. The model provided accurate and complete answers to both, with an appropriate emphasis on the recovery and care aspects, which is expected.

## Question Answering using LLM with Prompt Engineering

### Query 1: What is the protocol for managing sepsis in a critical care unit?

In [None]:
# Assigning the query to a variable
query1a = "What is the protocol for managing sepsis in a critical care unit?"
# Printing the response of query by changing the temperature parameter
print(response(query1a,temperature=0.5))

It is observed that with change in temperature, the response became more creative and descriptive.

### Query 2: What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?

In [None]:
# Assigning the query to a variable
query2a = "What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?"

# Printing the response of query by changing the max_tokens parameter
print(response(query2a, max_tokens=1024))

It is observed that upon increasing the max_tokens parameter, we are getting more descriptive reponse.

### Query 3: What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?

In [None]:
# Assigning the query to a variable
query3a = "What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?"
# Printing the response of query by changing the top_p parameter
print(response(query3a,top_p=0.8))

With decrease in top_p parameter, we are getting more predictable but potentially incomplete response, suitable for fact retrieval but not for nuanced explanations.

### Query 4:  What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?

In [None]:
# Assigning the query to a variable
query4a = "What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?"
# Printing the response of query by changing the top_k parameter
print(response(query4a,top_k=30))

No major change is reponse is observed here after reducing top_k from 50 to 30.

### Query 5: What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?

In [None]:
# Assigning the query to a variable
query5a = "List essential first-aid steps for a leg fracture during a hike, and key care & recovery guidelines after professional treatment."
# Printing the response of query by modifying the query
print(response(query5a))

We are getting a well structured and detailed response upon modifying the query.

## Data Preparation for RAG

### Loading the Data

In [None]:
# Import the Google Colab drive module
from google.colab import drive

# Mount Google Drive to the specified directory in the Colab environment.
# This will prompt you to authorize access.
drive.mount('/content/drive')


In [None]:
# Path to the PDF file stored in your Google Drive
pdf_file = '/content/drive/My Drive/medical_diagnosis_manual.pdf'

In [None]:
# Create a PDF loader object to read and process the specified PDF file
pdf_loader = PyMuPDFLoader(pdf_file)

### Data Overview

#### Checking the first 5 pages

In [None]:
# Load all pages from the PDF
docs = pdf_loader.load()

# Display the first five pages (indexes 0–4)
for i in range(5):
    print(f"--- Page {i+1} ---")
    print(docs[i].page_content)   # or simply: print(docs[i])
    print("\n")

#### Checking the number of pages

In [None]:
# Check how many pages (documents) were loaded from the PDF
len(docs)

There are 4114 pages in the pdf document.

### Data Chunking

In [None]:
# Create a text splitter that breaks the text into smaller chunks.
# Uses the tiktoken encoder for accurate token-based splitting.
# chunk_size = 512 tokens per chunk
# chunk_overlap = 16 tokens of overlap between consecutive chunks
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    encoding_name='cl100k_base',
    chunk_size=512,
    chunk_overlap=16
)

In [None]:
# Load the PDF and split it into smaller text chunks using the configured text splitter
document_chunks = pdf_loader.load_and_split(text_splitter)

In [None]:
# View the 7th text chunk (index 6) produced by the text splitter
document_chunks[6]

In [None]:
# Installing required dependencies (using numpy<2 for compatibility with sentence-transformers)
!pip install -U "numpy<2" sentence-transformers langchain-huggingface

### Embedding

In [None]:
# Initialize the embedding model that converts text chunks into numerical vectors.
# Using the 'thenlper/gte-large' model for high-quality semantic embeddings.
embedding_model = SentenceTransformerEmbeddings(model_name='thenlper/gte-large')

### Vector Database

In [None]:
# Name of the collection where embeddings will be stored
report = 'Article_QnA'

# Create a vector store using Chroma to store embeddings of the document chunks.
# - document_chunks: the text chunks to be embedded
# - embedding_model: the model used to generate embeddings
# - collection_name: the name of the collection in Chroma
vectorstore = Chroma.from_documents(
    document_chunks,
    embedding_model,
    collection_name=report
)


### Retriever

In [None]:
# Convert the Chroma vector store into a retriever object for semantic search
# - search_type='similarity': retrieves documents most similar to a query
# - search_kwargs={'k':2}: return the top 2 most relevant chunks for each query
retriever = vectorstore.as_retriever(
    search_type='similarity',
    search_kwargs={'k':2}
)


### System and User Prompt Template

In [None]:
# Defining  a controlled Q&A behavior where the model must answer only using the given context
qna_system_message = """
You are an assistant whose work is to give answers to questions with repect to a context.
User input will have the context required by you to answer user questions.

This context will begin with the token: ###Context.
The context contains references to specific portions of a document relevant to the user query.

User questions will begin with the token: ###Question.

Strictly answer only using the information provided in the ###Context.
Do not mention anything about the information in ###Context or the question in ###Question in your final answer.

If the answer to ###Question cannot be derived from the ###Context, just respond by saying "I don't know".

Remember that the answer to ###Question might not always be directly present in the information provided in the ###Context.
the answer can be indirectly derived from the information in ###Context.

"""

In [None]:
# This template formats the user message sent to the model.
# It inserts the context and the question under their respective tokens (###Context and ###Question),
# ensuring the model receives them in a structured and predictable format.
qna_user_message_template = """
Consider the following ###Context and ###Question
###Context
{context}

###Question
{question}
"""

### Response Function

In [None]:
import tiktoken

# Initialize tokenizer for counting and trimming tokens
enc = tiktoken.get_encoding("cl100k_base")

def trim_to_token_limit(text, max_tokens=1000):
    """
    Encodes the input text into tokens and trims it so that
    the total number of tokens does not exceed max_tokens.
    This prevents overly long contexts from breaking the model prompt.
    """
    tokens = enc.encode(text)
    if len(tokens) <= max_tokens:
        return text
    return enc.decode(tokens[:max_tokens])


def generate_rag_response(
        user_input,
        k=3,
        max_tokens=128,
        temperature=0,
        top_p=0.95,
        top_k=50
    ):
    global qna_system_message, qna_user_message_template

    # Retrieve relevant document chunks
    relevant_document_chunks = retriever.invoke(user_input)
    context_list = [d.page_content for d in relevant_document_chunks[:k]]  # limit k

    # Combine document chunks into a single context
    context_for_query = ". ".join(context_list)

    # Trim to safe token limit
    context_for_query = trim_to_token_limit(context_for_query, 1000)

    user_message = qna_user_message_template.replace('{context}', context_for_query)
    user_message = user_message.replace('{question}', user_input)

    prompt = qna_system_message + "\n" + user_message

    try:
        response = llm(
            prompt=prompt,
            max_tokens=max_tokens,
            temperature=temperature,
            top_p=top_p,
            top_k=top_k
        )

        response = response["choices"][0]["text"].strip()
    except Exception as e:
        response = f"Sorry, I encountered the following error:\n{e}"

    return response


## Question Answering using RAG

### Query 1: What is the protocol for managing sepsis in a critical care unit?

In [None]:
# Assigning query to variable
Rquery1 = "What is the protocol for managing sepsis in a critical care unit?"
# Printing the prompt
print(generate_rag_response(Rquery1))

Since this query was processed using a RAG setup with the supporting PDF document, the model’s response is expressed in precise medical terminology suitable for domain experts, which is exactly the expected behavior.

### Query 2: What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?

In [None]:
# Assigning query to variable
Rquery2 = "What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?"
# Printing the prompt
print(generate_rag_response(Rquery2))

For this query, the symptoms and cure are directly not specified in the provided context, so the model correctly responds with “I don’t know” for the symptoms part. However, it indirectly gives the symptoms by stating “typically results in abdominal pain, anorexia, and abdominal tenderness.” For the rest of the question, the model provides accurate answers based on the available context.

### Query 3: What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?

In [None]:
# Assigning query to variable
Rquery3 = "What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?"
# Printing the prompt
print(generate_rag_response(Rquery3))

Both parts of the question, namely treatment and cause, are accurately explained by the model based on the information available in the PDF.

### Query 4:  What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?

In [None]:
# Assigning query to variable
Rquery4 = "What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?"
# Printing the prompt
print(generate_rag_response(Rquery4))

The model is correctly retrieving the required answer for this query from the PDF context, providing appropriate recommendations for the treatment of a person suffering from brain tissue injury.

### Query 5: What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?

In [None]:
# Assigning query to variable
Rquery5 = "What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?"
# Printing the prompt
print(generate_rag_response(Rquery5))

The model is accurately explaining the precautions and treatment steps for a person with a fractured leg based on the information available in the PDF context.

## Inference:
When using RAG, the model’s responses are grounded in the specific context provided by the PDF, allowing it to give precise, accurate, and context-specific answers. Without RAG, the model generates responses based solely on its pre-trained knowledge, which may be less specific, potentially incomplete, or not aligned with the information in the PDF.

### Fine-tuning

### Query 1: What is the protocol for managing sepsis in a critical care unit?

In [None]:
# Assigning query to variable
Rquery1a = "What is the protocol for managing sepsis in a critical care unit?"
# Printing the prompt by changing temperature paramter
print(generate_rag_response(Rquery1a, temperature=0.5))

It is observed that upon increasing the temperature, the structure of response becomes more conversational and less exact but the meaning remains correct.

### Query 2: What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?

In [None]:
# Assigning query to variable
Rquery2b = "What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?"
# Printing the prompt by changing the k paramter
print(generate_rag_response(Rquery2b, k=6))

Upon increasing k, the response remains same. May be all relevant info is already captured in the lower value of k that's why the output may remain unchanged.

### Query 3: What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?

In [None]:
# Assigning query to variable
Rquery3c = "What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?"
# Printing the prompt by changing the top_k paramter
print(generate_rag_response(Rquery3c, top_k=30))

No change in response observed after increasing top_k as the most relevant info is in the top few passages.

### Query 4:  What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?

In [None]:
# Assigning query to variable
Rquery4d = "What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?"
# Printing the prompt by changing the top_p paramter
print(generate_rag_response(Rquery4d,top_p=0.8))

Lowering top-p made the model more conservative and deterministic, producing slightly more precise but less varied response.

### Query 5: What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?

In [None]:
# Assigning query to variable
Rquery5e = "What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?"
# Printing the prompt by changing the max_tokens paramter
print(generate_rag_response(Rquery5e, max_tokens=1024))

Increasing max_tokens allowed the model to generate a much longer, more detailed continuation, adding explanations about complications, hygiene, early mobilization, and care instructions that were previously cut off.

## Output Evaluation

Let us now use the LLM-as-a-judge method to check the quality of the RAG system on two parameters - retrieval and generation. We illustrate this evaluation based on the answeres generated to the question from the previous section.

- We are using the same Mistral model for evaluation, so basically here the llm is rating itself on how well he has performed in the task.

In [None]:
# This system message is designed for a groundedness rater.
# The model or evaluator will be given a question, the context used by the AI system,
# and the AI-generated answer. The task is to judge how much the answer is supported by the context.

groundedness_rater_system_message = """

You will be presented a ###Question, ###Context used by the AI system and AI generated ###Answer.

Your task is to judge the extent to which the ###Answer is derived from ###Context.

Rate it 1 - if The ###Answer is not derived from the ###Context at all
Rate it 2 - if The ###Answer is derived from the ###Context only to a limited extent
Rate it 3 - if The ###Answer is derived from ###Context to a good extent
Rate it 4 - if The ###Answer is derived from ###Context mostly
Rate it 5 - if The ###Answer is is derived from ###Context completely

Please note: Make sure you give a single overall rating in the range of 1 to 5 along with an overall explanation.

"""

In [None]:
# This system message is designed for a relevance rater.
# The model or evaluator will be given a question, the context used by the AI system,
# and the AI-generated answer. The task is to judge how relevant the answer is to the question.

relevance_rater_system_message = """

You will be presented with a ###Question, the ###Context used by the AI system to generate a response, and the AI-generated ###Answer.

Your task is to judge the extent to which the ###Answer is relevant to the ###Question, considering whether it directly addresses the key aspects of the ###Question based on the provided ###Context.

Rate the relevance as follows:
- Rate 1 – The ###Answer is not relevant to the ###Question at all.
- Rate 2 – The ###Answer is only slightly relevant to the **###Question**, missing key aspects.
- Rate 3 – The ###Answer is moderately relevant, addressing some parts of the **###Question** but leaving out important details.
- Rate 4 – The ###Answer is mostly relevant, covering key aspects but with minor gaps.
- Rate 5 – The ###Answer is fully relevant, directly answering all important aspects of the **###Question** with appropriate details from the **###Context**.

Note: Provide a single overall rating in the range of 1 to 5, along with a brief explanation of why you assigned that score.
"""

In [None]:
# This template formats the input for evaluation by a rater (groundedness or relevance).
# It structures the message by placing the question, context, and AI-generated answer
# under clearly labeled sections (###Question, ###Context, ###Answer).
# The placeholders {question}, {context}, and {answer} are replaced with the actual values
# before sending to the model or evaluator.

user_message_template = """
###Question
{question}

###Context
{context}

###Answer
{answer}
"""

In [None]:
# This function generates a response from the model specifically for evaluating
# groundedness or relevance of an AI-generated answer.
def generate_ground_relevance_response(user_input, k=3, max_tokens=128, temperature=0, top_p=0.95, top_k=50):
    global qna_system_message, qna_user_message_template

    # Retrieve relevant document chunks - USING invoke()
    relevant_document_chunks = retriever.invoke(user_input)  # Changed here
    context_list = [d.page_content for d in relevant_document_chunks]
    context_for_query = ". ".join(context_list)

    # Combine user_prompt and system_message to create the prompt
    prompt = f"""[INST]{qna_system_message}\n
                {'user'}: {qna_user_message_template.format(context=context_for_query, question=user_input)}
                [/INST]"""

    response = llm(
            prompt=prompt,
            max_tokens=max_tokens,
            temperature=temperature,
            top_p=top_p,
            top_k=top_k,
            stop=['INST'],
            )

    answer = response["choices"][0]["text"]

    # Combine user_prompt and system_message to create the prompt
    groundedness_prompt = f"""[INST]{groundedness_rater_system_message}\n
                {'user'}: {user_message_template.format(context=context_for_query, question=user_input, answer=answer)}
                [/INST]"""

    relevance_prompt = f"""[INST]{relevance_rater_system_message}\n
                {'user'}: {user_message_template.format(context=context_for_query, question=user_input, answer=answer)}
                [/INST]"""

    response_1 = llm(
            prompt=groundedness_prompt,
            max_tokens=max_tokens,
            temperature=temperature,
            top_p=top_p,
            top_k=top_k,
            stop=['INST'],
            )

    response_2 = llm(
            prompt=relevance_prompt,
            max_tokens=max_tokens,
            temperature=temperature,
            top_p=top_p,
            top_k=top_k,
            stop=['INST'],
            )

    return response_1['choices'][0]['text'], response_2['choices'][0]['text']

### Query 1: What is the protocol for managing sepsis in a critical care unit?

In [None]:
# Assigning variable to the query
Equery1 = "What is the protocol for managing sepsis in a critical care unit?"
# Printing the response
print(generate_ground_relevance_response(Equery1))

Both explanations give a rating of 5, agreeing that the answer is fully derived from the context. They note that the answer accurately reflects all key steps in managing sepsis in a critical care unit—such as prompt empiric antibiotics, culture-guided adjustments, and specific drug options—exactly as described in the provided context.

### Query 2: What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?

In [None]:
# Assigning variable to the query
Equery2 = "What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?"
# Printing the response
print(generate_ground_relevance_response(Equery2))

Both explanations agree that the response fully reflects the context, correctly identifying symptoms and surgical treatment, and they both give a rating of 5 because the answer is accurate, complete, and directly supported by the context.

### Query 3: What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?

In [None]:
# Assigning variable to the query
Equery3 = "What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?"
# Printing the response
print(generate_ground_relevance_response(Equery3))

Both explanations agree that the response fully addresses the question, correctly identifying alopecia areata as an autoimmune disorder and listing all relevant treatments, resulting in a rating of 5 for being complete and contextually accurate.

### Query 4: What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?

In [None]:
# Assigning variable to the query
Equery4 = "What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?"
# Printing the response
print(generate_ground_relevance_response(Equery4))

Both explanations agree that the response fully and accurately reflects the context, covering initial treatment, surgical interventions, early supportive care, and ongoing management for traumatic brain injury, resulting in a rating of 5 for completeness and relevance.

### Query 5: What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?

In [None]:
# Assigning variable to the query
Equery5 = "What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?"
# Printing the response
print(generate_ground_relevance_response(Equery5))

The first explanation rates the answer 4/5 because it includes some extra information not explicitly in the context, while the second rates it 5/5 since it views all details—including early mobilization and cast care—as fully supported by the context. The discrepancy arises from whether additional inferred details are considered outside the context.

## Actionable Insights

1. The AI system performs excellently (5-star ratings) on standard clinical queries but shows a slight drop (4-star rating) for complex, multi-part, situational queries like the hiking fracture scenario.

2. The system excels at factual retrieval but struggles slightly with multi-phase, scenario-based questions requiring synthesized guidance.

3. The system is production-ready for standard clinical use, with clear path to dominate complex scenario markets with targeted enhancements.

## Business Recommendations

1. Healthcare centers can deploy the model in their facilities during the production phase for emergency use.

2. Centers can display confidence levels per answer section and allow natural follow-up questions so users can drill deeper into areas like post-fracture recovery or specific precautions.

3. Maximum usage in urgent care centers and remote-consultation platforms will be more helpful where complex, situational questions are common, using their feedback to refine scenario-based responses before wider rollout.