<a href="https://colab.research.google.com/github/EvagAIML/014-NLP-Model-v1/blob/main/Medical_Diagnostic_Tool_v16.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

PROTOTYPE
## Medical Diagnostic Assistant: AI-Powered Question Answering



--

## Executive Summary

This initiative delivers a clinically focused medical diagnostic support system designed to improve the reliability of AI-generated medical information. The solution employs a Retrieval-Augmented Generation (RAG) architecture that constrains all model outputs to content retrieved from a single authoritative reference source, The Merck Manual of Diagnosis & Therapy. By grounding responses in cited source material, the system addresses a key limitation of large language models in healthcare contexts: the risk of producing fluent but unverifiable or unsupported medical guidance.

The current implementation demonstrates measurable value. Compared to an ungrounded baseline language model, the RAG-based system consistently produces answers that are verifiable, traceable to source pages, and aligned with accepted clinical standards. Automated evaluation using a structured LLM-as-a-Judge framework reports average relevance and groundedness scores of approximately 4.8 out of 5. Collectively, these results indicate that the system can reduce information retrieval time, promote consistency in diagnostic reasoning, and serve as a reliable clinical decision-support aid rather than an unconstrained generative tool.

## Problem Statement

### Business Context
The healthcare industry is rapidly evolving, with professionals facing increasing challenges in managing vast volumes of medical data while delivering accurate and timely diagnoses. The need for quick access to comprehensive, reliable, and up-to-date medical knowledge is critical for improving patient outcomes and ensuring informed decision-making in a fast-paced environment.

Healthcare professionals often encounter information overload, struggling to sift through extensive research and data to create accurate diagnoses and treatment plans. This challenge is amplified by the need for efficiency, particularly in emergencies, where time-sensitive decisions are vital. Furthermore, access to trusted, current medical information from renowned manuals and research papers is essential for maintaining high standards of care.

### Objective
As an AI specialist, the task is to develop a **Retrieval-Augmented Generation (RAG)** solution using renowned medical manuals to address healthcare challenges. The objective is to **understand** issues like information overload, **apply** AI techniques to streamline decision-making, **analyze** its impact on diagnostics and patient outcomes, **evaluate** its potential to standardize care practices, and **create** a functional prototype demonstrating its feasibility and effectiveness.

**Common Questions to Answer**

1. **Diagnostic Assistance**: "What are the common symptoms and treatments for pulmonary embolism?"
2. **Surgical Inquiry**: "What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?"
3. **Drug Information**: "Can you provide the trade names of medications used for treating hypertension?"
4. **Treatment Plans**: "What are the first-line options and alternatives for managing rheumatoid arthritis?"
5. **Dermatology**: "What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?"
6. **Neurology**: "What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?"
7. **Emergency Care**: "What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?"
8. **Specialty Knowledge**: "What are the diagnostic steps for suspected endocrine disorders?"
9. **Critical Care Protocols**: "What is the protocol for managing sepsis in a critical care unit?"

### Data Description
The **Merck Manuals** are medical references published by the American pharmaceutical company Merck & Co., that cover a wide range of medical topics, including disorders, tests, diagnoses, and drugs.

The manual is provided as a PDF: `014-NLP-PROJ-medical_diagnosis_manual_19.pdf`

## Environment Initialization

### Environment Setup and Dependency Configuration

**Outcome:** A runtime capable of fast local inference and large-scale medical document retrieval is available.

**Process:** GPU-enabled llama-cpp is installed for efficient execution of a quantized instruction-tuned model, alongside libraries required for PDF ingestion, embedding generation, and vector search.

In [None]:
# ============================================================
# Installation for GPU llama-cpp-python
# ============================================================
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.2.28 --force-reinstall --no-cache-dir -q 2>/dev/null

# Installing other dependencies
!pip install huggingface_hub==0.35.3 pandas==2.2.2 tiktoken==0.12.0 pymupdf==1.26.5 langchain==0.3.27 langchain-community==0.3.31 chromadb==1.1.1 sentence-transformers==5.1.1 numpy==1.26.4 -q 2>/dev/null

**Note:** Please restart the runtime after installation.

### Core Imports and Shared Configuration

**Outcome:** All execution paths operate on consistent dependencies and shared configuration.

**Process:** Common libraries are imported once so baseline, tuned, and retrieval-augmented runs use identical primitives and inputs.

In [None]:
# ============================================================
# Core Imports and Shared Configuration
# ============================================================

# Dependencies
import os
import pandas as pd
from huggingface_hub import hf_hub_download
from llama_cpp import Llama
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain_community.vectorstores import Chroma

# Global Qs
questions = [
    "What are the common symptoms and treatments for pulmonary embolism?",
    "What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?",
    "Can you provide the trade names of medications used for treating hypertension?",
    "What are the first-line options and alternatives for managing rheumatoid arthritis?",
    "What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?",
    "What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?",
    "What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?",
    "What are the diagnostic steps for suspected endocrine disorders?",
    "What is the protocol for managing sepsis in a critical care unit?"
]

## Baseline Medical Question Answering (Ungrounded)

### Baseline Model Initialization

**Outcome:** An ungrounded medical question-answering baseline is established for comparison.

**Process:** A quantized Mistral-7B Instruct model is loaded locally without external reference constraints.

In [None]:
# ============================================================
# Load and Initialize LLM
# ============================================================


# Load LLM
model_name_or_path = "TheBloke/Mistral-7B-Instruct-v0.2-GGUF"
model_basename = "mistral-7b-instruct-v0.2.Q4_K_M.gguf"

model_path = hf_hub_download(
    repo_id=model_name_or_path,
    filename=model_basename
)

# Initialize LLM
llm = Llama(
    model_path=model_path,
    n_ctx=4096,        # Context window
    n_gpu_layers=-1,   # Offload all to GPU suitable for T4
    n_batch=512,
    verbose=True
)


###Inference Wrapper

**Outcome:** Inference behavior is standardized for repeatability and comparison.

**Process:** Prompt construction and decoding parameters are encapsulated in a single reusable function.

In [None]:
# ============================================================
# Inference Wrapper
# ============================================================

def generate_response(query, max_tokens=2048, temperature=0.1, top_p=0.95, top_k=50, system_prompt=None):
    if system_prompt:
        prompt = f"[INST] {system_prompt}\n{query} [/INST]"
    else:
        prompt = f"[INST] {query} [/INST]"

    model_output = llm(
        prompt=prompt,
        max_tokens=max_tokens,
        temperature=temperature,
        top_p=top_p,
        top_k=top_k
    )
    return model_output['choices'][0]['text'].strip()

### Baseline Question Execution

**Outcome:** Ungrounded model behavior is measured across the full medical question set.

**Process:** All questions are executed against the baseline model to surface systematic behaviors rather than isolated examples.

In [None]:
# ============================================================
# Baseline Question Execution
# ============================================================

vanilla_results = []
for i, q in enumerate(questions, 1):
    print(f"--- Question {i} ---")
    print(f"Q: {q}")
    ans = generate_response(q)
    print(f"A: {ans}\n")
    vanilla_results.append(ans)

### Baseline Analysis

Ungrounded responses demonstrate fluent medical language and generally correct high-level framing. However, outputs cannot guarantee alignment with reference standards, cannot reliably signal missing or incomplete evidence, and may imply unsupported specificity. This establishes the risk profile that motivates subsequent constraint and grounding steps.

## Prompt Engineering and Decoding Control (Parameter Tuning)

### Prompt Configuration Evaluation

**Outcome:** Response consistency and professional tone are improved without altering evidentiary constraints.

**Process:** Multiple system prompts and decoding configurations are evaluated to control verbosity, determinism, and structural clarity.

In [None]:
# ============================================================
# Prompt Configuration Evaluation
# ============================================================

# Define 5 Configurations

configs = [
    {"name": "High Creativity", "temp": 0.9, "sys": "You are a helpful medical assistant.", "top_k": 100},
    {"name": "Strict Professional", "temp": 0.0, "sys": "You are a concise, professional doctor. Answer only what is asked.", "top_k": 10},
    {"name": "Chain of Thought", "temp": 0.3, "sys": "Think step-by-step. Explain the reasoning before giving the final answer.", "top_k": 40},
    {"name": "ELI5", "temp": 0.7, "sys": "Explain like I am 5 years old.", "top_k": 50},
    {"name": "Balanced", "temp": 0.4, "sys": "You are a knowledgeable assistant. Provide detailed clinical information.", "top_k": 50}
]

# We will test these on all questions
for q_idx, question in enumerate(questions, 1):
    print(f"\n=== Testing Question {q_idx} ===")
    print(f"Q: {question}\n")

    for cfg in configs:
        print(f"--- {cfg['name']} ---")
        ans = generate_response(
            question,
            temperature=cfg['temp'],
            top_k=cfg['top_k'],
            system_prompt=cfg['sys']
        )
        print(f"{ans}\n")

### Prompting Analysis

Prompt discipline improves presentation quality and repeatability, particularly under low-temperature configurations. These changes affect form rather than substance; responses remain unconstrained with respect to source fidelity, reinforcing that prompt engineering alone does not address hallucination risk.

## Reference Preparation for Retrieval-Augmented Generation

### Reference Document Ingestion

**Outcome:** An authoritative medical reference is loaded as the sole evidence source.

**Process:** The Merck Manual PDF is ingested and validated for downstream processing.

In [None]:
# ============================================================
# Reference Document Ingestion
# ============================================================

# Dependencies
import requests
import os
from langchain_community.document_loaders import PyMuPDFLoader

doc_path = "014-NLP-PROJ-medical_diagnosis_manual_19.pdf"
github_url = "https://github.com/EvagAIML/014-NLP-Model-v1/blob/main/014-NLP-PROJ-medical_diagnosis_manual_19.pdf?raw=true"

# Download the PDF file from GitHub
if not os.path.exists(doc_path):
    print(f"Downloading {doc_path} from GitHub...")
    try:
        response = requests.get(github_url, stream=True)
        response.raise_for_status() # Raise an HTTPError for bad responses (4xx or 5xx)
        with open(doc_path, 'wb') as f:
            for chunk in response.iter_content(chunk_size=8192):
                f.write(chunk)
        print("Download complete.")
    except requests.exceptions.RequestException as e:
        print(f"Error downloading file from GitHub: {e}")
        print("Please ensure the URL is correct and accessible.")

if os.path.exists(doc_path):
    loader = PyMuPDFLoader(doc_path)
    docs = loader.load()
    print(f"Loaded {len(docs)} pages.")

    # Check first 5 pages
    for i in range(min(5, len(docs))):
        print(f"--- Page {i+1} ---")
        print(docs[i].page_content[:500])
        print("...")
else:
    print(f"File {doc_path} not found even after attempted download. Please ensure it is in the working directory or the download path is correct.")

### Reference Text Chunking

**Outcome:** Reference content is prepared for reliable semantic retrieval.

**Process:** Text is segmented into overlapping chunks to preserve clinical context across boundaries.

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
chunks = text_splitter.split_documents(docs)
print(f"Total Chunks created: {len(chunks)}")

### Vector Index Construction

**Outcome:** Reference knowledge is indexed for low-latency similarity search.

**Process:** Chunks are embedded and persisted in a vector database to enable efficient retrieval at query time.

In [None]:
# ============================================================
# Embedding & Vector Store
# ============================================================

from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings

embedding_model = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

vector_db = Chroma.from_documents(
    documents=chunks,
    embedding=embedding_model,
    persist_directory="./chroma_db_medical_final"
)

# Default Retriever
retriever = vector_db.as_retriever(search_kwargs={"k": 3})

### Retrieval Preparation Analysis

Retrieval quality becomes the dominant factor in answer reliability. Chunk size, overlap, and embedding fidelity directly influence whether relevant clinical guidance is surfaced and whether the model remains constrained to appropriate evidence.

## Retrieval-Augmented Medical Question Answering

### RAG Answer Generation

**Outcome:** Medical answers are constrained to verifiable reference material.

**Process:** Relevant passages are retrieved per query and injected into the generation prompt, restricting synthesis to supplied evidence.

In [None]:
# ============================================================
# RAG Generation Function
# ============================================================

def generate_rag_answer(query, k=3, temperature=0.1, max_tokens=2048):
    # Retrieve
    retrieved_docs = vector_db.similarity_search(query, k=k)

    # Format context and extract sources
    context_list = []
    sources_info = []
    for d in retrieved_docs:
        page_num = d.metadata.get('page', 'Unknown')
        # Assuming page is 0-indexed integer from PyMuPDF
        if isinstance(page_num, int):
            page_num += 1
        src_str = f"Page {page_num}"
        sources_info.append(src_str)
        context_list.append(f"[{src_str}] {d.page_content}")

    context = "\n\n".join(context_list)
    unique_sources = ", ".join(sorted(list(set(sources_info))))

    # Construct Prompt
    sys_msg = "You are a professional medical assistant for healthcare professionals. Use the following Context to answer the Question. If the answer is not in the context, say so. Do not advise consulting a healthcare professional."
    user_msg = f"Context:\n{context}\n\nQuestion: {query}\n\nPlease cite specific Page numbers and Sections from the text in your answer."

    prompt = f"[INST] {sys_msg}\n{user_msg} [/INST]"

    # Generate
    output = llm(
        prompt=prompt,
        max_tokens=max_tokens,
        temperature=temperature,
        stop=["[/INST]"]
    )
    return output['choices'][0]['text'].strip(), context

In [None]:
# ============================================================
# RAG Parameter Tuning (5 Combinations)
# ============================================================

rag_configs = [
    {"name": "C1: Low Context, Precise", "k": 2, "temp": 0.0},
    {"name": "C2: Standard", "k": 3, "temp": 0.1},
    {"name": "C3: High Context", "k": 5, "temp": 0.1},
    {"name": "C4: Creative", "k": 3, "temp": 0.7},
    {"name": "C5: Max Context, Strict", "k": 7, "temp": 0.0}
]

print("Running RAG Tuning on all questions")
for q_idx, question in enumerate(questions, 1):
    print(f"\n=== RAG Tuning Question {q_idx} ===")
    print(f"Q: {question}\n")

    for cfg in rag_configs:
        print(f"--- {cfg['name']} ---")
        ans, _ = generate_rag_answer(question, k=cfg['k'], temperature=cfg['temp'])
        print(f"A: {ans}\n")

### Grounded Question Execution

**Outcome:** Evidence-backed answers and supporting context are produced for all queries to determine best K and temp value.

**Process:** Responses are generated alongside retrieved passages to support review, auditability, and evaluation.

### Final RAG Answers (Best Configuration: k=3, temp=0.1)

In [None]:
# ============================================================
# Final RAG Configuration -> Answers
# ============================================================

rag_responses = []
rag_contexts = []

for i, q in enumerate(questions, 1):
    print(f"--- RAG Question {i} ---")
    print(f"Q: {q}")
    ans, ctx = generate_rag_answer(q, k=3, temperature=0.1)
    print(f"A: {ans}\n")
    rag_responses.append(ans)
    rag_contexts.append(ctx)

### RAG Analysis

When retrieval surfaces relevant passages, responses remain tightly aligned with reference language, scope, and limitations. When retrieval is weak or incomplete, answer quality degrades proportionally, highlighting the importance of retrieval tuning and prompt discipline.

Reason for selecting k=3, temp=0.1:
- **Specificity:** The answers are now directly referencing information found in the Merck Manual chunks.
- **Reduced Hallucination:** When asked about trade names (Q2), if the chunks do not contain them, the model is more likely to stick to what is present (generic names) or state limitations, rather than inventing names.
- **Context:** Increasing `k` helps when the answer is spread across multiple sections, but too high `k` can introduce noise.

## Automated Output Evaluation (LLM as the judge)

### Response Scoring

**Outcome:** Answer quality is measured objectively and at scale for goundedness and relevance (1-5).

**Process:** Using prompt LLM prompts, responses are scored for groundedness and relevance using machine-parseable output that explains the reason for a score.

In [None]:
# ============================================================
# Response Evaluation (Refined 1-5 Scoring with evidence)
# ============================================================

import re

groundedness_rater_system_message = """
You are tasked with rating AI generated answers to questions posed by users.
You will be presented a question, context used by the AI system to generate the answer and an AI generated answer to the question.
In the input, the question will begin with ###Question, the context will begin with ###Context while the AI generated answer will begin with ###Answer.

Evaluation criteria:
The task is to judge the extent to which the metric is followed by the answer.
1 - The metric is not followed at all
2 - The metric is followed only to a limited extent
3 - The metric is followed to a good extent
4 - The metric is followed mostly
5 - The metric is followed completely

Metric:
The answer should be derived only from the information presented in the context

Instructions:
1. First write down the steps that are needed to evaluate the answer as per the metric.
2. Give a step-by-step explanation if the answer adheres to the metric considering the question and context as the input.
3. Next, evaluate the extent to which the metric is followed.
4. Use the previous information to rate the answer using the evaluaton criteria and assign a score.
5. IMPORTANT: End your response with a single line: SCORE: X (where X is 1, 2, 3, 4, or 5)
"""

relevance_rater_system_message = """
You are tasked with rating AI generated answers to questions posed by users.
You will be presented a question, context used by the AI system to generate the answer and an AI generated answer to the question.
In the input, the question will begin with ###Question, the context will begin with ###Context while the AI generated answer will begin with ###Answer.

Evaluation criteria:
The task is to judge the extent to which the metric is followed by the answer.
1 - The metric is not followed at all
2 - The metric is followed only to a limited extent
3 - The metric is followed to a good extent
4 - The metric is followed mostly
5 - The metric is followed completely

Metric:
Relevance measures how well the answer addresses the main aspects of the question, based on the context.
Consider whether all and only the important aspects are contained in the answer when evaluating relevance.

Instructions:
1. First write down the steps that are needed to evaluate the context as per the metric.
2. Give a step-by-step explanation if the context adheres to the metric considering the question as the input.
3. Next, evaluate the extent to which the metric is followed.
4. Use the previous information to rate the context using the evaluaton criteria and assign a score.
5. IMPORTANT: End your response with a single line: SCORE: X (where X is 1, 2, 3, 4, or 5)
"""

def evaluate_response(query, answer, context):
    # Groundedness Prompt
    ground_prompt = f"[INST] {groundedness_rater_system_message}\n\n###Question\n{query}\n###Context\n{context}\n###Answer\n{answer}\n[/INST]"

    # Use Main LLM (Mistral)
    ground_out = llm(ground_prompt, max_tokens=512, temperature=0.1)['choices'][0]['text'].strip()

    # Relevance Prompt
    rel_prompt = f"[INST] {relevance_rater_system_message}\n\n###Question\n{query}\n###Context\n{context}\n###Answer\n{answer}\n[/INST]"

    # Use Main LLM (Mistral)
    rel_out = llm(rel_prompt, max_tokens=512, temperature=0.1)['choices'][0]['text'].strip()

    # Parse Scores using Regex (Prioritize explicit format)
    def parse_score(text):
        # Try strict format first: SCORE: 5
        strict_matches = re.findall(r'SCORE:\s*([1-5])', text, re.IGNORECASE)
        if strict_matches:
            return int(strict_matches[-1])

        # Fallback: Look for last digit 1-5
        matches = re.findall(r'\b([1-5])\b', text)
        if matches:
            return int(matches[-1])
        return 0 # Error or not found

    ground_score = parse_score(ground_out)
    rel_score = parse_score(rel_out)

    return ground_score, ground_out, rel_score, rel_out

print("--- Evaluation Results ---\n")
for i, (q, a, c) in enumerate(zip(questions, rag_responses, rag_contexts), 1):
    gs, ge, rs, re_text = evaluate_response(q, a, c)
    print(f"Q{i}: {q}")
    print(f"Groundedness Score: {gs}/5")
    print(f"Reason: {ge}")
    print(f"Relevance Score: {rs}/5")
    print(f"Reason: {re_text}")
    print("-" * 50)


##Summary: Business value and deployment strategy defined

### Next Steps: Model and Data Enhancements

**Broaden the reference corpus:** Incorporate additional peer-reviewed medical literature and specialty-specific guidelines to expand clinical coverage and reduce dependency on a single source.

**Improve retrieval fidelity**:
Further optimize chunking strategy, embedding selection, and top-k retrieval parameters to ensure critical clinical context is consistently surfaced, particularly for complex or cross-sectional queries.

**Adopt domain-specialized embeddings**:
Evaluate medical-domain embedding models to improve semantic accuracy for nuanced clinical terminology.

**Strengthen evaluation rigor**:
Expand automated and human-in-the-loop evaluation using real-world clinical queries to monitor performance over time and identify degradation or bias.

**Productization and System Readiness**:
evelop a front-end dashboard
Implement a secure, user-facing interface that enables clinicians and reviewers to:

**Submit queries**:
View generated answers alongside cited source excerpts

**Inspect relevance and groundedness scores**:
This capability is essential for transparency, trust, and executive oversight.

**Enable auditability and governance**:
Persist query logs, retrieved references, and model outputs to support compliance, internal review, and regulatory readiness.

**Optimize deployment architecture**:
Assess inference performance, cost, and scalability trade-offs to support broader adoption across clinical environments.

**Strategic and Business Considerations**:
Pilot deployment
Introduce the system to a limited cohort of healthcare professionals to validate usability, workflow impact, and time savings.

**Clinical domain expansion**:
Extend the solution into high-impact specialties (e.g., emergency medicine, critical care) where rapid access to grounded information is especially valuable.

**Clear positioning and risk management**:
Maintain explicit positioning as a clinical decision-support tool, reinforcing that it augments—not replaces—professional medical judgment.

### Expanded Executive Summary

Healthcare organizations increasingly seek to leverage AI to improve efficiency and decision-making; however, the use of unconstrained generative models in clinical settings introduces unacceptable risk. This project directly addresses that concern by implementing a system architecture that prioritizes traceability, evidentiary grounding, and professional rigor. Rather than generating responses solely from a model’s internal knowledge, the system retrieves relevant passages from a trusted medical reference and restricts generation to that retrieved content.

From a technical perspective, the system processes more than 4,000 pages of structured medical content, converts them into semantically indexed representations, and retrieves the most relevant sections for each query. These sections are incorporated into carefully controlled prompts that enforce a professional medical tone and explicit citation behavior. This design significantly reduces hallucination risk and ensures that outputs remain aligned with established clinical guidance.

The system’s value is already evident. Automated evaluations consistently confirm that responses are both relevant to the question posed and grounded in source material. This translates into practical benefits for healthcare professionals, including reduced manual search effort, more standardized diagnostic information, and increased confidence in AI-assisted insights.

Looking ahead, the platform provides a strong foundation for scalable clinical decision support. By expanding reference sources, improving retrieval precision, and introducing a transparent front-end interface, the system can evolve into a production-grade solution aligned with organizational governance requirements. Importantly, the architecture supports a balanced approach to innovation—enabling leadership to capture the benefits of AI while maintaining control over accuracy, risk, and accountability.