# Medical Chatbot- G Version
This Pipeline answers questions strictly from a provided PDF or text file. It loads the document, counts tokens for query + document_text, and compresses the text via LLM summarisation if it exceeds the model’s context limit. An input guardrail blocks unsafe or disallowed queries before any model call. A strict prompt instructs the LLM to answer only from the document or reply “Not found in document.” The output guardrail then checks numeric grounding, token overlap, and entity presence to prevent hallucinations. Modular functions, logging, and error handling make it production‑ready, safe, and maintainable for enterprise‑level document‑based QA.

![Workflow](workflow.jpg)

## Importing the Packages

In [1]:
import os
import sys
import re
import math
import time
import json
import logging
from typing import Tuple, Optional, List
import tiktoken
from PyPDF2 import PdfReader
from openai import OpenAI
from dotenv import load_dotenv

## Logging Configuration

A logger is used to record messages about what your program is doing while it runs — like a running diary for your code.

In Our case, it’s configured to Capture events (info, warnings, errors) from the “Medical Chatbot G‑Version” pipeline, Format them with a timestamp, severity level, and message so they’re easy to read.Output them to the console in real time via StreamHandler.Avoid duplicates by checking if handlers already exist.

In [2]:
logger = logging.getLogger("Medical Chatbot G-Version")
if not logger.handlers:
    logger.setLevel(logging.INFO)
    ch = logging.StreamHandler(stream=sys.stdout)
    ch.setLevel(logging.INFO)
    formatter = logging.Formatter("[%(levelname)s] %(asctime)s - %(name)s - %(message)s")
    ch.setFormatter(formatter)
    logger.addHandler(ch)

In [3]:
load_dotenv(override=True)

True

## Configuration

In [4]:
# You can adjust the model and context window here. gpt-4o-mini supports large contexts (up to ~128k).
OPENAI_MODEL = os.environ.get("OPENAI_MODEL", "gpt-4o-mini")
MODEL_CONTEXT_TOKENS = int(os.environ.get("MODEL_CONTEXT_TOKENS", "128000"))  # Conservative default
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY", None)

if not OPENAI_API_KEY:
    raise EnvironmentError(
        "OPENAI_API_KEY is not set. Please set it in your environment to use the LLM."
    )

In [5]:
# Initialize OpenAI client once
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

## Tokenization Utilities
This sets up token counting for your LLM pipeline.

`get_encoder(model_name)` tries to load a tiktoken encoder optimised for the given model (e.g., gpt-4). If the model isn’t recognised, it logs a warning and falls back to the generic "cl100k_base" encoding.

ENCODER stores the chosen encoder for reuse.

`count_tokens(text)` then uses this encoder to convert the input string into tokens and returns the token count. This is essential for checking whether your document + query fit within the model’s context window before sending them to the LLM.

In [6]:
def get_encoder(model_name: str):
    """
    Get a tiktoken encoder for the provided model.
    Falls back to 'cl100k_base' if the specific model mapping is unknown.
    """
    try:
        return tiktoken.encoding_for_model(model_name)
    except KeyError:
        logger.warning("Unknown model for tiktoken; falling back to cl100k_base.")
        return tiktoken.get_encoding("cl100k_base")

ENCODER = get_encoder(OPENAI_MODEL)

def count_tokens(text: str) -> int:
    """Count tokens in a string using tiktoken."""
    return len(ENCODER.encode(text))

## Load the Document

In [7]:
def load_document(file_path: str) -> str:
    """
    Load a document from a .pdf or .txt file and return its full text content.
    - For PDFs, it extracts text page-by-page.
    - For text files, it reads directly.
    Raises:
        FileNotFoundError, ValueError, Exception
    """
    if not os.path.exists(file_path):
        logger.error(f"File not found: {file_path}")
        raise FileNotFoundError(f"File not found: {file_path}")

    ext = os.path.splitext(file_path)[1].lower()
    try:
        if ext == ".pdf":
            logger.info(f"Loading PDF: {file_path}")
            reader = PdfReader(file_path)
            pages_text = []
            for i, page in enumerate(reader.pages):
                try:
                    page_text = page.extract_text() or ""
                except Exception as e:
                    logger.warning(f"Failed to extract text from page {i}: {e}")
                    page_text = ""
                pages_text.append(page_text)
            document_text = "\n".join(pages_text).strip()
        elif ext in [".txt", ".md"]:
            logger.info(f"Loading text file: {file_path}")
            with open(file_path, "r", encoding="utf-8") as f:
                document_text = f.read().strip()
        else:
            raise ValueError("Unsupported file type. Only .pdf and .txt/.md are supported.")
    except Exception as e:
        logger.exception("Unexpected error while loading document.")
        raise e

    if not document_text:
        raise ValueError("Document appears to be empty after extraction.")
    return document_text

## Token Budget Check

This function checks if your query plus document text will fit inside the model’s context window.

It works by Counting tokens in the query and document using count_tokens().Adding them to get total_tokens.Comparing total_tokens to budget_tokens (the model’s max context size) to set fits as True or False.Logging the result with details for debugging and monitoring.

Returning a tuple:

fits → whether it’s within budget

total_tokens → the actual combined token count

This ensures you don’t send more text than the model can handle, preventing truncation or errors.

In [8]:
def check_token_budget(query: str, document_text: str, budget_tokens: int) -> Tuple[bool, int]:
    """
    Return whether query + document_text fits within the model's context window,
    and the total tokens used for that pair.
    """
    total_tokens = count_tokens(query) + count_tokens(document_text)
    fits = total_tokens <= budget_tokens
    logger.info(f"Token check: total={total_tokens}, budget={budget_tokens}, fits={fits}")
    return fits, total_tokens

## Input Guardrail

In [9]:
def apply_input_guardrail(query: str, document_text: str) -> Tuple[bool, Optional[str]]:
    """
    Validate the input for safety and basic relevance. Returns:
      - allowed: bool
      - message: None if allowed, else safe explanatory message

    Rules:
      - Block explicit requests for harmful instructions (e.g., how to create biological agents, weapons).
      - Block requests unrelated to the document if they also ask for dangerous/illegal guidance.
      - Allow neutral/medical queries relevant to the document.
    """
    q_lower = query.lower()

    # Disallowed content patterns (expand as needed)
    disallowed_patterns = [
        r"make\s+(a|an)\s+(bomb|weapon|explosive)",
        r"how\s+to\s+(manufacture|create)\s+(biological|chemical)\s+(weapon|agent)",
        r"bypass\s+(security|authentication)",
        r"exploit\s+(a|the)\s+vulnerability",
        r"harm\s+(someone|people)",
        r"kill\s+(someone|people)",
    ]
    for pat in disallowed_patterns:
        if re.search(pat, q_lower):
            logger.warning("Input guardrail: disallowed/harmful request detected.")
            return False, "Your request cannot be processed because it seeks unsafe or disallowed information."

    # Basic irrelevance detection (soft): If the query appears entirely unrelated to the document's domain,
    # we still allow but the pipeline will likely yield "Not found in document." We only block if it's both
    # irrelevant and seeking risky info (already handled). So we proceed.

    # Medical safety: Disclaimers are handled by answering strictly from the document.
    return True, None

## Helper: Chunking
This function splits long text into token‑bounded chunks so they fit within an LLM’s context window.

It first encodes the text into tokens using the global ENCODER. It calculates an overlap — the smaller of the given overlap or 10% of max_tokens_per_chunk — to preserve context between chunks. In a loop, it slices the token list from start to end (capped at max_len), decodes that slice back into text, and appends it to chunks. If the end of the token list is reached, it stops; otherwise, it moves start forward but steps back by overlap tokens to create the overlap.

This ensures each chunk is small enough for the model but still retains continuity for summarisation or sequential processing.

In [10]:
def chunk_text(text: str, max_tokens_per_chunk: int, overlap: int = 100) -> List[str]:
    """
    Chunk text into token-bounded segments with overlap to preserve context for summarization.
    """
    tokens = ENCODER.encode(text)
    chunks = []
    start = 0
    max_len = max_tokens_per_chunk
    ov = min(overlap, max(0, max_len // 10))

    while start < len(tokens):
        end = min(start + max_len, len(tokens))
        chunk = ENCODER.decode(tokens[start:end])
        chunks.append(chunk)
        if end == len(tokens):
            break
        start = end - ov  # overlap

    return chunks

## Compress Document via Summarization 
This function uses an LLM to produce a concise, fact‑based summary of a given text, with a word limit (max_words, default 250).

In [11]:
def _llm_summarize(text: str, max_words: int = 250) -> str:
    """
    Summarize a text using the LLM into a concise, factual summary.
    """
    prompt = (
        "You are a precise summarizer. Create a concise, faithful summary capturing key facts, "
        f"definitions, indications, contraindications, doses, adverse effects, and monitoring steps in <= {max_words} words. "
        "Do not invent information. Only use the provided text.\n\n"
        f"Text:\n{text}\n\nSummary:"
    )
    try:
        resp = client.chat.completions.create(
            model=OPENAI_MODEL,
            temperature=0,
            messages=[
                {"role": "system", "content": "You are a careful, faithful medical summarizer."},
                {"role": "user", "content": prompt},
            ],
        )
        return resp.choices[0].message.content.strip()
    except Exception as e:
        logger.exception("LLM summarization failed.")
        # Fallback: simple truncation if LLM fails
        words = re.split(r"\s+", text)
        return " ".join(words[:max_words])

This `compress_document` function ensures that a document plus the user’s query will fit within the model’s context limit by progressively compressing the text.

It first reserves 1000 tokens for the prompt and query, then calculates a safe chunk_tokens_limit for splitting the document. Using chunk_text(), it breaks the document into overlapping segments small enough for the summariser model.Each chunk is summarised with _llm_summarize() into ≤250 words, and the summaries are concatenated. It then checks the combined summary against the token budget with check_token_budget().

If still too large, it iteratively re‑summarises the combined text (≤200 words) up to three times, logging each step, until it fits. This guarantees the final compressed document is concise, faithful, and within the model’s context window.

In [12]:
def compress_document(document_text: str, query: str, budget_tokens: int) -> str:
    """
    Compress the document so that query + compressed_document fits within the budget.
    Strategy:
      - Chunk the document to safe sizes.
      - Summarize each chunk.
      - Iteratively reduce until within budget.
    """
    # Reserve a portion of the context for prompt and query
    reserved_for_prompt = 1000
    per_chunk_limit = 4000  # tokens per input to the summarizer model
    chunk_tokens_limit = min(per_chunk_limit, max(1000, (budget_tokens - reserved_for_prompt) // 4))

    chunks = chunk_text(document_text, chunk_tokens_limit)
    logger.info(f"Compress: initial chunks={len(chunks)}, tokens per chunk≈{chunk_tokens_limit}")

    # Summarize each chunk
    summaries = []
    for i, ch in enumerate(chunks):
        logger.info(f"Summarizing chunk {i+1}/{len(chunks)}")
        summaries.append(_llm_summarize(ch, max_words=250))

    combined = "\n".join(summaries)

    # If still too large, iteratively summarize again
    fits, total = check_token_budget(query, combined, budget_tokens)
    iteration = 0
    while not fits and iteration < 3:
        iteration += 1
        logger.info(f"Re-summarizing (iteration {iteration}) because tokens={total} exceed budget={budget_tokens}")
        combined = _llm_summarize(combined, max_words=200)
        fits, total = check_token_budget(query, combined, budget_tokens)

    return combined

## Prompt Building

In [13]:
def build_prompt(query: str, document_text: str) -> List[dict]:
    """
    Build strict prompt instructing the model to use only the provided document text.
    If answer not in document, it must reply exactly: 'Not found in document.'
    """
    system = (
        "You are an expert clinical QA assistant. Follow instructions exactly."
    )
    user = (
        "You must answer using ONLY the following document.\n"
        "If the answer is not in the document, reply exactly: 'Not found in document.'\n\n"
        "Document:\n"
        "-----\n"
        f"{document_text}\n"
        "-----\n\n"
        f"Question: {query}\n"
        "Answer:"
    )
    return [
        {"role": "system", "content": system},
        {"role": "user", "content": user},
    ]

## LLM Query Execution
`query_llm` sends messages to the LLM, retries on transient errors, logs failures, and returns the model’s response. It limits retries, delays between attempts, and raises exceptions after exceeding the maximum.

In [14]:
def query_llm(messages: List[dict], temperature: float = 0.0, max_retries: int = 2, retry_delay: float = 1.0) -> str:
    """
    Query the LLM with retries on transient errors.
    """
    attempt = 0
    while True:
        try:
            resp = client.chat.completions.create(
                model=OPENAI_MODEL,
                temperature=temperature,
                messages=messages,
            )
            answer = resp.choices[0].message.content.strip()
            return answer
        except Exception as e:
            attempt += 1
            logger.warning(f"LLM query failed (attempt {attempt}): {e}")
            if attempt > max_retries:
                logger.exception("Exceeded max retries for LLM call.")
                raise
            time.sleep(retry_delay)

## Output Guardrail

This defines two helper functions for output guardrail checks.

`_extract_numbers()` uses a regex to find all numeric tokens in a string, including integers and decimals. This supports factual validation by comparing numbers in the LLM’s answer against those in the source document.

`_token_overlap_score()` measures Jaccard similarity between two texts. It tokenises each string into lowercase alphanumeric words, converts them to sets, and computes the ratio of the intersection size to the union size. If either set is empty, it returns 0.0. This score helps detect hallucinations by checking how much the answer’s vocabulary overlaps with the document’s vocabulary.

In [15]:
def _extract_numbers(text: str) -> List[str]:
    """Extract numeric tokens (including decimals) for simple factual cross-checks."""
    return re.findall(r"\b\d+(?:\.\d+)?\b", text)

def _token_overlap_score(a: str, b: str) -> float:
    """
    Compute a simple token overlap score between strings a (answer) and b (document).
    Returns Jaccard similarity over lowercased word sets filtered to alphanumerics.
    """
    tokenize = lambda s: set(re.findall(r"[a-z0-9]+", s.lower()))
    A, B = tokenize(a), tokenize(b)
    if not A or not B:
        return 0.0
    inter = len(A & B)
    union = len(A | B)
    return inter / union

`apply_output_guardrail` checks an LLM’s answer against the source document for numeric accuracy, token overlap, and entity presence, with optional relaxed summary mode, returning “Not found in document.” if grounding checks fail.

In [16]:
def apply_output_guardrail(answer: str, document_text: str, allow_summary_mode: bool = False) -> str:
    """
    Validate that the LLM response is grounded in document_text.

    Parameters:
        answer (str): The raw LLM answer.
        document_text (str): The source document text.
        allow_summary_mode (bool): If True, relaxes entity grounding checks
                                   for summary-style queries while keeping
                                   numeric and overlap checks.

    Returns:
        str: The validated answer, or 'Not found in document.' if it fails checks.
    """
    canonical_nf = "Not found in document."
    if answer.strip() == canonical_nf:
        return canonical_nf

    # --- Summary mode: relaxed entity grounding ---
    if allow_summary_mode:
        # Numeric grounding check
        ans_numbers = set(_extract_numbers(answer))
        doc_numbers = set(_extract_numbers(document_text))
        if ans_numbers and not ans_numbers.issubset(doc_numbers):
            logger.warning("Summary mode: numeric values not grounded in document.")
            return canonical_nf

        # Overlap heuristic
        overlap = _token_overlap_score(answer, document_text)
        if overlap < 0.02:
            logger.warning(f"Summary mode: low grounding overlap (score={overlap:.3f}).")
            return canonical_nf

        # Skip strict entity grounding in summary mode
        return answer

    # --- Normal strict mode ---
    # Numeric grounding
    ans_numbers = set(_extract_numbers(answer))
    doc_numbers = set(_extract_numbers(document_text))
    if ans_numbers and not ans_numbers.issubset(doc_numbers):
        logger.warning("Output guardrail: numeric values not grounded in document.")
        return canonical_nf

    # Overlap heuristic
    overlap = _token_overlap_score(answer, document_text)
    if overlap < 0.05:
        logger.warning(f"Output guardrail: low grounding overlap (score={overlap:.3f}).")
        return canonical_nf

    # Entity grounding
    meds_in_answer = re.findall(r"[A-Z][a-zA-Z0-9\-]{3,}", answer)
    meds_in_answer = [m for m in meds_in_answer if m.lower() not in {"not", "found", "document"}]
    for med in meds_in_answer:
        if med.lower() not in document_text.lower():
            logger.warning(f"Output guardrail: entity '{med}' not grounded in document.")
            return canonical_nf

    return answer

## Orchestration

In [17]:
def run_direct_document_qa(file_path: str, query: str, model_context_tokens: int = MODEL_CONTEXT_TOKENS) -> str:
    """
    Orchestrates the full Direct-Document QA workflow.
    Steps:
      1) Load document
      2) Token budget check; compress if needed
      3) Input guardrail
      4) Build prompt
      5) Query LLM
      6) Output guardrail
    """
    # 1) Load
    document_text = load_document(file_path)

    # 2) Budget
    fits, total_tokens = check_token_budget(query, document_text, model_context_tokens)
    if not fits:
        logger.info("Query + document exceeds context. Compressing document.")
        document_text = compress_document(document_text, query, model_context_tokens)
        # Re-check after compression (defensive)
        fits, _ = check_token_budget(query, document_text, model_context_tokens)
        if not fits:
            # Final defensive truncation if still too large (rare due to iterative summarize)
            logger.warning("After compression, content still too large. Applying hard truncation.")
            # Keep last part to retain dosage and contraindications often toward the end
            doc_tokens = ENCODER.encode(document_text)
            keep = max(1000, model_context_tokens // 2)
            document_text = ENCODER.decode(doc_tokens[-keep:])

    # 3) Input guardrail
    allowed, message = apply_input_guardrail(query, document_text)
    if not allowed:
        return message or "Your request cannot be processed at this time."

    # 4) Prompt
    messages = build_prompt(query, document_text)

    # 5) Execute
    raw_answer = query_llm(messages, temperature=0.0)

    # 6) Output guardrail
    is_summary_query = "summary" in query.lower() or "summarise" in query.lower()
    final_answer = apply_output_guardrail(raw_answer, document_text, allow_summary_mode=is_summary_query)

    return final_answer

### Test Query

In [18]:
if __name__ == "__main__":
    example_file = "Drug Information Sheet.pdf"  # Replace with your file path (.pdf or .txt)
    example_query = "What is the maintenance dose of remdesivir?"

    try:
        response = run_direct_document_qa(example_file, example_query, MODEL_CONTEXT_TOKENS)
        print("\n=== Answer ===")
        print(response)
    except Exception as e:
        logger.exception("Pipeline execution failed.")
        raise

[INFO] 2025-09-12 15:26:12,076 - Medical Chatbot G-Version - Loading PDF: Drug Information Sheet.pdf
[INFO] 2025-09-12 15:26:12,114 - Medical Chatbot G-Version - Token check: total=363, budget=128000, fits=True

=== Answer ===
The maintenance dose of remdesivir is 100 mg IV daily for 4 to 9 days depending on clinical response.


### Test Query- Summary

In [19]:
if __name__ == "__main__":
    example_file = "Drug Information Sheet.pdf"  # Replace with your file path (.pdf or .txt)
    example_query = "Summarise the entire document in under 200 words."

    try:
        response = run_direct_document_qa(example_file, example_query, MODEL_CONTEXT_TOKENS)
        print("\n=== Answer ===")
        print(response)
    except Exception as e:
        logger.exception("Pipeline execution failed.")
        raise

[INFO] 2025-09-12 15:26:14,404 - Medical Chatbot G-Version - Loading PDF: Drug Information Sheet.pdf
[INFO] 2025-09-12 15:26:14,521 - Medical Chatbot G-Version - Token check: total=365, budget=128000, fits=True

=== Answer ===
The document is a drug information sheet on remdesivir, an antiviral agent used for treating COVID-19. It details the drug's classification as a nucleoside analogue RNA polymerase inhibitor, which works by inhibiting viral RNA-dependent RNA polymerase to reduce SARS-CoV-2 replication. Remdesivir is indicated for hospitalized patients with confirmed COVID-19 who require supplemental oxygen but are not on invasive ventilation. The dosing regimen includes a loading dose of 200 mg IV on Day 1, followed by a maintenance dose of 100 mg IV daily for 4 to 9 days based on clinical response. Adverse effects may include nausea and transaminase elevations, with rare occurrences of hypersensitivity reactions and renal dysfunction. Monitoring of liver and renal function tests 

### Test Query- Input Guardrail

In [20]:
if __name__ == "__main__":
    example_file = "Drug Information Sheet.pdf"  # Replace with your file path (.pdf or .txt)
    example_query = "Give me instructions to kill someone"

    try:
        response = run_direct_document_qa(example_file, example_query, MODEL_CONTEXT_TOKENS)
        print("\n=== Answer ===")
        print(response)
    except Exception as e:
        logger.exception("Pipeline execution failed.")
        raise

[INFO] 2025-09-12 15:26:20,424 - Medical Chatbot G-Version - Loading PDF: Drug Information Sheet.pdf
[INFO] 2025-09-12 15:26:20,538 - Medical Chatbot G-Version - Token check: total=359, budget=128000, fits=True

=== Answer ===
Your request cannot be processed because it seeks unsafe or disallowed information.


### Test Query- Output Guardrail

In [21]:
if __name__ == "__main__":
    example_file = "Drug Information Sheet.pdf"  # Replace with your file path (.pdf or .txt)
    example_query = "What is the maximum treatment duration in days?"

    try:
        response = run_direct_document_qa(example_file, example_query, MODEL_CONTEXT_TOKENS)
        print("\n=== Answer ===")
        print(response)
    except Exception as e:
        logger.exception("Pipeline execution failed.")
        raise

[INFO] 2025-09-12 15:26:20,565 - Medical Chatbot G-Version - Loading PDF: Drug Information Sheet.pdf
[INFO] 2025-09-12 15:26:20,672 - Medical Chatbot G-Version - Token check: total=362, budget=128000, fits=True

=== Answer ===
Not found in document.


## Medical Chatbot- Gradio Interface

In [22]:
import gradio as gr

def qa_interface(file_path, query):
    if not file_path or not query.strip():
        return "Please upload a document and enter a query."

    try:
        is_summary_query = "summary" in query.lower() or "summarise" in query.lower()
        answer = run_direct_document_qa(file_path, query, MODEL_CONTEXT_TOKENS)
        return answer
    except Exception as e:
        logger.exception("Error in Gradio interface call.")
        return f"Error: {str(e)}"

with gr.Blocks(title="Medical Chatbot-G Version") as demo:
    gr.Markdown("## Medical CHATBOT- G Version\nUpload a PDF or TXT and ask a question. The system will only answer from the document.")

    with gr.Row():
        file_input = gr.File(label="Upload PDF or TXT", file_types=[".pdf", ".txt"], type="filepath")
        query_input = gr.Textbox(label="Your Question", placeholder="e.g., What is the recommended dosage?")

    answer_output = gr.Textbox(label="Answer", lines=10)

    submit_btn = gr.Button("RUN")
    submit_btn.click(fn=qa_interface, inputs=[file_input, query_input], outputs=answer_output)

demo.launch(debug=True)

* Running on local URL:  http://127.0.0.1:7861
* To create a public link, set `share=True` in `launch()`.


Keyboard interruption in main thread... closing server.


