<a href="https://colab.research.google.com/github/SaiSatyamJena/Project-Nightingale/blob/main/Hallucination_Survey_Medical_Summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### 1.1. Introduction: The Criticality of Accuracy in Medical AI

Large Language Models (LLMs) have shown remarkable capabilities in text generation, including summarization. However, a significant challenge, especially in high-stakes domains like medicine, is the phenomenon of "hallucinations."

**What are Hallucinations?**
In the context of LLM-generated text, hallucinations refer to content that is:
*   **Factually incorrect:** Stating information that contradicts the source document or established medical knowledge.
*   **Fabricated:** Introducing information not present in the source document.
*   **Nonsensical or irrelevant:** Generating text that does not cohere with the input or the task.

**Why are Hallucinations Critical in Medicine?**
In medical report summarization, hallucinations can have severe consequences:
*   **Patient Safety Risks:** A summary suggesting a wrong diagnosis, incorrect dosage, or a non-existent allergy can lead to harmful medical decisions.
*   **Misinformation for Clinicians:** Doctors rely on accurate summaries for quick understanding. Hallucinated information can mislead them, waste time, and erode trust in AI tools.
*   **Legal and Ethical Implications:** Inaccurate medical records or summaries can lead to significant legal and ethical issues.

Controlling and evaluating hallucinations is paramount for the responsible deployment of LLMs in healthcare.

### 1.2. Project Goal: The "Hallucination Survey Function"

This project aims to develop and showcase a "Hallucination Survey Function" specifically designed for evaluating summaries of medical reports. The goal is not just to detect hallucinations but to provide a *survey* of potential issues, offering insights into the summary's faithfulness to the source document.

This function will serve as a practical tool to:
*   Quantify different aspects of potential hallucinations.
*   Help developers and researchers understand the types of errors their summarization models are making.
*   Provide a basis for improving model factuality and reliability.
*   Demonstrate a sophisticated understanding and command over controlling LLM outputs for critical applications.

### 1.3. Our Approach: A Multi-Faceted Survey

Instead of relying on a single metric, our Hallucination Survey Function will employ a battery of checks, each targeting different potential failure modes. This multi-faceted approach provides a more nuanced and comprehensive assessment. The checks will include:

*   **Lexical and Entity-Level Analysis:** Ensuring key medical terms, numerical values (like dosages), and named entities are consistent between source and summary.
*   **Negation Consistency:** Verifying that statements of absence or presence are correctly maintained.
*   **Phrase-Level Coherence:** Using n-gram analysis to check for congruent phrasing of important concepts.
*   **Content Overlap and Abstractiveness:** Utilizing metrics like ROUGE to understand how much the summary relies on verbatim extraction versus abstractive generation, and whether it adequately covers source content.
*   **(Potentially) Basic Semantic Outlier Detection:** Identifying summary sentences that might be semantically distant from the source content.

By combining these checks, we aim to build a robust tool that provides actionable feedback on summary quality, focusing squarely on the critical aspect of factual accuracy.

### 2.1. Essential Imports

We'll start by importing the Python libraries required for our analysis. We will prioritize standard libraries and those readily available in Google Colab to ensure smooth execution.

In [None]:
# Standard Python Libraries
import re
from collections import Counter, defaultdict
import string

# NLTK for text processing
print("Importing NLTK and downloading resources...")
NLTK_AVAILABLE = False  # Default to False
try:
    import nltk
    print("NLTK module imported. Downloading 'punkt', 'punkt_tab', and 'stopwords' resources...")
    # Download both 'punkt' and 'punkt_tab' to ensure tokenizer compatibility
    nltk.download('punkt')
    nltk.download('punkt_tab')  # Explicitly download punkt_tab
    nltk.download('stopwords')
    nltk.download('averaged_perceptron_tagger', quiet=True)

    from nltk.tokenize import word_tokenize, sent_tokenize
    from nltk.corpus import stopwords
    from nltk.util import ngrams

    # Test tokenization immediately after import and download
    try:
        print(f"  Attempting NLTK word_tokenize on 'Test sentence.': {word_tokenize('Test sentence.')}")
        NLTK_AVAILABLE = True
        print("NLTK imported, resources downloaded, and word_tokenize test successful.")
    except LookupError as e:
        print(f"NLTK LookupError after initial download: {e}")
        print("Re-attempting to download 'punkt_tab' as a fallback...")
        nltk.download('punkt_tab', quiet=True)  # Retry punkt_tab specifically
        print(f"  Re-attempting NLTK word_tokenize: {word_tokenize('Test sentence.')}")
        NLTK_AVAILABLE = True
        print("NLTK word_tokenize successful after re-confirmation.")
    except Exception as e:
        print(f"An unexpected error occurred during NLTK setup or test: {e}")
        NLTK_AVAILABLE = False

except ImportError:
    NLTK_AVAILABLE = False
    print("NLTK Import Error: NLTK module is not available. Some functionality will be limited.")
except Exception as e:
    NLTK_AVAILABLE = False
    print(f"An unexpected error occurred during NLTK import/download: {e}")

if not NLTK_AVAILABLE:
    print("NLTK setup failed. Defining fallback functions.")
    # Define dummy functions if NLTK fails
    def word_tokenize(text): return re.findall(r'\b\w+\b', text.lower())
    def sent_tokenize(text): return re.split(r'(?<=[.!?])\s+', text)
    stopwords_english = set(["a", "an", "the", "is", "are", "was", "were"])
    def ngrams(sequence, n):
        if not sequence or n < 1: return []
        return [tuple(sequence[i:i+n]) for i in range(len(sequence)-n+1)]

# Scikit-learn for TF-IDF and Cosine Similarity
print("\nImporting scikit-learn components...")
try:
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.metrics.pairwise import cosine_similarity
    SKLEARN_AVAILABLE = True
    print("Scikit-learn (TfidfVectorizer, cosine_similarity) imported successfully.")
except ImportError:
    SKLEARN_AVAILABLE = False
    print("Scikit-learn Import Error: TfidfVectorizer or cosine_similarity not available.")

# ROUGE score for summary evaluation
ROUGE_AVAILABLE = False
print("\nROUGE library (rouge_score) will be installed and imported in the next cell.")

# Confirmation print
print("\n--- Initial Imports Configuration ---")
print(f"NLTK Available: {NLTK_AVAILABLE}")
if NLTK_AVAILABLE:
    print(f"  Final check NLTK word_tokenize: {word_tokenize('Test sentence.')}")
else:
    print(f"  Using Fallback word_tokenize: {word_tokenize('Test sentence.')}")
print(f"Scikit-learn (for TF-IDF) Available: {SKLEARN_AVAILABLE}")
print(f"ROUGE Scorer status: To be determined in the next cell.")

Importing NLTK and downloading resources...
NLTK module imported. Downloading 'punkt', 'punkt_tab', and 'stopwords' resources...
  Attempting NLTK word_tokenize on 'Test sentence.': ['Test', 'sentence', '.']
NLTK imported, resources downloaded, and word_tokenize test successful.

Importing scikit-learn components...
Scikit-learn (TfidfVectorizer, cosine_similarity) imported successfully.

ROUGE library (rouge_score) will be installed and imported in the next cell.

--- Initial Imports Configuration ---
NLTK Available: True
  Final check NLTK word_tokenize: ['Test', 'sentence', '.']
Scikit-learn (for TF-IDF) Available: True
ROUGE Scorer status: To be determined in the next cell.


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
# Install and import rouge-score
print("Attempting to install and import rouge_score package...")
ROUGE_AVAILABLE = False  # Initialize/reset
import subprocess

try:
    # Use !pip install via subprocess to check installation status
    result = subprocess.run(['pip', 'install', 'rouge-score', '--quiet'], capture_output=True, text=True)

    if result.returncode == 0:
        print("pip install rouge-score executed successfully (or package already present).")
        from rouge_score import rouge_scorer
        ROUGE_AVAILABLE = True
        print("rouge_scorer imported successfully from rouge_score.")
    else:
        print("pip install rouge-score failed.")
        print(f"Error output: {result.stderr}")
except ImportError:
    print("ImportError: Even after attempting install, rouge_score could not be imported. ROUGE metrics will be skipped.")
except Exception as e:
    print(f"An unexpected error occurred during rouge-score installation/import: {e}")

print(f"\nFinal ROUGE Scorer Available: {ROUGE_AVAILABLE}")
if ROUGE_AVAILABLE:
    # Quick test of the imported scorer
    try:
        scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
        scores = scorer.score('This is a test.', 'This is a test.')
        print("ROUGE scorer test successful:", scores['rouge1'])
    except Exception as e:
        print(f"Error testing ROUGE scorer: {e}")
        ROUGE_AVAILABLE = False  # If test fails, mark as unavailable
        print(f"Final ROUGE Scorer Available (after test): {ROUGE_AVAILABLE}")

Attempting to install and import rouge_score package...
pip install rouge-score executed successfully (or package already present).
rouge_scorer imported successfully from rouge_score.

Final ROUGE Scorer Available: True
ROUGE scorer test successful: Score(precision=1.0, recall=1.0, fmeasure=1.0)


### 2.2. Helper Functions

We'll define some helper functions that will be used throughout the notebook.

In [None]:
# Content (Revised Helper Functions):
def basic_text_clean(text):
    """A very simple text cleaner: lowercase and remove extra whitespace."""
    text = str(text).lower() # Ensure string and lowercase
    text = re.sub(r'\s+', ' ', text).strip() # Remove extra whitespace
    # Consider removing all punctuation if it interferes with specific checks,
    # but sometimes punctuation (like in "mg/dL") is important.
    # text = text.translate(str.maketrans('', '', string.punctuation)) # Example: remove all punctuation
    return text

# Placeholder for LLM related functions - will be fully implemented in Chapter 5
def load_llm_model_and_tokenizer(model_name_path):
    print(f"INFO: Placeholder for loading LLM model and tokenizer: {model_name_path}")
    print("      Actual implementation will be in Chapter 5 using Hugging Face transformers.")
    # from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
    # tokenizer = AutoTokenizer.from_pretrained(model_name_path)
    # model = AutoModelForSeq2SeqLM.from_pretrained(model_name_path)
    # return model, tokenizer
    return None, None # Placeholder

def generate_llm_summary(model, tokenizer, source_text, max_summary_length=150, min_summary_length=25):
    print(f"INFO: Placeholder for generating LLM summary for text starting with: '{source_text[:70]}...'")
    print("      Actual implementation will be in Chapter 5.")
    # inputs = tokenizer(source_text, return_tensors="pt", max_length=1024, truncation=True)
    # summary_ids = model.generate(inputs["input_ids"], num_beams=4, max_length=max_summary_length, min_length=min_summary_length, early_stopping=True)
    # summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    # return summary
    return "This is a placeholder LLM summary. Full generation logic will be in Chapter 5."


print("Core helper functions defined (basic_text_clean, and placeholders for LLM functions).")

Core helper functions defined (basic_text_clean, and placeholders for LLM functions).


## Chapter 3: Architecting the "HallucinationSurveyor"

### 3.1. Core Design Philosophy

Our `HallucinationSurveyor` is designed as a comprehensive tool to scrutinize summaries against their source documents, focusing on potential factual discrepancies indicative of hallucinations. The core philosophy is a "survey" approach: instead of a single score, the surveyor performs a battery of distinct checks, each targeting a specific type of potential issue common in medical text summarization.

This allows for:
*   **Granular Feedback:** Pinpointing *what kind* of hallucination might be present (e.g., entity fabrication, numerical error, negation flip).
*   **Interpretability:** Understanding *why* a summary is flagged.
*   **Targeted Improvement:** Providing insights that can guide efforts to improve LLM factuality.

The surveyor will primarily use lexical, rule-based, and basic semantic techniques that are robust and don't require massive computational resources or complex dependencies, making it practical for various environments.

In [None]:
# Ensure basic_text_clean, word_tokenize, sent_tokenize, stopwords_english, ngrams,
# ROUGE_AVAILABLE, rouge_scorer, and NLTK_AVAILABLE are defined from previous cells (Chapter 2).
import re

class HallucinationSurveyor:
    def __init__(self, critical_medical_terms=None, common_units=None):
        if critical_medical_terms:
            self.critical_medical_terms = set(term.lower() for term in critical_medical_terms)
        else:
            self.critical_medical_terms = {
                "cancer", "tumor", "malignant", "benign", "metastasis", "infection", "sepsis",
                "fracture", "stroke", "infarction", "embolism", "diabetes", "hypertension",
                "allergy", "anaphylaxis", "pneumonia", "hiv", "covid-19", "syndrome"
            }

        if common_units:
            # Ensure '%' is not treated as a standalone unit here if it's in common_units,
            # as it's handled specially with numbers.
            self.common_units = set(unit.lower() for unit in common_units if unit != '%')
        else:
            self.common_units = {"mg", "ml", "g", "mcg", "iu", "l", "dl",
                                 "bpm", "mmhg", "kg", "cm", "mmol/l", "meq/l", "c", "f", "μg"} # '%' removed, handled separately

        # Add 'ug' as an alias for 'mcg' if not already present, common for microgram
        if 'mcg' in self.common_units:
            self.common_units.add('ug')
        elif 'μg' in self.common_units: # μg already added above
            self.common_units.add('ug')


        self.stopwords = stopwords_english

    def _preprocess_text_for_word_analysis(self, text):
        cleaned_text = basic_text_clean(str(text))
        tokens = word_tokenize(cleaned_text)
        processed_tokens = []
        for token in tokens:
            token_lower = token.lower()
            is_number_token = re.fullmatch(r"^-?(?:\d+(?:\.\d*)?|\.\d+)(?:(?:[/\-])(?:-?(?:\d+(?:\.\d*)?|\.\d+)))*%?$", token_lower)

            if is_number_token:
                processed_tokens.append(token_lower)
            elif token_lower.isalnum() and token_lower not in self.stopwords:
                if len(token_lower) > 1 or token_lower.isdigit():
                    processed_tokens.append(token_lower)
            elif token_lower in self.critical_medical_terms:
                 processed_tokens.append(token_lower)
            elif token_lower in self.common_units: # '%' is not in self.common_units anymore
                 processed_tokens.append(token_lower)
        return processed_tokens

    def _extract_numerical_phrases(self, text):
        text_normalized = str(text).replace('μg', 'mcg').replace('μ', 'mcg') # Normalize micro symbols to mcg
        cleaned_text = basic_text_clean(text_normalized)
        numerical_phrases_found = set()
        number_pattern = r"-?(?:\d+(?:\.\d*)?|\.\d+)(?:(?:[\s]*[./-][\s]*)(?:-?(?:\d+(?:\.\d*)?|\.\d+)))*"
        sorted_units = sorted(list(self.common_units), key=len, reverse=True) # self.common_units no longer contains '%'

        for match in re.finditer(number_pattern, cleaned_text):
            number_str_matched = match.group(0).replace(" ", "")
            # Strip trailing dot if it's not part of a decimal number itself (e.g. "100." -> "100", but "2.5" stays "2.5")
            if number_str_matched.endswith('.') and not re.search(r'\.\d+\.$', number_str_matched): # Avoid stripping from "2.5."
                if '.' in number_str_matched[:-1]: # if "2.5." -> "2.5"
                     number_str = number_str_matched
                else: # if "100." -> "100"
                     number_str = number_str_matched[:-1]
            else:
                number_str = number_str_matched

            current_pos = match.end()
            found_unit_for_this_number = False

            # First, explicitly check for '%'
            if current_pos < len(cleaned_text) and cleaned_text[current_pos] == '%':
                numerical_phrases_found.add(number_str + '%')
                found_unit_for_this_number = True
            else:
                # Then, check for other units
                max_len_unit = max(len(u) for u in sorted_units) if sorted_units else 0
                lookahead_window = cleaned_text[current_pos : current_pos + max_len_unit + 2]

                for unit in sorted_units: # sorted_units does not contain '%'
                    # Case 1: Unit is directly attached
                    if lookahead_window.startswith(unit):
                        idx_after_unit = len(unit)
                        if idx_after_unit == len(lookahead_window) or not lookahead_window[idx_after_unit].isalpha():
                            numerical_phrases_found.add(number_str + unit)
                            found_unit_for_this_number = True
                            break
                    # Case 2: Unit is separated by a single space
                    if lookahead_window.startswith(" " + unit):
                        idx_after_unit_and_space = 1 + len(unit)
                        if idx_after_unit_and_space == len(lookahead_window) or \
                           not lookahead_window[idx_after_unit_and_space].isalpha():
                            numerical_phrases_found.add(number_str + unit)
                            found_unit_for_this_number = True
                            break

            if not found_unit_for_this_number:
                numerical_phrases_found.add(number_str)

        return numerical_phrases_found

    def check_entity_coherence(self, source_text, summary_text):
        source_tokens = self._preprocess_text_for_word_analysis(source_text)
        summary_tokens = self._preprocess_text_for_word_analysis(summary_text)
        source_set = set(source_tokens)
        summary_set = set(summary_tokens)
        hallucinated_critical = [term for term in self.critical_medical_terms if term in summary_set and term not in source_set]
        omitted_critical = [term for term in self.critical_medical_terms if term in source_set and term not in summary_set]
        num_pattern_for_filter = r"^-?(?:\d+(?:\.\d*)?|\.\d+)(?:(?:[/\-])(?:-?(?:\d+(?:\.\d*)?|\.\d+)))*%?$"
        summary_words_only = {token for token in summary_set if not re.fullmatch(num_pattern_for_filter, token)}
        source_words_only = {token for token in source_set if not re.fullmatch(num_pattern_for_filter, token)}
        fabricated_general = list(summary_words_only - source_words_only - self.critical_medical_terms)
        return {
            "hallucinated_critical_entities": sorted(list(set(hallucinated_critical))),
            "omitted_critical_entities": sorted(list(set(omitted_critical))),
            "potentially_fabricated_general_terms": sorted(fabricated_general)[:10]
        }

    def check_numerical_consistency(self, source_text, summary_text):
        source_numbers = self._extract_numerical_phrases(source_text)
        summary_numbers = self._extract_numerical_phrases(summary_text)
        summary_not_in_source = list(summary_numbers - source_numbers)
        source_not_in_summary = list(source_numbers - summary_numbers)
        return {
            "numbers_in_summary_not_in_source": sorted(summary_not_in_source),
            "numbers_in_source_not_in_summary": sorted(source_not_in_summary)
        }

    def _extract_negations_with_context(self, text, terms_to_check):
        negated_phrases = []
        negation_markers = ["no", "not", "denies", "denied", "negative for", "absence of", "without",
                            "failed to reveal", "rules out", "ruled out", "free of", "clear of", "unremarkable for"]
        cleaned_text = basic_text_clean(str(text))
        sentences = sent_tokenize(cleaned_text)
        for sentence in sentences:
            sentence_lower = sentence.lower()
            for neg_marker in negation_markers:
                start_idx = 0
                while start_idx < len(sentence_lower):
                    marker_idx = sentence_lower.find(neg_marker, start_idx)
                    if marker_idx == -1: break
                    window_start = marker_idx + len(neg_marker)
                    if window_start < len(sentence_lower) and sentence_lower[window_start] == ' ': window_start += 1
                    window_end = window_start + 40
                    text_after_negation = sentence_lower[window_start:window_end]
                    for term in terms_to_check:
                        if term in text_after_negation:
                            context_phrase_start = max(0, marker_idx - 20)
                            term_pos_in_window = text_after_negation.find(term)
                            actual_term_end_in_sentence = window_start + term_pos_in_window + len(term)
                            context_phrase_end = min(len(sentence_lower), actual_term_end_in_sentence + 20)
                            contextual_phrase = sentence_lower[context_phrase_start:context_phrase_end].replace("\n", " ").strip()
                            negated_phrases.append(f"...{contextual_phrase}...")
                            break
                    start_idx = marker_idx + len(neg_marker)
        return sorted(list(set(negated_phrases)))

    def check_negation_consistency(self, source_text, summary_text):
        source_negated_phrases = self._extract_negations_with_context(source_text, self.critical_medical_terms)
        summary_negated_phrases = self._extract_negations_with_context(summary_text, self.critical_medical_terms)
        issues = []
        for snp in summary_negated_phrases:
            if snp not in source_negated_phrases:
                issues.append(f"Summary has distinct negation context not found in source: '{snp}'")
        # Check if source had a negation that summary seems to affirm (harder, needs affirmation detection)
        # For now, this check is primarily about differing explicit negations.
        return {
            "potential_negation_issues": issues,
            "source_negated_phrases_critical": source_negated_phrases,
            "summary_negated_phrases_critical": summary_negated_phrases
        }

    def check_n_gram_overlap(self, source_text, summary_text, n_values=None):
        if n_values is None: n_values = [2, 3]
        results = {}
        source_cleaned = basic_text_clean(str(source_text))
        summary_cleaned = basic_text_clean(str(summary_text))
        source_tokens_for_ngram = [t for t in word_tokenize(source_cleaned) if t.isalnum()]
        summary_tokens_for_ngram = [t for t in word_tokenize(summary_cleaned) if t.isalnum()]
        for n in n_values:
            source_ngrams_set = set(ngrams(source_tokens_for_ngram, n))
            summary_ngrams_set = set(ngrams(summary_tokens_for_ngram, n))
            if not summary_ngrams_set:
                overlap_score = 1.0 if not source_ngrams_set else 0.0
                summary_ngrams_not_in_source_list = []
            elif not source_ngrams_set:
                 overlap_score = 0.0
                 summary_ngrams_not_in_source_list = list(summary_ngrams_set)
            else:
                common_ngrams = source_ngrams_set.intersection(summary_ngrams_set)
                overlap_score = len(common_ngrams) / len(summary_ngrams_set) if len(summary_ngrams_set) > 0 else 0.0
                summary_ngrams_not_in_source_list = list(summary_ngrams_set - source_ngrams_set)
            results[f'{n}-gram'] = {
                'overlap_score': round(overlap_score, 3),
                'summary_ngrams_not_in_source': sorted([" ".join(gram) for gram in summary_ngrams_not_in_source_list][:5])
            }
        return {"n_gram_analysis": results}

    def check_abstractiveness_metrics(self, source_text, summary_text):
        rouge_results = {}
        # Ensure ROUGE_AVAILABLE and rouge_scorer are defined and accessible from global scope
        g = globals()
        if 'ROUGE_AVAILABLE' in g and g['ROUGE_AVAILABLE'] and 'rouge_scorer' in g:
            try:
                s_text = str(source_text); summ_text = str(summary_text)
                if s_text and summ_text:
                    scorer_instance = g['rouge_scorer'].RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
                    scores = scorer_instance.score(s_text, summ_text)
                    for key_rouge in scores:
                        rouge_results[key_rouge] = {
                            "precision": round(scores[key_rouge].precision, 3),
                            "recall": round(scores[key_rouge].recall, 3),
                            "fmeasure": round(scores[key_rouge].fmeasure, 3)
                        }
                else: rouge_results = {"error": "Source or summary text is empty for ROUGE."}
            except Exception as e: rouge_results = {"error": f"ROUGE calculation failed: {e}"}
        else: rouge_results = {"status": "ROUGE scorer not available or not properly initialized."}
        src_tokens_len = len(word_tokenize(basic_text_clean(str(source_text))))
        summ_tokens_len = len(word_tokenize(basic_text_clean(str(summary_text))))
        length_ratio = summ_tokens_len / src_tokens_len if src_tokens_len > 0 else 0.0
        return {"rouge_scores": rouge_results, "length_ratio_summary_vs_source": round(length_ratio, 3)}

    def calculate_heuristic_score(self, report):
        score = 0
        weights = {
            "hallucinated_critical": 10, "omitted_critical": 7, "fabricated_general": 1,
            "numbers_summary_not_source": 8, "numbers_source_not_summary": 4, "negation_issues": 10
        }
        if "entity_coherence" in report:
            score += len(report["entity_coherence"].get("hallucinated_critical_entities", [])) * weights["hallucinated_critical"]
            score += len(report["entity_coherence"].get("omitted_critical_entities", [])) * weights["omitted_critical"]
            score += len(report["entity_coherence"].get("potentially_fabricated_general_terms", [])) * weights["fabricated_general"]
        if "numerical_consistency" in report:
            score += len(report["numerical_consistency"].get("numbers_in_summary_not_in_source", [])) * weights["numbers_summary_not_source"]
            score += len(report["numerical_consistency"].get("numbers_in_source_not_in_summary", [])) * weights["numbers_source_not_summary"]
        if "negation_consistency" in report:
             score += len(report["negation_consistency"].get("potential_negation_issues", [])) * weights["negation_issues"]
        return score

    def survey(self, source_text, summary_text):
        g = globals() # For accessing NLTK_AVAILABLE, ROUGE_AVAILABLE
        src_preview = str(source_text)[:70] + "..." if len(str(source_text)) > 70 else str(source_text)
        sum_preview = str(summary_text)[:70] + "..." if len(str(summary_text)) > 70 else str(summary_text)
        print(f"\n--- Starting Hallucination Survey ---")
        print(f"Source: {src_preview}")
        print(f"Summary: {sum_preview}")
        report = {}; current_score = 0
        try:
            report["entity_coherence"] = self.check_entity_coherence(source_text, summary_text)
            report["numerical_consistency"] = self.check_numerical_consistency(source_text, summary_text)
            report["negation_consistency"] = self.check_negation_consistency(source_text, summary_text)
            report["n_gram_overlap"] = self.check_n_gram_overlap(source_text, summary_text)
            report["abstractiveness_metrics"] = self.check_abstractiveness_metrics(source_text, summary_text)
            current_score = self.calculate_heuristic_score(report)
            report["overall_hallucination_heuristic_score"] = current_score
        except Exception as e:
            print(f"ERROR during survey execution: {e}")
            report["error"] = str(e); report["overall_hallucination_heuristic_score"] = -1
        report["survey_notes"] = [
            "This is a heuristic survey. Manual review is crucial.",
            f"Heuristic score (higher indicates more potential issues): {current_score}",
            f"NLTK tokenizers used: {g.get('NLTK_AVAILABLE', 'Unknown')}",
            f"ROUGE metrics available: {g.get('ROUGE_AVAILABLE', False) and 'rouge_scorer' in g}"
        ]
        print("--- Hallucination Survey Complete ---")
        return report

# print("HallucinationSurveyor class defined.")

In [None]:
# Ensure the HallucinationSurveyor class is defined in a previous cell.

# Instantiate the surveyor
try:
    surveyor_instance = HallucinationSurveyor()
    print("\n--- Surveyor Instance Details ---")
    print(f"Surveyor critical medical terms count: {len(surveyor_instance.critical_medical_terms)}")
    print(f"Surveyor common units count: {len(surveyor_instance.common_units)}")
    print(f"Surveyor using {len(surveyor_instance.stopwords)} stopwords (from NLTK: {NLTK_AVAILABLE}).") # Assumes NLTK_AVAILABLE is global
    SURVEYOR_READY = True
except Exception as e:
    print(f"Error instantiating HallucinationSurveyor: {e}")
    SURVEYOR_READY = False

if SURVEYOR_READY:
    print("\n--- Testing _extract_numerical_phrases ---")
    test_texts_for_numbers = {
        "Test 1 (Mixed)": "Patient given Lisinopril 10mg daily. BP 120/80 mmHg. Glucose level at 5.5 mmol/L. Improvement of 20%.",
        "Test 2 (Temp & Weight)": "No fever noted. Temperature 37.5 C. Weight 70kg.",
        "Test 3 (Dose, BP, Range)": "Dose is 2.5g. Pressure 130 / 90. Range 5-10 units.",
        "Test 4 (mcg, bpm)": "Administer 50mcg of medication. Heart rate is 70 bpm.",
        "Test 5 (Spaced Units)": "The value is 100. Another value 200ml. Then 300 mg."
    }
    for name, text in test_texts_for_numbers.items():
        extracted = surveyor_instance._extract_numerical_phrases(text)
        print(f"\n{name}: '{text}'")
        print(f"  Extracted: {sorted(list(extracted))}")

    print("\n--- Testing _preprocess_text_for_word_analysis ---")
    test_text_preprocess = "Patient reports: NO fever, but denies headache. BP: 120/80 mmHg."
    processed_tokens = surveyor_instance._preprocess_text_for_word_analysis(test_text_preprocess)
    print(f"Original: '{test_text_preprocess}'")
    print(f"Processed Tokens: {processed_tokens}")

    # More comprehensive test after all methods are filled
    print("\n--- Full Survey Test (using stubs for now for some checks) ---")
    source_example = "Patient is a 45-year-old male with a history of hypertension, currently managed with lisinopril 10mg daily. Denies chest pain or shortness of breath. Recent labs show potassium at 4.0 meq/l. No signs of infection."
    summary_good_example = "A 45-year-old male with hypertension takes lisinopril 10mg. He denies chest pain. Potassium is 4.0 meq/l. No infection noted."

    if ROUGE_AVAILABLE: # Quick check if ROUGE is globally available for the test
         print("ROUGE scorer expected to be available for full survey.")
    else:
         print("Warning: ROUGE scorer NOT available globally, abstractiveness metrics will be limited.")

    report_good = surveyor_instance.survey(source_example, summary_good_example)
    print("\nReport for GOOD summary:")
    # Basic print for now, will pretty-print later
    for k, v in report_good.items():
        if k not in ["survey_notes"]:
             print(f"  {k}: {v}")
    print("  Survey Notes:")
    for note in report_good["survey_notes"]:
        print(f"    - {note}")

else:
    print("\nSURVEYOR NOT READY. Skipping further tests.")

print("Surveyor instantiation and initial tests complete.") # This will print if cell runs without error


--- Surveyor Instance Details ---
Surveyor critical medical terms count: 19
Surveyor common units count: 17
Surveyor using 127 stopwords (from NLTK: True).

--- Testing _extract_numerical_phrases ---

Test 1 (Mixed): 'Patient given Lisinopril 10mg daily. BP 120/80 mmHg. Glucose level at 5.5 mmol/L. Improvement of 20%.'
  Extracted: ['10mg', '120/80mmhg', '20%', '5.5mmol/l']

Test 2 (Temp & Weight): 'No fever noted. Temperature 37.5 C. Weight 70kg.'
  Extracted: ['37.5c', '70kg']

Test 3 (Dose, BP, Range): 'Dose is 2.5g. Pressure 130 / 90. Range 5-10 units.'
  Extracted: ['130/90', '2.5g', '5-10']

Test 4 (mcg, bpm): 'Administer 50mcg of medication. Heart rate is 70 bpm.'
  Extracted: ['50mcg', '70bpm']

Test 5 (Spaced Units): 'The value is 100. Another value 200ml. Then 300 mg.'
  Extracted: ['100', '200ml', '300mg']

--- Testing _preprocess_text_for_word_analysis ---
Original: 'Patient reports: NO fever, but denies headache. BP: 120/80 mmHg.'
Processed Tokens: ['patient', 'reports', 

## Chapter 4: Test Cases - Proving the Surveyor's Mettle

### 4.1. Defining Diverse Source Texts for Evaluation

To thoroughly evaluate our `HallucinationSurveyor`, we need a set of realistic (though mock) medical source texts. These will serve as the ground truth against which we test various summaries.

In [None]:
# Define Source Texts for Chapter 4 Evaluation

source_text_1_cardiology = """
Patient: Mr. John Smith, DOB: 1970-03-15, MRN: JS001
Chief Complaint: Follow-up, hypertension and hyperlipidemia.
History of Present Illness: Mr. Smith is a 54-year-old male with a known history of hypertension for 10 years and hyperlipidemia for 5 years.
He is currently prescribed Lisinopril 20mg daily and Atorvastatin 40mg daily.
He reports good medication adherence and denies any new symptoms such as chest pain, shortness of breath, palpitations, or dizziness.
He monitors his blood pressure at home; readings are generally around 135/85 mmHg.
He had a routine lipid panel last month: Total Cholesterol 180 mg/dL, LDL 100 mg/dL, HDL 45 mg/dL, Triglycerides 170 mg/dL.
Social History: Non-smoker, drinks alcohol occasionally (2-3 drinks per week). Exercises 3 times a week.
Allergies: Penicillin (causes rash).
Physical Examination: BP 138/88 mmHg, HR 72 bpm, regular. Lungs clear. Heart RRR, no murmurs.
Assessment: Hypertension, stable. Hyperlipidemia, LDL at goal but triglycerides slightly elevated.
Plan:
1. Continue Lisinopril 20mg daily.
2. Continue Atorvastatin 40mg daily.
3. Discussed lifestyle modifications for triglycerides, including reducing alcohol and refined carbohydrates.
4. No medication changes today.
5. Follow up in 6 months. Labs (lipid panel, CMP) to be done 1 week prior.
"""

source_text_2_diabetes = """
Patient: Ms. Emily White, DOB: 1965-09-20, MRN: EW002
Chief Complaint: Annual diabetes check-up.
History of Present Illness: Ms. White is a 59-year-old female diagnosed with Type 2 Diabetes Mellitus 8 years ago.
She is currently on Metformin 1000mg BID and Glipizide 5mg daily. She reports variable glucose readings, mostly between 150-200 mg/dL fasting.
She admits to occasional non-adherence with diet, especially around holidays. Denies symptoms of hypoglycemia.
Recent HbA1c (2 weeks ago) was 7.8%. Last eye exam was 1 year ago, showed mild non-proliferative diabetic retinopathy.
Last foot exam was 6 months ago, no neuropathy detected.
Medications: Metformin 1000mg BID, Glipizide 5mg daily, Lisinopril 10mg daily (for HTN), Aspirin 81mg daily.
Allergies: No Known Drug Allergies (NKDA).
Physical Examination: Weight 175 lbs, BP 130/78 mmHg. Feet: Sensation intact to monofilament, pulses good.
Assessment: Type 2 Diabetes Mellitus, suboptimally controlled (HbA1c 7.8%). Mild diabetic retinopathy. Hypertension, controlled.
Plan:
1. Emphasize importance of diet and consistent medication use.
2. Increase Metformin to 1000mg with breakfast and 1500mg with dinner.
3. Continue Glipizide 5mg daily.
4. Continue Lisinopril and Aspirin.
5. Refer to dietitian for counseling.
6. Schedule follow-up in 3 months with repeat HbA1c.
7. Annual eye exam due now, will place referral.
"""

source_text_3_ortho = """
Patient: Master Alex Green, DOB: 2014-07-01, MRN: AG003
Chief Complaint: Right wrist pain after fall.
History of Present Illness: Alex is a 10-year-old boy who fell off his skateboard 2 days ago, landing on his outstretched right hand.
He had immediate pain and swelling in the right wrist. Mother gave Ibuprofen 200mg which provided some relief.
No loss of consciousness, no other injuries reported. Pain is worse with movement. No numbness or tingling in fingers.
Past Medical History: Unremarkable. No previous fractures.
Allergies: None.
Physical Examination: Right wrist shows moderate swelling and tenderness, maximal over the distal radius.
Range of motion is limited by pain, especially supination and dorsiflexion. Snuffbox tenderness is equivocal.
Distal neurovascular status is intact.
X-ray Right Wrist (AP, Lateral, Oblique): Shows a non-displaced buckle fracture of the distal radius. Ulna appears intact.
Assessment: Acute non-displaced buckle fracture, distal radius, right.
Plan:
1. Immobilization in a volar wrist splint for 3-4 weeks.
2. Ibuprofen 200-400mg every 6-8 hours as needed for pain. Max 1200mg/day.
3. Ice and elevation.
4. No contact sports or activities that risk re-injury during healing.
5. Follow up with orthopedics in 1 week for splint check and further management.
If pain worsens or new symptoms (numbness, finger discoloration) develop, contact clinic sooner.
"""
print("Source texts for Chapter 4 defined: source_text_1_cardiology, source_text_2_diabetes, source_text_3_ortho")

Source texts for Chapter 4 defined: source_text_1_cardiology, source_text_2_diabetes, source_text_3_ortho


### 4.2. Evaluating Summaries with the `HallucinationSurveyor`

We will now test our `HallucinationSurveyor` against various types of summaries generated from `source_text_1_cardiology`. We'll instantiate our surveyor if not already done, or use the existing `surveyor_instance`.

In [None]:
# Ensure surveyor_instance is available
if 'surveyor_instance' not in globals() or not isinstance(surveyor_instance, HallucinationSurveyor):
    print("Re-initializing HallucinationSurveyor...")
    surveyor_instance = HallucinationSurveyor()
    print("HallucinationSurveyor instance ready.")
else:
    print("HallucinationSurveyor instance is already available.")

# Helper function to pretty print the survey report
import json
def pprint_survey_report(report):
    print(json.dumps(report, indent=2))

print("Surveyor instance and pretty-print helper ready.")

HallucinationSurveyor instance is already available.
Surveyor instance and pretty-print helper ready.


#### 4.2.1. Scenario 1: Good, Factual Summary

This summary aims to be factually accurate and well-grounded in `source_text_1_cardiology`. We expect a low hallucination score.

In [None]:
# Scenario 4.2.1: Good Factual Summary for source_text_1_cardiology

summary_1_good = """
Mr. Smith, a 54-year-old male with hypertension and hyperlipidemia, is on Lisinopril 20mg and Atorvastatin 40mg daily.
He denies new symptoms. His home BP is around 135/85 mmHg, and clinic BP was 138/88 mmHg.
Recent lipids showed Total Cholesterol 180 mg/dL, LDL 100 mg/dL, HDL 45 mg/dL, and slightly elevated Triglycerides at 170 mg/dL.
He is a non-smoker, drinks occasionally, and exercises. He is allergic to Penicillin.
The plan is to continue current medications and discuss lifestyle changes for triglycerides, with a follow-up in 6 months.
"""

print("--- Running Survey for Scenario 4.2.1 (Good Summary) ---")
report_1_good = surveyor_instance.survey(source_text_1_cardiology, summary_1_good)

print("\n--- Survey Report for Good Summary (source_text_1_cardiology) ---")
pprint_survey_report(report_1_good)

--- Running Survey for Scenario 4.2.1 (Good Summary) ---

--- Starting Hallucination Survey ---
Source: 
Patient: Mr. John Smith, DOB: 1970-03-15, MRN: JS001
Chief Complaint:...
Summary: 
Mr. Smith, a 54-year-old male with hypertension and hyperlipidemia, i...
--- Hallucination Survey Complete ---

--- Survey Report for Good Summary (source_text_1_cardiology) ---
{
  "entity_coherence": {
    "hallucinated_critical_entities": [],
    "omitted_critical_entities": [],
    "potentially_fabricated_general_terms": [
      "allergic",
      "clinic",
      "current",
      "discuss",
      "lipids",
      "medications",
      "recent",
      "showed"
    ]
  },
  "numerical_consistency": {
    "numbers_in_summary_not_in_source": [],
    "numbers_in_source_not_in_summary": [
      "001",
      "1",
      "10",
      "1970-03-15",
      "2",
      "2-3",
      "3",
      "4",
      "5",
      "72bpm"
    ]
  },
  "negation_consistency": {
    "potential_negation_issues": [],
    "source_negate

#### 4.2.2. Scenario 2: Summary with Fabricated Critical Entity

This summary introduces a critical medical condition (`diabetes`) not mentioned in `source_text_1_cardiology`. We expect the `entity_coherence` check to flag this clearly.

In [None]:
# Scenario 4.2.2: Fabricated Critical Entity (Diabetes)

summary_2_fabricated_entity = """
Mr. Smith, a 54-year-old male with hypertension, hyperlipidemia, and newly diagnosed diabetes, is on Lisinopril 20mg and Atorvastatin 40mg daily.
He denies chest pain. His clinic BP was 138/88 mmHg.
Recent lipids were LDL 100 mg/dL. He is allergic to Penicillin.
Plan is to continue current medications and manage his diabetes. Follow up in 6 months.
"""

print("--- Running Survey for Scenario 4.2.2 (Fabricated Entity) ---")
report_2_fabricated = surveyor_instance.survey(source_text_1_cardiology, summary_2_fabricated_entity)

print("\n--- Survey Report for Fabricated Entity Summary (source_text_1_cardiology) ---")
pprint_survey_report(report_2_fabricated)

--- Running Survey for Scenario 4.2.2 (Fabricated Entity) ---

--- Starting Hallucination Survey ---
Source: 
Patient: Mr. John Smith, DOB: 1970-03-15, MRN: JS001
Chief Complaint:...
Summary: 
Mr. Smith, a 54-year-old male with hypertension, hyperlipidemia, and ...
--- Hallucination Survey Complete ---

--- Survey Report for Fabricated Entity Summary (source_text_1_cardiology) ---
{
  "entity_coherence": {
    "hallucinated_critical_entities": [
      "diabetes"
    ],
    "omitted_critical_entities": [],
    "potentially_fabricated_general_terms": [
      "allergic",
      "clinic",
      "current",
      "diagnosed",
      "lipids",
      "manage",
      "medications",
      "newly",
      "recent"
    ]
  },
  "numerical_consistency": {
    "numbers_in_summary_not_in_source": [],
    "numbers_in_source_not_in_summary": [
      "001",
      "1",
      "10",
      "135/85mmhg",
      "170mg",
      "180mg",
      "1970-03-15",
      "2",
      "2-3",
      "3",
      "4",
      "45mg"

#### 4.2.3. Scenario 3: Summary with Incorrect Numerical Value

This summary alters a key numerical value from `source_text_1_cardiology` (Lisinopril dosage). We expect the `numerical_consistency` check to highlight this discrepancy.

In [None]:
# Scenario 4.2.3: Incorrect Numerical Value (Lisinopril dosage)

summary_3_incorrect_number = """
Mr. Smith, 54, has hypertension and hyperlipidemia, managed with Lisinopril 200mg daily and Atorvastatin 40mg daily.
He denies new symptoms. His home BP is around 135/85 mmHg.
Recent lipids showed LDL 100 mg/dL. He is allergic to Penicillin.
Plan: Continue medications. Follow up in 6 months.
"""

print("--- Running Survey for Scenario 4.2.3 (Incorrect Number) ---")
report_3_incorrect_number = surveyor_instance.survey(source_text_1_cardiology, summary_3_incorrect_number)

print("\n--- Survey Report for Incorrect Number Summary (source_text_1_cardiology) ---")
pprint_survey_report(report_3_incorrect_number)

--- Running Survey for Scenario 4.2.3 (Incorrect Number) ---

--- Starting Hallucination Survey ---
Source: 
Patient: Mr. John Smith, DOB: 1970-03-15, MRN: JS001
Chief Complaint:...
Summary: 
Mr. Smith, 54, has hypertension and hyperlipidemia, managed with Lisi...
--- Hallucination Survey Complete ---

--- Survey Report for Incorrect Number Summary (source_text_1_cardiology) ---
{
  "entity_coherence": {
    "hallucinated_critical_entities": [],
    "omitted_critical_entities": [],
    "potentially_fabricated_general_terms": [
      "200mg",
      "allergic",
      "lipids",
      "managed",
      "medications",
      "recent",
      "showed"
    ]
  },
  "numerical_consistency": {
    "numbers_in_summary_not_in_source": [
      "200mg"
    ],
    "numbers_in_source_not_in_summary": [
      "001",
      "1",
      "10",
      "138/88mmhg",
      "170mg",
      "180mg",
      "1970-03-15",
      "2",
      "2-3",
      "20mg",
      "3",
      "4",
      "45mg",
      "5",
      "72bpm"

#### 4.2.4. Scenario 4: Summary with Flipped Negation

This summary incorrectly flips a negation from `source_text_1_cardiology`. Specifically, it states the patient has an allergy they do not have, or denies an allergy they do have. We'll make it state the patient has *no* Penicillin allergy, contradicting the source. We expect `negation_consistency` to flag this.

In [None]:
# Scenario 4.2.4: Flipped Negation (Allergy status)

summary_4_flipped_negation = """
Mr. Smith, 54, has hypertension and hyperlipidemia, on Lisinopril 20mg and Atorvastatin 40mg.
He denies new symptoms. His home BP is around 135/85 mmHg.
Recent lipids were LDL 100 mg/dL. He has no allergy to Penicillin.
Plan: Continue medications. Follow up in 6 months.
"""
# Source says: Allergies: Penicillin (causes rash).
# Summary says: He has no allergy to Penicillin.
# "Penicillin" is not in self.critical_medical_terms by default. Let's add it for this test.

# Temporarily add 'penicillin' to critical_medical_terms for this specific test
# to ensure the negation check focuses on it.
original_critical_terms = surveyor_instance.critical_medical_terms.copy()
surveyor_instance.critical_medical_terms.add("penicillin")
print(f"Temporarily added 'penicillin' to critical terms. Current critical terms: {surveyor_instance.critical_medical_terms}")

print("\n--- Running Survey for Scenario 4.2.4 (Flipped Negation) ---")
report_4_flipped_negation = surveyor_instance.survey(source_text_1_cardiology, summary_4_flipped_negation)

# Restore original critical terms
surveyor_instance.critical_medical_terms = original_critical_terms
print(f"\nRestored original critical terms. Current critical terms: {surveyor_instance.critical_medical_terms}")

print("\n--- Survey Report for Flipped Negation Summary (source_text_1_cardiology) ---")
pprint_survey_report(report_4_flipped_negation)

Temporarily added 'penicillin' to critical terms. Current critical terms: {'covid-19', 'pneumonia', 'diabetes', 'malignant', 'syndrome', 'infection', 'infarction', 'sepsis', 'embolism', 'benign', 'fracture', 'allergy', 'hypertension', 'tumor', 'hiv', 'penicillin', 'cancer', 'stroke', 'anaphylaxis', 'metastasis'}

--- Running Survey for Scenario 4.2.4 (Flipped Negation) ---

--- Starting Hallucination Survey ---
Source: 
Patient: Mr. John Smith, DOB: 1970-03-15, MRN: JS001
Chief Complaint:...
Summary: 
Mr. Smith, 54, has hypertension and hyperlipidemia, on Lisinopril 20m...
--- Hallucination Survey Complete ---

Restored original critical terms. Current critical terms: {'covid-19', 'pneumonia', 'diabetes', 'malignant', 'syndrome', 'infection', 'infarction', 'sepsis', 'embolism', 'benign', 'fracture', 'allergy', 'hypertension', 'tumor', 'hiv', 'cancer', 'stroke', 'anaphylaxis', 'metastasis'}

--- Survey Report for Flipped Negation Summary (source_text_1_cardiology) ---
{
  "entity_cohere

#### 4.2.5. Scenario 5: Summary with Omission of Critical Information

This summary omits a critical piece of information from `source_text_1_cardiology` – the fact that the patient is on Atorvastatin for hyperlipidemia. We expect `omitted_critical_entities` to flag "atorvastatin" (if we consider it critical) or `numbers_in_source_not_in_summary` to flag its dosage ("40mg").

In [None]:
# Scenario 4.2.5: Omission of Critical Information (Atorvastatin)

summary_5_omission = """
Mr. Smith, a 54-year-old male with hypertension, is managed with Lisinopril 20mg daily.
He denies new symptoms and his home BP is around 135/85 mmHg.
He is allergic to Penicillin.
Plan: Continue Lisinopril. Follow up in 6 months.
"""
# Source states: "currently prescribed Lisinopril 20mg daily AND Atorvastatin 40mg daily."
# The summary omits Atorvastatin entirely.

# To make this test more direct for entity omission, let's ensure 'atorvastatin'
# is a critical term for this run.
original_critical_terms = surveyor_instance.critical_medical_terms.copy()
surveyor_instance.critical_medical_terms.add("atorvastatin") # Add specific drug
print(f"Temporarily added 'atorvastatin' to critical terms. Current: {surveyor_instance.critical_medical_terms}")

print("\n--- Running Survey for Scenario 4.2.5 (Omission) ---")
report_5_omission = surveyor_instance.survey(source_text_1_cardiology, summary_5_omission)

surveyor_instance.critical_medical_terms = original_critical_terms # Restore
print(f"\nRestored original critical terms. Current: {surveyor_instance.critical_medical_terms}")

print("\n--- Survey Report for Omission Summary (source_text_1_cardiology) ---")
pprint_survey_report(report_5_omission)

Temporarily added 'atorvastatin' to critical terms. Current: {'covid-19', 'pneumonia', 'diabetes', 'malignant', 'syndrome', 'infection', 'atorvastatin', 'infarction', 'sepsis', 'embolism', 'benign', 'fracture', 'allergy', 'hypertension', 'tumor', 'hiv', 'cancer', 'stroke', 'anaphylaxis', 'metastasis'}

--- Running Survey for Scenario 4.2.5 (Omission) ---

--- Starting Hallucination Survey ---
Source: 
Patient: Mr. John Smith, DOB: 1970-03-15, MRN: JS001
Chief Complaint:...
Summary: 
Mr. Smith, a 54-year-old male with hypertension, is managed with Lisi...
--- Hallucination Survey Complete ---

Restored original critical terms. Current: {'covid-19', 'pneumonia', 'diabetes', 'malignant', 'syndrome', 'infection', 'infarction', 'sepsis', 'embolism', 'benign', 'fracture', 'allergy', 'hypertension', 'tumor', 'hiv', 'cancer', 'stroke', 'anaphylaxis', 'metastasis'}

--- Survey Report for Omission Summary (source_text_1_cardiology) ---
{
  "entity_coherence": {
    "hallucinated_critical_entitie

### 5.1. Selecting Lightweight LLMs and Setting up for Summarization

To demonstrate the `HallucinationSurveyor` in a practical model evaluation context, we will use a few lightweight pre-trained language models from the Hugging Face Hub. We'll select models known for their summarization capabilities (or general text generation that can be prompted for summaries) and that are manageable within the Colab free tier.

We need to install and import the `transformers` library and `torch`.

In [None]:
print("Installing Hugging Face libraries: transformers, sentencepiece, and torch...")
# Using subprocess for cleaner install messages if preferred, or direct !pip

# Direct pip install is fine for Colab
!pip install transformers sentencepiece --quiet
!pip install torch --quiet # PyTorch; often a dependency or useful with transformers

print("Installation complete (or packages already present).")

try:
    import transformers
    import torch
    print(f"Transformers version: {transformers.__version__}")
    print(f"PyTorch version: {torch.__version__}")
    print("Successfully imported transformers and torch.")
    HF_LIBS_AVAILABLE = True
except ImportError:
    print("ERROR: Could not import transformers or torch. LLM summarization will not be possible.")
    HF_LIBS_AVAILABLE = False

# Define the models we'll try (lightweight ones)
#pegasus is good for summarization but can be larger.
#distilbart is smaller. t5-small is also good.

# Let's pick a very small, fast one for a quick demo, then maybe a slightly better small one.
# EleutherAI/gpt-neo-125M is a decoder-only model, can be prompted for summary.
# sshleifer/distilbart-cnn-6-6 is a smaller version of BART, good for summarization.

# For simplicity and to show prompting a decoder-only model:
MODELS_TO_TEST = {
    "GPT-Neo-125M (Decoder-Only)": "EleutherAI/gpt-neo-125m",
    "DistilBART-CNN-6-6 (Seq2Seq)": "sshleifer/distilbart-cnn-6-6"
    # Can add T5-small if time permits and it loads quickly: "t5-small"
}

# We will use the summarization pipeline for seq2seq models
# and manual prompting for decoder-only models.
if HF_LIBS_AVAILABLE:
    from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM, AutoModelForSeq2SeqLM
    print("Hugging Face pipeline and model/tokenizer classes imported.")

print(f"\nHF_LIBS_AVAILABLE: {HF_LIBS_AVAILABLE}")
print(f"Models selected for testing: {list(MODELS_TO_TEST.keys())}")

Installing Hugging Face libraries: transformers, sentencepiece, and torch...
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m21.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m16.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m27.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m827.9 kB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [3

### 5.2. Implementing Summary Generation with Selected LLMs

We'll now define functions to load the selected models and their tokenizers, and to generate summaries. For sequence-to-sequence models like DistilBART, we can use the `summarization` pipeline. For decoder-only models like GPT-Neo, we'll construct a prompt and use its text generation capabilities.

In [None]:
# Ensure HF_LIBS_AVAILABLE is True before using these functions
# from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModelForSeq2SeqLM, pipeline
# import torch # Should already be imported

def load_hf_model_and_tokenizer(model_name_path, model_type="seq2seq"):
    """Loads a Hugging Face model and tokenizer."""
    if not HF_LIBS_AVAILABLE:
        print(f"Hugging Face libraries not available. Cannot load {model_name_path}.")
        return None, None

    print(f"Attempting to load model and tokenizer for: {model_name_path} (type: {model_type})")
    try:
        tokenizer = AutoTokenizer.from_pretrained(model_name_path)

        if model_type == "seq2seq":
            model = AutoModelForSeq2SeqLM.from_pretrained(model_name_path)
        elif model_type == "decoder-only":
            model = AutoModelForCausalLM.from_pretrained(model_name_path)
        else:
            print(f"Unsupported model_type: {model_type}. Please use 'seq2seq' or 'decoder-only'.")
            return None, None

        print(f"Successfully loaded tokenizer and model for {model_name_path}.")
        return model, tokenizer
    except Exception as e:
        print(f"Error loading model/tokenizer {model_name_path}: {e}")
        return None, None

def generate_summary_with_hf_llm(model_info, text_to_summarize, max_length=150, min_length=30):
    """
    Generates a summary using a loaded Hugging Face model.
    model_info can be a tuple (model, tokenizer, model_type_str) or a pipeline.
    """
    if not HF_LIBS_AVAILABLE:
        print("Hugging Face libraries not available. Cannot generate summary.")
        return "Error: HF Libraries not available."

    model, tokenizer, model_type_str = None, None, None
    summarizer_pipeline = None

    if isinstance(model_info, tuple) and len(model_info) == 3:
        model, tokenizer, model_type_str = model_info
        if model is None or tokenizer is None:
            print("Model or tokenizer not loaded properly.")
            return "Error: Model/Tokenizer not loaded."
    elif isinstance(model_info, transformers.pipelines.base.Pipeline):
        summarizer_pipeline = model_info
        # Attempt to infer model_type_str from pipeline task or model name for logging
        if hasattr(summarizer_pipeline.model.config, '_name_or_path'):
            model_type_str = f"pipeline based on {summarizer_pipeline.model.config._name_or_path}"
        else:
            model_type_str = "pipeline (model type unknown)"

    else:
        print("Invalid model_info provided. Must be (model, tokenizer, type) or a pipeline.")
        return "Error: Invalid model setup."

    print(f"\nGenerating summary for text (first 50 chars): '{text_to_summarize[:50]}...' using {model_type_str}")

    summary_text = ""
    try:
        if summarizer_pipeline: # If it's a summarization pipeline
            # Pipelines handle tokenization and decoding internally
            # Some pipelines might not support min_length directly in the call in older versions
            # Forcing device if GPU is available (Colab often has one)
            device = 0 if torch.cuda.is_available() else -1 # 0 for first GPU, -1 for CPU
            print(f"Using device: {'cuda:0' if device == 0 else 'cpu'} for pipeline.")

            # Ensure text is not overly long for the pipeline's default model max length
            # This is a general precaution; specific models have different limits.
            # DistilBART's tokenizer has model_max_length often around 1024

            # Truncate input text if it's excessively long to prevent OOM or very slow processing
            # This is a heuristic limit for these small models.
            max_input_tokens = 1024
            # A simple way to check, though not perfectly accurate for token count
            if len(text_to_summarize.split()) > max_input_tokens * 0.7: # Heuristic
                 print(f"Warning: Input text is long, truncating for summarizer pipeline to approx first {int(max_input_tokens*0.7)} words.")
                 text_to_summarize = " ".join(text_to_summarize.split()[:int(max_input_tokens*0.7)])

            result = summarizer_pipeline(text_to_summarize, max_length=max_length, min_length=min_length, truncation=True)
            summary_text = result[0]['summary_text']

        elif model_type_str == "decoder-only" and model and tokenizer:
            # For decoder-only models, we need to prompt for a summary
            prompt = f"Summarize the following medical report:\n\nReport:\n{text_to_summarize}\n\nSummary:\n"

            # Truncate prompt if too long for the model
            # GPT-Neo-125M has a context length of 2048 tokens
            # We need to leave space for the generated summary (max_length)
            # tokenizer.model_max_length often gives the context window size
            model_max_len = tokenizer.model_max_length if hasattr(tokenizer, 'model_max_length') else 2048

            # Encode prompt to check its length
            input_ids_prompt_only = tokenizer.encode(prompt, return_tensors="pt")

            # Calculate available length for text_to_summarize within the prompt
            # Simplified: Assume prompt structure takes ~50 tokens. Summary output ~max_length tokens.
            # Available for document = model_max_len - 50 - max_length
            # This is very rough. A better way is to tokenize parts and assemble.

            # Simpler truncation: just ensure prompt isn't excessively long
            if len(prompt.split()) > model_max_len * 0.6: # Rough estimate
                 print(f"Warning: Prompt is long, truncating for decoder-only model.")
                 # A very crude way to truncate the document part of the prompt
                 doc_part = text_to_summarize
                 if len(doc_part.split()) > model_max_len * 0.4:
                     doc_part = " ".join(doc_part.split()[:int(model_max_len*0.4)])
                 prompt = f"Summarize the following medical report:\n\nReport:\n{doc_part}\n\nSummary:\n"


            inputs = tokenizer.encode(prompt, return_tensors="pt", truncation=True, max_length=model_max_len - max_length - 5) # Leave space for generation

            if torch.cuda.is_available():
                inputs = inputs.to('cuda')
                model.to('cuda')
                print("Using CUDA for decoder-only model generation.")
            else:
                print("Using CPU for decoder-only model generation.")

            # Generate summary
            # For decoder-only, max_length for generate is the total length of prompt+output
            # So we need output_ids = model.generate(inputs, max_length=len(inputs[0]) + max_length, ...)
            summary_ids = model.generate(
                inputs,
                max_length=len(inputs[0]) + max_length, # Desired total length
                min_length=len(inputs[0]) + min_length, # Ensure some new text is generated
                num_beams=4,
                early_stopping=True,
                no_repeat_ngram_size=3, # To reduce repetition
                pad_token_id=tokenizer.eos_token_id # Important for open-ended generation
            )
            # Decode only the generated part (after the prompt)
            summary_text = tokenizer.decode(summary_ids[0][inputs.shape[-1]:], skip_special_tokens=True)
        else:
            return f"Error: Model type {model_type_str} not handled or model/tokenizer missing."

        print(f"Generated summary: {summary_text[:100]}...") # Print first 100 chars of summary
        return summary_text.strip()

    except Exception as e:
        print(f"Error during summary generation with {model_type_str}: {e}")
        # Try to print some GPU memory info if it's an OOM error
        if "CUDA out of memory" in str(e) and torch.cuda.is_available():
            print(f"CUDA OOM: Total Memory: {torch.cuda.get_device_properties(0).total_memory/1e9:.2f} GB, "
                  f"Allocated: {torch.cuda.memory_allocated(0)/1e9:.2f} GB, "
                  f"Reserved: {torch.cuda.memory_reserved(0)/1e9:.2f} GB")
        return f"Error generating summary: {e}"

print("Helper functions for LLM loading and summarization defined.")
if not HF_LIBS_AVAILABLE:
    print("WARNING: Hugging Face libraries are not available, these functions will not work.")

Helper functions for LLM loading and summarization defined.


### 5.3. Generating and Evaluating Summaries - Model 1: DistilBART-CNN-6-6

Our first model for testing is `sshleifer/distilbart-cnn-6-6`, a distilled version of BART fine-tuned on the CNN/DailyMail dataset for summarization. We'll use the Hugging Face `pipeline` for this.

In [None]:
# Ensure HF_LIBS_AVAILABLE is True from previous cell
# from transformers import pipeline
# import torch # Should already be imported

llm_summaries = {} # Dictionary to store summaries from different models

if HF_LIBS_AVAILABLE:
    model_name_distilbart = MODELS_TO_TEST["DistilBART-CNN-6-6 (Seq2Seq)"] # "sshleifer/distilbart-cnn-6-6"
    print(f"--- Attempting to load Summarization Pipeline for: {model_name_distilbart} ---")

    summarizer_distilbart = None
    try:
        # Load the summarization pipeline
        # Explicitly setting device if GPU is available
        device = 0 if torch.cuda.is_available() else -1
        summarizer_distilbart = pipeline("summarization", model=model_name_distilbart, tokenizer=model_name_distilbart, device=device)
        print(f"Summarization pipeline for {model_name_distilbart} loaded successfully on device {'cuda:0' if device == 0 else 'cpu'}.")
    except Exception as e:
        print(f"Error loading summarization pipeline for {model_name_distilbart}: {e}")
        # Try to print some GPU memory info if it's an OOM error
        if "CUDA out of memory" in str(e) and torch.cuda.is_available():
            print(f"CUDA OOM: Total Memory: {torch.cuda.get_device_properties(0).total_memory/1e9:.2f} GB, "
                  f"Allocated: {torch.cuda.memory_allocated(0)/1e9:.2f} GB, "
                  f"Reserved: {torch.cuda.memory_reserved(0)/1e9:.2f} GB")

    if summarizer_distilbart:
        print(f"\n--- Generating summary for source_text_1_cardiology with {model_name_distilbart} ---")
        # Using our generic helper function which can now accept a pipeline
        # The helper function 'generate_summary_with_hf_llm' itself has truncation logic
        summary_distilbart = generate_summary_with_hf_llm(
            model_info=summarizer_distilbart,
            text_to_summarize=source_text_1_cardiology,
            max_length=180, # Slightly longer max length for medical text
            min_length=50   # Reasonably long min length
        )

        llm_summaries["DistilBART-CNN-6-6"] = summary_distilbart
        print("\n--- Summary from DistilBART-CNN-6-6: ---")
        print(summary_distilbart)
    else:
        print(f"Skipping summary generation as {model_name_distilbart} pipeline failed to load.")
        llm_summaries["DistilBART-CNN-6-6"] = "Error: Model pipeline failed to load."
else:
    print("Skipping DistilBART summary generation as Hugging Face libraries are not available.")
    llm_summaries["DistilBART-CNN-6-6"] = "Error: HF Libraries not available."

--- Attempting to load Summarization Pipeline for: sshleifer/distilbart-cnn-6-6 ---


config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/460M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/460M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

Device set to use cpu


Summarization pipeline for sshleifer/distilbart-cnn-6-6 loaded successfully on device cpu.

--- Generating summary for source_text_1_cardiology with sshleifer/distilbart-cnn-6-6 ---

Generating summary for text (first 50 chars): '
Patient: Mr. John Smith, DOB: 1970-03-15, MRN: JS...' using pipeline based on sshleifer/distilbart-cnn-6-6
Using device: cpu for pipeline.
Generated summary:  Mr. John Smith, 54, is a 54-year-old male with a known history of hypertension for 10 years and hyp...

--- Summary from DistilBART-CNN-6-6: ---
Mr. John Smith, 54, is a 54-year-old male with a known history of hypertension for 10 years and hyperlipidemia for 5 years . He is currently prescribed Lisinopril 20mg daily and Atorvastatin 40mg daily .


In [None]:
# Ensure surveyor_instance and summary_distilbart are available from previous cells

if 'surveyor_instance' in globals() and 'llm_summaries' in globals() and "DistilBART-CNN-6-6" in llm_summaries:
    summary_to_evaluate = llm_summaries["DistilBART-CNN-6-6"]
    if "Error:" not in summary_to_evaluate: # Check if summary generation was successful
        print(f"\n--- Applying HallucinationSurveyor to DistilBART summary for source_text_1_cardiology ---")

        # We won't modify critical_medical_terms for this general LLM evaluation
        # unless a specific hypothesis about a term needs testing.
        # Default critical terms will be used.

        report_distilbart = surveyor_instance.survey(source_text_1_cardiology, summary_to_evaluate)

        print("\n--- Survey Report for DistilBART-CNN-6-6 Summary ---")
        pprint_survey_report(report_distilbart)
    else:
        print(f"Skipping survey for DistilBART as summary generation failed: {summary_to_evaluate}")
else:
    print("Skipping survey for DistilBART: surveyor_instance or summary not found.")


--- Applying HallucinationSurveyor to DistilBART summary for source_text_1_cardiology ---

--- Starting Hallucination Survey ---
Source: 
Patient: Mr. John Smith, DOB: 1970-03-15, MRN: JS001
Chief Complaint:...
Summary: Mr. John Smith, 54, is a 54-year-old male with a known history of hype...
--- Hallucination Survey Complete ---

--- Survey Report for DistilBART-CNN-6-6 Summary ---
{
  "entity_coherence": {
    "hallucinated_critical_entities": [],
    "omitted_critical_entities": [],
    "potentially_fabricated_general_terms": []
  },
  "numerical_consistency": {
    "numbers_in_summary_not_in_source": [],
    "numbers_in_source_not_in_summary": [
      "001",
      "1",
      "100mg",
      "135/85mmhg",
      "138/88mmhg",
      "170mg",
      "180mg",
      "1970-03-15",
      "2",
      "2-3",
      "3",
      "4",
      "45mg",
      "6",
      "72bpm"
    ]
  },
  "negation_consistency": {
    "potential_negation_issues": [],
    "source_negated_phrases_critical": [
      "...

### 5.4. Generating and Evaluating Summaries - Model 2: GPT-Neo-125M

Next, we test `EleutherAI/gpt-neo-125m`, a decoder-only model. We will prompt it to generate a summary for `source_text_1_cardiology`. Decoder-only models can sometimes be more prone to rambling or less structured output if not carefully prompted and constrained, making this an interesting test for our surveyor.

In [None]:
# Ensure HF_LIBS_AVAILABLE, surveyor_instance, source_text_1_cardiology,
# load_hf_model_and_tokenizer, generate_summary_with_hf_llm, and pprint_survey_report are available.

if HF_LIBS_AVAILABLE:
    model_name_gpt_neo = MODELS_TO_TEST["GPT-Neo-125M (Decoder-Only)"] # "EleutherAI/gpt-neo-125m"
    print(f"--- Attempting to load Model & Tokenizer for: {model_name_gpt_neo} ---")

    # Manually load model and tokenizer for decoder-only
    gpt_neo_model, gpt_neo_tokenizer = load_hf_model_and_tokenizer(model_name_gpt_neo, model_type="decoder-only")

    if gpt_neo_model and gpt_neo_tokenizer:
        # Set pad_token_id to eos_token_id if not already set (important for open-ended generation)
        if gpt_neo_tokenizer.pad_token is None:
            gpt_neo_tokenizer.pad_token = gpt_neo_tokenizer.eos_token
            gpt_neo_model.config.pad_token_id = gpt_neo_model.config.eos_token_id
            print(f"Set pad_token_id to eos_token_id ({gpt_neo_tokenizer.eos_token_id}) for {model_name_gpt_neo}")

        print(f"\n--- Generating summary for source_text_1_cardiology with {model_name_gpt_neo} ---")
        summary_gpt_neo = generate_summary_with_hf_llm(
            model_info=(gpt_neo_model, gpt_neo_tokenizer, "decoder-only"),
            text_to_summarize=source_text_1_cardiology,
            max_length=200, # Allow a bit more length for decoder-only
            min_length=60
        )

        llm_summaries["GPT-Neo-125M"] = summary_gpt_neo
        print("\n--- Summary from GPT-Neo-125M: ---")
        print(summary_gpt_neo)

        if "Error:" not in summary_gpt_neo:
            print(f"\n--- Applying HallucinationSurveyor to GPT-Neo-125M summary ---")
            report_gpt_neo = surveyor_instance.survey(source_text_1_cardiology, summary_gpt_neo)

            print("\n--- Survey Report for GPT-Neo-125M Summary ---")
            pprint_survey_report(report_gpt_neo)
        else:
            print(f"Skipping survey for GPT-Neo as summary generation failed: {summary_gpt_neo}")
    else:
        print(f"Skipping summary generation and survey as {model_name_gpt_neo} model/tokenizer failed to load.")
        llm_summaries["GPT-Neo-125M"] = "Error: Model/tokenizer failed to load."
else:
    print("Skipping GPT-Neo summary generation and survey as Hugging Face libraries are not available.")
    llm_summaries["GPT-Neo-125M"] = "Error: HF Libraries not available."

--- Attempting to load Model & Tokenizer for: EleutherAI/gpt-neo-125m ---
Attempting to load model and tokenizer for: EleutherAI/gpt-neo-125m (type: decoder-only)


tokenizer_config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/357 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/526M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/119 [00:00<?, ?B/s]

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Successfully loaded tokenizer and model for EleutherAI/gpt-neo-125m.
Set pad_token_id to eos_token_id (50256) for EleutherAI/gpt-neo-125m

--- Generating summary for source_text_1_cardiology with EleutherAI/gpt-neo-125m ---

Generating summary for text (first 50 chars): '
Patient: Mr. John Smith, DOB: 1970-03-15, MRN: JS...' using decoder-only
Using CPU for decoder-only model generation.
Generated summary: This is the first report of a patient who has been diagnosed with hypertension. He has been treated ...

--- Summary from GPT-Neo-125M: ---
This is the first report of a patient who has been diagnosed with hypertension. He has been treated with Lisinoplatin 20mg for 2 years. He is on Lisinotecan 40mg for 3 years. His blood pressure has been stable for the past 3 months. He reports no new symptoms. He does not have any new complaints. He continues to have a regular lipid panel. He also has a regular blood pressure monitor.
The patient has been taking Lisinopaol 20mg twice a day for 3 

## Chapter 6: Discussion - Insights and Future Horizons

### 6.1. Strengths of the `HallucinationSurveyor`

The `HallucinationSurveyor` developed in this project demonstrates several key strengths for evaluating medical report summaries:

*   **Multi-Faceted Evaluation:** It moves beyond a single score, employing a suite of checks (entity coherence, numerical consistency, negation analysis, n-gram overlap, abstractiveness metrics) to provide a granular understanding of summary quality.
*   **Detection of Critical Errors:** As shown in Chapter 4, the surveyor effectively identifies critical hallucinations such as fabricated medical entities (e.g., "diabetes"), incorrect numerical values (e.g., wrong dosage), flipped negations (e.g., allergy status), and significant omissions of critical information (e.g., missing medication).
*   **Interpretability:** The detailed reports generated by the surveyor pinpoint *where* and *what kind* of potential issues exist, making the feedback actionable for human reviewers and model developers.
*   **Quantitative Heuristics:** The heuristic score, while needing context, provides a quick comparative measure of potential hallucination risk.
*   **Insight into Model Behavior:** When applied to LLM-generated summaries (Chapter 5), the surveyor reveals distinct model characteristics, such as DistilBART's extractiveness versus GPT-Neo's tendency towards creative fabrication and semantic divergence.
*   **Practical Implementation:** The surveyor is built using readily available Python libraries and techniques, making it relatively lightweight and deployable without excessive dependencies.

### 6.2. Current Limitations and Future Enhancements

While effective, the current `HallucinationSurveyor` has limitations inherent in its lexical and rule-based approach, opening avenues for future enhancements:

*   **Novel Entity Fabrications:** The surveyor relies on a predefined list of critical terms. It may not flag newly fabricated entities (e.g., misspelled or entirely invented drug names like "Lisinoplatin") unless these fabrications coincidentally match existing general vocabulary or are very similar to known terms.
    *   *Future Work:* Integrate fuzzy matching against medical ontologies (e.g., RxNorm, SNOMED CT) or use more advanced Named Entity Recognition (NER) models fine-tuned on medical text to identify and validate medical entities more robustly.

*   **Contextual Understanding of Numbers:** While numerical values are extracted, their full clinical context (e.g., "2 years" duration vs. "2mg" dosage) is not deeply analyzed for consistency beyond presence/absence of the number-unit string.
    *   *Future Work:* Develop methods to link numerical values to their associated medical concepts and check for plausible ranges or consistency with related information in the source.

*   **Semantic Contradictions & Complex Reasoning:** The current negation check is rule-based. Deeper semantic contradictions (e.g., "patient improved" vs. "patient's condition worsened") or errors requiring multi-step reasoning are beyond its scope.
    *   *Future Work:* Explore the integration of Natural Language Inference (NLI) models to assess entailment, neutrality, or contradiction between source and summary statements. Question Answering (QA)-based validation (generating questions from the summary and answering them against the source) could also be employed.

*   **Sophistication of Heuristic Score:** The overall score is a weighted sum. Its sensitivity and weightings could be further refined based on larger-scale evaluations and user feedback.
    *   *Future Work:* Explore machine learning models trained on human-annotated summaries to predict a more nuanced hallucination risk score.

*   **Scalability for Full Medical Knowledge:** Relying on curated lists of critical terms and units is practical for a demo but doesn't scale to the entirety of medical knowledge.
    *   *Future Work:* Leverage large medical knowledge graphs and ontologies more dynamically.

By addressing these areas, the `HallucinationSurveyor` can evolve into an even more powerful tool for ensuring the reliability of AI in medical applications.

## Chapter 7: Conclusion - Towards Reliable Medical Summarization

This project successfully developed and demonstrated a `HallucinationSurveyor` function, a vital tool for scrutinizing summaries of medical reports for factual accuracy and consistency. Through a series of targeted checks, the surveyor provides interpretable, multi-faceted feedback, effectively identifying critical errors such as fabricated entities, numerical inaccuracies, flipped negations, and critical omissions, as validated in Chapter 4.

Applying the surveyor to LLM-generated summaries in Chapter 5 further highlighted its utility in characterizing different model behaviors, from overly extractive outputs to more creative but factually flawed generations. The insights derived from such an analysis are crucial for guiding the development and safe deployment of LLMs in the sensitive medical domain.

While the current implementation showcases a strong foundation using practical techniques, the discussion on limitations and future enhancements (Chapter 6) charts a path towards even greater sophistication, incorporating advanced NLP methods and deeper medical knowledge integration.

Ultimately, the `HallucinationSurveyor` represents a significant step in gaining command over LLM outputs, fostering trust, and ensuring that AI-driven summarization in healthcare is not only efficient but, most importantly, reliable and safe for patient care. The ability to systematically "survey" for hallucinations is fundamental to this endeavor.