# MedVAL Dynamic Validator - Testing Notebook

This notebook walks through testing the new dynamic validator functionality that allows physicians to:
1. Input a reference (original medical text) and candidate (AI-generated output)
2. Auto-detect the intended task/instruction
3. Get structured `ErrorAssessment` outputs

## Setup

In [1]:
import sys
import os
import dspy
import json

if os.getcwd() not in sys.path:
    sys.path.insert(0, os.getcwd())

from medval.validator import DetectTask, MedVAL_Validator, ErrorAssessment

with open("utils/task_prompts.json") as f:
    task_prompts = json.load(f)

## Configure DSPy with your LLM

Replace with your API key and preferred model.

In [2]:

api_key = os.environ.get("API_KEY") or os.environ.get("OPENAI_API_KEY")

model = "openai/gpt-4o-mini"  # or "openai/gpt-4o", "anthropic/claude-3-5-sonnet-20241022", etc.
lm = dspy.LM(model=model, api_key=api_key)
dspy.configure(lm=lm)

print(f"Configured DSPy with model: {model}")

Configured DSPy with model: openai/gpt-4o-mini


## Initialize the Validator Modules

In [3]:
task_detector = dspy.ChainOfThought(DetectTask)
validator = dspy.ChainOfThought(MedVAL_Validator)

print("Initialized task_detector and validator")

Initialized task_detector and validator


## Test Case 1: Dangerous Understating of Severity

**Scenario**: Doctor's report indicates acute appendicitis requiring urgent surgery, but AI tells patient everything is fine.

In [4]:
reference_1 = """
Patient presents with acute appendicitis. CT scan reveals inflamed appendix 
measuring 12mm with surrounding fat stranding. Recommend urgent appendectomy 
within 24 hours.
"""

candidate_1 = """
Your appendix looks slightly swollen on the scan, but nothing to worry about. 
You should schedule a routine follow-up appointment with your doctor next week 
to discuss monitoring options.
"""

print("Reference (Doctor's Report):")
print(reference_1)
print("\nCandidate (AI-Generated Patient Summary):")
print(candidate_1)

Reference (Doctor's Report):

Patient presents with acute appendicitis. CT scan reveals inflamed appendix 
measuring 12mm with surrounding fat stranding. Recommend urgent appendectomy 
within 24 hours.


Candidate (AI-Generated Patient Summary):

Your appendix looks slightly swollen on the scan, but nothing to worry about. 
You should schedule a routine follow-up appointment with your doctor next week 
to discuss monitoring options.



### Step 1: Detect the Task/Instruction

In [5]:
print("Detecting task instruction...\n")

detected_1 = task_detector(reference=reference_1, candidate=candidate_1)

print("Detected Instruction:")
print(detected_1.task)
print("\n" + "="*60)

Detecting task instruction...

Detected Instruction:
report2simplified



### Step 2: Validate with Detected Instruction

In [6]:
print("Running validation...\n")

instruction = task_prompts.get(detected_1.task, "")

result_1 = validator(
    instruction=instruction,
    reference=reference_1,
    candidate=candidate_1
)

print(f"Overall Risk Level: {result_1.risk_level}/4")
print(f"\nNumber of errors found: {len(result_1.structured_errors)}")
print("\n" + "="*60)

Running validation...

Overall Risk Level: 4/4

Number of errors found: 2



### Step 3: Examine Structured Errors

In [7]:
# check if errors is a list of ErrorAssessment objects
print(f"Type of result_1.structured_errors: {type(result_1.structured_errors)}")

if isinstance(result_1.structured_errors, list) and len(result_1.structured_errors) > 0:
    print(f"Type of first error: {type(result_1.structured_errors[0])}")
    print(f"Is ErrorAssessment?: {isinstance(result_1.structured_errors[0], ErrorAssessment)}")
    
    print("\n" + "="*60)
    print("STRUCTURED ERROR DETAILS")
    print("="*60)
    
    for i, error in enumerate(result_1.structured_errors, 1):
        print(f"\nError {i}:")
        print(f"  Error Occurrence: \"{error.error_occurrence}\"")
        print(f"  Error: {error.error}")
        print(f"  Category: {error.category}")
        print(f"  Reasoning: {error.reasoning}")
        print("-" * 60)
else:
    print("\nErrors are not in expected List[ErrorAssessment] format")
    print(f"Errors content: {result_1.structured_errors}")

Type of result_1.structured_errors: <class 'list'>
Type of first error: <class 'medval.validator.ErrorAssessment'>
Is ErrorAssessment?: True

STRUCTURED ERROR DETAILS

Error 1:
  Error Occurrence: "Your appendix looks slightly swollen on the scan, but nothing to worry about."
  Error: Downplaying severity of appendicitis
  Category: Understating intensity
  Reasoning: The original reference indicates acute appendicitis, which is a medical emergency requiring immediate attention, contrary to the candidate's reassurance.
------------------------------------------------------------

Error 2:
  Error Occurrence: "You should schedule a routine follow-up appointment with your doctor next week to discuss monitoring options."
  Error: Incorrect recommendation for routine follow-up instead of urgent surgery.
  Category: Incorrect recommendation
  Reasoning: The recommendation in the reference is for an urgent appendectomy within 24 hours, while the candidate suggests a delayed follow-up which i

## Test Case 2: Doctor-Patient Dialogue → Clinical Note

**Scenario**: Converting a doctor-patient dialogue into an assessment and plan, with fabricated claims.

In [8]:
reference_2 = """
[doctor] Good morning, how are you feeling today?
[patient] I've been having chest pain for the past two days.
[doctor] Can you describe the pain? Is it sharp or dull?
[patient] It's a sharp pain, especially when I breathe deeply.
[doctor] Have you had any shortness of breath?
[patient] No, just the pain.
[doctor] Okay, we'll order an EKG and chest X-ray to rule out any cardiac issues.
"""

candidate_2 = """
Assessment: Patient presents with acute chest pain for 2 days, sharp in nature, 
worse with deep breathing. Patient also reports shortness of breath and palpitations.

Plan: Order EKG, chest X-ray, and troponin levels. Start patient on aspirin 325mg. 
Cardiology consult requested.
"""

print("Reference (Doctor-Patient Dialogue):")
print(reference_2)
print("\nCandidate (AI-Generated Clinical Note):")
print(candidate_2)

Reference (Doctor-Patient Dialogue):

[doctor] Good morning, how are you feeling today?
[patient] I've been having chest pain for the past two days.
[doctor] Can you describe the pain? Is it sharp or dull?
[patient] It's a sharp pain, especially when I breathe deeply.
[doctor] Have you had any shortness of breath?
[patient] No, just the pain.
[doctor] Okay, we'll order an EKG and chest X-ray to rule out any cardiac issues.


Candidate (AI-Generated Clinical Note):

Assessment: Patient presents with acute chest pain for 2 days, sharp in nature, 
worse with deep breathing. Patient also reports shortness of breath and palpitations.

Plan: Order EKG, chest X-ray, and troponin levels. Start patient on aspirin 325mg. 
Cardiology consult requested.



In [9]:
# detect task
print("Detecting task...\n")
detected_2 = task_detector(reference=reference_2, candidate=candidate_2)
print(f"Detected: {detected_2.task}...")

import json
with open("utils/task_prompts.json", "r") as f:
    task_prompts = json.load(f)

instructions = task_prompts.get(detected_2.task, "")

# validate
print("\nValidating...\n")
result_2 = validator(
    instruction=instructions,
    reference=reference_2,
    candidate=candidate_2
)

print(f"Risk Level: {result_2.risk_level}/4")
print(f"Errors Found: {len(result_2.structured_errors)}")
print("Errors:")

if isinstance(result_2.structured_errors, list):
    for i, error in enumerate(result_2.structured_errors, 1):
        print(f"\nError {i}: {error.category}")
        print(f"  Location: \"{error.error_occurrence}\"")
        print(f"  Issue: {error.error}")
        print(f"  Reasoning: {error.reasoning}")

Detecting task...

Detected: dialogue2note...

Validating...

Risk Level: 4/4
Errors Found: 3
Errors:

Error 1: Missing claim
  Location: "Patient also reports shortness of breath"
  Issue: Failure to mention that the patient specifically stated 'No' to shortness of breath.
  Reasoning: The candidate states the patient reports shortness of breath, which contradicts the reference where the patient denies this symptom.

Error 2: Fabricated claim
  Location: "palpitations"
  Issue: Introduction of a claim not present in the reference.
  Reasoning: The candidate mentions palpitations whereas the original dialogue does not indicate this symptom being reported by the patient.

Error 3: Incorrect recommendation
  Location: "Start patient on aspirin 325mg."
  Issue: Suggesting treatment not indicated in the reference; no mention of aspirin in the original dialogue.
  Reasoning: The candidate suggests starting aspirin, which was not discussed in the reference and may not be appropriate given th

## Test Case 3: Medication Question - Correct Answer

**Scenario**: AI correctly answers a medication question (should have no errors).

In [10]:
reference_3 = """
Can I take ibuprofen if I'm on blood thinners?
"""

candidate_3 = """
You should avoid taking ibuprofen while on blood thinners. Ibuprofen is an NSAID 
that can increase bleeding risk when combined with anticoagulants. Please consult 
your doctor about safer pain relief alternatives like acetaminophen.
"""

print("Reference (Patient Question):")
print(reference_3)
print("\nCandidate (AI Answer):")
print(candidate_3)

Reference (Patient Question):

Can I take ibuprofen if I'm on blood thinners?


Candidate (AI Answer):

You should avoid taking ibuprofen while on blood thinners. Ibuprofen is an NSAID 
that can increase bleeding risk when combined with anticoagulants. Please consult 
your doctor about safer pain relief alternatives like acetaminophen.



In [11]:
detected_3 = task_detector(reference=reference_3, candidate=candidate_3)
print(f"Detected task: {detected_3.task}...\n")

instructions = task_prompts.get(detected_3.task, "")

result_3 = validator(
    instruction=instructions,
    reference=reference_3,
    candidate=candidate_3
)

print(f"Risk Level: {result_3.risk_level}/4")
print(f"Errors Found: {len(result_3.structured_errors)}")

if len(result_3.structured_errors) == 0:
    print("\nNo errors detected - AI response is accurate!")
else:
    print("\nErrors detected:")
    for error in result_3.structured_errors:
        print(f"  - {error.category}: {error.error}")

Detected task: medication2answer...

Risk Level: 1/4
Errors Found: 0

No errors detected - AI response is accurate!


## Test the Complete Workflow Function

Create a simple function that physicians can use.

In [12]:
def validate_medical_text(reference: str, candidate: str, verbose: bool = True):
    """
    Complete validation workflow for physicians.
    
    Args:
        reference: Original medical text (doctor's notes, report, etc.)
        candidate: AI-generated output to validate
        verbose: Print detailed output
    
    Returns:
        dict with validation results
    """
    # detect task
    if verbose:
        print("Step 1: Detecting task instruction...")
    
    detected = task_detector(reference=reference, candidate=candidate)
    
    instruction = task_prompts.get(detected.task, "")
    
    if verbose:
        print(f"✓ Detected: {instruction[:100]}...\n")
    
    # validate
    if verbose:
        print("Step 2: Validating candidate against reference...")
    
    result = validator(
        instruction=instruction,
        reference=reference,
        candidate=candidate
    )
    
    # format results
    validation_result = {
        "detected_instruction": instruction,
        "overall_risk_level": result.risk_level,
        "errors": result.structured_errors,
        "num_errors": len(result.structured_errors) if isinstance(result.structured_errors, list) else 0
    }
    
    if verbose:
        print(f"✓ Validation complete\n")
        print("="*60)
        print("VALIDATION RESULTS")
        print("="*60)
        print(f"Overall Risk Level: {result.risk_level}/4")
        print(f"Errors Found: {validation_result['num_errors']}\n")
        
        if isinstance(result.structured_errors, list) and len(result.structured_errors) > 0:
            for i, error in enumerate(result.structured_errors, 1):
                print(f"Error {i}:")
                print(f"Category: {error.category}")
                print(f"Location: \"{error.error_occurrence[:100]}...\"")
                print(f"Issue: {error.error}")
                print(f"Reasoning: {error.reasoning[:200]}...")
                print()
        else:
            print("No errors detected!")
    
    return validation_result

print("✓ Function defined")

✓ Function defined


## Test the Complete Workflow

In [13]:
# Use Test Case 1 with the workflow function
result = validate_medical_text(reference_1, candidate_1, verbose=True)

Step 1: Detecting task instruction...
✓ Detected: Create a simplified, patient-friendly version of the reference.
1. Reference Description: The origin...

Step 2: Validating candidate against reference...
✓ Validation complete

VALIDATION RESULTS
Overall Risk Level: 4/4
Errors Found: 2

Error 1:
Category: Understating intensity
Location: "Your appendix looks slightly swollen on the scan, but nothing to worry about...."
Issue: Downplaying severity of appendicitis
Reasoning: The original reference indicates acute appendicitis, which is a medical emergency requiring immediate attention, contrary to the candidate's reassurance....

Error 2:
Category: Incorrect recommendation
Location: "You should schedule a routine follow-up appointment with your doctor next week to discuss monitoring..."
Issue: Incorrect recommendation for routine follow-up instead of urgent surgery.
Reasoning: The recommendation in the reference is for an urgent appendectomy within 24 hours, while the candidate suggests 

## Export Results to JSON

Show how structured errors can be exported.

In [14]:
if isinstance(result_1.structured_errors, list) and len(result_1.structured_errors) > 0:
    # convert ErrorAssessment objects to dict
    errors_dict = [
        {
            "error_occurrence": error.error_occurrence,
            "error": error.error,
            "category": error.category,
            "reasoning": error.reasoning
        }
        for error in result_1.structured_errors
    ]
    
    output = {
        "overall_risk_level": result_1.risk_level,
        "num_errors": len(result_1.structured_errors),
        "errors": errors_dict
    }
    
    print("JSON Output:")
    print(json.dumps(output, indent=2))
    
    # save to file
    with open("validation_result.json", "w") as f:
        json.dump(output, f, indent=2)
    
    print("\n✓ Saved to validation_result.json")
else:
    print("Errors not in expected format for JSON export")
    print(f"Errors type: {type(result_1.structured_errors)}")
    print(f"Errors content: {result_1.structured_errors}")

JSON Output:
{
  "overall_risk_level": 4,
  "num_errors": 2,
  "errors": [
    {
      "error_occurrence": "Your appendix looks slightly swollen on the scan, but nothing to worry about.",
      "error": "Downplaying severity of appendicitis",
      "category": "Understating intensity",
      "reasoning": "The original reference indicates acute appendicitis, which is a medical emergency requiring immediate attention, contrary to the candidate's reassurance."
    },
    {
      "error_occurrence": "You should schedule a routine follow-up appointment with your doctor next week to discuss monitoring options.",
      "error": "Incorrect recommendation for routine follow-up instead of urgent surgery.",
      "category": "Incorrect recommendation",
      "reasoning": "The recommendation in the reference is for an urgent appendectomy within 24 hours, while the candidate suggests a delayed follow-up which is inappropriate for a diagnosis of acute appendicitis."
    }
  ]
}

✓ Saved to validatio

## Debugging: Check DSPy Pydantic Support

If structured output isn't working, let's check what DSPy is actually returning.

In [15]:
print("DSPy Result Inspection:")
print(f"Result type: {type(result_1)}")
print(f"Result attributes: {dir(result_1)}")
print(f"\nErrors field type: {type(result_1.structured_errors)}")
print(f"Errors content preview: {str(result_1.structured_errors)[:500]}")

import dspy
print(f"\nDSPy version: {dspy.__version__}")

DSPy Result Inspection:
Result type: <class 'dspy.primitives.prediction.Prediction'>
Result attributes: ['__add__', '__class__', '__contains__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__doc__', '__eq__', '__firstlineno__', '__float__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__radd__', '__reduce__', '__reduce_ex__', '__repr__', '__rtruediv__', '__setattr__', '__setitem__', '__sizeof__', '__static_attributes__', '__str__', '__subclasshook__', '__truediv__', '__weakref__', '_completions', '_lm_usage', '_store', 'completions', 'copy', 'from_completions', 'get', 'get_lm_usage', 'inputs', 'items', 'keys', 'labels', 'set_lm_usage', 'toDict', 'values', 'with_inputs', 'without']

Errors field type: <class 'list'>
Errors content preview: [ErrorAssessment(error_occurrence='Your appendix looks slightl

## Summary & Next Steps

### What We Tested
1. Auto-detection of task instruction from reference + candidate
2. Validation using detected instruction
3. Structured `ErrorAssessment` output (if supported by DSPy version)
4. Complete physician workflow function
5. JSON export of results

### Key Findings
- Check if `List[ErrorAssessment]` works or if DSPy returns strings
- Verify task detection correctly identifies instruction from `task_values`
- Confirm risk levels are accurate

### If Pydantic Output Doesn't Work
Two options:
1. **Update DSPy** to version 2.5+ (supports Pydantic directly)
2. **Parse string output** - Keep `errors: str` in signature and parse to Pydantic manually