# PSDL: Patient Scenario Definition Language

## Early Sepsis Detection Demo with PhysioNet Challenge 2019 Data

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Chesterguan/PSDL/blob/main/examples/notebooks/PSDL_PhysioNet_Demo.ipynb)

---

**What is PSDL?**

PSDL is an open, vendor-neutral standard for expressing clinical detection scenarios.

> *What SQL became for data queries, PSDL aims to become for clinical logic.*

**This notebook demonstrates:**
1. Loading PhysioNet Challenge 2019 sepsis data
2. Defining a sepsis detection scenario in YAML
3. Evaluating patients against SIRS + organ dysfunction criteria
4. Comparing PSDL detection vs ground truth labels

---

## Setup

First, let's install PSDL and download the sample data.

In [None]:
# Install PSDL from GitHub (if running in Colab)
import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    !pip install -q psdl-lang
    
    # Download the sample data from GitHub
    !wget -q https://github.com/Chesterguan/PSDL/raw/main/examples/data/sepsis_sample_500.tar.gz
    !tar -xzf sepsis_sample_500.tar.gz
    DATA_PATH = 'sepsis_sample_500'
    SCENARIO_PATH = None  # Will use embedded scenario
else:
    # Running locally
    sys.path.insert(0, '../..')
    DATA_PATH = '../data/sepsis_sample_500'
    SCENARIO_PATH = None  # Use embedded scenario

print(f"Data path: {DATA_PATH}")

In [None]:
from datetime import datetime, timedelta
import os

# PSDL imports - using canonical paths
from psdl.core import parse_scenario
from psdl.adapters import PhysioNetBackend, load_physionet_dataset
from psdl.runtimes.single import SinglePatientEvaluator

# Data analysis
try:
    import pandas as pd
    import matplotlib.pyplot as plt
    import numpy as np
    HAS_PLOTTING = True
    plt.style.use('seaborn-v0_8-whitegrid')
    plt.rcParams['figure.figsize'] = (10, 6)
except ImportError:
    HAS_PLOTTING = False
    print("Note: matplotlib/pandas not available, skipping visualizations")

print("PSDL PhysioNet Demo Ready!")

## 1. Load PhysioNet Data

The PhysioNet Computing in Cardiology Challenge 2019 focused on early prediction of sepsis from clinical data.

**Dataset characteristics:**
- Hourly ICU measurements (vitals + labs)
- 40,336 patients total (we use a 500-patient sample)
- Binary sepsis label based on Sepsis-3 criteria
- 34 clinical variables per hour

In [None]:
# Load the PhysioNet data
backend = load_physionet_dataset(DATA_PATH, max_patients=500)
patients = backend.get_patient_ids()

print(f"Loaded {len(patients)} patients")

# Count sepsis vs non-sepsis cases
sepsis_count = 0
non_sepsis_count = 0
for pid in patients:
    meta = backend.get_patient_metadata(pid)
    if meta.get('has_sepsis', False):
        sepsis_count += 1
    else:
        non_sepsis_count += 1

print(f"\nDataset composition:")
print(f"  Sepsis cases: {sepsis_count} ({sepsis_count/len(patients)*100:.1f}%)")
print(f"  Non-sepsis cases: {non_sepsis_count} ({non_sepsis_count/len(patients)*100:.1f}%)")

In [None]:
# Examine a sample patient's data
sample_pid = patients[0]
backend.load_patient(sample_pid)

print(f"Sample patient: {sample_pid}")
print(f"\nAvailable signals ({len(backend.list_signals(sample_pid))} total):")
for sig in sorted(backend.list_signals(sample_pid))[:15]:
    data = backend.get_signal_data(sig, sample_pid)
    print(f"  {sig}: {len(data)} measurements")
print("  ...")

## 2. Define the Sepsis Detection Scenario

Our PSDL scenario implements clinical sepsis criteria:

**SIRS Criteria (2+ of 4 required):**
- Heart rate > 90 bpm
- Temperature > 38.3C or < 36C
- Respiratory rate > 22/min
- WBC > 12 or < 4 (x10^9/L)

**Plus Organ Dysfunction:**
- MAP < 65 mmHg (hypotension)
- Creatinine rise >= 0.3 mg/dL (kidney)
- Lactate > 2 mmol/L

In [None]:
# Define the sepsis scenario in YAML
# parse_scenario() handles both file paths (ending in .yaml) and YAML content strings
SEPSIS_SCENARIO_YAML = '''
scenario: physionet_sepsis_detection
version: "0.2.0"
name: PhysioNet Sepsis Early Detection
description: |
  Early sepsis detection scenario for PhysioNet Challenge 2019 data.
  Implements SIRS criteria plus organ dysfunction markers.

signals:
  HR:
    source: HeartRate
    type: numeric
    unit: bpm
  Temp:
    source: Temperature
    type: numeric
    unit: celsius
  Resp:
    source: RespiratoryRate
    type: numeric
    unit: breaths/min
  MAP:
    source: MeanArterialPressure
    type: numeric
    unit: mmHg
  WBC:
    source: WBC
    type: numeric
    unit: "10^9/L"
  Lactate:
    source: Lactate
    type: numeric
    unit: mmol/L
  Creatinine:
    source: Creatinine
    type: numeric
    unit: mg/dL

trends:
  # SIRS criteria
  hr_elevated:
    expr: last(HR) > 90
    description: Tachycardia (HR > 90)
  temp_abnormal:
    expr: last(Temp) > 38.3
    description: Fever (Temp > 38.3C)
  temp_low:
    expr: last(Temp) < 36
    description: Hypothermia (Temp < 36C)
  resp_elevated:
    expr: last(Resp) > 22
    description: Tachypnea (RR > 22)
  wbc_high:
    expr: last(WBC) > 12
    description: Leukocytosis (WBC > 12)
  wbc_low:
    expr: last(WBC) < 4
    description: Leukopenia (WBC < 4)
  
  # Organ dysfunction markers
  lactate_elevated:
    expr: last(Lactate) > 2.0
    description: Elevated lactate (> 2 mmol/L)
  map_low:
    expr: last(MAP) < 65
    description: Hypotension (MAP < 65)
  creatinine_rising:
    expr: delta(Creatinine, 6h) >= 0.3
    description: Acute kidney injury

logic:
  sirs_temp:
    expr: temp_abnormal OR temp_low
    description: Temperature criterion for SIRS
  sirs_wbc:
    expr: wbc_high OR wbc_low
    description: WBC criterion for SIRS
  sirs_positive:
    expr: (hr_elevated AND resp_elevated) OR (hr_elevated AND sirs_temp) OR (hr_elevated AND sirs_wbc) OR (resp_elevated AND sirs_temp) OR (resp_elevated AND sirs_wbc) OR (sirs_temp AND sirs_wbc)
    severity: medium
    description: SIRS criteria met (2+ positive)
  organ_dysfunction:
    expr: map_low OR creatinine_rising
    severity: high
    description: Evidence of organ dysfunction
  sepsis_suspected:
    expr: sirs_positive AND (lactate_elevated OR organ_dysfunction)
    severity: high
    description: Suspected sepsis - SIRS + organ dysfunction or lactate
    recommendation: Initiate sepsis bundle, obtain blood cultures
'''

# Parse the scenario - use file if available, otherwise parse the embedded YAML
if SCENARIO_PATH and os.path.exists(SCENARIO_PATH):
    scenario = parse_scenario(SCENARIO_PATH)
else:
    # parse_scenario handles YAML strings when they don't end in .yaml/.yml
    scenario = parse_scenario(SEPSIS_SCENARIO_YAML)

print(f"Scenario: {scenario.name}")
print(f"\nSignals: {list(scenario.signals.keys())}")
print(f"\nTrends ({len(scenario.trends)}):")
for name, trend in scenario.trends.items():
    print(f"  {name}")
print(f"\nLogic rules ({len(scenario.logic)}):")
for name, logic in scenario.logic.items():
    print(f"  {name}: {logic.expr}")

## 3. Evaluate Patients

Now let's run the PSDL evaluator on our patient cohort and compare results to ground truth.

In [None]:
# Create the evaluator
evaluator = SinglePatientEvaluator(scenario, backend)

# Evaluate all patients
results = []
for i, pid in enumerate(patients):
    meta = backend.get_patient_metadata(pid)
    ground_truth = meta.get('has_sepsis', False)
    
    # Use sepsis onset time if available, otherwise end of stay
    onset_time = backend.get_sepsis_onset_time(pid)
    if onset_time:
        ref_time = onset_time
    else:
        ref_time = backend.base_datetime + timedelta(hours=meta.get('total_hours', 24))
    
    # Run PSDL evaluation
    result = evaluator.evaluate(pid, ref_time)
    
    psdl_detected = result.logic_results.get('sepsis_suspected', False)
    sirs_positive = result.logic_results.get('sirs_positive', False)
    organ_dysfunc = result.logic_results.get('organ_dysfunction', False)
    
    results.append({
        'patient_id': pid,
        'ground_truth': ground_truth,
        'psdl_detected': psdl_detected,
        'sirs_positive': sirs_positive,
        'organ_dysfunction': organ_dysfunc,
        'trend_values': result.trend_values,
    })
    
    if (i + 1) % 100 == 0:
        print(f"Evaluated {i + 1}/{len(patients)} patients...")

print(f"\nCompleted evaluation of {len(results)} patients")

In [None]:
# Calculate metrics
tp = sum(1 for r in results if r['ground_truth'] and r['psdl_detected'])
fp = sum(1 for r in results if not r['ground_truth'] and r['psdl_detected'])
tn = sum(1 for r in results if not r['ground_truth'] and not r['psdl_detected'])
fn = sum(1 for r in results if r['ground_truth'] and not r['psdl_detected'])

sensitivity = tp / (tp + fn) if (tp + fn) > 0 else 0
specificity = tn / (tn + fp) if (tn + fp) > 0 else 0
ppv = tp / (tp + fp) if (tp + fp) > 0 else 0
npv = tn / (tn + fn) if (tn + fn) > 0 else 0

print("=== PSDL Sepsis Detection Results ===")
print(f"\nConfusion Matrix:")
print(f"                    Ground Truth")
print(f"                    Sepsis    No Sepsis")
print(f"  PSDL Positive      {tp:3d}        {fp:3d}")
print(f"  PSDL Negative      {fn:3d}        {tn:3d}")

print(f"\nPerformance Metrics:")
print(f"  Sensitivity (Recall): {sensitivity:.1%}")
print(f"  Specificity:          {specificity:.1%}")
print(f"  PPV (Precision):      {ppv:.1%}")
print(f"  NPV:                  {npv:.1%}")

# Additional insights
sirs_only = sum(1 for r in results if r['sirs_positive'] and not r['organ_dysfunction'])
organ_only = sum(1 for r in results if r['organ_dysfunction'] and not r['sirs_positive'])
both = sum(1 for r in results if r['sirs_positive'] and r['organ_dysfunction'])

print(f"\nCriteria Distribution:")
print(f"  SIRS only (no organ dysfunction): {sirs_only}")
print(f"  Organ dysfunction only (no SIRS): {organ_only}")
print(f"  Both SIRS + organ dysfunction:    {both}")

## 4. Detailed Case Analysis

Let's examine some specific cases to understand the detection patterns.

In [None]:
# Find true positives and false negatives for analysis
true_positives = [r for r in results if r['ground_truth'] and r['psdl_detected']]
false_negatives = [r for r in results if r['ground_truth'] and not r['psdl_detected']]

print("=== True Positive Cases (PSDL correctly detected sepsis) ===")
for case in true_positives[:3]:
    print(f"\nPatient {case['patient_id']}:")
    tv = case['trend_values']
    print(f"  HR: {tv.get('hr_elevated', 'N/A')} (elevated: {tv.get('hr_elevated', 0) and tv.get('hr_elevated', 0) > 90})")
    print(f"  Temp: {tv.get('temp_abnormal', 'N/A')}")
    print(f"  MAP: {tv.get('map_low', 'N/A')}")
    print(f"  Lactate: {tv.get('lactate_elevated', 'N/A')}")
    print(f"  SIRS: {case['sirs_positive']}, Organ dysfunction: {case['organ_dysfunction']}")

if false_negatives:
    print("\n=== False Negative Cases (PSDL missed sepsis) ===")
    for case in false_negatives[:3]:
        print(f"\nPatient {case['patient_id']}:")
        tv = case['trend_values']
        print(f"  HR: {tv.get('hr_elevated', 'N/A')}")
        print(f"  Temp: {tv.get('temp_abnormal', 'N/A')}")
        print(f"  MAP: {tv.get('map_low', 'N/A')}")
        print(f"  Lactate: {tv.get('lactate_elevated', 'N/A')}")
        print(f"  SIRS: {case['sirs_positive']}, Organ dysfunction: {case['organ_dysfunction']}")
else:
    print("\nNo false negatives - all sepsis cases were detected!")

In [None]:
# Visualization (if matplotlib is available)
if HAS_PLOTTING:
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))
    
    # Confusion matrix heatmap
    cm = np.array([[tp, fn], [fp, tn]])
    im = axes[0].imshow(cm, cmap='Blues')
    axes[0].set_xticks([0, 1])
    axes[0].set_yticks([0, 1])
    axes[0].set_xticklabels(['Positive', 'Negative'])
    axes[0].set_yticklabels(['Positive', 'Negative'])
    axes[0].set_xlabel('Ground Truth')
    axes[0].set_ylabel('PSDL Prediction')
    axes[0].set_title('Confusion Matrix')
    
    # Add text annotations
    for i in range(2):
        for j in range(2):
            text = axes[0].text(j, i, cm[i, j], ha='center', va='center', 
                               color='white' if cm[i, j] > cm.max()/2 else 'black',
                               fontsize=14, fontweight='bold')
    
    # Performance metrics bar chart
    metrics = ['Sensitivity', 'Specificity', 'PPV', 'NPV']
    values = [sensitivity, specificity, ppv, npv]
    colors = ['#27ae60' if v > 0.7 else '#f39c12' if v > 0.5 else '#e74c3c' for v in values]
    
    bars = axes[1].bar(metrics, [v * 100 for v in values], color=colors, edgecolor='#2c3e50')
    axes[1].set_ylabel('Percentage (%)')
    axes[1].set_title('PSDL Performance Metrics')
    axes[1].set_ylim(0, 100)
    axes[1].axhline(y=70, color='gray', linestyle='--', alpha=0.5, label='70% threshold')
    
    # Add value labels
    for bar, v in zip(bars, values):
        axes[1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 2, 
                    f'{v:.1%}', ha='center', fontsize=10, fontweight='bold')
    
    plt.tight_layout()
    plt.show()
else:
    print("Visualization skipped (matplotlib not available)")

## 5. Understanding the Results

### Why might PSDL detection differ from PhysioNet labels?

1. **Different criteria definitions**: PhysioNet labels were derived using the Sepsis-3 criteria applied retrospectively. Our PSDL scenario implements simplified SIRS + organ dysfunction criteria.

2. **Timing differences**: PhysioNet labels mark sepsis onset at a specific hour. PSDL evaluates at that moment, but may miss criteria if data is sparse.

3. **Data availability**: Some patients may have missing vital signs or lab values at the reference time.

### Key Insights

- PSDL provides **transparent, auditable** detection logic
- The scenario can be easily **modified** to adjust thresholds or add criteria
- Same scenario definition works across **different data sources** (OMOP, FHIR, flat files)

---

## Next Steps

1. **Tune thresholds**: Adjust SIRS criteria thresholds based on local population
2. **Add more signals**: Include additional organ dysfunction markers (platelets, bilirubin)
3. **Temporal patterns**: Add trending indicators (rising HR, falling MAP)
4. **ML integration**: Use PSDL features as inputs to machine learning models

In [None]:
# Summary
print("=" * 50)
print("PSDL PhysioNet Sepsis Detection Demo - Summary")
print("=" * 50)
print(f"\nDataset: PhysioNet Challenge 2019 ({len(patients)} patients)")
print(f"Scenario: {scenario.name}")
print(f"\nResults:")
print(f"  Ground truth sepsis cases: {sum(1 for r in results if r['ground_truth'])}")
print(f"  PSDL detected sepsis:      {sum(1 for r in results if r['psdl_detected'])}")
print(f"  Sensitivity: {sensitivity:.1%}")
print(f"  Specificity: {specificity:.1%}")
print(f"\nPSDL enables transparent, portable clinical detection logic.")
print("Learn more at: https://github.com/Chesterguan/PSDL")