# Clinical Trials LLM Annotation Pipeline

## Overview

This notebook is **Step 2** in our educational pipeline demonstrating how Large Language Models (LLMs) can enhance bioinformatics workflows. We'll use an LLM to intelligently annotate the clinical trials data we cleaned in `data_retriever.ipynb`.

## What is an "Agentic Workflow"?

You may have heard terms like "AI agents" or "agentic systems" - often associated with autonomous AI that can take actions independently. **This notebook demonstrates a different, more controlled approach**: a **circumscribed agentic workflow**.

### Key Characteristics:

**üß† Intelligent Decision-Making**
- The LLM interprets medical terminology, handles synonyms, and extracts meaning from unstructured text
- Goes beyond simple string matching to understand context

**üõ°Ô∏è Strict Guardrails**
- All LLM outputs are constrained to predefined formats (structured schemas)
- Multiple validation layers prevent hallucinations
- Original data is never modified

**üîç Audit Trail**
- Every LLM decision is logged
- Results exported to CSV for human review
- Fully transparent and reproducible

**üë§ Human Oversight**
- Optional human review and override capability
- Red flags logged for manual inspection

### What This Is NOT:

‚ùå **Not** an autonomous agent that can take arbitrary actions  
‚ùå **Not** a system that modifies or corrupts your raw data  
‚ùå **Not** a black box - every decision is traceable  

### What We'll Accomplish:

This notebook enriches our clinical trials dataset by:

1. **Mapping unmapped conditions** ‚Üí Finding MeSH terms for conditions that couldn't be automatically matched
2. **Extracting primary conditions** ‚Üí Identifying the main disease studied in each trial from context
3. **Categorizing therapeutically** ‚Üí Assigning trials to therapeutic areas (ONCOLOGY, CARDIOLOGY, etc.)

By the end, each trial will have:
- Standardized MeSH condition terms
- Therapeutic category classification
- Confidence scores for LLM decisions
- Audit trail of all annotations

Let's explore how to integrate LLM intelligence **purposefully and safely** into a bioinformatics data pipeline.

In [None]:
import os
import pickle
from pathlib import Path
from langchain_openai import ChatOpenAI
import pandas as pd
from dotenv import load_dotenv, find_dotenv
from services import logging_config, mesh_mapper, annotator


## The Spectrum: Deterministic ‚Üí Circumscribed Agent ‚Üí Autonomous AI

To understand what makes this workflow "agentic but safe," let's compare three approaches to data processing:

### Approach 1: Fully Deterministic Script

**How it works:**
```python
if condition == "T2DM":
    standardized = "Type 2 Diabetes"
elif condition == "Type II Diabetes":
    standardized = "Type 2 Diabetes"
# ... must hardcode every possible variant
```

**Pros:**
- ‚úÖ Predictable and fast
- ‚úÖ No risk of hallucination
- ‚úÖ Easy to debug

**Cons:**
- ‚ùå Brittle - fails on unexpected inputs
- ‚ùå Can't handle abbreviations or synonyms it wasn't programmed for
- ‚ùå Requires exhaustive enumeration of all possibilities
- ‚ùå Can't understand context

---

### Approach 2: Circumscribed Agent (This Notebook)

**How it works:**
```python
# LLM interprets context and maps to structured output
result = llm.ask(
    "Map this condition to a MeSH term",
    structured_output=ConditionMapping  # Forces specific format
)
# Then validate against external database
if validate_mesh_term(result.mesh_id):
    accept(result)
else:
    reject_and_log(result)
```

**Pros:**
- ‚úÖ Handles synonyms, abbreviations, typos gracefully
- ‚úÖ Understands context (e.g., "diabetes in patients with..." ‚Üí extracts "diabetes")
- ‚úÖ Adapts to variations without reprogramming
- ‚úÖ Structured outputs prevent free-form hallucination
- ‚úÖ Validation layers catch errors
- ‚úÖ Audit trail for every decision

**Cons:**
- ‚ö†Ô∏è Requires validation infrastructure
- ‚ö†Ô∏è Slower than pure rule-based systems
- ‚ö†Ô∏è Small risk of misinterpretation (mitigated by confidence scores)

---

### Approach 3: Fully Autonomous Agent

**How it works:**
```python
# Agent has freedom to query databases, modify data, take actions
agent.run("Improve the quality of this clinical trials dataset")
# Agent decides what to do with no constraints
```

**Pros:**
- ‚úÖ Maximum flexibility
- ‚úÖ Can discover novel approaches

**Cons:**
- ‚ùå Unpredictable behavior
- ‚ùå High risk of data corruption
- ‚ùå Opaque decision-making
- ‚ùå Difficult to reproduce results
- ‚ùå May take unintended actions

---

## Why We Choose the Middle Ground

For **bioinformatics workflows**, the circumscribed agent approach can offer the safest trade-off:

| Requirement | Deterministic | Circumscribed Agent | Autonomous |
|-------------|---------------|---------------------|------------|
| Flexibility | ‚ùå | ‚úÖ | ‚úÖ |
| Data Safety | ‚úÖ | ‚úÖ | ‚ùå |
| Reproducibility | ‚úÖ | ‚úÖ | ‚ùå |
| Context Understanding | ‚ùå | ‚úÖ | ‚úÖ |
| Auditability | ‚úÖ | ‚úÖ | ‚ùå |
| Validation | ‚úÖ | ‚úÖ | ‚ö†Ô∏è |

**Result:** We get LLM intelligence (context understanding, synonym handling) **with** the safety and auditability of traditional scripts.

---

## Local LLM Configuration

Notice in the code below that we're using a **local LLM** (configured via `LOCAL_LLM_URL` environment variable) rather than a cloud API like OpenAI or Anthropic.

**Why local?**

üîí **Data Privacy**
- Clinical trial data may include proprietary information
- Local inference means data never leaves your infrastructure
- Critical for pharmaceutical/biotech companies with IP concerns

üí∞ **Cost Control**
- Annotating hundreds or thousands of trials requires many API calls
- Local models (e.g., LLaMA, Mistral) run at zero marginal cost

‚ö° **Speed**
- No network latency

**Trade-off:**
- Local models may be slightly (or significantly) less capable than frontier models (GPT-5, Claude Sonnet)
- But for structured tasks with validation, mid-tier models can perform well

In [7]:
load_dotenv(find_dotenv())

logger = logging_config.get_logger(__name__)

DATA_STORAGE = os.getenv("DATA_LOC", None)

if DATA_STORAGE and Path(DATA_STORAGE).exists():
    logger.info(f"Data will be saved at:{DATA_STORAGE}")
else:
    DATA_STORAGE = Path(__file__).resolve()
    logger.warning(f"Warning: Data storage path in environment does not exist or was not set, saving data here: {DATA_STORAGE}")

cleaned_trials_loc = f"{DATA_STORAGE}/cleaned_trials.pkl"
if Path(cleaned_trials_loc).exists():
    with open(cleaned_trials_loc, "rb") as f:
        cleaned_trials = pickle.load(f)
else:
    logger.warning("No pkl file found: You must run the data retriever workflow before executing this noteboook")

annotator_llm = ChatOpenAI(base_url=os.getenv("LOCAL_LLM_URL"), model = os.getenv("LOCAL_LLM"))


[2025-11-13 08:30:03] INFO     - __main__ - Data will be saved at:/Users/joshuaziel/Documents/Coding/glp-1_landscape/data


## Data Safety Architecture

Before we run the LLM annotation workflow, let's understand how this system **guarantees** your original data remains intact.

### The "Working Copy" Pattern

```
Original Data (cleaned_trials.pkl)
         ‚Üì
    [Load into memory]
         ‚Üì
  Working Copy (in AnnotatorWorkflow)
         ‚Üì
   [LLM annotations applied]
         ‚Üì
 Annotated Data (new output file)
         ‚Üì
Original file UNCHANGED ‚úÖ
```

### Protective Mechanisms

**1. Immutable Original**
- The `cleaned_trials.pkl` file from `data_retriever.ipynb` is **never modified**
- It remains as a permanent checkpoint you can always return to

**2. `original_data` Attribute**
- When `AnnotatorWorkflow` loads your data, it stores a copy in `self.original_data`
- All operations happen on `self.working_data` 
- You can always compare original vs annotated to see what changed

**3. Non-Destructive Additions**
- LLM doesn't overwrite existing columns
- New information is **added** to:
  - `matched_conditions` (appends MeSH terms)
  - `tx_category` (new column)
  - `tx_category_confidence` (new column)
  - `llm_annotations` (tracks which fields were LLM-modified)

**4. Timestamped Outputs**
- All exports include timestamps in filenames (e.g., `annotated_trials_20231113_143022.pkl`)
- Never overwrites previous runs
- Full versioning history maintained

**5. CSV Exports for Audit**
- Every mapping decision exported to human-readable CSV:
  - `mapped_to_existing_conditions.csv` - Synonym mappings
  - `searched_conditions.csv` - New MeSH term searches
  - `categorized_mesh_terms.csv` - Therapeutic classifications
- These allow manual review of every LLM decision

### Verification After Running

After the workflow completes, you can verify data integrity:

```python
# Compare trial counts
assert len(cleaned_trials) == len(annotated_trials)

# Check original columns still exist
assert all(col in annotated_trials.columns for col in cleaned_trials.columns)

# Verify NCT IDs unchanged (primary key)
assert (cleaned_trials['nct_id'] == annotated_trials['nct_id']).all()
```

### What If Something Goes Wrong?

If LLM annotations are unsatisfactory:
1. **Revert:** Delete the output pickle and re-run from `cleaned_trials.pkl`
2. **Adjust:** Modify prompts or validation rules in `annotator.py`
3. **Override:** Use the human review CSV (explained later)

**Your original data is always safe.**

## Three-Stage Annotation Workflow

The `AnnotatorWorkflow.run_annotation_workflow()` method below executes three sequential stages. Each builds on the previous stage's results:

---

### Stage 1: Map Unmapped Conditions to Existing MeSH Terms

**Problem:**  
In `data_retriever.ipynb`, we used the `mesh_mapper` service to map conditions to MeSH terms via API lookup. However, some conditions failed to map because:
- They use non-standard abbreviations (e.g., "T2DM", "CAD")
- They're informal terms (e.g., "high blood sugar" instead of "hyperglycemia")
- They include study parameters mixed with conditions (e.g., "diabetes with A1C > 7%")

**LLM Solution:**  
The LLM reviews unmapped conditions and attempts to match them to **existing MeSH terms already in the dataset**.

**How it works:**
1. Extract all conditions that lack MeSH mappings
2. Get list of all successfully mapped MeSH terms from previous step
3. Ask LLM: "Which existing MeSH term best matches this unmapped condition?"
4. LLM returns: Matched term + confidence level (HIGH/MEDIUM/LOW)
5. **Validation:** Check that LLM's suggested term actually exists in our MeSH list
6. Accept valid matches; reject and log invalid ones

**Example:**
```
Unmapped: "T2DM"
LLM matches to existing: "Diabetes Mellitus, Type 2 (MeSH ID:D003924)"
Confidence: HIGH
Status: ‚úÖ Accepted (term exists in our dataset)
```

**Output:** `mapped_to_existing_conditions.csv` for review

---

### Stage 2: Search for New MeSH Terms

**Problem:**  
Some unmapped conditions don't match any existing MeSH term in our dataset (e.g., trial studying a rare disease). These need fresh MeSH lookups.

**LLM + API Solution:**  
For remaining unmapped conditions, we use a **two-step process**:

**Step 2a: Extract Primary Condition from Context**
- Many trial descriptions are complex: "Type 2 Diabetes in patients with chronic kidney disease and obesity"
- LLM analyzes the full trial record (title, summary, outcomes, conditions list)
- Extracts the **primary medical condition** being studied
- Returns: Clean condition string suitable for MeSH database search

**Step 2b: Verify via NCBI API**
- Take LLM-extracted condition and query NCBI's MeSH database
- Use the same `mesh_mapper` service from the retriever notebook
- Filter to disease/disorder categories only (tree codes C or F)
- If found: Accept; If not found: Mark as "NOT DETERMINED"

**Why two-step?**
- LLM is great at understanding context and extracting meaning
- But only the authoritative NCBI database can confirm valid MeSH terms
- Combining both gives flexibility + accuracy

**Example:**
```
Original condition: "Diabetes in patients with renal impairment"
LLM extracts: "Diabetes Mellitus"
NCBI API returns: "Diabetes Mellitus (MeSH ID:D003920)"
Status: ‚úÖ Accepted (verified by NCBI)
```

**Output:** `searched_conditions.csv` for review

---

### Stage 3: Classify Trials into Therapeutic Categories

**Problem:**  
For analysis and visualization, it's useful to group trials by therapeutic area:
- ONCOLOGY (cancer)
- ENDOCRINOLOGY (diabetes, thyroid)
- CARDIOVASCULAR (heart disease, hypertension)
- NEUROLOGY (Alzheimer's, Parkinson's)
- etc.

MeSH terms are very specific (e.g., "Diabetes Mellitus, Type 2"), but we want broad categories.

**LLM Solution:**  
For each trial's MeSH condition term(s), the LLM assigns a therapeutic category from a predefined list of 22 categories.

**How it works:**
1. Provide LLM with the trial's MeSH-standardized condition
2. Provide definitions of all 22 therapeutic categories
3. LLM selects the **single best-fit category**
4. Returns: Category + confidence + reasoning
5. **Constraint:** Must choose from enum (no free-form categories)

**Categories available:**
- ONCOLOGY
- ENDOCRINOLOGY  
- CARDIOVASCULAR
- NEUROLOGY
- IMMUNOLOGY
- INFECTIOUS_DISEASE
- RESPIRATORY
- GASTROENTEROLOGY
- NEPHROLOGY
- RHEUMATOLOGY
- DERMATOLOGY
- OPHTHALMOLOGY
- OTOLARYNGOLOGY
- PSYCHIATRY
- HEMATOLOGY
- SURGERY
- OBSTETRICS_GYNECOLOGY
- PEDIATRICS
- GERIATRICS
- PAIN_MANAGEMENT
- CRITICAL_CARE
- OTHER

**Example:**
```
MeSH term: "Diabetes Mellitus, Type 2 (MeSH ID:D003924)"
LLM assigns: ENDOCRINOLOGY
Confidence: HIGH
Reasoning: "Diabetes is a metabolic/endocrine disorder"
```

**Output:** `categorized_mesh_terms.csv` for review

---

### Progressive Refinement

Notice the workflow is **progressive**:
1. Try to match existing terms (fastest, most reliable)
2. If that fails, extract and search for new terms (slower, but comprehensive)
3. Finally, categorize everything for high-level analysis

This minimizes API calls and ensures maximum data coverage.

---

### Execution Time

Expect the workflow below to take **several minutes** depending on:
- Number of trials
- Number of unmapped conditions  
- LLM inference speed
- API rate limits

Progress will be logged in real-time. Be patient!

In [8]:
session = annotator.AnnotatorWorkflow(df = cleaned_trials, llm = annotator_llm, data_loc = DATA_STORAGE )
session.run_annotation_workflow()
annotated_trials = session.annotated_data.copy()

[2025-11-13 08:30:06] INFO     - services.annotator - Loaded existing MeSH map containing 298 mappings as a <class 'dict'>


## How LLM Decisions Are Constrained: Structured Outputs

The workflow above has completed (or is running). Let's examine **how** the LLM is prevented from "hallucinating" or producing invalid outputs.

### The Problem with Free-Form LLM Responses

If we asked an LLM a simple question without constraints:

```python
# ‚ùå DANGEROUS - No constraints
response = llm.ask("What MeSH term matches 'T2DM'?")
# LLM might return:
# "The term T2DM refers to Type 2 Diabetes Mellitus, which is..."
# (free-form text, hard to parse, may include errors)
```

**Issues:**
- Response format is unpredictable
- No way to programmatically extract the MeSH ID
- No confidence level
- May include explanations mixed with data
- Difficult to validate

---

### The Solution: Pydantic Schemas (Structured Outputs)

Instead, we define **strict data schemas** using Pydantic that the LLM **must** conform to:

```python
# ‚úÖ SAFE - Structured output
class ConditionMapping(BaseModel):
    """Schema that LLM must follow"""
    original_condition: str
    matched_mesh_term: str
    confidence: Literal["HIGH", "MEDIUM", "LOW"]
    reasoning: str

response = llm.ask(
    "What MeSH term matches 'T2DM'?",
    structured_output=ConditionMapping  # Forces this format
)

# Response is guaranteed to have these fields:
print(response.original_condition)  # "T2DM"
print(response.matched_mesh_term)   # "Diabetes Mellitus, Type 2 (MeSH ID:D003924)"
print(response.confidence)          # "HIGH"
print(response.reasoning)           # "T2DM is standard abbreviation for..."
```

### Schemas Used in This Workflow

**1. `MedicalConditionFilter`** (Stage 1 filtering)
```python
class MedicalConditionFilter(BaseModel):
    medical_conditions: list[str]  # Only actual diseases
    non_medical: list[str]         # Study parameters to exclude
```

**2. `ConditionMapping`** (Stage 1 synonym matching)
```python
class ConditionMapping(BaseModel):
    original_condition: str
    matched_mesh_term: str
    confidence: ConfidenceLevel  # Enum: HIGH/MEDIUM/LOW
```

**3. `ConditionExtraction`** (Stage 2 context extraction)
```python
class ConditionExtraction(BaseModel):
    primary_condition: str
    confidence: ConfidenceLevel
    reasoning: str
```

**4. `TxCategoryAnnotation`** (Stage 3 categorization)
```python
class TxCategoryAnnotation(BaseModel):
    therapeutic_category: Optional[TherapeuticCategory]  # Enum of 22 categories
    confidence: Optional[ConfidenceLevel]
    explanation: Optional[str]
```

### Why This Works

**Benefits of Structured Outputs:**

‚úÖ **Predictable Format**
- Every response has the same structure
- Easy to parse programmatically
- No need for regex or string manipulation

‚úÖ **Type Safety**
- `confidence` must match exactly the requirements of the schema, for example "VERY HIGH", "HIGH", "MEDIUM","LOW" or "VERY_LOW" (not "pretty sure" or "maybe")
- `therapeutic_category` must be from predefined enum (not any random category)

‚úÖ **Validation Built-In**
- Pydantic automatically validates types
- Missing required fields cause errors (caught immediately)
- Invalid enum values rejected

‚úÖ **Self-Documentation**
- Schema serves as clear specification for LLM
- Reduces ambiguity in prompts

---

### Example: Enum Constraints

The therapeutic category must be one of exactly 22 options:

```python
class TherapeuticCategory(str, Enum):
    ONCOLOGY = "ONCOLOGY"
    ENDOCRINOLOGY = "ENDOCRINOLOGY"
    CARDIOVASCULAR = "CARDIOVASCULAR"
    # ... 19 more categories
```

If LLM tries to return `"METABOLISM"` (not in enum) ‚Üí **Validation error**, not accepted.

This prevents the LLM from inventing categories.

---

### Confidence Scoring

Notice every schema includes a `confidence` field. This is crucial because:

**LLMs can self-assess uncertainty:**
For Example:
- `HIGH`: Strong evidence, clear match
- `MEDIUM`: Reasonable match but some ambiguity  
- `LOW`: Weak match, manual review recommended

While we haven't fully implemented this here, in could allow for more nuanced downstream filtering:
```python
# Only use high-confidence annotations
high_conf = annotated_trials[annotated_trials['tx_category_confidence'] == 'HIGH']
```

**Manual review prioritization:**
- Focus human review on LOW confidence items
- Trust HIGH confidence items (but still audit-able)

---

### What This Means for Data Quality

Structured outputs + validation layers = **controlled intelligence**

The LLM brings:
- Context understanding
- Synonym recognition  
- Reasoning capability

But is constrained by:
- Predefined schemas
- Type validation
- External API verification
- Confidence scoring (if implemented)

**Result:** Flexible enough to handle edge cases, safe enough for production pipelines.

## Multiple Validation Layers: Defense Against Hallucination

Structured outputs are the first line of defense, but we employ **multiple validation layers** to catch errors:

### Layer 1: Schema Validation (Pydantic)

**What it catches:**
- Wrong data types (e.g., string instead of list)
- Missing required fields
- Invalid enum values

**Example:**
```python
# LLM tries to return invalid confidence level
response.confidence = "very confident"  # ‚ùå Not in enum
# Pydantic raises ValidationError ‚Üí Request rejected
```

---

### Layer 2: Cross-Validation Against Existing Data

**Stage 1 (Synonym Matching):** When LLM suggests a MeSH term match, we verify it exists in our dataset.

```python
if mapping.mesh_term in valid_mesh_terms:
    validated_mappings[mapping.original_condition] = mapping.mesh_term
else:
    rejected_count += 1
    logger.info(f"Rejected invalid mapping: '{mapping.original_condition}' -> '{mapping.mesh_term}' (not in existing MeSH terms)")

    if rejected_count > 0:
        logger.info(f"Total rejected mappings: {rejected_count}")
```

**Result:** LLM cannot invent MeSH terms that don't exist in our validated dataset.

---

### Layer 3: External API Verification (NCBI)

**Stage 2 (New MeSH Searches):** For conditions requiring fresh lookups, we don't trust LLM alone.

**Two-step verification:**

1. **LLM extracts** the primary condition from context
2. **NCBI API verifies** with a real MeSH term before anything is added.

```python
try:
    result = mesh_mapper.search_mesh_term(condition, filter_diseases_only=True)

    if result:
        mesh_term = f"{result['mesh_term']} (MeSH ID:{result['mesh_id']})"
        logger.info(f"Found MeSH term for '{condition}': {mesh_term}")
        return mesh_term
    else:
        logger.warning(f"No MeSH term found for '{condition}'")
        return "NOT DETERMINED"

except Exception as e:
    logger.error(f"Error searching for MeSH term for '{condition}': {e}")
        return "NOT DETERMINED"
```

**Why this matters:**
- LLM might extract a reasonable-sounding but non-existent term
- NCBI is the **authoritative source** for MeSH terms
- Only conditions that can be clearly matched to a MeSH term in NCBI's official database result in mapping

---

### Layer 4: Retry Logic with Error Handling

**Stage 3 (Categorization):** If LLM fails to return valid structured output, we retry up to 4 times.

**Why retry?**
- Occasionally LLMs have transient failures (parsing errors, malformed JSON)
- Retry gives LLM a second chance before marking as failed
- After 4 attempts, we fail gracefully (mark as None) rather than crash

---

### Layer 5: Confidence-Based Filtering

Every LLM decision includes a confidence score. To take the notebook further, you could capture these in the dataframe and filter post-hoc:

```python
# Only accept high-confidence therapeutic categories
reliable = annotated_trials[
    annotated_trials['tx_category_confidence'] == 'HIGH'
]

# Flag low-confidence items for human review
needs_review = annotated_trials[
    annotated_trials['tx_category_confidence'] == 'LOW'
]
```

This allows graduated trust:
For example:
- HIGH or better confidence: Use automatically
- MEDIUM confidence: Spot-check a sample
- LOW or worse confidence: Manual review required

---

### Layer 6: Comprehensive Logging

Every validation failure is logged:

```python
logger.warning(f"Mapping could not be validated for {condition}")
logger.info(f"Successfully matched {count} conditions")
logger.error(f"Failed to extract condition from trial {nct_id}")
```

**Benefits:**
- Audit trail of all decisions and failures
- Debugging: Identify systematic issues
- Quality metrics: Track error rates over time

Check the logs after running to see:
- How many conditions were successfully mapped
- Which mappings were rejected
- Why certain trials couldn't be annotated

---

### Layer 7: CSV Exports for Human Review

**How to use:**
1. Open CSV in Excel/spreadsheet software
2. Review as the process goes along to identify any systematic errors
3. Override via human review CSV (see next section)

---

## Summary: Defense in Depth

| Layer | What It Does | What It Catches |
|-------|--------------|-----------------|
| 1. Schema validation | Enforces structure | Type errors, missing fields, invalid enums |
| 2. Cross-validation | Checks against existing data | Hallucinated MeSH terms not in dataset |
| 3. API verification | Checks against NCBI | Non-existent MeSH terms |
| 4. Retry logic | Multiple attempts | Transient LLM failures |
| 5. Confidence scoring (if implemented) | Self-assessment | Uncertain mappings |
| 6. Logging | Records all events | Systematic errors, trends |
| 7. CSV exports | Human review | Edge cases, quality assurance |

**No single layer is perfect, but together they create a robust system.**

Even if an error slips through one layer, subsequent layers catch it. This "defense in depth" approach ensures:
- High accuracy
- Transparent failures
- Auditable decisions
- Graceful degradation (never crashes, logs failures instead)

## Inspecting Results: Quality Assurance

The workflow has completed and created `annotated_trials` DataFrame. Now let's verify the quality of LLM annotations.

### Quick Quality Checks

**1. Verify Data Integrity**
```python
# Check no trials were lost
print(f"Original trials: {len(cleaned_trials)}")
print(f"Annotated trials: {len(annotated_trials)}")
assert len(cleaned_trials) == len(annotated_trials)

# Check NCT IDs match (primary key unchanged)
assert (cleaned_trials['nct_id'] == annotated_trials['nct_id']).all()
```

**2. Check Annotation Coverage**
```python
# How many trials have MeSH-mapped conditions?
trials_with_mesh = annotated_trials['matched_conditions'].apply(
    lambda x: len(x) > 0 if isinstance(x, list) else False
)
coverage = trials_with_mesh.sum() / len(annotated_trials) * 100
print(f"Trials with MeSH conditions: {coverage:.1f}%")

# How many have therapeutic categories?
trials_with_category = annotated_trials[len(annotated_trials['tx_category'])>0]
print(f"Trials with therapeutic category: {trials_with_category.sum()} 
      ({trials_with_category.mean()*100:.1f}%)"

print(f"\nTrials needing manual review: {len(needs_review)}")
print("\nReasons:")
print(f"  - No category assigned: {annotated_trials['tx_category'].isna().sum()}")
print(f"  - No MeSH conditions: {annotated_trials['matched_conditions'].apply(lambda x: len(x) == 0).sum()}")
```

---

### Review CSV Exports
**These files are ready for Human review and verification.**

In [None]:
annotated_trials_loc = f"{DATA_STORAGE}/<fill in pkl filename to load>"
if Path(cleaned_trials_loc).exists():
    with open(annotated_trials_loc, "rb") as f:
        cleaned_trials = pickle.load(f)
else:
    logger.error("Check the path to the pickle!")

In [None]:
human_review_path = f"{DATA_STORAGE}/human_review.csv"
if Path(human_review_path).exists():
    changes_df = pd.read_csv(human_review_path, header=0)
    logger.info("Loaded changes file for human review")
else: 
    changes_df = None
    logger.info("Store your changes file for human review at the appropriate location")

if changes_df is not None:
    try:
        for change in changes_df.itertuples():    
            row = change.row
            column = change.column
            new_assignment = change.new_assignment
            annotated_trials.loc[row, column] = new_assignment
    except Exception as e:
        logger.error(f"Warning - your changes could not be applied.  Perhaps the file was not appropriately formatted: {e}")

## Human-in-the-Loop: Override LLM Decisions

Even with multiple validation layers, you may identify incorrect LLM annotations during manual review. The human-in-the-loop override system allows you to correct these without re-running the entire workflow.

---

### Why Human Override Is Important

**LLMs are probabilistic, not perfect:**
- May misinterpret context (e.g., trial about diabetes complications ‚Üí wrongly categorizes as NEPHROLOGY instead of ENDOCRINOLOGY)
- May struggle with edge cases (e.g., trials involving multiple conditions)
- May lack domain expertise (e.g., specific knowledge about rare diseases)

**Human experts bring:**
- Domain knowledge
- Understanding of research context
- Ability to resolve ambiguity based on trial purpose

**The goal:** Combine LLM efficiency (handles 95% automatically) with human expertise (corrects the remaining 5%)

---

### How to Override LLM Decisions

**Step 1: Identify Issues During Review**

While reviewing the annotated trials or CSV exports, you may find errors:

```python
# Example: Review trials categorized as NEPHROLOGY
nephrology_trials = annotated_trials[annotated_trials['tx_category'] == 'NEPHROLOGY']

for idx, trial in nephrology_trials.iterrows():
    print(f"Row {idx}: {trial['nct_id']} - {trial['brief_title']}")
    print(f"  Category: {trial['tx_category']} (confidence: {trial['tx_category_confidence']})")
    print(f"  Conditions: {trial['matched_conditions']}")
    
# You notice: Row 245 is a diabetes trial, should be ENDOCRINOLOGY, not NEPHROLOGY
```

**Step 2: Create a Human Review CSV**

Create a file called `human_review.csv` in your `DATA_STORAGE` directory with this format:

| row | column | new_assignment |
|-----|--------|----------------|
| 245 | tx_category | ENDOCRINOLOGY |
| 367 | tx_category | CARDIOVASCULAR |
| 412 | matched_conditions | ["Diabetes Mellitus, Type 2 (MeSH ID:D003924)"] |

**CSV format:**
```csv
row,column,new_assignment
245,tx_category,ENDOCRINOLOGY
367,tx_category,CARDIOVASCULAR
412,matched_conditions,"[""Diabetes Mellitus, Type 2 (MeSH ID:D003924)""]"
```

**Column explanations:**
- `row`: The DataFrame index of the trial to modify (find using `annotated_trials[annotated_trials['nct_id'] == 'NCT12345678'].index[0]`)
- `column`: Which column to modify (e.g., `tx_category`, `matched_conditions`, `tx_category_confidence`)
- `new_assignment`: The corrected value (must match column data type)

**Step 3: Run the Override Cell Below**

The cell below will:
1. Load your `human_review.csv` file
2. Apply each correction to `annotated_trials`
3. Log which changes were made

```python
# This is what the cell below does:
for change in changes_df.itertuples():
    row = change.row
    column = change.column
    new_assignment = change.new_assignment
    
    # Apply override
    annotated_trials.loc[row, column] = new_assignment
```

---

### Example Workflow

**Scenario:** You find that trial NCT04856789 (row 245) was categorized as NEPHROLOGY, but it's primarily about diabetes management in patients with kidney complications. The correct category should be ENDOCRINOLOGY.

**Step-by-step:**

1. **Find the row index:**
```python
row_idx = annotated_trials[annotated_trials['nct_id'] == 'NCT04856789'].index[0]
print(f"Row index: {row_idx}")  # Output: 245
```

2. **Create `human_review.csv` in your data directory:**
```csv
row,column,new_assignment
245,tx_category,ENDOCRINOLOGY
245,tx_category_confidence,HIGH
```
(Note: You can update multiple columns for the same row, or multiple rows at once)

3. **Save the CSV to:** `{DATA_STORAGE}/human_review.csv`

4. **Run the cell below** ‚Üí Changes will be applied automatically

5. **Verify the change:**
```python
print(annotated_trials.loc[245, 'tx_category'])  # Output: ENDOCRINOLOGY
```

---

### Best Practices

**‚úÖ DO:**
- Document your rationale for overrides (add comments in CSV or separate notes)
- Save a copy of `human_review.csv` for reproducibility

**‚ùå DON'T:**
- Override without inspecting the trial details
- Batch-modify without verifying each case
- Delete `human_review.csv` after applying (keep for audit trail)
- Modify original data files directly

---

### What Can You Override?

You can modify any column in the DataFrame, but most commonly:

| Column | Type | Example Values |
|--------|------|----------------|
| `tx_category` | list | `["ENDOCRINOLOGY"]` |
| `matched_conditions` | list | `["Diabetes Mellitus, Type 2 (MeSH ID:D003924)"]` |

**Note:** When modifying list columns like `matched_conditions`, use proper JSON array syntax with escaped quotes in CSV:
```csv
row,column,new_assignment
123,matched_conditions,"[""Diabetes Mellitus, Type 2 (MeSH ID:D003924)"", ""Obesity (MeSH ID:D009765)""]"
```

---

### Tracking Overrides

After applying human overrides, you can track which trials were modified:

```python
# Load original automated annotations
original = session.annotated_data.copy()

# Compare to human-corrected version
differences = []
for idx in annotated_trials.index:
    for col in ['tx_category', 'matched_conditions']:
        if original.loc[idx, col] != annotated_trials.loc[idx, col]:
            differences.append({
                'nct_id': annotated_trials.loc[idx, 'nct_id'],
                'column': col,
                'llm_value': original.loc[idx, col],
                'human_value': annotated_trials.loc[idx, col]
            })

override_log = pd.DataFrame(differences)
override_log.to_csv(f"{DATA_STORAGE}/human_overrides_log.csv", index=False)
print(f"Logged {len(differences)} human overrides")
```

This creates an audit trail showing:
- Which trials were overridden
- What the LLM originally assigned
- What humans corrected it to

---

### When to Override vs Re-train

**Override individual errors** when:
- Small number of mistakes (<5% of trials)
- Edge cases that are inherently ambiguous
- Domain-specific judgments that LLM can't know

**Improve LLM prompts/workflow** when:
- Systematic errors (e.g., all diabetes trials miscategorized)
- >10% of trials need correction
- Clear pattern in mistakes (e.g., LLM consistently confuses two categories)

In the latter case, modify prompts in `annotator.py` and re-run the workflow rather than manually fixing hundreds of trials.

---

## The Complete Human-LLM Partnership

This override system exemplifies the "circumscribed agentic" philosophy:

1. **LLM handles the bulk work** (synonym matching, context extraction, categorization)
2. **Validation layers catch most errors** (schema checks, API verification, confidence scoring)
3. **Humans review and correct edge cases** (domain expertise, nuanced judgment)
4. **All decisions are auditable** (CSV exports, override logs, confidence scores)

**Result:** A scalable, accurate, and trustworthy annotation pipeline that combines machine efficiency with human expertise.