# Clinical Trials LLM Annotation Pipeline

## Overview

This notebook is **Step 2** in our educational pipeline demonstrating how AI can be both informative and useful. We'll use an LLM to intelligently annotate the clinical trials data we cleaned in `data_retriever.ipynb`.

## What is an "Agentic Workflow"?

You may have heard terms like "AI agents" or "agentic systems" - often associated with autonomous AI that can take actions independently. **This notebook demonstrates a different, more controlled approach**: a much simpler, **circumscribed**, but still agentic workflow**.  We're using this workflow to automate a straightforward task: annotation and classification of our clinical trial data set that will allow us to understand it better with respect to a question we're interested in: what therapeutic categories have drugs targeting GLP-1 been investigated in?

### Key Characteristics:

**Intelligent Decision-Making**
- The LLM interprets medical terminology, handles synonyms, and extracts meaning from unstructured text
- Goes beyond simple string matching to understand context

**Strict Guardrails**
- All LLM outputs are constrained to predefined formats (structured schemas)
- Multiple validation layers designed to reduce hallucinations
- Original data is never modified

**Audit Trail**
- Every LLM decision is logged
- Results exported to CSV for human review
- Fully transparent and traceable

**Human Oversight**
- Opportunity to review and modify the LLM assignments through human review and override capability

In [None]:
#Import the required libraries
import os
import pickle
from pathlib import Path
from langchain_openai import ChatOpenAI
import pandas as pd
from dotenv import load_dotenv, find_dotenv
from services import logging_config, mesh_mapper, annotator


[2025-11-13 11:30:10] INFO     - root - Logging initialized
[2025-11-13 11:30:10] INFO     - root - Log level: INFO
[2025-11-13 11:30:10] INFO     - root - Log directory: /Users/joshuaziel/Documents/Coding/glp-1_landscape/logs
[2025-11-13 11:30:10] INFO     - root - Console logging: True


## The Spectrum: Deterministic → Circumscribed Agent → Autonomous AI

To understand what makes this workflow "agentic but safe," let's compare three approaches to data processing:

### Approach 1: Fully Deterministic Script

**How it works:**
```python
if condition == "T2DM":
    standardized = "Type 2 Diabetes"
elif condition == "Type II Diabetes":
    standardized = "Type 2 Diabetes"
# ... must hardcode every possible variant or implement a more complicated natural language processing (NLP) workflow with things like fuzzy matching
```

**Pros:**
- ✅ Predictable and fast
- ✅ No risk of hallucination
- ✅ Easy to debug

**Cons:**
- ❌ Brittle without a more complex NLP workflow  - fails on unexpected inputs
- ❌ Can't handle abbreviations or synonyms it wasn't programmed for
- ❌ Can't understand context

---

### Approach 2: Circumscribed Agent (This Notebook)

**How it works:**


**Pros:**
- ✅ Handles synonyms, abbreviations, typos gracefully
- ✅ Understands context (e.g., "diabetes in patients with..." → extracts "diabetes")
- ✅ Adapts to variations without reprogramming
- ✅ Structured outputs prevent free-form hallucination
- ✅ Validation layers catch errors
- ✅ Audit trail for every decision

**Cons:**
- ⚠️ Requires validation infrastructure
- ⚠️ Slower than pure rule-based systems
- ⚠️ Still a risk of misinterpretation, which is why human review is still important

---

### Approach 3: Fully Autonomous Agent

**How it works:**
```python
# Agent has freedom to query databases, modify data, take actions
agent.run("Improve the quality of this clinical trials dataset by filling in missing data from the context and/or using your tools")
# Agent decides what to do with no constraints
```

**Pros:**
- ✅ Maximum flexibility
- ✅ Can discover novel approaches

**Cons:**
- ❌ Unpredictable behavior
- ❌ High risk of data corruption
- ❌ Opaque decision-making
- ❌ Difficult to reproduce results
- ❌ May take unintended actions

---

## Why We Choose the Middle Ground

For our purposes, we wanted to get to an answer fast, without overbuiling or extenstive testing.  For this use case the circumscribed agentic approach can offer an attractive trade-off that allows us to benefit from built-in intelligence, while maintaining absolute control over how our data will be updated and an understanding of what has been changed:

| Requirement | Deterministic | Circumscribed Agent | Autonomous |
|-------------|---------------|---------------------|------------|
| Flexibility | ❌ | ✅ | ✅ |
| Data Safety | ✅ | ✅ | ❌ |
| Reproducibility | ✅ | ✅ | ❌ |
| Context Understanding | ❌ | ✅ | ✅ |
| Auditability | ✅ | ✅ | ❌ |
| Validation | ✅ | ✅ | ⚠️ |


---

## Local LLM Configuration

Notice in the code below that we've run system ourselves using a **local LLM** (configured via `LOCAL_LLM_URL` environment variable) running in LM Studio rather than a cloud API like OpenAI or Anthropic. We've commented code to use GPT-5 in case you don't have a local llm running on your machine, however, keep in mind that you will need to set an API key in the .env file (see .env.example) and have available funds to cover the run.

**Trade-off:**
- Local models may be slightly (or significantly) less capable than frontier models (GPT-5, Claude Sonnet)
- But for structured tasks with validation, mid-tier models can perform well and reduce costs when testing and iterating

In [None]:
load_dotenv(find_dotenv())

logger = logging_config.get_logger(__name__)

DATA_STORAGE = os.getenv("DATA_LOC", None)

if DATA_STORAGE and Path(DATA_STORAGE).exists():
    logger.info(f"Data will be saved at:{DATA_STORAGE}")
else:
    DATA_STORAGE = Path(__file__).resolve()
    logger.warning(f"Warning: Data storage path in environment does not exist or was not set, saving data here: {DATA_STORAGE}")

cleaned_trials_loc = f"{DATA_STORAGE}/cleaned_trials.pkl"
if Path(cleaned_trials_loc).exists():
    with open(cleaned_trials_loc, "rb") as f:
        cleaned_trials = pickle.load(f)
else:
    logger.warning("No pkl file found: You must run the data retriever workflow before executing this noteboook")

#Change the comments below
annotator_llm = ChatOpenAI(base_url=os.getenv("LOCAL_LLM_URL"), model = os.getenv("LOCAL_LLM")) #Local LLM Configuration
#annotator_llm = ChatOpenAI(model="gpt-5")) #OpenAI LLM configuration

[2025-11-13 11:30:14] INFO     - __main__ - Data will be saved at:/Users/joshuaziel/Documents/Coding/glp-1_landscape/data


## Data Safety Architecture

Before we run the LLM annotation workflow, let's understand how this system **guarantees** your original data remains intact.

### The "Working Copy" Pattern

```
Original Data (cleaned_trials.pkl)
         ↓
    [Load into memory]
         ↓
  Working Copy (in AnnotatorWorkflow)
         ↓
   [LLM annotations applied]
         ↓
 Annotated Data (new output file)
         ↓
Original file UNCHANGED ✅
```

### Protective Mechanisms

**1. Immutable Original**
- The `cleaned_trials.pkl` file from `data_retriever.ipynb` is **never modified**
- It remains as a permanent checkpoint you can always return to

**2. `original_data` Attribute**
- When `AnnotatorWorkflow` loads your data, it stores a copy in `self.original_data`
- All operations happen on `self.working_data` 
- You can always compare original vs annotated to see what changed

**3. Non-Destructive Additions**
- LLM doesn't overwrite existing columns
- New information is **added** to:
  - `matched_conditions` (appends MeSH terms)
  - `tx_category` (new column)
  - `tx_category_confidence` (new column)
  - `llm_annotations` (tracks which fields were LLM-modified)

**4. Timestamped Outputs**
- All exports include timestamps in filenames (e.g., `annotated_trials_20231113_143022.pkl`)
- Never overwrites previous runs
- Full versioning history maintained

**5. CSV Exports for Audit**
- Every mapping decision exported to human-readable CSV:
  - `mapped_to_existing_conditions.csv` - Synonym mappings
  - `searched_conditions.csv` - New MeSH term searches
  - `categorized_mesh_terms.csv` - Therapeutic classifications
- These allow manual review of every LLM decision

### Verification After Running

After the workflow completes, you can verify data integrity:

```python
# Compare trial counts
assert len(cleaned_trials) == len(annotated_trials)

# Check original columns still exist
assert all(col in annotated_trials.columns for col in cleaned_trials.columns)

# Verify NCT IDs unchanged (primary key)
assert (cleaned_trials['nct_id'] == annotated_trials['nct_id']).all()
```

### What If Something Goes Wrong?

If LLM annotations are unsatisfactory:
1. **Revert:** Delete the output pickle and re-run from `cleaned_trials.pkl`
2. **Adjust:** Modify prompts or validation rules in `annotator.py`
3. **Override:** Use the human review CSV (explained later)

**Your original data is always safe.**

## Three-Stage Annotation Workflow

The `AnnotatorWorkflow.run_annotation_workflow()` method below executes three sequential stages. Each builds on the previous stage's results:

---

### Stage 1: Map Unmapped Conditions to Existing MeSH Terms

**Problem:**  
In `data_retriever.ipynb`, we used the `mesh_mapper` service to map conditions to MeSH terms via API lookup. However, some conditions failed to map because:
- They use non-standard abbreviations (e.g., "T2DM", "CAD")
- They're informal terms (e.g., "high blood sugar" instead of "hyperglycemia")
- They include study parameters mixed with conditions (e.g., "diabetes with A1C > 7%")

**LLM Solution:**  
The LLM reviews unmapped conditions and attempts to match them to **existing MeSH terms already in the dataset**.

**How it works:**
1. Extract all conditions that lack MeSH mappings
2. Get list of all successfully mapped MeSH terms from previous step
3. Ask LLM: "Which existing MeSH term best matches this unmapped condition?"
4. LLM returns: Matched term + confidence level 
5. **Validation:** Check that LLM's suggested term actually exists in our MeSH list
6. Accept valid matches; reject and log invalid ones

**Example:**
```
Unmapped: "T2DM"
LLM matches to existing: "Diabetes Mellitus, Type 2 (MeSH ID:D003924)"
Confidence: HIGH
Status: ✅ Accepted (term exists in our dataset)
```

**Output:** `mapped_to_existing_conditions.csv` for review

---

### Stage 2: Search for New MeSH Terms

**Problem:**  
The LLM may not be able to map the entry in the "conditions" column to existing MeSH term in our dataset (e.g., when the entry in "conditions" column isn't actually a disorder - there are more than a few of these examples in "cleaned_data"). These need fresh MeSH lookups with an understanding of additional context from the trial.

**LLM + API Solution:**  
For remaining unmapped conditions, we use a **two-step process**:

**Step 2a: Extract Primary Condition from Context**
- Many trial descriptions are complex (eg, "Type 2 Diabetes in patients with chronic kidney disease and obesity") or innacurately coded in "conditions" (eg, "incretins") 
- LLM analyzes the full trial record (title, summary, outcomes, conditions list)
- Extracts the **primary medical condition** being studied
- Returns: Clean condition string suitable for MeSH database search

**Step 2b: Verify via NCBI API**
- Take LLM-extracted condition and query NCBI's MeSH database
- Use the same `mesh_mapper` service from the retriever notebook
- Filter to disease/disorder categories only (tree codes C or F)
- If found: Accept; If not found: Mark as "NOT DETERMINED"

**Why two-step?**
- LLM is great at understanding context and extracting meaning
- But only the authoritative NCBI database can confirm valid MeSH terms
- Combining both gives flexibility + accuracy

**Example:**
```
Original condition: "Diabetes in patients with renal impairment"
LLM extracts: "Diabetes Mellitus"
NCBI API returns: "Diabetes Mellitus (MeSH ID:D003920)"
Status: ✅ Accepted (verified by NCBI)
```

**Output:** `searched_conditions.csv` for review

---

### Stage 3: Classify Trials into Therapeutic Categories

**Problem:**  
For analysis and visualization, it's useful to group trials by therapeutic area:
- ONCOLOGY (cancer)
- ENDOCRINOLOGY (diabetes, thyroid)
- CARDIOVASCULAR (heart disease, hypertension)
- NEUROLOGY (Alzheimer's, Parkinson's)
- etc.

MeSH terms are very specific (e.g., "Diabetes Mellitus, Type 2"), but we want broad categories.

**LLM Solution:**  
For each trial's MeSH condition term(s), the LLM assigns a therapeutic category from a predefined list of 22 categories.

**How it works:**
1. Allow the llm to see each of the MeSH-standardized conditions now in our dataset
2. Provide definitions of therapeutic categories
3. LLM selects the **single best-fit category**
4. Returns: Category + confidence + reasoning
5. **Constraint:** Must choose from enum (no free-form categories)

**Categories available:**
- ONCOLOGY
- ENDOCRINOLOGY  
- CARDIOVASCULAR
- NEUROLOGY
- IMMUNOLOGY
- INFECTIOUS_DISEASE
- RESPIRATORY
- GASTROENTEROLOGY
- NEPHROLOGY
- RHEUMATOLOGY
- DERMATOLOGY
- OPHTHALMOLOGY
- OTOLARYNGOLOGY
- PSYCHIATRY
- HEMATOLOGY
- SURGERY
- OBSTETRICS_GYNECOLOGY
- PEDIATRICS
- GERIATRICS
- PAIN_MANAGEMENT
- CRITICAL_CARE
- OTHER

**Example:**
```
MeSH term: "Diabetes Mellitus, Type 2 (MeSH ID:D003924)"
LLM assigns: ENDOCRINOLOGY
Confidence: HIGH
Reasoning: "Diabetes is a metabolic/endocrine disorder"
```

**Output:** `llm_mesh_term_mappings_{datestamp}.csv` for review

---

### Progressive Refinement

Notice the workflow is **progressive**:
1. Try to match existing terms (fastest, most reliable)
2. If that fails, extract the primary condition and search for new terms (slower, but comprehensive)
3. Finally, categorize everything for high-level analysis

This minimizes API calls and ensures maximum data coverage.

---

### Execution Time

Expect the workflow below to take **several minutes** depending on:
- Number of trials
- Number of unmapped conditions  
- LLM inference speed
- API rate limits

Progress will be logged in real-time. Be patient!

In [3]:
session = annotator.AnnotatorWorkflow(df = cleaned_trials, llm = annotator_llm, data_loc = DATA_STORAGE )
session.run_annotation_workflow()
annotated_trials = session.annotated_data.copy()

[2025-11-13 11:30:19] INFO     - services.annotator - Loaded existing MeSH map containing 297 mappings as a <class 'dict'>
[2025-11-13 11:30:19] INFO     - services.annotator - Trying to update MeSH mappings using existing terms
[2025-11-13 11:30:19] INFO     - services.annotator - Found 82 unique unmapped conditions
[2025-11-13 11:30:36] INFO     - httpx - HTTP Request: POST http://127.0.0.1:1234/v1/chat/completions "HTTP/1.1 200 OK"
[2025-11-13 11:30:36] INFO     - services.annotator - After filtering: 52 legitimate medical conditions
[2025-11-13 11:30:36] INFO     - services.annotator - Found 170 unique existing MeSH terms
[2025-11-13 11:31:24] INFO     - httpx - HTTP Request: POST http://127.0.0.1:1234/v1/chat/completions "HTTP/1.1 200 OK"
[2025-11-13 11:31:24] INFO     - services.annotator - Rejected invalid mapping: 'Children and Adolescent With Type 2 Diabetes' -> 'Type 2 Diabetes Mellitus (MeSH ID:68003924)' (not in existing MeSH terms)
[2025-11-13 11:31:24] INFO     - services

## How LLM Decisions Are Constrained: Structured Outputs

The workflow above has completed (or is running). Let's examine **how** the LLM is prevented from "hallucinating" or producing invalid outputs.

### The Problem with Free-Form LLM Responses

If we asked an LLM a simple question without constraints:

```python
# UNPREDICTABLE - No constraints
response = llm.ask("What MeSH term matches 'T2DM'?")
# LLM might return:
# "The term T2DM refers to Type 2 Diabetes Mellitus, which is..."
# (free-form text, hard to parse, may include errors)
```

**Issues:**
- Response format is unpredictable
- No way to programmatically extract the MeSH ID
- No confidence level
- May include explanations mixed with data
- Difficult to validate

---

### The Solution: Pydantic Schemas (Structured Outputs)

Instead, we define **strict data schemas** using Pydantic that the LLM **must** conform to.  Here's an example:

```python
# Much Safer - Structured output
class ConditionMapping(BaseModel):
    """Schema that LLM must follow"""
    original_condition: str
    matched_mesh_term: str
    confidence: ConfidenceLevel 
    reasoning: str

response = llm._with_structured_output(ConditionMapping).invoke(
    "What MeSH term matches 'T2DM'?",
) #Forces the expected output

# Response is guaranteed to have these fields:
print(response.original_condition)  # "T2DM"
print(response.matched_mesh_term)   # "Diabetes Mellitus, Type 2 (MeSH ID:D003924)"
print(response.confidence)          # "HIGH"
print(response.reasoning)           # "T2DM is standard abbreviation for..."
```


### What This Means for Data Quality

Structured outputs + validation layers = **controlled intelligence**

The LLM brings:
- Context understanding
- Synonym recognition  
- Reasoning capability

But is constrained by:
- Predefined schemas
- Type validation
- External API verification
- Confidence scoring (if implemented)

**Result:** Flexible enough to handle edge cases while remaining clearly explainable and predictable.

## Inspecting Results: Quality Assurance

The workflow has completed and created `annotated_trials` DataFrame. Now let's verify the quality of LLM annotations.

### Quick Quality Checks

**1. Verify Data Integrity**
```python
# Check no trials were lost
print(f"Original trials: {len(cleaned_trials)}")
print(f"Annotated trials: {len(annotated_trials)}")
assert len(cleaned_trials) == len(annotated_trials)

# Check NCT IDs match (primary key unchanged)
assert (cleaned_trials['nct_id'] == annotated_trials['nct_id']).all()
```

**2. Check Annotation Coverage**
```python
# How many trials have MeSH-mapped conditions?
trials_with_mesh = annotated_trials['matched_conditions'].apply(
    lambda x: len(x) > 0 if isinstance(x, list) else False
)
coverage = trials_with_mesh.sum() / len(annotated_trials) * 100
print(f"Trials with MeSH conditions: {coverage:.1f}%")

# How many have therapeutic categories?
trials_with_category = annotated_trials[len(annotated_trials['tx_category'])>0]
print(f"Trials with therapeutic category: {trials_with_category.sum()} 
      ({trials_with_category.mean()*100:.1f}%)"

print(f"\nTrials needing manual review: {len(needs_review)}")
print("\nReasons:")
print(f"  - No category assigned: {annotated_trials['tx_category'].isna().sum()}")
print(f"  - No MeSH conditions: {annotated_trials['matched_conditions'].apply(lambda x: len(x) == 0).sum()}")
```


In [None]:
## Uncomment this code if you've taken a break and need to pick back up - load the annotated_trials pickle using the code below
#annotated_trials_loc = f"{DATA_STORAGE}/<fill in pkl filename to load>"
#if Path(annotated_trials_loc).exists():
#    with open(annotated_trials_loc, "rb") as f:
#        cleaned_trials = pickle.load(f)
#else:
#    logger.error("Check the path to the pickle!")

In [None]:
# Check no trials were lost
print(f"Original trials: {len(cleaned_trials)}")
print(f"Annotated trials: {len(annotated_trials)}")
assert len(cleaned_trials) == len(annotated_trials)

# Check NCT IDs match (primary key unchanged)
assert (cleaned_trials['nct_id'] == annotated_trials['nct_id']).all()

In [None]:
# How many trials have MeSH-mapped conditions?
trials_with_mesh = annotated_trials['matched_conditions'].apply(
    lambda x: len(x) > 0 if isinstance(x, list) else False
)
coverage = trials_with_mesh.sum() / len(annotated_trials) * 100
print(f"Trials with MeSH conditions: {coverage:.1f}%")

# How many have therapeutic categories?
trials_with_category = annotated_trials[len(annotated_trials['tx_category'])>0]
print(f"Trials with therapeutic category: {trials_with_category.sum()} 
      ({trials_with_category.mean()*100:.1f}%)"

print(f"\nTrials needing manual review: {len(needs_review)}")
print("\nReasons:")
print(f"  - No category assigned: {annotated_trials['tx_category'].isna().sum()}")
print(f"  - No MeSH conditions: {annotated_trials['matched_conditions'].apply(lambda x: len(x) == 0).sum()}")

In [None]:
human_review_path = f"{DATA_STORAGE}/human_review.csv"
reviewed_trials = annotated_trials.copy()
if Path(human_review_path).exists():
    changes_df = pd.read_csv(human_review_path, header=0)
    logger.info("Loaded changes file for human review")
else: 
    changes_df = None
    logger.info("Store your changes file for human review at the appropriate location")

if changes_df is not None:
    try:
        for change in changes_df.itertuples():    
            row = change.row
            column = change.column
            new_assignment = change.new_assignment
            reviewed_trials.loc[row, column] = new_assignment
    except Exception as e:
        logger.error(f"Warning - your changes could not be applied.  Perhaps the file was not appropriately formatted: {e}")