# mCODE Pipeline Demo - Complete Workflow

This notebook demonstrates the complete mCODE (minimal Common Oncology Data Elements) pipeline,
showing how to fetch clinical trials and patient data, process them with mCODE mapping,
and store the results in CORE Memory.

## What is mCODE?

mCODE is a standardized data model for oncology data that enables:
- **Interoperability**: Consistent representation of cancer data across systems
- **Research**: Facilitates clinical trial matching and patient recruitment
- **Analytics**: Enables advanced analysis of cancer treatment patterns

## Pipeline Overview

The pipeline consists of several CLI tools that work together:

1. **`trials_fetcher`**: Fetches clinical trial data from ClinicalTrials.gov
2. **`trials_processor`**: Processes trials with mCODE mapping using LLMs
3. **`patients_fetcher`**: Fetches synthetic patient data from archives
4. **`patients_processor`**: Processes patients with mCODE mapping
5. **`end_to_end_processor`**: Runs the complete pipeline in one command

## Prerequisites

- Python environment with mCODE translator installed
- CORE Memory API key (for storage)
- Internet connection (for fetching trial data)

## Generated Files

This notebook will generate several intermediate files:
- `demo_trials_raw.json` - Raw clinical trial data
- `demo_trials_mcode.ndjson` - Trials processed with mCODE mapping
- `demo_patients_raw.json` - Raw synthetic patient data
- `demo_patients_mcode.ndjson` - Patients processed with mCODE mapping

All data will be stored in CORE Memory spaces for semantic search and matching.

In [None]:
# =============================================================================
# SETUP - Environment and Imports
# =============================================================================

import os
import sys
from pathlib import Path

# Add project root to path for imports
current_dir = Path.cwd()
PROJECT_ROOT = current_dir.parent
sys.path.insert(0, str(PROJECT_ROOT))

# Environment setup
os.environ['PYTHONPATH'] = str(PROJECT_ROOT)

# Verify environment
print(f"Project root: {PROJECT_ROOT}")
print(f"Python path includes project root: {str(PROJECT_ROOT) in sys.path}")
print(f"Working directory: {current_dir}")

# Check for required environment variables
core_api_key = os.getenv('COREAI_API_KEY')
if core_api_key:
    print("✅ CORE API key found")
else:
    print("⚠️  CORE API key not found - set COREAI_API_KEY environment variable")

# Step 1: Fetch Clinical Trials

## What does `trials_fetcher` do?

The `trials_fetcher` CLI tool:
- **Fetches** clinical trial data from ClinicalTrials.gov API
- **Supports** searching by condition (e.g., "breast cancer")
- **Handles** single trials by NCT ID or multiple trials
- **Provides** concurrent fetching for better performance
- **Outputs** raw JSON data (no processing)

## Command Structure
```bash
python -m src.cli.trials_fetcher --condition "breast cancer" --limit 3 --output demo_trials_raw.json
```

## What it generates
- Raw clinical trial data in ClinicalTrials.gov format
- JSON array of trial objects
- No mCODE processing yet

In [None]:
# Fetch clinical trials for breast cancer
!source activate mcode_translator && cd .. && python -m src.cli.trials_fetcher --condition "breast cancer" --limit 3 --output examples/demo_trials_raw.json --verbose

In [None]:
# Inspect the raw trial data
import json

# Try current directory first, then examples/
trial_file_path = 'demo_trials_raw.json'
if not os.path.exists(trial_file_path):
    trial_file_path = 'examples/demo_trials_raw.json'

if os.path.exists(trial_file_path):
    with open(trial_file_path, 'r') as f:
        trials_raw = json.load(f)
else:
    print("Trial data file not found. Please run the trials fetcher step first.")
    trials_raw = []

print(f"Fetched {len(trials_raw)} trials")
print("\nSample trial structure:")
if trials_raw:
    trial = trials_raw[0]
    print(f"NCT ID: {trial.get('protocolSection', {}).get('identificationModule', {}).get('nctId')}")
    print(f"Title: {trial.get('protocolSection', {}).get('identificationModule', {}).get('briefTitle', 'N/A')[:100]}...")
    print(f"Conditions: {trial.get('protocolSection', {}).get('conditionsModule', {}).get('conditions', [])})")
    print(f"Raw data keys: {list(trial.keys())}")

# Step 2: Process Trials with mCODE Mapping

## What does `trials_processor` do?

The `trials_processor` CLI tool:
- **Processes** clinical trial data with LLM-powered mCODE mapping
- **Extracts** eligibility criteria, treatment protocols, and study details
- **Generates** structured mCODE elements (diagnoses, treatments, etc.)
- **Supports** multiple LLM models and prompt templates
- **Provides** concurrent processing for performance
- **Can store** results directly in CORE Memory

## Key Features
- Uses advanced prompts to extract mCODE elements
- Handles complex eligibility criteria parsing
- Generates structured data for clinical matching
- Supports dry-run mode for testing

## Command Structure
```bash
python -m src.cli.trials_processor demo_trials_raw.json --output demo_trials_mcode.ndjson --model deepseek-coder
```

## What it generates
- mCODE-mapped trial data with structured elements
- Eligibility criteria in mCODE format
- Treatment protocols and study details
- Ready for CORE Memory storage

In [None]:
# Process trials with mCODE mapping (save to file)
!source activate mcode_translator && cd .. && python -m src.cli.trials_processor examples/demo_trials_raw.json --output examples/demo_trials_mcode.ndjson --model deepseek-coder --verbose

In [None]:
# Inspect the mCODE-processed trial data
import json

# Read NDJSON format (one JSON object per line)
mcode_trials = []
patient_mcode_file = 'demo_trials_mcode.ndjson'
if not os.path.exists(patient_mcode_file):
    patient_mcode_file = 'examples/demo_trials_mcode.ndjson'

if os.path.exists(patient_mcode_file):
    with open(patient_mcode_file, 'r') as f:
        for line in f:
            if line.strip():
                mcode_trials.append(json.loads(line))
else:
    print("mCODE trials file not found. Please run the trials processor step first.")

print(f"Processed {len(mcode_trials)} trials with mCODE mapping")
print("\nSample mCODE trial structure:")
if mcode_trials:
    trial = mcode_trials[0]
    print(f"Trial ID: {trial.get('trial_id')}")
    mcode_elements = trial.get('mcode_elements', {})
    print(f"mCODE elements found: {list(mcode_elements.keys())}")
    
    # Show mCODE mappings
    mappings = mcode_elements.get('mcode_mappings', [])
    if mappings:
        print(f"\nFirst few mCODE mappings:")
        for mapping in mappings[:3]:
            print(f"  - {mapping.get('mcode_element')}: {mapping.get('value')}")
    
    # Show processing metadata
    if 'processing_metadata' in mcode_elements:
        meta = mcode_elements['processing_metadata']
        print(f"\nProcessing metadata:")
        print(f"  Model used: {meta.get('model')}")
        print(f"  Prompt used: {meta.get('prompt')}")
        print(f"  Processing time: {meta.get('processing_time_seconds', 'N/A')}s")

# Step 3: Fetch Synthetic Patient Data

## What does `patients_fetcher` do?

The `patients_fetcher` CLI tool:
- **Fetches** synthetic patient data from pre-generated archives
- **Supports** different cancer types and time horizons
- **Provides** realistic oncology data for testing
- **Handles** single patients or bulk fetching
- **Outputs** raw FHIR Bundle format data

## Available Archives
- `breast_cancer_10_years` - Breast cancer patients over 10 years
- `breast_cancer_lifetime` - Breast cancer patients lifetime data
- `mixed_cancer_10_years` - Mixed cancer types over 10 years
- `mixed_cancer_lifetime` - Mixed cancer types lifetime data

## Command Structure
```bash
python -m src.cli.patients_fetcher --archive breast_cancer_10_years --limit 3 --output demo_patients_raw.json
```

## What it generates
- Raw patient data in FHIR Bundle format
- Clinical observations, conditions, medications
- Realistic oncology patient profiles
- No mCODE processing yet

In [None]:
# Fetch synthetic patient data
!source activate mcode_translator && cd .. && python -m src.cli.patients_fetcher --archive breast_cancer_10_years --limit 3 --output examples/demo_patients_raw.json --verbose

In [None]:
# Inspect the raw patient data
import json

# Try current directory first, then examples/
patient_file_path = 'demo_patients_raw.json'
if not os.path.exists(patient_file_path):
    patient_file_path = 'examples/demo_patients_raw.json'

if os.path.exists(patient_file_path):
    with open(patient_file_path, 'r') as f:
        patients_raw = json.load(f)
else:
    print("Patient data file not found. Please run the patients fetcher step first.")
    patients_raw = []

print(f"Fetched {len(patients_raw)} patients")
print("\nSample patient structure:")
if patients_raw:
    patient = patients_raw[0]
    entries = patient.get('entry', [])
    print(f"Patient bundle has {len(entries)} entries")
    
    # Show different resource types
    resource_types = {}
    for entry in entries:
        resource = entry.get('resource', {})
        rtype = resource.get('resourceType')
        if rtype:
            resource_types[rtype] = resource_types.get(rtype, 0) + 1
    
    print(f"Resource types: {resource_types}")
    
    # Show patient demographics
    for entry in entries:
        resource = entry.get('resource', {})
        if resource.get('resourceType') == 'Patient':
            name = resource.get('name', [{}])[0]
            birth_date = resource.get('birthDate')
            gender = resource.get('gender')
            print(f"Patient: {name.get('given', ['Unknown'])[0]} {name.get('family', 'Unknown')}")
            print(f"Birth date: {birth_date}")
            print(f"Gender: {gender}")
            break

# Step 4: Process Patients with mCODE Mapping

## What does `patients_processor` do?

The `patients_processor` CLI tool:
- **Processes** FHIR patient bundles with mCODE mapping
- **Extracts** cancer diagnoses, treatments, biomarkers, etc.
- **Generates** structured mCODE patient profiles
- **Supports** trial eligibility filtering
- **Can store** results directly in CORE Memory

## Key Features
- Converts FHIR resources to mCODE elements
- Handles complex clinical data relationships
- Supports filtering based on trial criteria
- Generates patient summaries for matching

## Command Structure
```bash
python -m src.cli.patients_processor --patients demo_patients_raw.json --trials demo_trials_mcode.ndjson --output demo_patients_mcode.ndjson
```

## What it generates
- mCODE-mapped patient data
- Structured clinical profiles
- Trial matching eligibility data
- Ready for CORE Memory storage

In [None]:
# Process patients with mCODE mapping (save to file)
!source activate mcode_translator && cd .. && python -m src.cli.patients_processor --patients examples/demo_patients_raw.json --trials examples/demo_trials_mcode.ndjson --output examples/demo_patients_mcode.ndjson --verbose

In [None]:
# Inspect the mCODE-processed patient data
import json

# Read NDJSON format
mcode_patients = []
patient_mcode_file = 'demo_patients_mcode.ndjson'
if not os.path.exists(patient_mcode_file):
    patient_mcode_file = 'examples/demo_patients_mcode.ndjson'

if os.path.exists(patient_mcode_file):
    with open(patient_mcode_file, 'r') as f:
        for line in f:
            if line.strip():
                mcode_patients.append(json.loads(line))
else:
    print("mCODE patients file not found. Please run the patients processor step first.")

print(f"Processed {len(mcode_patients)} patients with mCODE mapping")
print("\nSample mCODE patient structure:")
if mcode_patients:
    patient = mcode_patients[0]
    bundle = patient.get('patient_bundle', [])
    print(f"Patient bundle has {len(bundle)} mCODE entries")
    
    # Show different mCODE resource types
    resource_types = {}
    for entry in bundle:
        rtype = entry.get('resource_type')
        if rtype:
            resource_types[rtype] = resource_types.get(rtype, 0) + 1
    
    print(f"mCODE resource types: {resource_types}")
    
    # Show sample clinical data
    for entry in bundle[:3]:  # Show first 3 entries
        rtype = entry.get('resource_type')
        if rtype == 'Patient':
            name = entry.get('name', {})
            print(f"Patient: {name.get('given', ['Unknown'])} {name.get('family', 'Unknown')}")
        elif rtype == 'Condition':
            clinical_data = entry.get('clinical_data', {})
            code = clinical_data.get('code', {}).get('text', 'Unknown')
            print(f"Condition: {code}")
        elif rtype in ['Observation', 'MedicationStatement']:
            clinical_data = entry.get('clinical_data', {})
            code_text = clinical_data.get('code', {}).get('text', 'Unknown')
            print(f"{rtype}: {code_text}")

# Step 5: Store in CORE Memory

## What happens in CORE Memory storage?

Both `trials_processor` and `patients_processor` can store data directly in CORE Memory:
- **Creates** dedicated spaces for trials and patients
- **Indexes** mCODE elements for semantic search
- **Enables** cross-space queries and matching
- **Supports** advanced clinical correlations

## CORE Memory Spaces
- `mCODE Patients` - Patient clinical data
- `mCODE Research Protocols` - Clinical trial protocols

## Storage Process
1. Initialize CORE Memory client with API key
2. Create or reuse mCODE-aligned spaces
3. Seed comprehensive mCODE ontology
4. Ingest structured mCODE data
5. Enable semantic search and matching

## Command Structure
```bash
# Store trials
python -m src.cli.trials_processor demo_trials_raw.json --store-in-core-memory --model deepseek-coder

# Store patients  
python -m src.cli.patients_processor --patients demo_patients_raw.json --trials demo_trials_mcode.ndjson --store-in-core-memory
```

## Benefits
- Persistent storage with semantic indexing
- Cross-referencing between patients and trials
- Advanced search capabilities
- Scalable for large datasets

In [None]:
# Store trials in CORE Memory
!source activate mcode_translator && cd .. && python -m src.cli.trials_processor examples/demo_trials_raw.json --ingest --model deepseek-coder --verbose

In [None]:
# Store patients in CORE Memory
!source activate mcode_translator && cd .. && python -m src.cli.patients_processor --patients examples/demo_patients_raw.json --trials examples/demo_trials_mcode.ndjson --ingest --verbose

# Step 6: Verify CORE Memory Storage

## Testing Semantic Search

Once data is stored in CORE Memory, we can test semantic search capabilities:
- **Cross-space queries** between patients and trials
- **Clinical concept matching** (diagnoses, treatments, etc.)
- **Eligibility matching** for clinical trials

## What to expect
- Searches return relevant clinical data
- mCODE ontology enables intelligent matching
- Cross-references between patient and trial data

## Manual Verification

You can verify storage by:
1. Checking the CORE Memory web interface
2. Using the search API directly
3. Running additional queries

## Next Steps

With data stored in CORE Memory, you can:
- Build patient-trial matching systems
- Perform advanced clinical analytics
- Develop AI-powered oncology applications
- Scale to larger datasets

In [None]:
# Verify files were created
import os

files_to_check = [
    'demo_trials_raw.json',
    'demo_trials_mcode.ndjson',
    'demo_patients_raw.json',
    'demo_patients_mcode.ndjson'
]

print("Generated files:")
for file in files_to_check:
    # Check both current directory and examples/
    exists = os.path.exists(file) or os.path.exists(f'examples/{file}')
    if exists:
        # Get size from whichever location exists
        file_path = file if os.path.exists(file) else f'examples/{file}'
        size = os.path.getsize(file_path)
    else:
        size = 0
    print(f"  {'✅' if exists else '❌'} {file} ({size} bytes)")

print("\n📊 Pipeline Summary:")
print(f"  Trials fetched: {len(trials_raw) if 'trials_raw' in locals() else 'N/A'}")
print(f"  Trials processed: {len(mcode_trials) if 'mcode_trials' in locals() else 'N/A'}")
print(f"  Patients fetched: {len(patients_raw) if 'patients_raw' in locals() else 'N/A'}")
print(f"  Patients processed: {len(mcode_patients) if 'mcode_patients' in locals() else 'N/A'}")
print("\n🎉 Pipeline complete! Data stored in CORE Memory spaces.")

# Alternative: End-to-End Processor

## What does `end_to_end_processor` do?

The `end_to_end_processor` CLI tool runs the complete pipeline in one command:
- **Fetches** both trials and patients
- **Processes** with mCODE mapping
- **Stores** everything in CORE Memory
- **Provides** comprehensive workflow management

## When to use
- For complete automation
- When you want all data in one operation
- For production workflows

## Command Structure
```bash
python -m src.cli.end_to_end_processor --condition "breast cancer" --trials-limit 3 --patients-limit 3 --store-in-core-memory
```

## Benefits
- Single command execution
- Optimized resource usage
- Comprehensive error handling
- Token usage tracking

In [None]:
# Example of end-to-end processor (commented out to avoid duplicate processing)
# !source activate mcode_translator && cd .. && python -m src.cli.end_to_end_processor --condition "breast cancer" --trials-limit 2 --patients-limit 2 --ingest --verbose

# Cleanup

## Optional: Remove Generated Files

If you want to clean up the intermediate files:

```bash
rm demo_trials_raw.json demo_trials_mcode.ndjson demo_patients_raw.json demo_patients_mcode.ndjson
```

## Data Persistence

- Intermediate files can be deleted after processing
- CORE Memory retains all stored data
- Spaces persist across sessions

## Re-running the Pipeline

You can re-run any step independently:
- Fetch new data with different parameters
- Re-process with different models/prompts
- Add more data to existing CORE Memory spaces