# 🚀 Enhanced mCODE Translator Demo

**Transform clinical trial data into standardized mCODE elements with AI-powered precision**

This enhanced notebook demonstrates advanced features of the mCODE Translator framework:
- **Concurrent Processing**: Multi-worker processing for improved performance
- **Optimized Prompts**: Breast cancer-specific prompt optimization
- **CORE Memory Integration**: Direct storage of patient and trial summaries
- **Process Management**: Advanced monitoring and control techniques
- **Native IPython Commands**: Streamlined execution using `!python` magic

## 🎯 What is mCODE?

mCODE (Minimal Common Oncology Data Elements) is a standardized data model that enables:
- **Interoperability**: Consistent representation of cancer data across healthcare systems
- **Research**: Facilitates clinical trial matching and patient recruitment
- **Analytics**: Enables advanced analysis of cancer treatment patterns
- **AI Integration**: Provides structured data for machine learning applications

## 📋 Prerequisites

- Python 3.10+ in mcode_translator conda environment
- ClinicalTrials.gov API key (optional for demo)
- CORE Memory API key (optional for demo)
- Internet connection for data fetching

## 🏗️ Enhanced Pipeline Overview

```
┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Fetch Data    │ -> │   Process with   │ -> │  Store Results  │
│  (Concurrent)   │    │   AI & mCODE     │    │                 │
│ • Multi-worker  │    │ • Optimized      │    │ • CORE Memory   │
│   Processing    │    │   Prompts        │    │ • Summaries     │
│ • Breast Cancer │    │ • Concurrent     │    │ • Searchable    │
│   Optimized     │    │   Workers        │    │ • Persistent    │
└─────────────────┘    └──────────────────┘    └─────────────────┘
```

## 📊 Expected Outcomes

By the end of this notebook, you will have:
- Processed clinical trial data with optimized breast cancer prompts
- Utilized concurrent processing for improved performance
- Stored patient and trial summaries directly in CORE Memory
- Demonstrated advanced process management techniques
- Visualized key insights from the enhanced processing pipeline

## 🛠️ Setup and Environment Configuration

First, let's set up the environment and verify our configuration.

In [None]:
# Environment setup and imports

import os
import json
import sys
from pathlib import Path
import pandas as pd
import matplotlib.pyplot as plt
from typing import Dict, List, Any

# Add project root to path
current_dir = Path.cwd()
PROJECT_ROOT = current_dir.parent
sys.path.insert(0, str(PROJECT_ROOT))

print(f"Project root: {PROJECT_ROOT}")
print(f"Working directory: {current_dir}")
print(f"Python version: {sys.version}")
print(f"Python executable: {sys.executable}")

# Set up environment variables (replace with your actual keys)
os.environ['CLINICAL_TRIALS_API_KEY'] = 'your_clinical_trials_key_here'
os.environ['COREAI_API_KEY'] = 'your_core_memory_key_here'

# Verify environment
core_api_key = os.getenv('COREAI_API_KEY')
if core_api_key:
    print("✅ CORE API key configured")
else:
    print("⚠️  CORE API key not configured - some features will be limited")

## 📥 Step 1: Fetch Clinical Trial Data with Concurrency

Let's fetch clinical trial data using concurrent processing for improved performance.

In [None]:
# Fetch clinical trials data with concurrent processing
# Using IPython magic commands for streamlined execution

print("🚀 Fetching clinical trials with concurrent processing (4 workers)...")

# Use IPython magic for shell commands - more concise and native to Jupyter
!cd {PROJECT_ROOT} && python -m src.cli.trials_fetcher --condition "breast cancer" --limit 5 --output examples/demo_trials_raw.json --workers 4 --verbose

In [None]:
# Inspect the fetched trial data

trials_file = PROJECT_ROOT / "examples" / "demo_trials_raw.json"

if trials_file.exists():
    with open(trials_file, 'r') as f:
        trials_raw = json.load(f)
    
    print(f"📊 Fetched {len(trials_raw)} clinical trials")
    
    # Display sample trial information
    if trials_raw:
        trial = trials_raw[0]
        print("\n📋 Sample Trial:")
        print(f"NCT ID: {trial.get('protocolSection', {}).get('identificationModule', {}).get('nctId')}")
        print(f"Title: {trial.get('protocolSection', {}).get('identificationModule', {}).get('briefTitle', 'N/A')[:100]}...")
        conditions = trial.get('protocolSection', {}).get('conditionsModule', {}).get('conditions', [])
        print(f"Conditions: {conditions}")
        
        # Show data structure
        print(f"\n🔍 Raw data structure:")
        print(f"Top-level keys: {list(trial.keys())}")
        
else:
    print("❌ Trial data file not found")
    trials_raw = []

## 🧪 Step 2: Process Trials with Optimized Breast Cancer Prompts

Now we'll process the raw trial data using AI with breast cancer-optimized prompts and concurrent workers.

In [None]:
# Process trials with optimized breast cancer prompts and concurrent processing
# Using IPython magic for streamlined execution

print("🧪 Processing trials with breast cancer-optimized prompts (4 concurrent workers)...")

# Use optimized prompt for breast cancer trials
!cd {PROJECT_ROOT} && python -m src.cli.trials_processor examples/demo_trials_raw.json --output examples/demo_trials_mcode.ndjson --model deepseek-coder --prompt direct_mcode_evidence_based_concise --workers 4 --verbose

In [None]:
# Load and analyze the mCODE-processed trial data

mcode_trials_file = PROJECT_ROOT / "examples" / "demo_trials_mcode.ndjson"
mcode_trials = []

if mcode_trials_file.exists():
    with open(mcode_trials_file, 'r') as f:
        for line in f:
            if line.strip():
                mcode_trials.append(json.loads(line))
    
    print(f"📊 Processed {len(mcode_trials)} trials with mCODE mapping")
    
    if mcode_trials:
        trial = mcode_trials[0]
        print("\n🔬 Sample mCODE Trial:")
        print(f"Trial ID: {trial.get('trial_id')}")
        
        mcode_elements = trial.get('mcode_elements', {})
        print(f"mCODE elements found: {list(mcode_elements.keys())}")
        
        # Show mCODE mappings
        mappings = mcode_elements.get('mcode_mappings', [])
        if mappings:
            print("\n📋 First few mCODE mappings:")
            for mapping in mappings[:3]:
                print(f"  • {mapping.get('mcode_element')}: {mapping.get('value')}")
        
        # Show processing metadata
        if 'processing_metadata' in mcode_elements:
            meta = mcode_elements['processing_metadata']
            print(f"\n⚙️ Processing metadata:")
            print(f"  Model: {meta.get('model')}")
            print(f"  Processing time: {meta.get('processing_time_seconds', 'N/A')}s")
            
else:
    print("❌ mCODE trials file not found")

## 🏥 Step 3: Fetch and Process Patient Data

Let's fetch synthetic patient data and process it with mCODE mapping.

In [None]:
# Fetch synthetic patient data

print("🏥 Fetching synthetic breast cancer patient data...")

!cd {PROJECT_ROOT} && python -m src.cli.patients_fetcher --archive breast_cancer_10_years --limit 5 --output examples/demo_patients_raw.json --verbose

In [None]:
# Process patients with mCODE mapping and concurrent processing

print("🧪 Processing patients with mCODE mapping (4 concurrent workers)...")

!cd {PROJECT_ROOT} && python -m src.cli.patients_processor --patients examples/demo_patients_raw.json --trials examples/demo_trials_mcode.ndjson --output examples/demo_patients_mcode.ndjson --workers 4 --verbose

In [None]:
# Analyze the processed patient data

mcode_patients_file = PROJECT_ROOT / "examples" / "demo_patients_mcode.ndjson"
mcode_patients = []

if mcode_patients_file.exists():
    with open(mcode_patients_file, 'r') as f:
        for line in f:
            if line.strip():
                mcode_patients.append(json.loads(line))
    
    print(f"📊 Processed {len(mcode_patients)} patients with mCODE mapping")
    
    if mcode_patients:
        patient = mcode_patients[0]
        bundle = patient.get('patient_bundle', [])
        
        print(f"\n🏥 Sample mCODE Patient:")
        print(f"Patient bundle has {len(bundle)} mCODE entries")
        
        # Count resource types
        resource_types = {}
        for entry in bundle:
            rtype = entry.get('resource_type')
            if rtype:
                resource_types[rtype] = resource_types.get(rtype, 0) + 1
        
        print(f"Resource types: {resource_types}")
        
        # Show sample clinical data
        print("\n📋 Sample clinical entries:")
        for entry in bundle[:3]:
            rtype = entry.get('resource_type')
            if rtype == 'Patient':
                name = entry.get('name', {})
                print(f"  • Patient: {name.get('given', ['Unknown'])} {name.get('family', 'Unknown')}")
            elif rtype == 'Condition':
                clinical_data = entry.get('clinical_data', {})
                code = clinical_data.get('code', {}).get('text', 'Unknown')
                print(f"  • Condition: {code}")
            elif rtype in ['Observation', 'MedicationStatement']:
                clinical_data = entry.get('clinical_data', {})
                code_text = clinical_data.get('code', {}).get('text', 'Unknown')
                print(f"  • {rtype}: {code_text}")
                
else:
    print("❌ mCODE patients file not found")

## 🧠 Step 4: Store Patient and Trial Summaries in CORE Memory

Let's store both patient and trial summaries directly in CORE Memory spaces.

In [None]:
# Store trials in CORE Memory with summaries

print("🧠 Storing trials with summaries in CORE Memory...")

!cd {PROJECT_ROOT} && python -m src.cli.trials_processor examples/demo_trials_raw.json --ingest --model deepseek-coder --prompt direct_mcode_evidence_based_concise --workers 4 --verbose

In [None]:
# Store patients in CORE Memory with summaries

print("🏥 Storing patients with summaries in CORE Memory...")

!cd {PROJECT_ROOT} && python -m src.cli.patients_processor --patients examples/demo_patients_raw.json --trials examples/demo_trials_mcode.ndjson --ingest --workers 4 --verbose

## 📊 Step 5: Data Visualization and Analysis

Let's create visualizations to understand the enhanced processing results.

In [None]:
# Create visualizations of the processed data

import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter

# Set up plotting style
plt.style.use('default')
sns.set_palette("husl")

# Analyze trial data
if mcode_trials:
    # Extract mCODE element types
    element_types = []
    for trial in mcode_trials:
        mappings = trial.get('mcode_elements', {}).get('mcode_mappings', [])
        for mapping in mappings:
            element_types.append(mapping.get('mcode_element', 'Unknown'))
    
    # Count element types
    element_counts = Counter(element_types)
    
    # Plot
    plt.figure(figsize=(12, 6))
    plt.subplot(1, 2, 1)
    plt.bar(element_counts.keys(), element_counts.values())
    plt.title('mCODE Elements in Clinical Trials (Optimized)')
    plt.xlabel('Element Type')
    plt.ylabel('Count')
    plt.xticks(rotation=45, ha='right')
    
    plt.tight_layout()
    plt.show()
    
    print(f"📈 Found {len(element_counts)} different mCODE element types")
    print("Top elements:", dict(element_counts.most_common(5)))
else:
    print("No trial data available for visualization")

In [None]:
# Analyze patient data

if mcode_patients:
    # Extract resource types across all patients
    all_resource_types = []
    for patient in mcode_patients:
        bundle = patient.get('patient_bundle', [])
        for entry in bundle:
            rtype = entry.get('resource_type')
            if rtype:
                all_resource_types.append(rtype)
    
    # Count resource types
    resource_counts = Counter(all_resource_types)
    
    # Plot
    plt.figure(figsize=(10, 6))
    plt.bar(resource_counts.keys(), resource_counts.values())
    plt.title('FHIR Resource Types in Patient Data (Enhanced)')
    plt.xlabel('Resource Type')
    plt.ylabel('Count')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()
    
    print(f"📊 Found {len(resource_counts)} different FHIR resource types")
    print("Resource distribution:", dict(resource_counts.most_common()))
else:
    print("No patient data available for visualization")

## 🔍 Step 6: Semantic Search and Analysis

Now let's demonstrate semantic search capabilities across the stored data.

In [None]:
# Demonstrate semantic search capabilities
# Note: This requires CORE Memory to be properly configured

try:
    from src.utils.core_memory_client import CoreMemoryClient
    
    # Initialize client
    client = CoreMemoryClient(
        api_key=os.getenv('COREAI_API_KEY'),
        base_url="https://core.heysol.ai/api/v1/mcp"
    )
    
    # Get space IDs
    patients_space_id = client.get_patients_space_id()
    trials_space_id = client.get_clinical_trials_space_id()
    
    print("🔍 Performing semantic searches...")
    
    # Search for breast cancer in trials
    print("\n📋 Searching for 'breast cancer' in clinical trials:")
    try:
        results = client.search("breast cancer", space_id=trials_space_id)
        print(f"Found {len(results.get('results', []))} results")
        if results.get('results'):
            print(f"Sample result: {results['results'][0].get('content', '')[:200]}...")
    except Exception as e:
        print(f"Search failed: {e}")
    
    # Search for treatment in patients
    print("\n🏥 Searching for 'chemotherapy' in patient data:")
    try:
        results = client.search("chemotherapy", space_id=patients_space_id)
        print(f"Found {len(results.get('results', []))} results")
        if results.get('results'):
            print(f"Sample result: {results['results'][0].get('content', '')[:200]}...")
    except Exception as e:
        print(f"Search failed: {e}")
        
except ImportError:
    print("❌ CORE Memory client not available")
except Exception as e:
    print(f"❌ CORE Memory search failed: {e}")

## 📈 Step 7: Performance Analysis and Process Management

Let's analyze the performance and demonstrate process management techniques.

In [None]:
# Performance analysis and process management demonstration

import time
from datetime import datetime

# Calculate processing statistics
stats = {
    'trials_fetched': len(trials_raw) if 'trials_raw' in locals() else 0,
    'trials_processed': len(mcode_trials) if 'mcode_trials' in locals() else 0,
    'patients_fetched': len(mcode_patients) if 'mcode_patients' in locals() else 0,
    'processing_timestamp': datetime.now().isoformat(),
    'concurrent_workers': 4,
    'optimized_prompts': True
}

# Calculate success rates
if stats['trials_fetched'] > 0:
    stats['trial_processing_success_rate'] = stats['trials_processed'] / stats['trials_fetched']
else:
    stats['trial_processing_success_rate'] = 0

# Display statistics
print("📊 Enhanced Processing Statistics:")
print(f"  Clinical Trials Fetched: {stats['trials_fetched']}")
print(f"  Trials Processed with mCODE: {stats['trials_processed']}")
print(f"  Patients Processed: {stats['patients_fetched']}")
print(f"  Trial Processing Success Rate: {stats['trial_processing_success_rate']:.1%}")
print(f"  Concurrent Workers Used: {stats['concurrent_workers']}")
print(f"  Breast Cancer Optimized Prompts: {stats['optimized_prompts']}")
print(f"  Processing Timestamp: {stats['processing_timestamp']}")

# Quality metrics
if mcode_trials:
    total_mappings = sum(len(trial.get('mcode_elements', {}).get('mcode_mappings', [])) 
                       for trial in mcode_trials)
    avg_mappings_per_trial = total_mappings / len(mcode_trials)
    print(f"\n🔬 Quality Metrics:")
    print(f"  Total mCODE Mappings: {total_mappings}")
    print(f"  Average Mappings per Trial: {avg_mappings_per_trial:.1f}")

# File sizes
files_to_check = [
    'examples/demo_trials_raw.json',
    'examples/demo_trials_mcode.ndjson',
    'examples/demo_patients_raw.json',
    'examples/demo_patients_mcode.ndjson'
]

print(f"\n💾 Generated Files:")
for file_path in files_to_check:
    full_path = PROJECT_ROOT / file_path
    if full_path.exists():
        size = full_path.stat().st_size
        print(f"  ✅ {file_path}: {size} bytes")
    else:
        print(f"  ❌ {file_path}: not found")

In [None]:
# Process management demonstration

print("🔧 Process Management Demonstration:")
print("\n1. Check running processes:")
!ps aux | grep trials_processor | grep -v grep

print("\n2. Check file sizes:")
!ls -la examples/demo_trials_mcode.ndjson

print("\n3. Monitor system resources (if available):")
!df -h | head -5

print("\n4. Show recent log activity:")
!tail -10 /tmp/mcode_translator.log 2>/dev/null || echo "No log file found"

## 🎉 Conclusion and Advanced Features

Congratulations! You've successfully completed an enhanced mCODE translation pipeline with advanced features.

### ✅ What We Achieved

1. **Concurrent Processing**: Utilized 4 workers for improved performance
2. **Optimized Prompts**: Used breast cancer-specific prompt optimization
3. **CORE Memory Integration**: Stored patient and trial summaries directly
4. **Native IPython Commands**: Streamlined execution with `!python` magic
5. **Process Management**: Demonstrated advanced monitoring techniques
6. **Enhanced Visualization**: Created insights from optimized processing

### 🚀 Advanced Features Demonstrated

- **Multi-worker Processing**: `python -m src.cli.trials_processor --workers 4`
- **Breast Cancer Optimization**: Specialized prompts for oncology data
- **Direct CORE Memory Storage**: `--ingest` flag for immediate storage
- **IPython Integration**: Native Jupyter magic commands
- **Process Monitoring**: Real-time process management

### 📚 Key Commands Used

```bash
# Concurrent fetching
python -m src.cli.trials_fetcher --condition "breast cancer" --workers 4

# Optimized processing
python -m src.cli.trials_processor --workers 4 --prompt direct_mcode_evidence_based_concise

# Direct CORE Memory storage
python -m src.cli.trials_processor --ingest --workers 4

# Process monitoring
ps aux | grep trials_processor
```

### 🔧 Production Considerations

- **Worker Scaling**: Adjust `--workers` based on system resources
- **Memory Management**: Monitor RAM usage with concurrent processing
- **API Rate Limits**: Respect ClinicalTrials.gov API limits
- **Error Handling**: Implement retry logic for production deployments
- **Monitoring**: Set up comprehensive logging and alerting

### 📞 Support and Resources

- **Documentation**: [mCODE Translator Docs](https://github.com/yourusername/mcode-translator)
- **API Reference**: [ClinicalTrials.gov API](https://clinicaltrials.gov/data-api/)
- **CORE Memory**: [Documentation](https://core.heysol.ai/)
- **Issues**: [GitHub Issues](https://github.com/yourusername/mcode-translator/issues)

---

**🎯 Ready to harness the power of concurrent, optimized mCODE processing!**