# 🚀 **Complete mCODE Translation Workflow**

Transform clinical trial data into standardized mCODE format with AI optimization.

**🎯 What you'll accomplish:**
- Fetch real clinical trial data from ClinicalTrials.gov
- Test all AI model combinations for optimal performance
- Convert data to mCODE format
- Generate summaries and store in CORE Memory

**⚡ Features:**
- Concurrent processing with 8-15 workers
- Dynamic model/prompt discovery
- Cross-validation optimization
- Real-time progress tracking


## 🔧 **Step 1: Environment Setup**

Configure your environment and API keys to get started.

**📋 Prerequisites:**
- Python environment with dependencies installed
- API keys for LLM providers (OpenAI, DeepSeek, etc.)
- Optional: CORE Memory API key for data storage


In [None]:
# Configure environment and API keys
import os
import sys
from pathlib import Path
# Change to project root directory
current_dir = Path.cwd()
if current_dir.name == 'examples':
    project_root = current_dir.parent
    os.chdir(project_root)
    print(f"📁 Changed working directory to: {project_root}")
else:
    project_root = current_dir
# Add project root to Python path
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))
    print("✅ Added project root to Python path")
# Configure API keys (replace with your actual keys)
os.environ['CLINICAL_TRIALS_API_KEY'] = 'your_clinical_trials_api_key_here'
os.environ['COREAI_API_KEY'] = 'your_core_memory_api_key_here'
print("✅ API keys configured")
print("✅ Environment ready")
print(f"📍 Current working directory: {Path.cwd()}")


## 📊 **Step 2: Data Acquisition**

Fetch clinical trial data and download synthetic patient records.

**🎯 Goals:**
- Get 5 breast cancer clinical trials from ClinicalTrials.gov
- Download synthetic patient data archives
- Prepare data for mCODE conversion


In [None]:
# Fetch 5 breast cancer clinical trials from ClinicalTrials.gov
# This provides real-world treatment data for mCODE conversion
# Using 8 concurrent workers for faster fetching
!python -m src.cli.trials_fetcher --condition "breast cancer" --limit 5 --out raw_trials.ndjson --workers 8 --verbose


In [None]:
# Download all available synthetic patient data archives
# This provides comprehensive test data for the workflow
!python -m scripts.download_data --all


## 👥 **Step 4: Fetch Synthetic Patient Data**

Download synthetic patient records that mimic real breast cancer patients.

**🛡️ Why synthetic patients?** They protect patient privacy while providing realistic data for testing.


In [None]:
# Fetch 5 synthetic breast cancer patients from the 10-year archive
# These are artificially generated but realistic patient records
!python -m src.cli.patients_fetcher --archive breast_cancer_10_years --limit 5 --out raw_patients.ndjson --verbose


## 🧪 **Step 3: AI Model Optimization**

Find the best AI model and prompt combination for your data.

**🔬 Process:**
1. Discover all available models and prompts
2. Test all combinations (8 models × 4 prompts = 32 tests)
3. Use cross-validation for reliable results
4. Save optimal configuration for production

**⚡ Performance:** 15 concurrent workers for maximum speed


In [None]:
# Get available models and prompts dynamically
# This demonstrates how to programmatically get the available options
import subprocess
import json
# Get available models
models_output = subprocess.run([
    "python", "-m", "src.cli.trials_optimizer", "--list-models"
], capture_output=True, text=True, cwd=".")
# Get available prompts
prompts_output = subprocess.run([
    "python", "-m", "src.cli.trials_optimizer", "--list-prompts"
], capture_output=True, text=True, cwd=".")
# Parse the output to extract model and prompt names
models_lines = models_output.stdout.strip().split('\n')
prompts_lines = prompts_output.stdout.strip().split('\n')
# Extract model names (skip the header line)
AVAILABLE_MODELS = [line.split('• ')[1] for line in models_lines if '• ' in line]
# Extract prompt names (skip the header line)
AVAILABLE_PROMPTS = [line.split('• ')[1] for line in prompts_lines if '• ' in line]
print(f"🤖 Available models ({len(AVAILABLE_MODELS)}): {', '.join(AVAILABLE_MODELS)}")
print(f"📝 Available prompts ({len(AVAILABLE_PROMPTS)}): {', '.join(AVAILABLE_PROMPTS)}")
# Create comma-separated strings for command-line usage
MODELS_STR = ','.join(AVAILABLE_MODELS)
PROMPTS_STR = ','.join(AVAILABLE_PROMPTS)
print(f"\n📋 Command-ready strings:")
print(f"   Models: {MODELS_STR}")
print(f"   Prompts: {PROMPTS_STR}")


In [ ]:
# Demo: List all available models
!python -m src.cli.trials_optimizer --list-models


In [ ]:
# Demo: List all available prompts
!python -m src.cli.trials_optimizer --list-prompts


In [ ]:
# Full optimization using ALL available models and prompts
# This tests all combinations: 8 models × 4 prompts = 32 total combinations
# The optimizer only works with files, never uses APIs directly
# Using 15 concurrent workers for maximum speed
!python -m src.cli.trials_optimizer --trials-file raw_trials.ndjson --cv-folds 3 --prompts direct_mcode_evidence_based_concise,direct_mcode_evidence_based,direct_mcode_minimal,direct_mcode_structured --models deepseek-coder,deepseek-chat,deepseek-reasoner,gpt-4-turbo,gpt-4o,gpt-4o-mini,gpt-3.5-turbo,claude-3 --max-combinations 0 --save-config optimal_config.json --workers 15 --verbose


### 4a: Load Optimal Configuration

Read the optimization results and store the best model/prompt combination for later use.


In [None]:
# Load the optimal AI configuration found by the optimizer
import json
from pathlib import Path
config_file = Path('optimal_config.json')
if config_file.exists():
    with open(config_file, 'r') as f:
        optimal_config = json.load(f)
    # Store the best settings for use in later steps
    BEST_MODEL = optimal_config['optimal_settings']['model']
    BEST_PROMPT = optimal_config['optimal_settings']['prompt']
    print(f"🎯 Optimal configuration:")
    print(f"   Model: {BEST_MODEL}")
    print(f"   Prompt: {BEST_PROMPT}")
    print(f"   CV score: {optimal_config['optimal_settings']['cv_score']:.3f}")
else:
    # Fallback to defaults if optimization failed
    BEST_MODEL = 'deepseek-coder'
    BEST_PROMPT = 'direct_mcode_evidence_based_concise'
    print("⚠️  Using default configuration (optimization may have failed)")
    print(f"   Model: {BEST_MODEL}")
    print(f"   Prompt: {BEST_PROMPT}")


## ⚙️ **Step 4: Data Processing**

Convert raw clinical data to standardized mCODE format.

**🔄 Transformations:**
- Clinical trials → mCODE format
- Patient records → mCODE format
- Apply optimal AI model configuration

**⚡ Performance:** 12 concurrent workers for efficient processing


In [None]:
# Convert raw trial data to mCODE format using the best AI model
# This structures the trial information according to mCODE standards
# Using 12 concurrent workers for faster processing
!python -m src.cli.trials_processor raw_trials.ndjson --out mcode_trials.ndjson --model {BEST_MODEL} --prompt {BEST_PROMPT} --workers 12 --verbose


## 👨‍⚕️ **Step 7: Process Patients to mCODE Format**

Convert raw patient data to structured mCODE format.

**📋 Patient processing** focuses on individual medical histories, diagnoses, and treatments.


In [None]:
# Convert raw patient data to mCODE format
# This structures patient medical histories according to mCODE standards
# Using 12 concurrent workers for faster processing
!python -m src.cli.patients_processor --in raw_patients.ndjson --out mcode_patients.ndjson --workers 12 --verbose


## 📝 **Step 5: Generate Summaries & Store**

Create human-readable summaries and store everything permanently.

**📋 Outputs:**
- Readable summaries of all processed data
- Permanent storage in CORE Memory
- Searchable clinical information database


In [None]:
# Generate human-readable summaries of the mCODE trial data
# Store summaries in CORE Memory for permanent access and search
!python -m src.cli.trials_summarizer --in mcode_trials.ndjson --model {BEST_MODEL} --ingest --verbose


In [None]:
# Generate human-readable summaries of the mCODE patient data
# Store summaries in CORE Memory for permanent access and search
!python -m src.cli.patients_summarizer --in mcode_patients.ndjson --ingest --verbose


## ✅ **Step 6: Verify Results**

Check that everything worked correctly and celebrate success!

**📊 Validation:**
- Confirm all files were created
- Verify data integrity
- Check processing statistics


In [None]:
# Check which files were created and their sizes
import os
from pathlib import Path
files_to_check = [
    'optimal_config.json',
    'raw_trials.ndjson',
    'raw_patients.ndjson',
    'mcode_trials.ndjson',
    'mcode_patients.ndjson'
]
print("📁 Generated files:")
for filename in files_to_check:
    file_path = Path(filename)
    if file_path.exists():
        size = file_path.stat().st_size
        print(f"   ✅ {filename} ({size} bytes)")
    else:
        print(f"   ❌ {filename} (not found)")
print("\n🎉 mCODE translation workflow completed!")
print("All data has been processed and stored in CORE Memory.")


## 🎉 **Workflow Complete!**

You've successfully transformed clinical trial data into standardized mCODE format!

**🏆 Achievements:**
- ✅ Fetched clinical trial data
- ✅ Optimized AI model performance
- ✅ Converted to mCODE format
- ✅ Generated readable summaries
- ✅ Stored in CORE Memory

**🚀 Next Steps:**
- Explore your processed mCODE data
- Search CORE Memory for specific information
- Run with different conditions or datasets
- Integrate with other clinical systems
