# Complete mCODE Translation Workflow with Concurrency

This notebook demonstrates the complete mCODE translation pipeline using individual component scripts with **concurrent processing**.

**Prerequisites:**
- mcode_translator conda environment is active in the notebook kernel
- API keys are configured (ClinicalTrials.gov and CORE Memory)
- All dependencies are installed

**What this notebook does:**
1. **Concurrent Fetching**: Fetches clinical trial and patient data using multiple workers
2. **Concurrent Optimization**: Tests AI model combinations in parallel
3. **Concurrent Processing**: Converts data to mCODE format with multiple workers
4. **Concurrent Summarization**: Generates summaries and stores in CORE Memory

**Concurrency Features:**
- **Fetcher Pool**: 4 workers for concurrent API calls
- **Processor Pool**: 8 workers for parallel data processing
- **Optimizer Pool**: 2 workers for parallel model testing
- **Task Queues**: Priority-based task execution with progress tracking

## Step 1: Environment Setup

Configure API keys and verify the environment is ready.

In [None]:
# Configure environment and API keys
import os
import sys
from pathlib import Path

# Change to project root directory
current_dir = Path.cwd()
if current_dir.name == 'examples':
    project_root = current_dir.parent
    os.chdir(project_root)
    print(f"📁 Changed working directory to: {project_root}")
else:
    project_root = current_dir

# Add project root to Python path
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))
    print("✅ Added project root to Python path")

# Configure API keys (replace with your actual keys)
os.environ['CLINICAL_TRIALS_API_KEY'] = 'your_clinical_trials_api_key_here'
os.environ['COREAI_API_KEY'] = 'your_core_memory_api_key_here'

print("✅ API keys configured")
print("✅ Environment ready")
print(f"📍 Current working directory: {Path.cwd()}")

## Step 2: Fetch Clinical Trial Data

Download clinical trial data from ClinicalTrials.gov for breast cancer studies.

**What are clinical trials?** Research studies that test new treatments on human volunteers.

In [None]:
# Fetch 5 breast cancer clinical trials from ClinicalTrials.gov
# This provides real-world treatment data for mCODE conversion
# Using 4 concurrent workers for faster fetching
!python -m src.cli.trials_fetcher --condition "breast cancer" --limit 5 --out raw_trials.ndjson --workers 4 --verbose

## Step 3: Fetch Synthetic Patient Data

Download synthetic patient records that mimic real breast cancer patients.

**Why synthetic patients?** They protect patient privacy while providing realistic data for testing.

In [None]:
# Fetch 5 synthetic breast cancer patients from the 10-year archive
# These are artificially generated but realistic patient records
!python -m src.cli.patients_fetcher --archive breast_cancer_10_years --limit 5 --out raw_patients.ndjson --verbose

## Step 4: Optimize AI Model Parameters

Find the best AI model and prompt combination for processing breast cancer data.

**Why optimize?** Different AI models work better with different types of medical data.

**Note:** The optimizer only works with existing files, never fetches new data via APIs.

In [None]:
# Test 3 different model+prompt combinations using the fetched trial data
# This finds the best configuration for processing breast cancer trials
# The optimizer only works with files, never uses APIs directly
# Using 2 concurrent workers for parallel optimization
!python -m src.cli.trials_optimizer --trials-file raw_trials.ndjson --cv-folds 3 --max-combinations 3 --save-config optimal_config.json --workers 2 --verbose

### 4a: Load Optimal Configuration

Read the optimization results and store the best model/prompt combination for later use.

In [None]:
# Load the optimal AI configuration found by the optimizer
import json
from pathlib import Path

config_file = Path('optimal_config.json')
if config_file.exists():
    with open(config_file, 'r') as f:
        optimal_config = json.load(f)
    
    # Store the best settings for use in later steps
    BEST_MODEL = optimal_config['optimal_settings']['model']
    BEST_PROMPT = optimal_config['optimal_settings']['prompt']
    
    print(f"🎯 Optimal configuration:")
    print(f"   Model: {BEST_MODEL}")
    print(f"   Prompt: {BEST_PROMPT}")
    print(f"   CV score: {optimal_config['optimal_settings']['cv_score']:.3f}")
else:
    # Fallback to defaults if optimization failed
    BEST_MODEL = 'deepseek-coder'
    BEST_PROMPT = 'direct_mcode_evidence_based_concise'
    print("⚠️  Using default configuration (optimization may have failed)")
    print(f"   Model: {BEST_MODEL}")
    print(f"   Prompt: {BEST_PROMPT}")

## Step 5: Process Trials to mCODE Format

Convert raw clinical trial data to structured mCODE format using the optimized AI model.

**What is mCODE?** Minimal Common Oncology Data Elements - standardized format for cancer data.

In [None]:
# Convert raw trial data to mCODE format using the best AI model
# This structures the trial information according to mCODE standards
# Using 8 concurrent workers for faster processing
!python -m src.cli.trials_processor raw_trials.ndjson --out mcode_trials.ndjson --model {BEST_MODEL} --prompt {BEST_PROMPT} --workers 8 --verbose

## Step 6: Process Patients to mCODE Format

Convert raw patient data to structured mCODE format.

**Patient processing** focuses on individual medical histories, diagnoses, and treatments.

In [None]:
# Convert raw patient data to mCODE format
# This structures patient medical histories according to mCODE standards
# Using 8 concurrent workers for faster processing
!python -m src.cli.patients_processor --in raw_patients.ndjson --out mcode_patients.ndjson --workers 8 --verbose

## Step 7: Generate Summaries and Store in CORE Memory

Create human-readable summaries of all processed data and store permanently.

**CORE Memory** provides persistent storage and search capabilities for all processed data.

### 7a: Generate Trial Summaries

Create readable summaries of the clinical trials and store them in CORE Memory.

In [None]:
# Generate human-readable summaries of the mCODE trial data
# Store summaries in CORE Memory for permanent access and search
!python -m src.cli.trials_summarizer --in mcode_trials.ndjson --model {BEST_MODEL} --ingest --verbose

### 7b: Generate Patient Summaries

Create readable summaries of the patient data and store them in CORE Memory.

In [None]:
# Generate human-readable summaries of the mCODE patient data
# Store summaries in CORE Memory for permanent access and search
!python -m src.cli.patients_summarizer --in mcode_patients.ndjson --ingest --verbose

## Step 8: Verify Results

Check that all files were created successfully and show a summary of the work completed.

In [None]:
# Check which files were created and their sizes
import os
from pathlib import Path

files_to_check = [
    'optimal_config.json',
    'raw_trials.ndjson',
    'raw_patients.ndjson',
    'mcode_trials.ndjson',
    'mcode_patients.ndjson'
]

print("📁 Generated files:")
for filename in files_to_check:
    file_path = Path(filename)
    if file_path.exists():
        size = file_path.stat().st_size
        print(f"   ✅ {filename} ({size} bytes)")
    else:
        print(f"   ❌ {filename} (not found)")

print("\n🎉 mCODE translation workflow completed!")
print("All data has been processed and stored in CORE Memory.")

## Summary

This notebook successfully completed the complete mCODE translation workflow with **concurrent processing**:

✅ **Fetched** 5 clinical trials and 5 synthetic patients for breast cancer
✅ **Optimized** AI model parameters using existing files (no API calls)
✅ **Processed** all data into structured mCODE format
✅ **Generated** human-readable summaries
✅ **Stored** everything permanently in CORE Memory

### Generated Files:
- `raw_trials.ndjson` - Original clinical trial data
- `raw_patients.ndjson` - Original synthetic patient data
- `mcode_trials.ndjson` - Structured trial mCODE data
- `mcode_patients.ndjson` - Structured patient mCODE data
- `optimal_config.json` - Best AI model configuration

### Key Features:
- **Concurrent fetching**: 4 workers for API calls
- **Concurrent optimization**: 2 workers for model testing
- **Concurrent processing**: 8 workers for data conversion
- **File-only optimization**: Optimizer never uses APIs, only cross-validates existing files
- **Component-based**: Uses individual scripts for each step
- **Educational**: Each command is explained with context

### Next Steps:
- Explore the generated mCODE data
- Search CORE Memory for specific information
- Run the workflow with different conditions or parameters
- Integrate the processed data into other applications