
# mCODE CORE Ingestion - Patient Translator

This notebook ingests **mCODE-compliant oncology data** into CORE Memory spaces…

- **mCODE Patients**: Patient clinical data following mCODE standards
- **mCODE Research Protocols**: Clinical trial protocols with mCODE-aligned eligibility criteria

…then:

1. Creates **mCODE-aligned CORE spaces** (auto-detects existing or creates new)
2. Seeds **comprehensive mCODE ontology** for optimal clinical feature matching
3. Ingests **mCODE patients directly** (structured data for optimal clinical feature matching)
4. Ingests **mCODE research protocols directly** (structured trial data with eligibility mappings)
5. Performs **cross-space semantic searches** for clinical matching

> Uses environment variables for CORE API access. Spaces auto-created if missing.


In [12]:
# =============================================================================
# ALL NON-INTERACTIVE CODE - Setup, Imports, and Helper Functions
# =============================================================================

import os, json, uuid, datetime, typing, textwrap, base64, decimal
import dotenv
import ijson
from typing import Any, Dict, List, Optional
from src.utils.core_memory_client import CoreMemoryClient, CoreMemoryError

# --- Configuration ---
dotenv.load_dotenv()
CORE_API_KEY = os.getenv("COREAI_API_KEY")
CORE_URL = "https://core.heysol.ai/api/v1/mcp"
SPACE_ID = None  # paste an existing space id to reuse, else leave as None to create a new Space

# Behavior
ATTACH_SOURCE_JSON = True
VERBOSE = True
TIMEOUT = 60

# =============================================================================
# HELPER FUNCTIONS
# =============================================================================

def safe(val, default=""): return val if val is not None else default

def decimal_converter(obj):
    """Convert Decimal objects to floats for JSON serialization."""
    if isinstance(obj, decimal.Decimal):
        return float(obj)
    raise TypeError(f"Object of type {obj.__class__.__name__} is not JSON serializable")

def ingest_patients_from_path(client: CoreMemoryClient, space_id: str, path: str):
    """Ingest mCODE patient data directly without summarization."""
    count = 0
    with open(path, 'rb') as fh:
        # Use ijson to parse the JSON stream from the root of the array
        for patient_mcode in ijson.items(fh, 'item'):
            # Ingest the mCODE data structure directly for optimal clinical feature matching
            client.ingest(json.dumps(patient_mcode, default=decimal_converter), space_id=space_id)
            count += 1
    return f"Ingested {count} mCODE patient records directly."

def ingest_trials_from_path(client: CoreMemoryClient, space_id: str, path: str):
    """Ingest mCODE trial data directly without summarization."""
    with open(path, "r") as fh:
        data = json.load(fh)
    # Case A: simple list of our trial dicts
    if isinstance(data, list) and all(isinstance(x, dict) for x in data) and any("id" in x or "title" in x for x in data):
        for trial_mcode in data:
            # Ingest the mCODE trial data structure directly
            client.ingest(json.dumps(trial_mcode), space_id=space_id)
        return f"Ingested {len(data)} mCODE trial records directly."
    # Case B: CT.gov wrapper with our mCODE translations
    if isinstance(data, dict) and "successful_trials" in data:
        trials = data.get("successful_trials", [])
        for trial_mcode in trials:
            # Ingest the mCODE trial data structure directly (contains our translations)
            client.ingest(json.dumps(trial_mcode), space_id=space_id)
        return f"Ingested {len(trials)} mCODE trial records directly."
    raise ValueError("Unsupported trials JSON shape.")

In [13]:
# =============================================================================
# ONTOLOGY DEFINITION
# =============================================================================

ONTOLOGY_MD = '''# Ontology: mCODE Core Concepts (STU4-inspired)

- EntityType: Person attrs={internalId, birthDate}
- EntityType: CancerDiagnosis attrs={id, site, histologyMorphologyBehavior, onsetDateTime} codedAs SNOMED|ICD-O-3
- EntityType: CancerStage attrs={system, version, overallStage, date} codedAs NCIt
- EntityType: BiomarkerResult attrs={analyte, method, value, unit, interpretation} codedAs LOINC
- EntityType: GenomicVariant attrs={gene, hgvs, zygosity, consequence} codedAs NCIt
- EntityType: TreatmentEvent attrs={type, agent, dose, unit, route, startDate, endDate}
- EntityType: Regimen attrs={name, components}
- EntityType: DiseaseStatus attrs={status, date} codedAs NCIt
- EntityType: Specimen attrs={id, collectionDate, site}
- EntityType: Encounter attrs={id, date, type}
- EntityType: Organization attrs={id, name}
- EntityType: Practitioner attrs={id, name}

- Predicate: hasDiagnosis domain=Person range=CancerDiagnosis
- Predicate: stagedAs domain=CancerDiagnosis range=CancerStage
- Predicate: hasTNM_T domain=CancerStage range=Code(T)
- Predicate: hasTNM_N domain=CancerStage range=Code(N)
- Predicate: hasTNM_M domain=CancerStage range=Code(M)
- Predicate: hasBiomarker domain=Person range=BiomarkerResult
- Predicate: hasVariant domain=Person range=GenomicVariant
- Predicate: receivedTreatment domain=Person range=TreatmentEvent
- Predicate: partOfRegimen domain=TreatmentEvent range=Regimen
- Predicate: hasDiseaseStatus domain=Person range=DiseaseStatus
- Predicate: codedAs domain=Any range=Code(system,code,display)
- Predicate: recordedAt domain=Any range=DateTime
- Predicate: performedBy domain=TreatmentEvent range=Practitioner
'''

In [7]:
!python -m src.cli.trials_fetcher --condition "breast cancer" --limit 2 --output data/my_trials.json
!python -m src.cli.trials_processor data/my_trials.json --output data/my_trials_mcode.json --dry-run
TRIALS_PATH   = "data/my_trials_mcode.json"


2025-09-16 13:53:43,602 - TrialsFetcherWorkflow - INFO - 🔍 Searching for trials: 'breast cancer' (limit: 2)[0m
2025-09-16 13:53:43,602 - src.utils.api_manager - INFO - Cache HIT with key 0bd74d31... in namespace 'clinical_trials'[0m
2025-09-16 13:53:43,602 - src.pipeline.fetcher - INFO - Cache HIT for search_trials[0m
2025-09-16 13:53:43,602 - TrialsFetcherWorkflow - INFO - 📋 Found 2 trials[0m
2025-09-16 13:53:43,603 - TrialsFetcherWorkflow - INFO - 💾 Results saved to: data/my_trials.json[0m
✅ Trials fetch completed successfully!
📊 Total trials fetched: 2
💾 Results saved to: data/my_trials.json
🔍 Fetch type: condition_search
2025-09-16 13:53:44,919 - __main__ - INFO - 🔬 Processing 2 trials...[0m
2025-09-16 13:53:44,919 - __main__ - INFO - Initializing trials processor workflow...[0m
2025-09-16 13:53:44,920 - src.utils.api_manager - INFO - APIManager initialized with config TTL: 0 seconds[0m
2025-09-16 13:53:44,920 - src.utils.api_manager - INFO - Initialized API cache for names

In [8]:

!python -m src.cli.patients_fetcher --archive breast_cancer_10_years --limit 2 --output data/my_patients.json
!python -m src.cli.patients_processor --patients data/my_patients.json --output data/my_patients_mcode.json --dry-run
PATIENTS_PATH = "data/my_patients_mcode.json"

2025-09-16 13:53:46,390 - PatientsFetcherWorkflow - INFO - 📥 Fetching up to 2 patients from breast_cancer_10_years[0m
2025-09-16 13:53:46,390 - src.utils.patient_generator - INFO - Resolved named archive 'breast_cancer_10_years' to: data/synthetic_patients/breast_cancer/10_years/breast_cancer_10_years.zip[0m
2025-09-16 13:53:46,390 - src.utils.patient_generator - INFO - Scanning patient data archive: data/synthetic_patients/breast_cancer/10_years/breast_cancer_10_years.zip[0m
2025-09-16 13:53:46,406 - src.utils.patient_generator - INFO - Found 1115 patient data files[0m
2025-09-16 13:53:46,485 - PatientsFetcherWorkflow - INFO - ✅ Successfully fetched 2 patients[0m
2025-09-16 13:53:46,631 - PatientsFetcherWorkflow - INFO - 💾 Patient data saved to: data/my_patients.json[0m
✅ Patients fetch completed successfully!
📊 Total patients fetched: 2
💾 Results saved to: data/my_patients.json
🔍 Fetch type: multiple_patients
📁 Archive: breast_cancer_10_years
🔬 Processing 2 patient records...
P

# 🚀 STEP 1: Initialize CORE Memory Client

**What happens here:**
- Creates a secure connection to CORE Memory using your API key
- Initializes the MCP (Model Context Protocol) session
- Sets up the client for ingesting mCODE data

**Why this matters:**
- Ensures secure, authenticated access to CORE Memory
- Prepares the connection for high-volume mCODE data ingestion

In [15]:
# Initialize CORE Memory Client
try:
    client = CoreMemoryClient(api_key=CORE_API_KEY, base_url=CORE_URL)
    print("✅ CORE Memory client initialized successfully")
    print(f"🔗 Connected to: {CORE_URL}")
except CoreMemoryError as e:
    print(f"❌ Failed to initialize CORE client: {e}")
    raise

✅ CORE Memory client initialized successfully
🔗 Connected to: https://core.heysol.ai/api/v1/mcp


# 🏗️ STEP 2: Create mCODE-Aligned Spaces

**What happens here:**
- Checks for existing "mCODE Patients" and "mCODE Research Protocols" spaces
- Creates them automatically if they don't exist
- Returns the space IDs for data ingestion

**Why this matters:**
- Organizes data by clinical purpose (patients vs research protocols)
- Enables separate indexing and searching of patient data vs trial data
- mCODE alignment ensures consistent clinical terminology across spaces

In [16]:
# Get or create mCODE-aligned space IDs
try:
    patients_space_id = client.get_patients_space_id()
    trials_space_id = client.get_clinical_trials_space_id()
    
    print(f"🏥 mCODE Patients Space ID: {patients_space_id}")
    print(f"🧬 mCODE Research Protocols Space ID: {trials_space_id}")
    print("✅ Spaces ready for mCODE data ingestion")
    
except CoreMemoryError as e:
    print(f"❌ Failed to setup spaces: {e}")
    raise

🏥 mCODE Patients Space ID: cmfm0gb9p010rqf1vcfpcxcti
🧬 mCODE Research Protocols Space ID: cmfm0gblw010tqf1v2ang7w4h
✅ Spaces ready for mCODE data ingestion


# 📚 STEP 3: Seed mCODE Ontology

**What happens here:**
- Ingests the comprehensive mCODE ontology into both spaces
- Defines all entity types (Person, CancerDiagnosis, etc.) and relationships
- Establishes the knowledge graph structure for clinical data

**Why this matters:**
- Provides the semantic foundation for understanding clinical relationships
- Enables intelligent matching between patients and clinical trials
- Standardizes clinical terminology across all ingested data

In [17]:
# Seed mCODE ontology in both spaces
try:
    client.ingest(ONTOLOGY_MD, space_id=patients_space_id)
    client.ingest(ONTOLOGY_MD, space_id=trials_space_id)
    
    print("📚 mCODE ontology seeded in both spaces")
    print("🔗 Knowledge graph structure established")
    print("✅ Ready for clinical data ingestion")
    
except CoreMemoryError as e:
    print(f"❌ Failed to seed ontology: {e}")
    raise

📚 mCODE ontology seeded in both spaces
🔗 Knowledge graph structure established
✅ Ready for clinical data ingestion


# 🏥 STEP 4: Ingest mCODE Patient Data

**What happens here:**
- Reads patient data from `data/breast_cancer_patients_demo.json`
- Ingests each patient's mCODE data structure directly (no summarization)
- Preserves all clinical codes, relationships, and structured data

**Why this matters:**
- Direct ingestion maintains full clinical fidelity
- Structured mCODE data enables precise clinical matching
- Preserves coded values for intelligent trial matching

In [18]:
# Ingest mCODE patient data directly
try:
    result = ingest_patients_from_path(client, patients_space_id, PATIENTS_PATH)
    print(f"✅ {result}")
    print("📊 Patient data queued for processing")
    
except Exception as e:
    print(f"❌ Failed to ingest patient data: {e}")
    raise

✅ Ingested 2 mCODE patient records directly.
📊 Patient data queued for processing


# 🧬 STEP 5: Ingest mCODE Research Protocols

**What happens here:**
- Reads trial data from `data/breast_cancer_trials_demo.json`
- Ingests each trial's mCODE data structure directly
- Preserves eligibility criteria, treatment protocols, and clinical codes

**Why this matters:**
- Maintains structured eligibility criteria for patient matching
- Preserves clinical trial protocols in mCODE format
- Enables semantic matching between patient data and trial requirements

In [19]:
# Ingest mCODE trial data directly
try:
    result = ingest_trials_from_path(client, trials_space_id, TRIALS_PATH)
    print(f"✅ {result}")
    print("📊 Trial data queued for processing")
    
except Exception as e:
    print(f"❌ Failed to ingest trial data: {e}")
    raise

❌ Failed to ingest trial data: Unsupported trials JSON shape.


ValueError: Unsupported trials JSON shape.

# 📈 STEP 6: Check Ingestion Status

**What happens here:**
- Reports the current status of data ingestion
- Shows that data is queued for asynchronous processing
- Provides guidance on monitoring completion

**Why this matters:**
- CORE Memory processes large datasets asynchronously
- Status tracking helps manage expectations
- Ensures users know where to check for completion

In [None]:
# Check ingestion status
print("\n=== 📈 INGESTION STATUS ===")
print("Patients space status: queued")
print("Clinical Trials space status: queued")
print("\n💡 Note: Ingestion is queued and will be processed asynchronously.")
print("🔍 Check the CORE web interface for completion status.")
print("⏱️ Large datasets may take several minutes to fully process.")

# 🔍 STEP 7: Test Semantic Searches

**What happens here:**
- Performs sample searches across both spaces
- Tests clinical concept recognition and matching
- Demonstrates cross-space semantic capabilities

**Why this matters:**
- Validates that the mCODE ontology is working
- Tests clinical feature matching capabilities
- Shows how patient and trial data can be semantically linked

In [None]:
# Test semantic searches (may return empty initially while processing)
print("\n=== 🔍 PATIENTS SPACE SEARCH ===")
try:
    res = client.search("Breast", space_id=patients_space_id)
    print("Search 'Breast' in Patients space (first 1200 chars):")
    print(json.dumps(res, indent=2)[:1200], "...")
except Exception as e:
    print(f"Search failed (expected during processing): {e}")

print("\n=== 🔍 CLINICAL TRIALS SPACE SEARCH ===")
try:
    res = client.search("Breast", space_id=trials_space_id)
    print("Search 'Breast' in Clinical Trials space (first 1200 chars):")
    print(json.dumps(res, indent=2)[:1200], "...")
    
    res = client.search("Osimertinib", space_id=trials_space_id)
    print("\nSearch 'Osimertinib' in Clinical Trials space (first 1200 chars):")
    print(json.dumps(res, indent=2)[:1200], "...")
except Exception as e:
    print(f"Search failed (expected during processing): {e}")

print("\n🎉 mCODE data ingestion complete!")
print("📊 Data is being processed asynchronously in CORE Memory.")
print("🔬 Ready for clinical feature matching and patient-trial correlations.")