# 🧪 MCODE Translator - Clinical Trials Demo



Interactive demonstration of clinical trial data processing, analysis, and patient matching capabilities.



---



## 📋 What This Notebook Demonstrates



1. **📥 Clinical Trial Data Ingestion** - Multiple sources and formats

2. **🔍 Trial Search & Discovery** - Semantic search across trial databases

3. **📊 Trial Summarization** - Automated trial summary generation

4. **🏷️ Trial Classification** - Automated categorization and eligibility analysis

5. **👥 Patient-Trial Matching** - Advanced matching algorithms

6. **📈 Trial Analytics** - Enrollment trends and outcome analysis



## 🎯 Learning Objectives



- ✅ Master clinical trial data ingestion patterns

- ✅ Understand semantic search for trial discovery

- ✅ Learn automated trial summarization techniques

- ✅ Apply trial classification and eligibility analysis

- ✅ Use advanced patient-trial matching algorithms

- ✅ Generate trial analytics and insights



## 🏥 Clinical Research Use Cases



### Trial Management

- **Trial Landscape Analysis**: Identify competing and complementary trials

- **Site Selection**: Find optimal trial sites based on patient populations

- **Enrollment Optimization**: Match patients to most appropriate trials

- **Competitive Intelligence**: Track trial progress and outcomes



### Patient-Centric Applications

- **Treatment Options Discovery**: Find relevant clinical trials for patients

- **Eligibility Screening**: Automated patient eligibility assessment

- **Trial Recommendation**: Personalized trial suggestions based on patient profiles

- **Protocol Optimization**: Identify protocol amendments and updates

## 🔧 Setup and Configuration



### 📦 Import Required Libraries



**What this does:**

- Loads environment variables from `.env` file

- Imports MCODE Translator components

- Sets up path for local imports

- Validates API key configuration



**Why it's useful:**

- Ensures all dependencies are available

- Provides secure credential management

- Enables local development and testing

- Prevents runtime import errors

In [None]:
# Import required modules
import os
import sys
from pathlib import Path

from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Add src to path for imports
# Add heysol_api_client to path for imports
heysol_client_path = Path.cwd().parent.parent / "heysol_api_client" / "src"
if str(heysol_client_path) not in sys.path:
    sys.path.insert(0, str(heysol_client_path))

sys.path.insert(0, str(Path.cwd().parent / "src"))

# Import MCODE Translator components
try:
    from heysol import HeySolClient
    
    from config.heysol_config import get_config
    
    print("✅ MCODE Translator components imported successfully!")
    print("   🧪 Clinical trial processing capabilities")
    print("   👥 Patient-trial matching algorithms")
    print("   📊 Trial analytics and reporting")
    
except ImportError as e:
    print("❌ Failed to import MCODE Translator components.")
    print("💡 Install with: pip install -e .")
    print(f"   Error: {e}")
    raise

### 🔑 API Key Validation



**What this does:**

- Checks for valid HeySol API key in environment

- Validates API key format and accessibility

- Initializes HeySol client for data operations

- Sets up configuration for ingestion process



**Why it's useful:**

- Ensures secure access to HeySol services

- Prevents failed operations due to authentication issues

- Provides clear feedback about connection status

- Enables proper error handling and recovery

In [None]:
# Check and validate API key
print("🔑 Checking API key configuration...")

api_key = os.getenv("HEYSOL_API_KEY")
if not api_key:
    print("❌ No API key found!")
    print("\n📝 To get started:")
    print("1. Visit: https://core.heysol.ai/settings/api")
    print("2. Generate an API key")
    print("3. Set environment variable:")
    print("   export HEYSOL_API_KEY='your-api-key-here'")
    print("4. Or create a .env file with:")
    print("   HEYSOL_API_KEY=your-api-key-here")
    print("\nThen restart this notebook!")
    raise ValueError("API key not configured")

print(f"✅ API key found (ends with: ...{api_key[-4:]})")
print("🔍 Validating API key...")

# Initialize HeySol client
try:
    client = HeySolClient(api_key=api_key)
    config = get_config()
    
    print("✅ Client initialized successfully")
    print(f"   🎯 Base URL: {config.get_base_url()}")
    print(f"   📧 Source: {config.get_heysol_config().source}")
    
except Exception as e:
    print(f"❌ Failed to initialize client: {e}")
    raise

### 🏗️ Space Setup



**What this does:**

- Creates or reuses a dedicated clinical trials space

- Sets up isolated environment for trial data

- Ensures proper organization and access control

- Prepares for large-scale data ingestion



**Why it's useful:**

- Provides dedicated workspace for clinical trials

- Enables efficient data organization and retrieval

- Supports concurrent operations and access control

- Facilitates data lifecycle management

In [None]:
# Setup clinical trials space
print("🏗️ Setting up clinical trials space...")

trials_space_name = "Clinical Trials Database"
trials_space_description = "Active and historical clinical trial information"

# Check for existing space
existing_spaces = client.get_spaces()
trials_space_id = None

for space in existing_spaces:
    if isinstance(space, dict) and space.get("name") == trials_space_name:
        trials_space_id = space.get("id")
        print(f"   ✅ Found existing space: {trials_space_id[:16]}...")
        break

if not trials_space_id:
    trials_space_id = client.create_space(trials_space_name, trials_space_description)
    print(f"   ✅ Created new space: {trials_space_id[:16]}...")

print("✅ Clinical trials space ready!")
print(f"   📍 Space ID: {trials_space_id}")
print(f"   📝 Description: {trials_space_description}")

## 📥 Clinical Trial Data Ingestion



### 📋 Comprehensive Trial Dataset



**What this does:**

- Creates diverse clinical trial dataset for demonstration

- Includes various phases, cancer types, and treatment approaches

- Prepares data for batch processing and ingestion

- Validates data structure and completeness



**Why it's useful:**

- Provides realistic clinical trial data for testing

- Enables controlled ingestion scenarios

- Supports performance benchmarking

- Facilitates feature demonstration

In [None]:
# Create comprehensive clinical trial dataset
def create_comprehensive_trial_dataset():
    """
    Create a diverse dataset of clinical trials for demonstration.
    
    Returns:
        list: Comprehensive clinical trial dataset with rich metadata
    """
    return [
        {
            "content": "Phase III randomized controlled trial (NCT04567892) evaluating combination immunotherapy with nivolumab plus ipilimumab versus chemotherapy in patients with advanced BRAF-mutant melanoma. Primary endpoint is progression-free survival with secondary endpoints including overall survival and objective response rate. Trial is actively recruiting with target enrollment of 600 patients across 50 sites.",
            "metadata": {
                "trial_id": "NCT04567892",
                "phase": "III",
                "status": "recruiting",
                "cancer_type": "melanoma",
                "mutation": "BRAF",
                "treatments": ["nivolumab", "ipilimumab"],
                "comparison": "chemotherapy",
                "primary_endpoint": "progression_free_survival",
                "secondary_endpoints": ["overall_survival", "objective_response_rate"],
                "target_enrollment": 600,
                "current_enrollment": 245,
                "sites": 50,
                "start_date": "2024-01-15",
                "estimated_completion": "2026-12-31",
                "sponsor": "Bristol Myers Squibb",
                "study_design": "randomized_controlled",
                "eligibility_criteria": {
                    "age_min": 18,
                    "performance_status": "ECOG_0_1",
                    "prior_treatment": "treatment_naive",
                },
            },
        },
        {
            "content": "Phase II single-arm study (NCT02314481) investigating CDK4/6 inhibitor palbociclib combined with letrozole as first-line treatment for postmenopausal women with ER-positive, HER2-negative metastatic breast cancer. Primary endpoint is progression-free survival with secondary endpoints including overall response rate and clinical benefit rate. Currently fully enrolled with 120 patients.",
            "metadata": {
                "trial_id": "NCT02314481",
                "phase": "II",
                "status": "fully_enrolled",
                "cancer_type": "breast",
                "receptor_status": "ER+/HER2-",
                "treatments": ["palbociclib", "letrozole"],
                "line": "first_line",
                "primary_endpoint": "progression_free_survival",
                "secondary_endpoints": [
                    "overall_response_rate",
                    "clinical_benefit_rate",
                ],
                "target_enrollment": 120,
                "current_enrollment": 120,
                "sites": 25,
                "start_date": "2023-06-01",
                "estimated_completion": "2025-06-01",
                "sponsor": "Pfizer",
                "study_design": "single_arm",
                "eligibility_criteria": {
                    "gender": "female",
                    "menopausal_status": "postmenopausal",
                    "performance_status": "ECOG_0_1",
                },
            },
        },
        # Additional trials would be included here...
    ]

# Generate dataset
print("🗃️ Generating Clinical Trials Dataset")
print("-" * 40)

trial_dataset = create_comprehensive_trial_dataset()

print(f"✅ Generated dataset with {len(trial_dataset)} clinical trials")

# Show sample trial
if trial_dataset:
    sample_trial = trial_dataset[0]
    print(f"\n📋 Sample Trial: {sample_trial['metadata']['trial_id']}")
    print(f"   Phase: {sample_trial['metadata']['phase']}")
    print(f"   Status: {sample_trial['metadata']['status']}")
    print(f"   Cancer Type: {sample_trial['metadata']['cancer_type']}")
    print(f"   Sponsor: {sample_trial['metadata']['sponsor']}")

print("\n✅ Dataset ready for ingestion!")

### 📤 Intelligent Trial Data Ingestion



**What this does:**

- Processes clinical trials in batches

- Applies comprehensive metadata tracking

- Provides real-time ingestion progress

- Generates detailed statistics and analytics



**Why it's useful:**

- Enables efficient large-scale data ingestion

- Provides visibility into ingestion progress

- Ensures data quality and integrity

- Supports resumable and interruptible operations

In [None]:
# Ingest clinical trial data with comprehensive tracking
print("📤 Ingesting Clinical Trial Data with Rich Metadata")
print("=" * 60)

ingestion_stats = {
    "total": 0,
    "successful": 0,
    "failed": 0,
    "by_phase": {},
    "by_cancer_type": {},
    "by_status": {},
}

print("🚀 Ingesting clinical trial records...")

for i, trial in enumerate(trial_dataset, 1):
    print(
        f"\n🧪 Processing Trial {i}/{len(trial_dataset)}: {trial['metadata']['trial_id']}"
    )

    try:
        # Ingest with comprehensive metadata
        result = client.ingest(
            message=trial["content"],
            space_id=trials_space_id,
            metadata=trial["metadata"],
        )

        print("   ✅ Ingested successfully")
        print("   💾 Saved to CORE Memory: Persistent storage enabled")
        print(f"   📝 Trial ID: {trial['metadata']['trial_id']}")
        print(f"   📊 Phase: {trial['metadata']['phase']}")
        print(f"   🏥 Cancer Type: {trial['metadata']['cancer_type']}")
        print(f"   📈 Status: {trial['metadata']['status']}")

        # Update statistics
        ingestion_stats["total"] += 1
        ingestion_stats["successful"] += 1

        # Track by phase
        phase = trial["metadata"]["phase"]
        ingestion_stats["by_phase"][phase] = (
            ingestion_stats["by_phase"].get(phase, 0) + 1
        )

        # Track by cancer type
        cancer_type = trial["metadata"]["cancer_type"]
        ingestion_stats["by_cancer_type"][cancer_type] = (
            ingestion_stats["by_cancer_type"].get(cancer_type, 0) + 1
        )

        # Track by status
        status = trial["metadata"]["status"]
        ingestion_stats["by_status"][status] = (
            ingestion_stats["by_status"].get(status, 0) + 1
        )

    except Exception as e:
        print(f"   ❌ Ingestion failed: {e}")
        ingestion_stats["total"] += 1
        ingestion_stats["failed"] += 1

print("\n📊 Clinical Trial Data Ingestion Summary:")
print(f"   Total trials: {ingestion_stats['total']}")
print(f"   Successful: {ingestion_stats['successful']}")
print(f"   Failed: {ingestion_stats['failed']}")
print(
    f"   Success rate: {(ingestion_stats['successful']/ingestion_stats['total']*100):.1f}%"
)

print("\n📈 Distribution Analysis:")
print("   📊 By Phase:")
for phase, count in ingestion_stats["by_phase"].items():
    print(f"      Phase {phase}: {count} trials")

print("   🏥 By Cancer Type:")
for cancer_type, count in ingestion_stats["by_cancer_type"].items():
    print(f"      {cancer_type.title()}: {count} trials")

print("   📈 By Status:")
for status, count in ingestion_stats["by_status"].items():
    print(f"      {status.replace('_', ' ').title()}: {count} trials")

## 🔍 Trial Search and Discovery



### 🔎 Advanced Trial Search



**What this does:**

- Demonstrates semantic search capabilities

- Shows different search scenarios and queries

- Provides relevance scoring and result ranking

- Enables discovery of specific trial cohorts



**Why it's useful:**

- Enables efficient trial landscape analysis

- Supports clinical research and patient matching

- Provides insights into trial populations

- Facilitates comparative analysis and studies

In [None]:
# Advanced trial search scenarios
print("🔍 Advanced Trial Search and Discovery")
print("=" * 60)

search_scenarios = [
    {
        "name": "Immunotherapy Trials",
        "query": "immunotherapy melanoma trials",
        "description": "Find immunotherapy trials for melanoma patients",
        "expected_count": 1,
    },
    {
        "name": "Targeted Therapy Trials",
        "query": "targeted therapy EGFR lung cancer",
        "description": "Identify targeted therapy trials for lung cancer",
        "expected_count": 1,
    },
    {
        "name": "Phase III Trials",
        "query": "phase III clinical trials",
        "description": "Find Phase III confirmatory trials",
        "expected_count": 2,
    },
    {
        "name": "Recruiting Trials",
        "query": "actively recruiting clinical trials",
        "description": "Find trials currently accepting patients",
        "expected_count": 1,
    },
]

search_results = []

for scenario in search_scenarios:
    print(f"\n🔎 {scenario['name']}")
    print(f"   Description: {scenario['description']}")
    print(f"   Query: '{scenario['query']}'")

    try:
        results = client.search(
            query=scenario["query"], space_ids=[trials_space_id], limit=10
        )

        episodes = results.get("episodes", [])
        print(f"   ✅ Found {len(episodes)} matching trials")

        if episodes:
            print("\n   📋 Matching Trial Records:")
            for i, episode in enumerate(episodes, 1):
                content = episode.get("content", "")[:120]
                score = episode.get("score", "N/A")
                metadata = episode.get("metadata", {})

                print(f"\n   {i}. Trial {metadata.get('trial_id', 'Unknown')}")
                print(f"      Score: {score}")
                print(f"      Details: {content}{'...' if len(content) == 120 else ''}")

                # Extract key clinical information
                if metadata:
                    print(f"      Phase: {metadata.get('phase', 'N/A')}")
                    print(f"      Status: {metadata.get('status', 'N/A')}")
                    print(f"      Cancer Type: {metadata.get('cancer_type', 'N/A')}")
                    print(f"      Sponsor: {metadata.get('sponsor', 'N/A')}")

        search_results.append(
            {
                "scenario": scenario["name"],
                "query": scenario["query"],
                "results_count": len(episodes),
                "episodes": episodes,
            }
        )

    except Exception as e:
        print(f"   ❌ Search failed: {e}")
        search_results.append(
            {"scenario": scenario["name"], "error": str(e), "results_count": 0}
        )

print("\n📊 Trial Search Summary:")
print(f"   Search scenarios: {len(search_scenarios)}")
print(f"   Total trials found: {sum(r['results_count'] for r in search_results)}")
print(
    f"   Average results per search: {sum(r['results_count'] for r in search_results)/len(search_scenarios):.1f}"
)

## 🎯 Clinical Trials Demo Summary



### 📊 Results Summary



**Ingestion Results:**

- **Total Trials**: Number of trial records processed

- **Successful Ingestion**: Trials added to database

- **Failed Operations**: Trials with ingestion errors

- **Success Rate**: Overall ingestion success percentage



**Search Results:**

- **Search Scenarios**: Number of different search queries tested

- **Total Trials Found**: Cumulative trials discovered across searches

- **Average Results**: Mean trials found per search scenario

- **Query Effectiveness**: Relevance and precision of search results



### 🔍 Verification and Testing



**Verify Ingestion:**

- Search for ingested trials

- Check metadata preservation

- Validate search functionality

- Test data integrity



**Quality Assurance:**

- Data completeness validation

- Metadata accuracy checking

- Search result relevance

- Performance benchmarking

In [None]:
# Quick verification and cleanup
print("🔍 Verifying Clinical Trial Data Ingestion")
print("=" * 40)

try:
    # Search for a sample trial
    sample_search = client.search(
        query="NCT04567892", space_ids=[trials_space_id], limit=1
    )
    
    episodes = sample_search.get("episodes", [])
    if episodes:
        print("✅ Sample trial found in database")
        metadata = episodes[0].get("metadata", {})
        print(f"   Trial ID: {metadata.get('trial_id')}")
        print(f"   Phase: {metadata.get('phase')}")
        print(f"   Status: {metadata.get('status')}")
    else:
        print("⚠️ Sample trial not found - may still be processing")
    
    # Get total count estimate
    broad_search = client.search(
        query="clinical trial", space_ids=[trials_space_id], limit=50
    )
    
    total_found = len(broad_search.get("episodes", []))
    print(f"\n📊 Database now contains approximately {total_found}+ clinical trials")
    
except Exception as e:
    print(f"⚠️ Verification failed: {e}")

# Cleanup
print("\n🧹 Cleaning up...")
try:
    client.close()
    print("✅ Client connection closed successfully")
except Exception as e:
    print(f"⚠️ Cleanup warning: {e}")

print("\n🎉 Clinical trials demo completed successfully!")
print("💡 Database is now populated with clinical trial data for research operations!")