# CERT SDK - Advanced Features

This notebook demonstrates advanced CERT SDK features:

1. **Custom models** not in the validated registry
2. **Domain-specific baselines** (Healthcare, Legal, Finance)
3. **Custom quality scoring** with domain keywords
4. **Baseline registration** for reuse

**When to use this:**
- Using models not in the validated registry (e.g., `gpt-4-turbo`, `llama-3`)
- Building domain-specific agentic systems (Healthcare, Legal, Finance)
- Need baselines tailored to your specific use case

**Estimated time:** 5-10 minutes for full baseline measurement

## Setup and Imports

In [None]:
# Install CERT SDK if not already installed
# !pip install cert-sdk

import asyncio
import cert
from cert.models import ModelRegistry, ModelBaseline
from cert.providers import OpenAIProvider, GoogleProvider
from cert.providers.base import ProviderConfig
from cert.analysis.semantic import SemanticAnalyzer
from cert.analysis.quality import QualityScorer
from cert.core.metrics import (
    behavioral_consistency,
    empirical_performance_distribution,
)
import numpy as np

## Core Function: Measure Custom Baseline

This function measures a complete baseline for any model with optional domain-specific customization.

In [None]:
async def measure_custom_baseline(
    provider,
    prompts,
    n_consistency_trials=20,
    domain_keywords=None
):
    """
    Measure complete custom baseline for any model.
    
    Args:
        provider: Initialized provider with any model
        prompts: List of domain-specific prompts for your use case
        n_consistency_trials: Number of trials for consistency (20+ recommended)
        domain_keywords: Optional set of domain-specific keywords for quality scoring
    
    Returns:
        Tuple of (consistency, mean_performance, std_performance)
    """
    print("\n" + "="*70)
    print("Custom Baseline Measurement")
    print("="*70)
    print(f"\nModel: {provider.config.model_name}")
    print(f"Consistency trials: {n_consistency_trials}")
    print(f"Performance prompts: {len(prompts)}")
    
    # Step 1: Measure Behavioral Consistency
    print("\n[1/2] Measuring behavioral consistency...")
    print(f"  Generating {n_consistency_trials} responses to same prompt...")
    
    consistency_prompt = prompts[0]  # Use first prompt
    responses = []
    
    for i in range(n_consistency_trials):
        response = await provider.generate_response(
            prompt=consistency_prompt,
            temperature=0.7,
        )
        responses.append(response)
        if (i + 1) % 5 == 0:
            print(f"    Progress: {i+1}/{n_consistency_trials}")
    
    # Calculate consistency
    analyzer = SemanticAnalyzer()
    distances = analyzer.pairwise_distances(responses)
    consistency = behavioral_consistency(distances)
    
    print(f"  ✓ Behavioral Consistency: C = {consistency:.3f}")
    
    # Step 2: Measure Performance Distribution
    print("\n[2/2] Measuring performance distribution...")
    print(f"  Generating and scoring responses for {len(prompts)} prompts...")
    
    # Create custom scorer if domain keywords provided
    if domain_keywords:
        scorer = QualityScorer(domain_keywords=domain_keywords)
        print(f"  Using {len(domain_keywords)} domain-specific keywords")
    else:
        scorer = QualityScorer()
        print(f"  Using default analytical keywords")
    
    quality_scores = []
    for i, prompt in enumerate(prompts):
        response = await provider.generate_response(
            prompt=prompt,
            temperature=0.7,
        )
        
        components = scorer.score(prompt, response)
        quality_scores.append(components.composite_score)
        
        print(f"    Prompt {i+1}/{len(prompts)}: Q = {components.composite_score:.3f}")
    
    # Calculate distribution
    mu, sigma = empirical_performance_distribution(np.array(quality_scores))
    
    print(f"  ✓ Performance: μ = {mu:.3f}, σ = {sigma:.3f}")
    
    # Summary
    print("\n" + "="*70)
    print("Custom Baseline Results")
    print("="*70)
    print(f"Consistency:   C = {consistency:.3f}")
    print(f"Mean:          μ = {mu:.3f}")
    print(f"Std Dev:       σ = {sigma:.3f}")
    print("="*70)
    
    return consistency, mu, sigma

## Example 1: Healthcare Domain Custom Baseline

This example shows how to measure baselines for healthcare-specific tasks with custom prompts and domain keywords.

### Define Healthcare-Specific Prompts

In [None]:
healthcare_prompts = [
    "Analyze the key factors in patient care quality improvement.",
    "Evaluate the main considerations for clinical decision support systems.",
    "Assess the critical elements in healthcare data privacy and security.",
    "Identify the primary aspects of telemedicine implementation.",
    "Examine the essential components of medical staff coordination.",
    "Analyze the challenges in healthcare resource allocation.",
    "Evaluate diagnostic workflow optimization strategies.",
    "Assess patient safety protocols and risk management.",
    "Identify barriers to electronic health record adoption.",
    "Examine factors in healthcare cost reduction.",
]

print(f"Healthcare prompts defined: {len(healthcare_prompts)} prompts")

### Define Healthcare Domain Keywords

These keywords will be used to score quality of responses - higher scores for responses that use domain-relevant terminology.

In [None]:
healthcare_keywords = {
    # Clinical terms
    "patient", "clinical", "diagnosis", "treatment", "care",
    "medical", "physician", "nurse", "provider", "practitioner",
    # Healthcare operations
    "hospital", "clinic", "facility", "healthcare", "health",
    "quality", "safety", "protocol", "procedure", "guideline",
    # Technology
    "ehr", "emr", "telemedicine", "telehealth", "digital",
    "system", "technology", "data", "record", "information",
    # Management
    "workflow", "process", "management", "coordination", "efficiency",
    "resource", "allocation", "optimization", "improvement",
    # Compliance
    "hipaa", "compliance", "privacy", "security", "regulation",
    "policy", "standard", "certification", "accreditation",
}

print(f"Healthcare keywords defined: {len(healthcare_keywords)} keywords")

### Configure Model and Provider

In [None]:
# Example: Using a model not in the registry
model_id = "gpt-4-turbo"  # Not in validated registry

# First, let's see what models are available
print("Available validated models:")
cert.print_models(detailed=False)

print("\n" + "="*70)
# Check if our model is validated
if ModelRegistry.is_validated(model_id):
    print(f"✓ {model_id} is in validated registry")
    cert.get_model_info(model_id)
else:
    print(f"⚠ {model_id} is NOT in validated registry")
    print("  We need to measure a custom baseline")

In [None]:
# Get API key
from getpass import getpass

api_key = getpass("Enter your OpenAI API key: ")

In [None]:
# Initialize provider
config = ProviderConfig(
    api_key=api_key,
    model_name=model_id,
    temperature=0.7,
    max_tokens=1024,
)

provider = OpenAIProvider(config)
print(f"✓ Provider initialized: {provider}")

### Measure Healthcare-Specific Baseline

**Note:** This will take 5-10 minutes for full measurement with 20 consistency trials.

For quick testing, you can reduce `n_consistency_trials` to 10.

In [None]:
# Measure baseline
consistency, mu, sigma = await measure_custom_baseline(
    provider=provider,
    prompts=healthcare_prompts,
    n_consistency_trials=20,  # Reduce to 10 for faster testing
    domain_keywords=healthcare_keywords,
)

### Register Custom Baseline

Save your measured baseline in the registry for future use.

In [None]:
custom_baseline = ModelRegistry.register_custom_baseline(
    model_id=model_id,
    provider="openai",
    model_family=f"{model_id} (Healthcare)",
    consistency=consistency,
    mean_performance=mu,
    std_performance=sigma,
)

print(f"\n✓ Registered: {custom_baseline}")
print(f"\nThis baseline is now available for use:")
print(f"  ModelRegistry.get_model('{model_id}')")

### Compare to Paper Baselines

In [None]:
# Compare to similar validated model if available
gpt4o_baseline = ModelRegistry.get_model("gpt-4o")
if gpt4o_baseline:
    print("\n" + "="*70)
    print("Comparison to Paper Baselines")
    print("="*70)
    
    print(f"\nGPT-4o (from paper - general analytical):")
    print(f"  C = {gpt4o_baseline.consistency:.3f}")
    print(f"  μ = {gpt4o_baseline.mean_performance:.3f}")
    print(f"  σ = {gpt4o_baseline.std_performance:.3f}")
    
    print(f"\n{model_id} (Healthcare custom):")
    print(f"  C = {consistency:.3f} ({consistency - gpt4o_baseline.consistency:+.3f})")
    print(f"  μ = {mu:.3f} ({mu - gpt4o_baseline.mean_performance:+.3f})")
    print(f"  σ = {sigma:.3f} ({sigma - gpt4o_baseline.std_performance:+.3f})")
    
    print(f"\nNote: Differences expected due to:")
    print(f"  - Different model version")
    print(f"  - Different domain (Healthcare vs General Analytical)")
    print(f"  - Custom keyword scoring")

## Example 2: Legal Domain

Quick example showing Legal domain configuration.

In [None]:
# Legal-specific prompts
legal_prompts = [
    "Analyze the key factors in contract negotiation strategy.",
    "Evaluate risk management in corporate governance.",
    "Assess compliance requirements for data protection.",
    "Identify critical elements in intellectual property protection.",
    "Examine due diligence processes in mergers and acquisitions.",
]

# Legal-specific keywords
legal_keywords = {
    "legal", "law", "regulation", "compliance", "contract",
    "agreement", "litigation", "court", "judge", "attorney",
    "liability", "obligation", "rights", "statute", "jurisdiction",
    "evidence", "precedent", "case", "ruling", "counsel",
    "due diligence", "intellectual property", "patent", "copyright",
    "governance", "regulatory", "policy", "framework", "standard",
}

print("Legal Domain Configuration:")
print(f"  - {len(legal_prompts)} legal-specific prompts")
print(f"  - {len(legal_keywords)} legal keywords")
print(f"\nTo measure:")
print(f"  consistency, mu, sigma = await measure_custom_baseline(")
print(f"      provider=your_provider,")
print(f"      prompts=legal_prompts,")
print(f"      n_consistency_trials=20,")
print(f"      domain_keywords=legal_keywords,")
print(f"  )")

## Example 3: Finance Domain

In [None]:
# Finance-specific prompts
finance_prompts = [
    "Analyze key factors in portfolio risk management.",
    "Evaluate strategies for market volatility assessment.",
    "Assess critical elements in financial forecasting.",
    "Identify primary aspects of investment diversification.",
    "Examine components of credit risk evaluation.",
]

# Finance-specific keywords
finance_keywords = {
    "finance", "financial", "investment", "portfolio", "risk",
    "return", "capital", "asset", "liability", "equity",
    "market", "trading", "valuation", "pricing", "volatility",
    "liquidity", "credit", "debt", "bond", "stock",
    "diversification", "hedge", "derivative", "option", "futures",
    "compliance", "regulatory", "audit", "disclosure", "reporting",
}

print("Finance Domain Configuration:")
print(f"  - {len(finance_prompts)} finance-specific prompts")
print(f"  - {len(finance_keywords)} finance keywords")

## Custom Quality Scoring Configuration

You can also customize the quality scoring weights for different domains.

In [None]:
print("Quality Scoring Configuration Options")
print("="*70)

print("\nDefault (from paper - analytical tasks):")
print("  Semantic Relevance:    30%")
print("  Linguistic Coherence:  30%")
print("  Content Density:       40%")
print("  scorer = QualityScorer()")

print("\nCreative writing:")
print("  Semantic Relevance:    20% (less critical)")
print("  Linguistic Coherence:  50% (very important)")
print("  Content Density:       30% (moderate)")
print("  scorer = QualityScorer(")
print("      semantic_weight=0.2,")
print("      coherence_weight=0.5,")
print("      density_weight=0.3,")
print("  )")

print("\nTechnical documentation:")
print("  Semantic Relevance:    40% (very important)")
print("  Linguistic Coherence:  20% (less critical)")
print("  Content Density:       40% (very important)")
print("  scorer = QualityScorer(")
print("      semantic_weight=0.4,")
print("      coherence_weight=0.2,")
print("      density_weight=0.4,")
print("  )")

## Summary: When to Use Advanced Features

### Use Custom Baselines When:
1. **Model not in registry** - New models, fine-tuned models, or specific versions
2. **Domain-specific tasks** - Healthcare, Legal, Finance require different evaluation
3. **Custom prompts** - Your use case has unique prompt patterns
4. **Quality criteria** - Default quality scoring doesn't match your needs

### Best Practices:
1. **Consistency trials:** Use 20+ for reliable measurements (paper standard)
2. **Performance prompts:** Use 15+ diverse prompts (paper standard)
3. **Domain keywords:** Include 30-50 relevant terms
4. **Register baselines:** Save for reuse and team sharing
5. **Compare to paper:** Use validated models as reference points

### Production Workflow:
1. Measure custom baseline during development
2. Register in your model registry
3. Use for ongoing monitoring and comparison
4. Re-measure when model versions change
5. Consider contributing validated baselines back to CERT