# CERT SDK - Basic Usage with Validated Models

This notebook demonstrates the recommended workflow for using CERT SDK:

1. **Select a validated model** from the registry
2. **Initialize provider** with your API key
3. **Measure behavioral consistency** to check agent predictability
4. **Measure performance distribution** to understand output quality
5. **Compare results** to validated baselines from the paper

**Estimated time:** 2-3 minutes

**What you'll need:**
- API key for one of the validated models (OpenAI, Google, xAI, or Anthropic)
- Python 3.9+

## Setup and Imports

In [None]:
# Install CERT SDK if not already installed
# !pip install cert-sdk

import asyncio
import cert
from cert.models import ModelRegistry
from cert.providers import OpenAIProvider, GoogleProvider, XAIProvider
from cert.providers.base import ProviderConfig
from cert.analysis.semantic import SemanticAnalyzer
from cert.analysis.quality import QualityScorer
from cert.core.metrics import (
    behavioral_consistency,
    empirical_performance_distribution,
)
import numpy as np

## Step 1: Browse Available Validated Models

CERT SDK includes pre-validated baselines from the paper for these models. Let's see what's available:

In [None]:
# Use the convenient utils function to display all validated models
cert.print_models()

# You can also filter by provider:
# cert.print_models(provider="openai")

# Or get detailed info about a specific model:
# cert.get_model_info("gpt-4o")

## Step 2: Select Your Model and Configure Provider

Choose a model you have API access to and configure the provider:

In [None]:
# Select a model from the registry
# Options: "gpt-4o", "gpt-4o-mini", "grok-3", "gemini-3.5-pro", "claude-3-5-haiku-20241022"
selected_model_id = "gpt-4o"  # Change this to your model

# Get baseline from registry
model_baseline = ModelRegistry.get_model(selected_model_id)

print(f"✓ Selected: {model_baseline.model_family} ({model_baseline.model_id})")
print(f"  Using validated baseline from paper:")
print(f"  C={model_baseline.consistency:.3f}, μ={model_baseline.mean_performance:.3f}, σ={model_baseline.std_performance:.3f}")
print()

In [None]:
# Enter your API key (it will be hidden in Jupyter)
from getpass import getpass

api_key = getpass(f"Enter your {model_baseline.provider} API key: ")

In [None]:
# Initialize provider
config = ProviderConfig(
    api_key=api_key,
    model_name=model_baseline.model_id,
    temperature=0.7,
    max_tokens=1024,
)

# Map provider name to class
provider_map = {
    "openai": OpenAIProvider,
    "google": GoogleProvider,
    "xai": XAIProvider,
}

ProviderClass = provider_map[model_baseline.provider]
provider = ProviderClass(config)

print(f"✓ Provider initialized: {provider}")

## Step 3: Measure Behavioral Consistency

**What is it?** Behavioral consistency measures how predictable your agent is when given the same prompt multiple times.

**Why it matters:** Inconsistent agents are unpredictable in production.

**How it works:** We generate multiple responses to the same prompt and measure semantic distances between them. Lower distance = higher consistency.

In [None]:
async def measure_consistency(provider, model_baseline, n_trials=10):
    print(f"\n{'='*70}")
    print(f"Measuring Behavioral Consistency")
    print(f"{'='*70}")
    
    # Standard prompt from paper's experiments
    prompt = "Analyze the key factors in effective business strategy implementation."
    
    print(f"\nGenerating {n_trials} responses to measure consistency...")
    print(f"Prompt: '{prompt}'")
    
    # Generate responses
    responses = []
    for i in range(n_trials):
        response = await provider.generate_response(
            prompt=prompt,
            temperature=0.7,
        )
        responses.append(response)
        print(f"  Response {i+1}/{n_trials} generated ({len(response)} chars)")
    
    # Calculate semantic distances
    print(f"\nCalculating semantic distances...")
    analyzer = SemanticAnalyzer()
    distances = analyzer.pairwise_distances(responses)
    
    # Calculate consistency
    consistency = behavioral_consistency(distances)
    
    # Compare to paper baseline
    print(f"\n{'='*70}")
    print("Results:")
    print(f"{'='*70}")
    print(f"Measured Consistency: C = {consistency:.3f}")
    print(f"Paper Baseline:       C = {model_baseline.consistency:.3f}")
    
    diff = consistency - model_baseline.consistency
    if abs(diff) < 0.05:
        status = "✓ Within expected range"
    elif diff > 0:
        status = "↑ Higher than baseline (more consistent)"
    else:
        status = "↓ Lower than baseline (less consistent)"
    
    print(f"Difference:           {diff:+.3f} ({status})")
    
    return consistency

# Run the measurement
consistency = await measure_consistency(provider, model_baseline, n_trials=10)

## Step 4: Measure Performance Distribution

**What is it?** Performance distribution measures the quality of your agent's outputs across different prompts.

**Why it matters:** Understanding mean (μ) and variability (σ) helps predict production behavior.

**How it works:** We use multidimensional quality scoring (semantic relevance, linguistic coherence, content density) across multiple prompts.

In [None]:
async def measure_performance(provider, model_baseline, n_trials=5):
    print(f"\n{'='*70}")
    print(f"Measuring Performance Distribution")
    print(f"{'='*70}")
    
    # Standard prompts from paper's experiments
    prompts = [
        "Analyze the key factors in business strategy.",
        "Evaluate the main considerations for project management.",
        "Assess the critical elements in organizational change.",
        "Identify the primary aspects of market analysis.",
        "Examine the essential components of risk assessment.",
    ]
    
    print(f"\nGenerating responses for {n_trials} prompts...")
    
    # Generate and score responses
    scorer = QualityScorer()
    quality_scores = []
    
    for i, prompt in enumerate(prompts[:n_trials]):
        response = await provider.generate_response(
            prompt=prompt,
            temperature=0.7,
        )
        
        # Score using paper's quality metrics
        components = scorer.score(prompt, response)
        quality_scores.append(components.composite_score)
        
        print(f"  Prompt {i+1}: Q = {components.composite_score:.3f}")
        print(f"    (semantic: {components.semantic_relevance:.3f}, "
              f"coherence: {components.linguistic_coherence:.3f}, "
              f"density: {components.content_density:.3f})")
    
    # Calculate distribution
    mu, sigma = empirical_performance_distribution(np.array(quality_scores))
    
    # Compare to paper baseline
    print(f"\n{'='*70}")
    print("Results:")
    print(f"{'='*70}")
    print(f"Measured Performance: μ = {mu:.3f}, σ = {sigma:.3f}")
    print(f"Paper Baseline:       μ = {model_baseline.mean_performance:.3f}, σ = {model_baseline.std_performance:.3f}")
    
    mu_diff = mu - model_baseline.mean_performance
    print(f"Mean difference:      {mu_diff:+.3f}")
    
    return mu, sigma

# Run the measurement
mu, sigma = await measure_performance(provider, model_baseline, n_trials=5)

## Step 5: Coordination Effect Prediction

**What is it?** Coordination effect (γ) predicts how much agents improve when working together vs independently.

**Why it matters:** Tells you if adding more agents actually helps or just adds latency.

**How to interpret:**
- γ > 1: Agents coordinate well (synergy)
- γ = 1: No benefit from coordination
- γ < 1: Agents interfere with each other

In [None]:
print(f"\n{'='*70}")
print(f"Coordination Effect Prediction (from Paper)")
print(f"{'='*70}")

if model_baseline.coordination_2agent:
    print(f"\nValidated 2-agent coordination effect from paper:")
    print(f"  γ = {model_baseline.coordination_2agent:.3f}")
    
    # Calculate expected coordination performance
    independent_perf = model_baseline.mean_performance
    coordinated_perf = independent_perf * independent_perf * model_baseline.coordination_2agent
    
    print(f"\nPrediction for 2-agent sequential pipeline:")
    print(f"  Independent performance: {independent_perf:.3f}")
    print(f"  Expected coordinated:    {coordinated_perf:.3f}")
    print(f"  Improvement:             {(coordinated_perf/independent_perf - 1)*100:+.1f}%")
else:
    print("\n⚠ 2-agent coordination baseline not available for this model.")
    print("  You can measure it using coordination experiments.")

## Summary

Here's a complete comparison of your measurements vs the paper baselines:

In [None]:
print(f"\n{'='*70}")
print("Summary")
print(f"{'='*70}")
print(f"\nModel: {model_baseline.model_family} ({model_baseline.model_id})")
print(f"\nYour Measurements:")
print(f"  Consistency:   C = {consistency:.3f}")
print(f"  Performance:   μ = {mu:.3f}, σ = {sigma:.3f}")
print(f"\nPaper Baselines:")
print(f"  Consistency:   C = {model_baseline.consistency:.3f}")
print(f"  Performance:   μ = {model_baseline.mean_performance:.3f}, σ = {model_baseline.std_performance:.3f}")

if model_baseline.coordination_2agent:
    print(f"  2-agent γ:     {model_baseline.coordination_2agent:.3f}")

print(f"\n{'='*70}")
print("✓ Basic measurements complete!")
print(f"{'='*70}")
print("\nNext steps:")
print("  - Run more trials for statistical significance (20+ recommended)")
print("  - Measure coordination effects with multi-agent pipelines")
print("  - See advanced_usage.ipynb for custom models and domain-specific tasks")
print("  - See examples/two_agent_coordination.ipynb for coordination measurement")

## Interpreting Your Results

### Behavioral Consistency (C)
- **C > 0.85**: Highly consistent - safe for production
- **0.7 < C < 0.85**: Moderately consistent - acceptable with monitoring
- **C < 0.7**: Inconsistent - needs prompt engineering or different model

### Performance (μ, σ)
- **μ**: Higher mean = better quality outputs
- **σ**: Lower std dev = more predictable quality

### What if my results differ from baseline?
- Small differences (±0.05) are normal due to sampling
- Larger differences may indicate:
  - Different prompt distributions
  - Model version changes
  - Temperature/parameter differences
  - Domain-specific behavior

### Production Recommendations
1. **High consistency + High performance**: Deploy with confidence
2. **High consistency + Low performance**: Consider prompt engineering
3. **Low consistency**: Investigate before production deployment