# CERT SDK - Basic Usage with Validated Models

This notebook demonstrates the recommended workflow for using CERT SDK:

1. **Select a validated model** from the registry
2. **Initialize provider** with your API key
3. **Measure behavioral consistency** to check output predictability
4. **Measure performance distribution** to understand output quality
5. **Compare results** to validated baselines from the paper

**Estimated time:** 2-3 minutes

**What you'll need:**
- API key for one of the validated models (OpenAI, Google, xAI, or Anthropic)
- Python 3.9+

## Setup and Imports

In [None]:
# Install CERT SDK
# Option 1: From PyPI (when available)
# !pip install cert-sdk

# Option 2: Directly from GitHub repository (development version)
# !pip install git+https://github.com/Javihaus/CERT.git

import asyncio
import cert

## Step 1: Browse Available Validated Models

CERT SDK includes pre-validated baselines from the paper for these models. Let's see what's available:

In [None]:
# Use the convenient utils function to display all validated models
cert.print_models()

# You can also filter by provider:
# cert.print_models(provider="openai")

# Or get detailed info about a specific model:
# cert.get_model_info("gpt-4o")

## Step 2: Select Your Model and Configure Provider

Choose a model you have API access to and configure the provider:

In [None]:
# Select a model from the registry
# Options: "gpt-4o", "gpt-4o-mini", "grok-3", "gemini-3.5-pro", "claude-3-5-haiku-20241022"
selected_model_id = "gpt-4o"  # Change this to your model

# Get baseline from registry
model_baseline = cert.ModelRegistry.get_model(selected_model_id)

print(f"✓ Selected: {model_baseline.model_family} ({model_baseline.model_id})")
print(f"  Using validated baseline from paper:")
print(f"  C={model_baseline.consistency:.3f}, μ={model_baseline.mean_performance:.3f}, σ={model_baseline.std_performance:.3f}")
print()

In [None]:
# Enter your API key (it will be hidden in Jupyter)
from getpass import getpass

api_key = getpass("Enter your API key: ")

In [None]:
# Initialize provider - simple and direct!
provider = cert.create_provider(
    api_key=api_key,
    model_name=selected_model_id,
    temperature=0.7,
    max_tokens=1024,
)

print(f"✓ Provider initialized: {provider}")

## Step 3: Measure Behavioral Consistency

**What is it?** Behavioral consistency measures how predictable a model is when given the same prompt multiple times.

**Why it matters:** Inconsistent models are unpredictable in production.

**How it works:** We generate multiple responses to the same prompt and measure semantic distances between them. Lower distance variance = higher consistency.

**Formula:** `C = 1 - (σ(d) / μ(d))` where d = semantic distances between responses

In [None]:
# Measure consistency - simple one-line call!
consistency = await cert.measure_consistency(
    provider=provider,
    n_trials=10,
    baseline=model_baseline,
)

## Step 4: Measure Performance Distribution

**What is it?** Performance distribution measures the quality of model outputs across different prompts.

**Why it matters:** Understanding mean (μ) and variability (σ) helps predict production behavior.

**How it works:** We use multidimensional quality scoring (semantic relevance, linguistic coherence, content density) across multiple prompts.

In [None]:
# Measure performance - simple one-line call!
mu, sigma = await cert.measure_performance(
    provider=provider,
    baseline=model_baseline,
)

## Step 5: Context Propagation Effect Prediction

**What is it?** Context propagation effect (γ) measures how performance changes when models process accumulated context in sequential pipelines.

**What it measures:** When a later model sees the output of an earlier model, does quality improve due to extended context?

**How to interpret:**
- γ > 1: Sequential context accumulation improves performance
- γ = 1: No benefit from accumulated context
- γ < 1: Context accumulation degrades performance (attention dilution, context window issues)

**What this does NOT measure:**
- ❌ Not measuring "agent intelligence" or "coordination principles"
- ❌ Not detecting genuine collaboration or planning
- ❌ Not explaining WHY context helps (black box measurement)

**What it IS:**
- ✅ Statistical characterization of attention mechanism behavior
- ✅ Operational metric for pipeline architecture decisions
- ✅ Engineering measurement: "which configurations show reliable patterns?"

In [None]:
print(f"\n{'='*70}")
print(f"Context Propagation Effect Prediction (from Paper)")
print(f"{'='*70}")

if model_baseline.coordination_2agent:
    gamma = model_baseline.coordination_2agent
    print(f"\nValidated 2-model sequential context effect from paper:")
    print(f"  γ = {gamma:.3f}")
    
    # Calculate expected sequential performance
    independent_perf = model_baseline.mean_performance
    sequential_perf = independent_perf * independent_perf * gamma
    
    print(f"\nPrediction for 2-model sequential pipeline:")
    print(f"  Independent performance: {independent_perf:.3f}")
    print(f"  Expected sequential:     {sequential_perf:.3f}")
    print(f"  Improvement:             {(sequential_perf/independent_perf - 1)*100:+.1f}%")
    
    print(f"\nOperational Interpretation:")
    if gamma > 1.2:
        print(f"  → Strong context propagation benefit")
        print(f"  → Sequential processing improves quality substantially")
    elif gamma > 1.0:
        print(f"  → Moderate context propagation benefit")
        print(f"  → Sequential processing helps but gains are modest")
    else:
        print(f"  → Context accumulation does not improve performance")
        print(f"  → Consider single-model or parallel architectures")
else:
    print("\n⚠ 2-model context effect baseline not available for this model.")
    print("  You can measure it using sequential pipeline experiments.")

## Summary

Here's a complete comparison of your measurements vs the paper baselines:

In [None]:
print(f"\n{'='*70}")
print("Summary")
print(f"{'='*70}")
print(f"\nModel: {model_baseline.model_family} ({model_baseline.model_id})")
print(f"\nYour Measurements:")
print(f"  Consistency:   C = {consistency:.3f}")
print(f"  Performance:   μ = {mu:.3f}, σ = {sigma:.3f}")
print(f"\nPaper Baselines:")
print(f"  Consistency:   C = {model_baseline.consistency:.3f}")
print(f"  Performance:   μ = {model_baseline.mean_performance:.3f}, σ = {model_baseline.std_performance:.3f}")

if model_baseline.coordination_2agent:
    print(f"  2-model γ:     {model_baseline.coordination_2agent:.3f}")

print(f"\n{'='*70}")
print("✓ Basic measurements complete!")
print(f"{'='*70}")
print("\nNext steps:")
print("  - Run more trials for statistical significance (20+ recommended)")
print("  - Measure context effects with sequential pipelines")
print("  - See advanced_usage.ipynb for custom models and domain-specific tasks")
print("  - See langchain_research_writer_pipeline.ipynb for real pipeline example")

## Interpreting Your Results

### Behavioral Consistency (C)
- **C > 0.85**: Highly consistent - safe for production
- **0.7 < C < 0.85**: Moderately consistent - acceptable with monitoring
- **C < 0.7**: Inconsistent - needs prompt engineering or different model

### Performance (μ, σ)
- **μ**: Higher mean = better quality outputs
- **σ**: Lower std dev = more predictable quality

### Context Propagation Effect (γ)
- **γ > 1.2**: Strong benefit from sequential processing
- **1.0 < γ < 1.2**: Moderate benefit
- **γ < 1.0**: Context accumulation degrades performance

### What if my results differ from baseline?
- Small differences (±0.05) are normal due to sampling
- Larger differences may indicate:
  - Different prompt distributions
  - Model version changes
  - Temperature/parameter differences
  - Domain-specific behavior

### Production Recommendations
1. **High consistency + High performance**: Deploy with confidence
2. **High consistency + Low performance**: Consider prompt engineering
3. **Low consistency**: Investigate before production deployment
4. **γ > 1.2**: Sequential pipelines are beneficial
5. **γ < 1.0**: Avoid sequential processing, use single model or parallel architecture

### What CERT Measures
**Statistical Characterization:**
- Behavioral variance in token generation (C)
- Performance changes from context accumulation (γ)
- Output quality distribution (μ, σ)

**Engineering Decisions:**
- Which pipeline configurations are reliable?
- Does sequential processing help or hurt?
- How predictable is production behavior?

**NOT Measured:**
- ❌ Agent intelligence or reasoning
- ❌ Coordination principles or collaboration
- ❌ Why context helps (attention mechanisms are black box)

This is **engineering characterization** for deployment decisions, not coordination science.