# CERT Framework for Agentic AI applications: Model Provider Benchmarking Suite

**Consistency, Effect, Resilience, Trustworthiness Analysis**

This notebook implements a reproducible benchmarking framework to evaluate and compare language models across multiple providers (Anthropic, OpenAI, Google, xAI) on key business dimensions.

**Framework Version:** 1.0

**Last Updated:** October 2025

## Table of Contents

1. [Environment Setup](#environment-setup)
2. [Configuration](#configuration)
3. [Core Framework](#core-framework)
4. [Run Benchmark](#run-benchmark)
5. [Results Analysis](#results-analysis)
6. [Export & Visualization](#export--visualization)

## 1. Environment Setup

In [None]:
# Install required packages
%pip install anthropic openai google-generativeai requests python-dotenv sentence-transformers torch pandas numpy scipy scikit-learn matplotlib seaborn -q

In [None]:
import asyncio
import json
import logging
import os
import warnings
from abc import ABC, abstractmethod
from dataclasses import asdict, dataclass, field
from datetime import datetime
from typing import Any, Dict, List, Optional, Tuple

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.spatial.distance import pdist, squareform
from sentence_transformers import SentenceTransformer

# Suppress warnings
warnings.filterwarnings('ignore')

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

print("Environment setup complete.")

In [None]:
# Check GPU availability for embeddings
try:
    import torch
    device = "cuda" if torch.cuda.is_available() else "cpu"
    if device == "cuda":
        print(f"GPU available: {torch.cuda.get_device_name(0)}")
    else:
        print("Running on CPU. Consider enabling GPU in Colab for faster embeddings.")
except Exception as e:
    print(f"Error checking GPU: {e}")
    device = "cpu"

## 2. Configuration

In [None]:
# Google Colab: Store API keys securely
from google.colab import userdata

# Retrieve API keys (stored in Google Colab secrets)
API_KEYS = {}

# Try to get API keys from Colab secrets
providers_to_test = ['ANTHROPIC', 'OPENAI', 'GOOGLE', 'XAI']

for provider in providers_to_test:
    try:
        api_key = userdata.get(f"{provider}_API_KEY")
        if api_key:
            API_KEYS[provider] = api_key
            print(f"Loaded {provider} API key")
        else:
            print(f"Skipping {provider}: API key not found in Colab secrets")
    except userdata.NotebookAccessError:
        print(f"Cannot access {provider} API key - not running in Colab or key not set")
    except Exception as e:
        print(f"Error loading {provider} API key: {e}")

print(f"\nConfigured providers: {list(API_KEYS.keys())}")

In [None]:
# BENCHMARK CONFIGURATION

@dataclass
class BenchmarkConfig:
    """Configuration for benchmark execution."""

    # Trial configurations
    consistency_trials: int = 20
    performance_trials: int = 15
    coordination_trials: int = 15

    # Model selection
    providers: Dict[str, List[str]] = field(default_factory=lambda: {
        'anthropic': ['claude-3-5-haiku-20241022', 'claude-3-5-sonnet-20241022'],
        'openai': ['gpt-4o-mini'],
        'google': ['gemini-2.5-flash', 'gemini-2.0-flash'],
        'xai': ['grok-2']
    })

    # Embedding model
    embedding_model_name: str = 'all-MiniLM-L6-v2'

    # API parameters
    max_tokens: int = 1024
    temperature: float = 0.7
    timeout: int = 30

    # Test prompts
    consistency_prompt: str = (
        "Analyze the key factors in effective business strategy implementation. "
        "Provide a concise, structured response."
    )

    performance_prompts: List[str] = field(default_factory=lambda: [
        "Analyze the key factors in business strategy",
        "Evaluate the main considerations for project management",
        "Assess the critical elements in organizational change",
        "Identify the primary aspects of market analysis",
        "Examine the essential components of risk assessment"
    ])

    # Output configuration
    output_dir: str = '/content/benchmark_results'
    random_seed: int = 42

    def __post_init__(self):
        """Validate configuration after initialization."""
        if self.consistency_trials < 10:
            raise ValueError("consistency_trials must be >= 10")
        if self.performance_trials < 5:
            raise ValueError("performance_trials must be >= 5")

        # Create output directory
        os.makedirs(self.output_dir, exist_ok=True)

        # Set random seed for reproducibility
        np.random.seed(self.random_seed)


# Instantiate configuration
config = BenchmarkConfig()
print(f"Configuration loaded. Output directory: {config.output_dir}")
print(f"Configured providers and models: {config.providers}")

## 3. Core Framework

In [None]:
# Data structures for storing results

@dataclass
class ConsistencyResult:
    """Results from consistency testing."""
    provider: str
    model: str
    consistency_score: float
    mean_distance: float
    std_distance: float
    num_trials: int
    timestamp: str = field(default_factory=lambda: datetime.now().isoformat())


@dataclass
class PerformanceResult:
    """Results from performance testing."""
    provider: str
    model: str
    mean_score: float
    std_score: float
    min_score: float
    max_score: float
    num_trials: int
    timestamp: str = field(default_factory=lambda: datetime.now().isoformat())


@dataclass
class CoordinationResult:
    """Results from coordination testing."""
    provider: str
    model: str
    mean_performance: float
    std_performance: float
    num_trials: int
    degradation_factor: float
    timestamp: str = field(default_factory=lambda: datetime.now().isoformat())


print("Data structures defined.")

In [None]:
# Abstract base class for providers

class ProviderInterface(ABC):
    """Abstract interface for language model providers."""

    def __init__(self, api_key: str, timeout: int = 30):
        """Initialize provider with API key.

        Args:
            api_key: API key for the provider
            timeout: Request timeout in seconds
        """
        self.api_key = api_key
        self.timeout = timeout
        self.logger = logging.getLogger(self.__class__.__name__)

    @abstractmethod
    async def call_model(
        self,
        model: str,
        prompt: str,
        max_tokens: int = 1024,
        temperature: float = 0.7
    ) -> str:
        """Call the language model.

        Args:
            model: Model name/identifier
            prompt: Input prompt
            max_tokens: Maximum tokens in response
            temperature: Sampling temperature

        Returns:
            Model response text

        Raises:
            Exception: If API call fails
        """
        pass

    @abstractmethod
    def get_provider_name(self) -> str:
        """Get provider name."""
        pass


print("Provider interface defined.")

In [None]:
# Anthropic provider implementation

class AnthropicProvider(ProviderInterface):
    """Anthropic Claude provider."""

    def __init__(self, api_key: str, timeout: int = 30):
        super().__init__(api_key, timeout)
        try:
            from anthropic import Anthropic
            self.client = Anthropic(api_key=api_key)
            self.logger.info("Anthropic client initialized")
        except ImportError:
            raise ImportError("anthropic package not installed")
        except Exception as e:
            self.logger.error(f"Failed to initialize Anthropic client: {e}")
            raise

    async def call_model(
        self,
        model: str,
        prompt: str,
        max_tokens: int = 1024,
        temperature: float = 0.7
    ) -> str:
        """Call Claude model."""
        try:
            message = self.client.messages.create(
                model=model,
                max_tokens=max_tokens,
                temperature=temperature,
                messages=[
                    {"role": "user", "content": prompt}
                ]
            )
            response = message.content[0].text
            self.logger.debug(f"Claude {model} response: {len(response)} chars")
            return response
        except Exception as e:
            self.logger.error(f"Anthropic API error: {e}")
            raise

    def get_provider_name(self) -> str:
        return "anthropic"


print("Anthropic provider implemented.")

In [None]:
# OpenAI provider implementation

class OpenAIProvider(ProviderInterface):
    """OpenAI GPT provider."""

    def __init__(self, api_key: str, timeout: int = 30):
        super().__init__(api_key, timeout)
        try:
            from openai import OpenAI
            self.client = OpenAI(api_key=api_key)
            self.logger.info("OpenAI client initialized")
        except ImportError:
            raise ImportError("openai package not installed")
        except Exception as e:
            self.logger.error(f"Failed to initialize OpenAI client: {e}")
            raise

    async def call_model(
        self,
        model: str,
        prompt: str,
        max_tokens: int = 1024,
        temperature: float = 0.7
    ) -> str:
        """Call GPT model."""
        try:
            response = self.client.chat.completions.create(
                model=model,
                max_tokens=max_tokens,
                temperature=temperature,
                messages=[
                    {"role": "user", "content": prompt}
                ]
            )
            text = response.choices[0].message.content
            self.logger.debug(f"GPT {model} response: {len(text)} chars")
            return text
        except Exception as e:
            self.logger.error(f"OpenAI API error: {e}")
            raise

    def get_provider_name(self) -> str:
        return "openai"


print("OpenAI provider implemented.")

In [None]:
# Google Gemini provider implementation

class GoogleProvider(ProviderInterface):
    """Google Gemini provider."""

    def __init__(self, api_key: str, timeout: int = 30):
        super().__init__(api_key, timeout)
        try:
            import google.generativeai as genai
            genai.configure(api_key=api_key)
            self.genai = genai
            self.logger.info("Google Gemini client initialized")
        except ImportError:
            raise ImportError("google-generativeai package not installed")
        except Exception as e:
            self.logger.error(f"Failed to initialize Google client: {e}")
            raise

    async def call_model(
        self,
        model: str,
        prompt: str,
        max_tokens: int = 1024,
        temperature: float = 0.7
    ) -> str:
        """Call Gemini model."""
        try:
            model_obj = self.genai.GenerativeModel(model)
            response = model_obj.generate_content(
                prompt,
                generation_config=self.genai.types.GenerationConfig(
                    max_output_tokens=max_tokens,
                    temperature=temperature
                )
            )
            text = response.text
            self.logger.debug(f"Gemini {model} response: {len(text)} chars")
            return text
        except Exception as e:
            self.logger.error(f"Google Gemini API error: {e}")
            raise

    def get_provider_name(self) -> str:
        return "google"


print("Google provider implemented.")

In [None]:
# xAI Grok provider implementation

class XAIProvider(ProviderInterface):
    """xAI Grok provider."""

    def __init__(self, api_key: str, timeout: int = 30):
        super().__init__(api_key, timeout)
        try:
            from openai import OpenAI
            # Grok uses OpenAI-compatible API
            self.client = OpenAI(
                api_key=api_key,
                base_url="https://api.x.ai/v1"
            )
            self.logger.info("xAI Grok client initialized")
        except ImportError:
            raise ImportError("openai package not installed")
        except Exception as e:
            self.logger.error(f"Failed to initialize xAI client: {e}")
            raise

    async def call_model(
        self,
        model: str,
        prompt: str,
        max_tokens: int = 1024,
        temperature: float = 0.7
    ) -> str:
        """Call Grok model."""
        try:
            response = self.client.chat.completions.create(
                model=model,
                max_tokens=max_tokens,
                temperature=temperature,
                messages=[
                    {"role": "user", "content": prompt}
                ]
            )
            text = response.choices[0].message.content
            self.logger.debug(f"Grok {model} response: {len(text)} chars")
            return text
        except Exception as e:
            self.logger.error(f"xAI API error: {e}")
            raise

    def get_provider_name(self) -> str:
        return "xai"


print("xAI provider implemented.")

In [None]:
# Main benchmarking engine

class CERTBenchmarkEngine:
    """CERT Framework benchmarking engine.

    Measures:
    - Consistency: Behavioral reliability across trials
    - Effect: Multi-agent coordination performance
    - Resilience: Performance under stress
    - Trustworthiness: Reliability and explicability
    """

    def __init__(self, config: BenchmarkConfig, providers: Dict[str, ProviderInterface]):
        """Initialize benchmark engine.

        Args:
            config: Benchmark configuration
            providers: Dictionary of initialized providers
        """
        self.config = config
        self.providers = providers
        self.logger = logging.getLogger(self.__class__.__name__)

        # Initialize embedding model
        self.logger.info(f"Loading embedding model: {config.embedding_model_name}")
        self.embedding_model = SentenceTransformer(
            config.embedding_model_name,
            device=device
        )

        # Results storage
        self.consistency_results: List[ConsistencyResult] = []
        self.performance_results: List[PerformanceResult] = []
        self.coordination_results: List[CoordinationResult] = []
        self.execution_log: List[Dict[str, Any]] = []

    def _calculate_consistency(
        self,
        responses: List[str]
    ) -> Tuple[float, float, float]:
        """Calculate semantic consistency from responses.

        Args:
            responses: List of model responses

        Returns:
            Tuple of (consistency_score, mean_distance, std_distance)
        """
        if len(responses) < 2:
            return 1.0, 0.0, 0.0

        # Filter empty responses
        valid_responses = [r for r in responses if r and len(r.strip()) > 0]
        if len(valid_responses) < 2:
            return 0.0, np.inf, 0.0

        # Encode responses
        embeddings = self.embedding_model.encode(
            valid_responses,
            show_progress_bar=False,
            convert_to_tensor=False
        )

        # Calculate pairwise distances
        distances = pdist(embeddings, metric='cosine')

        if len(distances) == 0:
            return 1.0, 0.0, 0.0

        mean_distance = np.mean(distances)
        std_distance = np.std(distances)

        # Consistency: 1 - (std_dev / mean)
        if mean_distance == 0:
            consistency = 1.0
        else:
            consistency = max(0.0, 1.0 - (std_distance / mean_distance))

        return consistency, mean_distance, std_distance

    def _score_response(
        self,
        prompt: str,
        response: str
    ) -> float:
        """Score response quality (0-1).

        Evaluates:
        - Semantic relevance to prompt
        - Response length (completeness)
        - Presence of structured content
        """
        if not response or len(response.strip()) < 10:
            return 0.0

        try:
            # Semantic relevance (50%)
            prompt_embedding = self.embedding_model.encode(prompt, show_progress_bar=False)
            response_embedding = self.embedding_model.encode(response, show_progress_bar=False)

            relevance = float(np.dot(prompt_embedding, response_embedding) /
                            (np.linalg.norm(prompt_embedding) * np.linalg.norm(response_embedding)))
            relevance = max(0.0, min(1.0, (relevance + 1) / 2))  # Normalize [-1, 1] to [0, 1]

            # Completeness (30%): based on response length
            word_count = len(response.split())
            completeness = min(1.0, word_count / 200)  # 200 words = excellent

            # Structure (20%): presence of bullets, numbering, paragraphs
            has_structure = 0.5
            if '.' in response or '\n' in response or ':' in response:
                has_structure = 1.0

            # Weighted score
            score = (relevance * 0.5 + completeness * 0.3 + has_structure * 0.2)

            return float(score)
        except Exception as e:
            self.logger.warning(f"Error scoring response: {e}")
            return 0.5  # Default score

    async def test_consistency(
        self,
        provider_name: str,
        model: str
    ) -> Optional[ConsistencyResult]:
        """Test model consistency.

        Args:
            provider_name: Provider identifier
            model: Model name

        Returns:
            ConsistencyResult or None if failed
        """
        if provider_name not in self.providers:
            self.logger.error(f"Provider {provider_name} not available")
            return None

        provider = self.providers[provider_name]
        responses = []

        self.logger.info(
            f"Testing consistency: {provider_name}/{model} "
            f"({self.config.consistency_trials} trials)"
        )

        for trial in range(self.config.consistency_trials):
            try:
                response = await provider.call_model(
                    model,
                    self.config.consistency_prompt,
                    max_tokens=self.config.max_tokens,
                    temperature=self.config.temperature
                )
                responses.append(response)

                if (trial + 1) % 5 == 0:
                    self.logger.info(f"  Completed trial {trial + 1}/{self.config.consistency_trials}")
            except Exception as e:
                self.logger.warning(f"Trial {trial + 1} failed: {e}")
                continue

        if not responses:
            self.logger.error(f"No successful responses for {provider_name}/{model}")
            return None

        consistency, mean_dist, std_dist = self._calculate_consistency(responses)

        result = ConsistencyResult(
            provider=provider_name,
            model=model,
            consistency_score=consistency,
            mean_distance=mean_dist,
            std_distance=std_dist,
            num_trials=len(responses)
        )

        self.logger.info(
            f"Consistency: {consistency:.3f} "
            f"(mean_dist={mean_dist:.3f}, std_dist={std_dist:.3f})"
        )

        self.consistency_results.append(result)
        return result

    async def test_performance(
        self,
        provider_name: str,
        model: str
    ) -> Optional[PerformanceResult]:
        """Test model performance.

        Args:
            provider_name: Provider identifier
            model: Model name

        Returns:
            PerformanceResult or None if failed
        """
        if provider_name not in self.providers:
            self.logger.error(f"Provider {provider_name} not available")
            return None

        provider = self.providers[provider_name]
        scores = []

        self.logger.info(
            f"Testing performance: {provider_name}/{model} "
            f"({self.config.performance_trials} trials)"
        )

        for trial in range(self.config.performance_trials):
            prompt = self.config.performance_prompts[
                trial % len(self.config.performance_prompts)
            ]

            try:
                response = await provider.call_model(
                    model,
                    prompt,
                    max_tokens=self.config.max_tokens,
                    temperature=self.config.temperature
                )
                score = self._score_response(prompt, response)
                scores.append(score)

                if (trial + 1) % 5 == 0:
                    self.logger.info(
                        f"  Completed trial {trial + 1}/{self.config.performance_trials} "
                        f"(avg score: {np.mean(scores):.3f})"
                    )
            except Exception as e:
                self.logger.warning(f"Trial {trial + 1} failed: {e}")
                continue

        if not scores:
            self.logger.error(f"No successful scores for {provider_name}/{model}")
            return None

        result = PerformanceResult(
            provider=provider_name,
            model=model,
            mean_score=float(np.mean(scores)),
            std_score=float(np.std(scores)),
            min_score=float(np.min(scores)),
            max_score=float(np.max(scores)),
            num_trials=len(scores)
        )

        self.logger.info(
            f"Performance: mean={result.mean_score:.3f}, "
            f"std={result.std_score:.3f}"
        )

        self.performance_results.append(result)
        return result

    async def run_full_benchmark(
        self,
        test_consistency: bool = True,
        test_performance: bool = True,
        test_coordination: bool = False
    ) -> Dict[str, List[Any]]:
        """Run complete benchmark suite.

        Args:
            test_consistency: Whether to run consistency tests
            test_performance: Whether to run performance tests
            test_coordination: Whether to run coordination tests

        Returns:
            Dictionary with all results
        """
        start_time = datetime.now()
        self.logger.info("Starting full benchmark suite")

        # Iterate over all configured providers and models
        for provider_name, models in self.config.providers.items():
            if provider_name not in self.providers:
                self.logger.warning(f"Provider {provider_name} not available, skipping")
                continue

            for model in models:
                self.logger.info(f"\nTesting {provider_name}/{model}")

                if test_consistency:
                    await self.test_consistency(provider_name, model)

                if test_performance:
                    await self.test_performance(provider_name, model)

        end_time = datetime.now()
        duration = (end_time - start_time).total_seconds()

        self.logger.info(f"Benchmark completed in {duration:.1f} seconds")

        return {
            'consistency': self.consistency_results,
            'performance': self.performance_results,
            'coordination': self.coordination_results,
            'duration_seconds': duration,
            'start_time': start_time.isoformat(),
            'end_time': end_time.isoformat()
        }


print("Benchmark engine implemented.")

## 4. Run Benchmark

In [None]:
# Initialize providers based on available API keys

provider_map = {
    'ANTHROPIC': AnthropicProvider,
    'OPENAI': OpenAIProvider,
    'GOOGLE': GoogleProvider,
    'XAI': XAIProvider
}

initialized_providers = {}

for provider_key, provider_class in provider_map.items():
    if provider_key in API_KEYS:
        try:
            provider_name = provider_key.lower()
            provider_instance = provider_class(
                api_key=API_KEYS[provider_key],
                timeout=30
            )
            initialized_providers[provider_name] = provider_instance
            print(f"Initialized {provider_name} provider")
        except Exception as e:
            print(f"Failed to initialize {provider_key}: {e}")
    else:
        print(f"Skipping {provider_key}: API key not available")

if not initialized_providers:
    raise RuntimeError(
        "No providers initialized. Please add API keys to Colab secrets: "
        "ANTHROPIC_API_KEY, OPENAI_API_KEY, GOOGLE_API_KEY, XAI_API_KEY"
    )

print(f"\nProviders ready: {list(initialized_providers.keys())}")

In [None]:
# CUSTOMIZE BENCHMARK CONFIGURATION

# You can modify these settings before running the benchmark

# Option 1: Test all available models
config_all_models = BenchmarkConfig(
    consistency_trials=20,
    performance_trials=15,
    providers={
        'anthropic': ['claude-3-5-haiku-20241022'],
        'openai': ['gpt-4o-mini'],
        'google': ['gemini-2.5-flash'],
        'xai': ['grok-2']
    }
)

# Option 2: Quick test (fewer trials)
config_quick = BenchmarkConfig(
    consistency_trials=5,
    performance_trials=5,
    providers={
        'anthropic': ['claude-3-5-haiku-20241022'],
        'openai': ['gpt-4o-mini']
    }
)

# Option 3: Comprehensive test
config_comprehensive = BenchmarkConfig(
    consistency_trials=50,
    performance_trials=30,
    providers={
        'anthropic': ['claude-3-5-haiku-20241022', 'claude-3-5-sonnet-20241022'],
        'openai': ['gpt-4o-mini'],
        'google': ['gemini-2.5-flash', 'gemini-2.0-flash'],
        'xai': ['grok-2']
    }
)

# SELECT CONFIGURATION
# Change this to config_quick, config_all_models, or config_comprehensive
selected_config = config_all_models

print(f"Selected configuration: {selected_config}")
print(f"Providers to test: {selected_config.providers}")

In [None]:
# Initialize benchmark engine

engine = CERTBenchmarkEngine(
    config=selected_config,
    providers=initialized_providers
)

print("Benchmark engine initialized")
print(f"Output directory: {selected_config.output_dir}")

In [None]:
# Run the benchmark

results = await engine.run_full_benchmark(
    test_consistency=True,
    test_performance=True,
    test_coordination=False
)

print("\nBenchmark execution complete.")
print(f"Total duration: {results['duration_seconds']:.1f} seconds")
print(f"Consistency tests: {len(results['consistency'])}")
print(f"Performance tests: {len(results['performance'])}")

## 5. Results Analysis

In [None]:
# Convert results to DataFrames for analysis

consistency_df = pd.DataFrame([
    asdict(r) for r in results['consistency']
])

performance_df = pd.DataFrame([
    asdict(r) for r in results['performance']
])

# Display summary statistics
print("CONSISTENCY RESULTS")
print("="*60)
print(consistency_df[['provider', 'model', 'consistency_score', 'num_trials']].to_string(index=False))

print("\nPERFORMANCE RESULTS")
print("="*60)
print(performance_df[['provider', 'model', 'mean_score', 'std_score', 'num_trials']].to_string(index=False))

In [None]:
# Statistical analysis

if len(consistency_df) > 1:
    print("CONSISTENCY ANALYSIS")
    print("="*60)
    consistency_summary = consistency_df.groupby('provider').agg({
        'consistency_score': ['mean', 'std', 'min', 'max']
    }).round(3)
    print(consistency_summary)

if len(performance_df) > 1:
    print("\nPERFORMANCE ANALYSIS")
    print("="*60)
    performance_summary = performance_df.groupby('provider').agg({
        'mean_score': ['mean', 'std', 'min', 'max']
    }).round(3)
    print(performance_summary)

In [None]:
# Combine consistency and performance for overall comparison

if len(consistency_df) > 0 and len(performance_df) > 0:
    comparison_df = consistency_df[['provider', 'model', 'consistency_score']].copy()

    # Join with performance scores
    perf_merged = performance_df[['provider', 'model', 'mean_score']].copy()
    perf_merged.columns = ['provider', 'model', 'performance_score']

    comparison_df = comparison_df.merge(
        perf_merged,
        on=['provider', 'model'],
        how='inner'
    )

    # Normalize scores to 0-100 scale for readability
    comparison_df['consistency_normalized'] = (comparison_df['consistency_score'] * 100).round(1)
    comparison_df['performance_normalized'] = (comparison_df['performance_score'] * 100).round(1)

    print("COMBINED COMPARISON")
    print("="*80)
    print(comparison_df[[
        'provider', 'model', 'consistency_normalized', 'performance_normalized'
    ]].to_string(index=False))

    print("\nNote: Consistency (0-100) and Performance (0-100) normalized scores")

## 6. Export & Visualization

In [None]:
# Export results to CSV

timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

consistency_path = os.path.join(
    selected_config.output_dir,
    f"consistency_results_{timestamp}.csv"
)
consistency_df.to_csv(consistency_path, index=False)
print(f"Consistency results exported to: {consistency_path}")

performance_path = os.path.join(
    selected_config.output_dir,
    f"performance_results_{timestamp}.csv"
)
performance_df.to_csv(performance_path, index=False)
print(f"Performance results exported to: {performance_path}")

if 'comparison_df' in locals():
    comparison_path = os.path.join(
        selected_config.output_dir,
        f"comparison_results_{timestamp}.csv"
    )
    comparison_df.to_csv(comparison_path, index=False)
    print(f"Comparison results exported to: {comparison_path}")

In [None]:
# Export results to JSON with metadata

export_data = {
    'metadata': {
        'framework': 'CERT',
        'version': '1.0',
        'timestamp': results['start_time'],
        'duration_seconds': results['duration_seconds'],
        'configuration': {
            'consistency_trials': selected_config.consistency_trials,
            'performance_trials': selected_config.performance_trials,
            'embedding_model': selected_config.embedding_model_name
        }
    },
    'results': {
        'consistency': [asdict(r) for r in results['consistency']],
        'performance': [asdict(r) for r in results['performance']]
    }
}

json_path = os.path.join(
    selected_config.output_dir,
    f"benchmark_results_{timestamp}.json"
)

with open(json_path, 'w') as f:
    json.dump(export_data, f, indent=2)

print(f"Full results exported to: {json_path}")

In [None]:
# Visualization: Consistency comparison

if len(consistency_df) > 0:
    fig, ax = plt.subplots(figsize=(12, 6))

    # Prepare data
    consistency_df_sorted = consistency_df.sort_values('consistency_score', ascending=False)
    x_labels = [f"{row['provider']}/{row['model'].split('-')[-1]}"
                 for _, row in consistency_df_sorted.iterrows()]
    y_values = consistency_df_sorted['consistency_score'].values

    # Create bar chart
    colors = plt.cm.Blues(np.linspace(0.4, 0.8, len(y_values)))
    bars = ax.bar(range(len(y_values)), y_values, color=colors, edgecolor='black', linewidth=1.5)

    # Add value labels on bars
    for bar, value in zip(bars, y_values):
        height = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2., height,
               f'{value:.3f}',
               ha='center', va='bottom', fontsize=10, fontweight='bold')

    ax.set_xticks(range(len(x_labels)))
    ax.set_xticklabels(x_labels, rotation=45, ha='right')
    ax.set_ylabel('Consistency Score', fontsize=12, fontweight='bold')
    ax.set_title('CERT Framework: Behavioral Consistency Comparison',
                 fontsize=14, fontweight='bold', pad=20)
    ax.set_ylim([0, 1.0])
    ax.grid(axis='y', alpha=0.3, linestyle='--')
    ax.axhline(y=0.85, color='green', linestyle='--', alpha=0.5, label='Excellent (>0.85)')
    ax.axhline(y=0.70, color='orange', linestyle='--', alpha=0.5, label='Fair (>0.70)')
    ax.legend(loc='lower right')

    plt.tight_layout()

    viz_path = os.path.join(
        selected_config.output_dir,
        f"consistency_comparison_{timestamp}.png"
    )
    plt.savefig(viz_path, dpi=300, bbox_inches='tight')
    print(f"Visualization saved to: {viz_path}")
    plt.show()

In [None]:
# Visualization: Performance comparison

if len(performance_df) > 0:
    fig, ax = plt.subplots(figsize=(12, 6))

    # Prepare data
    performance_df_sorted = performance_df.sort_values('mean_score', ascending=False)
    x_labels = [f"{row['provider']}/{row['model'].split('-')[-1]}"
                 for _, row in performance_df_sorted.iterrows()]
    y_values = performance_df_sorted['mean_score'].values
    y_errors = performance_df_sorted['std_score'].values

    # Create bar chart with error bars
    colors = plt.cm.Oranges(np.linspace(0.4, 0.8, len(y_values)))
    bars = ax.bar(range(len(y_values)), y_values, yerr=y_errors,
                   color=colors, edgecolor='black', linewidth=1.5,
                   capsize=5, error_kw={'linewidth': 2})

    # Add value labels on bars
    for bar, value in zip(bars, y_values):
        height = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2., height,
               f'{value:.3f}',
               ha='center', va='bottom', fontsize=10, fontweight='bold')

    ax.set_xticks(range(len(x_labels)))
    ax.set_xticklabels(x_labels, rotation=45, ha='right')
    ax.set_ylabel('Performance Score', fontsize=12, fontweight='bold')
    ax.set_title('CERT Framework: Output Quality & Performance Comparison',
                 fontsize=14, fontweight='bold', pad=20)
    ax.set_ylim([0, 1.0])
    ax.grid(axis='y', alpha=0.3, linestyle='--')

    plt.tight_layout()

    viz_path = os.path.join(
        selected_config.output_dir,
        f"performance_comparison_{timestamp}.png"
    )
    plt.savefig(viz_path, dpi=300, bbox_inches='tight')
    print(f"Visualization saved to: {viz_path}")
    plt.show()

In [None]:
# Visualization: Scatter plot (Consistency vs Performance)

if 'comparison_df' in locals() and len(comparison_df) > 1:
    fig, ax = plt.subplots(figsize=(10, 8))

    # Create scatter plot
    colors_map = {'anthropic': '#1f77b4', 'openai': '#ff7f0e',
                   'google': '#2ca02c', 'xai': '#d62728'}

    for provider in comparison_df['provider'].unique():
        provider_data = comparison_df[comparison_df['provider'] == provider]
        ax.scatter(provider_data['consistency_normalized'],
                  provider_data['performance_normalized'],
                  s=300, alpha=0.7, label=provider.upper(),
                  color=colors_map.get(provider, 'gray'),
                  edgecolor='black', linewidth=1.5)

        # Add model labels
        for _, row in provider_data.iterrows():
            ax.annotate(row['model'].split('-')[-1],
                        (row['consistency_normalized'], row['performance_normalized']),
                        xytext=(5, 5), textcoords='offset points',
                        fontsize=9, fontweight='bold')

    ax.set_xlabel('Consistency Score (0-100)', fontsize=12, fontweight='bold')
    ax.set_ylabel('Performance Score (0-100)', fontsize=12, fontweight='bold')
    ax.set_title('CERT Framework: Consistency vs Performance Trade-off',
                 fontsize=14, fontweight='bold', pad=20)
    ax.set_xlim([0, 105])
    ax.set_ylim([0, 105])
    ax.grid(True, alpha=0.3, linestyle='--')
    ax.legend(loc='best', fontsize=11)

    plt.tight_layout()

    viz_path = os.path.join(
        selected_config.output_dir,
        f"consistency_vs_performance_{timestamp}.png"
    )
    plt.savefig(viz_path, dpi=300, bbox_inches='tight')
    print(f"Scatter plot saved to: {viz_path}")
    plt.show()

In [None]:
# Create summary report

report_content = f"""CERT FRAMEWORK BENCHMARK REPORT
{'='*80}

Execution Details:
- Timestamp: {results['start_time']}
- Duration: {results['duration_seconds']:.1f} seconds
- Framework: CERT (Consistency, Effect, Resilience, Trustworthiness)

Configuration:
- Consistency Trials: {selected_config.consistency_trials}
- Performance Trials: {selected_config.performance_trials}
- Embedding Model: {selected_config.embedding_model_name}

Providers Tested:
{json.dumps(selected_config.providers, indent=2)}

CONSISTENCY RESULTS
{'-'*80}
{consistency_df[['provider', 'model', 'consistency_score', 'num_trials']].to_string(index=False)}

PERFORMANCE RESULTS
{'-'*80}
{performance_df[['provider', 'model', 'mean_score', 'std_score', 'num_trials']].to_string(index=False)}

Results exported to: {selected_config.output_dir}

Files generated:
- consistency_results_{timestamp}.csv
- performance_results_{timestamp}.csv
- comparison_results_{timestamp}.csv
- benchmark_results_{timestamp}.json
- consistency_comparison_{timestamp}.png
- performance_comparison_{timestamp}.png
- consistency_vs_performance_{timestamp}.png
"""

report_path = os.path.join(
    selected_config.output_dir,
    f"benchmark_report_{timestamp}.txt"
)

with open(report_path, 'w') as f:
    f.write(report_content)

print("BENCHMARK SUMMARY")
print("="*80)
print(report_content)
print(f"\nReport saved to: {report_path}")

In [None]:
# List all generated files

import os

print(f"\nGenerated files in {selected_config.output_dir}:")
print("="*80)

files = os.listdir(selected_config.output_dir)
for file in sorted(files):
    file_path = os.path.join(selected_config.output_dir, file)
    file_size = os.path.getsize(file_path)
    print(f"  {file:<50} ({file_size:,} bytes)")

print(f"\nTotal files: {len(files)}")
print(f"\nTo download files in Colab, use:")
print(f"  from google.colab import files")
print(f"  files.download('{report_path}')")

## Additional Notes

### Framework Description

**CERT Framework** measures AI model reliability across four dimensions:

1. **Consistency (C)**: Behavioral reliability across multiple trials using semantic embeddings
   - Measures: How predictable is the model's behavior?
   - Score: 0.0 (completely inconsistent) to 1.0 (perfectly consistent)
   - Use case: Compliance, audit, regulated workflows

2. **Effect (E)**: Multi-agent coordination performance
   - Measures: How well do models work together in larger systems?
   - Not implemented in this basic version

3. **Resilience (R)**: Performance under stress and edge cases
   - Not implemented in this basic version

4. **Trustworthiness (T)**: Reliability and explicability
   - Not implemented in this basic version

### Extending the Framework

To add more tests:

1. Add new test method to `CERTBenchmarkEngine` class
2. Create corresponding result dataclass
3. Call from `run_full_benchmark` method
4. Export results to CSV/JSON

### Customization Examples

**Add custom prompts:**
```python
config.consistency_prompt = "Your custom prompt here"
config.performance_prompts = ["Prompt 1", "Prompt 2", ...]
```

**Modify scoring logic:**
Edit the `_score_response` method in `CERTBenchmarkEngine` class

**Add new provider:**
Create new class inheriting from `ProviderInterface` and implement `call_model` method

### Troubleshooting

**API Key issues:**
- Store API keys in Google Colab Secrets (Keys icon)
- Ensure key names match: `ANTHROPIC_API_KEY`, `OPENAI_API_KEY`, `GOOGLE_API_KEY`, `XAI_API_KEY`

**Memory issues:**
- Reduce trial counts in configuration
- Use smaller embedding model
- Run one provider at a time

**Rate limiting:**
- Benchmark respects provider rate limits
- Adjust delays between requests as needed
- Start with fewer trials

### Reproducibility

To ensure reproducibility:
- Random seed is set to 42 (configurable)
- All results are timestamped
- Configuration is saved with results
- Embedding model version is recorded
- Run timestamp preserves execution order

### Citation

If using this framework in research, please cite:

```
CERT Benchmarking Framework (2025)
Consistency, Effect, Resilience, Trustworthiness Analysis
for Language Model Providers
```
