##**Claude-Only CERT Agents Pipeline Pipeline Coordination Demo**
---

This framework provides systematic measurement of coordination patterns in current LLM systems. While these systems manipulate discrete tokens based on statistical correlations rather than learning continuous representations of the physical world, this infrastructure will be essential when we develop architectures based on learned world models.

**What this measures:**
Coordination behaviors between sophisticated pattern-matching systems.

**What this enables:**
Deployment scaffolding for current technology + experimental apparatus for studying genuine coordination when it emerges from proper architectures.

###**Basic description**

>Predetermined sequential flow:

>> Agent 1 → Agent 2 → ... → Agent N for (N<10)

>Fixed processing chain with no autonomous decision-making

>Each "agent" is a role-specialized LLM instance processing predetermined inputs

>No dynamic coordination or emergent communication patterns

###**Setup Instructions**
####1. API Configuration
python# Set your Anthropic API key
api_key = "sk-ant-api03-your-key-here"
####2. Agent Configuration (2-10 agents)
Each agent requires four parameters:

>Agent ID: Unique identifier (agent_1, summarizer, etc.)

>Model: Choose from available Claude models

>Role: Agent's specialized function (Document Analyst, Critical Reviewer)

>Task: Specific instructions for this agent's analysis approach

Available Models:
```
claude-opus-4-20250514 - #Most powerful, complex reasoning
claude-sonnet-4-20250514 - #Balanced performance (recommended)
claude-3-5-haiku-20241022 - #Fastest response times
claude-3-haiku-20240307 - #Legacy model for comparison
```
####3. Coordination Configuration
Global Task: Overall objective for the Agents Pipeline system

Coordination Pattern:
>Sequential: Agents process in order, building on previous outputs

>Parallel: All agents analyze simultaneously, independent perspectives

####4. Document Upload
Upload PDF documents for analysis. The system extracts text content and uses it as context for agent coordination.

###**Execution Process**

####Phase 1: Individual Analysis

>**Behavioral Consistency Score ($C$)**


>How reliably an agent produces similar responses to identical tasks.

>for $C$ in the range 0.9-1.0: Highly reliable

>for $C$ in the range 0.7-0.9: Moderately consistent

>for $C$<0.7: Unreliable for deployment


####Phase 2: Agents Pipeline Coordination
Agents coordinate according to selected pattern while system tracks:

>Conversation flow: Complete interaction sequence

>Response quality: Success/failure rates

>Timing patterns: Response latencies and bottlenecks

####Phase 3: Coordination Effect Measurement
>**Coordination Effect ($\gamma$)**

>Performance change when agents work together vs. alone.

<center>$\gamma = \frac{\textrm{Coordinated Performance}}{\textrm{Individual Performance}}$</center>

> $\gamma$ > 1.0: Agents help each other

> $\gamma$ = 1.0: No coordination benefit

> $\gamma$ < 1.0: Agents interfere with each other

###**Interactive Visualization**
Conversation Timeline: Real-time tracking of agent interactions with step-by-step conversation flow
performance four-panel analysis showing:

> Agent consistency scores over time

> Response time patterns by agent

> Coordination effects across experiments

> Success rates and error analysis




## Installs and Imports

In [None]:
##Install required packages and clone the CERT repository##
#!pip install anthropic transformers torch dotenv pycryptodome PyPDF2
#!pip install -q watermark
## Clone CERT repository##
#!git clone https://github.com/Javihaus/cert-coordination-observability.git
#!cd cert-coordination-observability && pip install -e .

In [None]:
%load_ext watermark
%watermark

**Test environment**

Python implementation: CPython <br>
Python version       : 3.11.13<br>
IPython version      : 7.34.0<br>

Compiler    : GCC 11.4.0<br>
OS          : Linux<br>
Release     : 6.1.123+<br>
Machine     : x86_64<br>
Processor   : x86_64<br>
CPU cores   : 2<br>
Architecture: 64bit<br>

In [None]:
import os
import asyncio
import time
import json
import numpy as np
import pandas as pd
from datetime import datetime
from typing import Dict, List, Any, Optional
from dataclasses import dataclass
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import anthropic
import PyPDF2
import io
from google.colab import files
from IPython.display import display, HTML, clear_output
import ipywidgets as widgets


In [None]:
ANTHROPIC_API_KEY="xxxxxxxx"

## Code

In [None]:
# CERT Core Measurement Components
@dataclass
class AgentInteraction:
    timestamp: datetime
    agent_id: str
    model: str
    role: str
    task: str
    prompt: str
    response: str
    response_time: float
    success: bool
    error: Optional[str] = None
    metadata: Dict[str, Any] = None

@dataclass
class CoordinationStep:
    step_number: int
    agent_id: str
    input_context: str
    output: str
    reasoning: str
    timestamp: datetime

class CERTMeasurement:
    """Core measurement logic for behavioral consistency and coordination effects"""

    @staticmethod
    def calculate_behavioral_consistency(responses: List[str]) -> float:
        """Calculate consistency score (β) from multiple responses - IMPROVED VERSION"""
        if len(responses) < 2:
            return 1.0

        # Improved consistency measurement that handles semantic similarity
        # Filter out common stopwords that don't contribute to content meaning
        stopwords = {
            'the', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for', 'of', 'with', 'by',
            'a', 'an', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has',
            'had', 'do', 'does', 'did', 'will', 'would', 'could', 'should', 'may', 'might',
            'this', 'that', 'these', 'those', 'i', 'you', 'he', 'she', 'it', 'we', 'they'
        }

        # Extract meaningful content words from each response
        content_sets = []
        for response in responses:
            # Clean and tokenize
            cleaned = response.lower()
            # Remove punctuation and numbers that might vary between responses
            import re
            cleaned = re.sub(r'[^\w\s]', ' ', cleaned)
            cleaned = re.sub(r'\d+', '', cleaned)  # Remove numbers like "1.", "2.", etc.

            # Extract meaningful words (length > 2, not stopwords)
            words = cleaned.split()
            content_words = [w for w in words if len(w) > 2 and w not in stopwords]
            content_sets.append(set(content_words))

        # Calculate pairwise content similarity
        similarities = []
        for i in range(len(content_sets)):
            for j in range(i + 1, len(content_sets)):
                intersection = len(content_sets[i].intersection(content_sets[j]))
                union = len(content_sets[i].union(content_sets[j]))

                if union == 0:
                    similarity = 1.0  # Both responses had no meaningful content
                else:
                    # Use Jaccard similarity but boost it for structured tasks
                    jaccard = intersection / union

                    # For extraction tasks, high content overlap should score higher
                    # Apply a boost function that rewards high content overlap
                    if jaccard > 0.5:
                        similarity = 0.5 + (jaccard - 0.5) * 1.5  # Boost high similarity
                    else:
                        similarity = jaccard

                    similarity = min(1.0, similarity)  # Cap at 1.0

                similarities.append(similarity)

        return np.mean(similarities) if similarities else 0.0

    @staticmethod
    def calculate_coordination_effect(individual_performances: List[float],
                                    coordinated_performance: float) -> float:
        """Calculate coordination effect (γ) - IMPROVED VERSION"""

        # Handle edge cases
        if not individual_performances or len(individual_performances) == 0:
            return 1.0

        expected_performance = np.mean(individual_performances)

        # Avoid division by zero
        if expected_performance == 0:
            if coordinated_performance > 0:
                return 2.0  # Some improvement over zero baseline
            else:
                return 1.0  # No change from zero baseline

        gamma = coordinated_performance / expected_performance

        # Apply reasonable bounds to prevent extreme values due to measurement noise
        # Real coordination effects typically range from 0.5 to 2.0
        gamma = max(0.1, min(gamma, 3.0))

        return gamma

    @staticmethod
    def calculate_response_quality(response: str, task_type: str = "extraction") -> float:
        """Calculate quality score for individual responses"""

        if not response or len(response.strip()) == 0:
            return 0.0

        if task_type == "extraction":
            # For extraction tasks like "list three main points"
            # Quality = completeness + structure + content richness

            # Check for list structure (numbered, bulleted, or clear separation)
            structure_score = 0.0
            response_lower = response.lower()

            # Look for list indicators
            list_indicators = ['1.', '2.', '3.', '•', '-', 'first', 'second', 'third', 'one:', 'two:', 'three:']
            structure_count = sum(1 for indicator in list_indicators if indicator in response_lower)

            if structure_count >= 2:
                structure_score = 0.3  # Well-structured response
            elif structure_count >= 1:
                structure_score = 0.1  # Some structure

            # Content richness (meaningful words)
            words = response.split()
            meaningful_words = [w for w in words if len(w) > 3]
            content_score = min(0.4, len(meaningful_words) / 20.0)  # Cap at 0.4

            # Completeness (length indicator)
            length_score = min(0.3, len(response) / 200.0)  # Cap at 0.3

            total_quality = structure_score + content_score + length_score
            return min(1.0, total_quality)

        elif task_type == "analysis":
            # For analytical tasks - different quality metrics
            words = response.split()

            # Analytical depth indicators
            analytical_terms = ['because', 'therefore', 'however', 'moreover', 'analysis', 'conclusion', 'evidence']
            analytical_score = sum(1 for term in analytical_terms if term in response.lower()) / 10.0

            # Content richness
            content_score = min(0.6, len(words) / 50.0)

            total_quality = min(1.0, analytical_score + content_score)
            return total_quality

        else:
            # Default quality metric
            word_count = len(response.split())
            if word_count < 10:
                return 0.2
            elif word_count < 50:
                return 0.5 + (word_count - 10) / 80.0  # Linear increase
            else:
                return 0.8 + min(0.2, (word_count - 50) / 200.0)  # Diminishing returns

    @staticmethod
    def calculate_pipeline_performance(responses: List[str], task_type: str = "extraction") -> float:
        """Calculate overall performance of a pipeline of responses"""

        if not responses:
            return 0.0

        # For pipeline, the final response is most important
        # But we also consider the progression quality

        individual_qualities = [
            CERTMeasurement.calculate_response_quality(response, task_type)
            for response in responses
        ]

        if len(individual_qualities) == 1:
            return individual_qualities[0]

        # Weight the final response more heavily, but consider progression
        final_weight = 0.6
        progression_weight = 0.4

        final_quality = individual_qualities[-1]
        progression_quality = np.mean(individual_qualities[:-1]) if len(individual_qualities) > 1 else 0

        pipeline_performance = (final_weight * final_quality) + (progression_weight * progression_quality)

        return pipeline_performance

class ClaudeAgent:
    """Individual Claude agent with specific role and model"""

    def __init__(self, agent_id: str, model: str, role: str, task_description: str, api_key: str):
        self.agent_id = agent_id
        self.model = model
        self.role = role
        self.task_description = task_description
        self.client = anthropic.Anthropic(api_key=api_key)
        self.interaction_history = []
        self.performance_metrics = {
            "consistency_scores": [],
            "response_times": [],
            "error_count": 0,
            "success_count": 0
        }

    async def generate_response(self, prompt: str, context: str = "",
                              max_tokens: int = 500) -> AgentInteraction:
        """Generate response and track performance metrics"""
        start_time = time.time()

        try:
            full_prompt = f"""Role: {self.role}
Task: {self.task_description}

Context: {context}

Request: {prompt}

Respond according to your role and task. Be concise but thorough."""

            response = self.client.messages.create(
                model=self.model,
                max_tokens=max_tokens,
                messages=[{"role": "user", "content": full_prompt}]
            )

            response_time = time.time() - start_time
            response_text = response.content[0].text

            interaction = AgentInteraction(
                timestamp=datetime.now(),
                agent_id=self.agent_id,
                model=self.model,
                role=self.role,
                task=self.task_description,
                prompt=prompt,
                response=response_text,
                response_time=response_time,
                success=True,
                metadata={"context": context, "max_tokens": max_tokens}
            )

            self.interaction_history.append(interaction)
            self.performance_metrics["success_count"] += 1
            self.performance_metrics["response_times"].append(response_time)

            return interaction

        except Exception as e:
            response_time = time.time() - start_time

            interaction = AgentInteraction(
                timestamp=datetime.now(),
                agent_id=self.agent_id,
                model=self.model,
                role=self.role,
                task=self.task_description,
                prompt=prompt,
                response="",
                response_time=response_time,
                success=False,
                error=str(e)
            )

            self.interaction_history.append(interaction)
            self.performance_metrics["error_count"] += 1

            return interaction

    async def measure_consistency(self, prompt: str, trials: int = 3) -> float:
        """Measure behavioral consistency across multiple trials"""
        responses = []

        for _ in range(trials):
            interaction = await self.generate_response(prompt)
            if interaction.success:
                responses.append(interaction.response)
            await asyncio.sleep(0.5)  # Rate limiting

        if len(responses) >= 2:
            consistency = CERTMeasurement.calculate_behavioral_consistency(responses)
            self.performance_metrics["consistency_scores"].append(consistency)
            return consistency

        return 0.0

class CoordinationOrchestrator:
    """Manages Agents Pipeline coordination and conversation tracking"""

    def __init__(self, agents: List[ClaudeAgent]):
        self.agents = agents
        self.coordination_history = []
        self.conversation_log = []
        self.coordination_effects = []

    async def run_sequential_coordination(self, initial_prompt: str,
                                        document_content: str = "") -> List[CoordinationStep]:
        """Run sequential coordination between agents"""
        coordination_steps = []
        current_context = f"Document: {document_content}\n\nInitial Task: {initial_prompt}"

        for i, agent in enumerate(self.agents):
            step_prompt = f"Step {i+1}: {current_context}"

            interaction = await agent.generate_response(step_prompt, current_context)

            step = CoordinationStep(
                step_number=i + 1,
                agent_id=agent.agent_id,
                input_context=current_context,
                output=interaction.response if interaction.success else f"ERROR: {interaction.error}",
                reasoning=f"Agent {agent.agent_id} ({agent.role}) processing step {i+1}",
                timestamp=interaction.timestamp
            )

            coordination_steps.append(step)
            self.conversation_log.append({
                "step": i + 1,
                "agent": agent.agent_id,
                "role": agent.role,
                "model": agent.model,
                "task": agent.task_description,
                "input": current_context[:200] + "..." if len(current_context) > 200 else current_context,
                "output": step.output,
                "timestamp": step.timestamp,
                "success": interaction.success,
                "response_time": interaction.response_time
            })

            # Update context for next agent
            if interaction.success:
                current_context = f"Previous analysis: {interaction.response}\n\nContinue the analysis:"

            await asyncio.sleep(1)  # Rate limiting between agents

        self.coordination_history.extend(coordination_steps)
        return coordination_steps

    async def run_parallel_coordination(self, task_prompt: str,
                                      document_content: str = "") -> List[CoordinationStep]:
        """Run parallel coordination where all agents work simultaneously"""
        coordination_steps = []
        context = f"Document: {document_content}\n\nTask: {task_prompt}"

        # All agents process the same prompt simultaneously
        tasks = []
        for i, agent in enumerate(self.agents):
            tasks.append(agent.generate_response(task_prompt, context))

        interactions = await asyncio.gather(*tasks)

        for i, (agent, interaction) in enumerate(zip(self.agents, interactions)):
            step = CoordinationStep(
                step_number=i + 1,
                agent_id=agent.agent_id,
                input_context=context,
                output=interaction.response if interaction.success else f"ERROR: {interaction.error}",
                reasoning=f"Agent {agent.agent_id} ({agent.role}) parallel processing",
                timestamp=interaction.timestamp
            )

            coordination_steps.append(step)
            self.conversation_log.append({
                "step": i + 1,
                "agent": agent.agent_id,
                "role": agent.role,
                "model": agent.model,
                "task": agent.task_description,
                "input": context[:200] + "..." if len(context) > 200 else context,
                "output": step.output,
                "timestamp": step.timestamp,
                "success": interaction.success,
                "response_time": interaction.response_time
            })

        self.coordination_history.extend(coordination_steps)
        return coordination_steps

    def measure_coordination_effect(self, coordination_steps: List[CoordinationStep]) -> float:
        """Measure overall coordination effect"""
        # Get individual baseline performances
        individual_performances = []
        for agent in self.agents:
            if agent.performance_metrics["consistency_scores"]:
                individual_performances.append(np.mean(agent.performance_metrics["consistency_scores"]))
            else:
                individual_performances.append(0.5)  # Default baseline

        # Simulate coordinated performance based on successful steps
        successful_steps = [step for step in coordination_steps if "ERROR" not in step.output]
        coordinated_performance = len(successful_steps) / len(coordination_steps) if coordination_steps else 0

        gamma = CERTMeasurement.calculate_coordination_effect(individual_performances, coordinated_performance)

        self.coordination_effects.append({
            "timestamp": datetime.now(),
            "individual_performances": individual_performances,
            "coordinated_performance": coordinated_performance,
            "coordination_effect": gamma,
            "successful_steps": len(successful_steps),
            "total_steps": len(coordination_steps)
        })

        return gamma

class CERTVisualizer:
    """Creates interactive visualizations of coordination behavior"""

    @staticmethod
    def get_agent_color(agent_index: int, total_agents: int) -> str:
        """Generate distinct colors for any number of agents"""
        # Use a color palette that works well for 2-10 agents
        color_palette = [
            "#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd",
            "#8c564b", "#e377c2", "#7f7f7f", "#bcbd22", "#17becf"
        ]
        return color_palette[agent_index % len(color_palette)]

    @staticmethod
    def get_agent_icon(agent_index: int) -> str:
        """Generate distinct icons for any number of agents"""
        icons = ["🔍", "⚡", "🎯", "🧠", "📊", "🔬", "💡", "🎨", "📈", "🔧"]
        return icons[agent_index % len(icons)]

    @staticmethod
    def display_agent_conversation(conversation_log: List[Dict]):
        """Display clear agent conversation flow with formatted text"""
        print("\n" + "="*80)
        print("🤝 AGENT CONVERSATION FLOW")
        print("="*80)

        # Create agent-to-index mapping for consistent coloring
        unique_agents = []
        for entry in conversation_log:
            if entry["agent"] not in unique_agents:
                unique_agents.append(entry["agent"])

        for entry in conversation_log:
            step_num = entry["step"]
            agent_id = entry["agent"]
            role = entry["role"]
            output = entry["output"]
            success = entry["success"]

            # Get consistent color and icon for this agent
            agent_index = unique_agents.index(agent_id)
            icon = CERTVisualizer.get_agent_icon(agent_index)

            # ANSI color codes for terminal output
            color_codes = [
                "\033[94m",  # Blue
                "\033[91m",  # Red
                "\033[92m",  # Green
                "\033[93m",  # Yellow
                "\033[95m",  # Magenta
                "\033[96m",  # Cyan
                "\033[97m",  # White
                "\033[90m",  # Gray
                "\033[94m",  # Blue (repeat for >8 agents)
                "\033[91m"   # Red
            ]

            color = color_codes[agent_index % len(color_codes)]
            reset_color = "\033[0m"

            print(f"\n{color}{'='*60}{reset_color}")
            print(f"{color}{icon} STEP {step_num}: {agent_id.upper()} ({role}){reset_color}")
            print(f"{color}{'='*60}{reset_color}")

            if success:
                # Clean and format the output
                cleaned_output = output.strip()
                if len(cleaned_output) > 1000:
                    # Show first part and indicate truncation
                    print(f"{cleaned_output[:1000]}...")
                    print(f"\n{color}[Output truncated - showing first 1000 characters]{reset_color}")
                else:
                    print(cleaned_output)
            else:
                print(f"❌ ERROR: {output}")

            print(f"{color}{'='*60}{reset_color}")

    @staticmethod
    def display_complete_responses_text(conversation_log: List[Dict]):
        """Display complete agent responses in clean text format for easy reading"""
        print("\n" + "📖" * 60)
        print("📖 COMPLETE AGENT RESPONSES - FULL TEXT")
        print("📖" * 60)

    @staticmethod
    def display_final_synthesis(conversation_log: List[Dict]):
        """Highlight the final output (last successful agent response)"""
        final_entry = None
        for entry in reversed(conversation_log):
            if entry["success"]:
                final_entry = entry
                break

        if final_entry:
            print("\n" + "🎯" * 30)
            print("🎯 FINAL PIPELINE RESULT")
            print("🎯" * 30)
            print(f"\nFinal Agent: {final_entry['agent']} ({final_entry['role']})")
            print(f"Model: {final_entry.get('model', 'Unknown')}")
            print("\n" + "-" * 80)
            print(final_entry["output"])
            print("-" * 80)
            print("🎯" * 30)
        else:
            print("\n❌ No successful agent responses found in conversation log")

        for entry in conversation_log:
            step_num = entry["step"]
            agent_id = entry["agent"]
            role = entry["role"]
            model = entry.get("model", "Unknown")
            task = entry.get("task", "Analysis and processing")
            output = entry["output"]
            success = entry["success"]
            response_time = entry.get("response_time", 0)

            print(f"\n{'='*100}")
            print(f"STEP {step_num}: {agent_id.upper()}")
            print(f"Role: {role}")
            print(f"Model: {model}")
            print(f"Task: {task}")
            print(f"Response Time: {response_time:.2f}s")
            print(f"Status: {'✅ Success' if success else '❌ Error'}")
            print(f"{'='*100}")

            if success:
                print(f"\nCOMPLETE RESPONSE:")
                print(f"{'-'*100}")
                print(output)
                print(f"{'-'*100}")
            else:
                print(f"\n❌ ERROR: {output}")

            print()  # Extra space between agents

        print("📖" * 60)

    @staticmethod
    def create_conversation_table(conversation_log: List[Dict]) -> go.Figure:
        """Create comprehensive table showing complete agent responses"""
        if not conversation_log:
            return go.Figure().add_annotation(text="No conversation data available")

        # Prepare table data with COMPLETE responses
        steps = []
        agents = []
        roles = []
        models = []
        tasks = []
        responses = []
        status = []
        response_times = []

        for entry in conversation_log:
            steps.append(f"Step {entry['step']}")
            agents.append(entry['agent'])
            roles.append(entry['role'])
            models.append(entry.get('model', 'Unknown'))

            # Get complete task description
            task_desc = entry.get('task', 'Analysis and processing')
            tasks.append(task_desc)

            # COMPLETE response text - NO TRUNCATION
            response_text = entry['output']
            responses.append(response_text)

            status.append("✅ Success" if entry['success'] else "❌ Error")
            response_times.append(f"{entry.get('response_time', 0):.2f}s")

        # Create comprehensive table with complete responses
        fig = go.Figure(data=[go.Table(
            columnwidth=[60, 120, 150, 120, 150, 800, 80, 100],  # Wider response column
            header=dict(
                values=[
                    '<b>Step</b>',
                    '<b>Agent ID</b>',
                    '<b>Role</b>',
                    '<b>Model</b>',
                    '<b>Task</b>',
                    '<b>Complete Response</b>',
                    '<b>Status</b>',
                    '<b>Time</b>'
                ],
                fill_color='lightblue',
                align='left',
                font=dict(size=12, color='black'),
                height=50
            ),
            cells=dict(
                values=[steps, agents, roles, models, tasks, responses, status, response_times],
                fill_color=[['white', 'lightgray'] * len(steps)],
                align='left',
                font=dict(size=11),
                height=200,  # Taller cells to accommodate full responses
                line=dict(color='darkslategray', width=1)
            )
        )])

        fig.update_layout(
            title=f"Complete Agent Pipeline Responses - {len(conversation_log)} Processing Steps",
            height=max(600, len(conversation_log) * 220 + 200),  # Dynamic height for readability
            margin=dict(l=20, r=20, t=80, b=20),
            font=dict(family="Arial, sans-serif")
        )

        return fig

    @staticmethod
    def create_conversation_timeline(conversation_log: List[Dict]) -> go.Figure:
        """Create clear step-by-step conversation timeline for any number of agents"""
        if not conversation_log:
            return go.Figure().add_annotation(text="No conversation data available")

        # Extract data from conversation log
        steps = [entry['step'] for entry in conversation_log]
        agents = [entry['agent'] for entry in conversation_log]
        roles = [entry['role'] for entry in conversation_log]
        outputs = [entry['output'][:150] + "..." if len(entry['output']) > 150 else entry['output'] for entry in conversation_log]
        success_status = ["Success" if entry['success'] else "Error" for entry in conversation_log]

        # Create agent-to-index mapping for consistent coloring
        unique_agents = []
        for agent in agents:
            if agent not in unique_agents:
                unique_agents.append(agent)

        # Generate colors for each agent
        colors = []
        for agent in agents:
            agent_index = unique_agents.index(agent)
            colors.append(CERTVisualizer.get_agent_color(agent_index, len(unique_agents)))

        fig = go.Figure()

        # Add scatter plot points for each step
        fig.add_trace(go.Scatter(
            x=steps,
            y=[1] * len(steps),  # All on same horizontal line
            mode='markers+text',
            marker=dict(
                size=25,
                color=colors,
                line=dict(width=2, color='white'),
                opacity=0.8
            ),
            text=[f"Step {step}<br>{agent}" for step, agent in zip(steps, agents)],
            textposition="top center",
            hovertemplate="<b>Step %{x}</b><br>Agent: %{customdata[0]}<br>Role: %{customdata[1]}<br>Status: %{customdata[2]}<br>Output: %{customdata[3]}<extra></extra>",
            customdata=list(zip(agents, roles, success_status, outputs)),
            name="Processing Pipeline"
        ))

        # Add connecting lines between steps
        if len(steps) > 1:
            fig.add_trace(go.Scatter(
                x=steps,
                y=[1] * len(steps),
                mode='lines',
                line=dict(color='gray', width=3, dash='dash'),
                showlegend=False,
                hoverinfo='skip'
            ))

        # Add agent legend
        for i, agent in enumerate(unique_agents):
            fig.add_trace(go.Scatter(
                x=[None], y=[None],
                mode='markers',
                marker=dict(
                    size=15,
                    color=CERTVisualizer.get_agent_color(i, len(unique_agents))
                ),
                name=agent,
                showlegend=True
            ))

        fig.update_layout(
            title=f"Agent Processing Pipeline - {len(unique_agents)} Agents, {len(steps)} Sequential Steps",
            xaxis_title="Processing Step",
            yaxis=dict(visible=False),  # Hide y-axis since all points are on same line
            height=400,
            xaxis=dict(
                tickmode='linear',
                tick0=1,
                dtick=1,
                range=[0.5, max(steps) + 0.5]
            ),
            legend=dict(
                orientation="h",
                yanchor="bottom",
                y=1.02,
                xanchor="right",
                x=1
            )
        )

        return fig

    @staticmethod
    def create_performance_dashboard(agents: List[ClaudeAgent],
                                   coordination_effects: List[Dict]) -> go.Figure:
        """Create comprehensive performance dashboard"""

        fig = make_subplots(
            rows=2, cols=2,
            subplot_titles=(
                "Agent Consistency Scores (β)",
                "Response Times",
                "Coordination Effects (γ)",
                "Success Rates"
            ),
            specs=[
                [{"type": "bar"}, {"type": "scatter"}],
                [{"type": "scatter"}, {"type": "bar"}]
            ]
        )

        # Consistency scores
        agent_names = [agent.agent_id for agent in agents]
        consistency_scores = []

        for agent in agents:
            if agent.performance_metrics["consistency_scores"]:
                consistency_scores.append(np.mean(agent.performance_metrics["consistency_scores"]))
            else:
                consistency_scores.append(0.0)

        fig.add_trace(
            go.Bar(
                x=agent_names,
                y=consistency_scores,
                name="Consistency (β)",
                marker=dict(color=consistency_scores, colorscale="Viridis")
            ),
            row=1, col=1
        )

        # Response times
        for agent in agents:
            if agent.performance_metrics["response_times"]:
                fig.add_trace(
                    go.Scatter(
                        x=list(range(len(agent.performance_metrics["response_times"]))),
                        y=agent.performance_metrics["response_times"],
                        mode="lines+markers",
                        name=f"{agent.agent_id} Response Time"
                    ),
                    row=1, col=2
                )

        # Coordination effects
        if coordination_effects:
            gammas = [ce["coordination_effect"] for ce in coordination_effects]
            timestamps = [ce["timestamp"] for ce in coordination_effects]

            fig.add_trace(
                go.Scatter(
                    x=timestamps,
                    y=gammas,
                    mode="lines+markers",
                    name="Coordination Effect (γ)",
                    line=dict(color="red")
                ),
                row=2, col=1
            )

        # Success rates
        success_rates = []
        for agent in agents:
            total = agent.performance_metrics["success_count"] + agent.performance_metrics["error_count"]
            if total > 0:
                success_rates.append(agent.performance_metrics["success_count"] / total)
            else:
                success_rates.append(0.0)

        fig.add_trace(
            go.Bar(
                x=agent_names,
                y=success_rates,
                name="Success Rate",
                marker=dict(color=success_rates, colorscale="RdYlGn")
            ),
            row=2, col=2
        )

        fig.update_layout(height=800, title_text="CERT Agents Pipeline Performance Dashboard")
        return fig

class PDFProcessor:
    """Handles PDF upload and text extraction"""

    @staticmethod
    def upload_and_extract() -> Dict[str, str]:
        """Upload PDF and extract text content"""
        print("📄 Upload PDF documents for analysis...")
        uploaded = files.upload()

        documents = {}
        for filename, content in uploaded.items():
            if filename.endswith('.pdf'):
                try:
                    pdf_reader = PyPDF2.PdfReader(io.BytesIO(content))
                    text = ""
                    for page in pdf_reader.pages:
                        text += page.extract_text() + "\n"

                    documents[filename] = text
                    print(f"✅ Extracted {len(text)} characters from {filename}")

                except Exception as e:
                    print(f"❌ Error processing {filename}: {str(e)}")
            else:
                print(f"⚠️ Skipping {filename} (not a PDF)")

        return documents

# Available Claude Models
CLAUDE_MODELS = {
    "claude-opus-4-20250514": "Claude Opus 4 (Latest, Most Powerful)",
    "claude-sonnet-4-20250514": "Claude Sonnet 4 (Latest, Balanced)",
    "claude-3-5-haiku-20241022": "Claude 3.5 Haiku (Fast)",
    "claude-3-7-sonnet-20250219": "Claude 3.7 Sonnet",
    "claude-3-5-sonnet-20241022": "Claude 3.5 Sonnet",
    "claude-3-5-sonnet-20240620": "Claude 3.5 Sonnet (June)",
    "claude-3-haiku-20240307": "Claude 3 Haiku (Legacy)"
}

# Configuration Interface
def create_agent_config_interface(max_agents: int = 10):
    """Create interactive interface for configuring agents"""

    def create_agent_widgets(agent_num: int):
        agent_id = widgets.Text(
            value=f"agent_{agent_num}",
            description=f"Agent {agent_num} ID:",
            style={'description_width': 'initial'}
        )

        model = widgets.Dropdown(
            options=list(CLAUDE_MODELS.keys()),
            value="claude-sonnet-4-20250514",
            description=f"Model:",
            style={'description_width': 'initial'}
        )

        role = widgets.Text(
            value=f"Analyst {agent_num}",
            description=f"Role:",
            style={'description_width': 'initial'}
        )

        task = widgets.Textarea(
            value=f"Analyze documents and provide insights based on your specialized perspective",
            description=f"Task:",
            style={'description_width': 'initial'}
        )

        return {
            "id": agent_id,
            "model": model,
            "role": role,
            "task": task
        }

    # Number of agents selector
    num_agents = widgets.IntSlider(
        value=3,
        min=2,
        max=max_agents,
        description="Number of Agents:",
        style={'description_width': 'initial'}
    )

    # Global task configuration
    global_task = widgets.Textarea(
        value="Analyze the uploaded PDF document and provide comprehensive insights",
        description="Global Task:",
        style={'description_width': 'initial'}
    )

    coordination_pattern = widgets.Dropdown(
        options=["sequential", "parallel"],
        value="sequential",
        description="Coordination Pattern:",
        style={'description_width': 'initial'}
    )

    return {
        "num_agents": num_agents,
        "global_task": global_task,
        "coordination_pattern": coordination_pattern,
        "create_agent_widgets": create_agent_widgets
    }



In [None]:
async def run_claude_cert_demo(
    api_key: str,
    agent_configs: List[Dict[str, str]],
    global_task: str,
    coordination_pattern: str = "sequential",
    consistency_trials: int = 3
):
    """
    Run the complete Claude-only CERT coordination demonstration

    Args:
        api_key: Anthropic API key
        agent_configs: List of agent configurations with id, model, role, task
        global_task: Overall coordination task
        coordination_pattern: "sequential" or "parallel"
        consistency_trials: Number of trials for consistency measurement
    """

    print("🎯 Claude-Only CERT Agents Pipeline Coordination Analysis")
    print("=" * 60)
    print(f"Agents: {len(agent_configs)}")
    print(f"Pattern: {coordination_pattern}")
    print(f"Task: {global_task}")
    print()

    # Create agents
    agents = []
    for config in agent_configs:
        agent = ClaudeAgent(
            agent_id=config["id"],
            model=config["model"],
            role=config["role"],
            task_description=config["task"],
            api_key=api_key
        )
        agents.append(agent)
        print(f"✅ Created {config['id']} using {CLAUDE_MODELS[config['model']]}")

    # Upload and process documents
    documents = PDFProcessor.upload_and_extract()

    if not documents:
        print("❌ No documents uploaded")
        return None

    # Take first document for analysis
    doc_name, doc_content = list(documents.items())[0]
    print(f"\n📄 Analyzing: {doc_name}")

    # Phase 1: Individual agent consistency measurement
    print("\n🔍 Phase 1: Individual Agent Analysis")
    for agent in agents:
        print(f"  Testing {agent.agent_id}...")
        consistency = await agent.measure_consistency(global_task, consistency_trials)
        print(f"    Consistency (β): {consistency:.3f}")

    # Phase 2: Coordination measurement
    print(f"\n🤝 Phase 2: Agents Pipeline Coordination ({coordination_pattern})")

    orchestrator = CoordinationOrchestrator(agents)

    if coordination_pattern == "sequential":
        coordination_steps = await orchestrator.run_sequential_coordination(global_task, doc_content)
    else:
        coordination_steps = await orchestrator.run_parallel_coordination(global_task, doc_content)

    # Calculate coordination effect
    gamma = orchestrator.measure_coordination_effect(coordination_steps)
    print(f"Coordination Effect (γ): {gamma:.3f}")

    # Phase 3: Generate visualizations
    print("\n📊 Phase 3: Generating Results & Visualizations")

    # Display complete responses in clean text format
    CERTVisualizer.display_complete_responses_text(orchestrator.conversation_log)

    # Highlight the final result
    CERTVisualizer.display_final_synthesis(orchestrator.conversation_log)

    # Create interactive table for complete responses
    conversation_table = CERTVisualizer.create_conversation_table(orchestrator.conversation_log)
    dashboard_fig = CERTVisualizer.create_performance_dashboard(agents, orchestrator.coordination_effects)

    # Display results
    display(HTML("<h2>🎯 CERT Claude-Only Pipeline Analysis Results</h2>"))
    display(HTML(f"<p><b>Document:</b> {doc_name}</p>"))
    display(HTML(f"<p><b>Processing Pipeline:</b> {len(agents)} agents in {coordination_pattern} mode</p>"))

    print("\n📋 INTERACTIVE COMPLETE RESPONSES TABLE:")
    print("📖 Scrollable table with full agent responses - no truncation")
    display(conversation_table)

    print("\n📊 Performance Metrics Dashboard:")
    display(dashboard_fig)

    # Summary statistics
    print("\n📋 Summary Statistics:")
    print("=" * 30)

    for agent in agents:
        if agent.performance_metrics["consistency_scores"]:
            avg_consistency = np.mean(agent.performance_metrics["consistency_scores"])
            avg_response_time = np.mean(agent.performance_metrics["response_times"])
            success_rate = agent.performance_metrics["success_count"] / (
                agent.performance_metrics["success_count"] + agent.performance_metrics["error_count"]
            )

            print(f"{agent.agent_id}:")
            print(f"  • Model: {CLAUDE_MODELS[agent.model]}")
            print(f"  • Consistency (β): {avg_consistency:.3f}")
            print(f"  • Avg Response Time: {avg_response_time:.2f}s")
            print(f"  • Success Rate: {success_rate:.2%}")

    print(f"\nOverall Coordination Effect (γ): {gamma:.3f}")

    # Enhanced coordination analysis
    print("\n" + "="*50)
    print("📊 PROCESSING PIPELINE ANALYSIS")
    print("="*50)

    if gamma > 1.2:
        print("🚀 STRONG PIPELINE BENEFIT - Sequential processing significantly improves output quality")
        print(f"   Pipeline improvement: {(gamma-1)*100:.1f}% above individual agent baseline")
    elif gamma > 1.0:
        print("✅ POSITIVE PIPELINE EFFECT - Sequential processing improves output quality")
        print(f"   Pipeline improvement: {(gamma-1)*100:.1f}% above individual agent baseline")
    elif gamma > 0.8:
        print("➡️ NEUTRAL PIPELINE EFFECT - Minimal benefit from sequential processing")
        print(f"   Pipeline change: {(gamma-1)*100:.1f}% from individual baseline")
    else:
        print("⚠️ NEGATIVE PIPELINE EFFECT - Sequential processing may degrade quality")
        print(f"   Pipeline degradation: {(1-gamma)*100:.1f}% below individual baseline")

    print(f"\n💡 SCIENTIFIC INTERPRETATION:")
    print(f"This measures coordination effects in a predetermined processing pipeline where")
    print(f"each Claude instance processes the output of the previous one. This is valuable")
    print(f"infrastructure for current token-manipulation systems, not genuine agentic AI.")

    # Agent performance ranking
    agent_rankings = []
    for agent in agents:
        if agent.performance_metrics["consistency_scores"]:
            avg_consistency = np.mean(agent.performance_metrics["consistency_scores"])
            agent_rankings.append((agent.agent_id, avg_consistency, agent.model))

    agent_rankings.sort(key=lambda x: x[1], reverse=True)

    print(f"\n🏆 AGENT PERFORMANCE RANKING:")
    for i, (agent_id, consistency, model) in enumerate(agent_rankings):
        print(f"   {i+1}. {agent_id}: β={consistency:.3f} ({CLAUDE_MODELS[model]})")

    print("="*50)

    # Return complete results
    return {
        "agents": agents,
        "orchestrator": orchestrator,
        "coordination_steps": coordination_steps,
        "coordination_effect": gamma,
        "conversation_log": orchestrator.conversation_log,
        "documents": documents
    }

## Run the Demo

**Example Configuration**
```
agent_configs = [
    {
        "id": "primary_analyst",
        "model": "claude-sonnet-4-20250514",
        "role": "Primary Document Analyst",
        "task": "Extract key themes and main arguments from documents"
    },
    {
        "id": "critical_reviewer",
        "model": "claude-3-5-haiku-20241022",
        "role": "Critical Reviewer",
        "task": "Identify gaps, contradictions, and areas needing clarification"
    },
    {
        "id": "synthesizer",
        "model": "claude-opus-4-20250514",
        "role": "Synthesis Specialist",
        "task": "Integrate multiple perspectives into coherent conclusions"
    }
]

global_task = "Analyze the uploaded document for strategic insights and actionable recommendations"
coordination_pattern = "sequential"  # or "parallel"
consistency_trials = 3

results = await run_claude_cert_demo(
    api_key=anthropic_api_key,
    agent_configs=agent_configs,
    global_task=global_task,
    coordination_pattern=coordination_pattern,
    consistency_trials=consistency_trials
)
```



In [None]:
agent_configs = [
    {
        "id": "primary_analyst",
        "model": "claude-sonnet-4-20250514",
        "role": "Primary Document Analyst",
        "task": "Extract key themes and main arguments from documents"
    },
    {
        "id": "critical_reviewer",
        "model": "claude-3-5-haiku-20241022",
        "role": "Critical Reviewer",
        "task": "Identify gaps, contradictions, and areas needing clarification"
    },
    {
        "id": "synthesizer",
        "model": "claude-opus-4-20250514",
        "role": "Synthesis Specialist",
        "task": "Integrate multiple perspectives into coherent conclusions"
    }
]

global_task = "List three main points from this document"
coordination_pattern = "sequential"  # or "parallel"
consistency_trials = 3

results = await run_claude_cert_demo(
    api_key=ANTHROPIC_API_KEY,
    agent_configs=agent_configs,
    global_task=global_task,
    coordination_pattern=coordination_pattern,
    consistency_trials=consistency_trials
)