# Systematic Multi-Agent Distributive Justice Analysis

This notebook performs comprehensive analysis of distributive justice principles across different agent configurations.

## Experiment Design
- **Models**: "gpt-4.1", "gpt-4.1-mini", "gpt-4.1-nano" 
- **Temperatures**: 0, 1, 2
- **Prompts**: US, Canadian, Polish college students  
- **Income Distributions**: 4 distributions from new_logic.md
- **Total Configurations**: 3 × 3 × 3 × 4 = **108 unique combinations**

## Configuration Generation Method
- **SYSTEMATIC**: Uses `itertools.product()` to generate each unique combination exactly once
- **NOT RANDOM**: Completely deterministic, exhaustive coverage

## Analysis Sections
1. Configuration Setup & Generation
2. Parallel Batch Execution  
3. Results Analysis & Visualization
4. Parameter Impact Analysis
5. Satisfaction Analysis

## 1. Setup and Imports

In [1]:
# Setup paths and imports
import sys
import os
from pathlib import Path
import json
import yaml
from datetime import datetime
from itertools import product
import nest_asyncio

# Data analysis and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List, Dict, Any

# Enable nested event loops for Jupyter
nest_asyncio.apply()

# Add MAAI to path
sys.path.insert(0, 'src')

# Import MAAI modules
try:
    from maai.runners import run_batch
    from maai.config.manager import load_config_from_file
    from maai.core.models import IncomeDistribution, IncomeClass
    print("✅ MAAI modules imported successfully")
except ImportError as e:
    print(f"❌ Error importing MAAI modules: {e}")
    print("💡 Make sure you're running from the correct directory")

# Set plotting style
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)

print("🎯 Systematic Analysis Notebook Ready")
print("📊 Configuration: All models included (gpt-4.1, gpt-4.1-mini, gpt-4.1-nano)")

✅ MAAI modules imported successfully
🎯 Systematic Analysis Notebook Ready
📊 Configuration: All models included (gpt-4.1, gpt-4.1-mini, gpt-4.1-nano)


## 2. Configuration Setup

In [None]:
# Define experiment parameters
MODELS = ["gpt-4.1-mini"]
TEMPERATURES = [0, 1, 2]  # Removed 0.5 for simplicity
PROMPTS = {
    "us_student": "You are a US college student participating in an economic experiment about distributive justice principles. You will receive real monetary payouts based on your income class assignment after the group chooses a principle.",
    "canadian_student": "You are a Canadian college student participating in an economic experiment about distributive justice principles. You will receive real monetary payouts based on your income class assignment after the group chooses a principle.",
    "polish_student": "You are a Polish college student participating in an economic experiment about distributive justice principles. You will receive real monetary payouts based on your income class assignment after the group chooses a principle."
}

# Income distributions from new_logic.md specification
# IMPORTANT: Keys must match IncomeClass enum values exactly
INCOME_DISTRIBUTIONS = [
    {
        'distribution_id': 1,
        'name': 'Distribution 1',
        'income_by_class': {
            'High': 32000,           # IncomeClass.HIGH = "High"
            'Medium high': 27000,    # IncomeClass.MEDIUM_HIGH = "Medium high"
            'Medium': 24000,         # IncomeClass.MEDIUM = "Medium"
            'Medium low': 13000,     # IncomeClass.MEDIUM_LOW = "Medium low"
            'Low': 12000             # IncomeClass.LOW = "Low"
        }
    },
    {
        'distribution_id': 2,
        'name': 'Distribution 2',
        'income_by_class': {
            'High': 28000,
            'Medium high': 22000,
            'Medium': 20000,
            'Medium low': 17000,
            'Low': 13000
        }
    },
    {
        'distribution_id': 3,
        'name': 'Distribution 3',
        'income_by_class': {
            'High': 31000,
            'Medium high': 24000,
            'Medium': 21000,
            'Medium low': 16000,
            'Low': 14000
        }
    },
    {
        'distribution_id': 4,
        'name': 'Distribution 4',
        'income_by_class': {
            'High': 21000,
            'Medium high': 20000,
            'Medium': 19000,
            'Medium low': 16000,
            'Low': 15000
        }
    }
]

# Calculate total combinations
total_combinations = len(MODELS) * len(TEMPERATURES) * len(PROMPTS)

print(f"📋 Experiment Parameters:")
print(f"   Models: {len(MODELS)} ({', '.join(MODELS)})")
print(f"   Temperatures: {len(TEMPERATURES)} ({TEMPERATURES})")
print(f"   Prompts: {len(PROMPTS)} ({', '.join(PROMPTS.keys())})")
print(f"   Distributions: {len(INCOME_DISTRIBUTIONS)}")
print(f"   ")
print(f"🔢 Total Configurations: {total_combinations}")
print(f"   Generation method: SYSTEMATIC (each unique combination created exactly once)")
print(f"✅ Income class keys fixed to match IncomeClass enum values")

## 3. Configuration Generation

In [None]:
def create_systematic_config(model: str, temperature: float, prompt_key: str, distribution_set: List[Dict], config_id: str) -> Dict:
    """
    Create a systematic experiment configuration.
    
    Args:
        model: Model name (e.g., "gpt-4.1-mini")
        temperature: Temperature value (0, 1, 2)
        prompt_key: Prompt identifier ("us_student", "canadian_student", "polish_student")
        distribution_set: List of income distributions
        config_id: Unique configuration identifier
    
    Returns:
        Configuration dictionary
    """
    prompt_text = PROMPTS[prompt_key]
    
    config = {
        'experiment_id': config_id,
        'global_temperature': temperature,
        
        # Metadata for analysis
        'systematic_analysis': {
            'model': model,
            'temperature': temperature,
            'prompt_type': prompt_key,
            'distribution_count': len(distribution_set)
        },
        
        'experiment': {
            'max_rounds': 6,  # Sufficient for consensus
            'decision_rule': 'unanimity',
            'timeout_seconds': 300
        },
        
        # New game logic settings
        'individual_rounds': 4,
        'payout_ratio': 0.0001,
        'enable_detailed_examples': True,
        'enable_secret_ballot': True,
        
        # Income distribution scenarios
        'income_distributions': distribution_set,
        
        # Memory strategy
        'memory_strategy': 'decomposed',
        
        # Agent configuration - 3 agents with same prompt
        'agents': [
            {
                'name': f'Agent_1_{prompt_key}',
                'model': model,
                'personality': prompt_text,
                'temperature': temperature
            },
            {
                'name': f'Agent_2_{prompt_key}',
                'model': model,
                'personality': prompt_text,
                'temperature': temperature
            },
            {
                'name': f'Agent_3_{prompt_key}',
                'model': model,
                'personality': prompt_text,
                'temperature': temperature
            }
        ],
        
        'defaults': {
            'personality': prompt_text,
            'model': model,
            'temperature': temperature
        },
        
        'output': {
            'directory': 'systematic_analysis_results',
            'formats': ['json']
        }
    }
    
    return config

def generate_all_configurations() -> List[Dict]:
    """
    Generate ALL unique experiment configurations systematically.
    Uses itertools.product() to ensure each combination appears exactly once.
    
    Returns:
        List of configuration dictionaries, one per unique combination
    """
    configurations = []
    config_index = 1
    
    # Create all combinations systematically - each appears exactly once
    for model, temperature, prompt_key in product(MODELS, TEMPERATURES, PROMPTS.keys()):
        config_id = f"systematic_{config_index:03d}_{model.replace('-', '_')}_{temperature}_{prompt_key}"
        
        config = create_systematic_config(
            model=model,
            temperature=temperature,
            prompt_key=prompt_key,
            distribution_set=INCOME_DISTRIBUTIONS,
            config_id=config_id
        )
        
        configurations.append(config)
        config_index += 1
    
    return configurations

# Generate configurations systematically
all_configs = generate_all_configurations()

print(f"✅ Generated {len(all_configs)} unique configurations systematically")
print(f"📊 Sample configuration IDs:")
for i, config in enumerate(all_configs[:5]):
    print(f"   {i+1}. {config['experiment_id']}")
if len(all_configs) > 5:
    print(f"   ... and {len(all_configs) - 5} more")

## 4. Save Configurations

In [None]:
# Create directories
systematic_dir = Path("systematic_analysis")
config_dir = systematic_dir / "configs"
results_dir = systematic_dir / "results"

config_dir.mkdir(parents=True, exist_ok=True)
results_dir.mkdir(parents=True, exist_ok=True)

# Save all configurations
config_files = []
for config in all_configs:
    config_file = config_dir / f"{config['experiment_id']}.yaml"
    
    with open(config_file, 'w') as f:
        yaml.dump(config, f, indent=2, default_flow_style=False)
    
    config_files.append(config['experiment_id'])

print(f"✅ Saved {len(config_files)} configuration files to {config_dir}")
print(f"📁 Results will be saved to: {results_dir}")

# Display configuration summary
summary_df = pd.DataFrame([
    {
        'Config ID': config['experiment_id'],
        'Model': config['systematic_analysis']['model'],
        'Temperature': config['systematic_analysis']['temperature'],
        'Prompt Type': config['systematic_analysis']['prompt_type'],
        'Agents': len(config['agents']),
        'Distributions': len(config['income_distributions'])
    }
    for config in all_configs
])

print("\n📊 Configuration Summary:")
print(summary_df.head(10))
print(f"\nTotal configurations: {len(summary_df)}")

## 5. Parallel Batch Execution

This section runs all configurations in parallel batches using the existing batch execution system.

In [None]:
# Configuration for batch execution
MAX_CONCURRENT = 4  # Conservative for API limits
BATCH_SIZE = 12     # Process in smaller batches for better monitoring

print(f"🚀 Starting systematic batch execution")
print(f"   Total configurations: {len(config_files)}")
print(f"   Max concurrent: {MAX_CONCURRENT}")
print(f"   Batch size: {BATCH_SIZE}")
print(f"   Estimated batches: {(len(config_files) + BATCH_SIZE - 1) // BATCH_SIZE}")

# All configurations will be run (no test mode)
config_files_to_run = config_files

print(f"\n📋 Configurations to execute: {len(config_files_to_run)}")

In [None]:
# Execute batch experiments
import time

async def run_systematic_experiments():
    """
    Run all systematic experiments using the existing batch system.
    """
    batch_start_time = time.time()
    
    # Run all experiments in parallel
    results = await run_batch(
        config_files_to_run,
        max_concurrent=MAX_CONCURRENT,
        output_dir=str(results_dir),
        config_dir=str(config_dir)
    )
    
    batch_duration = time.time() - batch_start_time
    
    # Analyze results
    successful = [r for r in results if r.get('success', False)]
    failed = [r for r in results if not r.get('success', False)]
    
    print(f"\n🎯 Systematic Execution Complete!")
    print(f"   Total time: {batch_duration:.1f}s ({batch_duration/60:.1f} minutes)")
    print(f"   Successful: {len(successful)}/{len(results)}")
    print(f"   Failed: {len(failed)}/{len(results)}")
    print(f"   Average per experiment: {batch_duration/len(results):.1f}s")
    
    if failed:
        print(f"\n❌ Failed experiments:")
        for result in failed[:5]:  # Show first 5 failures
            print(f"   - {result.get('experiment_id', 'unknown')}: {result.get('error', 'unknown error')}")
    
    return results

# Run the experiments
print("⏳ Starting batch execution...")
experiment_results = await run_systematic_experiments()

## 6. Results Loading and Processing

In [None]:
def load_experiment_data(results_dir: Path) -> pd.DataFrame:
    """
    Load and process all experiment results into a structured DataFrame.
    
    Args:
        results_dir: Directory containing result JSON files
    
    Returns:
        DataFrame with experiment results and metadata
    """
    result_files = list(results_dir.glob("*.json"))
    
    if not result_files:
        print("❌ No result files found")
        return pd.DataFrame()
    
    print(f"📊 Loading {len(result_files)} experiment results...")
    
    all_data = []
    
    for result_file in result_files:
        try:
            with open(result_file, 'r') as f:
                data = json.load(f)
            
            # Extract experiment metadata
            metadata = data.get('experiment_metadata', {})
            consensus = metadata.get('final_consensus', {})
            
            # Parse config ID for systematic analysis metadata
            experiment_id = metadata.get('experiment_id', result_file.stem)
            
            # Extract systematic analysis parameters from config ID
            id_parts = experiment_id.split('_')
            if len(id_parts) >= 5:
                model = id_parts[2] + '-' + id_parts[3] + ('-' + id_parts[4] if len(id_parts) > 5 else '')
                temperature = float(id_parts[-2])
                prompt_type = id_parts[-1]
            else:
                model = 'unknown'
                temperature = 0.0
                prompt_type = 'unknown'
            
            # Extract agent data
            agent_data = {k: v for k, v in data.items() if k != 'experiment_metadata'}
            
            # Process each agent's data
            for agent_name, agent_info in agent_data.items():
                agent_record = {
                    'experiment_id': experiment_id,
                    'agent_name': agent_name,
                    'model': model,
                    'temperature': temperature,
                    'prompt_type': prompt_type,
                    'consensus_reached': consensus.get('agreement_reached', False),
                    'agreed_principle': consensus.get('agreed_principle', None),
                    'total_duration': metadata.get('total_duration_seconds', 0),
                    'final_principle_id': consensus.get('principle_choice', {}).get('principle_id', None),
                }
                
                # Extract preference rankings
                rankings = agent_info.get('preference_rankings', [])
                if rankings:
                    # Get initial, post-individual, and final rankings
                    initial_rankings = [r for r in rankings if r.get('phase') == 'initial']
                    post_individual_rankings = [r for r in rankings if r.get('phase') == 'post_individual']
                    final_rankings = [r for r in rankings if r.get('phase') == 'final']
                    
                    if initial_rankings:
                        agent_record['initial_ranking'] = initial_rankings[-1].get('rankings', [])
                        agent_record['initial_certainty'] = initial_rankings[-1].get('certainty_level', 'unknown')
                    
                    if post_individual_rankings:
                        agent_record['post_individual_ranking'] = post_individual_rankings[-1].get('rankings', [])
                        agent_record['post_individual_certainty'] = post_individual_rankings[-1].get('certainty_level', 'unknown')
                    
                    if final_rankings:
                        agent_record['final_ranking'] = final_rankings[-1].get('rankings', [])
                        agent_record['final_certainty'] = final_rankings[-1].get('certainty_level', 'unknown')
                
                # Extract economic outcomes
                outcomes = agent_info.get('economic_outcomes', [])
                total_payout = sum(outcome.get('payout_amount', 0) for outcome in outcomes)
                agent_record['total_payout'] = total_payout
                agent_record['num_outcomes'] = len(outcomes)
                
                all_data.append(agent_record)
        
        except Exception as e:
            print(f"⚠️ Error processing {result_file}: {e}")
            continue
    
    df = pd.DataFrame(all_data)
    print(f"✅ Loaded {len(df)} agent records from {len(result_files)} experiments")
    
    return df

# Load the data
results_df = load_experiment_data(results_dir)

if not results_df.empty:
    print(f"\n📊 Dataset Overview:")
    print(f"   Total agent records: {len(results_df)}")
    print(f"   Unique experiments: {results_df['experiment_id'].nunique()}")
    print(f"   Models: {results_df['model'].unique()}")
    print(f"   Temperatures: {sorted(results_df['temperature'].unique())}")
    print(f"   Prompt types: {results_df['prompt_type'].unique()}")
    print(f"   Consensus rate: {results_df['consensus_reached'].mean():.1%}")
    
    # Show sample data
    print("\n🔍 Sample data:")
    print(results_df[['experiment_id', 'model', 'temperature', 'prompt_type', 'consensus_reached', 'agreed_principle']].head())
else:
    print("❌ No data loaded. Check if experiments completed successfully.")

## 7. Final Justice Criterion Summary

In [None]:
if not results_df.empty:
    # Overall distribution of justice criteria
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    fig.suptitle('Final Justice Criterion Analysis', fontsize=16, fontweight='bold')
    
    # 1. Overall distribution
    ax1 = axes[0, 0]
    experiment_consensus = results_df.groupby('experiment_id')['agreed_principle'].first()
    principle_counts = experiment_consensus.value_counts()
    
    if not principle_counts.empty:
        principle_counts.plot(kind='bar', ax=ax1, color='skyblue')
        ax1.set_title('Distribution of Agreed Justice Principles')
        ax1.set_ylabel('Number of Experiments')
        ax1.tick_params(axis='x', rotation=45)
        
        # Add value labels
        for i, v in enumerate(principle_counts.values):
            ax1.text(i, v + 0.1, str(v), ha='center', va='bottom')
    
    # 2. Consensus achievement rate
    ax2 = axes[0, 1]
    consensus_rate = results_df.groupby('experiment_id')['consensus_reached'].first().value_counts()
    consensus_rate.plot(kind='pie', ax=ax2, autopct='%1.1f%%', colors=['#ff7f7f', '#7fbf7f'])
    ax2.set_title('Consensus Achievement Rate')
    ax2.set_ylabel('')
    
    # 3. Principles by model
    ax3 = axes[1, 0]
    model_principles = results_df.groupby(['experiment_id', 'model'])['agreed_principle'].first().reset_index()
    model_crosstab = pd.crosstab(model_principles['model'], model_principles['agreed_principle'])
    
    if not model_crosstab.empty:
        model_crosstab.plot(kind='bar', ax=ax3, stacked=True)
        ax3.set_title('Justice Principles by Model')
        ax3.set_ylabel('Number of Experiments')
        ax3.tick_params(axis='x', rotation=45)
        ax3.legend(title='Principle', bbox_to_anchor=(1.05, 1), loc='upper left')
    
    # 4. Principles by prompt type
    ax4 = axes[1, 1]
    prompt_principles = results_df.groupby(['experiment_id', 'prompt_type'])['agreed_principle'].first().reset_index()
    prompt_crosstab = pd.crosstab(prompt_principles['prompt_type'], prompt_principles['agreed_principle'])
    
    if not prompt_crosstab.empty:
        prompt_crosstab.plot(kind='bar', ax=ax4, stacked=True)
        ax4.set_title('Justice Principles by Prompt Type')
        ax4.set_ylabel('Number of Experiments')
        ax4.tick_params(axis='x', rotation=45)
        ax4.legend(title='Principle', bbox_to_anchor=(1.05, 1), loc='upper left')
    
    plt.tight_layout()
    plt.show()
    
    # Summary statistics
    print("\n📊 Justice Criterion Summary:")
    print("=" * 50)
    
    if not principle_counts.empty:
        for principle, count in principle_counts.items():
            percentage = (count / len(experiment_consensus)) * 100
            print(f"   {principle}: {count} experiments ({percentage:.1f}%)")
    
    consensus_achieved = results_df.groupby('experiment_id')['consensus_reached'].first().sum()
    total_experiments = results_df['experiment_id'].nunique()
    print(f"\n   Consensus achieved: {consensus_achieved}/{total_experiments} experiments ({consensus_achieved/total_experiments:.1%})")
else:
    print("⚠️ No data available for justice criterion analysis")

## 8. Parameter Impact Analysis

In [None]:
if not results_df.empty:
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    fig.suptitle('Parameter Impact Analysis', fontsize=16, fontweight='bold')
    
    # Prepare experiment-level data
    exp_data = results_df.groupby('experiment_id').agg({
        'model': 'first',
        'temperature': 'first', 
        'prompt_type': 'first',
        'consensus_reached': 'first',
        'agreed_principle': 'first',
        'total_duration': 'first',
        'total_payout': 'mean'  # Average payout across agents
    }).reset_index()
    
    # 1. Model impact on consensus
    ax1 = axes[0, 0]
    model_consensus = exp_data.groupby('model')['consensus_reached'].agg(['mean', 'count']).reset_index()
    bars1 = ax1.bar(model_consensus['model'], model_consensus['mean'], 
                    color=['lightblue', 'lightcoral'])
    ax1.set_title('Consensus Rate by Model')
    ax1.set_ylabel('Consensus Rate')
    ax1.set_ylim(0, 1)
    
    # Add value labels and counts
    for i, (bar, rate, count) in enumerate(zip(bars1, model_consensus['mean'], model_consensus['count'])):
        ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02,
                f'{rate:.1%}\n(n={count})', ha='center', va='bottom')
    
    # 2. Temperature impact on consensus
    ax2 = axes[0, 1]
    temp_consensus = exp_data.groupby('temperature')['consensus_reached'].agg(['mean', 'count']).reset_index()
    ax2.plot(temp_consensus['temperature'], temp_consensus['mean'], 'o-', linewidth=2, markersize=8)
    ax2.set_title('Consensus Rate by Temperature')
    ax2.set_xlabel('Temperature')
    ax2.set_ylabel('Consensus Rate')
    ax2.set_ylim(0, 1)
    ax2.grid(True, alpha=0.3)
    
    # Add value labels
    for temp, rate, count in zip(temp_consensus['temperature'], temp_consensus['mean'], temp_consensus['count']):
        ax2.annotate(f'{rate:.1%}\n(n={count})', (temp, rate), 
                    textcoords="offset points", xytext=(0,10), ha='center')
    
    # 3. Prompt type impact on consensus
    ax3 = axes[1, 0]
    prompt_consensus = exp_data.groupby('prompt_type')['consensus_reached'].agg(['mean', 'count']).reset_index()
    bars3 = ax3.bar(prompt_consensus['prompt_type'], prompt_consensus['mean'],
                    color=['lightgreen', 'lightyellow', 'lightpink'])
    ax3.set_title('Consensus Rate by Prompt Type')
    ax3.set_ylabel('Consensus Rate')
    ax3.set_ylim(0, 1)
    ax3.tick_params(axis='x', rotation=45)
    
    # Add value labels
    for bar, rate, count in zip(bars3, prompt_consensus['mean'], prompt_consensus['count']):
        ax3.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02,
                f'{rate:.1%}\n(n={count})', ha='center', va='bottom')
    
    # 4. Duration by parameters
    ax4 = axes[1, 1]
    duration_by_temp = exp_data.groupby(['temperature', 'consensus_reached'])['total_duration'].mean().unstack()
    
    if duration_by_temp.shape[1] > 1:
        duration_by_temp.plot(kind='bar', ax=ax4, color=['red', 'green'])
        ax4.set_title('Average Duration by Temperature & Consensus')
        ax4.set_xlabel('Temperature')
        ax4.set_ylabel('Duration (seconds)')
        ax4.legend(['No Consensus', 'Consensus Reached'])
        ax4.tick_params(axis='x', rotation=0)
    else:
        # Fallback if no variation in consensus
        temp_duration = exp_data.groupby('temperature')['total_duration'].mean()
        temp_duration.plot(kind='bar', ax=ax4, color='steelblue')
        ax4.set_title('Average Duration by Temperature')
        ax4.set_xlabel('Temperature')
        ax4.set_ylabel('Duration (seconds)')
    
    plt.tight_layout()
    plt.show()
    
    # Statistical summary
    print("\n📈 Parameter Impact Summary:")
    print("=" * 50)
    
    print("\n🤖 Model Impact:")
    for _, row in model_consensus.iterrows():
        print(f"   {row['model']}: {row['mean']:.1%} consensus rate (n={row['count']})")
    
    print("\n🌡️ Temperature Impact:")
    for _, row in temp_consensus.iterrows():
        print(f"   Temperature {row['temperature']}: {row['mean']:.1%} consensus rate (n={row['count']})")
    
    print("\n💬 Prompt Type Impact:")
    for _, row in prompt_consensus.iterrows():
        print(f"   {row['prompt_type']}: {row['mean']:.1%} consensus rate (n={row['count']})")
else:
    print("⚠️ No data available for parameter impact analysis")

## 9. Satisfaction Analysis

In [None]:
if not results_df.empty:
    # Analyze preference rankings evolution
    def analyze_preference_evolution(df):
        """
        Analyze how agent preferences evolve through the experiment phases.
        """
        # Filter to agents with complete ranking data
        complete_agents = df.dropna(subset=['initial_ranking', 'final_ranking'])
        
        if complete_agents.empty:
            print("⚠️ No complete preference ranking data available")
            return
        
        print(f"\n📊 Preference Evolution Analysis (n={len(complete_agents)} agents)")
        print("=" * 60)
        
        # Calculate preference stability
        def ranking_stability(initial, final):
            """Calculate how much rankings changed between phases."""
            if not isinstance(initial, list) or not isinstance(final, list):
                return None
            if len(initial) != len(final):
                return None
            return sum(1 for i, f in zip(initial, final) if i == f) / len(initial)
        
        complete_agents['stability'] = complete_agents.apply(
            lambda row: ranking_stability(row['initial_ranking'], row['final_ranking']), axis=1
        )
        
        # Remove rows where stability couldn't be calculated
        stable_agents = complete_agents.dropna(subset=['stability'])
        
        if not stable_agents.empty:
            print(f"\n🎯 Preference Stability:")
            print(f"   Mean stability: {stable_agents['stability'].mean():.1%}")
            print(f"   Stability by temperature:")
            temp_stability = stable_agents.groupby('temperature')['stability'].mean()
            for temp, stability in temp_stability.items():
                print(f"     Temperature {temp}: {stability:.1%}")
        
        return stable_agents
    
    # Run preference evolution analysis
    stable_data = analyze_preference_evolution(results_df)
    
    # Visualization of preference analysis
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    fig.suptitle('Satisfaction and Preference Analysis', fontsize=16, fontweight='bold')
    
    # 1. Preference stability by temperature
    ax1 = axes[0, 0]
    if stable_data is not None and not stable_data.empty:
        temp_stability = stable_data.groupby('temperature')['stability'].agg(['mean', 'std', 'count']).reset_index()
        ax1.bar(temp_stability['temperature'], temp_stability['mean'], 
                yerr=temp_stability['std'], capsize=5, color='lightblue')
        ax1.set_title('Preference Stability by Temperature')
        ax1.set_xlabel('Temperature')
        ax1.set_ylabel('Stability (proportion same rankings)')
        ax1.set_ylim(0, 1)
        
        # Add value labels
        for temp, mean, count in zip(temp_stability['temperature'], temp_stability['mean'], temp_stability['count']):
            ax1.text(temp, mean + 0.05, f'{mean:.1%}\n(n={count})', ha='center', va='bottom')
    
    # 2. Certainty levels distribution
    ax2 = axes[0, 1]
    certainty_cols = ['initial_certainty', 'final_certainty']
    certainty_data = []
    
    for col in certainty_cols:
        if col in results_df.columns:
            certainty_counts = results_df[col].value_counts()
            for cert, count in certainty_counts.items():
                certainty_data.append({'Phase': col.replace('_certainty', ''), 'Certainty': cert, 'Count': count})
    
    if certainty_data:
        cert_df = pd.DataFrame(certainty_data)
        cert_pivot = cert_df.pivot(index='Certainty', columns='Phase', values='Count').fillna(0)
        cert_pivot.plot(kind='bar', ax=ax2, stacked=True)
        ax2.set_title('Certainty Levels Distribution')
        ax2.set_ylabel('Number of Agents')
        ax2.tick_params(axis='x', rotation=45)
        ax2.legend(title='Phase')
    
    # 3. Payout distribution by consensus
    ax3 = axes[1, 0]
    if 'total_payout' in results_df.columns:
        consensus_groups = results_df.groupby('consensus_reached')['total_payout']
        
        # Box plot
        box_data = [group.values for name, group in consensus_groups]
        box_labels = [f"{'Consensus' if name else 'No Consensus'}\n(n={len(group)})" 
                     for name, group in consensus_groups]
        
        ax3.boxplot(box_data, labels=box_labels)
        ax3.set_title('Payout Distribution by Consensus')
        ax3.set_ylabel('Total Payout ($)')
    
    # 4. Model vs Prompt interaction
    ax4 = axes[1, 1]
    model_prompt_consensus = results_df.groupby(['model', 'prompt_type'])['consensus_reached'].mean().unstack()
    
    if not model_prompt_consensus.empty:
        sns.heatmap(model_prompt_consensus, annot=True, fmt='.1%', ax=ax4, cmap='RdYlGn')
        ax4.set_title('Consensus Rate: Model × Prompt Type')
        ax4.set_xlabel('Prompt Type')
        ax4.set_ylabel('Model')
    
    plt.tight_layout()
    plt.show()
    
    # Additional satisfaction metrics
    print("\n💰 Economic Outcome Analysis:")
    print("=" * 40)
    
    if 'total_payout' in results_df.columns:
        payout_stats = results_df.groupby('consensus_reached')['total_payout'].describe()
        print("\n📊 Payout Statistics by Consensus:")
        print(payout_stats.round(4))
        
        # Payout by parameters
        print("\n💎 Average Payout by Parameters:")
        temp_payout = results_df.groupby('temperature')['total_payout'].mean()
        for temp, payout in temp_payout.items():
            print(f"   Temperature {temp}: ${payout:.4f}")
else:
    print("⚠️ No data available for satisfaction analysis")

## 10. Comprehensive Summary Report

In [None]:
if not results_df.empty:
    print("\n" + "=" * 80)
    print("📋 SYSTEMATIC ANALYSIS COMPREHENSIVE REPORT")
    print("=" * 80)
    
    # Overall statistics
    total_experiments = results_df['experiment_id'].nunique()
    total_agents = len(results_df)
    consensus_rate = results_df.groupby('experiment_id')['consensus_reached'].first().mean()
    
    print(f"\n🎯 EXPERIMENT OVERVIEW:")
    print(f"   Total experiments completed: {total_experiments}")
    print(f"   Total agent records: {total_agents}")
    print(f"   Overall consensus rate: {consensus_rate:.1%}")
    print(f"   Average experiment duration: {results_df['total_duration'].mean():.1f} seconds")
    
    # Parameter breakdown
    print(f"\n🔬 PARAMETER COVERAGE:")
    print(f"   Models tested: {', '.join(results_df['model'].unique())}")
    print(f"   Temperatures tested: {sorted(results_df['temperature'].unique())}")
    print(f"   Prompt types tested: {', '.join(results_df['prompt_type'].unique())}")
    
    # Key findings
    print(f"\n🏆 KEY FINDINGS:")
    
    # Best performing configurations
    exp_summary = results_df.groupby(['model', 'temperature', 'prompt_type']).agg({
        'consensus_reached': 'mean',
        'experiment_id': 'count',
        'total_duration': 'mean'
    }).round(3).reset_index()
    
    best_consensus = exp_summary.loc[exp_summary['consensus_reached'].idxmax()]
    print(f"   Best consensus rate: {best_consensus['consensus_reached']:.1%}")
    print(f"     Model: {best_consensus['model']}")
    print(f"     Temperature: {best_consensus['temperature']}")
    print(f"     Prompt: {best_consensus['prompt_type']}")
    
    # Most common principle
    principle_counts = results_df.groupby('experiment_id')['agreed_principle'].first().value_counts()
    if not principle_counts.empty:
        most_common = principle_counts.index[0]
        most_common_count = principle_counts.iloc[0]
        print(f"\n   Most agreed upon principle: {most_common}")
        print(f"     Chosen in {most_common_count}/{total_experiments} experiments ({most_common_count/total_experiments:.1%})")
    
    # Temperature effects
    temp_effects = results_df.groupby('temperature')['consensus_reached'].mean()
    best_temp = temp_effects.idxmax()
    print(f"\n   Most effective temperature: {best_temp} ({temp_effects[best_temp]:.1%} consensus rate)")
    
    # Model comparison
    model_comparison = results_df.groupby('model')['consensus_reached'].mean()
    print(f"\n   Model performance:")
    for model, rate in model_comparison.items():
        print(f"     {model}: {rate:.1%} consensus rate")
    
    # Prompt type effects
    prompt_effects = results_df.groupby('prompt_type')['consensus_reached'].mean()
    best_prompt = prompt_effects.idxmax()
    print(f"\n   Most effective prompt type: {best_prompt} ({prompt_effects[best_prompt]:.1%} consensus rate)")
    
    print(f"\n📊 DATA EXPORT:")
    
    # Save summary data
    summary_file = systematic_dir / "analysis_summary.csv"
    results_df.to_csv(summary_file, index=False)
    print(f"   Full dataset saved to: {summary_file}")
    
    # Save experiment-level summary
    exp_level_summary = results_df.groupby('experiment_id').agg({
        'model': 'first',
        'temperature': 'first',
        'prompt_type': 'first',
        'consensus_reached': 'first',
        'agreed_principle': 'first',
        'total_duration': 'first',
        'total_payout': 'mean'
    }).reset_index()
    
    exp_summary_file = systematic_dir / "experiment_summary.csv"
    exp_level_summary.to_csv(exp_summary_file, index=False)
    print(f"   Experiment summary saved to: {exp_summary_file}")
    
    print(f"\n✅ SYSTEMATIC ANALYSIS COMPLETE")
    print(f"   Results available in: {systematic_dir}")
    print(f"   Total execution time: {results_df['total_duration'].sum():.0f} seconds")
    print(f"   Average per experiment: {results_df['total_duration'].mean():.1f} seconds")
    
else:
    print("❌ No data available for comprehensive report")
    print("💡 Make sure experiments completed successfully before running analysis")

## 11. Next Steps and Configuration Options

### To Run Full Analysis:
1. Set `TEST_MODE = False` in the execution section
2. Increase `MAX_CONCURRENT` if your system can handle more parallel experiments
3. Add GPT-4.1 to the MODELS list for complete coverage

### To Modify Parameters:
- **Models**: Edit the `MODELS` list
- **Temperatures**: Modify the `TEMPERATURES` list  
- **Prompts**: Update the `PROMPTS` dictionary
- **Distributions**: Modify `INCOME_DISTRIBUTIONS` (currently matches new_logic.md)

### Analysis Extensions:
- Add more detailed preference evolution tracking
- Include individual agent conversation analysis
- Add statistical significance testing
- Export results for external analysis tools

This notebook provides a complete framework for systematic analysis of multi-agent distributive justice experiments with parallel execution and comprehensive visualization.

In [2]:
# Run specific configuration: configs/lucas_new.yaml
print("🧪 Running specific configuration: configs/lucas_new.yaml")

async def run_specific_config():
    """Run the lucas_new.yaml configuration specifically."""
    
    # Run single experiment
    from maai.runners import run_experiment
    
    try:
        result = await run_experiment(
            'lucas_new',  # Configuration name (without .yaml extension)
            output_dir='systematic_analysis/specific_results',  # Custom output directory
            config_dir='configs'  # Standard configs directory
        )
        
        if result.get('success', False):
            print(f"✅ SUCCESS: {result['experiment_id']}")
            print(f"   Output saved to: {result['output_path']}")
            print(f"   Duration: {result.get('duration_seconds', 0):.1f} seconds")
            
            # Load and display basic results
            try:
                import json
                with open(result['output_path'], 'r') as f:
                    data = json.load(f)
                
                metadata = data.get('experiment_metadata', {})
                consensus = metadata.get('final_consensus', {})
                
                print(f"\\n📊 Quick Results:")
                print(f"   Consensus reached: {consensus.get('agreement_reached', 'Unknown')}")
                if consensus.get('agreed_principle'):
                    print(f"   Agreed principle: {consensus.get('agreed_principle')}")
                print(f"   Total duration: {metadata.get('total_duration_seconds', 0):.1f}s")
                
                # Count agents
                agent_count = len([k for k in data.keys() if k != 'experiment_metadata'])
                print(f"   Agents: {agent_count}")
                
            except Exception as e:
                print(f"   ⚠️ Could not load detailed results: {e}")
                
        else:
            print(f"❌ FAILED: {result.get('error', 'Unknown error')}")
            
    except Exception as e:
        print(f"❌ Error running experiment: {e}")
        import traceback
        traceback.print_exc()

# Run the specific configuration
await run_specific_config()

🧪 Running specific configuration: configs/lucas_new.yaml
Loaded 3 agents:
  - Agent_1_us_student: gpt-4.1 (custom personality)
  - Agent_2_us_student: gpt-4.1 (custom personality)
  - Agent_3_us_student: gpt-4.1 (custom personality)

=== Starting New Game Logic Experiment ===
Experiment ID: systematic_001_gpt_4.1_0_us_student
Agents: 3
Individual Rounds: 4 (fixed per new_logic.md)
Group Deliberation Max Rounds: 6
Income Distributions: 4
Payout Ratio: $0.0001 per $1

--- Initializing Agents ---
Created 3 deliberation agents
  - Agent_1_us_student (agent_1)
  - Agent_2_us_student (agent_2)
  - Agent_3_us_student (agent_3)

=== PHASE 1: Individual Familiarization ===

--- Phase 1.1: Initial Preference Ranking ---
Collected initial rankings from 3 agents

--- Phase 1.2: Detailed Examples ---
Presenting detailed examples of principle outcomes to agents...
  Agent_1_us_student: RunResult:
- Last agent: Agent(name="Agent_1_us_student", ...)
- Final output (str):
    Certainly! Here’s my summa

## 12. Run Specific Configuration

Run a specific existing configuration file for testing or individual analysis.