# AutoGen Framework Baseline for Clinical Trial NLP

## Overview

This notebook demonstrates how to use Microsoft AutoGen to build a multi-agent system for clinical trial natural language inference (NLI). AutoGen excels at multi-agent collaboration through conversation patterns.

### Why AutoGen?
- **Multi-agent dialogue**: Natural conversation between specialized agents
- **Role specialization**: Each agent has specific expertise
- **Flexible coordination**: Supervisor manages task distribution
- **Robust reasoning**: Multiple perspectives improve accuracy

### Agent Architecture
Based on the teaching material, we implement:
1. **Supervisor**: Coordinates tasks and manages workflow
2. **Medical Expert**: Understands medical terminology and concepts
3. **Numerical Analyzer**: Processes quantitative data and statistics
4. **Logic Checker**: Validates logical relationships
5. **Aggregator**: Makes final decisions based on all inputs

## Setup and Installation

First, let's install the required packages and set up our environment:

In [None]:
# Load environment variables
from dotenv import load_dotenv
import os

load_dotenv()
print("✅ Environment loaded")

In [None]:
# Import required libraries
import json
import pandas as pd
from tqdm import tqdm
import autogen
from typing import Dict, List, Any, Optional
import warnings
warnings.filterwarnings('ignore')

print("✅ Libraries imported successfully")

## Data Loading and Utilities

Let's create utility functions to load clinical trial data:

In [None]:
def load_clinical_trial(trial_id: str) -> Dict[str, Any]:
    """Load clinical trial data from JSON file.
    
    Args:
        trial_id: The NCT identifier for the clinical trial
        
    Returns:
        Dictionary containing trial data or error information
    """
    try:
        file_path = os.path.join("training_data", "CT json", f"{trial_id}.json")
        with open(file_path, "r", encoding="utf-8") as f:
            data = json.load(f)
        return data
    except FileNotFoundError:
        return {"error": f"Clinical trial {trial_id} not found"}
    except Exception as e:
        return {"error": f"Error loading {trial_id}: {str(e)}"}

def load_dataset(filepath: str) -> Dict[str, Any]:
    """Load training or test dataset.
    
    Args:
        filepath: Path to the JSON dataset file
        
    Returns:
        Dictionary containing the dataset
    """
    try:
        with open(filepath, "r", encoding="utf-8") as f:
            return json.load(f)
    except Exception as e:
        print(f"Error loading dataset: {e}")
        return {}

# Test data loading
sample_trial = load_clinical_trial("NCT00066573")
print(f"✅ Data loading functions ready. Sample trial loaded: {sample_trial.get('Clinical Trial ID', 'Error')}")

## AutoGen Agent Configuration

Now let's configure our multi-agent system. Each agent has a specialized role and specific instructions:

In [None]:
# Configuration for OpenAI models
config_list = [
    {
        "model": "gpt-4o-mini",
        "api_key": os.getenv("OPENAI_API_KEY")
    }
]

# LLM configuration
llm_config = {
    "config_list": config_list,
    "temperature": 0.1,  # Low temperature for consistent results
    "timeout": 120,
}

print("✅ AutoGen configuration ready")

## Agent Definitions

Let's define each agent with their specific roles and capabilities:

In [None]:
# 1. Supervisor Agent - Coordinates the overall workflow
supervisor = autogen.AssistantAgent(
    name="supervisor",
    system_message="""You are the Supervisor Agent responsible for coordinating clinical trial analysis.
    
    Your responsibilities:
    1. Break down complex clinical trial questions into manageable tasks
    2. Assign tasks to appropriate specialist agents
    3. Coordinate information flow between agents
    4. Ensure all aspects of the statement are thoroughly analyzed
    5. Guide the final decision-making process
    
    When given a statement and clinical trial data, you should:
    - First understand what the statement claims
    - Identify which sections of the trial are relevant
    - Delegate specific analysis tasks to specialist agents
    - Collect and synthesize their findings
    
    Always maintain a systematic approach and ensure thorough analysis.
    """,
    llm_config=llm_config,
)

# 2. Medical Expert Agent - Handles medical terminology and concepts
medical_expert = autogen.AssistantAgent(
    name="medical_expert",
    system_message="""You are the Medical Expert Agent specializing in clinical trial analysis.
    
    Your expertise includes:
    1. Medical terminology and clinical concepts
    2. Understanding of trial phases, interventions, and outcomes
    3. Clinical significance of findings
    4. Medical relationships and causations
    5. Safety profiles and adverse events
    
    When analyzing statements:
    - Focus on the medical accuracy and clinical relevance
    - Identify key medical terms and their implications
    - Assess whether medical claims align with trial evidence
    - Consider clinical context and medical plausibility
    
    Provide clear medical reasoning for your analysis.
    """,
    llm_config=llm_config,
)

# 3. Numerical Analyzer Agent - Processes quantitative data
numerical_analyzer = autogen.AssistantAgent(
    name="numerical_analyzer",
    system_message="""You are the Numerical Analyzer Agent specializing in quantitative analysis.
    
    Your expertise includes:
    1. Statistical analysis and interpretation
    2. Numerical comparisons and calculations
    3. Percentage, ratios, and statistical significance
    4. Data ranges, confidence intervals, and error margins
    5. Trend analysis and numerical patterns
    
    When analyzing statements:
    - Extract and verify all numerical claims
    - Perform calculations to validate numerical relationships
    - Compare stated numbers with trial data
    - Assess statistical significance and clinical relevance
    - Identify any numerical inconsistencies or errors
    
    Be precise and show your calculations when relevant.
    """,
    llm_config=llm_config,
)

# 4. Logic Checker Agent - Validates logical relationships
logic_checker = autogen.AssistantAgent(
    name="logic_checker",
    system_message="""You are the Logic Checker Agent responsible for validating logical reasoning.
    
    Your expertise includes:
    1. Logical consistency and coherence
    2. Cause-and-effect relationships
    3. Conditional logic and implications
    4. Contradiction detection
    5. Multi-step reasoning validation
    
    When analyzing statements:
    - Evaluate the logical structure of claims
    - Check for internal consistency
    - Identify logical fallacies or contradictions
    - Assess the validity of inferences
    - Ensure conclusions follow from premises
    
    Focus on the logical soundness rather than medical or numerical details.
    """,
    llm_config=llm_config,
)

# 5. Aggregator Agent - Makes final decisions
aggregator = autogen.AssistantAgent(
    name="aggregator",
    system_message="""You are the Aggregator Agent responsible for making final entailment decisions.
    
    Your responsibilities:
    1. Synthesize findings from all specialist agents
    2. Weigh different types of evidence appropriately
    3. Resolve conflicts between agent analyses
    4. Make the final Entailment vs Contradiction decision
    5. Provide confidence assessment for the decision
    
    Decision criteria:
    - ENTAILMENT: Statement is directly supported by trial data
    - CONTRADICTION: Statement is refuted by trial data
    
    When making decisions:
    - Consider input from Medical Expert, Numerical Analyzer, and Logic Checker
    - Prioritize evidence that directly addresses the statement
    - Be conservative: if evidence is unclear, consider context carefully
    - Always provide reasoning for your final decision
    
    Output format: "FINAL_DECISION: [Entailment/Contradiction]"
    """,
    llm_config=llm_config,
)

# User proxy for managing the conversation
user_proxy = autogen.UserProxyAgent(
    name="user_proxy",
    human_input_mode="NEVER",
    max_consecutive_auto_reply=10,
    is_termination_msg=lambda x: x.get("content", "").find("FINAL_DECISION:") >= 0,
    code_execution_config=False,
)

print("✅ All agents created successfully")

## Group Chat Setup

We'll set up a group chat where agents can collaborate on analyzing each statement:

In [None]:
# Create group chat with all agents
groupchat = autogen.GroupChat(
    agents=[supervisor, medical_expert, numerical_analyzer, logic_checker, aggregator, user_proxy],
    messages=[],
    max_round=15,  # Allow sufficient rounds for thorough analysis
    speaker_selection_method="round_robin",  # Ensure all agents participate
)

# Create group chat manager
manager = autogen.GroupChatManager(
    groupchat=groupchat,
    llm_config=llm_config
)

print("✅ Group chat configured")

## Clinical Trial Analysis Pipeline

Now let's create our main analysis pipeline that coordinates the multi-agent analysis:

In [None]:
def analyze_statement_with_agents(statement: str, primary_id: str, secondary_id: Optional[str] = None, 
                                section_id: Optional[str] = None) -> str:
    """
    Analyze a statement using the multi-agent system.
    
    Args:
        statement: The natural language statement to analyze
        primary_id: Primary clinical trial ID
        secondary_id: Secondary trial ID for comparison statements
        section_id: Relevant section of the trial (Eligibility, Intervention, Results, Adverse Events)
        
    Returns:
        Final decision: 'Entailment' or 'Contradiction'
    """
    
    # Load clinical trial data
    primary_data = load_clinical_trial(primary_id)
    secondary_data = None
    if secondary_id:
        secondary_data = load_clinical_trial(secondary_id)
    
    # Prepare the analysis context
    context = f"""
    CLINICAL TRIAL ANALYSIS TASK
    
    Statement to analyze: "{statement}"
    
    Primary Trial ID: {primary_id}
    Secondary Trial ID: {secondary_id if secondary_id else "N/A"}
    Relevant Section: {section_id if section_id else "All sections"}
    
    PRIMARY TRIAL DATA:
    {json.dumps(primary_data, indent=2)}
    
    {f'SECONDARY TRIAL DATA:\n{json.dumps(secondary_data, indent=2)}' if secondary_data else ''}
    
    TASK: Determine whether the statement represents ENTAILMENT (supported by data) or CONTRADICTION (refuted by data).
    
    Supervisor: Please coordinate the analysis. Each specialist should examine the statement from their domain perspective.
    """
    
    # Reset group chat for new analysis
    groupchat.messages = []
    
    # Start the multi-agent analysis
    try:
        user_proxy.initiate_chat(
            manager,
            message=context
        )
        
        # Extract final decision from the last message
        final_message = groupchat.messages[-1]["content"] if groupchat.messages else ""
        
        # Parse the final decision
        if "FINAL_DECISION: Entailment" in final_message:
            return "Entailment"
        elif "FINAL_DECISION: Contradiction" in final_message:
            return "Contradiction"
        else:
            # Fallback: look for decision keywords in the final message
            if "entailment" in final_message.lower():
                return "Entailment"
            else:
                return "Contradiction"
                
    except Exception as e:
        print(f"Error in analysis: {e}")
        return "Contradiction"  # Conservative fallback

print("✅ Analysis pipeline ready")

## Test Example

Let's test our multi-agent system with a sample case:

In [None]:
# Test with a sample statement
test_statement = "there is a 13.2% difference between the results from the two the primary trial cohorts"
test_primary_id = "NCT00066573"

print(f"Testing with statement: '{test_statement}'")
print(f"Primary trial: {test_primary_id}")
print("\n" + "="*80 + "\n")

# Run the analysis
result = analyze_statement_with_agents(
    statement=test_statement,
    primary_id=test_primary_id,
    section_id="Results"
)

print(f"\n" + "="*80)
print(f"FINAL RESULT: {result}")
print("="*80)

## Evaluation on Training Data

Let's evaluate our AutoGen system on a subset of training data:

In [None]:
# Load training data
train_data = load_dataset("training_data/train.json")
print(f"Loaded {len(train_data)} training examples")

# Evaluate on first 20 examples (adjust as needed for testing)
sample_size = 20
examples = list(train_data.items())[:sample_size]

print(f"\nEvaluating on {len(examples)} examples...")

results = []
correct = 0

for i, (uuid, example) in enumerate(tqdm(examples, desc="Processing")):
    try:
        statement = example.get("Statement")
        primary_id = example.get("Primary_id")
        secondary_id = example.get("Secondary_id")
        section_id = example.get("Section_id")
        expected = example.get("Label")
        
        if not statement or not primary_id:
            results.append({
                "uuid": uuid,
                "expected": expected,
                "predicted": "SKIPPED",
                "correct": False
            })
            continue
        
        # Get prediction from multi-agent system
        predicted = analyze_statement_with_agents(
            statement=statement,
            primary_id=primary_id,
            secondary_id=secondary_id,
            section_id=section_id
        )
        
        # Check if correct
        is_correct = (predicted.strip() == expected.strip())
        if is_correct:
            correct += 1
            
        results.append({
            "uuid": uuid,
            "statement": statement[:100] + "..." if len(statement) > 100 else statement,
            "expected": expected,
            "predicted": predicted,
            "correct": is_correct
        })
        
        print(f"Example {i+1}: {expected} -> {predicted} {'✅' if is_correct else '❌'}")
        
    except Exception as e:
        print(f"Error processing example {i+1}: {e}")
        results.append({
            "uuid": uuid,
            "expected": expected,
            "predicted": "ERROR",
            "correct": False
        })

# Calculate accuracy
accuracy = correct / len(examples) if examples else 0
print(f"\n📊 AutoGen Results:")
print(f"Accuracy: {accuracy:.2%} ({correct}/{len(examples)})")

## Error Analysis

Let's examine some examples where our system made mistakes:

In [None]:
# Show incorrect predictions for analysis
incorrect_results = [r for r in results if not r["correct"] and r["predicted"] not in ["SKIPPED", "ERROR"]]

print(f"\n🔍 Error Analysis ({len(incorrect_results)} incorrect predictions):")
print("=" * 80)

for i, result in enumerate(incorrect_results[:5]):  # Show first 5 errors
    print(f"\nError #{i+1}:")
    print(f"Statement: {result['statement']}")
    print(f"Expected: {result['expected']}")
    print(f"Predicted: {result['predicted']}")
    print("-" * 40)

## Generate Submission File

Let's create a submission file for the test data:

In [None]:
def generate_autogen_submission(test_file="test.json", output_file="autogen_submission.json", sample_size=None):
    """
    Generate submission file using AutoGen multi-agent system.
    
    Args:
        test_file: Path to test data
        output_file: Output submission file
        sample_size: Number of examples to process (None for all)
    """
    
    # Load test data
    test_data = load_dataset(test_file)
    if not test_data:
        print(f"❌ Could not load test data from {test_file}")
        return
    
    examples = list(test_data.items())
    if sample_size:
        examples = examples[:sample_size]
        
    print(f"🚀 Generating AutoGen predictions for {len(examples)} examples...")
    
    submission = {}
    
    for i, (uuid, example) in enumerate(tqdm(examples, desc="AutoGen Processing")):
        try:
            statement = example.get("Statement")
            primary_id = example.get("Primary_id")
            secondary_id = example.get("Secondary_id")
            section_id = example.get("Section_id")
            
            if not statement or not primary_id:
                submission[uuid] = {"Prediction": "Contradiction"}  # Default fallback
                continue
                
            # Get prediction from multi-agent system
            prediction = analyze_statement_with_agents(
                statement=statement,
                primary_id=primary_id,
                secondary_id=secondary_id,
                section_id=section_id
            )
            
            submission[uuid] = {"Prediction": prediction}
            
        except Exception as e:
            print(f"Error processing {uuid}: {e}")
            submission[uuid] = {"Prediction": "Contradiction"}  # Conservative fallback
    
    # Save submission file
    with open(output_file, "w", encoding="utf-8") as f:
        json.dump(submission, f, indent=2)
    
    print(f"✅ AutoGen submission saved to {output_file}")
    return submission

# Generate submission for a small sample (change sample_size as needed)
autogen_submission = generate_autogen_submission(
    test_file="test.json", 
    output_file="autogen_submission.json",
    sample_size=10  # Process only 10 examples for testing
)

print(f"Generated predictions for {len(autogen_submission)} examples")

## Conclusion and Insights

### AutoGen Framework Strengths:
1. **Multi-perspective analysis**: Each agent brings specialized expertise
2. **Structured collaboration**: Clear roles and responsibilities
3. **Flexible coordination**: Supervisor manages complex workflows
4. **Robust reasoning**: Multiple viewpoints improve decision quality
5. **Interpretability**: Agent conversations provide insight into reasoning

### Key Features Demonstrated:
- **Role specialization**: Medical, numerical, and logical analysis
- **Group chat coordination**: Agents collaborate through structured dialogue
- **Systematic approach**: Supervisor ensures comprehensive analysis
- **Error handling**: Graceful fallbacks for edge cases
- **Scalable architecture**: Easy to add new specialized agents

### Optimization Opportunities:
1. **Prompt engineering**: Fine-tune agent instructions for better performance
2. **Agent coordination**: Improve conversation flow and decision-making
3. **Error analysis**: Use failed cases to improve agent reasoning
4. **Performance tuning**: Optimize model parameters and conversation patterns
5. **Domain knowledge**: Incorporate more medical domain expertise

### When to Use AutoGen:
- Complex reasoning tasks requiring multiple perspectives
- Problems that benefit from role specialization
- Scenarios where interpretability is important
- Applications needing robust collaborative decision-making

AutoGen excels at structured multi-agent collaboration and provides excellent transparency into the reasoning process, making it ideal for complex clinical analysis tasks.