# Multi-LLM Collaborative Debate System - Demo Playground

Interactive notebook for running single debate sessions and observing agent reasoning in real-time.

## Features:
- Run debates on individual problems
- Observe each agent's reasoning process
- See peer reviews and refinements
- Watch the judge make the final decision

In [1]:
import json
import sys
from pathlib import Path

# Add parent directory to path for imports
sys.path.insert(0, str(Path(".").resolve().parent))

from dotenv import load_dotenv
from openai import OpenAI

from src.agents import PERSONAS
from src.orchestrator import (
    get_role_preference,
    generate_solution,
    generate_critique,
    refine_solution,
    judge_verdict,
    grade_answer,
)

# Load environment variables
load_dotenv()

# Initialize OpenAI client
client = OpenAI()
print("OpenAI client initialized successfully!")

OpenAI client initialized successfully!


## 1. Load Problem Set

In [2]:
PROBLEMS_PATH = Path("../data/problems.json")

with open(PROBLEMS_PATH, "r") as f:
    problems = json.load(f)

print(f"Loaded {len(problems)} problems")
print("\nAvailable problems:")
for p in problems[:10]:  # Show first 10
    print(f"  [{p['id']}] {p['category']}/{p['difficulty']}: {p['question'][:60]}...")

Loaded 25 problems

Available problems:
  [1] Logic/Hard: Three gods A, B, and C are called, in no particular order, T...
  [2] Math/Medium: Find the last two digits of 7^2024....
  [3] Physics/Medium: A ladder of length L and mass m leans against a frictionless...
  [4] Game Theory/Hard: In a sealed-bid second-price auction (Vickrey auction) with ...
  [5] Logic/Hard: Cheryl gives Albert and Bernard ten possible dates for her b...
  [6] Math/Hard: A standard 12-hour clock has an hour hand and a minute hand....
  [7] Probability/Medium: You are on a game show with 3 doors. Behind one is a car; be...
  [8] Math/Medium: If x^x^x... (infinite power tower) = 2, what is x?...
  [9] Physics/Medium: A car accelerates from rest at a constant rate 'a' for time ...
  [10] Logic/Hard: 100 prisoners are lined up. Each wears a red or blue hat. Th...


## 2. Select a Problem

Choose a problem ID or define a custom problem.

In [3]:
# Option 1: Select by ID
PROBLEM_ID = 1  # Change this to test different problems

problem = next((p for p in problems if p["id"] == PROBLEM_ID), None)

# Option 2: Define a custom problem (uncomment to use)
# problem = {
#     "id": 999,
#     "category": "Custom",
#     "difficulty": "Medium",
#     "question": "What is 2 + 2?",
#     "ground_truth": "4",
# }

if problem:
    print(f"Selected Problem {problem['id']}")
    print(f"Category: {problem['category']}")
    print(f"Difficulty: {problem['difficulty']}")
    print(f"\nQuestion:\n{problem['question']}")
    print(f"\nGround Truth: {problem['ground_truth']}")
else:
    print(f"Problem ID {PROBLEM_ID} not found!")

Selected Problem 1
Category: Logic
Difficulty: Hard

Question:
Three gods A, B, and C are called, in no particular order, True, False, and Random. True always speaks truly, False always speaks falsely, but whether Random speaks truly or falsely is a completely random matter. You must determine the identities of A, B, and C by asking three yes-no questions; each question must be put to exactly one god. The gods understand English, but will answer all questions in their own language, in which the words for yes and no are 'da' and 'ja', in some order. You do not know which word means which. What is the first question you should ask to ensure you can solve the puzzle?

Ground Truth: Ask god B: 'If I asked you 'Is A Random?', would you say 'ja'?' (If B answers 'ja', then C is not Random. If B answers 'da', then A is not Random. This isolates a non-random god to ask the next questions to.)


## 3. Helper Functions for Pretty Printing

In [4]:
def print_header(title: str) -> None:
    """Print a formatted header."""
    print("\n" + "=" * 80)
    print(f" {title}")
    print("=" * 80)


def print_subheader(title: str) -> None:
    """Print a formatted subheader."""
    print(f"\n--- {title} ---")


def print_agent_info(agent_id: str) -> None:
    """Print agent persona information."""
    print(f"\nAgent {agent_id} Persona: {PERSONAS[agent_id][:100]}...")

## 4. Stage 0: Role Assignment

Each agent decides if they prefer to be a Solver or Judge.

In [5]:
print_header("STAGE 0: ROLE ASSIGNMENT")

question = problem["question"]
agent_ids = list(PERSONAS.keys())
preferences = []

for agent_id in agent_ids:
    print_subheader(f"Agent {agent_id}")
    print_agent_info(agent_id)
    
    pref = get_role_preference(client, agent_id, question)
    preferences.append(pref)
    
    print(f"\nPreferred Role: {pref.role_priority}")
    print(f"Confidence: {pref.confidence:.2f}")
    print(f"Reasoning: {pref.reasoning[:200]}...")


 STAGE 0: ROLE ASSIGNMENT

--- Agent A ---

Agent A Persona: You are a rigid logician and scientist. You prioritize formal proofs, step-by-step derivation, and e...

Preferred Role: Solver
Confidence: 0.90
Reasoning: Given the complexity and the logical nature of the problem, it requires a structured approach to formulating questions and deducing the identities of the gods based on their responses. As a Solver, I ...

--- Agent B ---

Agent B Persona: You are a lateral thinker. You look for alternative interpretations, trick wording, or creative solu...

Preferred Role: Solver
Confidence: 0.85
Reasoning: I believe I would be better suited as a Solver for this problem because it requires creative reasoning and lateral thinking to derive an effective first question. This involves understanding the nuanc...

--- Agent C ---

Agent C Persona: You are a pragmatic engineer. You focus on probability, heuristics, and real-world constraints. You ...

Preferred Role: Solver
Confidence: 0.85
Rea

In [6]:
# Assign roles based on preferences
import random

judge_candidates = [p for p in preferences if p.role_priority == "Judge"]

if judge_candidates:
    weights = [p.confidence for p in judge_candidates]
    judge_pref = random.choices(judge_candidates, weights=weights, k=1)[0]
    judge_id = judge_pref.agent_id
else:
    judge_pref = random.choice(preferences)
    judge_id = judge_pref.agent_id

solver_ids = [p.agent_id for p in preferences if p.agent_id != judge_id]

print_subheader("ROLE ASSIGNMENT RESULT")
print(f"Judge: Agent {judge_id}")
print(f"Solvers: Agents {', '.join(solver_ids)}")


--- ROLE ASSIGNMENT RESULT ---
Judge: Agent D
Solvers: Agents A, B, C


## 5. Stage 1: Independent Solutions

Each solver generates their solution independently.

In [7]:
print_header("STAGE 1: INDEPENDENT SOLUTIONS")

initial_solutions = {}

for solver_id in solver_ids:
    print_subheader(f"Solver {solver_id}")
    print_agent_info(solver_id)
    
    solution = generate_solution(client, solver_id, question)
    initial_solutions[solver_id] = solution
    
    print("\nSolution:")
    print(f"{solution.solution_text[:500]}..." if len(solution.solution_text) > 500 else solution.solution_text)
    print(f"\nFINAL ANSWER: {solution.final_answer}")


 STAGE 1: INDEPENDENT SOLUTIONS

--- Solver A ---

Agent A Persona: You are a rigid logician and scientist. You prioritize formal proofs, step-by-step derivation, and e...

Solution:
To solve the problem of identifying the three gods (True, False, and Random) with the constraints that they answer only in ‘da’ and ‘ja’ (which we don't know the meanings of), we need to construct our first question carefully. 

1. **Identifying the Types of Gods**: We note that True will always tell the truth, False will always lie, and Random can give any answer (honestly or dishonestly). 

2. **Understanding the Responses**: The two words 'da' and 'ja' can mean either 'yes' or 'no'. We cannot...

FINAL ANSWER: Ask god A: "If I asked you whether god B is Random, would you say 'da'?"

--- Solver B ---

Agent B Persona: You are a lateral thinker. You look for alternative interpretations, trick wording, or creative solu...

Solution:
To tackle this problem, we must devise a first question that enables us t

## 6. Stage 2: Peer Review (Round Robin)

Each solver critiques the other two solvers' work.

In [8]:
print_header("STAGE 2: PEER REVIEW")

all_reviews = {sid: [] for sid in solver_ids}

for reviewer_id in solver_ids:
    for target_id in solver_ids:
        if reviewer_id != target_id:
            print_subheader(f"Reviewer {reviewer_id} -> Target {target_id}")
            
            review = generate_critique(
                client,
                reviewer_id,
                target_id,
                question,
                initial_solutions[target_id],
            )
            all_reviews[target_id].append(review)
            
            print(f"\nScore: {review.score}/10")
            print(f"Strengths: {', '.join(review.strengths[:3])}")
            print(f"Weaknesses: {', '.join(review.weaknesses[:3])}")
            if review.errors:
                print(f"Errors Found: {len(review.errors)}")
                for err in review.errors[:2]:
                    print(f"  - [{err.severity}] {err.location}: {err.description[:100]}")


 STAGE 2: PEER REVIEW

--- Reviewer A -> Target B ---

Score: 5/10
Strengths: The solver correctly identified the challenges posed by the nature of the gods and the ambiguity of their language., The proposed question is structured to gather information about both the identity of a god and the interpretation of their responses.
Weaknesses: The solution lacks clarity in the evaluation of responses, particularly in the case when Random is asked., The impact of asking Random is downplayed – its unpredictability should have been more thoroughly explored, as it complicates the derivation of reliable knowledge.
Errors Found: 2
  - [critical] Step 4: The analysis does not adequately consider that responses from Random may confuse the interpretation 
  - [critical] Step 5: The conclusion implies that any answer could lead towards identifying True or False; however, the ro

--- Reviewer A -> Target C ---

Score: 6/10
Strengths: The solver correctly identified the need for a self-referential que

## 7. Stage 3: Refinement

Solvers improve their solutions based on peer feedback.

In [9]:
print_header("STAGE 3: REFINEMENT")

refined_solutions = {}

for solver_id in solver_ids:
    print_subheader(f"Solver {solver_id} Refining")
    
    refined = refine_solution(
        client,
        solver_id,
        question,
        initial_solutions[solver_id],
        all_reviews[solver_id],
    )
    refined_solutions[solver_id] = refined
    
    print(f"\nChanges Made: {refined.changes_made[:300]}..." if len(refined.changes_made) > 300 else f"\nChanges Made: {refined.changes_made}")
    print(f"\nOriginal Answer: {initial_solutions[solver_id].final_answer}")
    print(f"Refined Answer:  {refined.final_answer}")
    
    # Check if answer changed
    if initial_solutions[solver_id].final_answer != refined.final_answer:
        print(">>> ANSWER CHANGED!")


 STAGE 3: REFINEMENT

--- Solver A Refining ---

Changes Made: Clarified the analysis of Random's responses and specified how to handle ambiguous responses. Enhanced the explanation of the consequences of each answer from the first question and provided a follow-up question strategy.

Original Answer: Ask god A: "If I asked you whether god B is Random, would you say 'da'?"
Refined Answer:  Ask god A: "If I asked you whether god B is Random, would you say 'da'?"

--- Solver B Refining ---

Changes Made: Clarified the analysis of responses to include the unpredictability of Random more explicitly. Explained the implications of each possible answer more clearly and reinforced that identifying Random complicates the deduction process. Revised the question phrasing to ensure it directly targets the ide...

Original Answer: "If I asked you 'Is A True?', would you say 'da'?"
Refined Answer:  If I asked you whether 'You are Random', would you say 'da'?
>>> ANSWER CHANGED!

--- Solver C Refini

## 8. Stage 4: Judge Verdict

The judge evaluates all solutions and selects the winner.

In [10]:
print_header("STAGE 4: JUDGE VERDICT")

print(f"\nJudge: Agent {judge_id}")
print_agent_info(judge_id)

verdict = judge_verdict(
    client,
    judge_id,
    question,
    solver_ids,
    initial_solutions,
    all_reviews,
    refined_solutions,
)

print(f"\nWINNER: Solver {verdict.best_solver_id}")
print("\nRationale:")
print(verdict.rationale)
print(f"\nFINAL ANSWER TO USER: {verdict.final_answer_to_user}")


 STAGE 4: JUDGE VERDICT

Judge: Agent D

Agent D Persona: You are a balanced synthesizer. You strive for clarity and consensus. You try to bridge the gap betw...

WINNER: Solver A

Rationale:
Solver A's final answer is the strongest due to its clear, logical construction of the first question that directly addresses the challenge of identifying the identities of the gods under the constraints given. Despite receiving some critiques about how Random's response is addressed, A's explanation showcases a consistent and solid understanding of the deductive reasoning necessary to navigate the complexities of the problem. Additionally, their response shows an awareness of the need for follow-up questions based on the answers given, allowing for a more robust strategy in tackling the puzzle. The overall clarity and logical progression of thoughts in their refined solution contribute to its effectiveness in answering the problem at hand.

FINAL ANSWER TO USER: Ask god A: "If I asked you whethe

## 9. Evaluation

Compare the final answer against the ground truth.

In [11]:
print_header("EVALUATION")

ground_truth = problem["ground_truth"]

evaluation = grade_answer(
    client,
    question,
    ground_truth,
    verdict.final_answer_to_user,
)

print(f"\nGround Truth: {ground_truth}")
print(f"System Answer: {verdict.final_answer_to_user}")
print(f"\nCorrect: {'YES' if evaluation.is_correct else 'NO'}")
print("\nEvaluation Reasoning:")
print(evaluation.reasoning)


 EVALUATION

Ground Truth: Ask god B: 'If I asked you 'Is A Random?', would you say 'ja'?' (If B answers 'ja', then C is not Random. If B answers 'da', then A is not Random. This isolates a non-random god to ask the next questions to.)
System Answer: Ask god A: "If I asked you whether god B is Random, would you say 'da'?"

Correct: NO

Evaluation Reasoning:
The proposed question to god A does not effectively isolate the identity of the gods in the same manner as the ground truth. While asking about god B is a step, the structure of the question does not allow us to deduce the status of god B in relation to true, false, or random as effectively as the ground truth does. The ground truth question is designed to specifically identify one of the gods as either True or False by asking about god A's identity in relation to Random. The answer provided does not accomplish this same level of clarity or isolation.


## 10. Summary

In [12]:
print_header("DEBATE SUMMARY")

print(f"""
Problem ID: {problem['id']}
Category: {problem['category']}
Difficulty: {problem['difficulty']}

Roles:
  - Judge: Agent {judge_id}
  - Solvers: {', '.join(solver_ids)}

Initial Answers:
""")
for sid in solver_ids:
    print(f"  - Solver {sid}: {initial_solutions[sid].final_answer}")

print("""
Refined Answers:
""")
for sid in solver_ids:
    changed = "(changed)" if initial_solutions[sid].final_answer != refined_solutions[sid].final_answer else ""
    print(f"  - Solver {sid}: {refined_solutions[sid].final_answer} {changed}")

print(f"""
Judge Selected: Solver {verdict.best_solver_id}
Final Answer: {verdict.final_answer_to_user}

Ground Truth: {ground_truth}
Result: {'CORRECT' if evaluation.is_correct else 'INCORRECT'}
""")


 DEBATE SUMMARY

Problem ID: 1
Category: Logic
Difficulty: Hard

Roles:
  - Judge: Agent D
  - Solvers: A, B, C

Initial Answers:

  - Solver A: Ask god A: "If I asked you whether god B is Random, would you say 'da'?"
  - Solver B: "If I asked you 'Is A True?', would you say 'da'?"
  - Solver C: Ask: "If I asked you 'Is God B Random?', would you say 'da'?"

Refined Answers:

  - Solver A: Ask god A: "If I asked you whether god B is Random, would you say 'da'?" 
  - Solver B: If I asked you whether 'You are Random', would you say 'da'? (changed)
  - Solver C: Ask: "If I asked you 'Is God B Random?', would you say 'da'?" 

Judge Selected: Solver A
Final Answer: Ask god A: "If I asked you whether god B is Random, would you say 'da'?"

Ground Truth: Ask god B: 'If I asked you 'Is A Random?', would you say 'ja'?' (If B answers 'ja', then C is not Random. If B answers 'da', then A is not Random. This isolates a non-random god to ask the next questions to.)
Result: INCORRECT



---

## Quick Run: Full Debate Pipeline

Use this cell to run a complete debate in one go.

In [14]:
def run_full_debate(problem_id: int) -> dict:
    """Run a complete debate for a given problem ID."""
    problem = next((p for p in problems if p["id"] == problem_id), None)
    if not problem:
        raise ValueError(f"Problem ID {problem_id} not found")
    
    question = problem["question"]
    print(f"Running debate for Problem {problem_id}: {question[:80]}...")
    
    # Stage 0: Role assignment
    preferences = [get_role_preference(client, aid, question) for aid in PERSONAS.keys()]
    judge_candidates = [p for p in preferences if p.role_priority == "Judge"]
    if judge_candidates:
        judge_id = random.choices(judge_candidates, weights=[p.confidence for p in judge_candidates], k=1)[0].agent_id
    else:
        judge_id = random.choice(preferences).agent_id
    solver_ids = [p.agent_id for p in preferences if p.agent_id != judge_id]
    print(f"  Roles: Judge={judge_id}, Solvers={solver_ids}")
    
    # Stage 1: Solutions
    initial_solutions = {sid: generate_solution(client, sid, question) for sid in solver_ids}
    print(f"  Initial answers: {[initial_solutions[s].final_answer[:30] for s in solver_ids]}")
    
    # Stage 2: Reviews
    all_reviews = {sid: [] for sid in solver_ids}
    for reviewer in solver_ids:
        for target in solver_ids:
            if reviewer != target:
                all_reviews[target].append(
                    generate_critique(client, reviewer, target, question, initial_solutions[target])
                )
    print("  Peer reviews completed")
    
    # Stage 3: Refinement
    refined_solutions = {
        sid: refine_solution(client, sid, question, initial_solutions[sid], all_reviews[sid])
        for sid in solver_ids
    }
    print(f"  Refined answers: {[refined_solutions[s].final_answer[:30] for s in solver_ids]}")
    
    # Stage 4: Verdict
    verdict = judge_verdict(client, judge_id, question, solver_ids, initial_solutions, all_reviews, refined_solutions)
    print(f"  Judge selected: Solver {verdict.best_solver_id}")
    
    # Evaluation
    evaluation = grade_answer(client, question, problem["ground_truth"], verdict.final_answer_to_user)
    print(f"  Final answer: {verdict.final_answer_to_user}")
    print(f"  Ground truth: {problem['ground_truth']}")
    print(f"  Result: {'CORRECT' if evaluation.is_correct else 'INCORRECT'}")
    
    return {
        "problem": problem,
        "verdict": verdict,
        "evaluation": evaluation,
    }


# Example: Run on problem 7 (Monty Hall)
result = run_full_debate(7)

Running debate for Problem 7: You are on a game show with 3 doors. Behind one is a car; behind the others, goa...
  Roles: Judge=D, Solvers=['A', 'B', 'C']
  Initial answers: ['2/3', '\\frac{2}{3}', '2/3']
  Peer reviews completed
  Refined answers: ['2/3', 'The probability of winning if ', '2/3']
  Judge selected: Solver A
  Final answer: The probability of winning the car if you switch to Door 2 is 2/3.
  Ground truth: 2/3
  Result: CORRECT
