Skip to content

EVIEHub/PEAR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PEAR: Permutation-Equivariant Adaptive Routing Multi-Agent Debate

PEAR is an experiment harness for Permutation-Equivariant Adaptive Routing Multi-Agent Debate. It studies multi-agent reasoning systems where agents first produce independent answers and then revise them through sparse, dynamically routed critiques.

The current method uses a two-phase debate loop:

  1. Answer phase: each agent proposes or updates an answer with a confidence score.
  2. Critique phase: each agent receives a fixed number of critiques from other agents. For pear_full, the routing graph is selected using state-aware signals rather than a fixed topology.

The full PEAR routing objective combines three components:

  • Targeted cross-answer routing: prioritize high-confidence agents with different answers as sources for lower-confidence targets.
  • Influence balancing: avoid repeatedly over-exposing the group to one historically influential source.
  • Low-confidence filtering: reduce routing exposure from low-confidence sources.

This repository includes fixed-topology baselines, CoT / CoT-SC baselines, random sparse routing controls, ablations, robustness experiments, local vLLM inference, and OpenAI-compatible closed-model inference.


Repository Structure

The repository root is the source directory. main.py is the canonical CLI entry point, and each subdirectory is importable as a top-level package. Generated run directories and paper notes are intentionally not expanded here.

PEAR/
|- main.py                         # CLI entry point; loads YAML config and calls runner.run_experiment
|- prompts.py                      # all LLM-facing prompt templates, confidence rubric, robustness prompts
|- requirements.txt                # Python dependencies, including vLLM / transformers / API clients
|- .env.example                    # template for API keys and OpenAI-compatible base URLs
|- configs/                        # agents/debate/dataset/replication/logging settings
|- core/
|  |- state.py                     # shared debate state schema and mode/type definitions
|  |- topology.py                  # topology construction, routing scores, random k-regular logic
|  `- views.py                     # per-agent visibility/view helpers
|- nodes/
|  |- agent_runner.py              # calls LLM agents for initial answers and updates
|  |- aggregator.py                # final answer aggregation, e.g. majority vote / judge
|  |- scorer.py                    # example-level scoring node
|  |- topology_scheduler.py        # topology/routing schedule helpers
|  `- turn_scheduler.py            # turn order / round scheduling helpers
|- graph/
|  `- builder.py                   # LangGraph StateGraph construction for PEAR execution
|- runner/
|  `- experiment.py                # high-level experiment loop, condition sweeps, summaries, transcripts
|- models/
|  |- base.py                      # BaseLLM and Generation abstractions
|  |- model.py                     # OpenAI-compatible, Anthropic, HF, and vLLM backend implementations
|  |- factory.py                   # loads configs/models.yaml and instantiates the selected backend
|- data/
|  |- loaders.py                   # dataset loading utilities
|  |- tasks.py                     # dataset adapters: formatting, answer parsing, scoring
|  |- gsm8k/                       # GSM8K local JSON files
|  |- mmlu_pro/                    # MMLU-Pro local JSON files
|  |- math500/                     # MATH-500 local JSON files
|  |- truthful_qa/                 # TruthfulQA local JSON files
|- metrics/
|  |- scorers.py                   # accuracy and task-level scoring helpers
|  |- diagnostics.py               # W2R/R2W, routing, confidence, influence diagnostics
|  `- stability.py                 # stability / tail-risk / influence-distribution metrics
|- utils/
|  |- budget.py                    # optional call/token budget accounting
|  |- logging.py                   # run directory creation, logger setup, timestamp handling
|  |- seed.py                      # reproducibility helpers
|  `- tracing.py                   # JSONL trace writing
|- scripts/
`- analysis/

Installation

The project can be run with a regular Python environment, but local open-model experiments should use a dedicated vLLM environment. The scripts default to .venv-vllm/bin/python.

Create the vLLM environment

Using uv:

uv venv .venv-vllm --python 3.12
source .venv-vllm/bin/activate
uv pip install -r requirements.txt

Or using standard venv / pip:

python3.12 -m venv .venv-vllm
source .venv-vllm/bin/activate
python -m pip install --upgrade pip
python -m pip install -r requirements.txt

For closed-source or OpenAI-compatible APIs, create .env from the template:

cp .env.example .env
# Fill OPENAI_API_KEY / OPENAI_BASE_URL as needed.

Quickstart

Run the default config:

python main.py --config configs/default.yaml

Override common fields from the CLI:

python main.py \
  --config configs/main_large.yaml \
  --model qwen2.5-7b-vllm \
  --dataset mmlu_pro \
  --num-examples 50

Run with explicit seeds:

python main.py \
  --config configs/main_large.yaml \
  --seeds 1 2 3 \
  --perm-seeds 10

Run the parameterized vLLM shell entry point:

scripts/run_vllm.sh qwen7b
scripts/run_vllm.sh llama
scripts/run_vllm.sh gemma
scripts/run_vllm.sh qwen3
scripts/run_vllm.sh open4

Useful shell overrides:

VLLM_DATASETS="gsm8k mmlu_pro" \
VLLM_NUM_EXAMPLES=200 \
PEAR_SEEDS="1 2 3" \
PEAR_PERM_SEEDS="10" \
scripts/run_vllm.sh qwen7b

Backward-compatible wrappers are also kept:

scripts/run_qwen_vllm.sh
scripts/run_llama_vllm.sh
scripts/run_gemma_vllm.sh
scripts/run_qwen3_main_vllm.sh

Configuration

configs/default.yaml contains the main knobs. Important fields:

agents:
  model: qwen2.5-7b-vllm

debate:
  n_agents: 5
  rounds: 3
  base_topology: k_regular
  k_regular_degree: 2
  mode: pear_full
  agg_mode: majority_vote
  max_tokens_per_call: 512
  mc_permutations: 100
  routing_temperature: 0.7
  alpha_targeted_cross: 0.2
  alpha_influence: 0.7
  alpha_low_confidence: 0.7
  normalize_routing_terms: true
  low_confidence_threshold: 3
  targeted_cross_source_confidence_min: 4
  targeted_cross_target_confidence_max: 3
  influence_beta: 0.6

dataset:
  name: mmlu_pro
  num_examples: 100

replication:
  seeds: [0]
  agent_perm_seeds: [10]

configs/main_large.yaml defines the main comparison conditions:

condition meaning
cot single-agent chain-of-thought baseline
cot_sc self-consistency over multiple independent agents/samples
fixed_clique fixed fully connected debate
fixed_star fixed hub-and-spoke debate
fixed_chain fixed sequential chain debate
fixed_ring fixed local-neighbor ring debate
random_k_regular random sparse k-regular routing baseline
pear_full full adaptive routing method

configs/ablation.yaml isolates PEAR routing components:

condition targeted cross influence balancing low-confidence filtering
pear_targeted_cross yes no no
pear_influence no yes no
pear_low_confidence no no yes
pear_targeted_influence yes yes no
pear_targeted_low_confidence yes no yes
pear_influence_low_confidence no yes yes
pear_full yes yes yes

Datasets

All datasets are loaded from local JSON files under data/; the runner does not download data at runtime.

name source files answer format
mmlu_pro data/mmlu_pro/{test,validation}.json A-J multiple choice
gsm8k data/gsm8k/{test,train}.json numeric exact match
truthful_qa data/truthful_qa/mc_task.json multiple choice letter
math_500 data/math500/test.json math answer exact match

data/tasks.py normalizes each dataset into a shared Example format and provides benchmark-specific format_question, parse_answer, and score methods.


Models

Models are registered in configs/models.yaml. Each entry is selected by agents.model or --model.

Common local vLLM entries include:

model key backend
qwen2.5-7b-vllm vLLM
llama-3.1-8b-vllm vLLM
gemma-3-12b-vllm vLLM
qwen2.5-14b-vllm vLLM
qwen3-30b-a3b-local-vllm vLLM

Closed / API models can use the OpenAI-compatible backend through OPENAI_API_KEY and OPENAI_BASE_URL.


Running Experiments

Main open-model experiment:

scripts/run_main_large_vllm.sh all

Parameterized vLLM runner:

scripts/run_vllm.sh qwen7b llama gemma qwen3

Routing ablation:

scripts/run_ablation_vllm.sh all

Random k-regular baseline:

scripts/run_random_k_regular_vllm.sh open4

Closed-model main experiment:

scripts/run_closed_main.sh gpt
scripts/run_closed_main.sh claude

Metrics

The primary metric is accuracy. The code also records diagnostics that are useful for understanding debate dynamics:

metric meaning
accuracy final answer correctness
w2r_rate wrong-to-right answer transition rate
r2w_rate right-to-wrong answer transition rate
answer_entropy diversity of agent answers
influence_entropy concentration or balance of influence
targeted_cross_rate frequency of targeted high-confidence dissent routing
critique_acceptance_rate how often critiques are adopted

Analysis scripts live in analysis/ and generated figures/reports should be written under analysis/figures/ or dedicated report subfolders.


Citation

@article{pear2026,
  title={PEAR: Permutation-Equivariant Adaptive Routing Multi-Agent Debate},
  author={Feng, Yang and Xu, Ziwei and Hu, Xia and He, Fengxiang},
  year={2026}
}

License

See LICENSE.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors