Skip to content

ProbeLLM: Automating Principled Diagnosis of LLM Failures

Notifications You must be signed in to change notification settings

HowieHwong/ProbeLLM

Repository files navigation

ProbeLLM

Automating Principled Diagnosis of LLM Failures

Python 3.8+ License: MIT Docs MCP Tools OpenAI OpenRouter

An automated probing framework that discovers structured failure modes of LLMs
via hierarchical Monte Carlo Tree Search.

Getting Started  ·  Library API  ·  Tool System  ·  Models  ·  Citation


Note

ProbeLLM goes beyond isolated error cases. It discovers failure modes — recurring, structured patterns that explain how and why models fail — by combining hierarchical MCTS exploration with tool-augmented test generation and failure-aware clustering.


Overview

Static benchmarks provide only a fixed snapshot of model behavior — as LLMs evolve, their error distributions shift, leaving emerging weaknesses undetected. ProbeLLM formulates probing as a budget-aware hierarchical search that automatically:

  1. Explores new failure regions (Macro) and refines known ones (Micro) via UCB-guided MCTS
  2. Generates verifiable test cases with tool-augmented LLMs (code execution, web retrieval, perturbation)
  3. Synthesizes individual failures into interpretable failure modes via failure-aware embeddings and boundary-aware induction


Overview of ProbeLLM. (I) Probes a target model and initializes the search with seed test cases from existing benchmarks. (II) Hierarchical search selects between Macro and Micro regimes, performs tool-augmented generation, and verifies target model responses. (III) Failure-aware embeddings, clustering, and boundary-aware induction produce interpretable failure modes.

Macro vs. Micro Search

ProbeLLM allocates its probing budget between two complementary strategies:

Macro (Coverage) Micro (Refinement)
Goal Surface novel failure regions Densify evidence around known failures
Mechanism Greedy k-center diversity sampling Local perturbation of seed queries
Tools Web search, Python execution Perturbation, Python execution
MCTS Role Expands across clusters Deepens within clusters


Macro vs. Micro search strategy. Macro explores new clusters to diversify topics, while Micro refines within a cluster to deepen local evidence around known failure regions.

Failure Mode Evolution



UMAP visualization of failure samples for the Mistral (left) and GPT (right) model families. Small nodes represent individual failed queries, colored by version (cool → warm = older → newer). Directed edges connect semantically similar failures across generations, tracing the evolution of model deficits. Large purple centroids indicate weakness clusters persisting across versions; dark blue centroids are isolated to the latest model.

Highlights

Hierarchical MCTS

Principled budget allocation between Macro & Micro, guided by UCB selection

Tool-Augmented Generation

MCP-style pluggable tools: web retrieval, code execution, perturbation

Failure Mode Synthesis

Failure-aware embeddings + HDBSCAN + boundary-aware induction

Benchmark-Agnostic

Seed queries are optional; works with any user-provided prompts

Verified Test Cases

Verifiable ground truths + LLM-as-Judge reduce spurious failures

Multi-Model Concurrent

Async probing of 12+ models across 5 benchmarks in parallel


Getting Started

Installation

pip install -r requirements.txt

Environment Variables

Variable Purpose
OPENAI_API_KEY Judge model, embedding model, web retrieval tool
OPENROUTER_API_KEY Target / generator models via OpenRouter
OPENROUTER_BASE_URL (Optional) Defaults to https://openrouter.ai/api/v1

Quick Start

from probellm import VulnerabilityPipelineAsync

pipeline = VulnerabilityPipelineAsync(
    model_name="anthropic/claude-sonnet-4.5",   # generator model
    test_model="openai/gpt-4o-mini",             # target model under test
    judge_model="gpt-4.1",                       # verification judge
    max_depth=1000,
    num_simulations=100,
    num_samples=5,
)

pipeline.add_datasets_batch(["mbpp", "mmlu", "hellaswag", "truthful_qa", "super_glue"])
pipeline.run()

Library API

Everything is importable directly from probellm:

from probellm import VulnerabilityPipelineAsync          # Core search pipeline
from probellm import create_checkpoints                   # Checkpoint creation
from probellm import resume_from_checkpoint                # Resume interrupted search
from probellm import run_analysis                          # Post-hoc analysis
from probellm import ToolRegistry, build_default_tool_registry  # Tool system
Multi-Model Concurrent Search

Probe multiple target models in parallel under the same configuration:

from probellm import VulnerabilityPipelineAsync

model_configs = [
    ("anthropic/claude-sonnet-4.5", "openai/gpt-4o-mini"),
    ("anthropic/claude-sonnet-4.5", "openai/gpt-4.1"),
    ("anthropic/claude-sonnet-4.5", "meta-llama/llama-3.1-8b-instruct"),
    ("anthropic/claude-sonnet-4.5", "deepseek/deepseek-v3.2"),
]

results = VulnerabilityPipelineAsync.run_multiple_pipelines(
    model_configs=model_configs,
    dataset_ids=["mbpp", "mmlu", "hellaswag", "truthful_qa", "super_glue"],
    judge_model="gpt-5.2",
    max_depth=1000,
    num_simulations=100,
    num_samples=5,
)
Checkpoints & Resume
from probellm import create_checkpoints, resume_from_checkpoint

create_checkpoints("results/run_xxx")                          # save checkpoint
resume_from_checkpoint("results/run_xxx")                      # resume search
resume_from_checkpoint("results/run_xxx", verify_only=True)    # verify only
Analysis Pipeline

Run error extraction, benchmark statistics, and PCA + HDBSCAN failure-mode clustering:

from probellm import run_analysis

run_analysis("results/run_xxx")                    # global analysis
run_analysis("results/run_xxx", by_dataset=True)   # per-dataset analysis

Outputs: errorSamples.json · benchmark_results.csv/.xlsx · enhanced_analysis/

CLI Alternatives
python -m probellm.checkpoint results/run_xxx
python -m probellm.resume results/run_xxx
python -m probellm.resume results/run_xxx --verify-only
python -m probellm.analysis results/run_xxx
python -m probellm.analysis results/run_xxx --by_dataset

Tool System

ProbeLLM uses a pluggable MCP-style tool registry. During MCTS expansion, the generator LLM autonomously selects which tool to invoke for test-case construction and answer verification.

Tool Purpose Typical Use
perturbation Semantic-preserving rewording Micro search (local exploration)
python_exec Python code execution with LLM repair loop Math / algorithmic questions
web_search Evidence retrieval via OpenAI Responses API Factual questions, Macro search

Tip

You can add custom tools to extend ProbeLLM to new domains. See examples/chemistry_tool_integration.py for a real-world integration with ChemOrch.

Custom Tool Example
from probellm.tools import ToolRegistry, LocalMCPTool, ToolSpec

registry = ToolRegistry()

spec = ToolSpec(
    name="my_domain_tool",
    description="Domain-specific test case generator.",
    input_schema={
        "type": "object",
        "properties": {"query": {"type": "string"}},
        "required": ["query"],
    },
)

def handler(args):
    return {"generated_question": "...", "ground_truth": "..."}

registry.register(LocalMCPTool(spec, handler))

Default Models & Benchmarks

Target Models (12 tested)

Proprietary Open-Weight
GPT-4o-mini Llama-3.1-8B-Instruct
Claude-3.5-Sonnet Deepseek-v3.2
Gemini-2.5-Flash Phi-4
Grok-4.1-Fast Ministral-14B
Devstral
OLMo-3-7B-Instruct
Granite-4.0
GPT-oss-20B

Seed Benchmarks

Benchmark Aspect
MMLU Multi-task knowledge
SuperGLUE Language understanding
MBPP Code generation
HellaSwag Commonsense reasoning
TruthfulQA Truthfulness

Package Structure
probellm/
├── __init__.py              # Public API (lazy imports)
├── config.py                # Model name configuration
├── search.py                # Hierarchical MCTS pipeline
├── checkpoint.py            # Checkpoint creation
├── resume.py                # Resume from checkpoints
├── analysis.py              # Post-hoc analysis pipeline
├── validate.py              # Preflight validation
├── tools/
│   ├── mcp.py               # MCP protocol (ToolSpec, MCPTool)
│   ├── registry.py          # ToolRegistry & LocalMCPTool
│   ├── builtin.py           # Default tool registry factory
│   └── python_exec.py       # Python execution + LLM feedback
├── dataloader/
│   ├── dataset_loader.py    # YAML-driven dataset loader
│   ├── datasets_config.yaml # 5 benchmark configurations
│   └── sampler.py           # Hierarchical & sequential samplers
└── utils/
    ├── testcase_gen.py       # Tool-augmented test case generation
    ├── answer_gen.py         # Ground-truth answer generation
    ├── perturbation.py       # Perturbation strategies
    ├── web_retrieval.py      # Web retrieval tool
    ├── embedding.py          # Text embedding
    ├── pcaAnalysisEnhanced.py  # PCA + HDBSCAN clustering
    └── ...

Citation

If you find ProbeLLM useful in your research, please consider citing:

@article{huang2025probellm,
  title     = {ProbeLLM: Automating Principled Diagnosis of LLM Failures},
  author    = {Huang, Yue and Jiang, Zhengzhe and Ma, Yuchen and Jiang, Yu and Wang, Xiangqi and Zhou, Yujun and Hao, Yuexing and Guo, Kehan and Chen, Pin-Yu and Feuerriegel, Stefan and Zhang, Xiangliang},
  year      = {2025}
}

License

MIT License — see LICENSE for details.


Built with research from University of Notre Dame, LMU Munich, MIT, and IBM Research.

About

ProbeLLM: Automating Principled Diagnosis of LLM Failures

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages