An automated probing framework that discovers structured failure modes of LLMs
via hierarchical Monte Carlo Tree Search.
Getting Started · Library API · Tool System · Models · Citation
Note
ProbeLLM goes beyond isolated error cases. It discovers failure modes — recurring, structured patterns that explain how and why models fail — by combining hierarchical MCTS exploration with tool-augmented test generation and failure-aware clustering.
Static benchmarks provide only a fixed snapshot of model behavior — as LLMs evolve, their error distributions shift, leaving emerging weaknesses undetected. ProbeLLM formulates probing as a budget-aware hierarchical search that automatically:
- Explores new failure regions (Macro) and refines known ones (Micro) via UCB-guided MCTS
- Generates verifiable test cases with tool-augmented LLMs (code execution, web retrieval, perturbation)
- Synthesizes individual failures into interpretable failure modes via failure-aware embeddings and boundary-aware induction
Overview of ProbeLLM. (I) Probes a target model and initializes the search with seed test cases from existing benchmarks. (II) Hierarchical search selects between Macro and Micro regimes, performs tool-augmented generation, and verifies target model responses. (III) Failure-aware embeddings, clustering, and boundary-aware induction produce interpretable failure modes.
ProbeLLM allocates its probing budget between two complementary strategies:
| Macro (Coverage) | Micro (Refinement) | |
|---|---|---|
| Goal | Surface novel failure regions | Densify evidence around known failures |
| Mechanism | Greedy k-center diversity sampling | Local perturbation of seed queries |
| Tools | Web search, Python execution | Perturbation, Python execution |
| MCTS Role | Expands across clusters | Deepens within clusters |
Macro vs. Micro search strategy. Macro explores new clusters to diversify topics, while Micro refines within a cluster to deepen local evidence around known failure regions.
UMAP visualization of failure samples for the Mistral (left) and GPT (right) model families. Small nodes represent individual failed queries, colored by version (cool → warm = older → newer). Directed edges connect semantically similar failures across generations, tracing the evolution of model deficits. Large purple centroids indicate weakness clusters persisting across versions; dark blue centroids are isolated to the latest model.
|
Hierarchical MCTS Principled budget allocation between Macro & Micro, guided by UCB selection |
Tool-Augmented Generation MCP-style pluggable tools: web retrieval, code execution, perturbation |
Failure Mode Synthesis Failure-aware embeddings + HDBSCAN + boundary-aware induction |
|
Benchmark-Agnostic Seed queries are optional; works with any user-provided prompts |
Verified Test Cases Verifiable ground truths + LLM-as-Judge reduce spurious failures |
Multi-Model Concurrent Async probing of 12+ models across 5 benchmarks in parallel |
pip install -r requirements.txt| Variable | Purpose |
|---|---|
OPENAI_API_KEY |
Judge model, embedding model, web retrieval tool |
OPENROUTER_API_KEY |
Target / generator models via OpenRouter |
OPENROUTER_BASE_URL |
(Optional) Defaults to https://openrouter.ai/api/v1 |
from probellm import VulnerabilityPipelineAsync
pipeline = VulnerabilityPipelineAsync(
model_name="anthropic/claude-sonnet-4.5", # generator model
test_model="openai/gpt-4o-mini", # target model under test
judge_model="gpt-4.1", # verification judge
max_depth=1000,
num_simulations=100,
num_samples=5,
)
pipeline.add_datasets_batch(["mbpp", "mmlu", "hellaswag", "truthful_qa", "super_glue"])
pipeline.run()Everything is importable directly from probellm:
from probellm import VulnerabilityPipelineAsync # Core search pipeline
from probellm import create_checkpoints # Checkpoint creation
from probellm import resume_from_checkpoint # Resume interrupted search
from probellm import run_analysis # Post-hoc analysis
from probellm import ToolRegistry, build_default_tool_registry # Tool systemMulti-Model Concurrent Search
Probe multiple target models in parallel under the same configuration:
from probellm import VulnerabilityPipelineAsync
model_configs = [
("anthropic/claude-sonnet-4.5", "openai/gpt-4o-mini"),
("anthropic/claude-sonnet-4.5", "openai/gpt-4.1"),
("anthropic/claude-sonnet-4.5", "meta-llama/llama-3.1-8b-instruct"),
("anthropic/claude-sonnet-4.5", "deepseek/deepseek-v3.2"),
]
results = VulnerabilityPipelineAsync.run_multiple_pipelines(
model_configs=model_configs,
dataset_ids=["mbpp", "mmlu", "hellaswag", "truthful_qa", "super_glue"],
judge_model="gpt-5.2",
max_depth=1000,
num_simulations=100,
num_samples=5,
)Checkpoints & Resume
from probellm import create_checkpoints, resume_from_checkpoint
create_checkpoints("results/run_xxx") # save checkpoint
resume_from_checkpoint("results/run_xxx") # resume search
resume_from_checkpoint("results/run_xxx", verify_only=True) # verify onlyAnalysis Pipeline
Run error extraction, benchmark statistics, and PCA + HDBSCAN failure-mode clustering:
from probellm import run_analysis
run_analysis("results/run_xxx") # global analysis
run_analysis("results/run_xxx", by_dataset=True) # per-dataset analysisOutputs: errorSamples.json · benchmark_results.csv/.xlsx · enhanced_analysis/
CLI Alternatives
python -m probellm.checkpoint results/run_xxx
python -m probellm.resume results/run_xxx
python -m probellm.resume results/run_xxx --verify-only
python -m probellm.analysis results/run_xxx
python -m probellm.analysis results/run_xxx --by_datasetProbeLLM uses a pluggable MCP-style tool registry. During MCTS expansion, the generator LLM autonomously selects which tool to invoke for test-case construction and answer verification.
| Tool | Purpose | Typical Use |
|---|---|---|
perturbation |
Semantic-preserving rewording | Micro search (local exploration) |
python_exec |
Python code execution with LLM repair loop | Math / algorithmic questions |
web_search |
Evidence retrieval via OpenAI Responses API | Factual questions, Macro search |
Tip
You can add custom tools to extend ProbeLLM to new domains. See examples/chemistry_tool_integration.py for a real-world integration with ChemOrch.
Custom Tool Example
from probellm.tools import ToolRegistry, LocalMCPTool, ToolSpec
registry = ToolRegistry()
spec = ToolSpec(
name="my_domain_tool",
description="Domain-specific test case generator.",
input_schema={
"type": "object",
"properties": {"query": {"type": "string"}},
"required": ["query"],
},
)
def handler(args):
return {"generated_question": "...", "ground_truth": "..."}
registry.register(LocalMCPTool(spec, handler))|
Target Models (12 tested)
|
Seed Benchmarks
|
Package Structure
probellm/
├── __init__.py # Public API (lazy imports)
├── config.py # Model name configuration
├── search.py # Hierarchical MCTS pipeline
├── checkpoint.py # Checkpoint creation
├── resume.py # Resume from checkpoints
├── analysis.py # Post-hoc analysis pipeline
├── validate.py # Preflight validation
├── tools/
│ ├── mcp.py # MCP protocol (ToolSpec, MCPTool)
│ ├── registry.py # ToolRegistry & LocalMCPTool
│ ├── builtin.py # Default tool registry factory
│ └── python_exec.py # Python execution + LLM feedback
├── dataloader/
│ ├── dataset_loader.py # YAML-driven dataset loader
│ ├── datasets_config.yaml # 5 benchmark configurations
│ └── sampler.py # Hierarchical & sequential samplers
└── utils/
├── testcase_gen.py # Tool-augmented test case generation
├── answer_gen.py # Ground-truth answer generation
├── perturbation.py # Perturbation strategies
├── web_retrieval.py # Web retrieval tool
├── embedding.py # Text embedding
├── pcaAnalysisEnhanced.py # PCA + HDBSCAN clustering
└── ...
If you find ProbeLLM useful in your research, please consider citing:
@article{huang2025probellm,
title = {ProbeLLM: Automating Principled Diagnosis of LLM Failures},
author = {Huang, Yue and Jiang, Zhengzhe and Ma, Yuchen and Jiang, Yu and Wang, Xiangqi and Zhou, Yujun and Hao, Yuexing and Guo, Kehan and Chen, Pin-Yu and Feuerriegel, Stefan and Zhang, Xiangliang},
year = {2025}
}MIT License — see LICENSE for details.
Built with research from University of Notre Dame, LMU Munich, MIT, and IBM Research.