ProbeLLM

Automating Principled Diagnosis of LLM Failures

An automated probing framework that discovers structured failure modes of LLMs
via hierarchical Monte Carlo Tree Search.

Getting Started · Library API · Tool System · Models · Citation

Note

ProbeLLM goes beyond isolated error cases. It discovers failure modes — recurring, structured patterns that explain how and why models fail — by combining hierarchical MCTS exploration with tool-augmented test generation and failure-aware clustering.

Overview

Static benchmarks provide only a fixed snapshot of model behavior — as LLMs evolve, their error distributions shift, leaving emerging weaknesses undetected. ProbeLLM formulates probing as a budget-aware hierarchical search that automatically:

Explores new failure regions (Macro) and refines known ones (Micro) via UCB-guided MCTS
Generates verifiable test cases with tool-augmented LLMs (code execution, web retrieval, perturbation)
Synthesizes individual failures into interpretable failure modes via failure-aware embeddings and boundary-aware induction

_{Overview of ProbeLLM. (I) Probes a target model and initializes the search with seed test cases from existing benchmarks. (II) Hierarchical search selects between Macro and Micro regimes, performs tool-augmented generation, and verifies target model responses. (III) Failure-aware embeddings, clustering, and boundary-aware induction produce interpretable failure modes.}

Macro vs. Micro Search

ProbeLLM allocates its probing budget between two complementary strategies:

	Macro (Coverage)	Micro (Refinement)
Goal	Surface novel failure regions	Densify evidence around known failures
Mechanism	Greedy k-center diversity sampling	Local perturbation of seed queries
Tools	Web search, Python execution	Perturbation, Python execution
MCTS Role	Expands across clusters	Deepens within clusters

_{Macro vs. Micro search strategy. Macro explores new clusters to diversify topics, while Micro refines within a cluster to deepen local evidence around known failure regions.}

Failure Mode Evolution

_{UMAP visualization of failure samples for the Mistral (left) and GPT (right) model families. Small nodes represent individual failed queries, colored by version (cool → warm = older → newer). Directed edges connect semantically similar failures across generations, tracing the evolution of model deficits. Large purple centroids indicate weakness clusters persisting across versions; dark blue centroids are isolated to the latest model.}

Highlights

Hierarchical MCTS

Principled budget allocation between Macro & Micro, guided by UCB selection

Tool-Augmented Generation

MCP-style pluggable tools: web retrieval, code execution, perturbation

Failure Mode Synthesis

Failure-aware embeddings + HDBSCAN + boundary-aware induction

Benchmark-Agnostic

Seed queries are optional; works with any user-provided prompts

Verified Test Cases

Verifiable ground truths + LLM-as-Judge reduce spurious failures

Multi-Model Concurrent

Async probing of 12+ models across 5 benchmarks in parallel

Getting Started

Installation

pip install -r requirements.txt

Environment Variables

Variable	Purpose
`OPENAI_API_KEY`	Judge model, embedding model, web retrieval tool
`OPENROUTER_API_KEY`	Target / generator models via OpenRouter
`OPENROUTER_BASE_URL`	(Optional) Defaults to `https://openrouter.ai/api/v1`

Quick Start

from probellm import VulnerabilityPipelineAsync

pipeline = VulnerabilityPipelineAsync(
    model_name="anthropic/claude-sonnet-4.5",   # generator model
    test_model="openai/gpt-4o-mini",             # target model under test
    judge_model="gpt-4.1",                       # verification judge
    max_depth=1000,
    num_simulations=100,
    num_samples=5,
)

pipeline.add_datasets_batch(["mbpp", "mmlu", "hellaswag", "truthful_qa", "super_glue"])
pipeline.run()

Library API

Everything is importable directly from probellm:

from probellm import VulnerabilityPipelineAsync          # Core search pipeline
from probellm import create_checkpoints                   # Checkpoint creation
from probellm import resume_from_checkpoint                # Resume interrupted search
from probellm import run_analysis                          # Post-hoc analysis
from probellm import ToolRegistry, build_default_tool_registry  # Tool system

Multi-Model Concurrent Search

Probe multiple target models in parallel under the same configuration:

from probellm import VulnerabilityPipelineAsync

model_configs = [
    ("anthropic/claude-sonnet-4.5", "openai/gpt-4o-mini"),
    ("anthropic/claude-sonnet-4.5", "openai/gpt-4.1"),
    ("anthropic/claude-sonnet-4.5", "meta-llama/llama-3.1-8b-instruct"),
    ("anthropic/claude-sonnet-4.5", "deepseek/deepseek-v3.2"),
]

results = VulnerabilityPipelineAsync.run_multiple_pipelines(
    model_configs=model_configs,
    dataset_ids=["mbpp", "mmlu", "hellaswag", "truthful_qa", "super_glue"],
    judge_model="gpt-5.2",
    max_depth=1000,
    num_simulations=100,
    num_samples=5,
)

Checkpoints & Resume

from probellm import create_checkpoints, resume_from_checkpoint

create_checkpoints("results/run_xxx")                          # save checkpoint
resume_from_checkpoint("results/run_xxx")                      # resume search
resume_from_checkpoint("results/run_xxx", verify_only=True)    # verify only

Analysis Pipeline

Run error extraction, benchmark statistics, and PCA + HDBSCAN failure-mode clustering:

from probellm import run_analysis

run_analysis("results/run_xxx")                    # global analysis
run_analysis("results/run_xxx", by_dataset=True)   # per-dataset analysis

Outputs: errorSamples.json · benchmark_results.csv/.xlsx · enhanced_analysis/

CLI Alternatives

python -m probellm.checkpoint results/run_xxx
python -m probellm.resume results/run_xxx
python -m probellm.resume results/run_xxx --verify-only
python -m probellm.analysis results/run_xxx
python -m probellm.analysis results/run_xxx --by_dataset

Tool System

ProbeLLM uses a pluggable MCP-style tool registry. During MCTS expansion, the generator LLM autonomously selects which tool to invoke for test-case construction and answer verification.

Tool	Purpose	Typical Use
`perturbation`	Semantic-preserving rewording	Micro search (local exploration)
`python_exec`	Python code execution with LLM repair loop	Math / algorithmic questions
`web_search`	Evidence retrieval via OpenAI Responses API	Factual questions, Macro search

Tip

You can add custom tools to extend ProbeLLM to new domains. See examples/chemistry_tool_integration.py for a real-world integration with ChemOrch.

Custom Tool Example

from probellm.tools import ToolRegistry, LocalMCPTool, ToolSpec

registry = ToolRegistry()

spec = ToolSpec(
    name="my_domain_tool",
    description="Domain-specific test case generator.",
    input_schema={
        "type": "object",
        "properties": {"query": {"type": "string"}},
        "required": ["query"],
    },
)

def handler(args):
    return {"generated_question": "...", "ground_truth": "..."}

registry.register(LocalMCPTool(spec, handler))

Default Models & Benchmarks

Target Models (12 tested)

Proprietary	Open-Weight
GPT-4o-mini	Llama-3.1-8B-Instruct
Claude-3.5-Sonnet	Deepseek-v3.2
Gemini-2.5-Flash	Phi-4
Grok-4.1-Fast	Ministral-14B
	Devstral
	OLMo-3-7B-Instruct
	Granite-4.0
	GPT-oss-20B

Seed Benchmarks

Benchmark	Aspect
MMLU	Multi-task knowledge
SuperGLUE	Language understanding
MBPP	Code generation
HellaSwag	Commonsense reasoning
TruthfulQA	Truthfulness

Package Structure

probellm/
├── __init__.py              # Public API (lazy imports)
├── config.py                # Model name configuration
├── search.py                # Hierarchical MCTS pipeline
├── checkpoint.py            # Checkpoint creation
├── resume.py                # Resume from checkpoints
├── analysis.py              # Post-hoc analysis pipeline
├── validate.py              # Preflight validation
├── tools/
│   ├── mcp.py               # MCP protocol (ToolSpec, MCPTool)
│   ├── registry.py          # ToolRegistry & LocalMCPTool
│   ├── builtin.py           # Default tool registry factory
│   └── python_exec.py       # Python execution + LLM feedback
├── dataloader/
│   ├── dataset_loader.py    # YAML-driven dataset loader
│   ├── datasets_config.yaml # 5 benchmark configurations
│   └── sampler.py           # Hierarchical & sequential samplers
└── utils/
    ├── testcase_gen.py       # Tool-augmented test case generation
    ├── answer_gen.py         # Ground-truth answer generation
    ├── perturbation.py       # Perturbation strategies
    ├── web_retrieval.py      # Web retrieval tool
    ├── embedding.py          # Text embedding
    ├── pcaAnalysisEnhanced.py  # PCA + HDBSCAN clustering
    └── ...

Citation

If you find ProbeLLM useful in your research, please consider citing:

@article{huang2025probellm,
  title     = {ProbeLLM: Automating Principled Diagnosis of LLM Failures},
  author    = {Huang, Yue and Jiang, Zhengzhe and Ma, Yuchen and Jiang, Yu and Wang, Xiangqi and Zhou, Yujun and Hao, Yuexing and Guo, Kehan and Chen, Pin-Yu and Feuerriegel, Stefan and Zhang, Xiangliang},
  year      = {2025}
}

License

MIT License — see LICENSE for details.

_{Built with research from University of Notre Dame, LMU Munich, MIT, and IBM Research.}

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
aassets		aassets
docs		docs
examples		examples
probellm		probellm
.gitignore		.gitignore
README.md		README.md
executeAnalysis.py		executeAnalysis.py
readthedocs.yml		readthedocs.yml
requirements.txt		requirements.txt
validate_config.py		validate_config.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ProbeLLM

Automating Principled Diagnosis of LLM Failures

Overview

Macro vs. Micro Search

Failure Mode Evolution

Highlights

Getting Started

Installation

Environment Variables

Quick Start

Library API

Tool System

Default Models & Benchmarks

Citation

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

HowieHwong/ProbeLLM

Folders and files

Latest commit

History

Repository files navigation

ProbeLLM

Automating Principled Diagnosis of LLM Failures

Overview

Macro vs. Micro Search

Failure Mode Evolution

Highlights

Getting Started

Installation

Environment Variables

Quick Start

Library API

Tool System

Default Models & Benchmarks

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages