ToolChoiceConfusion: Causal Minimal Tool Filtering

This repository contains the experimental runner for studying whether larger tool menus make LLM agents less reliable.

Overview

The experiment compares several tool filtering methods:

all tools
keyword top-5
keyword top-10
state-aware filtering
full causal path exposure
causal minimal tool filtering (CMTF)

The benchmark currently uses synthetic multi-step tool-use tasks across calendar, email, and file/document domains.

Setup

Install dependencies:

python3 -m pip install --user --upgrade boto3

Set AWS region:

export AWS_REGION=us-east-1
export AWS_DEFAULT_REGION=us-east-1

Run

Set the model list:

export BEDROCK_MODEL_IDS="amazon.nova-lite-v1:0,amazon.nova-pro-v1:0,anthropic.claude-3-haiku-20240307-v1:0,anthropic.claude-3-sonnet-20240229-v1:0"

Run the experiment:

python3 scaledExperiment.py

Analysis

After a run completes, copy the generated results into an analysis/ folder:

analysis/task_metrics.csv
analysis/raw_traces.jsonl

Then generate summary tables:

python analyze_results.py

This writes:

tables/summary_by_model_method.csv
tables/summary_aggregate.csv
tables/summary_aggregate.tex

To generate plots:

python plot_results.py

This writes PNG and PDF figures into the figures/ folder.

results_scaled/raw_traces.jsonl
results_scaled/task_metrics.csv

These output files are ignored by Git because they can be regenerated.

Notes

Do not commit AWS keys, PEM files, logs, or generated result files.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
analyze_results.py		analyze_results.py
plot_results.py		plot_results.py
requirements.txt		requirements.txt
scaledExperiment.py		scaledExperiment.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ToolChoiceConfusion: Causal Minimal Tool Filtering

Overview

Setup

Run

Analysis

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ToolChoiceConfusion: Causal Minimal Tool Filtering

Overview

Setup

Run

Analysis

Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages