Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
200 changes: 198 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,198 @@
# SPIRAL
This is an AAAI-2026 conference paper repo.
# SPIRAL: Symbolic LLM Planning via Grounded and Reflective Search

This repository contains the code and analysis notebooks for **SPIRAL**, our framework that embeds a tri-agent cognitive architecture into an MCTS loop to enable more robust, grounded, and reflective planning with large language models. The full details are described in our paper and appendix.

## Table of Contents

- [Repository Structure](#repository-structure)
- [Getting Started](#getting-started)
- [Dependencies](#dependencies)
- [Data](#data)
- [Running Experiments](#running-experiments)
- [Scripts & Agents](#scripts--agents)
- [Analysis Notebooks](#analysis-notebooks)
- [Configuration & Hyperparameters](#configuration--hyperparameters)
- [License](#license)

---

## Repository Structure

Note: Enter SPIRAL folder to follow the next information.

```
├── analysis/
│ ├── ablation/
│ │ └── analysis_ablations.ipynb
│ ├── baseline/
│ │ ├── cot_k1/
│ │ ├── cot_k3/
│ │ ├── cot_k5/
│ │ ├── spiral/
│ │ ├── analysis_baseline_performance.ipynb
│ │ ├── analysis_cost_benefit.ipynb
│ │ ├── cost_comparison_api_calls.pdf
│ │ └── cost_comparison_tokens.pdf
│ └── sota/
│ ├── sota_performance/
│ └── tot_hyper_params_performance/
├── scripts/
│ ├── taskbench_ablation.py
│ ├── taskbench_cot_baseline.py
│ ├── taskbench_lats_baseline.py
│ ├── taskbench_rafa_baseline.py
│ ├── taskbench_react_baseline.py
│ ├── taskbench_react_rafa_baseline.py
│ ├── taskbench_spiral.py
│ └── taskbench_tot_baseline.py
├── Taskbench/
│ ├── data_dailylifeapis/
│ └── data_huggingface/
├── utils/
│ └── generic_client.py
├── environment.yml
├── run_all_baseline_experiments.sh
├── run_all_ablation_experiments.sh
├── run_all_sota_experiments.sh
└── LICENSE
```

---

## Getting Started

1. **Clone the repository**
```bash
git clone https://github.com/<your-org-or-username>/SPIRAL.git
cd SPIRAL
```

2. **Create the Conda environment**
```bash
conda create -n spiral python=3.11
conda activate spiral
pip install -r requirements.txt
```

3. **Download TaskBench datasets**
Place the `dailylifeapis` and `huggingface` benchmark data under `Taskbench/data_dailylifeapis/` and `Taskbench/data_huggingface/`, respectively.
```
cd Taskbench/
huggingface-cli download microsoft/Taskbench --local-dir . --repo-type dataset
```
---

## Dependencies


All required packages are listed in requirements.txt

---

## Data

We evaluate on two TaskBench tool-use benchmarks:

- **DailyLifeAPIs** (`Taskbench/data_dailylifeapis/`)
- **HuggingFace** (`Taskbench/data_huggingface/`)

Each dataset should be organized as in the original TaskBench release:

```
Taskbench/data_dailylifeapis/
└── problems.jsonl

Taskbench/data_huggingface/
└── problems.jsonl
```

---

## Running Experiments

### 1. Baseline Methods
NOTE: these won't work, they require a different library structure

```bash
./run_all_baseline_experiments.sh
```

This will run Chain-of-Thought (k=1,3,5), ReAct, RAFA, ToT, LATS, etc., via the corresponding `taskbench_*_baseline.py` scripts.

### 2. SPIRAL Agent
NOTE: relative imports need to be fixed with a proper refactor. For now, this will work.
```bash
cd scripts/
python taskbench_spiral.py --run_name test --api_family dailylifeapis num_problems 10 --seed 50 model_namet mistral --debug_llm_output
```

Or run both benchmarks end-to-end:

```bash
./run_all_sota_experiments.sh
```

### 3. Ablation Studies

```bash
./run_all_ablation_experiments.sh
```

This will sweep over standard MCTS budgets and disable components (Planner, Simulator, Critic) to quantify their impact.

---

## Scripts & Agents

- **`scripts/taskbench_spiral.py`**
Implements the SPIRAL agent:
- **Planner**: proposes actions via LLM prompts
- **Simulator**: predicts next observation
- **Critic**: scores plan progress

- **Baseline scripts** (`taskbench_cot_baseline.py`, `taskbench_react_baseline.py`, etc.)
Wrap existing state-of-the-art methods for fair comparison.

- **`utils/generic_client.py`**
A drop-in replacement for internal APIs, providing a `HuggingFaceChatClient` to interface with any HF LLM.

---

## Analysis Notebooks

All result aggregation, tables, and figures are in `analysis/`:

- **`analysis_baseline_performance.ipynb`**
- **`analysis_cost_benefit.ipynb`**
- **`analysis_ablations.ipynb`**
- **`analysis/sota_*`**

Use these notebooks to reproduce the tables and plots in the paper and appendix.

---

## Configuration & Hyperparameters

Detailed hyperparameters are in Appendix B of the paper:

| Component | Default Value |
|-------------------------|--------------------------|
| MCTS Budget (K) | 50 iterations |
| Exploration Constant C | 1.5 |
| Planner Temperature | 0.1 |
| Simulator Temperature | 0.0 |
| Critic Temperature | 0.0 |
| Reward Weight α | 0.5 |
| Random Seeds | [42, 101, 1234, 2024, 12345] |

See **Appendix B** (`SPIRAL_Technical_Appendix.pdf`) for full details.

---

## License

This project is released under the **MIT License**. See [LICENSE](LICENSE) for details.
Binary file added analysis/RQ1/cost_comparison_api_calls.pdf
Binary file not shown.
Binary file added analysis/RQ1/cost_comparison_tokens.pdf
Binary file not shown.
Empty file added analysis/RQ1/cot_k5/sample.txt
Empty file.
83 changes: 83 additions & 0 deletions analysis/RQ1/rq1_analysis.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "1d9f42be",
"metadata": {
"vscode": {
"languageId": "plaintext"
}
},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "87732b5c",
"metadata": {
"vscode": {
"languageId": "plaintext"
}
},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "b7069963",
"metadata": {
"vscode": {
"languageId": "plaintext"
}
},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "834bf0d8",
"metadata": {
"vscode": {
"languageId": "plaintext"
}
},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "a40da16c",
"metadata": {
"vscode": {
"languageId": "plaintext"
}
},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "c67ce02e",
"metadata": {
"vscode": {
"languageId": "plaintext"
}
},
"outputs": [],
"source": []
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
2,199 changes: 2,199 additions & 0 deletions analysis/RQ1/rq1_analysis_rq1_1_0729.ipynb

Large diffs are not rendered by default.

2,663 changes: 2,663 additions & 0 deletions analysis/RQ1/rq1_analysis_rq1_2_0729.ipynb

Large diffs are not rendered by default.

Loading