Decision-Language Model (DLM) for Dynamic Restless Multi-Armed Bandit Tasks in Public Health

Behari, N., Zhang, E., Zhao, Y., Taneja, A., Nagaraj, D., & Tambe, M. Decision-Language Model (DLM) for Dynamic Restless Multi-Armed Bandit Tasks in Public Health. NeurIPS 2024.

Restless multi-armed bandits (RMABs) are widely used to allocate scarce resources in public-health programs, but classical formulations commit to a single reward function and cannot adapt to evolving policy priorities. DLM treats a large language model as an automated planner that proposes reward functions as code, trains a budget-constrained policy under each candidate, evaluates the resulting behavior against an operator-specified target, and iterates. The result is rapid policy adjustment from natural-language commands without manual reward engineering.

Install

Python 3.11+, with numpy, scipy, and torch.

git clone https://github.com/NikhilBehari/dlm.git
cd dlm
conda env create -f environment.yml
conda activate dlm

Or with pip:

pip install -e .

Each LLM backend installs as an optional extra:

pip install -e ".[openai]"      # OpenAI models
pip install -e ".[anthropic]"   # Anthropic models
pip install -e ".[gemini]"      # Google models

Configure your model

Copy llm_config.example.toml to llm_config.toml and fill in your provider, model, and API key. The populated file is gitignored.

provider = "gemini"
model    = "gemini-2.5-flash-lite"
api_key  = "..."
temperature = 1.0

Quickstart

End-to-end DLM loop:

from dlm import (
    DLMTask, compile_reward, default_output_dir,
    load_provider, run_task, synthetic_dataset,
)

dataset  = synthetic_dataset(n_arms=12, n_arm_types=3)
provider = load_provider("llm_config.toml")
task     = DLMTask(
    name="prioritize_type_2",
    goal="Focus the budget on arms whose type is 2.",
    target_reward=compile_reward("state + state * agent_feats[2]"),
)
result = run_task(
    task, dataset, budget=4.0, provider=provider,
    output_dir=default_output_dir(task.name),
    progress=print,
)
print(result.best_candidate.source)

Standalone Lagrangian-PPO training under a fixed reward:

from dlm import RMABEnv, ScriptedReward, Trainer, TrainerConfig, synthetic_dataset

env = RMABEnv(
    dataset=synthetic_dataset(n_arms=12, n_arm_types=3),
    budget=4.0,
    reward_fn=ScriptedReward("state + state * agent_feats[2]"),
)
Trainer(env, TrainerConfig(epochs=30)).train()

There are three ways to define your RMAB environment, depending on the form of your input data.

From raw arrays:

from dlm import RMABDataset

dataset = RMABDataset(
    transition_matrices=T,         # (N, S, A, S), rows summing to 1
    features=X,                    # (N, F)
    feature_names=("age", "..."),  # optional
    action_costs=[0, 1],           # optional; default arange(A)
)

From a list of per-arm specs (works for any S, A):

from dlm import Arm, binary_arm, from_arms

arms = [
    binary_arm([1.0, 0.0], p_pull_at_0=0.8, p_pass_at_0=0.2),
    binary_arm([0.0, 1.0], p_pull_at_0=0.5, p_pass_at_0=0.3),
]
dataset = from_arms(arms, feature_names=("type_a", "type_b"))

From a standardized directory on disk:

my_env/
├── env.json         {"feature_names": [...], "action_costs": [...]}
├── transitions.npy  (N, S, A, S)
└── features.npy     (N, F)

dataset = RMABDataset.load("my_env/")
dataset.save("my_env/")  # round-trip back to disk

Runnable scripts live in examples/.

Outputs

Runs persist to outputs/<timestamp>_<task>/ (gitignored): config.json, run.log, transcript.md, training.jsonl, prompts.jsonl, candidates.jsonl, result.json.

Citation

@inproceedings{behari2024dlm,
  title     = {Decision-Language Model {(DLM)} for Dynamic Restless Multi-Armed Bandit Tasks in Public Health},
  author    = {Behari, Nikhil and Zhang, Edwin and Zhao, Yunfan and Taneja, Aparna and Nagaraj, Dheeraj and Tambe, Milind},
  booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
  year      = {2024},
}

License

Released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
docs		docs
examples		examples
src/dlm		src/dlm
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
llm_config.example.toml		llm_config.example.toml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Decision-Language Model (DLM) for Dynamic Restless Multi-Armed Bandit Tasks in Public Health

Install

Configure your model

Quickstart

Outputs

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Decision-Language Model (DLM) for Dynamic Restless Multi-Armed Bandit Tasks in Public Health

Install

Configure your model

Quickstart

Outputs

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages