This repository provides the implementation used to reproduce the numerical results in Generalized Priority-Aware Shapley Value (GPASV).
- Python 3.11+
- Required libraries (pinned in
requirements.txt):numpy,pandas,scipy,scikit-learnnumba(sampler acceleration)matplotlib,seaborntorch,torchvision,torchaudio(LLM evaluation only)transformers,huggingface_hub,datasets(LLM evaluation only)vllm(batched local inference for the LLM evaluation)
Install with:
pip install -r requirements.txtIf you are using a CUDA-enabled GPU, install PyTorch with the appropriate CUDA
build (the LLM evaluation in exp2_llm_evaluation/ requires a GPU):
pip install torch==2.10.0 torchvision==0.25.0 --index-url https://download.pytorch.org/whl/cu126The LLM evaluation downloads MT-Bench and Chatbot Arena from HuggingFace. Set your HuggingFace token before running:
export HF_TOKEN=hf_xxxexp1_simulation/— synthetic-graph simulations on cyclic directed graphs (Section 5).exp2_llm_evaluation/— LLM ensemble valuation on MT-Bench with the Chatbot Arena pairwise-preference graph (Section 6).
Three simulations:
- Sim 1: mixing time of MCMC —
main_mixing.pyruns the adjacent-swap Metropolis–Hastings sampler under random and greedy initialization across graph families and grid sizes;plot_mixing.pyproduces the mixing-time figure. - Sim 2: Monte Carlo accuracy —
main_accuracy.pymeasures absolute relative error of the direct permutation estimator across two scenarios;main_surrogate.pycompares against linear and quadratic surrogate-assisted estimators under matched utility-evaluation budgets.plot_accuracy.pyandplot_surrogate.pyrender the corresponding figures. - Sim 3: priority sweeping —
main_sweep.pysweeps soft and hard priority temperatures and tracks group-sum GPASV under two utility families;plot_sweep.pyrenders the sweep figure.
To run everything end-to-end:
cd exp1_simulation
bash run_simulation.shrun_simulation.sh runs the four main_*.py scripts (long-running) and then
invokes the four plot_*.py scripts to write figures into figure/.
This experiment values 20 LLMs on MT-Bench using the Chatbot Arena pairwise-preference graph as a hard cyclic priority and an open-source vs. paid label as a soft priority.
Pipeline:
main_subset.pypopulates a (subset, prompt) cache by running an aggregator–judge pipeline (Qwen3.5-35B-A3B-FP8 served locally via vLLM) over every prefix coalition of every sampled permutation. One question per invocation; parallelise across the 80 MT-Bench prompts if you have multiple GPUs.main_value.pycomputes per-question GPASV value pickles from the cache for all 19 priority regimes along the three sweeps.main_ess.pycomputes the post-hoc ESS matrix used for the SNIS reuse table.plot_llm.pyrenders the five figures (two for the main text, three for the appendix).
To run end-to-end (GPU + HF_TOKEN required):
export HF_TOKEN=hf_xxx
cd exp2_llm_evaluation
bash run_llm.sh