LLMs are increasingly used to simulate human participants in social science research, but existing evaluations conflate base model capabilities with agent design choices, making it unclear whether results reflect the model or the configuration.
HumanStudy-Bench treats participant simulation as an agent design problem and provides a standardized testbed — combining an Execution Engine that reconstructs full experimental protocols from published studies and a Benchmark with standardized evaluation metrics — for replaying human-subject experiments end-to-end with alignment evaluation at the level of scientific inference.
With HumanStudy-Bench You Can:
- Test different agent designs on the same experiments to find what works best
- Run agents through real studies reconstructed from published human-subject research
- Compare results rigorously using inferential-level metrics that measure whether agents reach the same scientific conclusions as humans
We include 12 foundational studies (cognition, strategic interaction, social psychology) covering more than 6,000 trials with human samples ranging from tens to over 2,100 participants.
💡 You can also add your own studies using our automated pipeline to test custom research questions.
pip install -r requirements.txtYou can run an AI agent through a specific study (e.g., the "False Consensus Effect") or the entire benchmark suite. The engine handles the interaction, data collection, and statistical comparison against human ground truth.
# Run a specific study with a specific agent design (e.g., Mistral with a demographic profile)
python scripts/run_baseline_pipeline.py \
--study-id study_001 \
--real-llm \
--model mistralai/mistral-nemo \
--presets v3_human_plus_demoProbability Alignment Score (PAS): Measures whether agents reach the same scientific conclusions as humans at the phenomenon level. It quantifies the probability that agent and human populations exhibit behavior consistent with the same hypothesis, accounting for statistical uncertainty in human baselines.
Effect Consistency Score (ECS): Measures how closely agents reproduce the magnitude and pattern of human behavioral effects at the data level. It assesses both the precision (capturing the pattern) and accuracy (matching the magnitude) of agent responses compared to human ground truth.
→ See detailed metric derivations and explanations
After running simulations, get a summary of all runs (PAS, ECS, tokens, cost):
python scripts/simple_results.pyOutputs are written to results/benchmark/: simple_summary.md, simple_studies.csv, simple_findings.csv.
You can easily test new behavioral hypotheses by defining custom agent specifications. Simply create a new method file in src/agents/custom_methods/ to control how your agent presents itself to the experiment.
Example: src/agents/custom_methods/my_persona.py
def generate_prompt(profile):
return f"You are a {profile['age']}-year-old {profile['occupation']}. Please answer naturally."Run your new design:
python scripts/run_baseline_pipeline.py --study-id study_001 --real-llm --system-prompt-preset my_persona- Adding New Studies – Parse research PDFs and auto-generate simulation code
- Model Configuration – Set up API keys for OpenAI, Anthropic, Google, or OpenRouter
If you use HumanStudy-Bench, please cite:
@misc{liu2026humanstudybenchaiagentdesign,
title={HumanStudy-Bench: Towards AI Agent Design for Participant Simulation},
author={Xuan Liu and Haoyang Shang and Zizhang Liu and Xinyan Liu and Yunze Xiao and Yiwen Tu and Haojian Jin},
year={2026},
eprint={2602.00685},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2602.00685},
}Hugging Face: Benchmark and resources are available on the Hugging Face Hub — fuyyckwhy/HS-Bench-results.
This project is licensed under the MIT License - see the LICENSE file for details.
