Yifan Shen1,2,*, Jiawen Zhang2,4,*, Jian Xu2, Junho Kim2, Ismini Lourentzou2, Xu Cao1,2,†, Meihuan Huang1,3,5,†
1PediaMed AI |
2University of Illinois Urbana-Champaign |
3Shenzhen Children's Hospital |
4Peking University |
5Hong Kong Polytechnic University
*Equal contribution |
†Corresponding author
ChildAgentEval is an interactive evaluation framework for studying whether multimodal large language model (MLLM) agents can align their reasoning, memory, language, and error patterns with human developmental stages.
The benchmark is inspired by the Wechsler Intelligence Scale for Children (WISC), but it does not reproduce protected clinical test items. Instead, it translates psychometrically grounded cognitive constructs into web-based interactive tasks that agents must complete through browser actions such as clicking, selecting, typing, and responding under time or memory constraints.
Most agent benchmarks reward maximal task performance. Child-facing AI systems require a different evaluation target: developmental appropriateness. A tutor or assistant that gives technically correct but adult-level explanations may fail to meet the needs of a younger user.
ChildAgentEval studies cognitive age alignment: whether an agent can exhibit age-appropriate behavior across language abstraction, working memory, visual and fluid reasoning, processing speed, and social explanation style.
ChildAgentEval is designed as a model-agnostic evaluation infrastructure rather than a static public dataset. It administers controlled web tasks, records detailed interaction traces, and reports both task-level and factor-level results.
| Cognitive factor | Public high-level description |
|---|---|
| Gc | Verbal abstraction, vocabulary, and comprehension |
| Gf/Gv | Fluid reasoning, visual reasoning, and spatial problem solving |
| WM | Information retention and manipulation across interaction steps |
| PSI | Time-constrained visual-symbolic execution and response efficiency |
The paper evaluates agents across representative developmental anchors: ages 7, 10, 13, and 16. The skill-guided setting uses age bands of 6-8, 9-11, 12-14, and 15-17.
This figure is a public overview from the paper. It is not a release of the complete item set, protected administration protocol, answer keys, or scoring materials.
To avoid relying on subjective role prompts such as "act like a child," ChildAgentEval introduces a data-grounded skill distillation pipeline. The method extracts developmental markers from age-stratified child and adolescent corpora, then distills them into structured cognitive skill cards.
These skill cards constrain the agent through modules for vocabulary abstraction, working memory, reasoning budget, visual reliance, and social perspective.
Standard age prompting does not reliably produce age-ordered behavior. General purpose agents tend to default to their strongest available capabilities, even when assigned a younger target age.
Skill-guided agents show clearer developmental differentiation in stronger models. In the reported experiments, targeted cognitive filters produce more monotonic score trajectories from younger to older age bands, especially in language-mediated dimensions.
Alignment remains uneven across cognitive domains. Language and crystallized knowledge are easier to calibrate, while working memory, perceptual reasoning, and processing speed remain harder to align with human developmental norms.
Because parts of the evaluation protocol are derived from or constrained by copyrighted Wechsler scale materials, we cannot publicly release the complete evaluation protocol, protected item administration details, answer keys, scoring rubrics, or materials that could be used to reconstruct the original clinical assessment.
If you would like to evaluate your model with ChildAgentEval, please contact:
| Material | Public status |
|---|---|
| Paper summary and figures | Released in this repository |
| High-level benchmark design | Released in this repository |
| Complete evaluation protocol | Restricted |
| Protected item content and answer keys | Restricted |
| External model evaluation | Available by controlled request |
We can coordinate a controlled evaluation of your model under the standardized ChildAgentEval environment. Depending on your setup, this may involve an API endpoint, hosted model access, or a secure checkpoint-sharing arrangement. Please do not post API keys, private model weights, or credentials in GitHub issues.
See docs/evaluation_access.md for the recommended request format.
If you use ChildAgentEval or discuss the benchmark in your work, please cite:
@misc{shen2026childagenteval,
title = {Evaluating Cognitive Age Alignment in Interactive AI Agents},
author = {Yifan Shen and Jiawen Zhang and Jian Xu and Junho Kim and Ismini Lourentzou and Xu Cao and Meihuan Huang},
year = {2026},
note = {Preprint}
}For evaluation requests and collaboration inquiries, contact yifan26@illinois.edu.




