Skip to content

PediaMedAI/ChildAgentEval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ChildAgentEval: Evaluating Cognitive Age Alignment in Interactive AI Agents

Yifan Shen1,2,*, Jiawen Zhang2,4,*, Jian Xu2, Junho Kim2, Ismini Lourentzou2, Xu Cao1,2,†, Meihuan Huang1,3,5,†

1PediaMed AI   |   2University of Illinois Urbana-Champaign   |   3Shenzhen Children's Hospital   |   4Peking University   |   5Hong Kong Polytechnic University
*Equal contribution   |   Corresponding author

Overview

ChildAgentEval is an interactive evaluation framework for studying whether multimodal large language model (MLLM) agents can align their reasoning, memory, language, and error patterns with human developmental stages.

The benchmark is inspired by the Wechsler Intelligence Scale for Children (WISC), but it does not reproduce protected clinical test items. Instead, it translates psychometrically grounded cognitive constructs into web-based interactive tasks that agents must complete through browser actions such as clicking, selecting, typing, and responding under time or memory constraints.

ChildAgentEval framework

Why Cognitive Age Alignment?

Most agent benchmarks reward maximal task performance. Child-facing AI systems require a different evaluation target: developmental appropriateness. A tutor or assistant that gives technically correct but adult-level explanations may fail to meet the needs of a younger user.

ChildAgentEval studies cognitive age alignment: whether an agent can exhibit age-appropriate behavior across language abstraction, working memory, visual and fluid reasoning, processing speed, and social explanation style.

Benchmark Design

ChildAgentEval is designed as a model-agnostic evaluation infrastructure rather than a static public dataset. It administers controlled web tasks, records detailed interaction traces, and reports both task-level and factor-level results.

Cognitive factor Public high-level description
Gc Verbal abstraction, vocabulary, and comprehension
Gf/Gv Fluid reasoning, visual reasoning, and spatial problem solving
WM Information retention and manipulation across interaction steps
PSI Time-constrained visual-symbolic execution and response efficiency

The paper evaluates agents across representative developmental anchors: ages 7, 10, 13, and 16. The skill-guided setting uses age bands of 6-8, 9-11, 12-14, and 15-17.

Overview of interactive subtests

This figure is a public overview from the paper. It is not a release of the complete item set, protected administration protocol, answer keys, or scoring materials.

Age-Specific Cognitive Skill Distillation

To avoid relying on subjective role prompts such as "act like a child," ChildAgentEval introduces a data-grounded skill distillation pipeline. The method extracts developmental markers from age-stratified child and adolescent corpora, then distills them into structured cognitive skill cards.

These skill cards constrain the agent through modules for vocabulary abstraction, working memory, reasoning budget, visual reliance, and social perspective.

Skill distillation pipeline

Main Findings

Standard age prompting does not reliably produce age-ordered behavior. General purpose agents tend to default to their strongest available capabilities, even when assigned a younger target age.

Skill-guided agents show clearer developmental differentiation in stronger models. In the reported experiments, targeted cognitive filters produce more monotonic score trajectories from younger to older age bands, especially in language-mediated dimensions.

Alignment remains uneven across cognitive domains. Language and crystallized knowledge are easier to calibrate, while working memory, perceptual reasoning, and processing speed remain harder to align with human developmental norms.

Developmental trajectories

Developmental metrics

Evaluation Access

Because parts of the evaluation protocol are derived from or constrained by copyrighted Wechsler scale materials, we cannot publicly release the complete evaluation protocol, protected item administration details, answer keys, scoring rubrics, or materials that could be used to reconstruct the original clinical assessment.

If you would like to evaluate your model with ChildAgentEval, please contact:

yifan26@illinois.edu

Material Public status
Paper summary and figures Released in this repository
High-level benchmark design Released in this repository
Complete evaluation protocol Restricted
Protected item content and answer keys Restricted
External model evaluation Available by controlled request

We can coordinate a controlled evaluation of your model under the standardized ChildAgentEval environment. Depending on your setup, this may involve an API endpoint, hosted model access, or a secure checkpoint-sharing arrangement. Please do not post API keys, private model weights, or credentials in GitHub issues.

See docs/evaluation_access.md for the recommended request format.

Citation

If you use ChildAgentEval or discuss the benchmark in your work, please cite:

@misc{shen2026childagenteval,
  title        = {Evaluating Cognitive Age Alignment in Interactive AI Agents},
  author       = {Yifan Shen and Jiawen Zhang and Jian Xu and Junho Kim and Ismini Lourentzou and Xu Cao and Meihuan Huang},
  year         = {2026},
  note         = {Preprint}
}

Contact

For evaluation requests and collaboration inquiries, contact yifan26@illinois.edu.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors