This repository is a cleaned release of the CASS framework from the AAAI 2026 paper:
From Diagnosis to Generalization: A Cognitive Approach to Data Selection for Educational LLMs
This release only keeps the core CASS code:
- MIRT-based cognitive diagnosis
- parameter export from the fitted MIRT model
- item ranking with Fisher-style or entropy-based criteria
- rebuilding selected training subsets from externally provided public data
This release does not include:
- any dataset files
- any merged models, LoRA checkpoints, or inference outputs
- LLM training or inference code
- LLaMA-Factory config files
In the paper workflow, downstream LLM fine-tuning and inference were conducted with LLaMA-Factory, but that part is intentionally left out here.
CASS_release/
├── scripts/
├── src/cass_release/
├── pyproject.toml
└── requirements.txt
python -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install -e .
pip install -r requirements.txtTo fit the CASS selector, prepare a CSV file with:
user_iditem_idscore
Example meaning:
user_id: a specialized LLM or probing agent identifieritem_id: question identifierscore: model response score in[0, 1]
To rebuild selected training subsets, prepare either:
- a CSV with
new_stem,new_answer,question_ID - or a JSON list with
instruction,input,output,id
python scripts/train_mirt.py \
--response-csv /path/to/response_matrix.csv \
--output-dir artifacts/mirtThis script:
- builds user/item index mappings
- splits the response matrix into train/val/test
- fits the MIRT model
- saves the checkpoint and metrics
python scripts/export_mirt_parameters.py \
--checkpoint artifacts/mirt/mirt.pt \
--mapping-dir artifacts/mirt/mappings \
--output-dir artifacts/mirt/parametersThis exports:
user_embeddings.npyitem_a_embeddings.npyitem_b_embeddings.npy- user/item mapping files
Paper-aligned ranking:
python scripts/select_items.py \
--parameters-dir artifacts/mirt/parameters \
--strategy fisher_trace \
--top-k 7000 \
--output-path artifacts/mirt/fisher_selection_7000.jsonLegacy ranking compatible with the original preserved workspace logic:
python scripts/select_items.py \
--parameters-dir artifacts/mirt/parameters \
--strategy entropy \
--top-k 7000 \
--output-path artifacts/mirt/entropy_selection_7000.jsonpython scripts/build_training_datasets.py selected \
--selection-map artifacts/mirt/fisher_selection_7000.json \
--train-pool /path/to/public_training_pool.csv \
--output-dir artifacts/selected_datasetsThis creates one JSON training subset per model.
The paper points to the following public resources:
If you want to reproduce the full paper pipeline, use the public dataset together with this repository for CASS selection, then run downstream LLM fine-tuning and inference with LLaMA-Factory in your own environment.
This repository is released under the Apache License 2.0. See [LICENSE](/Users/guoyuxiang/Desktop/AAAI 代码/CASS_release/LICENSE).