CASS Core

This repository is a cleaned release of the CASS framework from the AAAI 2026 paper:

From Diagnosis to Generalization: A Cognitive Approach to Data Selection for Educational LLMs

This release only keeps the core CASS code:

MIRT-based cognitive diagnosis
parameter export from the fitted MIRT model
item ranking with Fisher-style or entropy-based criteria
rebuilding selected training subsets from externally provided public data

This release does not include:

any dataset files
any merged models, LoRA checkpoints, or inference outputs
LLM training or inference code
LLaMA-Factory config files

In the paper workflow, downstream LLM fine-tuning and inference were conducted with LLaMA-Factory, but that part is intentionally left out here.

Repository Layout

CASS_release/
├── scripts/
├── src/cass_release/
├── pyproject.toml
└── requirements.txt

Installation

python -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install -e .
pip install -r requirements.txt

Expected External Inputs

1. Response Matrix for MIRT

To fit the CASS selector, prepare a CSV file with:

user_id
item_id
score

Example meaning:

user_id: a specialized LLM or probing agent identifier
item_id: question identifier
score: model response score in [0, 1]

2. Public Training Pool

To rebuild selected training subsets, prepare either:

a CSV with new_stem, new_answer, question_ID
or a JSON list with instruction, input, output, id

Core Pipeline

Step 1. Train the MIRT model

python scripts/train_mirt.py \
  --response-csv /path/to/response_matrix.csv \
  --output-dir artifacts/mirt

This script:

builds user/item index mappings
splits the response matrix into train/val/test
fits the MIRT model
saves the checkpoint and metrics

Step 2. Export fitted parameters

python scripts/export_mirt_parameters.py \
  --checkpoint artifacts/mirt/mirt.pt \
  --mapping-dir artifacts/mirt/mappings \
  --output-dir artifacts/mirt/parameters

This exports:

user_embeddings.npy
item_a_embeddings.npy
item_b_embeddings.npy
user/item mapping files

Step 3. Rank items for each model

Paper-aligned ranking:

python scripts/select_items.py \
  --parameters-dir artifacts/mirt/parameters \
  --strategy fisher_trace \
  --top-k 7000 \
  --output-path artifacts/mirt/fisher_selection_7000.json

Legacy ranking compatible with the original preserved workspace logic:

python scripts/select_items.py \
  --parameters-dir artifacts/mirt/parameters \
  --strategy entropy \
  --top-k 7000 \
  --output-path artifacts/mirt/entropy_selection_7000.json

Step 4. Rebuild selected training subsets

python scripts/build_training_datasets.py selected \
  --selection-map artifacts/mirt/fisher_selection_7000.json \
  --train-pool /path/to/public_training_pool.csv \
  --output-dir artifacts/selected_datasets

This creates one JSON training subset per model.

Public Resources

The paper points to the following public resources:

If you want to reproduce the full paper pipeline, use the public dataset together with this repository for CASS selection, then run downstream LLM fine-tuning and inference with LLaMA-Factory in your own environment.

License

This repository is released under the Apache License 2.0. See [LICENSE](/Users/guoyuxiang/Desktop/AAAI 代码/CASS_release/LICENSE).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CASS Core

Repository Layout

Installation

Expected External Inputs

1. Response Matrix for MIRT

2. Public Training Pool

Core Pipeline

Step 1. Train the MIRT model

Step 2. Export fitted parameters

Step 3. Rank items for each model

Step 4. Rebuild selected training subsets

Public Resources

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
scripts		scripts
src/cass_release		src/cass_release
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

CASS Core

Repository Layout

Installation

Expected External Inputs

1. Response Matrix for MIRT

2. Public Training Pool

Core Pipeline

Step 1. Train the MIRT model

Step 2. Export fitted parameters

Step 3. Rank items for each model

Step 4. Rebuild selected training subsets

Public Resources

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages