Yuan Zhang1, Lifeng Guo2, Junwen Pan3, Wenzhao Zheng4,
Wen Zhou5, Kuan Cheng1, Kurt Keutzer4, Shanghang Zhang1✉️
1School of Computer Science, Peking University
2Beijing University of Posts and Telecommunications, 3Tianjin University
4EECS, UC Berkeley, 5Chinese Academy of Sciences
🔥 [2026/05/18] We released SEED and its Code is now open-source!
Overview of SEED. SEED formulates subset selection as a Weighted Independent Set problem over a similarity graph constructed from training data, with better node weights from a mutual influence subspace and better edges from local scale normalization. The resulting structurally balanced graph enables selecting a compact, diverse, and high-influence subset. Different colors indicate that nodes belong to different domains, while the color intensity represents the node weights.
- Clone this repository and navigate to SEED folder
git clone https://github.com/Gumpest/SEED.git
cd SEED- Install necessary package
conda create -n seed python=3.10 -y
conda activate seed
pip install torch==2.1.2 torchvision torchaudio
pip install -r requirement.txt- Install SEED
pip install -e .-
Prepare Training and Target Data
4.1 Instruction Tuning
-
Training datasets: Flan v2, COT, Dolly, and Open Assistant.
-
Target datasets: MMLU, Tydiqa, and BBH.
-
A processed version of these files are available in Google Drive.
4.2 Visual Instruction Tuning
-
Training datasets: Honeybee-Remake-SEED-200K available in HuggingFace.
-
Target datasets: random 5% of benchmark datasets.
-
We provide a complete example pipeline (LLaMA3-8B) for the instruction tuning task, covering data selection, model training, and evaluation. All commands are organized as shell scripts for easy reproduction and one-command execution. Please remember to replace the default paths with your own local paths before running the scripts.
- Warmup training (5% random data)
bash shell/1_warmup.sh- Collect the target gradient datastore
bash shell/2_gradient_train.shNote
Gradient collection must be performed separately for each dataset by manually switching the corresponding comments four times.
- Collect the target gradient datastore
bash shell/3_gradient_val.sh- Select data with SEED
bash shell/4_select.sh- Train the model with selected data
bash shell/5_train.sh- Evaluate the model
bash evaluation/batch_eval.sh- Print your results
python evaluation/print_res.pyThe results are shown as follows:
================== Summary Table ==================
Task | Checkpoint | Score
------------------------------------
tydiqa | 211 | 57.5664
mmlu | 211 | 0.6513
bbh | 317 | 0.6676
==================================================
This project is released under the Apache 2.0 license.
If you use SEED in your research, please cite our work by using the following BibTeX entry:
@article{zhang2026seed,
title={SEED: Targeted Data Selection by Weighted Independent Set},
author={Zhang, Yuan and Guo, Lifeng and Pan, Junwen and Liu, Chang and Zheng, Wenzhao and Cheng, Kuan and Keutzer, Kurt and Zhang, Shanghang},
journal={arXiv preprint arXiv:2605.15691},
year={2026}
}We extend our gratitude to the open-source efforts of LESS, FAISS, HoneyBee.
