SEED: Targeted Data Selection by Weighted Independent Set

Yuan Zhang¹, Lifeng Guo², Junwen Pan³, Wenzhao Zheng⁴,

Wen Zhou⁵, Kuan Cheng¹, Kurt Keutzer⁴, Shanghang Zhang^1✉️

¹School of Computer Science, Peking University

²Beijing University of Posts and Telecommunications, ³Tianjin University

⁴EECS, UC Berkeley, ⁵Chinese Academy of Sciences

📜 News

🔥 [2026/05/18] We released SEED and its Code is now open-source!

👀 Overview

Overview of SEED. SEED formulates subset selection as a Weighted Independent Set problem over a similarity graph constructed from training data, with better node weights from a mutual influence subspace and better edges from local scale normalization. The resulting structurally balanced graph enables selecting a compact, diverse, and high-influence subset. Different colors indicate that nodes belong to different domains, while the color intensity represents the node weights.

👨‍💻 Preparation

Clone this repository and navigate to SEED folder

git clone https://github.com/Gumpest/SEED.git
cd SEED

Install necessary package

conda create -n seed python=3.10 -y
conda activate seed

pip install torch==2.1.2 torchvision torchaudio
pip install -r requirement.txt

Install SEED

pip install -e .

Prepare Training and Target Data

4.1 Instruction Tuning
- Training datasets: Flan v2, COT, Dolly, and Open Assistant.
- Target datasets: MMLU, Tydiqa, and BBH.
- A processed version of these files are available in Google Drive.
4.2 Visual Instruction Tuning
- Training datasets: Honeybee-Remake-SEED-200K available in HuggingFace.
- Target datasets: random 5% of benchmark datasets.

🎯 Quick Start

We provide a complete example pipeline (LLaMA3-8B) for the instruction tuning task, covering data selection, model training, and evaluation. All commands are organized as shell scripts for easy reproduction and one-command execution. Please remember to replace the default paths with your own local paths before running the scripts.

Data Selection with SEED

Warmup training (5% random data)

bash shell/1_warmup.sh

Collect the target gradient datastore

bash shell/2_gradient_train.sh

Note

Gradient collection must be performed separately for each dataset by manually switching the corresponding comments four times.

Collect the target gradient datastore

bash shell/3_gradient_val.sh

Select data with SEED

bash shell/4_select.sh

Training

Train the model with selected data

bash shell/5_train.sh

Evaluation

Evaluate the model

bash evaluation/batch_eval.sh

Print your results

python evaluation/print_res.py

The results are shown as follows:

================== Summary Table ==================

Task       | Checkpoint   | Score   
------------------------------------
tydiqa     | 211          | 57.5664 
mmlu       | 211          | 0.6513  
bbh        | 317          | 0.6676  

==================================================

License

This project is released under the Apache 2.0 license.

Citation

If you use SEED in your research, please cite our work by using the following BibTeX entry:

@article{zhang2026seed,
  title={SEED: Targeted Data Selection by Weighted Independent Set},
  author={Zhang, Yuan and Guo, Lifeng and Pan, Junwen and Liu, Chang and Zheng, Wenzhao and Cheng, Kuan and Keutzer, Kurt and Zhang, Shanghang},
  journal={arXiv preprint arXiv:2605.15691},
  year={2026}
}

Acknowledgment

We extend our gratitude to the open-source efforts of LESS, FAISS, HoneyBee.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
evaluation		evaluation
seed		seed
shell		shell
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirement.txt		requirement.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SEED: Targeted Data Selection by Weighted Independent Set

📜 News

👀 Overview

👨‍💻 Preparation

🎯 Quick Start

Data Selection with SEED

Training

Evaluation

License

Citation

Acknowledgment

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SEED: Targeted Data Selection by Weighted Independent Set

Yuan Zhang1, Lifeng Guo2, Junwen Pan3, Wenzhao Zheng4, Wen Zhou5, Kuan Cheng1, Kurt Keutzer4, Shanghang Zhang1✉️ 1School of Computer Science, Peking University 2Beijing University of Posts and Telecommunications, 3Tianjin University 4EECS, UC Berkeley, 5Chinese Academy of Sciences

📜 News

👀 Overview

👨‍💻 Preparation

🎯 Quick Start

Data Selection with SEED

Training

Evaluation

License

Citation

Acknowledgment

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages