🧬 UniSD

A Unified Self-Distillation Framework for Large Language Models

Yiqiao Jin¹*, Yiyang Wang¹*, Lucheng Fu¹, Yijia Xiao², Yinyi Luo³, Haoxin Liu¹, B. Aditya Prakash¹, Josiah Hester¹, Jindong Wang⁴†, Srijan Kumar¹†

¹ Georgia Institute of Technology · ² UCLA · ³ Carnegie Mellon University · ⁴ William & Mary

_{* Equal contribution · † Corresponding authors}

📖 Abstract

Self-distillation (SD) offers a promising path for adapting large language models (LLMs) without relying on stronger external teachers. However, SD in autoregressive LLMs remains challenging because self-generated trajectories are free-form, correctness is task-dependent, and plausible rationales can still provide unstable or unreliable supervision. Existing methods mainly examine isolated design choices, leaving their effectiveness, roles, and interactions unclear. In this paper, we propose UniSD, a Unified framework to systematically study Self-Distillation. UniSD integrates complementary mechanisms that address supervision reliability, representation alignment, and training stability, including multi-teacher agreement, EMA teacher stabilization, token-level contrastive learning, feature matching, and divergence clipping. Across six benchmarks and six models from three model families, UniSD reveals when self-distillation improves over static imitation, which components drive the gains, and how these components interact across tasks. Guided by these insights, we construct UniSD*, an integrated pipeline that combines complementary components and achieves the strongest overall performance, improving over the base model by +5.4 and the strongest baseline by +2.8. Extensive evaluation highlights self-distillation as a practical and steerable approach for efficient LLM adaptation without stronger external teachers.

✨ Highlights

🧩 Unified framework spanning the three axes of self-distillation: supervision reliability, representation alignment, and training stability.
🔬 Five complementary mechanisms studied in isolation and in combination across 6 benchmarks × 6 models × 3 model families.
🏆 UniSD* — the integrated recipe — achieves the strongest overall performance using only self-derived supervision, no stronger external teacher required.

🧩 The UniSD Framework

UniSD is built from five complementary mechanisms that can be enabled independently or composed into the integrated UniSD* recipe.

Component	`--mode`	Key flag(s)
🤝 Multi-Teacher Agreement (sequence-level)	`agreement_seq_{random,retrieval,induction}`	`--num-auxiliary-contexts`, `--gamma_agreement`
🎯 Multi-Teacher Agreement (token-level)	`agreement_tok_{random,retrieval,induction}`	`--num-auxiliary-contexts`, `--gamma_agreement`, `--agreement_stat`
🌊 EMA Teacher Stabilization	`ema`	`--ref_model_sync_steps`, `--ref_model_mixup_beta`
⚖️ Token-Level Contrastive Learning	`contrastive`	`--contrastive_weight`, `--contrastive_margin`
🧠 Feature Matching	`match_joint` / `match_repr`	`--final_layer_distill_weight`
✂️ Divergence Clipping (JSD-Clip)	`clip`	`--alpha`, `--token_clip`
⭐ UniSD* (integrated)	`unisd_star`	combines EMA + matching + contrastive + agreement

🚀 Installation

UniSD targets Python 3.12 + CUDA 12.8 (cu128 wheels). The install has a few prerequisite steps before the final pip install -r requirements.txt, because (a) PyTorch's cu128 build lives on the PyTorch wheel index and (b) flash-attention-2 must be compiled against the installed torch.

# 1) Create and activate the env
conda create -n unisd python=3.12 -y
conda activate unisd
pip install -U pip setuptools wheel packaging ninja

# 2) Install cu128 PyTorch from the PyTorch wheel index (must precede flash-attn build)
pip install --index-url https://download.pytorch.org/whl/cu128 \
    torch==2.11.0 torchvision==0.26.0 torchaudio==2.11.0

# 3) Point flash-attn's CUDA build at a 12.x toolkit
#    (on many hosts /usr/local/cuda → 13.x, which mismatches torch's cu128 ABI)
export CUDA_HOME=/usr/local/cuda-12.6

# 4) Install everything else — flash-attn builds from source here (~20 min the first time)
pip install -r requirements.txt --no-build-isolation

💡 Don't have /usr/local/cuda-12.6? Any CUDA 12.x toolkit (12.4–12.8) works. Run ls -d /usr/local/cuda-12* to see what's available and set CUDA_HOME to that path.

⚠️ trl ↔ vLLM compatibility: this environment ships trl==1.4.0 (officially supports vLLM 0.12.0–0.18.0) with vllm==0.20.2. The combination works in our smoke tests but trl will print a warning at import time. If you hit a runtime error from VLLMClient, pin vllm<0.19.

Verify the install

python -c "
import torch, vllm, flash_attn, flashinfer
print('torch       ', torch.__version__, 'cuda_ok:', torch.cuda.is_available())
print('vllm        ', vllm.__version__)
print('flash_attn  ', flash_attn.__version__)
print('flashinfer  ', flashinfer.__version__)
"

Optional environment variables: WANDB_API_KEY (logging), HF_TOKEN (gated models).

⚡ Quick Start

UniSD provides two ways to launch training: a high-level orchestrator with sane defaults, and a direct command for full per-flag control.

Option 1 — Preset orchestrator (preferred)

scripts/run_experiments.py handles GPU scheduling, dependency-aware sweeps, and sensible defaults.

# Template
python scripts/run_experiments.py <SUBCOMMAND> [--gpus <GPU_IDS>] [subcommand-flags...]

# Example: token-level contrastive learning
python scripts/run_experiments.py contrastive --weight 0.1 --margin 0.5

💡 Run python scripts/run_experiments.py --dry-run to preview every job before launch.

Option 2 — Direct command

python -m src.train.train_unisd exposes every UniSD flag for fine-grained control.

# Template
python -m src.train.train_unisd \
    --mode <MODE> --dataset <DATASET> \
    --model_name <MODEL> \
    --per_device_train_batch_size <BATCH> \
    --num-auxiliary-contexts <N> \
    --use_vllm

# Example: token-level contrastive on MBPP with Qwen2.5-7B
python -m src.train.train_unisd \
    --mode contrastive --dataset mbpp \
    --model_name Qwen/Qwen2.5-7B-Instruct \
    --per_device_train_batch_size 4 \
    --contrastive_weight 0.1 --use_vllm

Valid placeholder values

Placeholder	Values
`<SUBCOMMAND>`	`agreement`, `ema`, `contrastive`, `match_joint`, `match_repr`, `clip`, `unisd_star` (= UniSD)*, `induction`
`<MODE>`	`agreement_{seq,tok}_{random,retrieval,induction}`, `ema`, `contrastive`, `match_joint`, `match_repr`, `clip`, `unisd_star`
`<DATASET>`	`mbpp`, `tooluse`, `scienceqa`, `cos_e`, `medmcqa` (eval-only: `gpqa`, `humaneval`)
`<MODEL>`	Qwen2.5 (0.5B/1.5B/3B/7B-Instruct), Llama-3.1-8B-Instruct, Gemma-3-4B-IT, InternLM3-8B-Instruct

One-time cache prep

A few modes require a one-time cache build:

🔻 contrastive and unisd_star need a negative-demonstration cache:

python -m src.teacher.negative_demonstrations \
    --model_name Qwen/Qwen2.5-7B-Instruct --dataset mbpp

🪄 agreement_*_induction modes need an induction cache:

python scripts/run_experiments.py induction --num-demos 5

✅ random and retrieval agreement modes need no prep — embeddings auto-build on first run.

📊 Datasets

UniSD is evaluated across six benchmarks spanning four task families.

Dataset	Role	Task
🔬 ScienceQA	train + eval	Scientific reasoning
💻 MBPP	train + eval	Code generation
💭 CoS-E	train + eval	Commonsense reasoning
🛠️ ToolAlpaca	train + eval	Tool usage
🎓 GPQA	OOD eval	Scientific reasoning
🧪 HumanEval	OOD eval	Code generation

🤖 Supported Models

UniSD is validated across three model families:

Qwen2.5 — 0.5B / 1.5B / 3B / 7B-Instruct (default: Qwen/Qwen2.5-7B-Instruct)
Llama-3.1 — 8B-Instruct
Gemma-3 — 4B-IT
InternLM3 — 8B-Instruct

🧪 Evaluation

Evaluation entry points live under src/eval/:

# Code generation (MBPP / HumanEval)
python -m src.eval.eval_code   --mode <MODE> --dataset humaneval \
    --model_name_or_path <CKPT_OR_HF_ID>

# Multiple-choice QA (ScienceQA / GPQA / CoS-E / MedMCQA)
python -m src.eval.eval_mcqa   --mode <MODE> --dataset gpqa \
    --model_name_or_path <CKPT_OR_HF_ID>

# Tool usage (ToolAlpaca)
python -m src.eval.eval_tooluse --mode <MODE> --dataset tooluse \
    --model_name_or_path <CKPT_OR_HF_ID>

📝 Citation

If you find UniSD useful in your research, please cite:

@article{jin2026unisd,
  title={UniSD: Towards a Unified Self-Distillation Framework for Large Language Models},
  author={Jin, Yiqiao and Wang, Yiyang and Fu, Lucheng and Xiao, Yijia and Luo, Yinyi and Liu, Haoxin and Prakash, B Aditya and Hester, Josiah and Wang, Jindong and Kumar, Srijan},
  journal={arXiv preprint arXiv:2605.06597},
  year={2026}
}

🙏 Acknowledgements

UniSD is built on top of excellent open-source work from the community: 🤗 Transformers · 🤗 TRL · vLLM · DeepSpeed · PEFT · Accelerate.

⚖️ License

This project is released under the Apache License 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧬 UniSD

A Unified Self-Distillation Framework for Large Language Models

📖 Abstract

✨ Highlights

🧩 The UniSD Framework

🚀 Installation

Verify the install

⚡ Quick Start

Option 1 — Preset orchestrator (preferred)

Option 2 — Direct command

Valid placeholder values

One-time cache prep

📊 Datasets

🤖 Supported Models

🧪 Evaluation

📝 Citation

🙏 Acknowledgements

⚖️ License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧬 UniSD

A Unified Self-Distillation Framework for Large Language Models

📖 Abstract

✨ Highlights

🧩 The UniSD Framework

🚀 Installation

Verify the install

⚡ Quick Start

Option 1 — Preset orchestrator (preferred)

Option 2 — Direct command

Valid placeholder values

One-time cache prep

📊 Datasets

🤖 Supported Models

🧪 Evaluation

📝 Citation

🙏 Acknowledgements

⚖️ License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages