# Master Taxonomy of Deep Model Optimization and Acceleration Techniques (2014–2025)

---

## 1. Model Compression

### 1.1 Pruning & Sparsity

- **Unstructured magnitude pruning**  
  *Han et al., “Learning both Weights and Connections...”, 2015; “Deep Compression”, ICLR 2016*  
  Iteratively remove small weights and retrain to maintain accuracy.

- **Structured / channel / filter pruning**  
  *Li et al., 2017; He et al., 2017*  
  Remove whole filters, channels, or residual blocks for real-world acceleration.

- **Automated / importance-based pruning**  
  *AMC, MorphNet, MetaPruning, Movement Pruning (2020)*  
  Learn what to prune using sensitivity or meta-learning.

- **Dynamic sparse training (RigL, SET, GMP)**  
  *Evci et al., 2020; Mostafa & Wang, 2019*  
  Maintain fixed sparsity and regrow useful connections during training.

- **Semi-structured N:M sparsity**  
  *NVIDIA Ampere–Blackwell (2:4, 4:8 patterns)*  
  Hardware-aligned sparsity for Tensor Core acceleration.

- **One-shot LLM pruning**  
  *SparseGPT (2023); Wanda (2023)*  
  Directly prune LLMs without full retraining.

---

### 1.2 Quantization

- **Classic quantization (INT8/INT4)**  
  *Jacob et al., 2018*  
  Post-training (PTQ) or quantization-aware (QAT) for CNNs/Transformers.

- **LLM-grade quantization (SmoothQuant)**  
  *Xiao et al., ICML 2023*  
  Handles outliers by shifting activation scales for all matmuls (W8A8).

- **Activation-aware quantization (AWQ, GPTQ, AQLM)**  
  *Lin et al., 2023*  
  Quantize weights (4-bit) while preserving activation precision (16-bit).

- **Ultra-low precision (FP8, FP4, NF4, NVFP4)**  
  *NVIDIA Blackwell, 2024–2025*  
  Native low-bit tensor formats balancing efficiency and stability.

---

### 1.3 Low-Rank and Knowledge Distillation

- **Low-rank factorization**  
  *SVD, Tucker, CP, Tensor-Train, LoRA, DoRA, AdapterFusion*  
  Decompose or adapt model layers to reduce parameter count.

- **Knowledge Distillation (KD)**  
  *Hinton et al., 2015; DistilBERT (2019); TinyBERT (2020); MiniLM (2020)*  
  Teach smaller “student” models using teacher outputs.

- **Dataset distillation / pruning**  
  *"Less Is More for LLM Training", 2024–2025*  
  Optimize training data via quality filtering, scoring, and curricula.

---

## 2. Efficient Attention and Architecture

### 2.1 Attention Optimizations

- **FlashAttention 1–3**  
  *Dao et al., 2022–2025*  
  IO-aware kernels that minimize memory reads and writes.

- **Paged / chunked attention (vLLM)**  
  Enables long context windows with efficient KV-cache management.

- **Linear and sparse Transformers**  
  *Linformer, Performer, Reformer, Longformer, BigBird, Nyströmformer*  
  Reduce quadratic complexity of self-attention.

- **MQA / GQA**  
  *Shazeer, 2019; PaLM/LLaMA3 (2023–2024)*  
  Share K/V tensors per head or group to cut KV memory.

- **Mixture-of-Experts (MoE)**  
  *Switch Transformer (2021); GLaM; Mixtral; DeepSeek-V3 (2024–2025)*  
  Sparse activation of expert layers for high capacity with lower cost.

- **Early-exit / conditional compute**  
  *DeeBERT, PABEE, BEExformer (2024)*  
  Dynamically stop inference when confidence is high.

- **Neural Architecture Search (NAS)**  
  *MnasNet, EfficientNet, ProxylessNAS*  
  Automated discovery of efficient architectures.

- **CNN efficiency**  
  *MobileNet, Xception, ShuffleNet*  
  Depthwise-separable and pointwise convolutions for edge devices.

---

## 3. Training-Time Efficiency

- **Mixed precision (FP16/BF16/FP8)**  
  *Micikevicius et al., ICLR 2018*  
  Lower precision with master FP32 copy for stable gradients.

- **Activation / gradient checkpointing**  
  *Chen et al., 2016*  
  Recompute activations during backprop to save memory.

- **Parallelism & sharding**  
  *FSDP, ZeRO, Ulysses, Sequence Parallelism (2021–2025)*  
  Distribute parameters, gradients, and optimizer states.

- **Dynamic sparse training**  
  Compute only on active subgraphs with topology updates.

- **Progressive layer dropping / growing**  
  LayerDrop, DropPath for adaptive capacity.

- **Curriculum learning / data pruning**  
  Present simpler or higher-value samples earlier.

- **Optimizer compression (8-bit Adam)**  
  *Dettmers et al., 2022; bitsandbytes*  
  Quantize optimizer states for memory efficiency.

---

## 4. Inference-Time Serving Optimizations

- **Paged KV-cache (vLLM)**  
  Efficiently manage long contexts using paged attention structures.

- **Continuous batching (vLLM, SGLang)**  
  Aggregate requests dynamically to improve GPU utilization.

- **Speculative decoding**  
  *Leviathan et al., 2023 → Medusa/EAGLE (2024–2025)*  
  Draft-then-verify token generation to accelerate autoregressive inference.

- **Kernel and graph fusion**  
  *TorchInductor, TensorRT-LLM, XLA/IREE, Triton*  
  Fuse linear, normalization, and activation kernels.

- **Quantized serving kernels**  
  *Marlin/AWQ kernels for INT4 + FP16 decoding.*

- **Parallelism (pipeline/tensor/expert)**  
  *Megatron-Core, DeepSpeed inference*  
  Split model computation across devices for scalability.

---

## 5. Hardware and System-Level Optimizations

- **Quantized formats**  
  *INT8/INT4/PTQ/QAT, SmoothQuant, AWQ, GPTQ, AQLM.*

- **Low-bit floating formats**  
  *FP8/BF16/FP4/NVFP4 kernels (Hopper/Blackwell).*

- **N:M structured sparsity**  
  Tensor Core acceleration for semi-structured pruning.

- **AOT compilers**  
  *XLA, TensorRT-LLM, Triton, OpenXLA/IREE.*  
  Optimize computational graphs and runtime execution.

---

## 6. Decision Guidelines

### Training Optimization
Use:
- **Mixed precision (BF16/FP16)**  
- **Activation checkpointing**  
- **FSDP / ZeRO**  
to improve memory and training speed.

### Inference Optimization
Use:
- **GQA/MQA**, **FlashAttention-3**, **Paged KV-cache (vLLM)**  
for high throughput without accuracy degradation.

### Memory-Constrained Serving
Use:
- **SmoothQuant (W8A8)** or **AWQ (W4A16)**  
on **vLLM/TRT-LLM** runtimes.

### Scaling Capacity
Use:
- **Sparse MoE models** (e.g., Mixtral, DeepSeek-V3)  
if expert-parallel frameworks are available.

### Latency-Sensitive Applications
Use:
- **Speculative decoding**, **Continuous batching**, **TensorRT fusion**  
to minimize per-token latency.

---

## 7. Cross-Technique Playbooks

### Long-Context LLM Serving (2025 GPUs)
$$
\text{GQA + FlashAttention-3 + SmoothQuant (W8A8) + vLLM + Continuous Batching + Speculative Decoding}
$$
Optionally export to TensorRT-LLM for fused kernels.

### High-Capacity MoE Training
$$
\text{Expert Parallel (Megatron-Core) + FlashAttention + INT8 Weights + SmoothQuant + vLLM Serving}
$$

### Efficient LLM Training on Small Clusters
$$
\text{BF16 + ZeRO-3 + Checkpointing + Sequence Parallel + 8-bit Adam}
$$

### Edge Vision or On-Device Models
$$
\text{Structured Pruning + INT8 PTQ + Early-Exit + TensorRT Deployment}
$$

---

## 8. Key References

- Han et al., *Deep Compression*, ICLR 2016.  
- Li et al., *Pruning Filters for Efficient ConvNets*, ICLR 2017.  
- Hinton et al., *Distilling the Knowledge*, 2015.  
- Frankle & Carbin, *Lottery Ticket Hypothesis*, ICLR 2019.  
- Sanh et al., *Movement Pruning*, NeurIPS 2020.  
- Micikevicius et al., *Mixed Precision Training*, ICLR 2018.  
- Dao et al., *FlashAttention*, 2022–2025.  
- Xiao et al., *SmoothQuant*, ICML 2023–2024.  
- Lin et al., *AWQ*, 2023.  
- DeepSeek-V2/V3, Mixtral (2024–2025).  
- NVIDIA *TensorRT-LLM / Blackwell NVFP4* (2025).  
- DeepSpeed *ZeRO / Ulysses / SP* (2021–2025).

---

## 9. Notes and Emerging Directions

- Reasoning-aware quantization (2025) is under active development.  
- NVFP4 kernels are still maturing for general vLLM integration.  
- Domain-specific accelerations (e.g., diffusion, 3D vision) are expanding.  
- Gradient-free “forward-forward” training is experimental but promising.

---


# Master Taxonomy (2014 → 2025)

---

## A) Model Compression

### Pruning & Sparsity

- **Unstructured / magnitude pruning** — iterative pruning & retraining.  
  *arxiv.org*

- **Structured pruning (channels/filters/heads)**; **movement-based pruning**; **N:M structured (e.g., 2:4)**.  
  *arxiv.org*  
  +1

- **One-shot & post-training pruning for LLMs (e.g., SparseGPT)**.  
  *arxiv.org*

---

### Quantization

- **PTQ/QAT for CNNs/Transformers; W8A8 (SmoothQuant), INT8/INT4 weight-only (LLM.int8, GPTQ, AWQ), NF4/NF8 (QLoRA)**.  
  *arxiv.org*  
  +4  
  *arxiv.org*  
  +4  
  *arxiv.org*  
  +4

---

### Weight Sharing / Tying

- **Hashed weights; tied input–output embeddings; parameter sharing across layers (e.g., ALBERT)**.  
  *ar5iv*  
  +1

---

### Low-Rank & Tensor Decompositions

- **SVD/CP/Tucker for conv/linear layers; low-rank adapters (LoRA)**.  
  *arxiv.org*  
  +1

---

### Knowledge Distillation (KD)

- **Vanilla KD; task-agnostic KD (DistilBERT, TinyBERT); hint-based (FitNets)**.  
  *arxiv.org*  
  +2  
  *arxiv.org*  
  +2

---

## B) Efficient Attention & Architectures

### Efficient Transformers (long context / linearized attention)

- **Reformer (LSH), Linformer (low-rank), Longformer/BigBird (sparse patterns), Performer (FAVOR+)**.  
  *arxiv.org*  
  +4  
  *arxiv.org*  
  +4  
  *arxiv.org*  
  +4

- **MQA/GQA (share K/V across heads or groups for fast decoding)**.  
  *arxiv.org*  
  +1

---

### MoE (Sparsely Activated)

- **Sparsely-gated layers; Switch Transformer; GShard-style routing/parallelism**.  
  *arxiv.org*

---

### Conditional Compute & Early Exit

- **Token/patch pruning/merging; adaptive depth; early-exit Transformers**.  
  *arxiv.org*

---

### NAS & Compound Scaling

- **Search-based / rule-based scaling (e.g., EfficientNet)**.  
  *arxiv.org*

---

## C) Training-Time Efficiency

### Numerical & Memory

- **Mixed-precision (FP16/bfloat16/FP8); gradient/activation checkpointing; ZeRO-style partitioning**.  
  *arxiv.org*  
  +1

---

### Sparse & Progressive Training

- **Dynamic sparse training (RigL/SET), LayerDrop/progressive depth**.  
  *arxiv.org*

---

### Data-Level

- **Curriculum/self-paced learning; data pruning (EL2N/forgetting events); dataset distillation**.  
  *arxiv.org*

---

## D) Inference-Time Serving Optimizations

### KV-Cache & Batching

- **PagedAttention (vLLM), continuous batching, cache sharing/eviction policies**.  
  *arxiv.org*

---

### Speculative & Multi-Draft Decoding

- **Draft-model / tree-based / Medusa-style generation**.  
  *arxiv.org*

---

### Kernel / Graph

- **FlashAttention kernels; graph/tensor fusion (XLA/TensorRT-LLM/Triton)**.  
  *arxiv.org*  
  +1

---

### Memory & IO

- **Chunked prefill, CUDA graphs, paged/prefix KV, streaming attention variants (serving)**.  
  *GitHub*

---

## E) Hardware & Formats

### Low-Precision Formats

- **INT8/INT4/W8A8; FP8 training/inference; NormalFloat NF4/NF8 for fine-tuning**.  
  *arxiv.org*  
  +2  
  *arxiv.org*  
  +2

---

### Structured Sparsity for Accelerators

- **N:M (e.g., 2:4) exploiting hardware sparse-MM support**.  
  *arxiv.org*

---

### AOT/Compilers & Runtimes

- **XLA, TensorRT-LLM, Triton custom kernels**.  
  *IBM Research*

---

## Evidence Table

| Technique | Core Goal | Key Idea (1–2 lines) | Canonical Paper(s) | 2023–2025 Follow-ups | Typical Wins* | Accuracy & Trade-offs |
|------------|------------|----------------------|--------------------|----------------------|----------------|-----------------------|
| **Magnitude / iterative pruning** | Both | Remove small-magnitude weights; retrain to recover accuracy. | Chen et al., gradient checkpointing enabled deep nets; used widely alongside pruning. *arxiv.org* | One-shot LLM pruning (SparseGPT). *arxiv.org* | 2–10× params ↓; 1.2–2× speed on sparse-aware hw | Dense GPUs need structured sparsity or kernels; retraining usually required. |
| **Structured filter/channel pruning** | Both | Prune channels/heads/filters to fit dense kernels. | BigBird/Longformer show structured sparsity patterns viable at attention level. *arxiv.org* +1 | Movement pruning (for Transformers). *arxiv.org* | 1.3–2× speed; 20–70% params ↓ | Larger drops if too aggressive; pick saliency-guided criteria. |
| **N:M structured sparsity (2:4)** | Infer | Enforce hardware-friendly pattern for GEMM speedups. | BigBird/structured patterns; vendor docs (NVIDIA Ampere). *arxiv.org* | Wider N:M support in recent libs. | Up to ~1.5× GEMM on supported GPUs | Requires retraining/fine-tune to preserve accuracy. |
| **PTQ W8A8 (SmoothQuant)** | Infer | Shift activation outliers into weights offline; enable W8A8. | SmoothQuant (2022/23). *arxiv.org* +1 | Broad adoption in LLM serving stacks. | ~1.3–1.6× speed; ~2× memory ↓ | Usually near-lossless on common LLMs; tune per-layer scaling. |
| **LLM.int8** | Infer | Mixed outlier handling + int8 matmuls for proj/FFN. | Dettmers et al., 2022. *arxiv.org* +1 | — | ~2× memory ↓; modest throughput ↑ | Near-baseline accuracy; small quality drift on some tasks. |
| **GPTQ (3–4-bit PTQ)** | Infer | Second-order, one-shot weight quantization for GPTs. | Frantar et al., 2022/ICLR’23. *arxiv.org* +1 | Many toolchains (2023–25). | 2–4× mem ↓; 1.2–1.8× speed | Per-layer bit-allocation and calibration critical. |
| **AWQ (weight-only)** | Infer | Protect ~1% salient weights to cut quant error. | Lin et al., 2023/MLSys’24. *arxiv.org* +1 | Integrated in vLLM/TensorRT-LLM. | 2–4× mem ↓; small speed ↑ | Robust on edge; activations kept FP16/FP8. |
| **QLoRA (NF4)** | Train | 4-bit base weights + LoRA adapters + paged optimizers. | Dettmers et al., 2023. *arxiv.org* +1 | NF8 variants, broader HF support. | 4–6× train mem ↓ | Quality near FT when data/task aligned. |
| **Weight tying / ALBERT** | Both | Share embeddings/weights across layers to cut params. | Press & Wolf 2017; ALBERT 2019. *semanticscholar.org* | — | 1.5–3× params ↓ | Minimal loss; may affect capacity for heterogenous features. |
| **Low-rank factorization (SVD/CP/Tucker)** | Infer | Factor conv/linear tensors into low-rank products. | Denton et al., 2014; Lebedev et al., 2014. *arxiv.org* | Modern SVD on attention/FFN blocks. | 1.3–2× speed; mem ↓ | Need layer-wise rank search; small quality drop typical. |
| **LoRA adapters** | Train | Low-rank updates on frozen weights. | Hu et al., 2021 (LoRA). *arxiv.org* | QLoRA (2023). *arxiv.org* | 10–100× fewer trainable params | Inference cost ~unchanged; combine with quant/PTQ. |
| **Knowledge Distillation (KD)** | Both | Match student to teacher logits/hidden states. | Hinton et al., 2015; DistilBERT (2019); TinyBERT (2020). *arxiv.org* +2 | Cross-modal KD for V+L; task-agnostic KD. | 1.5–10× smaller; 1.2–1.8× faster | Careful temperature/loss balancing; may underperform OOD. |
| **Efficient long-seq attention** | Both | Approximate/sparse/low-rank attention to reduce O(L²). | Reformer, Linformer, Longformer, BigBird, Performer (2020). *arxiv.org* +4 | Kernel-fused FlashAttention for exact softmax. | >2–10× mem↓/speed↑ on long L | May change inductive bias; benchmark task-wise. |
| **MQA / GQA** | Infer | Share K/V across heads or small groups to cut KV size; faster decode. | Shazeer 2019 (MQA); Ainslie et al., 2023 (GQA). *arxiv.org* +1 | Widely used in modern LLMs. | 1.3–1.8× tok/s ↑; KV mem ↓ | Minor quality drop for MQA; GQA narrows gap. |
| **MoE / Switch** | Both | Route tokens to few experts (top-1/2) → sparse activation. | Fedus et al., 2021 (Switch). *arxiv.org* | Production MoE in 2024–25 families. | Pretrain up to 4–7× faster at iso-FLOPs | Requires careful routing/stability; inference sharding. |
| **Early-exit / conditional depth** | Infer | Exit when confidence high; or drop layers adaptively. | Used in DeeBERT/PABEE lines. | — | Latency ↓ 20–50% on easy inputs | Must guard against biased exits; calibrate. |
| **Mixed precision (FP16/bf16/FP8)** | Train | FP16/bf16 math with loss scaling; FP8 on Hopper-class GPUs. | Micikevicius et al., 2017. *arxiv.org* | FP8 serving/training in 2023–25 stacks. | 1.3–2× throughput; ~2× mem ↓ | Keep master weights FP32/bf16; watch overflow. |
| **Activation/gradient checkpointing** | Train | Recompute activations on backward to save memory. | Chen et al., 2016. *arxiv.org* | Widely adopted in LLM training. | up to ~3–5× larger batch/ctx at same VRAM | Extra fwd compute (~20–30%). |
| **Data curriculum / pruning / distillation** | Train | Order, select, or synthesize data to reduce steps. | Classic curriculum/self-paced; dataset distillation. *arxiv.org* | EL2N/forgetting-based selection (2021–24). | 1.2–3× wall-clock ↓ | Risk of overfitting to curated subsets; maintain diversity. |
| **FlashAttention (kernels)** | Both | IO-aware tiled exact attention; v2/v3 improve throughput. | Tri Dao et al., 2022–23. *arxiv.org* | Integrated in vLLM / PyTorch. | 1.5–3× speed; big mem ↓ | Exact softmax; needs matching layouts. |
| **vLLM: PagedAttention + continuous batching** | Infer | Page-based KV memory, near-zero waste; flexible sharing; continuous batching. | Kwon et al., 2023. *arxiv.org* | vTensor/Jenga/PagedEviction families. *arxiv.org* +2 | 2–4× throughput vs prior, esp. long ctx | Requires engine integration; page size tuning. |
| **Speculative decoding** | Infer | Draft with small model/multi-heads; verify with target LM. | General technique; widely used since 2023. *arxiv.org* | Medusa / multi-draft variants. | 1.5–3× tok/s ↑ at similar quality | Gains shrink with high-quality targets/strict sampling. |
| **Compiler/graph fusion (XLA/TensorRT-LLM/Triton)** | Both | Fuse ops; autotune kernels; AOT scheduling; memory planning. | Triton (Tillet et al., 2019). *IBM Research* | TensorRT-LLM 2023–25 releases. | 1.2–2× speed; latency tail ↓ | Engine-specific constraints; kernel portability. |

---

*Ballparks from cited papers and benchmarks; gains vary by model, context length, batch, hardware, and kernel maturity.*

---

## “When to Use” Decision Guide

- **Long context throughput** → FlashAttention + MQA/GQA + PagedAttention (vLLM); add speculative decoding if draft model is cheap enough.  
  *arxiv.org* +2

- **VRAM bound at training** → mixed precision (bf16/FP16) + activation checkpointing + gradient accumulation; if still tight, use LoRA/QLoRA for fine-tuning.  
  *arxiv.org* +2

- **Edge/device inference** → AWQ/GPTQ weight-only or W8A8 (SmoothQuant); avoid activation quantization on extreme outlier layers.  
  *arxiv.org* +2

- **Latency spikes on easy inputs** → early-exit / adaptive depth + continuous batching. Monitor drift on hard tails.  

- **Serving many small requests** → continuous batching + CUDA graphs + paged KV (vLLM/TensorRT-LLM).  
  *arxiv.org*

- **Model too big to fine-tune** → KD → smaller student or LoRA on frozen base; for fastest iteration, use QLoRA with 4-bit base.  
  *arxiv.org* +1

- **Compute budget fixed; need scale** → MoE/Switch (sparse activation) for larger capacity at constant FLOPs.  
  *arxiv.org*

---

## Cross-Technique Playbooks

### LLM Serving (Long Context, High Throughput)
**GQA + FlashAttention + W8A8 (SmoothQuant) + vLLM (PagedAttention, continuous batching) + speculative decoding**  
*arxiv.org* +3

---

### Memory-Tight Fine-Tuning (Single GPU)
**QLoRA (NF4) + bf16/FP16 AMP + activation checkpointing; batch via gradient accumulation.**  
Switch to **LoRA + AWQ (int4)** for ablations.  
*arxiv.org* +2

---

### Edge / Device CNN–ViT Deployment
**Weight-only PTQ (AWQ/GPTQ) + structured pruning + compiler fusion (TensorRT/Triton)**.  
Validate calibration carefully.  
*arxiv.org* +2

---

### Throughput-Oriented Cluster Training
**MoE/Switch + bf16 AMP + checkpointing + efficient attention + ZeRO partitioning + curriculum pruning**.  
*arxiv.org* +2

---

### Low-Latency Chat (Short Prompts)
**MQA/GQA + FlashAttention + CUDA Graphs + int8/FP8 kernels + speculative decoding.**  
*arxiv.org* +2

---

## Coverage Check

- **Quantization:** LLM.int8 (Dettmers 2022), QLoRA (Dettmers 2023), SmoothQuant (Xiao 2022/23), GPTQ (Frantar 2022), AWQ (Lin 2023/24).  
  *arxiv.org* +4

- **Attention / Architectures:** Reformer, Linformer, Longformer, BigBird, Performer (2020); FlashAttention (2022–23); MQA (Shazeer 2019); GQA (Ainslie 2023); Switch Transformer (Fedus 2021).  
  *arxiv.org* +8

- **Serving:** vLLM + PagedAttention (Kwon 2023); vTensor, Jenga (2024–25).  
  *arxiv.org* +2

- **Compression:** Low-rank (Denton 2014), tensor decompositions (Lebedev 2014); weight tying/ALBERT; HashedNets.  
  *arxiv.org* +2  
  *semanticscholar.org* +2

- **Training Efficiency:** Mixed precision (Micikevicius 2017); checkpointing (Chen 2016).  
  *arxiv.org* +1

- **Distillation:** Hinton 2015; DistilBERT 2019; TinyBERT 2020.  
  *arxiv.org* +2

---

## Possible Gaps / Families to Add

- Token-/head-level pruning for LLMs (2023–25)  
- Prefix-caching & attention sinks for streaming LLMs  
- Multi-token prediction (MTP)  
- NAS variants (MnasNet)  
- Vendor-specific FP8/FP4 details and 2:4 compiler passes (TensorRT-LLM/XLA releases)

---

## Notes on Scope Guard

Ambiguous terms like “shuffling” are treated under **curriculum/data ordering and selection**; practical implementations rely on **curriculum/self-paced learning** and **example-selection metrics** (loss, gradient norms, forgetting events).  
*arxiv.org*


# Master Taxonomy (2014 → 2025)

---

## Model Compression

### Pruning & Sparsity
- **Magnitude & iterative pruning**; *Deep Compression* (Han et al., 2015) — iterative prune–retrain cycle.  
- **Lottery Ticket Hypothesis** (Frankle & Carbin, 2019) — find sparse subnetworks that can train to full accuracy.  
- **Movement pruning** (Sanh et al., 2020) — identify unimportant weights via gradient movement during fine-tuning.  
- **Structured N:M (2:4) sparsity** — GPU-accelerated sparsity patterns (Ampere/Hopper).  
- **Post-training sparse LLMs (SparseGPT)** and **activation-aware pruning (Wanda)** for one-shot LLM compression.  

---

### Quantization
- **Post-training quantization** (LLM.int8, GPTQ, AWQ, SmoothQuant) — activation-aware or layerwise scaling.  
- **Quantization-aware training (QAT)** — retrain with quant noise simulation.  
- **Weight-only 4-bit quantization** — GPTQ, AWQ.  
- **FP8 training/inference** — Transformer Engine (E4M3/E5M2).  
- **NF4/NF8 (QLoRA)** — NormalFloat low-bit quantization for fine-tuning.  
  *arxiv.org*  
  +2  
  *arxiv.org*  
  +2  

---

### Weight Sharing / Tying
- **HashedNets** (Chen et al., 2015) — hash multiple connections into shared parameters.  
- **ALBERT** (Lan et al., 2019 / ICLR 2020) — cross-layer parameter sharing & factorized embeddings.  
  *Proceedings of Machine Learning Research*  
  +1  

---

### Low-Rank & Adapters
- **LoRA / DoRA** — train low-rank update matrices for frozen models; efficient fine-tuning.  
- **QLoRA (4-bit base + LoRA)** — combine 4-bit NF4 weights with low-rank adapters for low-memory finetuning.  
  *arxiv.org*  

---

### Tensor Decompositions
- **CP / Tucker / Tensor-Train (TT)** — compress convolutional and linear layers via low-rank factorization.  
  *Note:* Classic pre-LLM techniques, now rarely used in Transformer compression.

---

### Distillation (KD)
- **Hinton KD** — transfer soft labels from teacher to student.  
- **DistilBERT**, **TinyBERT** — task-specific transformer distillation.  
- **Modern LLM self- and speculative-aided distillation** — use the model itself or speculative drafts for supervision.

---

## Efficient Attention & Architecture

### Kernel-Efficient Attention
- **FlashAttention v1/v2** — IO-aware, tiled, exact attention computation.  
- **Long-context attention** — Reformer (LSH), Linformer (low-rank), Longformer/BigBird (sparse), Performer (FAVOR+).

---

### MQA / GQA
- **Multi-Query Attention (MQA)** — one KV per head.  
- **Grouped-Query Attention (GQA)** — shared K/V across head groups for efficiency.

---

### Sparsely-Activated (MoE)
- **Mixture-of-Experts (MoE)** — sparse activation of submodules; *GShard (2020)*, *Switch Transformer (2021)*, modern MoE LLMs.

---

### Early-Exit / Conditional Compute
- **BranchyNet (2016)** — dynamic inference depth.  
- **DeeBERT / PABEE (2020)** — confidence-based early exit during Transformer inference.

---

### NAS & Compound Scaling
- **ENAS** — efficient neural architecture search.  
- **EfficientNet scaling** — depth, width, and resolution compound scaling rules.

---

### State-Space / Hybrid Alternatives
- **Mamba** — linear-time selective state-space models (SSMs).  
- **Hybrid memory-augmented Transformers** for long-context tasks.  
  *arxiv.org*  
  +1  

---

## Training-Time Efficiency

### Mixed Precision
- **FP16/BF16 precision** — reduced-precision arithmetic with loss scaling.  
- **FP8 Transformer Engine** — emerging standard for NVIDIA Hopper/Blackwell.  
  *arxiv.org*  
  +1  

---

### Checkpointing / Rematerialization
- **Sublinear-memory training (O(√n))** — recompute activations during backprop to reduce memory footprint.  
  *arxiv.org*  

---

### Dynamic Sparse Training
- **RigL**, **Sparse Evolutionary Training (SET)** — dynamically evolve sparse connectivity during training.

---

### Depth Scheduling
- **Stochastic Depth**, **LayerDrop** — probabilistically drop residual paths or layers for efficiency and regularization.

---

### Curriculum / Data Pruning / Distillation
- **Curriculum learning** — schedule samples from easy to hard.  
- **Influence/DataMap pruning** — remove low-utility data points.  
- **Dataset distillation** — synthesize representative samples for faster training.

---

### Parallelism & Sharding (Systems)
- **ZeRO / FSDP** — optimizer, gradient, and parameter partitioning.  
- **4-D / compiler-aided parallelism (TorchTitan)** — advanced distributed model sharding.  
  *arxiv.org*  
  +1  

---

## Inference-Time Serving Optimizations

### KV-Cache Management
- **PagedAttention (vLLM)** — virtual-memory-like KV cache.  
- **Continuous batching** — dynamic batch aggregation for concurrent requests.  
- **GQA/MQA** — shrink KV cache footprint.  
  *arxiv.org*  
  +1  

---

### Speculative Decoding
- **Draft-and-verify decoding** — use a smaller model to propose tokens and verify with a larger one.  
- **Staged/block verification** and **self-speculative decoding** — modern latency-optimized variants.  
  *arxiv.org*  
  +3  
  *arxiv.org*  
  +3  
  *arxiv.org*  
  +3  

---

### Kernel / Graph Fusion
- **TorchInductor (PyTorch 2)** → **Triton kernels** — fuse linear operations and reduce GPU kernel launches.  
- **CUDA Graphs** — reduce CPU overhead for repeated operations.  
  *dl.acm.org*  
  +1  

---

### Tensor & Graph Optimizers
- **TensorRT-LLM / ONNX / Flash-kernels** — optimize fused kernels with quantization-aware compilation.

---

### Scheduling
- **Continuous batching, chunked prefill, beam/prompt KV sharing** — throughput scaling and reduced fragmentation in *vLLM*.  
  *GitHub*

---

## Hardware / Format

### Low-Precision Formats
- **INT8 / INT4 weight-only quantization; W8A8 full quantization.**  
- **FP8 (E4M3/E5M2)** — new numeric formats for mixed-precision compute.  
  *arxiv.org*

---

### Structured Sparsity
- **2:4 N:M sparsity** — hardware-accelerated structured sparsity on Ampere/Hopper GPUs.

---

### Ahead-of-Time (AOT) Compilers
- **TensorRT-LLM, XLA/OpenXLA, Triton, PyTorch 2 compile path** — optimize static graphs and fusions for performance.  
  *dl.acm.org*

---

## Evidence Table

| Technique | Core Goal (train / infer / both) | Key Idea | Canonical Paper(s) | Notable 2023–2025 Follow-ups | Typical Speedup / Memory Wins* | Accuracy & Trade-offs |
|------------|----------------------------------|-----------|--------------------|------------------------------|--------------------------------|-----------------------|
| **Magnitude / Iterative Pruning** | both | Remove small weights and retrain | *Deep Compression* (Han et al., 2015) | Movement Pruning (Sanh et al., 2020) | 3–10× params ↓; speedup needs sparse kernels | Accuracy recoverable; requires fine-tuning |
| **Lottery Ticket Hypothesis** | train | Sparse subnetworks train to full accuracy | Frankle & Carbin, 2019 | RigL integration | Param↓; variable training savings | Not plug-and-play for LLMs |
| **2:4 Structured Sparsity** | both | N:M pattern for GPU sparse matmul | NVIDIA Ampere (2020) | Post-A100 adoption | 1.5–2× kernel speed | Retraining needed |
| **LLM.int8** | infer | Outlier-aware 8-bit matmul w/ FP16 acts | Dettmers et al., 2022 | W8A8 variants | 1.2–1.6× speed; 50% mem↓ | Minor drift if outliers mishandled |
| **GPTQ (PTQ)** | infer | 4-bit per-channel quantization | Frantar et al., 2022 | AutoRound | 2–4× mem↓; 1.3–2× speed | Depends on calibration data |
| **SmoothQuant (W8A8)** | infer | Offline smoothing for 8-bit activations | Xiao et al., 2022 | TensorRT integration | 1.3–2× speed; mem↓ | Nearly lossless |
| **AWQ (4-bit)** | infer | Protect salient channels via scaling | Lin et al., 2023 | Integrated in vLLM/TensorRT | 2–4× mem↓ | Robust; needs calibration |
| **QLoRA (NF4)** | train | 4-bit base + LoRA adapters | Dettmers et al., 2023 | Instruction-tuning pipelines | 4–8× train mem↓ | Matches full-precision fine-tune |
| **FP8 Training/Inference** | both | Low-bit FP8 arithmetic | Micikevicius et al., 2022 | Transformer Engine | 1.2–1.6× speed; mem↓ | Near BF16 if tuned |
| **Hashed Weight Sharing** | both | Parameter hashing | Chen et al., 2015 | Balanced sharing (2023) | Large param↓ | Noise; best in over-param nets |
| **ALBERT Tying** | train | Cross-layer weight sharing | Lan et al., 2019 | — | 2–3× param↓ | Slight capacity loss |
| **LoRA / DoRA** | train | Low-rank finetuning | Hu et al., 2021 | DoRA (2024) | 10–100× fewer trainable params | Accuracy near full finetune |
| **FlashAttention v1/v2** | both | IO-aware exact attention | Dao et al., 2022–2023 | TensorRT integration | 1.5–2.5× faster attention | Exact results |
| **MQA / GQA** | infer | Shared K/V among heads | Shazeer 2019; Ainslie 2023 | Default in LLMs | KV mem↓ up to 8× | Minor quality drop |
| **MoE (Switch / GShard)** | both | Token routing to few experts | Fedus 2021; Lepikhin 2020 | Production MoE | 2–4× training throughput | Router stability issues |
| **Mamba (SSMs)** | both | Linear-time selective SSMs | Gu & Dao 2023 | Mamba-2 (2024–25) | O(L) inference | Maturing ecosystem |
| **Checkpointing** | train | Activation recomputation | Chen et al., 2016 | Hybrid checkpointing (2024–25) | >50% mem↓ | 20–40% compute↑ |
| **PagedAttention + Continuous Batching** | infer | Virtual KV cache, flexible batching | Kwon et al., 2023 | Jenga (2025) | 2–4× throughput | Scheduler tuning critical |
| **Speculative Decoding** | infer | Draft small model, verify outputs | Chen et al., 2023 | Block/self variants | 1.5–2.5× latency↓ | Gains vary with temperature |
| **TorchInductor → Triton Fusion** | both | Graph fusion & kernel optimization | Ansel et al., 2024 | CUDA Graphs integration | 1.2–2× speed | Limited dynamic support |

---

*Speedups from literature and practice; depend on hardware, sequence length, and kernel maturity.*

---

## “When to Use” Decision Guide

- **Long contexts / high concurrency** → FlashAttention + vLLM (PagedAttention + continuous batching) + GQA/MQA; add speculative decoding if latency dominates.  
  *arxiv.org* +1  

- **Memory-bound serving** → Weight-only 4-bit (GPTQ/AWQ) or W8A8 (SmoothQuant) with FlashAttention kernels.  

- **Finetuning on limited VRAM** → QLoRA (NF4) + LoRA + checkpointing + BF16/FP16 precision.  
  *arxiv.org* +2  

- **Training large dense models** → Mixed precision (BF16/FP8) + checkpointing + compiler fusion + stochastic depth.  
  *arxiv.org* +2  

- **Aggressive cost reduction** → Structured 2:4 pruning + brief KD refresh; fine-tune per-layer.

- **Beyond Transformers (linear-time)** → Mamba or hybrid SSM-Transformer when quadratic attention is the bottleneck.  
  *arxiv.org*

---

## Cross-Technique Playbooks

### LLM Serving (Long Context)
**GQA + FlashAttention-2 + W8A8 (SmoothQuant) + vLLM (PagedAttention + continuous batching) + speculative decoding.**  
*arxiv.org* +1  

### Cost-Efficient Instruction-Tuning (24–48 GB)
**QLoRA (NF4) + LoRA + checkpointing + BF16 + torch.compile (Inductor→Triton).**  
*dl.acm.org* +3  
*arxiv.org* +3  

### Multi-Tenant API Throughput
**vLLM (continuous batching, chunked prefill) + FlashAttention + CUDA Graphs + TensorRT-LLM fusion.**  
*GitHub* +1  

### Sparse-Aware Acceleration
**2:4 pruning + distillation + FP8 kernels; keep critical layers dense.**  
*arxiv.org*

### Long-Sequence Research
**Mamba backbone / Reformer / Performer + LoRA adapters; evaluate against exact-attention baselines.**  
*arxiv.org*

---

## Coverage Check (Primary Sources)

- **Pruning & Sparsity:** Han (2015); Frankle & Carbin (2019); Movement Pruning (2020); NVIDIA 2:4 (2020).  
- **Quantization:** LLM.int8 (2022); GPTQ (2022); SmoothQuant (2022); AWQ (2023); FP8 (2022); QLoRA/NF4 (2023).  
  *arxiv.org* +2  
- **Weight Sharing:** HashedNets (2015); ALBERT (2019).  
  *Proceedings of Machine Learning Research* +1  
- **Adapters:** LoRA (2021).  
- **Attention & Architectures:** FlashAttention (2022–23); Reformer, Linformer, Longformer, Performer (2020); MQA (2019); GQA (2023); GShard (2020); Switch (2021); Mamba (2023).  
  *arxiv.org*  
- **Training Efficiency:** Mixed Precision (2017); Checkpointing (2016); Stochastic Depth (2016); LayerDrop (2019).  
  *arxiv.org* +1  
- **Serving Systems:** vLLM (2023); TensorRT-LLM; PyTorch 2 / TorchInductor (2024); CUDA Graphs.  
  *arxiv.org* +2  
  *dl.acm.org* +2  
- **Decoding:** Speculative Decoding (2023–24).  
  *arxiv.org* +2  

---

## Possible Gaps
- Recent tensor-decomposition benchmarks for LLMs.  
- Structured sparsity beyond 2:4 (Hopper+).  
- Memory schedulers beyond vLLM (e.g., FlashInfer).  

---

## Notes
Speedups are indicative; depend on sequence length, batch, and hardware.  
“Lossless” denotes mathematically exact attention (FlashAttention) or identical output distribution (speculative decoding).  
Quantization and approximate attention may introduce minor accuracy drift.


# Master Taxonomy (2014 → 2025)

---

## 1) Pruning & Sparsification

- **Second-order (early) pruning:** *Optimal Brain Damage (OBD)*, *Optimal Brain Surgeon (OBS).*  
  *ResearchGate*  
  +1  

- **Magnitude / iterative pruning (“learn-prune-retrain”):** *Learning both Weights and Connections* (Han et al.).  
  *arxiv.org*  
  +2  
  *papers.nips.cc*  
  +2  

- **One-shot “at initialization” pruning:** *SNIP.*  
  *arxiv.org*  
  +1  

- **Winning subnets / trainability:** *Lottery Ticket Hypothesis.*  
  *arxiv.org*  
  +2  
  *openreview.net*  
  +2  

- **Movement-aware pruning for transfer learning / NLP:** *Movement Pruning.*  
  *arxiv.org*  
  +1  

- **Dynamic sparse training:** *RigL* — sparse connectivity grown / pruned during training.  
  *GitHub*  
  +1  

---

## 2) Quantization (Weights / Activations / Compute)

- **Int8 for Transformers (mixed outlier handling):** *LLM.int8().*  
  *arxiv.org*  
  +1  

- **Post-training LLM quantization (second-order):** *GPTQ.*  
  *arxiv.org*  
  +1  

- **Activation-aware weight quantization (INT3/4) for LLMs:** *AWQ.*  
  *arxiv.org*  
  +2  
  *proceedings.mlsys.org*  
  +2  

- **Training-free W8A8 quant (smooth activations):** *SmoothQuant.*  
  *arxiv.org*  
  +2  
  *arxiv.org*  
  +2  

- **Zero-cost PTQ pipeline for large Transformers (INT8/INT4):** *ZeroQuant.*  
  *arxiv.org*  
  +2  
  *proceedings.neurips.cc*  
  +2  

- **Ultra-low precision BERT (Hessian-guided):** *Q-BERT.*  
  *arxiv.org*  
  +2  
  *ojs.aaai.org*  
  +2  

- **Binary / ternary / extreme low-bit CNNs:** *BinaryNet / BNN, XNOR-Net, DoReFa-Net.*  
  *arxiv.org*  
  +5  
  *arxiv.org*  
  +5  
  *papers.neurips.cc*  
  +5  

- **Quantized fine-tuning (parameter-efficient):** *QLoRA* (4-bit base + LoRA).  
  *arxiv.org*  
  +1  

---

## 3) Knowledge Distillation (Teacher → Student)

- **Classical response-based KD:** *Hinton et al.* (2015).  
  *robots.ox.ac.uk*  

- **Intermediate hints / feature KD:** *FitNets.*  
  *arxiv.org*  

- **Self-distillation / Born-Again Nets:** *BAN.*  
  *papers.neurips.cc*  

- **Compact Transformer students:** *DistilBERT, TinyBERT, MobileBERT; Patient KD for BERT.*  
  *ResearchGate*  
  +3  
  *arxiv.org*  
  +3  
  *arxiv.org*  
  +3  

---

## 4) Low-Rank & Tensor Decomposition

- **Low-rank factorization of conv/fc layers:** *Jaderberg et al.* (rank-1 conv), *Denton et al.* (linear structure).  
  *arxiv.org*  

- **Tensor (CP/Tucker) decompositions:** *Lebedev et al.; Tai et al.* (low-rank regularization).  
  *openaccess.thecvf.com*  
  *arxiv.org*  

- **Low-rank adapters (efficient updates):** *LoRA* — often speeds fine-tuning and can reduce inference compute.  
  *arxiv.org*  

---

## 5) Weight Sharing, Clustering & Entropy Coding

- **Deep Compression pipeline:** *Prune → Quantize → Huffman coding* (*Han et al.*).  
  *ResearchGate*  

- **Hashed parameter sharing:** *HashedNets.*  

---

## 6) Efficient Network Operators & Architectures

- **Depthwise-separable convs & inverted residuals:** *MobileNetV1*, *MobileNetV2.*  
  *arxiv.org*  
  +1  

- **Group conv + channel shuffle:** *ShuffleNet.*  
  *arxiv.org*  

- **Parameter-lean CNN:** *SqueezeNet* (+Deep Compression).  
  *arxiv.org*  

- **Ghost modules (cheap operations):** *GhostNet.*  
  *arxiv.org*  
  +1  

- **Compound scaling & NAS:** *EfficientNet; Once-for-All (OFANet).*  
  *proceedings.neurips.cc*  
  +1  

---

## 7) Attention / Transformer Efficiency (Sub-Quadratic or Memory-Aware)

- **Reversible / efficient Transformers:** *Reformer* (LSH attention, reversible layers).  
  *arxiv.org*  

- **Low-rank projection attention:** *Linformer.*  
  *arxiv.org*  

- **Kernelized attention:** *Performer* (FAVOR+).  
  *openaccess.thecvf.com*  

- **Sparse / long-sequence patterns:** *Longformer*, *BigBird.*  
  *arxiv.org*  
  +1  

- **Memory-optimal kernels:** *FlashAttention* (IO-aware exact attention).  
  *bradmcdanel.com*  

---

## 8) Conditional Computation & Sparse Experts

- **GShard:** Conditional computation and automatic sharding — foundational for MoE scaling.  
  *arxiv.org*  

- **Switch Transformer:** Simple MoE routing, faster pretraining.  
  *arxiv.org*  
  +1  

- **Vision-MoE:** Sparse experts for ViT.  
  *proceedings.neurips.cc*  

---

## 9) Dynamic Inference & Early Exiting / Layer Skipping

- **Early-exit heads:** *BranchyNet; MSDNet (multi-exit DenseNets).*  
  *ResearchGate*  

- **Dynamic layer routing / skipping:** *SkipNet, BlockDrop.*  
  *arxiv.org*  
  +1  

- **Early-exit Transformers for NLP:** *DeeBERT* (and related).  
  *ResearchGate*  

---

## 10) Algorithmic / Kernel-Level Speedups

- **Fast convolutions:** FFT-based convolution; *Winograd minimal filtering.*  
  *arxiv.org*  
  +1  

---

## 11) Architecture Growth / Sharing for Faster Training

- **Function-preserving transforms:** *Net2Net* and later extensions for rapid model growth.  
  *arxiv.org*  
  +1  

- **Parameter tying / sharing to reduce memory:** *ALBERT.*  
  *arxiv.org*  
  +1  

---

## 12) Automated, Hardware-Aware Compression / Search

- **RL-driven compression:** *AMC (AutoML for Model Compression).*  
  *arxiv.org*  
  +1  

- **Latency-measured simplification:** *NetAdapt.*  
  *arxiv.org*  
  +1  

- **Meta-learned channel pruning:** *MetaPruning.*  
  *arxiv.org*  
  +1  

---

## 13) Fast Decoding for LLMs (Quality-Preserving)

- **Speculative decoding:** Exact-output-distribution acceleration via draft-and-verify.  
  *arxiv.org*  
  +2  
  *openreview.net*  
  +2  

- **Multi-token prediction heads (e.g., MEDUSA):** Predict multiple tokens per step.  
  *arxiv.org*  
  +2  
  *dl.acm.org*  
  +2  

---

## Notes

- This taxonomy highlights **canonical**, **high-impact**, and **widely adopted** acceleration and compression methods from 2014–2025.  
- **Emerging directions:** token pruning/merging for ViTs and LLMs, FP8/BF16 mixed-precision training, KV-cache optimizations, hybrid state-space architectures.  
- **Composability:** Many techniques can be layered for multiplicative benefits — e.g., *(prune + quantize + distill)*, *(NAS + depthwise ops)*, *(MoE + FlashAttention).*  
