# MASTER LIST — SPEEDUP TECHNIQUES (TRAINING + INFERENCE)

Organized into **8 Mega-Categories** + **59 Subcategories**.

---

# 1) Architectural Optimization

## 1.1 Smaller or Efficient Architectures
- MobileNet (Depthwise Separable Convolutions)  
- ShuffleNet (Channel Shuffle)  
- SqueezeNet (Fire modules)  
- EfficientNet (Compound scaling)  
- ConvNeXt / RepVGG (reparameterization)  
- MLP-Mixer / gMLP  
- Linformer / Nyströmformer / Performer (linear attention)

## 1.2 Token / Input Reduction
- Patch merging (Vision Transformers)  
- Token pruning / early exiting  
- Text compression (Byte-level, SentencePiece)  
- Sliding-window attention (Longformer, BigBird)  
- Temporal downsampling (audio, video)

## 1.3 Low-Rank Structure
- LoRA-style low-rank decomposition  
- Tensor-train decomposition  
- Kronecker factorization  
- SVD on weight matrices

## 1.4 Parameter Sharing
- ALBERT-style parameter sharing  
- Recurrent MLP blocks  
- Weight tying in embedding/output layers

---

# 2) Training Optimization

## 2.1 Mixed Precision Training
- FP16 / bfloat16  
- FP8 training (NVIDIA Hopper)  
- Dynamic loss scaling

## 2.2 Gradient Optimization
- Gradient accumulation  
- Gradient checkpointing  
- Selective activation recomputation  
- Adaptive optimizers (AdamW, Adafactor)

## 2.3 Batch & Data Pipeline Optimization
- Larger batch sizes with LARS / LAMB  
- Prefetching  
- Fused dataloaders  
- Asynchronous augmentation

## 2.4 Distributed Training
- Data Parallelism  
- Model Parallelism  
- Pipeline Parallelism  
- ZeRO (Stage 1–3)  
- Fully Sharded Data Parallel (FSDP)  
- Tensor Parallelism  
- Mixture-of-Experts parallelism

## 2.5 Curriculum Learning
- Progressive task difficulty  
- Progressive layer training  
- Freeze-unfreeze schedules

---

# 3) Inference Optimization

## 3.1 Quantization  
(These preserve performance if calibrated well.)

- Post-training quantization (PTQ)  
- Quantization-aware training (QAT)  
- 8-bit / 4-bit / 2-bit weights  
- SmoothQuant  
- GPTQ / AWQ / AQLM  
- FP8 inference

## 3.2 Pruning  
(Structured pruning has minimal performance drop.)

- Block pruning  
- Head pruning (Transformers)  
- Neuron / channel pruning  
- Magnitude pruning  
- Movement pruning

## 3.3 Distillation
- Knowledge distillation (teacher → student)  
- TinyBERT / DistilBERT  
- Layer-wise distillation  
- Logit matching + feature distillation

## 3.4 Caching & Reuse
- KV cache for autoregressive transformers  
- Attention cache  
- Prefix decoding  
- Recurrent memory tokens

---

# 4) Algorithmic Improvements

## 4.1 Efficient Attention
- FlashAttention (1/2)  
- Memory-efficient attention  
- Sparse attention  
- Linear attention (Performer, ETC)  
- Kernel-based attention  
- Reformer LSH attention

## 4.2 Inference-Time Algorithms
- Speculative decoding  
- Medusa decoding  
- Multi-step parallel decoding  
- Lookahead decoding  
- Tree-based decoding

## 4.3 Training-Time Algorithms
- SAM optimizer (flat minima allow smaller models)  
- Sharpness-aware reparameterization  
- Label smoothing

---

# 5) Hardware Optimization

## 5.1 GPU-Level Optimizations
- Tensor Cores  
- CUDA Graphs  
- Kernel fusion  
- Operator fusion (FlashAttention, Flash-Decoding)  
- cuDNN / cuBLAS optimized kernels

## 5.2 Multi-GPU & Cluster
- NCCL optimizations  
- Efficient all-reduce  
- Interconnect: NVLink, InfiniBand  
- Sharded state (ZeRO, FSDP)

## 5.3 Compilation
- TensorRT  
- ONNX Runtime  
- XLA Compilation  
- JAX jit  
- PyTorch 2.0 Inductor compiler

---

# 6) Numerical & Mathematical Tricks

## 6.1 Normalization Improvements
- RMSNorm  
- LayerNorm-free designs  
- Pre-norm Transformer  
- ScaleNorm  
- Cosine normalization

## 6.2 Activation Function Improvements
- GELU vs ReLU  
- SwiGLU  
- SiLU  
- ReLU6 / hard-swish (edge devices)

## 6.3 Initialization & Scaling
- Xavier / Kaiming initialization  
- μ-parameterization  
- QK normalization (Transformers)

---

# 7) Data-Centric Speedups

## 7.1 Better Data, Less Compute
- Data deduplication  
- Data filtering  
- High-quality pretraining corpora  
- Synthetic data bootstrapping

## 7.2 Faster Data Sampling
- Token packing  
- Efficient shuffling  
- Weighted sampling for curriculum

## 7.3 Self-Supervised Efficiency
- SimCLR → BYOL → DINO → iBOT  
- Teacher-free self-distillation

---

# 8) Model Compression (Without Hurting Performance)

## 8.1 Reparameterization
- RePruning + re-training  
- Structural re-parameterization (RepVGG)  
- Weight clustering

## 8.2 Tensor Compression
- Weight sharing  
- Huffman coding  
- Matrix factorization

## 8.3 KV-Cache Compression
- Sublayer KV merging  
- KV token dropping  
- Dynamic KV eviction

---

# BONUS: Transformer-Specific Modern Techniques

## 9.1 Scaling Laws for Efficient Training
- Chinchilla scaling (compute-optimal)  
- Data-optimal training  
- Small model + more data > large model + less data  

## 9.2 Efficient Long-Context
- FlashAttention-2 + paged attention  
- Streaming attention  
- RingAttention  
- H2O attention  

## 9.3 Efficient Decoding
- Grouped-query attention (GQA)  
- Multi-query attention (MQA)  
- Speculative sampling  
- Recurrent memory (RWKV, Mamba)

---

# FINAL EXTREME SPEEDUPS (No Performance Loss)

These provide **5×–20×** improvements:

- FP8 Training + FlashAttention2  
- GQA + MQA  
- Speculative Decoding  
- TensorRT / Inductor compilation  
- Quantization-aware training  
- Distillation + low-rank adapters  
- Chinchilla scaling  
- Data filtering + token packing  
- Modular MoE routing (top-1 gating)

---


# Some Papers that use that techniques

Below is the **same content**, **every word preserved**, but **all links/arXiv/website lines removed** and **only the paper names kept**.

---

# 1. Knowledge Distillation (KD) & Teacher–Student Compression

**Core idea:** Train a smaller “student” model to mimic a larger “teacher” using soft targets → keeps accuracy, big speed/inference gains.

## Flagship papers & authors

**Distilling the Knowledge in a Neural Network – Hinton, Vinyals, Dean (2015)**  
The classic KD paper. Defines temperature scaling, soft targets, and shows a large ensemble distilled into a single smaller model without losing performance.

**Knowledge Distillation Surveys – Gou et al., 2021; Mansourian et al., 2024**  
Comprehensive surveys on KD variants (logit-based, feature-based, relation-based, self-distillation, multi-teacher, etc.).

**TinyBERT, MobileBERT, DistilBERT**

- TinyBERT – Jiao et al. (2019)  
- MobileBERT – Sun et al. (2020)  
- DistilBERT – Sanh et al. (2019)

All are “teacher–student” distillations of BERT designed for faster inference with minimal accuracy loss; they appear prominently in recent KD surveys.

**What to take away:** Hinton’s KD paper is the conceptual root; almost every modern small transformer / mobile model is some KD variant.

---

# 2. Pruning & Structured Sparsity

**Core idea:** Delete weights, channels, or blocks that don’t matter → fewer FLOPs, smaller memory footprint, similar accuracy.

## Key references

**Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding – Song Han, Huizi Mao, Bill Dally (2016)**  
Pipeline: magnitude pruning → weight quantization → Huffman coding. Compresses AlexNet from 233MB to 8.9MB with no accuracy loss.

Han’s follow-up pruning pipeline papers extend this prune-then-quantize approach.

**Modern sparse training / Lottery Ticket Hypothesis** also fit here, but Deep Compression is the canonical starting point.

---

# 3. Quantization & Low-Precision Inference (especially for LLMs)

**Core idea:** Represent weights (and sometimes activations) in INT8 / INT4 / INT2, etc. → 2–4× speed & memory savings, with minimal accuracy loss.

## Flagship LLM quantization papers

**SmoothQuant – Xiao, Lin, Seznec, Wu, Demouth, Song Han (2022)**  
“SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models.”

**GPTQ – Frantar et al. (2023)**  
“GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers.”

**AWQ – Ji Lin et al. (2024)**  
“AWQ: Activation-aware Weight Quantization for LLM Compression / On-Device LLM.”

**FBGEMM – Khudia et al. (2021)**  
“FBGEMM: Enabling High-Performance Low-Precision Deep Learning Inference.”

---

# 4. Efficient CNN / Vision Architectures (lightweight by design)

**Core idea:** Build from efficient blocks (depthwise convolutions, bottlenecks) to get SOTA accuracy at low FLOPs.

## Canonical works

**MobileNet – Howard et al. (2017)**  
“MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications.”

**EfficientNet – Tan & Quoc Le (2019)**  
“EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.”

MobileNetV2, MobileNetV3, ShuffleNet, GhostNet also belong here.

---

# 5. Efficient Transformer Architectures & Attention

**Core idea:** Make attention cheaper (less memory, faster) without large accuracy drops.

## Important papers

**Reformer – Kitaev, Kaiser, Levskaya (2020)**  
“Reformer: The Efficient Transformer.”

**Performer – Choromanski et al. (2021)**  
“Rethinking Attention with Performers.”

**FlashAttention – Tri Dao et al. (2022)**  
“FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.”

**Switch Transformer – Fedus, Zoph, Shazeer (2022)**  
“Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.”

---

# 6. Faster Decoding / Inference Algorithms (for LLMs)

**Core idea:** Change the decoding algorithm, not the model → fewer serial forward passes.

**Speculative Decoding – Yaniv Leviathan et al. (2023)**  
“Fast Inference from Transformers via Speculative Decoding.”

**KV-cache / paged attention**  
Not one canonical paper, but many systems works; core idea is caching and memory-efficient management of attention states.

---

# 7. Parameter-Efficient Fine-Tuning (PEFT)

**Core idea:** Freeze the large backbone, add a small number of tunable parameters → faster training, minimal inference overhead.

**LoRA – Edward J. Hu et al. (2021)**  
“LoRA: Low-Rank Adaptation of Large Language Models.”

---

# 8. Compute-Optimal Training & Scaling Laws

**Core idea:** Choose optimal model size vs. number of training tokens → same or better performance at lower compute.

**Chinchilla – Hoffmann et al. (2022)**  
“Training Compute-Optimal Large Language Models.”

Shows that earlier LLMs were under-trained; Chinchilla (70B, 1.3T tokens) outperforms much larger models while being cheaper.

---

# 9. Distributed Training & Memory Optimizations

**Core idea:** Shard parameters, gradients, optimizer states; overlap communication and compute.

**ZeRO – Rajbhandari et al. (2020)**  
“ZeRO: Memory Optimizations Toward Training Trillion Parameter Models.”

**FSDP – Fully Sharded Data Parallel**  
“Fully Sharded Data Parallel: faster AI training with fewer GPUs.”

**PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel – Zhao et al. (2023)**  
Describes production-grade FSDP implementation.

---

# 10. Hardware-Aware Kernels & Libraries

**Core idea:** Optimize low-level kernels (matmul, convolutions, quantized ops) for massive speed gains with same math.

**FBGEMM – Khudia et al. (2021)**  
“FBGEMM: Enabling High-Performance Low-Precision Deep Learning Inference.”

FlashAttention (above) is this idea applied to attention.

---

# 11. Classic but Still Important Techniques

- Mixed-precision training (FP16/BF16) + loss scaling  
- Gradient checkpointing  
- Reversible layers (used in Reformer)

These are widely used to speed training and reduce memory.

---

# 12. Where Hinton Fits in This Landscape

Hinton contributes most directly via:

- **Distilling the Knowledge in a Neural Network** – foundation of KD  
- Earlier foundational work on efficient representations and autoencoders

KD is the root of most modern model compression.

---

# How to Use This as a Research Roadmap

## Start with the surveys
- Knowledge Distillation Surveys – Gou et al. (2021)  
- Knowledge Distillation Survey – Mansourian et al. (2024)  
- SmoothQuant  
- GPTQ  
- AWQ

## Then go technique by technique
Read the 1–3 canonical papers in each family (KD, pruning, quantization, efficient CNNs, efficient attention, speculative decoding, PEFT, ZeRO/FSDP, etc.).

## Use citation graphs
ConnectedPapers and Semantic Scholar help explore neighborhoods around:  
EfficientNet, FlashAttention, LoRA, GPTQ, AWQ, Speculative Decoding.

---
