# MASTER LIST — SPEEDUP TECHNIQUES (TRAINING + INFERENCE)

Organized into **8 Mega-Categories** + **59 Subcategories**.

---

# 1) Architectural Optimization

## 1.1 Smaller or Efficient Architectures
- MobileNet (Depthwise Separable Convolutions)  
- ShuffleNet (Channel Shuffle)  
- SqueezeNet (Fire modules)  
- EfficientNet (Compound scaling)  
- ConvNeXt / RepVGG (reparameterization)  
- MLP-Mixer / gMLP  
- Linformer / Nyströmformer / Performer (linear attention)

## 1.2 Token / Input Reduction
- Patch merging (Vision Transformers)  
- Token pruning / early exiting  
- Text compression (Byte-level, SentencePiece)  
- Sliding-window attention (Longformer, BigBird)  
- Temporal downsampling (audio, video)

## 1.3 Low-Rank Structure
- LoRA-style low-rank decomposition  
- Tensor-train decomposition  
- Kronecker factorization  
- SVD on weight matrices

## 1.4 Parameter Sharing
- ALBERT-style parameter sharing  
- Recurrent MLP blocks  
- Weight tying in embedding/output layers

---

# 2) Training Optimization

## 2.1 Mixed Precision Training
- FP16 / bfloat16  
- FP8 training (NVIDIA Hopper)  
- Dynamic loss scaling

## 2.2 Gradient Optimization
- Gradient accumulation  
- Gradient checkpointing  
- Selective activation recomputation  
- Adaptive optimizers (AdamW, Adafactor)

## 2.3 Batch & Data Pipeline Optimization
- Larger batch sizes with LARS / LAMB  
- Prefetching  
- Fused dataloaders  
- Asynchronous augmentation

## 2.4 Distributed Training
- Data Parallelism  
- Model Parallelism  
- Pipeline Parallelism  
- ZeRO (Stage 1–3)  
- Fully Sharded Data Parallel (FSDP)  
- Tensor Parallelism  
- Mixture-of-Experts parallelism

## 2.5 Curriculum Learning
- Progressive task difficulty  
- Progressive layer training  
- Freeze-unfreeze schedules

---

# 3) Inference Optimization

## 3.1 Quantization  
(These preserve performance if calibrated well.)

- Post-training quantization (PTQ)  
- Quantization-aware training (QAT)  
- 8-bit / 4-bit / 2-bit weights  
- SmoothQuant  
- GPTQ / AWQ / AQLM  
- FP8 inference

## 3.2 Pruning  
(Structured pruning has minimal performance drop.)

- Block pruning  
- Head pruning (Transformers)  
- Neuron / channel pruning  
- Magnitude pruning  
- Movement pruning

## 3.3 Distillation
- Knowledge distillation (teacher → student)  
- TinyBERT / DistilBERT  
- Layer-wise distillation  
- Logit matching + feature distillation

## 3.4 Caching & Reuse
- KV cache for autoregressive transformers  
- Attention cache  
- Prefix decoding  
- Recurrent memory tokens

---

# 4) Algorithmic Improvements

## 4.1 Efficient Attention
- FlashAttention (1/2)  
- Memory-efficient attention  
- Sparse attention  
- Linear attention (Performer, ETC)  
- Kernel-based attention  
- Reformer LSH attention

## 4.2 Inference-Time Algorithms
- Speculative decoding  
- Medusa decoding  
- Multi-step parallel decoding  
- Lookahead decoding  
- Tree-based decoding

## 4.3 Training-Time Algorithms
- SAM optimizer (flat minima allow smaller models)  
- Sharpness-aware reparameterization  
- Label smoothing

---

# 5) Hardware Optimization

## 5.1 GPU-Level Optimizations
- Tensor Cores  
- CUDA Graphs  
- Kernel fusion  
- Operator fusion (FlashAttention, Flash-Decoding)  
- cuDNN / cuBLAS optimized kernels

## 5.2 Multi-GPU & Cluster
- NCCL optimizations  
- Efficient all-reduce  
- Interconnect: NVLink, InfiniBand  
- Sharded state (ZeRO, FSDP)

## 5.3 Compilation
- TensorRT  
- ONNX Runtime  
- XLA Compilation  
- JAX jit  
- PyTorch 2.0 Inductor compiler

---

# 6) Numerical & Mathematical Tricks

## 6.1 Normalization Improvements
- RMSNorm  
- LayerNorm-free designs  
- Pre-norm Transformer  
- ScaleNorm  
- Cosine normalization

## 6.2 Activation Function Improvements
- GELU vs ReLU  
- SwiGLU  
- SiLU  
- ReLU6 / hard-swish (edge devices)

## 6.3 Initialization & Scaling
- Xavier / Kaiming initialization  
- μ-parameterization  
- QK normalization (Transformers)

---

# 7) Data-Centric Speedups

## 7.1 Better Data, Less Compute
- Data deduplication  
- Data filtering  
- High-quality pretraining corpora  
- Synthetic data bootstrapping

## 7.2 Faster Data Sampling
- Token packing  
- Efficient shuffling  
- Weighted sampling for curriculum

## 7.3 Self-Supervised Efficiency
- SimCLR → BYOL → DINO → iBOT  
- Teacher-free self-distillation

---

# 8) Model Compression (Without Hurting Performance)

## 8.1 Reparameterization
- RePruning + re-training  
- Structural re-parameterization (RepVGG)  
- Weight clustering

## 8.2 Tensor Compression
- Weight sharing  
- Huffman coding  
- Matrix factorization

## 8.3 KV-Cache Compression
- Sublayer KV merging  
- KV token dropping  
- Dynamic KV eviction

---

# BONUS: Transformer-Specific Modern Techniques

## 9.1 Scaling Laws for Efficient Training
- Chinchilla scaling (compute-optimal)  
- Data-optimal training  
- Small model + more data > large model + less data  

## 9.2 Efficient Long-Context
- FlashAttention-2 + paged attention  
- Streaming attention  
- RingAttention  
- H2O attention  

## 9.3 Efficient Decoding
- Grouped-query attention (GQA)  
- Multi-query attention (MQA)  
- Speculative sampling  
- Recurrent memory (RWKV, Mamba)

---

# FINAL EXTREME SPEEDUPS (No Performance Loss)

These provide **5×–20×** improvements:

- FP8 Training + FlashAttention2  
- GQA + MQA  
- Speculative Decoding  
- TensorRT / Inductor compilation  
- Quantization-aware training  
- Distillation + low-rank adapters  
- Chinchilla scaling  
- Data filtering + token packing  
- Modular MoE routing (top-1 gating)

---
