**THE ORDER I SHOULD STUDY (Simplified, Practical)**

### **Phase 1 — Foundations (2–4 weeks)**

FLOPs, GPU basics, memory hierarchy, autograd, PyTorch profiler.

### **Phase 2 — Architecture Optimization (2–3 weeks)**

MobileNet/EfficientNet/DistilBERT/FlashAttention.

### **Phase 3 — Training Optimization (3 weeks)**

FSDP, ZeRO, AMP, checkpointing, XLA.

### **Phase 4 — Compression (4–6 weeks)**

Quantization → pruning → distillation → LoRA → low-rank.

### **Phase 5 — Mobile Deployment (3–4 weeks)**

TFLite, CoreML, ONNX Mobile.

### **Phase 6 — LLM Inference Optimization (4–6 weeks)**

vLLM, TensorRT-LLM, KV cache, speculative decoding.

### **Phase 7 — Compiler & Kernel Level (6–12 weeks)**

Learn Triton → write fused kernels → study TVM/TensorRT.


<hr>
<hr>
<hr>


# **Detailed Roadmap for Model Optimization (Size, Params, Quantization)**


## STAGE 1 — Core Fundamentals (Must Learn First)

Applies to _all_ optimization domains.

### **1. FLOPs & MACs deeply**

Understand:

- FLOPs vs throughput
- Memory-bound vs compute-bound
- How to calculate FLOPs for Conv2D, MatMul, Attention
- Why sometimes reducing parameters doesn’t reduce latency

### **2. GPU Architecture**

Know exactly:

- Warps, threads, blocks
- Shared memory vs global memory
- Tensor Cores & BF16/FP16
- Why kernel fusion matters

### **3. PyTorch Internals**

- TorchScript / FX Graph
- Static vs dynamic shapes
- JIT optimization
- Profiling (Torch Profiler, TensorBoard)

### **4. Memory Optimization Basics**

- Activation checkpointing
- Mixed precision
- Gradient accumulation
- Offloading (CPU/GPU/NVMe)

This stage builds your foundation.

<hr>
<hr>
<hr>


## STAGE 2 — Model Architecture Optimization

Learn how efficient models are designed.

### **CNN Efficiency (for Computer Vision)**

Study:

- MobileNet V1 → depthwise separable conv
- MobileNet V2 → inverted residuals
- MobileNet V3 → SE blocks + NAS
- ShuffleNet → channel shuffle
- EfficientNet → compound scaling
- SqueezeNet → 50x smaller model design

Master:

- Depthwise conv math
- Grouped conv
- Bottleneck blocks
- Channel pruning concepts

---

### **Transformer Efficiency**

Understand:

- RMSNorm vs LayerNorm
- FlashAttention
- Multi-Query Attention (MQA)
- Grouped Query Attention (GQA)
- ALiBi / RoPE
- SwiGLU vs GeLU
- KV cache — how it fundamentally impacts latency

Learn lighter architectures:

- DistilBERT
- MobileBERT
- TinyBERT
- Longformer / Linformer (low attention cost)
- LLaMA architecture principles

<hr>
<hr>
<hr>


## STAGE 3 — Training-Time Optimization (Speed + Memory)

This is critical.

### **Techniques you must master**

- FP16, BF16 (automatic mixed precision)
- Gradient checkpointing
- FullyShardedDataParallel (FSDP)
- ZeRO optimizers (DeepSpeed)
- Adafactor (low memory optimizer)
- FlashAttention (reduces memory by 2–5×)
- Tensor parallelism
- Pipeline parallelism
- Activation recomputation
- XLA for training
- PyTorch Compile (torch.compile)

### **Your skills after this stage**

You’ll know how to train:

- 5× bigger models on the same GPU
- Faster with the same hardware
- At lower VRAM using sharding/offloading
- With fused kernels

<hr>
<hr>
<hr>


## STAGE 4 — Model Compression (Cross-domain)

This applies to CNNs + Transformers + Mobile.

### **A. Quantization**

Master:

- PTQ (Post-training quantization)
- QAT (Quantization-aware training)
- FP16, INT8, INT4, INT2
- Weight-only quantization
- GPTQ
- AWQ (Activation-aware PTQ)
- SmoothQuant
- KV cache quantization (LLMs)

Quantization is the **most important skill** for both mobile and LLM inference.

---

### **B. Pruning**

Learn:

- Unstructured pruning
- Structured pruning (filters, channels, heads)
- Movement pruning
- Lottery ticket hypothesis
- N:M sparsity (2:4 sparsity for Nvidia GPUs)

---

### **C. Knowledge Distillation**

Master:

- Soft targets
- Matching intermediate layers
- Layer dropping
- Distilling LLMs into smaller student models
- Distilling CNNs into mobile CNNs

---

### **D. Low-rank Methods**

- LoRA
- QLoRA
- LoRA for inference distillation
- Low-rank factorization of weight matrices

<hr>
<hr>
<hr>


## STAGE 5 — Mobile Deployment Optimization

After compressing the model, learn:

### **Framework-specific**

- TFLite
- CoreML
- ONNX Runtime Mobile
- MediaPipe
- MNN (Alibaba)
- NCNN (Tencent)

### **Skills**

- Quantization for hardware support
- Operator fusion
- Delegate acceleration (GPU, NNAPI)
- Memory footprint optimization
- Latency vs accuracy tradeoffs

### **Techniques**

- Post-training integer quantization
- Micro-Conv and GhostNet architectures
- Pre-fused mobile kernels

<hr>
<hr>
<hr>


## STAGE 6 — Transformers / LLM Inference Optimization

This is a separate world.

### **A. GPU LLM Runtimes**

Learn how these work:

- vLLM
- TensorRT-LLM
- TGI (HuggingFace)
- DeepSpeed-MII
- FasterTransformer
- AITemplate

---

### **B. Key Concepts**

- Paginated Attention (vLLM)
- Continuous batching
- Prefill vs decode phase
- KV cache optimization
- Flash-Decoding
- Speculative decoding
- Quantized attention kernels

<hr>
<hr>
<hr>


## STAGE 7 — Systems-Level Optimization (The Hardest Part)

This is where you become elite.

### **A. Tensor Compilers**

You must learn:

- ONNX Runtime
- TensorRT kernels
- TVM (Apache TVM)
- MLIR
- AITemplate
- PyTorch 2.0 Compiler

---

### **B. GPU Kernel Programming**

Learn Triton (way easier than CUDA):

- Fused linear layers
- Fused softmax
- FlashAttention kernel
- Fused LayerNorm + bias + matmul

Then optionally learn basic CUDA:

- Thread divergence
- Warp-level primitives
- Shared-memory tiling
- Tensor core GEMM

---

<hr>
<hr>
<hr>


## STAGE 8 — Final Mastery: Model Serving Optimization

- Batching strategies
- CPU offloading
- Memory pools
- Quantization-aware serving
- Async execution queues

This is what companies like OpenAI, Meta, and Google use.
