-
Notifications
You must be signed in to change notification settings - Fork 0
FAQ
FlashOptim is a model optimization toolkit for compressing and accelerating deep learning models. It supports quantization, pruning, distillation, and neural architecture search.
FlashOptim works with any PyTorch model, with first-class support for FlashVision detection and classification models.
- Training/Optimization: GPU recommended (CUDA 11.8+)
- Inference: CPU, GPU, or edge devices (via ONNX/TensorRT export)
- PTQ is faster (no training required) but may lose 1-2% accuracy
- QAT requires training but typically recovers accuracy to within 0.5%
- Start with PTQ; use QAT if accuracy drop is unacceptable
Typically 100-500 representative samples are sufficient. More samples help with histogram-based calibration.
Not without sparse hardware or sparse inference engines. Use structured pruning for guaranteed speedup on standard hardware.
Yes! Apply pruning first, fine-tune, then quantize. This gives both sparsity and reduced precision.
No. Feature distillation with projection layers can handle different architectures. Logit distillation works regardless of architecture.
Start with T=4. Higher temperatures (5-10) work better when the teacher is much larger.
Depends on strategy and search space:
- Random search: minutes to hours
- Evolutionary: hours to days
- Use proxy tasks to speed up evaluation
- ONNX (recommended)
- TensorRT (planned)
- OpenVINO (planned)
- CoreML (planned)
flashoptim benchmark --model optimized/model.onnx --device cpu --warmup 10 --runs 100FlashOptim — Model optimization toolkit | PyPI | MIT License