Ultra-lightweight real-time object detection with LoRA fine-tuning, Knowledge Distillation, and a full desktop UI
Train, distill, fine-tune, monitor, test, export, and quantize — all from a single application.
FlashDet is an end-to-end object detection system based on the NanoDet-Plus architecture with a ShuffleNetV2 backbone. It ships with a modern PyQt5 desktop application that covers the entire workflow — from dataset conversion to INT8 quantized deployment — without writing a single line of code. It also integrates advanced training techniques inspired by torchtune: LoRA/QLoRA for parameter-efficient fine-tuning and Knowledge Distillation for training compact student models from larger teachers.
- Ultra-Lightweight — 0.49M to 2.44M parameters (0.9 – 4.7 MB FP16)
- Real-Time — 100+ FPS on GPU, 30+ FPS on edge devices
- LoRA / QLoRA Fine-Tuning — Freeze the backbone, train only low-rank adapters; QLoRA adds INT8/NF4 quantization of frozen weights for even lower memory
- Knowledge Distillation — Train a small student model from a larger teacher using logit + feature distillation with configurable temperature and loss weights
- Full Desktop UI — Sidebar navigation, live charts, visual comparison tools — including LoRA/KD configuration panels
- End-to-End — Data conversion → Training → Dashboard → Inference → Export → Quantization
- Multiple Formats — Import from YOLO / VOC / COCO, export to ONNX
- Advanced Quantization — Static, Dynamic, QAT, Hessian-based with side-by-side visual comparison
- Mixed Precision & Multi-GPU — AMP (FP16) training with DataParallel support
- COCO Pretrained Weights — Auto-download official pretrained checkpoints for transfer learning
Catppuccin Mocha dark theme — deep navy base with warm pastel accents.
| Quantization + Visual Comparison | |
![]() |
|
| Static / Dynamic / QAT / Hessian quantization with algorithm selection, calibration settings, comparison charts, and visual side-by-side inference. |
Download the pre-built binary for your platform — no Python required.
Windows
- Download
FlashDet_Setup.exefrom Releases - Run the installer
- Launch from the Start Menu
Linux
tar -xzf FlashDet-linux.tar.gz
cd FlashDet && ./FlashDet# Clone
git clone https://github.com/GauravGoswami/NanoDet-Plus-Lite.git FlashDet
cd FlashDet
# One-command setup — creates venv, installs correct PyTorch + all deps
bash setup_env.sh # auto-detects GPU
# bash setup_env.sh --cpu # force CPU-only
# bash setup_env.sh --cuda 12.4 # force specific CUDA version
# Activate the environment
source venv/bin/activate
# Launch the UI
python ui/main.py # or: ./run_ui.shManual setup (if you prefer)
python3 -m venv venv && source venv/bin/activate
# Install PyTorch (pick ONE):
pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu # CPU
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118 # CUDA 11.8
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124 # CUDA 12.x
# Install project dependencies
pip install -r requirements.txt# Train on the included demo dataset
python train.py --epochs 50 --batch-size 16 --device cuda
# Train on a custom dataset (0.5x ultra-lite model)
python train.py \
--model-size m-0.5x \
--input-size 320 \
--epochs 100 \
--batch-size 16 \
--save-dir workspace/my_project \
--device cuda \
--class-file classes/my_classes.txt \
--train-images data/my_dataset/train \
--val-images data/my_dataset/valid
# LoRA fine-tuning (parameter-efficient)
python train.py --lora --lora-rank 8 --epochs 50 --device cuda
# Knowledge Distillation (teacher → student)
python train_kd.py \
--teacher-checkpoint workspace/teacher/best.pth \
--teacher-size m-1.5x \
--model-size m-0.5x \
--kd-temperature 4.0 \
--epochs 100 --device cuda
# Inference
python test.py --model workspace/my_project/checkpoint_best.pth --image photo.jpg
# Export to ONNX
python scripts/convert_pth_to_onnx.py \
--checkpoint workspace/my_project/checkpoint_best.pth \
--output model.onnx --simplify
# INT8 quantization
python scripts/fp16_to_int8_quantize.py --model model.onnx --output model_int8.onnx| Model | Backbone | FPN | Params | FP16 Size | INT8 Size | Input |
|---|---|---|---|---|---|---|
| m-0.5x | ShuffleNetV2 0.5x | 96 | 0.49M | ~1.2 MB | ~0.6 MB | 320 |
| m (default) | ShuffleNetV2 1.0x | 96 | 1.17M | ~2.6 MB | ~1.3 MB | 320 / 416 |
| m-1.5x | ShuffleNetV2 1.5x | 128 | 2.44M | ~5.2 MB | ~2.6 MB | 320 / 416 |
| Model | Input | mAP | CPU (ms) | GPU (ms) | GFLOPs |
|---|---|---|---|---|---|
| m | 320 | 27.0 | 11.97 | 5.25 | 0.9 |
| m | 416 | 30.4 | 19.77 | 8.32 | 1.52 |
| m-1.5x | 320 | 29.9 | 15.90 | 7.21 | 1.75 |
| m-1.5x | 416 | 34.1 | 25.49 | 11.50 | 2.97 |
Input (320x320x3)
|
ShuffleNetV2 Backbone (0.5x / 1.0x / 1.5x)
C2 ──── C3 ──── C4
|
GhostPAN Neck (top-down + bottom-up FPN)
P3 ──── P4 ──── P5
|
FlashDet detection head (based on NanoDet-Plus; per-level classification + regression)
QFL + GIoU + DFL losses
|
NMS + Post-processing
|
Detections [x1, y1, x2, y2, score, class]
| Tab | What it does |
|---|---|
| Annotation | Full-featured image annotation tool. Bounding box, four-point quadrilateral, and polygon drawing. Zoom/pan, brightness/contrast, crosshair, undo, and COCO JSON export. |
| Data Conversion | Convert YOLO / VOC / custom formats to COCO. Validation split slider, class name editor, symlink option, annotation viewer. |
| Training | Configure device (CPU/GPU/multi-GPU), model size (0.5x/1.0x/1.5x), hyperparameters (LR, batch, warmup, patience, grad accum), AMP, resume from checkpoint, COCO pretrained init. |
| LoRA Fine-tune | Dedicated LoRA/QLoRA dashboard. Select variant (Standard, DoRA, LoRA+, AdaLoRA, OrthoLoRA, LoRA-FA), configure rank/alpha/dropout, choose target modules, select base checkpoint, view memory estimates. |
| Distillation | Knowledge Distillation dashboard. Select teacher checkpoint and size, configure temperature, logit/feature loss weights, hard loss weight. Supports LoRA on student and activation checkpointing. |
| Dashboard | Live loss charts (Total, QFL, BBox, DFL) updated per batch. Epoch vs iteration views, LR tracking, live detection preview from workspace. Auto-discovers experiments. |
| Inference | Load .pth or .onnx model. Test on single images, folders, video files, or live webcam. Confidence/NMS sliders, detection summary table, FPS counter, zoom/pan support. |
| Export Model | Export to ONNX with optional graph simplification, dynamic batch size, and input size selection. View exported model metadata. |
| Quantization | Static, Dynamic, QAT, and Hessian-based quantization. Multiple algorithms per type (MinMax, Histogram, Entropy, MSE, Per-Channel, Fisher, etc.). Comparison charts (size, speed, accuracy). Visual Comparison tool runs both original and quantized models side-by-side on the same image. |
| Format | Description |
|---|---|
| YOLO | .txt label files with class cx cy w h (normalised 0–1) |
| Pascal VOC | XML annotations with bounding boxes |
| COCO | JSON annotations (native format — no conversion needed) |
The UI converts YOLO and VOC to COCO format automatically.
| Type | Algorithms | Size Reduction | Speed-up | Accuracy Loss |
|---|---|---|---|---|
| Static | MinMax, Histogram, Entropy, MSE, Per-Channel | ~4x | 3–4x | 1–2% |
| Dynamic | PyTorch Dynamic, ONNX Runtime Dynamic | ~2–4x | 2–3x | 1–2% |
| QAT | Default QAT, Per-Channel QAT | ~4x | 3–4x | < 1% |
| Hessian-based | Layer Sensitivity, Fisher Information | ~4x | 3–4x | < 1% |
The Quantization tab includes a Visual Comparison tool that runs both the original and quantized model on the same images side-by-side, showing match rate, IoU, confidence changes, and latency.
| Target | Format | Notes |
|---|---|---|
| Edge (Raspberry Pi, Jetson) | ONNX INT8 | Use static quantization with calibration data |
| Mobile (Android) | ONNX / NCNN | Convert ONNX to NCNN for best performance |
| Server | ONNX Runtime / TensorRT | FP16 or INT8 with dynamic batching |
| Web | ONNX.js / TF.js | Convert ONNX to TF.js for browser inference |
| Intel | OpenVINO | Direct ONNX import |
FlashDet/
├── config/ # Model, data, training configuration
├── data/demo/ # Included demo dataset (64 train + 16 valid)
├── docs/screenshots/ # UI screenshots for README
├── scripts/
│ ├── prepare_data.py # YOLO → COCO conversion
│ ├── convert_pth_to_onnx.py # Export to ONNX
│ ├── fp16_to_int8_quantize.py# INT8 quantization
│ ├── build_executable.py # PyInstaller packaging
│ ├── take_screenshots.py # Capture UI screenshots
│ ├── build_linux.sh # Linux build script
│ └── build_windows.bat # Windows build script
├── src/
│ ├── models/
│ │ ├── backbone/ # ShuffleNetV2 (0.5x / 1.0x / 1.5x)
│ │ ├── neck/ # GhostPAN feature pyramid
│ │ ├── head/ # FlashDet detection head + aux head
│ │ ├── assignment/ # Dynamic soft label assigner
│ │ ├── lora.py # LoRA / QLoRA implementation
│ │ └── detector.py # FlashDet top-level module
│ ├── losses/
│ │ ├── focal_loss.py # Quality Focal Loss (QFL), Distribution Focal Loss (DFL)
│ │ ├── iou_loss.py # GIoU Loss
│ │ └── kd_loss.py # Knowledge Distillation losses (logit + feature)
│ ├── data/ # Dataset, dataloader, transforms, prepare
│ └── utils/ # Visualization, metrics, checkpoint, box ops
├── ui/
│ ├── main.py # App entry point with sidebar navigation
│ ├── styles.py # Shared colour palette and widget styles
│ ├── helpers.py # Model/class listing helpers
│ ├── tabs/
│ │ ├── data_tab.py # Data Conversion
│ │ ├── training_tab.py # Standard Training
│ │ ├── lora_tab.py # LoRA / QLoRA Fine-tuning
│ │ ├── kd_tab.py # Knowledge Distillation
│ │ ├── dashboard_tab.py # Dashboard
│ │ ├── inference_tab.py # Inference
│ │ ├── export_tab.py # Export Model
│ │ └── quantization_tab.py # Quantization + Visual Comparison
│ └── widgets/ # Shared file-dialog widgets
├── train.py # CLI training entry point (supports --lora, --qlora)
├── train_kd.py # CLI knowledge distillation entry point
├── test.py # CLI inference entry point
├── run_ui.sh # Launch desktop UI
└── requirements.txt # Python dependencies
- Quality Focal Loss (QFL) — joint classification and IoU quality
- Generalised IoU (GIoU) — robust bounding-box regression
- Distribution Focal Loss (DFL) — flexible localisation distribution
- Random horizontal flip
- Multi-scale resize (0.5x – 1.5x)
- Colour jittering (brightness, contrast, saturation)
- Mosaic (4-image combination)
- Cosine annealing LR schedule with warmup
- Gradient clipping and accumulation
- Mixed-precision training (FP16) with
torch.amp - Early stopping with configurable patience
- Multi-GPU via
DataParallel - EMA (Exponential Moving Average) for stable evaluation
FlashDet supports parameter-efficient fine-tuning inspired by torchtune, adapted for convolutional object detection:
| Method | What it does | Memory Savings |
|---|---|---|
| LoRA | Injects low-rank decomposition matrices into Conv2d layers. Only adapters are trained. | ~60–70% fewer trainable params |
| QLoRA | Same as LoRA but additionally quantizes frozen base weights to INT8 or NF4. | ~75–85% memory reduction |
# LoRA fine-tuning (rank=8, alpha=16)
python train.py --lora --lora-rank 8 --lora-alpha 16 --epochs 50 --device cuda
# QLoRA with INT8 quantized base
python train.py --qlora --qlora-dtype int8 --lora-rank 8 --epochs 50 --device cudaOpen the LoRA Fine-tune tab in the sidebar:
- Select a base model (COCO pretrained or custom checkpoint)
- Choose LoRA or QLoRA mode and pick a variant (Standard, DoRA, LoRA+, AdaLoRA, OrthoLoRA, LoRA-FA)
- Set Rank, Alpha, Dropout, and target modules
- Click Start LoRA Fine-tuning
After training, LoRA weights are automatically merged into the base model for zero-overhead inference.
Train a smaller, faster student model by distilling knowledge from a larger, more accurate teacher:
| Component | Loss Function | Purpose |
|---|---|---|
| Logit Distillation | KL-divergence (classification) + Smooth L1 (regression) | Transfer soft predictions |
| Feature Distillation | Normalised L2 on FPN feature maps | Transfer intermediate representations |
| Hard Loss | Standard QFL + GIoU + DFL | Ground-truth supervised learning |
python train_kd.py \
--teacher-checkpoint workspace/teacher/best.pth \
--teacher-size m-1.5x \
--model-size m-0.5x \
--kd-temperature 4.0 \
--kd-logit-weight 1.0 \
--kd-feature-weight 0.5 \
--epochs 100 --device cudaOpen the Distillation tab in the sidebar:
- Browse for the teacher checkpoint (
.pth) and select teacher model size - Choose student model size and input resolution
- Configure temperature, logit weight, feature weight, and hard loss weight
- Click Start Distillation — launches
train_kd.pywith full KD configuration
torch >= 2.0.0
torchvision >= 0.15.0
PyQt5 >= 5.15.0
opencv-python >= 4.5.0
matplotlib >= 3.5.0
numpy >= 1.20.0
Pillow >= 8.0.0
onnx >= 1.10.0
onnxruntime >= 1.15.0
pycocotools >= 2.0.0
Full list in requirements.txt.
FlashDet works for any object detection task where speed and small model size matter:
- Safety & PPE — hardhat, vest, mask detection on construction sites
- Container & Logistics — number plate / code reading on shipping containers
- Retail — product detection, shelf monitoring
- Agriculture — crop and pest identification
- Industrial QA — defect and part-count inspection
- Autonomous vehicles — pedestrian and sign detection at the edge
# Linux
./scripts/build_linux.sh
# Windows (Command Prompt)
scripts\build_windows.batSee docs/BUILD.md for full packaging instructions.
- NanoDet — original implementation by RangiLyu
- torchtune — PyTorch fine-tuning library (LoRA/KD inspiration)
- LoRA: Low-Rank Adaptation — parameter-efficient fine-tuning
- QLoRA — quantized low-rank adaptation
- ShuffleNetV2 — efficient backbone
- GhostNet — ghost modules for lightweight FPN
- Generalised Focal Loss — QFL and DFL
MIT License — see LICENSE for details.
Built by Gaurav Goswami








