VirPro introduces a new adaptive multimodal pretraining paradigm that enriches weak supervision using visually-referred probabilistic prompts, significantly improving existing Weak-Supervised Monocular 3D Detection frameworks such as WeakM3D and GGA, and achieving up to +4.8% AP improvement on KITTI.
Multiple learnable prompts are assigned to each object instance by embedding class names into natural-language templates, enabling robust contextual representation learning.
Prompt embeddings are enriched with visual cues and parameterized as multivariate Gaussian distributions, whose means encode canonical semantics while their variances model visual uncertainty.
Our paradigm adopts a two-stage training pipeline. In the first stage, as shown in the following figure, we introduce an Adaptive Prompt Bank (APB) to generate diverse, instance-specific prompts. We further propose Multi-Gaussian Prompt Modeling (MGPM), which injects visual cues into textual embeddings and represents each prompt as a multivariate Gaussian distribution. A unified prompt embedding is then sampled and normalized for each instance, followed by RoI-level Contrastive Matching to align monocular 3D object embeddings with their corresponding textual prompts embeddings. In the second stage, we adopt the Dual-to-One Distillation (D2OD) strategy from CAW3D to transfer the learned scene-aware priors into the monocular encoder.

AP40 IoU=0.5
| Method | Easy (AP_BEV / AP_3D) | Moderate (AP_BEV / AP_3D) | Hard (AP_BEV / AP_3D) |
|---|---|---|---|
| Without 2D GT Annotation | |||
| WeakM3D | 58.20 / 50.16 | 38.02 / 29.94 | 30.17 / 23.11 |
| VirPro+WeakM3D | 55.09 / 50.97 | 38.76 / 31.95 | 31.12 / 24.27 |
| With 2D GT Annotation | |||
| GGA+PGD | 57.20 / 51.48 | 40.11 / 35.73 | 34.96 / 30.49 |
| VirPro+GGA+PGD | 60.11 / 54.72 | 42.95 / 39.49 | 37.50 / 33.32 |
| Method | Easy (AP_BEV / AP_3D) | Moderate (AP_BEV / AP_3D) | Hard (AP_BEV / AP_3D) |
|---|---|---|---|
| Without 2D GT Annotation | |||
| WeakM3D | 11.82 / 5.03 | 5.66 / 2.26 | 4.08 / 1.63 |
| VirPro+WeakM3D | 12.23 / 5.41 | 5.92 / 2.52 | 4.33 / 1.81 |
| With 2D GT Annotation | |||
| GGA+PGD | 14.87 / 7.09 | 9.26 / 4.27 | 7.09 / 3.26 |
| VirPro+GGA+PGD | 15.59 / 7.95 | 9.58 / 4.96 | 7.29 / 3.64 |
- Environment setup
- Data preparation
- Stage 1 requires:
- KITTI RAW
- 2D RoI Label
- Stage 2 requires:
- KITTI Object 3D
- GGA Pseudo Labels
- Stage 1 requires:
- Run Stage 1: VirPro pretraining
- Run Stage 2: GGA+PGD training
- Run test
GGA is a weakly supervised point encoder that outputs 3D bounding boxes. PGD is a fully supervised monocular 3D encoder. In the GGA+PGD training pipeline, the 3D boxes predicted by GGA are used as pseudo labels to replace the ground-truth annotations required by PGD. This project integrates the VirPro pretraining paradigm into the GGA+PGD framework to further enhance weakly supervised monocular 3D detection performance.
conda create --name virpro python=3.8 -y
conda activate virpro
conda install pytorch==2.0.0 torchvision==0.15.0 torchaudio==2.0.0 -c pytorch
pip install openmim
mim install mmcv-full==1.4.0
mim install mmdet==3.3.0
mim install mmsegmentation==0.14.1
pip install -e .Dataset Structure Example
data
└── kitti
├── ImageSets
│ ├── test.txt
│ ├── train.txt
│ └── val.txt
├── training
└── calib
└── image_2
└── velodyne
└── predicted_2d_bbox
├── testing
└── calib
└── image_2
└── velodyne
└── label
├── kitti_infos_train_GGA_pseudo.pkl
├── kitti_infos_train_GGA_pseudo_mono3d.coco.json
├── kitti_infos_val_GGA_pseudo.pkl
├── kitti_infos_val_GGA_pseudo_mono3d.coco.json
KITTI Raw
wget -i ./kitti_archives_to_download.txt -P kitti_data/
cd kitti_data
unzip "*.zip"
cd ..
ln -s kitti_data ./data/kitti/kitti_raw2D RoI Labels
WeakM3D provides both 2D bounding boxes and the corresponding RoI LiDAR points. For each sample in the dataset, the provided KITTI_RAW pseudo label stores these two modalities in a single .npz file. In our case, we only utilize the 2D bounding boxes as our 2D RoI labels. Please download and unpack this folder. Then rename it as predicted_2d_bbox.
KITTI Object 3D
Download from: https://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=3d
GGA Pseudo Labels
You may generate pseudo labels following the procedures provided in the original GGA project. Our repository also includes pre-generated pseudo labels (data/kitti/*.pkl files), which can be used directly.
CUDA_VISIBLE_DEVICES=0 python scripts/pretrain_ppl_multi.py --config ./config/resnet34_backbone.yamlThen use utils/ckp_pretrain_to_train.py to convert a Stage 1 VirPro checkpoint into a backbone-only training checkpoint.
input_checkpoint.pth: the checkpoint obtained after Stage 1 pretrainingoutput_checkpoint.pth: the converted checkpoint for Stage 2 training
python utils/ckp_pretrain_to_train.py [input_checkpoint.pth] [output_checkpoint.pth]Before running Stage 2, on line 190 in configs_train/gga/gga_pdg.py, remember to specify the converted checkpoint as:
distill.teacher_ckpt = "path/to/output_checkpoint.pth"./tools/dist_train.sh configs_train/gga/gga_pdg.py 1Use utils/ckp_train_to_test.py to convert a stage-2 checkpoint into a test checkpoint by removing teacher.* weights and stripping the student. prefix, then saving the cleaned model.
python utils/ckp_train_to_test.py [input_checkpoint.pth] [output_checkpoint.pth]./tools/dist_test.sh configs_train/gga/gga_pdg.py 1