Unsupervised object discovery, trajectory modelling, and collision prediction from raw CLEVRER video — no annotated masks, no supervised detectors.
Raw CLEVRER Video (480×320, ~128 frames @ 25 fps)
│
├─ [P1 Perception] DINOv2 slot attention (unsupervised)
│ → K slot centroids (y, x) + velocities (vy, vx) per frame
│
├─ [P2 Dynamics] NRI-style GNN over slot state sequences
│ → Multi-step trajectory rollout (T_pred steps)
│
└─ [Collision Head] Pairwise MLP over predicted trajectories
→ Collision probability per object pair
Experiment matrix:
| Condition | Perception | States | GNN Rollout MSE | Slot-CV MSE |
|---|---|---|---|---|
| E1 (oracle) | — | GT | 1.567 | 105.87 |
| E2 (P1-Recon) | DINOv2 slots | Slot-derived | 67.71 | 50.19 |
| E2+ (P1-Motion) | DINOv2 slots + motion loss | Slot-derived | 40.08 | 21.24 |
| E3 (P1-JEPA) | DINOv2 slots + JEPA | Slot-derived | 39.64 | 26.91 |
| P1-SAM2 (reference) | SAM2 video seg | Slot-derived | — | 8.25 tracking MSE |
CLEVRER dataset layout:
data/
clevrer/
videos_train/video_00000-01000/video_00000.mp4 ...
videos_validation/video_10000-11000/video_10000.mp4 ...
annotations_train/annotation_00000.json ...
annotations_validation/annotation_10000.json ...
clevrer_states/val/state_video_XXXXX.pt # GT states (centers, velocities)
Pre-cache DINOv2 features (speeds up P1 training ~5×):
python scripts/precompute_dino_features.py \
--video_root data/clevrer/videos_train --out_dir data/clevrer_dino_cache_trainpython scripts/train_slots_clevrer.py --config configs/p1_frame_slots_feature_v3.yamlKey config options: motion_weight=3.0 (P1-Motion), num_slots=7, encoder_type=dinov2.
Checkpoint → checkpoints/slots_clevrer_p1B_v3/best_slots.pt
python scripts/prepare_p1_slot_states.py \
--p1_ckpt checkpoints/slots_clevrer_p1B_v3/best_slots.pt \
--video_root data/clevrer/videos_validation \
--output_root data/clevrer_p1B_v3_states_val# Oracle (GT states)
python scripts/train_gnn_dynamics_clevrer.py --config configs/gnn_dynamics_gt.yaml
# Slot-derived states
python scripts/train_p2_gnn_from_p1.py \
--p1_states_root data/clevrer_p1B_v3_states_train \
--out_dir checkpoints/gnn_dynamics_p1B_v3python scripts/train_collision_head_clevrer.py --config configs/collision_head.yamlpython scripts/evaluate_pipeline.py \
--slots_ckpt checkpoints/slots_clevrer_p1B_v3/best_slots.pt \
--p1_gnn_ckpt checkpoints/gnn_dynamics_p1B_v3/best_gnn_dynamics_p1.pt \
--p1_states_root data/clevrer_p1B_v3_states_val \
--val_videos_root data/clevrer/videos_validation \
--val_states_root data/clevrer_states/val \
--max_videos 200 \
--output outputs/eval_p1B_v3_final.jsonMetrics: recon_mse, tracking_traj_mse (Hungarian-matched, patch units), vel_error, gnn_rollout_mse, constant_vel_baseline, collision P/R/F1.
models/
slots.py - SlotAttentionAutoencoder, SlotAttention, slot_grounding_loss
video_slots.py - SAViModel, RecurrentSlotAttention, SlotPredictor
frame_encoder.py - FrameObjectEncoder (DINOv2 + slot attention, P1)
gnn_dynamics.py - NRIEncoder, NRIDecoder, NRIDynamics
collision_head.py - CollisionHead, GeometricCollisionBaseline
metrics/
slots_clevrer.py - Centroid extraction, Hungarian matching, traj/vel MSE
ari.py - ARI excluding background
consistency.py - velocity_error, trajectory_mse, rollout_mse
scripts/
train_slots_clevrer.py - P1 frame slot training
train_video_slots_clevrer.py - SAVi video slot training
train_gnn_dynamics_clevrer.py - GNN dynamics (GT states)
train_p2_gnn_from_p1.py - GNN dynamics (P1 slot states)
train_collision_head_clevrer.py - Collision head training
prepare_p1_slot_states.py - Extract P1 states for GNN training
prepare_p1_sam2_states.py - SAM2 video segmentation baseline
precompute_dino_features.py - Cache DINOv2 features to disk
evaluate_pipeline.py - End-to-end evaluation
eval_sam2_tracking.py - SAM2 tracking MSE evaluation
visualize_example.py - Single-video slot/trajectory visualization
configs/
p1_frame_slots_feature_v3.yaml - P1-Motion (motion-weighted, best)
p1_frame_slots_feature_sc.yaml - P1-SC (InfoNCE contrastive)
gnn_dynamics_gt.yaml - GNN on GT states (E1 oracle)
gnn_dynamics_slots.yaml - GNN on slot states (E2/E3)
collision_head.yaml - Collision head
data/
clevrer_config.py - Resolution constants (CLEVRER_H=320, CROP_H=CROP_W=320)
pip install torch torchvision numpy scipy pyyaml pillow opencv-pythonDINOv2: downloaded automatically via torch.hub on first run.
SAM2: pip install git+https://github.com/facebookresearch/sam2.git