feat: add SAM2 Hiera backbone support by PABannier · Pull Request #1 · PABannier/sam3.cpp

PABannier · 2026-03-30T08:41:32Z

Summary

Add full SAM2 (Segment Anything Model 2) support alongside SAM3, auto-detected from the binary file magic number at load time
Implement Hiera backbone (PatchEmbed, windowed multi-scale attention with Q-pooling, 4-stage FPN neck) and all shared-code parameterization for dual-model inference
Adapt GUI examples and quantize tool to work with both SAM2 and SAM3 models at runtime

Details

New files:

convert_sam2_to_ggml.py — weight conversion for all SAM2/2.1 variants (Tiny/Small/Base+/Large)

Core changes (sam3.cpp, sam3.h):

sam3_model_type enum + sam3_get_model_type() API
Hiera backbone graph: sam2_build_hiera_graph() with window partition, Q-pooling, bicubic PE
FPN neck: sam2_build_fpn_neck_graph() with nearest top-down + scalp
sam2_encode_image_hiera() full pipeline with ImageNet preprocessing
All shared functions parameterized (feat_size(), sigmoid_scale/bias, mask interpolation)
5-token decoder layout, multimask tracking, fixed_no_obj_ptr blending

Examples:

main_image.cpp — PCS mode disabled for SAM2, PVS works
main_video.cpp — dispatches sam3_create_visual_tracker/sam3_propagate_frame for SAM2
quantize.cpp — handles both SAM2 and SAM3 headers

Status

Backbone runs end-to-end and produces output. Segmentation masks are not yet numerically correct — debugging the forward pass is the next step.

Test plan

Compiles cleanly on macOS (Metal + CPU)
All 8 SAM2/2.1 checkpoints convert and quantize (F16, Q4_0, Q4_1, Q8_0)
sam3_image launches with SAM2 model, encodes image, runs PVS
SAM3 models still load and work (no regressions)
Numerical validation against Python SAM2 reference
Video tracking end-to-end test

🤖 Generated with Claude Code

Add full SAM2 (Segment Anything Model 2) support to sam3.cpp, enabling both SAM2 and SAM3 models to be loaded and used through the same API. The model type is auto-detected from the binary file magic number. New components: - convert_sam2_to_ggml.py: weight conversion for all SAM2/2.1 variants (Tiny, Small, Base+, Large) with auto-detection and YAML config support - Hiera backbone: PatchEmbed, windowed multi-scale attention with Q-pooling, bicubic+tiled positional embedding, 4-stage feature pyramid - FPN neck: lateral 1x1 convs, nearest-neighbor top-down fusion, scalp - SAM2 image preprocessing with ImageNet normalization - 5-token decoder layout for older SAM2 (pred_obj_scores=False) - Multimask output selection during video tracking propagation - fixed_no_obj_ptr object pointer blending by presence score Modified for dual-model support: - sam3_load_model() dispatches on magic (0x73616D32 vs 0x73616D33) - All shared functions parameterized: feat_size(), sigmoid_scale/bias, mask interpolation target, spatial dimensions from tensor shapes - sam3_segment_pcs/sam3_create_tracker guarded for SAM2 models - Examples (image/video) adapt UI at runtime based on model type - Quantize tool handles both SAM2 and SAM3 headers Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…cture - Fix window_size "lags by a block" — Q-stride blocks now use the PREVIOUS stage's window_spec, matching Python's Hiera.__init__ - Fix window unpartition pad_hw computation for Q-stride blocks: recompute from target spatial dims instead of dividing partition pads - Add SAM2 dispatch in sam3_encode_image_from_preprocessed() - Add SAM2_DUMP_DIR environment variable for intermediate tensor dumps - Add test_sam2_backbone_compare.cpp and dump_sam2_reference.py - Add compare_tensors.py for numerical comparison Status: PatchEmbed output matches Python (cos=0.9999). Positional embedding computation diverges (cos=0.005) — root cause identified, fix pending. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Two bugs fixed in the Hiera backbone: 1. Bicubic interpolation kernel used wrong coefficients (a=-0.5 instead of a=-0.75), causing PE to be 6x too large in magnitude. Fixed to use Keys cubic kernel matching PyTorch's F.interpolate. 2. Multi-head attention reshape incorrectly merged heads into the batch dimension. Now follows the exact SAM3 ViT pattern: reshape(HD,NH,N,B) → permute(0,2,1,3) → cont → reshape(HD,N,NH*B) → reshape_4d(HD,N,NH,B) for flash_attn_ext. Also fixed: PE weight tensor registration shape (was [E,W,H,1], should be [W,H,E,1] to match conversion script's reversed dims). Status: Hiera backbone blocks 0-23 now match Python reference (cos≥0.999999 with f32 weights). FPN neck still diverges — next to debug. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

FPN lateral convolutions now use ggml_conv_2d_sk_p0 (matching SAM3 neck pattern) with proper bias broadcasting via ggml_repeat. Added debug tensor dumps for FPN laterals. Verified: entire Hiera backbone + FPN neck pipeline matches Python reference with cos≥0.99997 at all stages (f32 weights on cat.jpg). The previous comparison failure was a layout mismatch (NCHW vs CWHB) in the comparison script, not a computation error. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add dump_sam2_pvs_reference.py and test_sam2_pvs_compare.cpp for comparing SAM2 PVS decoder output against Python reference. Results with SAM2.1 Base+ f32 on cat.jpg, center point (600, 599): - Best mask (IoU=0.994): 99.9% binary agreement with Python - Mask logits cosine similarity: 0.97-0.98 across all 3 multimask outputs - IoU predictions match closely: [0.994, 0.397, 0.977] vs [0.994, 0.469, 0.986] Also add SAM2_DUMP_DIR support in sam3_segment_pvs() for intermediate tensor dumping. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add test_sam2_e2e_all_variants.py that runs Python reference and C++ ggml inference for all SAM2.1 variants (Tiny, Small, Base+, Large) on the same image with the same point prompt, then compares binary masks and IoU scores. Results on cat.jpg with center point (600, 599): Variant Best Mask IoU Logit Cos Binary IoU Agreement tiny 0.991 0.969 0.996 99.8% small 0.991 0.970 0.997 99.8% base_plus 0.994 0.969 0.997 99.8% large 0.993 0.979 0.997 99.8% ALL PASS: best masks match Python with >=99.6% binary IoU. Also add sam3_state_set_orig_dims() to fix coordinate normalization when using sam3_encode_image_from_preprocessed() with non-square original images. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add SAM2.0 support alongside SAM2.1 in the e2e test. Fix tensor loader to allow up to 5 optional missing tensors for SAM2.0 compat (no_obj_embed_spatial, obj_ptr_tpos_proj not present in SAM2.0). Results on cat.jpg with center point (600, 599): sam2_tiny 0.988 binary_IoU=0.997 agree=99.8% ✓ sam2_small 0.993 binary_IoU=0.998 agree=99.9% ✓ sam2_base_plus 0.994 binary_IoU=0.997 agree=99.8% ✓ sam2_large 0.995 binary_IoU=0.982 agree=98.9% ✓ sam2.1_tiny 0.991 binary_IoU=0.996 agree=99.8% ✓ sam2.1_small 0.991 binary_IoU=0.997 agree=99.8% ✓ sam2.1_base_plus 0.994 binary_IoU=0.997 agree=99.8% ✓ sam2.1_large 0.993 binary_IoU=0.997 agree=99.8% ✓ ALL PASS: best masks match Python reference across all variants. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace the fragile "tolerate missing tensors" hack with a clean solution: add an `is_sam2_1` flag to the binary header so the C++ loader knows exactly which tensors to register. - convert_sam2_to_ggml.py: auto-detect SAM2.0 vs 2.1 from checkpoint tensor names (no_obj_embed_spatial presence) and write is_sam2_1 flag - sam3.cpp: read is_sam2_1 from header, conditionally register no_obj_embed_spatial and obj_ptr_tpos_proj tensors only for SAM2.1 - sam3.cpp: restore strict tensor count check — no tolerance - sam3.cpp: guard runtime no_obj_embed_spatial usage with null check - quantize.cpp: update SAM2 header field count (56 → 57) Tensor counts are now exact: SAM2.0 Tiny: registered 468 = loaded 468 SAM2.1 Tiny: registered 471 = loaded 471 All 8 variants still pass end-to-end (test_sam2_e2e_all_variants.py). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Added extensive debugging infrastructure to trace the TwoWay decoder: - Initial tokens dump (verified cos=1.0 vs Python) - Q projection dump (verified cos=1.0 vs Python) - SA norm output dump (cos=0.16 — BUG identified) - Manual numpy SDPA computation confirms Python output (cos=1.0) Root cause narrowed to: sam3_sam_attention produces correct Q projection but wrong attention output. The bug is in the ggml graph execution of the attention, not in the data preparation or weights. Possible graph allocator buffer reuse issue affecting the decoder. Next step: isolate the self-attention into a separate sub-graph to test graph allocator hypothesis. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

sam3_populate_pe_cache() used n_img_embd() which returns img_size/patch_size = 1024/14 = 73 (SAM3 ViT convention). For SAM2, the correct spatial size is feat_size() = 64 (Hiera backbone with stride 4 + 3 pooling stages). This caused the dense positional encoding grid to be computed at 73x73 instead of 64x64, producing garbage values in the SAM mask decoder. The fix: use feat_size() which returns the correct value for both SAM2 (64) and SAM3 (72). Before fix (llama.jpg, point 320,240): IoU = [0.349, 0.007, 0.260] (garbage) After fix: IoU = [0.970, 0.010, 0.923] (matches Python exactly) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…mbd) Catches the bug where sam3_populate_pe_cache() used n_img_embd() = 73 (SAM3 ViT: 1024/14) instead of feat_size() = 64 (SAM2 Hiera) for the dense positional encoding grid. A wrong-sized PE grid produces garbage masks from the SAM decoder. The test verifies: 1. The clicked center point is inside the predicted foreground mask 2. The IoU score is above 0.1 (garbage PE gives near-zero IoU) 3. The foreground coverage is non-degenerate (0.05-99.5%) Tested on all SAM2 variants (Tiny, Small, Base+, Large) × llama.jpg. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

SAM2 Hiera backbone uses head_dim=56 (embed_dim/num_heads = 112/2) which was not supported by the Metal flash_attn_ext kernel. SAM decoder cross-attention uses head_dim=16 which had shader code but was missing from the supports_op whitelist. Both now work on Metal GPU: - Backbone encoding: 659ms (Metal) vs 4000ms (CPU) = 6x speedup - PVS decoder: 96ms on Metal Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…clicks ImGui::Image() sets WantCaptureMouse=1, preventing the SDL mouse event handler from receiving clicks on the video canvas. Replace with InvisibleButton + DrawList::AddImage, matching the pattern used in main_image.cpp where canvas interaction works correctly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The InvisibleButton claims WantCaptureMouse, blocking the SDL event loop from seeing clicks. Use ImGui::IsMouseClicked/IsItemHovered after the InvisibleButton — matching the working main_image.cpp pattern. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add interactive timeline below canvas showing frame position and per-instance presence bands with click-to-seek/scrub - Auto-create tracker and encode first frame on launch (remove manual "Start tracking" button) - Points mode now auto-triggers instance creation on each click, matching box mode behavior Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PABannier and others added 10 commits March 30, 2026 22:06

PABannier force-pushed the sam2 branch from 4b5bb9f to 3f3bbc1 Compare March 30, 2026 20:07

PABannier and others added 5 commits March 30, 2026 22:30

PABannier merged commit 8dd369f into main Mar 31, 2026

PABannier deleted the sam2 branch March 31, 2026 09:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add SAM2 Hiera backbone support#1

feat: add SAM2 Hiera backbone support#1
PABannier merged 15 commits intomainfrom
sam2

PABannier commented Mar 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

PABannier commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

Status

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

PABannier commented Mar 30, 2026 •

edited

Loading