Skip to content

feat: add SAM2 Hiera backbone support#1

Merged
PABannier merged 15 commits intomainfrom
sam2
Mar 31, 2026
Merged

feat: add SAM2 Hiera backbone support#1
PABannier merged 15 commits intomainfrom
sam2

Conversation

@PABannier
Copy link
Copy Markdown
Owner

@PABannier PABannier commented Mar 30, 2026

Summary

  • Add full SAM2 (Segment Anything Model 2) support alongside SAM3, auto-detected from the binary file magic number at load time
  • Implement Hiera backbone (PatchEmbed, windowed multi-scale attention with Q-pooling, 4-stage FPN neck) and all shared-code parameterization for dual-model inference
  • Adapt GUI examples and quantize tool to work with both SAM2 and SAM3 models at runtime

Details

New files:

  • convert_sam2_to_ggml.py — weight conversion for all SAM2/2.1 variants (Tiny/Small/Base+/Large)

Core changes (sam3.cpp, sam3.h):

  • sam3_model_type enum + sam3_get_model_type() API
  • Hiera backbone graph: sam2_build_hiera_graph() with window partition, Q-pooling, bicubic PE
  • FPN neck: sam2_build_fpn_neck_graph() with nearest top-down + scalp
  • sam2_encode_image_hiera() full pipeline with ImageNet preprocessing
  • All shared functions parameterized (feat_size(), sigmoid_scale/bias, mask interpolation)
  • 5-token decoder layout, multimask tracking, fixed_no_obj_ptr blending

Examples:

  • main_image.cpp — PCS mode disabled for SAM2, PVS works
  • main_video.cpp — dispatches sam3_create_visual_tracker/sam3_propagate_frame for SAM2
  • quantize.cpp — handles both SAM2 and SAM3 headers

Status

Backbone runs end-to-end and produces output. Segmentation masks are not yet numerically correct — debugging the forward pass is the next step.

Test plan

  • Compiles cleanly on macOS (Metal + CPU)
  • All 8 SAM2/2.1 checkpoints convert and quantize (F16, Q4_0, Q4_1, Q8_0)
  • sam3_image launches with SAM2 model, encodes image, runs PVS
  • SAM3 models still load and work (no regressions)
  • Numerical validation against Python SAM2 reference
  • Video tracking end-to-end test

🤖 Generated with Claude Code

PABannier and others added 10 commits March 30, 2026 22:06
Add full SAM2 (Segment Anything Model 2) support to sam3.cpp, enabling
both SAM2 and SAM3 models to be loaded and used through the same API.
The model type is auto-detected from the binary file magic number.

New components:
- convert_sam2_to_ggml.py: weight conversion for all SAM2/2.1 variants
  (Tiny, Small, Base+, Large) with auto-detection and YAML config support
- Hiera backbone: PatchEmbed, windowed multi-scale attention with Q-pooling,
  bicubic+tiled positional embedding, 4-stage feature pyramid
- FPN neck: lateral 1x1 convs, nearest-neighbor top-down fusion, scalp
- SAM2 image preprocessing with ImageNet normalization
- 5-token decoder layout for older SAM2 (pred_obj_scores=False)
- Multimask output selection during video tracking propagation
- fixed_no_obj_ptr object pointer blending by presence score

Modified for dual-model support:
- sam3_load_model() dispatches on magic (0x73616D32 vs 0x73616D33)
- All shared functions parameterized: feat_size(), sigmoid_scale/bias,
  mask interpolation target, spatial dimensions from tensor shapes
- sam3_segment_pcs/sam3_create_tracker guarded for SAM2 models
- Examples (image/video) adapt UI at runtime based on model type
- Quantize tool handles both SAM2 and SAM3 headers

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…cture

- Fix window_size "lags by a block" — Q-stride blocks now use the
  PREVIOUS stage's window_spec, matching Python's Hiera.__init__
- Fix window unpartition pad_hw computation for Q-stride blocks:
  recompute from target spatial dims instead of dividing partition pads
- Add SAM2 dispatch in sam3_encode_image_from_preprocessed()
- Add SAM2_DUMP_DIR environment variable for intermediate tensor dumps
- Add test_sam2_backbone_compare.cpp and dump_sam2_reference.py
- Add compare_tensors.py for numerical comparison

Status: PatchEmbed output matches Python (cos=0.9999). Positional
embedding computation diverges (cos=0.005) — root cause identified,
fix pending.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two bugs fixed in the Hiera backbone:

1. Bicubic interpolation kernel used wrong coefficients (a=-0.5
   instead of a=-0.75), causing PE to be 6x too large in magnitude.
   Fixed to use Keys cubic kernel matching PyTorch's F.interpolate.

2. Multi-head attention reshape incorrectly merged heads into the
   batch dimension. Now follows the exact SAM3 ViT pattern:
   reshape(HD,NH,N,B) → permute(0,2,1,3) → cont → reshape(HD,N,NH*B)
   → reshape_4d(HD,N,NH,B) for flash_attn_ext.

Also fixed: PE weight tensor registration shape (was [E,W,H,1],
should be [W,H,E,1] to match conversion script's reversed dims).

Status: Hiera backbone blocks 0-23 now match Python reference
(cos≥0.999999 with f32 weights). FPN neck still diverges — next
to debug.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
FPN lateral convolutions now use ggml_conv_2d_sk_p0 (matching SAM3 neck
pattern) with proper bias broadcasting via ggml_repeat. Added debug
tensor dumps for FPN laterals.

Verified: entire Hiera backbone + FPN neck pipeline matches Python
reference with cos≥0.99997 at all stages (f32 weights on cat.jpg).
The previous comparison failure was a layout mismatch (NCHW vs CWHB)
in the comparison script, not a computation error.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add dump_sam2_pvs_reference.py and test_sam2_pvs_compare.cpp for
comparing SAM2 PVS decoder output against Python reference.

Results with SAM2.1 Base+ f32 on cat.jpg, center point (600, 599):
- Best mask (IoU=0.994): 99.9% binary agreement with Python
- Mask logits cosine similarity: 0.97-0.98 across all 3 multimask outputs
- IoU predictions match closely: [0.994, 0.397, 0.977] vs [0.994, 0.469, 0.986]

Also add SAM2_DUMP_DIR support in sam3_segment_pvs() for intermediate
tensor dumping.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add test_sam2_e2e_all_variants.py that runs Python reference and C++
ggml inference for all SAM2.1 variants (Tiny, Small, Base+, Large)
on the same image with the same point prompt, then compares binary
masks and IoU scores.

Results on cat.jpg with center point (600, 599):

  Variant       Best Mask IoU  Logit Cos  Binary IoU  Agreement
  tiny                 0.991     0.969      0.996      99.8%
  small                0.991     0.970      0.997      99.8%
  base_plus            0.994     0.969      0.997      99.8%
  large                0.993     0.979      0.997      99.8%

ALL PASS: best masks match Python with >=99.6% binary IoU.

Also add sam3_state_set_orig_dims() to fix coordinate normalization
when using sam3_encode_image_from_preprocessed() with non-square
original images.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add SAM2.0 support alongside SAM2.1 in the e2e test. Fix tensor
loader to allow up to 5 optional missing tensors for SAM2.0 compat
(no_obj_embed_spatial, obj_ptr_tpos_proj not present in SAM2.0).

Results on cat.jpg with center point (600, 599):

  sam2_tiny          0.988  binary_IoU=0.997  agree=99.8%  ✓
  sam2_small         0.993  binary_IoU=0.998  agree=99.9%  ✓
  sam2_base_plus     0.994  binary_IoU=0.997  agree=99.8%  ✓
  sam2_large         0.995  binary_IoU=0.982  agree=98.9%  ✓
  sam2.1_tiny        0.991  binary_IoU=0.996  agree=99.8%  ✓
  sam2.1_small       0.991  binary_IoU=0.997  agree=99.8%  ✓
  sam2.1_base_plus   0.994  binary_IoU=0.997  agree=99.8%  ✓
  sam2.1_large       0.993  binary_IoU=0.997  agree=99.8%  ✓

ALL PASS: best masks match Python reference across all variants.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the fragile "tolerate missing tensors" hack with a clean
solution: add an `is_sam2_1` flag to the binary header so the C++
loader knows exactly which tensors to register.

- convert_sam2_to_ggml.py: auto-detect SAM2.0 vs 2.1 from checkpoint
  tensor names (no_obj_embed_spatial presence) and write is_sam2_1 flag
- sam3.cpp: read is_sam2_1 from header, conditionally register
  no_obj_embed_spatial and obj_ptr_tpos_proj tensors only for SAM2.1
- sam3.cpp: restore strict tensor count check — no tolerance
- sam3.cpp: guard runtime no_obj_embed_spatial usage with null check
- quantize.cpp: update SAM2 header field count (56 → 57)

Tensor counts are now exact:
  SAM2.0 Tiny:  registered 468 = loaded 468
  SAM2.1 Tiny:  registered 471 = loaded 471

All 8 variants still pass end-to-end (test_sam2_e2e_all_variants.py).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Added extensive debugging infrastructure to trace the TwoWay decoder:
- Initial tokens dump (verified cos=1.0 vs Python)
- Q projection dump (verified cos=1.0 vs Python)
- SA norm output dump (cos=0.16 — BUG identified)
- Manual numpy SDPA computation confirms Python output (cos=1.0)

Root cause narrowed to: sam3_sam_attention produces correct Q projection
but wrong attention output. The bug is in the ggml graph execution of
the attention, not in the data preparation or weights. Possible graph
allocator buffer reuse issue affecting the decoder.

Next step: isolate the self-attention into a separate sub-graph to test
graph allocator hypothesis.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sam3_populate_pe_cache() used n_img_embd() which returns img_size/patch_size
= 1024/14 = 73 (SAM3 ViT convention). For SAM2, the correct spatial size is
feat_size() = 64 (Hiera backbone with stride 4 + 3 pooling stages).

This caused the dense positional encoding grid to be computed at 73x73
instead of 64x64, producing garbage values in the SAM mask decoder.

The fix: use feat_size() which returns the correct value for both SAM2 (64)
and SAM3 (72).

Before fix (llama.jpg, point 320,240): IoU = [0.349, 0.007, 0.260] (garbage)
After fix:                              IoU = [0.970, 0.010, 0.923] (matches Python exactly)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PABannier and others added 5 commits March 30, 2026 22:30
…mbd)

Catches the bug where sam3_populate_pe_cache() used n_img_embd() = 73
(SAM3 ViT: 1024/14) instead of feat_size() = 64 (SAM2 Hiera) for the
dense positional encoding grid. A wrong-sized PE grid produces garbage
masks from the SAM decoder.

The test verifies:
1. The clicked center point is inside the predicted foreground mask
2. The IoU score is above 0.1 (garbage PE gives near-zero IoU)
3. The foreground coverage is non-degenerate (0.05-99.5%)

Tested on all SAM2 variants (Tiny, Small, Base+, Large) × llama.jpg.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SAM2 Hiera backbone uses head_dim=56 (embed_dim/num_heads = 112/2)
which was not supported by the Metal flash_attn_ext kernel. SAM decoder
cross-attention uses head_dim=16 which had shader code but was missing
from the supports_op whitelist.

Both now work on Metal GPU:
- Backbone encoding: 659ms (Metal) vs 4000ms (CPU) = 6x speedup
- PVS decoder: 96ms on Metal

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…clicks

ImGui::Image() sets WantCaptureMouse=1, preventing the SDL mouse event
handler from receiving clicks on the video canvas. Replace with
InvisibleButton + DrawList::AddImage, matching the pattern used in
main_image.cpp where canvas interaction works correctly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The InvisibleButton claims WantCaptureMouse, blocking the SDL event
loop from seeing clicks. Use ImGui::IsMouseClicked/IsItemHovered
after the InvisibleButton — matching the working main_image.cpp pattern.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add interactive timeline below canvas showing frame position and
  per-instance presence bands with click-to-seek/scrub
- Auto-create tracker and encode first frame on launch (remove
  manual "Start tracking" button)
- Points mode now auto-triggers instance creation on each click,
  matching box mode behavior

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@PABannier PABannier merged commit 8dd369f into main Mar 31, 2026
@PABannier PABannier deleted the sam2 branch March 31, 2026 09:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant