Conversation
Add full SAM2 (Segment Anything Model 2) support to sam3.cpp, enabling both SAM2 and SAM3 models to be loaded and used through the same API. The model type is auto-detected from the binary file magic number. New components: - convert_sam2_to_ggml.py: weight conversion for all SAM2/2.1 variants (Tiny, Small, Base+, Large) with auto-detection and YAML config support - Hiera backbone: PatchEmbed, windowed multi-scale attention with Q-pooling, bicubic+tiled positional embedding, 4-stage feature pyramid - FPN neck: lateral 1x1 convs, nearest-neighbor top-down fusion, scalp - SAM2 image preprocessing with ImageNet normalization - 5-token decoder layout for older SAM2 (pred_obj_scores=False) - Multimask output selection during video tracking propagation - fixed_no_obj_ptr object pointer blending by presence score Modified for dual-model support: - sam3_load_model() dispatches on magic (0x73616D32 vs 0x73616D33) - All shared functions parameterized: feat_size(), sigmoid_scale/bias, mask interpolation target, spatial dimensions from tensor shapes - sam3_segment_pcs/sam3_create_tracker guarded for SAM2 models - Examples (image/video) adapt UI at runtime based on model type - Quantize tool handles both SAM2 and SAM3 headers Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…cture - Fix window_size "lags by a block" — Q-stride blocks now use the PREVIOUS stage's window_spec, matching Python's Hiera.__init__ - Fix window unpartition pad_hw computation for Q-stride blocks: recompute from target spatial dims instead of dividing partition pads - Add SAM2 dispatch in sam3_encode_image_from_preprocessed() - Add SAM2_DUMP_DIR environment variable for intermediate tensor dumps - Add test_sam2_backbone_compare.cpp and dump_sam2_reference.py - Add compare_tensors.py for numerical comparison Status: PatchEmbed output matches Python (cos=0.9999). Positional embedding computation diverges (cos=0.005) — root cause identified, fix pending. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two bugs fixed in the Hiera backbone: 1. Bicubic interpolation kernel used wrong coefficients (a=-0.5 instead of a=-0.75), causing PE to be 6x too large in magnitude. Fixed to use Keys cubic kernel matching PyTorch's F.interpolate. 2. Multi-head attention reshape incorrectly merged heads into the batch dimension. Now follows the exact SAM3 ViT pattern: reshape(HD,NH,N,B) → permute(0,2,1,3) → cont → reshape(HD,N,NH*B) → reshape_4d(HD,N,NH,B) for flash_attn_ext. Also fixed: PE weight tensor registration shape (was [E,W,H,1], should be [W,H,E,1] to match conversion script's reversed dims). Status: Hiera backbone blocks 0-23 now match Python reference (cos≥0.999999 with f32 weights). FPN neck still diverges — next to debug. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
FPN lateral convolutions now use ggml_conv_2d_sk_p0 (matching SAM3 neck pattern) with proper bias broadcasting via ggml_repeat. Added debug tensor dumps for FPN laterals. Verified: entire Hiera backbone + FPN neck pipeline matches Python reference with cos≥0.99997 at all stages (f32 weights on cat.jpg). The previous comparison failure was a layout mismatch (NCHW vs CWHB) in the comparison script, not a computation error. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add dump_sam2_pvs_reference.py and test_sam2_pvs_compare.cpp for comparing SAM2 PVS decoder output against Python reference. Results with SAM2.1 Base+ f32 on cat.jpg, center point (600, 599): - Best mask (IoU=0.994): 99.9% binary agreement with Python - Mask logits cosine similarity: 0.97-0.98 across all 3 multimask outputs - IoU predictions match closely: [0.994, 0.397, 0.977] vs [0.994, 0.469, 0.986] Also add SAM2_DUMP_DIR support in sam3_segment_pvs() for intermediate tensor dumping. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add test_sam2_e2e_all_variants.py that runs Python reference and C++ ggml inference for all SAM2.1 variants (Tiny, Small, Base+, Large) on the same image with the same point prompt, then compares binary masks and IoU scores. Results on cat.jpg with center point (600, 599): Variant Best Mask IoU Logit Cos Binary IoU Agreement tiny 0.991 0.969 0.996 99.8% small 0.991 0.970 0.997 99.8% base_plus 0.994 0.969 0.997 99.8% large 0.993 0.979 0.997 99.8% ALL PASS: best masks match Python with >=99.6% binary IoU. Also add sam3_state_set_orig_dims() to fix coordinate normalization when using sam3_encode_image_from_preprocessed() with non-square original images. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add SAM2.0 support alongside SAM2.1 in the e2e test. Fix tensor loader to allow up to 5 optional missing tensors for SAM2.0 compat (no_obj_embed_spatial, obj_ptr_tpos_proj not present in SAM2.0). Results on cat.jpg with center point (600, 599): sam2_tiny 0.988 binary_IoU=0.997 agree=99.8% ✓ sam2_small 0.993 binary_IoU=0.998 agree=99.9% ✓ sam2_base_plus 0.994 binary_IoU=0.997 agree=99.8% ✓ sam2_large 0.995 binary_IoU=0.982 agree=98.9% ✓ sam2.1_tiny 0.991 binary_IoU=0.996 agree=99.8% ✓ sam2.1_small 0.991 binary_IoU=0.997 agree=99.8% ✓ sam2.1_base_plus 0.994 binary_IoU=0.997 agree=99.8% ✓ sam2.1_large 0.993 binary_IoU=0.997 agree=99.8% ✓ ALL PASS: best masks match Python reference across all variants. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the fragile "tolerate missing tensors" hack with a clean solution: add an `is_sam2_1` flag to the binary header so the C++ loader knows exactly which tensors to register. - convert_sam2_to_ggml.py: auto-detect SAM2.0 vs 2.1 from checkpoint tensor names (no_obj_embed_spatial presence) and write is_sam2_1 flag - sam3.cpp: read is_sam2_1 from header, conditionally register no_obj_embed_spatial and obj_ptr_tpos_proj tensors only for SAM2.1 - sam3.cpp: restore strict tensor count check — no tolerance - sam3.cpp: guard runtime no_obj_embed_spatial usage with null check - quantize.cpp: update SAM2 header field count (56 → 57) Tensor counts are now exact: SAM2.0 Tiny: registered 468 = loaded 468 SAM2.1 Tiny: registered 471 = loaded 471 All 8 variants still pass end-to-end (test_sam2_e2e_all_variants.py). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Added extensive debugging infrastructure to trace the TwoWay decoder: - Initial tokens dump (verified cos=1.0 vs Python) - Q projection dump (verified cos=1.0 vs Python) - SA norm output dump (cos=0.16 — BUG identified) - Manual numpy SDPA computation confirms Python output (cos=1.0) Root cause narrowed to: sam3_sam_attention produces correct Q projection but wrong attention output. The bug is in the ggml graph execution of the attention, not in the data preparation or weights. Possible graph allocator buffer reuse issue affecting the decoder. Next step: isolate the self-attention into a separate sub-graph to test graph allocator hypothesis. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sam3_populate_pe_cache() used n_img_embd() which returns img_size/patch_size = 1024/14 = 73 (SAM3 ViT convention). For SAM2, the correct spatial size is feat_size() = 64 (Hiera backbone with stride 4 + 3 pooling stages). This caused the dense positional encoding grid to be computed at 73x73 instead of 64x64, producing garbage values in the SAM mask decoder. The fix: use feat_size() which returns the correct value for both SAM2 (64) and SAM3 (72). Before fix (llama.jpg, point 320,240): IoU = [0.349, 0.007, 0.260] (garbage) After fix: IoU = [0.970, 0.010, 0.923] (matches Python exactly) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…mbd) Catches the bug where sam3_populate_pe_cache() used n_img_embd() = 73 (SAM3 ViT: 1024/14) instead of feat_size() = 64 (SAM2 Hiera) for the dense positional encoding grid. A wrong-sized PE grid produces garbage masks from the SAM decoder. The test verifies: 1. The clicked center point is inside the predicted foreground mask 2. The IoU score is above 0.1 (garbage PE gives near-zero IoU) 3. The foreground coverage is non-degenerate (0.05-99.5%) Tested on all SAM2 variants (Tiny, Small, Base+, Large) × llama.jpg. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SAM2 Hiera backbone uses head_dim=56 (embed_dim/num_heads = 112/2) which was not supported by the Metal flash_attn_ext kernel. SAM decoder cross-attention uses head_dim=16 which had shader code but was missing from the supports_op whitelist. Both now work on Metal GPU: - Backbone encoding: 659ms (Metal) vs 4000ms (CPU) = 6x speedup - PVS decoder: 96ms on Metal Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…clicks ImGui::Image() sets WantCaptureMouse=1, preventing the SDL mouse event handler from receiving clicks on the video canvas. Replace with InvisibleButton + DrawList::AddImage, matching the pattern used in main_image.cpp where canvas interaction works correctly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The InvisibleButton claims WantCaptureMouse, blocking the SDL event loop from seeing clicks. Use ImGui::IsMouseClicked/IsItemHovered after the InvisibleButton — matching the working main_image.cpp pattern. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add interactive timeline below canvas showing frame position and per-instance presence bands with click-to-seek/scrub - Auto-create tracker and encode first frame on launch (remove manual "Start tracking" button) - Points mode now auto-triggers instance creation on each click, matching box mode behavior Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Details
New files:
convert_sam2_to_ggml.py— weight conversion for all SAM2/2.1 variants (Tiny/Small/Base+/Large)Core changes (
sam3.cpp,sam3.h):sam3_model_typeenum +sam3_get_model_type()APIsam2_build_hiera_graph()with window partition, Q-pooling, bicubic PEsam2_build_fpn_neck_graph()with nearest top-down + scalpsam2_encode_image_hiera()full pipeline with ImageNet preprocessingfeat_size(),sigmoid_scale/bias, mask interpolation)fixed_no_obj_ptrblendingExamples:
main_image.cpp— PCS mode disabled for SAM2, PVS worksmain_video.cpp— dispatchessam3_create_visual_tracker/sam3_propagate_framefor SAM2quantize.cpp— handles both SAM2 and SAM3 headersStatus
Backbone runs end-to-end and produces output. Segmentation masks are not yet numerically correct — debugging the forward pass is the next step.
Test plan
sam3_imagelaunches with SAM2 model, encodes image, runs PVS🤖 Generated with Claude Code