Skip to content

feat: VLM primitives — vision ops, NN layers, encoder models#96

Merged
0xDaizz merged 1 commit intomainfrom
feat/vlm-primitives
Mar 17, 2026
Merged

feat: VLM primitives — vision ops, NN layers, encoder models#96
0xDaizz merged 1 commit intomainfrom
feat/vlm-primitives

Conversation

@0xDaizz
Copy link
Owner

@0xDaizz 0xDaizz commented Mar 17, 2026

Summary

  • Add VLM (Vision-Language Model) primitive operations: interpolate (bilinear/bicubic), pixel_normalize, pool2d with Metal kernel registration and buffer slot definitions
  • Add vision NN layers: PatchEmbedding, VisionPositionalEmbedding, VisionTransformerBlock, MultiModalProjector, and cache-free attention forward pass
  • Add vision encoder model configs: ViT-Base/16, ViT-Large/14, SigLIP SO400M/14, CLIP ViT-L/14-336 with HF safetensors weight mapping

Test plan

  • cargo check --workspace passes
  • cargo fmt --all --check passes
  • cargo clippy --workspace --all-targets passes (fixed large_enum_variant warning)
  • Integration test with actual vision encoder weights (future PR)

🤖 Generated with Claude Code

Core ops (rmlx-core):
- interpolate: bilinear/bicubic Metal kernels, align_corners support
- pixel_normalize: per-channel (pixel-mean)/std, ImageNet defaults
- pool2d: avg/max pooling with configurable kernel/stride/padding

NN layers (rmlx-nn):
- PatchEmbedding: Conv2d + reshape for image→patch sequence
- VisionPositionalEmbedding: learned + grid2D
- VisionTransformerBlock: pre-norm bidirectional attention
- MultiModalProjector: Linear or MLP (fc1→GELU→fc2)
- Attention::forward_no_cache() for vision (no RoPE, no mask, no KV cache)

Vision encoder models:
- ViT-Base/16 (768h, 12L), ViT-Large/14 (1024h, 24L)
- SigLIP SO400M/14 (1152h, 27L, no class token)
- CLIP ViT-L/14-336 (1024h, 24L, QuickGELU)
- HF weight name mapping for vision encoders

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@0xDaizz 0xDaizz merged commit db8d116 into main Mar 17, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant