Summary
Factory recipe to add vision capability to any text-only LLM by training a projection layer between a vision encoder (SigLIP/CLIP) and the LLM's embedding space. LLaVA-style architecture.
Approach
- Frozen base LLM + frozen vision encoder (SigLIP-400M or similar)
- Train only the projection layer (MLP mapping image embeddings → LLM token space)
- Two-stage: (1) feature alignment on image-caption pairs, (2) visual instruction tuning
- Output: projection weights that bolt onto any model from the same family
Factory Integration
- New forge profile type:
vision-encoder
- Recipe specifies: base LLM, vision encoder, projection architecture, training data
- Can combine with existing LoRA personality adapters — vision + personality in one model
- Validation: VQA benchmarks, image description quality
Why
This turns any text-only forged model into a multimodal one. A compacted Qwen3.5 with vision bolted on is a powerful local model. The projection layer is small (~50-100MB) — it's essentially an adapter.
Novel territory
Nobody is asking "what happens when you add vision to a structurally pruned model?" The compacted model's embedding space may behave differently post-pruning. This could be paper-worthy if the projection layer compensates for pruned capacity.
Related
Summary
Factory recipe to add vision capability to any text-only LLM by training a projection layer between a vision encoder (SigLIP/CLIP) and the LLM's embedding space. LLaVA-style architecture.
Approach
Factory Integration
vision-encoderWhy
This turns any text-only forged model into a multimodal one. A compacted Qwen3.5 with vision bolted on is a powerful local model. The projection layer is small (~50-100MB) — it's essentially an adapter.
Novel territory
Nobody is asking "what happens when you add vision to a structurally pruned model?" The compacted model's embedding space may behave differently post-pruning. This could be paper-worthy if the projection layer compensates for pruned capacity.
Related