Skip to content

Factory recipe: bolt-on vision encoder (LLaVA-style) #649

@joelteply

Description

@joelteply

Summary

Factory recipe to add vision capability to any text-only LLM by training a projection layer between a vision encoder (SigLIP/CLIP) and the LLM's embedding space. LLaVA-style architecture.

Approach

  • Frozen base LLM + frozen vision encoder (SigLIP-400M or similar)
  • Train only the projection layer (MLP mapping image embeddings → LLM token space)
  • Two-stage: (1) feature alignment on image-caption pairs, (2) visual instruction tuning
  • Output: projection weights that bolt onto any model from the same family

Factory Integration

  • New forge profile type: vision-encoder
  • Recipe specifies: base LLM, vision encoder, projection architecture, training data
  • Can combine with existing LoRA personality adapters — vision + personality in one model
  • Validation: VQA benchmarks, image description quality

Why

This turns any text-only forged model into a multimodal one. A compacted Qwen3.5 with vision bolted on is a powerful local model. The projection layer is small (~50-100MB) — it's essentially an adapter.

Novel territory

Nobody is asking "what happens when you add vision to a structurally pruned model?" The compacted model's embedding space may behave differently post-pruning. This could be paper-worthy if the projection layer compensates for pruned capacity.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions