Factory recipe: bolt-on vision encoder (LLaVA-style)

## Summary

Factory recipe to add vision capability to any text-only LLM by training a projection layer between a vision encoder (SigLIP/CLIP) and the LLM's embedding space. LLaVA-style architecture.

## Approach

- **Frozen base LLM** + **frozen vision encoder** (SigLIP-400M or similar)
- Train only the **projection layer** (MLP mapping image embeddings → LLM token space)
- Two-stage: (1) feature alignment on image-caption pairs, (2) visual instruction tuning
- Output: projection weights that bolt onto any model from the same family

## Factory Integration

- New forge profile type: `vision-encoder`
- Recipe specifies: base LLM, vision encoder, projection architecture, training data
- Can combine with existing LoRA personality adapters — vision + personality in one model
- Validation: VQA benchmarks, image description quality

## Why

This turns any text-only forged model into a multimodal one. A compacted Qwen3.5 with vision bolted on is a powerful local model. The projection layer is small (~50-100MB) — it's essentially an adapter.

## Novel territory

Nobody is asking "what happens when you add vision to a structurally pruned model?" The compacted model's embedding space may behave differently post-pruning. This could be paper-worthy if the projection layer compensates for pruned capacity.

## Related

- #582 (Native multimodal pipeline)
- #480 (Qwen3.5-0.8B vision)
- #576 (Factory SCADA)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Factory recipe: bolt-on vision encoder (LLaVA-style) #649

Summary

Approach

Factory Integration

Why

Novel territory

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Factory recipe: bolt-on vision encoder (LLaVA-style) #649

Description

Summary

Approach

Factory Integration

Why

Novel territory

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions