Natural-language image editing through cascaded vision-language translation
A proof-of-concept exploring how vision-language models can bridge the gap between casual user prompts and precise image editing instructions.
Image editing models like QWEN-Image-Edit work great with specific instructions ("add sepia tone, reduce saturation"), but struggle with how people actually talk ("make it vintage"). If you feed vague prompts directly to diffusion models, they tend to reimagine the entire scene instead of editing what's there—changing subjects, hallucinating elements, losing the original composition.
This project uses a two-stage pipeline:
User Input → [JoyCaption Translation] → [QWEN Image Editing] → Output
"make it vintage" → "add sepia tone, reduce → [edited image]
saturation, add film grain"
Stage 1 - JoyCaption (LLaVA-based): Looks at both your prompt and the actual image, then translates vague requests into 1-4 concrete, atomic edits. It's explicitly constrained to preserve faces, identities, composition, and pose unless you specifically ask to change them.
Why this matters: By breaking down abstract concepts into specific operations before diffusion, we prevent the model from going rogue. The edit modifies the image, not a reimagining of it. Subjects stay realistic, composition stays intact.
Stage 2 - QWEN-Image-Edit: Takes those specific instructions and applies them. Because it receives unambiguous directives, it can focus on targeted modifications while maintaining coherence.
Comparison between our cascaded approach and vanilla QWEN-Image-Edit (4-bit):
Notice how our approach better preserves the original subject, composition, and realism while still applying the requested edits. This becomes especially apparent the more general the prompt becomes, where the cascading helps to determine the users intend
This is an early proof of concept. The core pipeline works and produces good results, but expect rough edges:
- No streamlined installation process yet (you'll need to manually install PyTorch, transformers, diffusers, etc.)
- Models download on first run (~20GB total)
- Bugs and edge cases exist
- Requires GPU with ~20GB VRAM
A stable release with proper packaging and documentation is coming soon. For now, this is a research prototype.
If you want to try it anyway:
# Clone the repo
git clone https://github.com/SvenPfiffner/AutoEdit.git
cd AutoEdit
# Install dependencies (adjust for your CUDA version)
pip install streamlit pillow torch transformers diffusers accelerate
# Run the app
streamlit run src/autoedit/app.py
Models will download automatically on first run. Open the URL that appears, upload an image, and describe your edits naturally.
Author: Sven Pfiffner
Want to help improve this? Open an issue or fork the repo and submit a merge request. All contributions welcome! 🙌
If you use this in commercial work, academic research, or public projects, please cite:
@software{pfiffner2025autoedit,
author = {Pfiffner, Sven},
title = {AutoEdit Studio: Cascaded Vision-Language Image Editing},
year = {2025},
url = {https://github.com/SvenPfiffner/AutoEdit}
}