Skip to content

AutoEdit ✨ Natural-language image editing that keeps your subject and composition intact. By translating vague prompts into precise edits, AutoEdit delivers intuitive, targeted changes.

License

Notifications You must be signed in to change notification settings

SvenPfiffner/AutoEdit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AutoEdit ✨

Natural-language image editing through cascaded vision-language translation

A proof-of-concept exploring how vision-language models can bridge the gap between casual user prompts and precise image editing instructions.

The Problem

Image editing models like QWEN-Image-Edit work great with specific instructions ("add sepia tone, reduce saturation"), but struggle with how people actually talk ("make it vintage"). If you feed vague prompts directly to diffusion models, they tend to reimagine the entire scene instead of editing what's there—changing subjects, hallucinating elements, losing the original composition.

The Approach

This project uses a two-stage pipeline:

User Input → [JoyCaption Translation] → [QWEN Image Editing] → Output
  "make it vintage"  →  "add sepia tone, reduce     →  [edited image]
                         saturation, add film grain"

Stage 1 - JoyCaption (LLaVA-based): Looks at both your prompt and the actual image, then translates vague requests into 1-4 concrete, atomic edits. It's explicitly constrained to preserve faces, identities, composition, and pose unless you specifically ask to change them.

Why this matters: By breaking down abstract concepts into specific operations before diffusion, we prevent the model from going rogue. The edit modifies the image, not a reimagining of it. Subjects stay realistic, composition stays intact.

Stage 2 - QWEN-Image-Edit: Takes those specific instructions and applies them. Because it receives unambiguous directives, it can focus on targeted modifications while maintaining coherence.

Examples

Comparison between our cascaded approach and vanilla QWEN-Image-Edit (4-bit):

PROMPT SOURCE OURS QWEN Image-Edit 4bit
The woman looks lonely, add a friend next to her
Remove watermarks
Turn this kitten into a nerdy supervillain
The person occupying this table is really thirsty
Oh no! This woman seems to be cold, do something about it

Notice how our approach better preserves the original subject, composition, and realism while still applying the requested edits. This becomes especially apparent the more general the prompt becomes, where the cascading helps to determine the users intend

Current Status ⚠️

This is an early proof of concept. The core pipeline works and produces good results, but expect rough edges:

  • No streamlined installation process yet (you'll need to manually install PyTorch, transformers, diffusers, etc.)
  • Models download on first run (~20GB total)
  • Bugs and edge cases exist
  • Requires GPU with ~20GB VRAM

A stable release with proper packaging and documentation is coming soon. For now, this is a research prototype.

Running It

If you want to try it anyway:

# Clone the repo
git clone https://github.com/SvenPfiffner/AutoEdit.git
cd AutoEdit

# Install dependencies (adjust for your CUDA version)
pip install streamlit pillow torch transformers diffusers accelerate

# Run the app
streamlit run src/autoedit/app.py

Models will download automatically on first run. Open the URL that appears, upload an image, and describe your edits naturally.


Contributing & Citation

Author: Sven Pfiffner

Want to help improve this? Open an issue or fork the repo and submit a merge request. All contributions welcome! 🙌

If you use this in commercial work, academic research, or public projects, please cite:

@software{pfiffner2025autoedit,
  author = {Pfiffner, Sven},
  title = {AutoEdit Studio: Cascaded Vision-Language Image Editing},
  year = {2025},
  url = {https://github.com/SvenPfiffner/AutoEdit}
}

About

AutoEdit ✨ Natural-language image editing that keeps your subject and composition intact. By translating vague prompts into precise edits, AutoEdit delivers intuitive, targeted changes.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages