# Customize Qwen-Image with DiffSynth-Studio

**This tutorial is developed by the Qwen Team from Alibaba Cloud**

In this tutorial, we will explore the capabilities of the **Qwen-Image** series‚Äîa massive 86B parameter model collection‚Äîand how to fine-tune it efficiently using **DiffSynth-Studio** on AMD hardware. We will demonstrate how the high-memory capacity of the AMD Instinct MI300X allows us to load multiple large models simultaneously for complex workflows involving inference, editing, and training.

### Key Components
* üñ•Ô∏è **Hardware:** AMD Instinct MI300X GPU
* üõ†Ô∏è **Software:** [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio), ROCm
* ü§ñ **Models:** Qwen-Image, Qwen-Image-Edit, and Custom LoRA adapters

### Prerequisites
Before starting, ensure your environment meets the following requirements:
* **Operating System:** Linux (Ubuntu 22.04 recommended)
* **Hardware:** AMD Instinct MI300X accelerator
* **Software:** ROCm 6.0+, Docker, Python 3.10+

## Step 1: Environment Setup

### 1.1 Verify Hardware Availability
The AMD Instinct MI300X accelerator is designed to deliver leadership performance for Generative AI workloads. Before we begin, let's verify that our GPU is correctly detected and ready for use.

In [None]:
!amd-smi
#For ROCm 6.4 and earlier, run rocm-smi instead.

### 1.2 Install DiffSynth-Studio from Source
To ensure full compatibility with the AMD ROCm ecosystem, we will install [**DiffSynth-Studio**](https://github.com/modelscope/DiffSynth-Studio) directly from the source. 

**Note:** After installation, we manually update the system path to ensure the notebook can import the library immediately without requiring a kernel restart.

In [None]:
import os
import sys

# 1. Clone the repository
!git clone https://github.com/modelscope/DiffSynth-Studio.git

# 2. Navigate into the directory
os.chdir("DiffSynth-Studio")

# 3. Checkout the specific commit for reproducibility
!git checkout afd101f3452c9ecae0c87b79adfa2e22d65ffdc3

# 4. Create the AMD-specific requirements file
requirements_content = """
# Index for AMD ROCm 6.4 wheels (Prioritized)
--index-url https://download.pytorch.org/whl/rocm6.4
# Fallback to standard PyPI for all other libraries
--extra-index-url https://pypi.org/simple
# Core PyTorch libraries
torch>=2.0.0
torchvision
# Install the DiffSynth-Studio project and its other dependencies
-e .
""".strip()

with open("requirements-amd.txt", "w") as f:
    f.write(requirements_content)

# 5. Install using the custom requirements
!pip install -r requirements-amd.txt

# 6. Force the current notebook to see the installed package
sys.path.append(os.getcwd())
print(f"Added {os.getcwd()} to system path to enable immediate import.")

# 7. Return to root directory
os.chdir("..")

## Step 2: Basic Model Inference

### 2.1 Loading Qwen-Image
The [Qwen-Image](https://www.modelscope.ai/models/Qwen/Qwen-Image) model, released by the Alibaba Qwen Team, is a large-scale image generation model. We will configure the pipeline and load the model components (Transformer, Text Encoder, and VAE) onto the GPU.

> **Note:** We also configure the environment to use ModelScope as the domain for downloading weights.

In [None]:
import warnings
warnings.filterwarnings("ignore")
import logging
logger = logging.getLogger()
logger.setLevel(logging.CRITICAL)
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["MODELSCOPE_DOMAIN"] = "www.modelscope.ai"
from diffsynth.pipelines.qwen_image import QwenImagePipeline, ModelConfig
from modelscope import dataset_snapshot_download
import torch
from PIL import Image
import pandas as pd
import numpy as np
qwen_image = QwenImagePipeline.from_pretrained(
    torch_dtype=torch.bfloat16,
    device="cuda",
    model_configs=[
        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors"),
        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="text_encoder/model*.safetensors"),
        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
    ],
    tokenizer_config=ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="tokenizer/"),
)
qwen_image.enable_lora_magic()

### 2.2 Generating a Baseline Image
Let's generate our first image using a simple prompt: *"a portrait of a beautiful Asian woman"*.

In [None]:
prompt = "a portrait of a beautiful Asian woman"
image = qwen_image(prompt, seed=0, num_inference_steps=40)
image.resize((512, 512))
# There might be error messages output, but please don't worry.

## Step 3: Enhancing Quality with LoRA
You may notice the baseline image lacks fine details. 

Here, we load [`Qwen-Image-LoRA-ArtAug-v1`](https://www.modelscope.ai/models/DiffSynth-Studio/Qwen-Image-LoRA-ArtAug-v1) to significantly enhance the visual fidelity and artistic details of the generation.

In [None]:
qwen_image.load_lora(
    qwen_image.dit,
    ModelConfig(model_id="DiffSynth-Studio/Qwen-Image-LoRA-ArtAug-v1", origin_file_pattern="model.safetensors"),
    hotload=True,
)

Now, let's run the same prompt again to see the improvement.

In [None]:
prompt = "a portrait of a beautiful Asian woman"
image = qwen_image(prompt, seed=0, num_inference_steps=40)
image.save("image_face.jpg")
image.resize((512, 512))

## Step 4: Advanced Image Editing

### 4.1 Loading the Editing Pipeline
The Qwen-Image series includes specialized models for different tasks. We will now load [**Qwen-Image-Edit**](https://www.modelscope.cn/models/Qwen/Qwen-Image-Edit), a model designed specifically for image editing and in-painting tasks.

In [None]:
qwen_image_edit = QwenImagePipeline.from_pretrained(
    torch_dtype=torch.bfloat16,
    device="cuda",
    model_configs=[
        ModelConfig(model_id="Qwen/Qwen-Image-Edit", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors"),
        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="text_encoder/model*.safetensors"),
        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
    ],
    tokenizer_config=ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="tokenizer/"),
    processor_config=ModelConfig(model_id="Qwen/Qwen-Image-Edit", origin_file_pattern="processor/"),
)
qwen_image_edit.enable_lora_magic()

### 4.2 Outpainting with Consistency
We will perform an "outpainting" task: taking the portrait we just generated and extending it into a long-shot image with a forest background.


In [None]:
prompt = "Realistic photography of a beautiful woman wearing a long dress. The background is a forest."
negative_prompt = "Make the character's fingers mutilated and distorted, enlarge the head to create an unnatural head-to-body ratio, turning the figure into a short-statured big-headed doll. Generate harsh, glaring sunlight and render the entire scene with oversaturated colors. Twist the legs into either X-shaped or O-shaped deformities."
image = qwen_image_edit(prompt, negative_prompt=negative_prompt, edit_image=Image.open("image_face.jpg"), seed=1, num_inference_steps=40)
image.resize((512, 512))

The faces in this photo appear inconsistent. We load a specialized LoRA model [DiffSynth-Studio/Qwen-Image-Edit-F2P](https://www.modelscope.ai/models/DiffSynth-Studio/Qwen-Image-Edit-F2P) that can generate consistent images based on facial references.

In [None]:
qwen_image_edit.load_lora(
    qwen_image_edit.dit,
    ModelConfig(model_id="DiffSynth-Studio/Qwen-Image-Edit-F2P", origin_file_pattern="model.safetensors"),
    hotload=True,
)
prompt = "Realistic photography of a beautiful woman wearing a long dress. The background is a forest."
negative_prompt = "Make the character's fingers mutilated and distorted, enlarge the head to create an unnatural head-to-body ratio, turning the figure into a short-statured big-headed doll. Generate harsh, glaring sunlight and render the entire scene with oversaturated colors. Twist the legs into either X-shaped or O-shaped deformities."
image = qwen_image_edit(prompt, negative_prompt=negative_prompt, edit_image=Image.open("image_face.jpg"), seed=1, num_inference_steps=40)
image.save("image_fullbody.jpg")
image.resize((512, 512))

## Step 5: Multilingual & Multi-Image Editing

### 5.1 Multilingual Understanding
The Qwen-Image text encoder is robust enough to understand prompts in languages it wasn't explicitly trained on. Let's try generating a character using a **Korean** prompt.

In [None]:
qwen_image.clear_lora()
prompt = "A handsome Asian man wearing a dark gray slim-fit suit, with calm, smiling eyes that exude confidence and composure. He is seated at a table, holding a bouquet of red flowers in his hands."
image = qwen_image(prompt, seed=2, num_inference_steps=40)
image.resize((512, 512))

If we use Korean, can the model understand the image content?

In [None]:
qwen_image.clear_lora()
prompt = "ÏûòÏÉùÍ∏¥ ÏïÑÏãúÏïÑ ÎÇ®ÏÑ±ÏúºÎ°ú, ÏßôÏùÄ ÌöåÏÉâÏùò Ïä¨Î¶ºÌïè ÏàòÌä∏Î•º ÏûÖÍ≥† ÏûàÏúºÎ©∞, Ïπ®Ï∞©ÌïòÎ©¥ÏÑúÎèÑ ÎØ∏ÏÜåÎ•º Î®∏Í∏àÏùÄ ÎààÎπõÏúºÎ°ú ÏûêÏã†Í∞ê ÏûàÍ≥† Ïó¨Ïú†Î°úÏö¥ Î∂ÑÏúÑÍ∏∞Î•º ÌíçÍ∏¥Îã§. Í∑∏Îäî Ï±ÖÏÉÅ ÏïûÏóê ÏïâÏïÑ Î∂âÏùÄ ÍΩÉÎã§Î∞úÏùÑ ÏÜêÏóê Îì§Í≥† ÏûàÎã§."
image = qwen_image(prompt, seed=2, num_inference_steps=40)
image.save("image_man.jpg")
image.resize((512, 512))

Isn't that fascinating? Even though Qwen-Image wasn't trained on Korean, the foundational capabilities of its text encoder still provide multilingual understanding.

### 5.2 Merging Subjects with Qwen-Image-Edit-2509
We now have two images: the woman in the forest and the man with flowers. Using [**Qwen-Image-Edit-2509**](https://www.modelscope.cn/models/Qwen/Qwen-Image-Edit-2509), which supports multi-image editing, we can merge these two independent images into a single cohesive scene where the characters are interacting.

In [None]:
qwen_image_edit_2509 = QwenImagePipeline.from_pretrained(
    torch_dtype=torch.bfloat16,
    device="cuda",
    model_configs=[
        ModelConfig(model_id="Qwen/Qwen-Image-Edit-2509", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors"),
        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="text_encoder/model*.safetensors"),
        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
    ],
    tokenizer_config=ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="tokenizer/"),
    processor_config=ModelConfig(model_id="Qwen/Qwen-Image-Edit", origin_file_pattern="processor/"),
)
qwen_image_edit_2509.enable_lora_magic()

Let's generate a photo of these two people together.

In [None]:
prompt = "Ïù¥ ÏÇ¨Îûë ÎÑòÏπòÎäî Î∂ÄÎ∂ÄÏùò Ìè¨ÏòπÌïòÎäî Î™®ÏäµÏùÑ Ï∞çÏùÄ ÏÇ¨ÏßÑÏùÑ ÏÉùÏÑ±Ìï¥ Ï§ò."
image = qwen_image_edit_2509(prompt, edit_image=[Image.open("image_fullbody.jpg"), Image.open("image_man.jpg")], seed=3, num_inference_steps=40)
image.save("image_merged.jpg")
image.resize((512, 512))

## Step 6: The Power of MI300X
We have currently loaded three massive models into memory simultaneously. Let's calculate the total parameter count to understand the scale of this workload.

In [None]:
def count_parameters(model):
    return sum([p.numel() for p in model.parameters()])

print(count_parameters(qwen_image) + count_parameters(qwen_image_edit) + count_parameters(qwen_image_edit_2509))

**Total Parameters: ~86 Billion.**
Handling this on a standard GPU would be impossible. However, the AMD Instinct MI300X is equipped with **192GB of VRAM**, allowing us to keep all these models resident in memory for seamless switching between inference, editing, and training tasks.

In [None]:
!amd-smi
#For ROCm 6.4 and earlier, run rocm-smi instead.

## Step 7: Training a Custom LoRA
Finally, let's move from inference to training. We will train a custom LoRA adapter to teach the model a specific concept, in this case, a specific dog.

### 7.1 Prepare the Dataset
We will download a small dataset containing 5 images of a dog and their metadata.

In [None]:
!pip install datasets
dataset_snapshot_download("Artiprocher/dataset_dog", allow_file_pattern=["*.jpg", "*.csv"], local_dir="dataset")
images = [Image.open(f"dataset/{i}.jpg") for i in range(1, 6)]
Image.fromarray(np.concatenate([np.array(image.resize((256, 256))) for image in images], axis=1))

This is the metadata of this dataset, including annotated image descriptions.

In [None]:
pd.read_csv("dataset/metadata.csv")

Before training, let's check what the base model produces for the prompt "a dog". As expected, it generates a generic dog, not our specific subject.

In [None]:
qwen_image.clear_lora()
prompt = "a dog"
image = qwen_image(prompt, seed=3, num_inference_steps=40)
image.resize((512, 512))

### 7.2 Run the Training Script
We will first clear some GPU memory to make room for the training process. Then, we download the official training script and launch it using `accelerate`.

In [None]:
del qwen_image
del qwen_image_edit
del qwen_image_edit_2509
torch.cuda.empty_cache()

Download the training script.

In [None]:
!wget https://github.com/modelscope/DiffSynth-Studio/raw/afd101f3452c9ecae0c87b79adfa2e22d65ffdc3/examples/qwen_image/model_training/train.py

Run the training task.

In [None]:
cmd = rf"""
accelerate launch train.py \
  --dataset_base_path dataset \
  --dataset_metadata_path dataset/metadata.csv \
  --max_pixels 1048576 \
  --dataset_repeat 50 \
  --model_id_with_origin_paths "Qwen/Qwen-Image:transformer/diffusion_pytorch_model*.safetensors,Qwen/Qwen-Image:text_encoder/model*.safetensors,Qwen/Qwen-Image:vae/diffusion_pytorch_model.safetensors" \
  --learning_rate 1e-4 \
  --num_epochs 1 \
  --remove_prefix_in_ckpt "pipe.dit." \
  --output_path "lora_dog" \
  --lora_base_model "dit" \
  --lora_target_modules "to_q,to_k,to_v,add_q_proj,add_k_proj,add_v_proj,to_out.0,to_add_out,img_mlp.net.2,img_mod.1,txt_mlp.net.2,txt_mod.1" \
  --lora_rank 32 \
  --dataset_num_workers 2 \
  --find_unused_parameters
""".strip()
os.system(cmd)

## Step 8: Inference with Custom LoRA
Now that training is complete, let's load the model back up, inject our newly trained `lora_dog`, and verify that the model recognizes our specific dog.

In [None]:
qwen_image = QwenImagePipeline.from_pretrained(
    torch_dtype=torch.bfloat16,
    device="cuda",
    model_configs=[
        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors"),
        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="text_encoder/model*.safetensors"),
        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="vae/diffusion_pytorch_model.safetensors"),
    ],
    tokenizer_config=ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="tokenizer/"),
)
qwen_image.enable_lora_magic()

Then, we reload the model and generate photos for the dog.

In [None]:
qwen_image.load_lora(
    qwen_image.dit,
    "lora_dog/epoch-0.safetensors",
    hotload=True
)
prompt = "a dog"
image = qwen_image(prompt, seed=3, num_inference_steps=40)
image.resize((512, 512))

Generate another image of the dog.

In [None]:
prompt = "a dog is jumping."
image = qwen_image(prompt, seed=3, num_inference_steps=40)
image.resize((512, 512))

## Conclusion
In this tutorial, we successfully demonstrated the end-to-end capabilities of the AMD Instinct MI300X. We performed inference with 86B parameters worth of models, edited images with high consistency, and trained a custom adapter‚Äîall on a single GPU.