Suho Ryu,
Kihyun Kim,
Eugene Baek,
Dongsoo Shin,
Joonseok Lee
GitHub | arXiv
HATIE is a comprehensive evaluation framework developed to objectively assess text-guided image editing models. HATIE introduces a large-scale, diverse benchmark set, covering various editing tasks. It employs an automated, multifaceted evaluation pipeline that aligns closely with human perception. HATIE enables scalable, reproducible, and precise benchmarking of image editing models.
# clone repository
git clone https://github.com/SuhoRyu/HATIE.git
cd HATIE# download image set
git lfs install
git clone https://huggingface.co/SHRyu97/HATIE
unzip HATIE/HATIE_original_images.zipAlternatively, you can download HATIE_original_images.zip and editable_objs_mask.pkl from HuggingFace. After downloading, unzip the archive (if needed) and place the contents in any desired location on your system.
The following instructions should work on most machines. However, depending on your specific environment and system configuration, you may need to install PyTorch and CUDA separately to ensure compatibility.
conda create -n hatie python=3.10
conda activate hatie
conda install pytorch=2.7.0 torchvision=0.22.0 torchaudio=2.7.0 cudatoolkit=11.8 -c pytorch -c conda-forge
pip install -r requirements.txtThe files queries/queries_w_remove.pkl and queries/queries_wo_remove.pkl contain all necessary information of the benchmark queries. Specifically, queries_w_remove.pkl includes object removal, whereas queries_wo_remove.pkl contains queries excluding object removal cases. You can load the desired query file using the pickle module.
import pickle
with open('queries_w_remove.pkl', 'rb') as f:
queries = pickle.load(f)The queries are formatted as a dictionary containing lists of queries, structured as shown below.
{
'2373554':
[
{'type': 'obj_rep',
'original': ['3143517', 'person'],
'target': 'cup',
'id': 23847,
'original_caption': 'a young person stands on snowshoes.',
'target_caption': 'A cup stands on its snowshoes.',
'instruction': 'Replace the person with a cup.'},
{'type': ...
],
'2370790': ...
}The keys of the outermost dictionary ('2373554', '2370790', etc.) represent the image IDs of the original images, corresponding directly to the filenames ('2373554.jpg', '2370790.jpg', etc.). Each key maps to a list of query dictionaries that should be applied to the respective original image. Within each query dictionary, you'll find generated text prompts tailored for your model. Specifically, 'original_caption' and 'target_caption' are intended for description-based models, while 'instruction' is suitable for instruction-based models. Each query dictionary also includes a unique query ID ('id') that identifies each query across the entire dataset.
Run your model across all original images and query prompts, saving the generated outputs into a single folder. You can freely choose the naming prefix, but ensure consistency throughout. Each edited image filename should end with "_{query ID}.jpg". For example, if your chosen prefix is "output_modelA", an appropriate filename would be "output_modelA_23847.jpg" (query ID = 23847 in the example above). While JPG is set as the default format in the provided code, you're free to select a different format if preferred.
The first step in the HATIE pipeline involves locating and segmenting all necessary objects for evaluation from the edited images. Open the file segment.sh, which should appear as shown below.
#!/bin/bash
python evaluation/segment_targets.py \
--query_file "queries/queries_wo_remove.pkl" \ # The query file you used for editing
--prefix "output_modelA" \ # The file name prefix of edited images
--edited_images "path/to/edited/images" \ # path to the edited images
--outdir "outputs" \ # path to where you desire to save the benchmark results
--image_format "jpg" \ # format of the edited images
--use_gpu 1 # use GPU = 1, don't use GPU = 0Set each option appropriately based on your specific run configurations, then execute the script as follows:
./segment.shThen, the code will save the resulting segmentation mask files into the output directory you specified.
The next step is the actual scoring phase. Using the segmentation mask files obtained from the previous step, the following script computes scores for each edited output image using all metrics included in HATIE. Open the score.sh file, which should appear as shown below.
#!/bin/bash
python evaluation/score.py \
--query_file "queries/queries_wo_remove.pkl" \ # The query file you used for editing
--prefix "output_modelA" \ # The file name prefix of edited images
--original_seg_file "HATIE/editable_objs_mask.pkl" \ # path to the original images' object segmentation mask file
--original_images "original_images" \ # path to original images
--edited_images "path/to/edited/images" \ # path to edited images
--outdir "outputs" \ # path to where you desire to save the benchmark results
--use_gpu 1 \ # use GPU = 1, don't use GPU = 0
--compute_err 1 # compute only the benchmark scores = 0, compute scores with error = 1Set each option according to your specific run configuration, then execute the script as follows:
./score.shThe code will log the output scores into the output directory you specified.
The final phase aggregates all the scores obtained during the scoring phase into final model scores. Open the aggregate.sh file, which should look like the example below.
#!/bin/bash
python evaluation/aggregate.py \
--query_file "queries/queries_wo_remove.pkl" \ # The query file you used for editing
--prefix "output_modelA" \ # The file name prefix of edited images
--outdir "outputs" \ # path to where you desire to save the benchmark results
--compute_errors 1 # compute only the benchmark scores = 0, compute scores with error = 1Set each option appropriately for your run, then execute the script with the following command:
./aggregate.shThe code will aggregate the scores of each output into final model scores and save the results in the output directory you specified.
You will receive two JSON files containing the final model scores:
-
{output path}/scores/scores_{prefix}_total.json– This file contains the aggregated total score along with scores for the five evaluation criteria: object fidelity, object consistency, background fidelity, background consistency, and image quality. -
{output path}/scores/scores_{prefix}_qtype.json– This file provides scores aggregated separately for each query type: object add, replace, remove, attribute change, resize, background change and style change.
Each score is represented as a list of two values: "[score, error]", where the first value is the score, and the second is the corresponding error. If the error option was disabled, the error values will be "null".
Except for aggregate.sh, both segment.sh and score.sh are safe to interrupt and resume. To resume a process, simply re-run the corresponding script. The code will automatically continue from where it left off.
The editing and segmentation processes do not need to be fully completed before running the subsequent scripts. Each script will proceed using the available results and will exit with a message indicating any unfinished items. This means you can run parts of the HATIE pipeline multiple times during the (potentially time-consuming) editing process to save time on evaluation.
This implementation of HATIE integrates several pre-existing tools and libraries. Specifically, it imports:
In addition, modified versions of
are embedded directly into the codebase. We acknowledge and appreciate the contributions of these original repositories.
@inproceedings{ryu2025towards,
title={Towards Scalable Human-aligned Benchmark for Text-guided Image Editing},
author={Ryu, Suho and Kim, Kihyun and Baek, Eugene and Shin, Dongsoo and Lee, Joonseok},
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
pages={18292--18301},
year={2025}
}