CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation

CoMM is a high-quality dataset designed to improve the coherence, consistency, and alignment of multimodal content. It sources raw data from diverse origins, focusing on instructional content and visual storytelling to establish a strong foundation.

🔔 News

[07/31/2024] Our dataset and evaluation code are open-sourced!
[06/15/2024] Our paper is released on arXiv: https://arxiv.org/abs/2406.10462.

Dataset

Download the dataset from Google Drive.
Unzip the downloaded file and put three split data to ./datasets.
Use the following command to download the images of the dataset: bash scripts/download_images.sh

Environment Setup

conda create -n comm python=3.8
conda activate comm
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install -r requirements.txt

Evaluation

The format of the prediction results is shown in eval/example. And we provide the evaluation scripts for the four tasks in the CoMM dataset:

cd eval

results_path="/path/to/predict_results"
model_type="your model_name"

# Task1  Image-to-Text Sequence Generation
python -u eval_metric.py --predict_results_path ${results_path} --model_type ${model_type} --task_type task1 
python -u cal_gpt4o_score.py --predict_results_path ${results_path} --model_type ${model_type} --task_type task1 

# Task2  Text-to-Image Sequence Generation
python -u eval_metric.py --predict_results_path ${results_path} --model_type ${model_type} --task_type task2 
python -u cal_gpt4o_score.py --predict_results_path ${results_path} --model_type ${model_type} --task_type task2 

# Task3  Interleaved Image-Text Content Continuation
python -u eval_metric.py --predict_results_path ${results_path} --model_type ${model_type} --task_type task3
python -u cal_gpt4o_score.py --predict_results_path ${results_path} --model_type ${model_type} --task_type task3

# Task4   Question-based Interleaved Image-Text Generation
python -u eval_metric.py --predict_results_path ${results_path} --model_type ${model_type} --task_type task4
python -u cal_gpt4o_score.py --predict_results_path ${results_path} --model_type ${model_type} --task_type task4

TODO

Release the training and inference code
- Emu2
- SEED
- MiniGPT5

Citation

If you find this dataset useful, please cite our paper:

@article{chen2024comm,
  title={CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation},
  author={Chen, Wei and Li, Lin and Yang, Yongqi and Wen, Bin and Yang, Fan and Gao, Tingting and Wu, Yu and Chen, Long},
  journal={arXiv preprint arXiv:2406.10462},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
datasets		datasets
eval		eval
models/MiniGPT-5		models/MiniGPT-5
scripts		scripts
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation

🔔 News

Dataset

Environment Setup

Evaluation

TODO

Citation

About

Releases

Packages

Languages

HKUST-LongGroup/CoMM

Folders and files

Latest commit

History

Repository files navigation

CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation

🔔 News

Dataset

Environment Setup

Evaluation

TODO

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages