This repository contains the official code for the paper Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision.
Authors: Aadarsh Sahoo, Georgia Gkioxari
- Setup
- Checkpoints
- Dataset Formats
- Training
- Evaluation
- DataEngine Quickstart
- Outputs
- Citation
- Acknowledgments
Use the provided conda environment file:
git clone --recurse-submodules https://github.com/AadSah/ConverSeg.git
cd ConverSeg
conda env create -f converseg.yml
conda activate conversegDownload the released checkpoints from Hugging Face:
These are raw checkpoint files for this repository (for example .torch and LoRA adapter files), not Hugging Face from_pretrained model format.
git lfs install
git clone https://huggingface.co/aadarsh99/ConverSeg-Net-3B ./checkpoints/ConverSeg-Net-3BRun inference with demo.py by pointing to the downloaded checkpoint paths:
python demo.py \
--final_ckpt ./checkpoints/ConverSeg-Net-3B/ConverSeg-Net_sam2_90000.torch.torch \
--plm_ckpt ./checkpoints/ConverSeg-Net-3B/ConverSeg-Net_plm_90000.torch \
--lora_ckpt ./checkpoints/ConverSeg-Net-3B/lora_plm_adapter_90000
--model_cfg sam2_hiera_l.yaml \
--base_ckpt /path/to/sam2_hiera_large.pt \
--image /path/to/image.jpg \
--prompt "the left-most person" \
--device cuda \
--out_dir ./demo_outputsYou can also run demo.py in interactive mode by omitting --image and --prompt.
train.py reads a JSONL file (default: dataset.jsonl) inside --data-dir.
Each line:
image: image path (relative to--data-diror absolute)mask_merged: segmentation mask path (relative or absolute)prompt: optional text prompt
Example:
{"image":"images/0001.jpg","mask_merged":"masks/0001.png","prompt":"the left-most person"}Minimal layout:
my_data/
dataset.jsonl
images/
masks/
When using --input_json, eval.py expects:
{
"dataset": "chunk_01",
"count": 2,
"items": [
{"id":"0001","image":"images/0001.jpg","mask":"masks/0001.png","prompt":"the left-most person"},
{"id":"0002","image":"images/0002.jpg","mask":"masks/0002.png","prompt":"a red object near the chair"}
]
}- Release training data.
python train.py \
--data-dir /path/to/my_data \
--dataset-jsonl dataset.jsonl \
--model-cfg sam2_hiera_s.yaml \
--checkpoint /path/to/sam2_hiera_small.pt \
--device cuda \
--steps 3000 \
--batch-size 4 \
--acc 4 \
--lr 1e-4 \
--wd 1e-4 \
--out ./ckpts \
--name ConverSeg_sam2Notes:
- Checkpoints and LoRA adapters are written under
--out. - TensorBoard logs are written to
--out/tb/<name>.
eval.py supports two modes.
python eval.py \
--final_ckpt ./ckpts/ConverSeg_sam2_final.torch \
--plm_ckpt ./ckpts/ConverSeg_sam2_plm_final.torch \
--lora_ckpt ./ckpts/lora_plm_adapter_final \
--model_cfg sam2_hiera_l.yaml \
--base_ckpt /path/to/sam2_hiera_large.pt \
--hf_dataset aadarsh99/ConverSeg \
--hf_config default \
--hf_splits sam_seeded,human_annotated \
--device cuda \
--save_preds ./preds_hfpython eval.py \
--input_json /path/to/items.json \
--final_ckpt ./ckpts/ConverSeg_sam2_final.torch \
--plm_ckpt ./ckpts/ConverSeg_sam2_plm_final.torch \
--lora_ckpt ./ckpts/lora_plm_adapter_final \
--model_cfg sam2_hiera_l.yaml \
--base_ckpt /path/to/sam2_hiera_large.pt \
--device cuda \
--save_preds ./preds_jsonGenerate conversational supervision from raw images, then export into ConverSeg train/eval formats.
Install extra dependency:
pip install google-genaiSet environment variable:
export GOOGLE_API_KEY=<your_key>Run generation:
python dataengine/run.py \
--input /path/to/image_or_dir \
--config sam2.1_hiera_l.yaml \
--checkpoint /path/to/sam2.1_hiera_large.pt \
--output_dir /path/to/dataengine_runsExport for training/evaluation:
python dataengine/tools/export_dataset.py \
--runs_root /path/to/dataengine_runs \
--out_dir /path/to/ConverSeg_export \
--mode both \
--path_mode relativeSee dataengine/README.md for full schemas and failure modes.
From train.py (--out):
<name>_<step>.torch: SAM2 checkpoints<name>_plm_<step>.torch: language adapter checkpointslora_plm_adapter_<step>/: LoRA adapter snapshotstb/<name>/: TensorBoard logsval/step_<step>.png: validation panels
From eval.py (--save_preds):
*_pred.png: predicted masks*_gt.png: GT masks*_orig.png: source images*_panel.png: overlays*_prompt.txt: prompt text
@misc{sahoo2026conversationalimagesegmentationgrounding,
title = {Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision},
author = {Aadarsh Sahoo and Georgia Gkioxari},
year = {2026},
eprint = {2602.13195},
archivePrefix = {arXiv},
primaryClass = {cs.CV},
url = {https://arxiv.org/abs/2602.13195},
}For any questions or issues, please open a GitHub issue or contact Aadarsh. Thank you for your interest in our work!
ConverSeg builds on SAM2.
