Skip to content

This repository contains the code for the paper - "Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision"

Notifications You must be signed in to change notification settings

AadSah/ConverSeg

Repository files navigation

Conversational Image Segmentation

This repository contains the official code for the paper Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision.
Authors: Aadarsh Sahoo, Georgia Gkioxari

Project Page arXiv Paper Dataset Hugging Face Space

ConverSeg teaser


Table of Contents

  1. Setup
  2. Checkpoints
  3. Dataset Formats
  4. Training
  5. Evaluation
  6. DataEngine Quickstart
  7. Outputs
  8. Citation
  9. Acknowledgments

1. Setup

Use the provided conda environment file:

git clone --recurse-submodules https://github.com/AadSah/ConverSeg.git
cd ConverSeg
conda env create -f converseg.yml
conda activate converseg

2. Checkpoints

Download the released checkpoints from Hugging Face:

These are raw checkpoint files for this repository (for example .torch and LoRA adapter files), not Hugging Face from_pretrained model format.

git lfs install
git clone https://huggingface.co/aadarsh99/ConverSeg-Net-3B ./checkpoints/ConverSeg-Net-3B

Run inference with demo.py by pointing to the downloaded checkpoint paths:

python demo.py \
  --final_ckpt ./checkpoints/ConverSeg-Net-3B/ConverSeg-Net_sam2_90000.torch.torch \
  --plm_ckpt ./checkpoints/ConverSeg-Net-3B/ConverSeg-Net_plm_90000.torch \
  --lora_ckpt ./checkpoints/ConverSeg-Net-3B/lora_plm_adapter_90000
  --model_cfg sam2_hiera_l.yaml \
  --base_ckpt /path/to/sam2_hiera_large.pt \
  --image /path/to/image.jpg \
  --prompt "the left-most person" \
  --device cuda \
  --out_dir ./demo_outputs

You can also run demo.py in interactive mode by omitting --image and --prompt.


3. Dataset Formats

Training format (dataset.jsonl)

train.py reads a JSONL file (default: dataset.jsonl) inside --data-dir.

Each line:

  • image: image path (relative to --data-dir or absolute)
  • mask_merged: segmentation mask path (relative or absolute)
  • prompt: optional text prompt

Example:

{"image":"images/0001.jpg","mask_merged":"masks/0001.png","prompt":"the left-most person"}

Minimal layout:

my_data/
  dataset.jsonl
  images/
  masks/

Evaluation JSON format (items.json)

When using --input_json, eval.py expects:

{
  "dataset": "chunk_01",
  "count": 2,
  "items": [
    {"id":"0001","image":"images/0001.jpg","mask":"masks/0001.png","prompt":"the left-most person"},
    {"id":"0002","image":"images/0002.jpg","mask":"masks/0002.png","prompt":"a red object near the chair"}
  ]
}

📋 TODO

  • Release training data.

4. Training

python train.py \
  --data-dir /path/to/my_data \
  --dataset-jsonl dataset.jsonl \
  --model-cfg sam2_hiera_s.yaml \
  --checkpoint /path/to/sam2_hiera_small.pt \
  --device cuda \
  --steps 3000 \
  --batch-size 4 \
  --acc 4 \
  --lr 1e-4 \
  --wd 1e-4 \
  --out ./ckpts \
  --name ConverSeg_sam2

Notes:

  • Checkpoints and LoRA adapters are written under --out.
  • TensorBoard logs are written to --out/tb/<name>.

5. Evaluation

eval.py supports two modes.

Hugging Face mode (default)

python eval.py \
  --final_ckpt ./ckpts/ConverSeg_sam2_final.torch \
  --plm_ckpt ./ckpts/ConverSeg_sam2_plm_final.torch \
  --lora_ckpt ./ckpts/lora_plm_adapter_final \
  --model_cfg sam2_hiera_l.yaml \
  --base_ckpt /path/to/sam2_hiera_large.pt \
  --hf_dataset aadarsh99/ConverSeg \
  --hf_config default \
  --hf_splits sam_seeded,human_annotated \
  --device cuda \
  --save_preds ./preds_hf

JSON mode

python eval.py \
  --input_json /path/to/items.json \
  --final_ckpt ./ckpts/ConverSeg_sam2_final.torch \
  --plm_ckpt ./ckpts/ConverSeg_sam2_plm_final.torch \
  --lora_ckpt ./ckpts/lora_plm_adapter_final \
  --model_cfg sam2_hiera_l.yaml \
  --base_ckpt /path/to/sam2_hiera_large.pt \
  --device cuda \
  --save_preds ./preds_json

6. DataEngine Quickstart

Generate conversational supervision from raw images, then export into ConverSeg train/eval formats.

Install extra dependency:

pip install google-genai

Set environment variable:

export GOOGLE_API_KEY=<your_key>

Run generation:

python dataengine/run.py \
  --input /path/to/image_or_dir \
  --config sam2.1_hiera_l.yaml \
  --checkpoint /path/to/sam2.1_hiera_large.pt \
  --output_dir /path/to/dataengine_runs

Export for training/evaluation:

python dataengine/tools/export_dataset.py \
  --runs_root /path/to/dataengine_runs \
  --out_dir /path/to/ConverSeg_export \
  --mode both \
  --path_mode relative

See dataengine/README.md for full schemas and failure modes.


7. Outputs

From train.py (--out):

  • <name>_<step>.torch: SAM2 checkpoints
  • <name>_plm_<step>.torch: language adapter checkpoints
  • lora_plm_adapter_<step>/: LoRA adapter snapshots
  • tb/<name>/: TensorBoard logs
  • val/step_<step>.png: validation panels

From eval.py (--save_preds):

  • *_pred.png: predicted masks
  • *_gt.png: GT masks
  • *_orig.png: source images
  • *_panel.png: overlays
  • *_prompt.txt: prompt text

8. Citation

@misc{sahoo2026conversationalimagesegmentationgrounding,
  title = {Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision},
  author = {Aadarsh Sahoo and Georgia Gkioxari},
  year = {2026},
  eprint = {2602.13195},
  archivePrefix = {arXiv},
  primaryClass = {cs.CV},
  url = {https://arxiv.org/abs/2602.13195}, 
}

For any questions or issues, please open a GitHub issue or contact Aadarsh. Thank you for your interest in our work!


9. Acknowledgments

ConverSeg builds on SAM2.

About

This repository contains the code for the paper - "Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages