CAT: Compositional Adversarial Training for Robust Visual Watermarking

This repository contains the code for the paper "Compositional Adversarial Training for Robust Visual Watermarking". CAT is a plug-in training framework that replaces random augmentation with a learned sequential adversary to improve watermark robustness across post-generation and in-generation watermarking methods.

The watermarking code is based on PixelSeal and VideoSeal. For CAT applied to in-generation autoregressive watermarking, check the in-processing branch. For the evaluations against watermarking baseline on image and video models, check the cat-eval branch.

Overview

Standard watermark training samples augmentations independently and uniformly, which under-covers the compositional attack sequences that dominate real-world failures. CAT instead trains a lightweight sequential adversary—a frozen DINOv3 backbone + GRU controller + MLP heads—that adaptively selects the sequence of augmentation families most likely to break the current watermark model at each training step.

Key design choices:

Straight-through Gumbel-Softmax for differentiable attack selection
Entropy regularization to prevent collapse to a single destructive attack
Depth T ∈ {1, 2} sequential compositions (T=1 = single learned attack, T=2 = two-step chain)
~800K adversary parameters with only 20–30% additional training overhead
Inference-time behavior is unchanged — the adversary is training-only

Installation

Install PyTorch for your CUDA version from pytorch.org, then install the remaining dependencies:

pip install -r requirements.txt

torchcodec is required for video decoding and must be installed separately to match your CUDA version:

pip install torchcodec==0.9.0+cu128  # adjust cu128 to your CUDA version

FFmpeg with H.264/H.265 codec support is also required for video augmentations.

Repository Structure

CAT/
├── train.py                         # Main training script
├── extract.py                       # SA-1B tar extraction utility
├── new_analyze_results.py           # Results analysis and plotting
├── configs/
│   ├── all_augs.yaml                # Augmentation library + param ranges
│   ├── all_augs_video.yaml          # Video-specific augmentation config
│   ├── embedder.yaml                # Embedder architecture config
│   ├── extractor.yaml               # Extractor architecture config
│   ├── attenuation.yaml             # JND attenuation config
│   └── datasets/                   # Per-dataset path configs
├── videoseal/
│   ├── augmentation/
│   │   ├── adversary.py             # SequentialAdversary (the CAT module)
│   │   ├── augmenter.py             # Random augmenter + AdversarialAugmenter
│   │   ├── bandit.py                # UCB bandit adversary (ablation)
│   │   ├── curriculum.py            # Per-attack difficulty curriculum
│   │   ├── geometric.py             # Geometric augmentations
│   │   ├── valuemetric.py           # Photometric + compression augmentations
│   │   └── video.py                 # Video-specific augmentations
│   ├── models/
│   │   ├── videoseal.py             # Main Videoseal model (embedder + extractor)
│   │   ├── embedder.py              # U-Net embedder
│   │   ├── extractor.py             # ConvNeXt extractor
│   │   └── baselines.py             # Baseline model wrappers (TorchScript)
│   ├── evals/
│   │   ├── full.py                  # Full evaluation script
│   │   └── metrics.py               # Bit accuracy, capacity, PSNR, SSIM, LPIPS
│   ├── cards/                       # Model card YAMLs (checkpoints + arch configs)
│   │   ├── pixelseal.yaml
│   │   ├── videoseal_0.0.yaml
│   │   └── videoseal_1.0.yaml
│   └── data/                        # Dataset loaders (images, video, HDF5)
├── docs/
│   ├── training.md                  # Training guide
│   ├── baselines.md                 # How to load baseline models
│   └── HDF5_OPTIMIZATION.md        # Data loading optimization
├── notebooks/
│   ├── image_inference.ipynb
│   └── video_inference.ipynb
└── wmforger/                        # Watermark forging experiments

Data Preparation

SA-1B (recommended)

The training data is SA-1B loaded from HuggingFace. The config at configs/datasets/sa-1b-full-resized.yaml points to asatheesh/CAT-Image by default, which is a pre-processed Parquet mirror.

For evaluation, download the held-out test set and extract it:

wget https://tinyurl.com/cat-robust-watermark-eval-data -O sa-1b-test.tar
tar -xf sa-1b-test.tar && mv <extracted-folder> sa-1b-test

Then update configs/datasets/sa-1b-test.yaml to point at the extracted directory:

val_dir: /path/to/sa-1b-test

Video (Movie-Gen-Bench)

Video fine-tuning uses Movie-Gen-Bench loaded from HuggingFace. The config at configs/datasets/movie-gen-bench.yaml points to asatheesh/CAT-Video.

For OOD video evaluation, we use SA-V (Segment Anything Video). Download it and update configs/datasets/sav-test.yaml with the path to the extracted folder.

OOD evaluation datasets

The out-of-distribution image benchmarks used in the paper each have a config in configs/datasets/. Download and point each config's val_dir at the extracted folder:

DIV2K — download the validation set from the ETH CVL page; update configs/datasets/DIV2k.yaml

CLIC — download via Kaggle API:

curl -L -o clic-dataset.zip \
  https://www.kaggle.com/api/v1/datasets/download/mustafaalkhafaji95/clic-dataset
unzip clic-dataset.zip -d clic-dataset

Update configs/datasets/CLIC-test.yaml with the extracted path.

MetFace — download from Google Drive; update configs/datasets/metface.yaml

Custom datasets

For any other image dataset, create a config in configs/datasets/ pointing to your image folder:

# configs/datasets/myimages.yaml
train_dir: /path/to/images/train/
val_dir: /path/to/images/val/
train_annotation_file: null
val_annotation_file: null

The loader supports plain image folders and COCO-format annotations.

Training

Random Augmentation Baseline

Single-step random augmentation on 4 GPUs:

OMP_NUM_THREADS=40 torchrun --nproc_per_node=4 train.py --local_rank 0 \
    --video_dataset none --image_dataset sa-1b-full-resized --workers 4 \
    --extractor_model convnext_tiny --embedder_model unet_small2_yuv_quant \
    --hidden_size_multiplier 1 --nbits 128 \
    --scaling_w_schedule Cosine,scaling_min=0.2,start_epoch=200,epochs=200 \
    --scaling_w 1.0 --scaling_i 1.0 --attenuation jnd_1_1 \
    --epochs 500 --iter_per_epoch 1000 \
    --scheduler CosineLRScheduler,lr_min=1e-6,t_initial=500,warmup_lr_init=1e-8,warmup_t=5 \
    --optimizer AdamW,lr=5e-4 \
    --lambda_dec 1.0 --lambda_d 0.1 --lambda_i 0.1 --perceptual_loss yuv \
    --num_augs 1 --augmentation_config configs/all_augs.yaml \
    --disc_in_channels 1 --disc_start 50

CAT (Single-Step Adversary, T=1)

Add --use_adversary True and --num_augs 1. The adversary kicks in after --adversary_start_epoch (default 0, but we recommend 5 warmup epochs):

OMP_NUM_THREADS=40 torchrun --nproc_per_node=4 train.py --local_rank 0 \
    --video_dataset none --image_dataset sa-1b-full-resized --workers 4 \
    --extractor_model convnext_tiny --embedder_model unet_small2_yuv_quant \
    --hidden_size_multiplier 1 --nbits 128 \
    --scaling_w_schedule Cosine,scaling_min=0.2,start_epoch=200,epochs=200 \
    --scaling_w 1.0 --scaling_i 1.0 --attenuation jnd_1_1 \
    --epochs 500 --iter_per_epoch 1000 \
    --scheduler CosineLRScheduler,lr_min=1e-6,t_initial=500,warmup_lr_init=1e-8,warmup_t=5 \
    --optimizer AdamW,lr=5e-4 \
    --lambda_dec 1.0 --lambda_d 0.1 --lambda_i 0.1 --perceptual_loss yuv \
    --num_augs 1 --augmentation_config configs/all_augs.yaml \
    --disc_in_channels 1 --disc_start 50 \
    --use_adversary True \
    --adversary_entropy_weight 0.1 \
    --adversary_hidden_dim 256 \
    --adversary_gumbel_temperature 1.0 \
    --adversary_param_head_type beta \
    --adversary_start_epoch 5

CAT (Compositional Adversary, T=2)

Set --num_augs 2 for T=2 compositional adversary:

OMP_NUM_THREADS=40 torchrun --nproc_per_node=4 train.py --local_rank 0 \
    --video_dataset none --image_dataset sa-1b-full-resized --workers 4 \
    --extractor_model convnext_tiny --embedder_model unet_small2_yuv_quant \
    --hidden_size_multiplier 1 --nbits 128 \
    --scaling_w_schedule Cosine,scaling_min=0.2,start_epoch=200,epochs=200 \
    --scaling_w 1.0 --scaling_i 1.0 --attenuation jnd_1_1 \
    --epochs 500 --iter_per_epoch 1000 \
    --scheduler CosineLRScheduler,lr_min=1e-6,t_initial=500,warmup_lr_init=1e-8,warmup_t=5 \
    --optimizer AdamW,lr=5e-4 \
    --lambda_dec 1.0 --lambda_d 0.1 --lambda_i 0.1 --perceptual_loss yuv \
    --num_augs 2 --augmentation_config configs/all_augs.yaml \
    --disc_in_channels 1 --disc_start 50 \
    --use_adversary True \
    --adversary_entropy_weight 0.1 \
    --adversary_hidden_dim 256 \
    --adversary_gumbel_temperature 1.0 \
    --adversary_param_head_type beta \
    --adversary_start_epoch 5 \
    --percep_loss_start_epoch 50

Video Fine-tuning

After image pre-training, fine-tune on video:

OMP_NUM_THREADS=40 torchrun --nproc_per_node=2 train.py --local_rank 0 \
    --video_dataset movie-gen-bench --image_dataset none \
    --workers 0 --frames_per_clip 16 \
    --resume_from /path/to/image/checkpoint.pth \
    --resume_optimizer_state True --resume_disc True \
    --videoseal_step_size 4 --lowres_attenuation True \
    --img_size_proc 256 --img_size_val 768 --img_size 768 \
    --extractor_model convnext_tiny --embedder_model unet_small2_yuv_quant \
    --hidden_size_multiplier 1 --nbits 128 \
    --scaling_w_schedule None --scaling_w 0.2 --scaling_i 1.0 --attenuation jnd_1_1 \
    --epochs 500 --iter_per_epoch 100 \
    --scheduler None --optimizer AdamW,lr=1e-5 \
    --lambda_dec 1.0 --lambda_d 0.5 --lambda_i 0.1 --perceptual_loss yuv \
    --num_augs 1 --augmentation_config configs/all_augs.yaml \
    --disc_in_channels 1 --disc_start 50 \
    --use_adversary True --adversary_entropy_weight 0.1

Key Training Parameters

Parameter	Description	Paper default
`--use_adversary`	Enable CAT (vs. random augmentation)	`False`
`--num_augs`	Adversary depth T (1 or 2)	`1`
`--adversary_entropy_weight`	λ_ent for entropy regularization	`0.1`
`--adversary_hidden_dim`	GRU + MLP hidden dim (d_h)	`256`
`--adversary_gumbel_temperature`	Gumbel-Softmax temperature τ	`1.0`
`--adversary_param_head_type`	Param sampling: `beta`, `point`, `uniform`	`uniform`
`--adversary_backbone`	Feature backbone: `dinov2` or `resnet18`	`dinov2`
`--adversary_start_epoch`	Epoch to activate adversary	`5`
`--nbits`	Watermark payload size in bits	`128`
`--scaling_w`	Watermark strength (higher = more robust, less invisible)	`1.0`
`--use_bandit`	Use UCB bandit adversary instead (ablation)	`False`

Evaluation

Evaluate a trained checkpoint on image datasets:

python -m videoseal.evals.full \
    --checkpoint /path/to/checkpoint.pth \
    --lowres_attenuation True --scaling_w 0.2 \
    --dataset sa-1b-full-resized --is_video false \
    --num_samples 1000 --batch_size 4

Evaluate on video:

python -m videoseal.evals.full \
    --checkpoint /path/to/checkpoint.pth \
    --lowres_attenuation True --scaling_w 0.2 \
    --dataset movie-gen-bench --is_video true \
    --num_samples 100 --batch_size 2

The evaluation reports:

Bit accuracy: fraction of correctly recovered watermark bits
Capacity (bits): nbits × (1 - H(bit_accuracy)) where H is Bernoulli entropy — the effective number of reliably recoverable bits
PSNR, SSIM, MS-SSIM, LPIPS for image quality

Augmentation families evaluated: identity, value (brightness/contrast/hue/saturation), compression (JPEG, H.265), geometric (rotate/crop/resize/perspective/flip), and pairwise compositions for T=2.

Pre-trained Checkpoints

Model	Setting	HuggingFace
PixelSeal + CAT	Single-step (T=1)	asatheesh/CAT-Pixelseal-Single-Step
PixelSeal + CAT	Compositional (T=2)	asatheesh/CAT-Pixelseal-Compositional

git clone https://huggingface.co/asatheesh/CAT-Pixelseal-Single-Step
git clone https://huggingface.co/asatheesh/CAT-Pixelseal-Compositional

Augmentation Library

The augmentation library is defined in configs/all_augs.yaml. Each augmentation has a probability weight and a parameter range. You can add, remove, or reweight augmentations by editing this file. The adversary selects from exactly this set during training.

Image / Video augmentations: identity, JPEG compression, crop, rotate, rotate90, horizontal flip, perspective, Gaussian blur, brightness, contrast, saturation, hue, H.265 video codec.

Video-only augmentations: H.264, H.265, frame drop, speed change, temporal reorder, window averaging.

Ablations

The paper ablates three CAT components. These are controlled via:

Ablation	Command-line change
No entropy regularization	`--adversary_entropy_weight 0.0`
ResNet18 backbone (vs DINOv3)	`--adversary_backbone resnet18`
UCB bandit adversary (vs Gumbel-Softmax)	`--use_bandit True --bandit_type ucb` (replaces `--use_adversary True`)

Notebooks

notebooks/image_inference.ipynb — embed and extract watermarks from images
notebooks/video_inference.ipynb — embed and extract watermarks from video

Citation

@article{cat2025,
  title={Compositional Adversarial Training for Robust Visual Watermarking},
  author={Anonymous Authors},
  year={2025},
}

License

This project is licensed under the MIT License — see the LICENSE file for details.

The watermarking backbone is based on VideoSeal (Meta Platforms, Inc.), which is also MIT licensed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CAT: Compositional Adversarial Training for Robust Visual Watermarking

Overview

Installation

Repository Structure

Data Preparation

SA-1B (recommended)

Video (Movie-Gen-Bench)

OOD evaluation datasets

Custom datasets

Training

Random Augmentation Baseline

CAT (Single-Step Adversary, T=1)

CAT (Compositional Adversary, T=2)

Video Fine-tuning

Key Training Parameters

Evaluation

Pre-trained Checkpoints

Augmentation Library

Ablations

Notebooks

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.vscode		.vscode
configs		configs
docs		docs
notebooks		notebooks
videoseal		videoseal
wmforger		wmforger
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
extract.py		extract.py
map_yml_to_req.py		map_yml_to_req.py
new_analyze_results.py		new_analyze_results.py
requirements.txt		requirements.txt
test_config_compat.py		test_config_compat.py
train.py		train.py

Folders and files

Latest commit

History

Repository files navigation

CAT: Compositional Adversarial Training for Robust Visual Watermarking

Overview

Installation

Repository Structure

Data Preparation

SA-1B (recommended)

Video (Movie-Gen-Bench)

OOD evaluation datasets

Custom datasets

Training

Random Augmentation Baseline

CAT (Single-Step Adversary, T=1)

CAT (Compositional Adversary, T=2)

Video Fine-tuning

Key Training Parameters

Evaluation

Pre-trained Checkpoints

Augmentation Library

Ablations

Notebooks

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages