🌐 Project Page • 🤗 Models • 📋 arXiv
This repository contains the code for the paper "Compositional Adversarial Training for Robust Visual Watermarking". CAT is a plug-in training framework that replaces random augmentation with a learned sequential adversary to improve watermark robustness across post-generation and in-generation watermarking methods.
The watermarking code is based on PixelSeal and VideoSeal. For CAT applied to in-generation autoregressive watermarking, check the in-processing branch. For the evaluations against watermarking baseline on image and video models, check the cat-eval branch.
Standard watermark training samples augmentations independently and uniformly, which under-covers the compositional attack sequences that dominate real-world failures. CAT instead trains a lightweight sequential adversary—a frozen DINOv3 backbone + GRU controller + MLP heads—that adaptively selects the sequence of augmentation families most likely to break the current watermark model at each training step.
Key design choices:
- Straight-through Gumbel-Softmax for differentiable attack selection
- Entropy regularization to prevent collapse to a single destructive attack
- Depth T ∈ {1, 2} sequential compositions (T=1 = single learned attack, T=2 = two-step chain)
- ~800K adversary parameters with only 20–30% additional training overhead
- Inference-time behavior is unchanged — the adversary is training-only
Install PyTorch for your CUDA version from pytorch.org, then install the remaining dependencies:
pip install -r requirements.txttorchcodec is required for video decoding and must be installed separately to match your CUDA version:
pip install torchcodec==0.9.0+cu128 # adjust cu128 to your CUDA versionFFmpeg with H.264/H.265 codec support is also required for video augmentations.
CAT/
├── train.py # Main training script
├── extract.py # SA-1B tar extraction utility
├── new_analyze_results.py # Results analysis and plotting
├── configs/
│ ├── all_augs.yaml # Augmentation library + param ranges
│ ├── all_augs_video.yaml # Video-specific augmentation config
│ ├── embedder.yaml # Embedder architecture config
│ ├── extractor.yaml # Extractor architecture config
│ ├── attenuation.yaml # JND attenuation config
│ └── datasets/ # Per-dataset path configs
├── videoseal/
│ ├── augmentation/
│ │ ├── adversary.py # SequentialAdversary (the CAT module)
│ │ ├── augmenter.py # Random augmenter + AdversarialAugmenter
│ │ ├── bandit.py # UCB bandit adversary (ablation)
│ │ ├── curriculum.py # Per-attack difficulty curriculum
│ │ ├── geometric.py # Geometric augmentations
│ │ ├── valuemetric.py # Photometric + compression augmentations
│ │ └── video.py # Video-specific augmentations
│ ├── models/
│ │ ├── videoseal.py # Main Videoseal model (embedder + extractor)
│ │ ├── embedder.py # U-Net embedder
│ │ ├── extractor.py # ConvNeXt extractor
│ │ └── baselines.py # Baseline model wrappers (TorchScript)
│ ├── evals/
│ │ ├── full.py # Full evaluation script
│ │ └── metrics.py # Bit accuracy, capacity, PSNR, SSIM, LPIPS
│ ├── cards/ # Model card YAMLs (checkpoints + arch configs)
│ │ ├── pixelseal.yaml
│ │ ├── videoseal_0.0.yaml
│ │ └── videoseal_1.0.yaml
│ └── data/ # Dataset loaders (images, video, HDF5)
├── docs/
│ ├── training.md # Training guide
│ ├── baselines.md # How to load baseline models
│ └── HDF5_OPTIMIZATION.md # Data loading optimization
├── notebooks/
│ ├── image_inference.ipynb
│ └── video_inference.ipynb
└── wmforger/ # Watermark forging experiments
The training data is SA-1B loaded from HuggingFace. The config at configs/datasets/sa-1b-full-resized.yaml points to asatheesh/CAT-Image by default, which is a pre-processed Parquet mirror.
For evaluation, download the held-out test set and extract it:
wget https://tinyurl.com/cat-robust-watermark-eval-data -O sa-1b-test.tar
tar -xf sa-1b-test.tar && mv <extracted-folder> sa-1b-testThen update configs/datasets/sa-1b-test.yaml to point at the extracted directory:
val_dir: /path/to/sa-1b-testVideo fine-tuning uses Movie-Gen-Bench loaded from HuggingFace. The config at configs/datasets/movie-gen-bench.yaml points to asatheesh/CAT-Video.
For OOD video evaluation, we use SA-V (Segment Anything Video). Download it and update configs/datasets/sav-test.yaml with the path to the extracted folder.
The out-of-distribution image benchmarks used in the paper each have a config in configs/datasets/. Download and point each config's val_dir at the extracted folder:
- DIV2K — download the validation set from the ETH CVL page; update
configs/datasets/DIV2k.yaml - CLIC — download via Kaggle API:
Update
curl -L -o clic-dataset.zip \ https://www.kaggle.com/api/v1/datasets/download/mustafaalkhafaji95/clic-dataset unzip clic-dataset.zip -d clic-dataset
configs/datasets/CLIC-test.yamlwith the extracted path. - MetFace — download from Google Drive; update
configs/datasets/metface.yaml
For any other image dataset, create a config in configs/datasets/ pointing to your image folder:
# configs/datasets/myimages.yaml
train_dir: /path/to/images/train/
val_dir: /path/to/images/val/
train_annotation_file: null
val_annotation_file: nullThe loader supports plain image folders and COCO-format annotations.
Single-step random augmentation on 4 GPUs:
OMP_NUM_THREADS=40 torchrun --nproc_per_node=4 train.py --local_rank 0 \
--video_dataset none --image_dataset sa-1b-full-resized --workers 4 \
--extractor_model convnext_tiny --embedder_model unet_small2_yuv_quant \
--hidden_size_multiplier 1 --nbits 128 \
--scaling_w_schedule Cosine,scaling_min=0.2,start_epoch=200,epochs=200 \
--scaling_w 1.0 --scaling_i 1.0 --attenuation jnd_1_1 \
--epochs 500 --iter_per_epoch 1000 \
--scheduler CosineLRScheduler,lr_min=1e-6,t_initial=500,warmup_lr_init=1e-8,warmup_t=5 \
--optimizer AdamW,lr=5e-4 \
--lambda_dec 1.0 --lambda_d 0.1 --lambda_i 0.1 --perceptual_loss yuv \
--num_augs 1 --augmentation_config configs/all_augs.yaml \
--disc_in_channels 1 --disc_start 50Add --use_adversary True and --num_augs 1. The adversary kicks in after --adversary_start_epoch (default 0, but we recommend 5 warmup epochs):
OMP_NUM_THREADS=40 torchrun --nproc_per_node=4 train.py --local_rank 0 \
--video_dataset none --image_dataset sa-1b-full-resized --workers 4 \
--extractor_model convnext_tiny --embedder_model unet_small2_yuv_quant \
--hidden_size_multiplier 1 --nbits 128 \
--scaling_w_schedule Cosine,scaling_min=0.2,start_epoch=200,epochs=200 \
--scaling_w 1.0 --scaling_i 1.0 --attenuation jnd_1_1 \
--epochs 500 --iter_per_epoch 1000 \
--scheduler CosineLRScheduler,lr_min=1e-6,t_initial=500,warmup_lr_init=1e-8,warmup_t=5 \
--optimizer AdamW,lr=5e-4 \
--lambda_dec 1.0 --lambda_d 0.1 --lambda_i 0.1 --perceptual_loss yuv \
--num_augs 1 --augmentation_config configs/all_augs.yaml \
--disc_in_channels 1 --disc_start 50 \
--use_adversary True \
--adversary_entropy_weight 0.1 \
--adversary_hidden_dim 256 \
--adversary_gumbel_temperature 1.0 \
--adversary_param_head_type beta \
--adversary_start_epoch 5Set --num_augs 2 for T=2 compositional adversary:
OMP_NUM_THREADS=40 torchrun --nproc_per_node=4 train.py --local_rank 0 \
--video_dataset none --image_dataset sa-1b-full-resized --workers 4 \
--extractor_model convnext_tiny --embedder_model unet_small2_yuv_quant \
--hidden_size_multiplier 1 --nbits 128 \
--scaling_w_schedule Cosine,scaling_min=0.2,start_epoch=200,epochs=200 \
--scaling_w 1.0 --scaling_i 1.0 --attenuation jnd_1_1 \
--epochs 500 --iter_per_epoch 1000 \
--scheduler CosineLRScheduler,lr_min=1e-6,t_initial=500,warmup_lr_init=1e-8,warmup_t=5 \
--optimizer AdamW,lr=5e-4 \
--lambda_dec 1.0 --lambda_d 0.1 --lambda_i 0.1 --perceptual_loss yuv \
--num_augs 2 --augmentation_config configs/all_augs.yaml \
--disc_in_channels 1 --disc_start 50 \
--use_adversary True \
--adversary_entropy_weight 0.1 \
--adversary_hidden_dim 256 \
--adversary_gumbel_temperature 1.0 \
--adversary_param_head_type beta \
--adversary_start_epoch 5 \
--percep_loss_start_epoch 50After image pre-training, fine-tune on video:
OMP_NUM_THREADS=40 torchrun --nproc_per_node=2 train.py --local_rank 0 \
--video_dataset movie-gen-bench --image_dataset none \
--workers 0 --frames_per_clip 16 \
--resume_from /path/to/image/checkpoint.pth \
--resume_optimizer_state True --resume_disc True \
--videoseal_step_size 4 --lowres_attenuation True \
--img_size_proc 256 --img_size_val 768 --img_size 768 \
--extractor_model convnext_tiny --embedder_model unet_small2_yuv_quant \
--hidden_size_multiplier 1 --nbits 128 \
--scaling_w_schedule None --scaling_w 0.2 --scaling_i 1.0 --attenuation jnd_1_1 \
--epochs 500 --iter_per_epoch 100 \
--scheduler None --optimizer AdamW,lr=1e-5 \
--lambda_dec 1.0 --lambda_d 0.5 --lambda_i 0.1 --perceptual_loss yuv \
--num_augs 1 --augmentation_config configs/all_augs.yaml \
--disc_in_channels 1 --disc_start 50 \
--use_adversary True --adversary_entropy_weight 0.1| Parameter | Description | Paper default |
|---|---|---|
--use_adversary |
Enable CAT (vs. random augmentation) | False |
--num_augs |
Adversary depth T (1 or 2) | 1 |
--adversary_entropy_weight |
λ_ent for entropy regularization | 0.1 |
--adversary_hidden_dim |
GRU + MLP hidden dim (d_h) | 256 |
--adversary_gumbel_temperature |
Gumbel-Softmax temperature τ | 1.0 |
--adversary_param_head_type |
Param sampling: beta, point, uniform |
uniform |
--adversary_backbone |
Feature backbone: dinov2 or resnet18 |
dinov2 |
--adversary_start_epoch |
Epoch to activate adversary | 5 |
--nbits |
Watermark payload size in bits | 128 |
--scaling_w |
Watermark strength (higher = more robust, less invisible) | 1.0 |
--use_bandit |
Use UCB bandit adversary instead (ablation) | False |
Evaluate a trained checkpoint on image datasets:
python -m videoseal.evals.full \
--checkpoint /path/to/checkpoint.pth \
--lowres_attenuation True --scaling_w 0.2 \
--dataset sa-1b-full-resized --is_video false \
--num_samples 1000 --batch_size 4Evaluate on video:
python -m videoseal.evals.full \
--checkpoint /path/to/checkpoint.pth \
--lowres_attenuation True --scaling_w 0.2 \
--dataset movie-gen-bench --is_video true \
--num_samples 100 --batch_size 2The evaluation reports:
- Bit accuracy: fraction of correctly recovered watermark bits
- Capacity (bits):
nbits × (1 - H(bit_accuracy))where H is Bernoulli entropy — the effective number of reliably recoverable bits - PSNR, SSIM, MS-SSIM, LPIPS for image quality
Augmentation families evaluated: identity, value (brightness/contrast/hue/saturation), compression (JPEG, H.265), geometric (rotate/crop/resize/perspective/flip), and pairwise compositions for T=2.
| Model | Setting | HuggingFace |
|---|---|---|
| PixelSeal + CAT | Single-step (T=1) | asatheesh/CAT-Pixelseal-Single-Step |
| PixelSeal + CAT | Compositional (T=2) | asatheesh/CAT-Pixelseal-Compositional |
git clone https://huggingface.co/asatheesh/CAT-Pixelseal-Single-Step
git clone https://huggingface.co/asatheesh/CAT-Pixelseal-CompositionalThe augmentation library is defined in configs/all_augs.yaml. Each augmentation has a probability weight and a parameter range. You can add, remove, or reweight augmentations by editing this file. The adversary selects from exactly this set during training.
Image / Video augmentations: identity, JPEG compression, crop, rotate, rotate90, horizontal flip, perspective, Gaussian blur, brightness, contrast, saturation, hue, H.265 video codec.
Video-only augmentations: H.264, H.265, frame drop, speed change, temporal reorder, window averaging.
The paper ablates three CAT components. These are controlled via:
| Ablation | Command-line change |
|---|---|
| No entropy regularization | --adversary_entropy_weight 0.0 |
| ResNet18 backbone (vs DINOv3) | --adversary_backbone resnet18 |
| UCB bandit adversary (vs Gumbel-Softmax) | --use_bandit True --bandit_type ucb (replaces --use_adversary True) |
- notebooks/image_inference.ipynb — embed and extract watermarks from images
- notebooks/video_inference.ipynb — embed and extract watermarks from video
@article{cat2025,
title={Compositional Adversarial Training for Robust Visual Watermarking},
author={Anonymous Authors},
year={2025},
}This project is licensed under the MIT License — see the LICENSE file for details.
The watermarking backbone is based on VideoSeal (Meta Platforms, Inc.), which is also MIT licensed.

