Skip to content

Asatheesh6561/CAT

Repository files navigation

CAT: Compositional Adversarial Training for Robust Visual Watermarking

🌐 Project Page🤗 Models📋 arXiv

This repository contains the code for the paper "Compositional Adversarial Training for Robust Visual Watermarking". CAT is a plug-in training framework that replaces random augmentation with a learned sequential adversary to improve watermark robustness across post-generation and in-generation watermarking methods.

The watermarking code is based on PixelSeal and VideoSeal. For CAT applied to in-generation autoregressive watermarking, check the in-processing branch. For the evaluations against watermarking baseline on image and video models, check the cat-eval branch.

CAT training pipeline Results summary


Overview

Standard watermark training samples augmentations independently and uniformly, which under-covers the compositional attack sequences that dominate real-world failures. CAT instead trains a lightweight sequential adversary—a frozen DINOv3 backbone + GRU controller + MLP heads—that adaptively selects the sequence of augmentation families most likely to break the current watermark model at each training step.

Key design choices:

  • Straight-through Gumbel-Softmax for differentiable attack selection
  • Entropy regularization to prevent collapse to a single destructive attack
  • Depth T ∈ {1, 2} sequential compositions (T=1 = single learned attack, T=2 = two-step chain)
  • ~800K adversary parameters with only 20–30% additional training overhead
  • Inference-time behavior is unchanged — the adversary is training-only

Installation

Install PyTorch for your CUDA version from pytorch.org, then install the remaining dependencies:

pip install -r requirements.txt

torchcodec is required for video decoding and must be installed separately to match your CUDA version:

pip install torchcodec==0.9.0+cu128  # adjust cu128 to your CUDA version

FFmpeg with H.264/H.265 codec support is also required for video augmentations.


Repository Structure

CAT/
├── train.py                         # Main training script
├── extract.py                       # SA-1B tar extraction utility
├── new_analyze_results.py           # Results analysis and plotting
├── configs/
│   ├── all_augs.yaml                # Augmentation library + param ranges
│   ├── all_augs_video.yaml          # Video-specific augmentation config
│   ├── embedder.yaml                # Embedder architecture config
│   ├── extractor.yaml               # Extractor architecture config
│   ├── attenuation.yaml             # JND attenuation config
│   └── datasets/                   # Per-dataset path configs
├── videoseal/
│   ├── augmentation/
│   │   ├── adversary.py             # SequentialAdversary (the CAT module)
│   │   ├── augmenter.py             # Random augmenter + AdversarialAugmenter
│   │   ├── bandit.py                # UCB bandit adversary (ablation)
│   │   ├── curriculum.py            # Per-attack difficulty curriculum
│   │   ├── geometric.py             # Geometric augmentations
│   │   ├── valuemetric.py           # Photometric + compression augmentations
│   │   └── video.py                 # Video-specific augmentations
│   ├── models/
│   │   ├── videoseal.py             # Main Videoseal model (embedder + extractor)
│   │   ├── embedder.py              # U-Net embedder
│   │   ├── extractor.py             # ConvNeXt extractor
│   │   └── baselines.py             # Baseline model wrappers (TorchScript)
│   ├── evals/
│   │   ├── full.py                  # Full evaluation script
│   │   └── metrics.py               # Bit accuracy, capacity, PSNR, SSIM, LPIPS
│   ├── cards/                       # Model card YAMLs (checkpoints + arch configs)
│   │   ├── pixelseal.yaml
│   │   ├── videoseal_0.0.yaml
│   │   └── videoseal_1.0.yaml
│   └── data/                        # Dataset loaders (images, video, HDF5)
├── docs/
│   ├── training.md                  # Training guide
│   ├── baselines.md                 # How to load baseline models
│   └── HDF5_OPTIMIZATION.md        # Data loading optimization
├── notebooks/
│   ├── image_inference.ipynb
│   └── video_inference.ipynb
└── wmforger/                        # Watermark forging experiments

Data Preparation

SA-1B (recommended)

The training data is SA-1B loaded from HuggingFace. The config at configs/datasets/sa-1b-full-resized.yaml points to asatheesh/CAT-Image by default, which is a pre-processed Parquet mirror.

For evaluation, download the held-out test set and extract it:

wget https://tinyurl.com/cat-robust-watermark-eval-data -O sa-1b-test.tar
tar -xf sa-1b-test.tar && mv <extracted-folder> sa-1b-test

Then update configs/datasets/sa-1b-test.yaml to point at the extracted directory:

val_dir: /path/to/sa-1b-test

Video (Movie-Gen-Bench)

Video fine-tuning uses Movie-Gen-Bench loaded from HuggingFace. The config at configs/datasets/movie-gen-bench.yaml points to asatheesh/CAT-Video.

For OOD video evaluation, we use SA-V (Segment Anything Video). Download it and update configs/datasets/sav-test.yaml with the path to the extracted folder.

OOD evaluation datasets

The out-of-distribution image benchmarks used in the paper each have a config in configs/datasets/. Download and point each config's val_dir at the extracted folder:

  • DIV2K — download the validation set from the ETH CVL page; update configs/datasets/DIV2k.yaml
  • CLIC — download via Kaggle API:
    curl -L -o clic-dataset.zip \
      https://www.kaggle.com/api/v1/datasets/download/mustafaalkhafaji95/clic-dataset
    unzip clic-dataset.zip -d clic-dataset
    Update configs/datasets/CLIC-test.yaml with the extracted path.
  • MetFace — download from Google Drive; update configs/datasets/metface.yaml

Custom datasets

For any other image dataset, create a config in configs/datasets/ pointing to your image folder:

# configs/datasets/myimages.yaml
train_dir: /path/to/images/train/
val_dir: /path/to/images/val/
train_annotation_file: null
val_annotation_file: null

The loader supports plain image folders and COCO-format annotations.


Training

Random Augmentation Baseline

Single-step random augmentation on 4 GPUs:

OMP_NUM_THREADS=40 torchrun --nproc_per_node=4 train.py --local_rank 0 \
    --video_dataset none --image_dataset sa-1b-full-resized --workers 4 \
    --extractor_model convnext_tiny --embedder_model unet_small2_yuv_quant \
    --hidden_size_multiplier 1 --nbits 128 \
    --scaling_w_schedule Cosine,scaling_min=0.2,start_epoch=200,epochs=200 \
    --scaling_w 1.0 --scaling_i 1.0 --attenuation jnd_1_1 \
    --epochs 500 --iter_per_epoch 1000 \
    --scheduler CosineLRScheduler,lr_min=1e-6,t_initial=500,warmup_lr_init=1e-8,warmup_t=5 \
    --optimizer AdamW,lr=5e-4 \
    --lambda_dec 1.0 --lambda_d 0.1 --lambda_i 0.1 --perceptual_loss yuv \
    --num_augs 1 --augmentation_config configs/all_augs.yaml \
    --disc_in_channels 1 --disc_start 50

CAT (Single-Step Adversary, T=1)

Add --use_adversary True and --num_augs 1. The adversary kicks in after --adversary_start_epoch (default 0, but we recommend 5 warmup epochs):

OMP_NUM_THREADS=40 torchrun --nproc_per_node=4 train.py --local_rank 0 \
    --video_dataset none --image_dataset sa-1b-full-resized --workers 4 \
    --extractor_model convnext_tiny --embedder_model unet_small2_yuv_quant \
    --hidden_size_multiplier 1 --nbits 128 \
    --scaling_w_schedule Cosine,scaling_min=0.2,start_epoch=200,epochs=200 \
    --scaling_w 1.0 --scaling_i 1.0 --attenuation jnd_1_1 \
    --epochs 500 --iter_per_epoch 1000 \
    --scheduler CosineLRScheduler,lr_min=1e-6,t_initial=500,warmup_lr_init=1e-8,warmup_t=5 \
    --optimizer AdamW,lr=5e-4 \
    --lambda_dec 1.0 --lambda_d 0.1 --lambda_i 0.1 --perceptual_loss yuv \
    --num_augs 1 --augmentation_config configs/all_augs.yaml \
    --disc_in_channels 1 --disc_start 50 \
    --use_adversary True \
    --adversary_entropy_weight 0.1 \
    --adversary_hidden_dim 256 \
    --adversary_gumbel_temperature 1.0 \
    --adversary_param_head_type beta \
    --adversary_start_epoch 5

CAT (Compositional Adversary, T=2)

Set --num_augs 2 for T=2 compositional adversary:

OMP_NUM_THREADS=40 torchrun --nproc_per_node=4 train.py --local_rank 0 \
    --video_dataset none --image_dataset sa-1b-full-resized --workers 4 \
    --extractor_model convnext_tiny --embedder_model unet_small2_yuv_quant \
    --hidden_size_multiplier 1 --nbits 128 \
    --scaling_w_schedule Cosine,scaling_min=0.2,start_epoch=200,epochs=200 \
    --scaling_w 1.0 --scaling_i 1.0 --attenuation jnd_1_1 \
    --epochs 500 --iter_per_epoch 1000 \
    --scheduler CosineLRScheduler,lr_min=1e-6,t_initial=500,warmup_lr_init=1e-8,warmup_t=5 \
    --optimizer AdamW,lr=5e-4 \
    --lambda_dec 1.0 --lambda_d 0.1 --lambda_i 0.1 --perceptual_loss yuv \
    --num_augs 2 --augmentation_config configs/all_augs.yaml \
    --disc_in_channels 1 --disc_start 50 \
    --use_adversary True \
    --adversary_entropy_weight 0.1 \
    --adversary_hidden_dim 256 \
    --adversary_gumbel_temperature 1.0 \
    --adversary_param_head_type beta \
    --adversary_start_epoch 5 \
    --percep_loss_start_epoch 50

Video Fine-tuning

After image pre-training, fine-tune on video:

OMP_NUM_THREADS=40 torchrun --nproc_per_node=2 train.py --local_rank 0 \
    --video_dataset movie-gen-bench --image_dataset none \
    --workers 0 --frames_per_clip 16 \
    --resume_from /path/to/image/checkpoint.pth \
    --resume_optimizer_state True --resume_disc True \
    --videoseal_step_size 4 --lowres_attenuation True \
    --img_size_proc 256 --img_size_val 768 --img_size 768 \
    --extractor_model convnext_tiny --embedder_model unet_small2_yuv_quant \
    --hidden_size_multiplier 1 --nbits 128 \
    --scaling_w_schedule None --scaling_w 0.2 --scaling_i 1.0 --attenuation jnd_1_1 \
    --epochs 500 --iter_per_epoch 100 \
    --scheduler None --optimizer AdamW,lr=1e-5 \
    --lambda_dec 1.0 --lambda_d 0.5 --lambda_i 0.1 --perceptual_loss yuv \
    --num_augs 1 --augmentation_config configs/all_augs.yaml \
    --disc_in_channels 1 --disc_start 50 \
    --use_adversary True --adversary_entropy_weight 0.1

Key Training Parameters

Parameter Description Paper default
--use_adversary Enable CAT (vs. random augmentation) False
--num_augs Adversary depth T (1 or 2) 1
--adversary_entropy_weight λ_ent for entropy regularization 0.1
--adversary_hidden_dim GRU + MLP hidden dim (d_h) 256
--adversary_gumbel_temperature Gumbel-Softmax temperature τ 1.0
--adversary_param_head_type Param sampling: beta, point, uniform uniform
--adversary_backbone Feature backbone: dinov2 or resnet18 dinov2
--adversary_start_epoch Epoch to activate adversary 5
--nbits Watermark payload size in bits 128
--scaling_w Watermark strength (higher = more robust, less invisible) 1.0
--use_bandit Use UCB bandit adversary instead (ablation) False

Evaluation

Evaluate a trained checkpoint on image datasets:

python -m videoseal.evals.full \
    --checkpoint /path/to/checkpoint.pth \
    --lowres_attenuation True --scaling_w 0.2 \
    --dataset sa-1b-full-resized --is_video false \
    --num_samples 1000 --batch_size 4

Evaluate on video:

python -m videoseal.evals.full \
    --checkpoint /path/to/checkpoint.pth \
    --lowres_attenuation True --scaling_w 0.2 \
    --dataset movie-gen-bench --is_video true \
    --num_samples 100 --batch_size 2

The evaluation reports:

  • Bit accuracy: fraction of correctly recovered watermark bits
  • Capacity (bits): nbits × (1 - H(bit_accuracy)) where H is Bernoulli entropy — the effective number of reliably recoverable bits
  • PSNR, SSIM, MS-SSIM, LPIPS for image quality

Augmentation families evaluated: identity, value (brightness/contrast/hue/saturation), compression (JPEG, H.265), geometric (rotate/crop/resize/perspective/flip), and pairwise compositions for T=2.


Pre-trained Checkpoints

Model Setting HuggingFace
PixelSeal + CAT Single-step (T=1) asatheesh/CAT-Pixelseal-Single-Step
PixelSeal + CAT Compositional (T=2) asatheesh/CAT-Pixelseal-Compositional
git clone https://huggingface.co/asatheesh/CAT-Pixelseal-Single-Step
git clone https://huggingface.co/asatheesh/CAT-Pixelseal-Compositional

Augmentation Library

The augmentation library is defined in configs/all_augs.yaml. Each augmentation has a probability weight and a parameter range. You can add, remove, or reweight augmentations by editing this file. The adversary selects from exactly this set during training.

Image / Video augmentations: identity, JPEG compression, crop, rotate, rotate90, horizontal flip, perspective, Gaussian blur, brightness, contrast, saturation, hue, H.265 video codec.

Video-only augmentations: H.264, H.265, frame drop, speed change, temporal reorder, window averaging.


Ablations

The paper ablates three CAT components. These are controlled via:

Ablation Command-line change
No entropy regularization --adversary_entropy_weight 0.0
ResNet18 backbone (vs DINOv3) --adversary_backbone resnet18
UCB bandit adversary (vs Gumbel-Softmax) --use_bandit True --bandit_type ucb (replaces --use_adversary True)

Notebooks


Citation

@article{cat2025,
  title={Compositional Adversarial Training for Robust Visual Watermarking},
  author={Anonymous Authors},
  year={2025},
}

License

This project is licensed under the MIT License — see the LICENSE file for details.

The watermarking backbone is based on VideoSeal (Meta Platforms, Inc.), which is also MIT licensed.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors