Abstract
Modern T2V/I2V generators synthesize people increasingly hard to distinguish from authentic footage, while current evaluation suites lag: legacy benchmarks target manipulation-based forgeries, and recent synthetic-video benchmarks prioritize scale over realistic human depiction. We introduce SynthForensics, a people-centric benchmark of 20,445 videos from 8 T2V and 7 I2V open-source generators, paired-source from FF++/DFD reals, two-stage human-validated, in four compression versions with full metadata. In our paired-comparison human study, raters prefer SynthForensics in 71–77% of head-to-head comparisons against each of nine existing synthetic-video benchmarks, while facial-quality metrics fall within the FF++/DFD baseline range. Across 15 detectors and three protocols, face-based methods drop 13–55 AUC points (mean 27) from FF++ to SynthForensics and a further 23 under aggressive compression; fine-tuning closes the gap at a backward cost on legacy benchmarks; re-training shows synthetic and manipulation features largely disjoint for most detectors. We release dataset, pipeline, and code.
SynthForensics/
├── T2V/
│ ├── videos/
│ │ ├── raw/
│ │ │ ├── cogvideox/ # <ID>_cogvideox_t2v.mp4
│ │ │ ├── daVinci-MagiHuman/
│ │ │ ├── helios/
│ │ │ ├── ltx2-3/
│ │ │ ├── magi-1/
│ │ │ ├── self-forcing/
│ │ │ ├── skyreels-v2/
│ │ │ └── wan2-1/
│ │ ├── canonical/ # same per-generator structure
│ │ ├── crf23/
│ │ └── crf40/
│ └── metadata/
│ ├── cogvideox/ # <ID>_cogvideox_t2v.json
│ ├── daVinci-MagiHuman/
│ └── … # one sub-folder per generator
├── I2V/
│ ├── videos/
│ │ ├── raw/
│ │ │ ├── cogvideox/ # <ID>_cogvideox_i2v.mp4
│ │ │ ├── daVinci-MagiHuman/
│ │ │ ├── helios/
│ │ │ ├── ltx2-3/
│ │ │ ├── magi-1/
│ │ │ ├── skyreels-v2/
│ │ │ └── wan2-1/
│ │ ├── canonical/ # same per-generator structure
│ │ ├── crf23/
│ │ └── crf40/
│ ├── i2v_frames/ # <ID>.png — reference frames used as conditioning input
│ └── metadata/
│ ├── cogvideox/ # <ID>_cogvideox_i2v.json
│ └── … # one sub-folder per generator
├── captions/ # <ID>.json — dense captions for FF++ and DFD source videos
├── train.json
├── test.json
├── val.json
└── README.md
Within both T2V/videos/ and I2V/videos/, samples are organized by compression level (raw, canonical, crf23, crf40) and, within each compression level, by generator name. Two distinct ID schemes are used depending on the source:
- FF++ samples —
<ID>_<generator>_t2v.mp4/<ID>_<generator>_i2v.mp4, where<ID>is a zero-padded three-digit integer inherited from the FaceForensics++ dataset (e.g.,071_cogvideox_t2v.mp4). - DFD samples —
<subject_id>__<scene>_<generator>_t2v.mp4/<subject_id>__<scene>_<generator>_i2v.mp4, where<subject_id>is a two-digit zero-padded integer and<scene>is a descriptive scene name (e.g.,01__exit_phone_room_cogvideox_t2v.mp4).
In both cases <generator> matches the directory name (e.g., cogvideox, daVinci-MagiHuman, wan2-1). Metadata files in T2V/metadata/<generator>/ and I2V/metadata/<generator>/ follow the same naming patterns with a .json extension.
The files train.json, test.json, and val.json each contain a list of video identifiers (zero-padded three-digit strings, e.g., "071", "954") that define the official training, test, and validation partitions of the benchmark. These identifiers are inherited directly from the FaceForensics++ dataset splits, ensuring full compatibility with the FF++ evaluation protocol.
The identifiers serve a dual purpose:
-
Fake video selection. For each generator, only the videos whose numeric ID appears in the corresponding split file should be included in that partition. Concretely, given a split set
$\mathcal{S}$ and a generator$g$ , the subset of fake videos assigned to that partition is:
This selection applies uniformly across all generators in both the T2V and I2V branches, at every available compression level.
- Real video selection. The same identifiers correspond to the real (pristine) videos from the FaceForensics++ dataset that should be treated as the authentic counterpart for each partition. Detectors trained or evaluated on SynthForensics are therefore expected to use the FF++ real videos indexed by the same IDs as the negative class, preserving the one-to-one correspondence between real and fake samples established by the original FF++ benchmark.
The test partition is additionally supplemented with the full DeepFakeDetection (DFD) dataset. Unlike the SynthForensics generators — whose test samples are selected via the ID-based mechanism described above — all DFD videos are included in the test split without any ID-based filtering. DFD videos follow the naming convention <subject_id>__<scene>.mp4 (e.g., 01__exit_phone_room.mp4) and are drawn from 16 distinct scenarios across multiple subjects. These samples serve as an out-of-domain evaluation source, enabling assessment of detector generalization beyond the FF++-aligned fake distribution.
| Branch | Display name | Directory name | Videos (raw) |
|---|---|---|---|
| T2V | CogVideoX | cogvideox |
1,363 |
| T2V | DaVinci-MagiHuman | daVinci-MagiHuman |
1,363 |
| T2V | Helios | helios |
1,363 |
| T2V | LTX-2.3 | ltx2-3 |
1,363 |
| T2V | Magi-1 | magi-1 |
1,363 |
| T2V | Self-Forcing | self-forcing |
1,363 |
| T2V | SkyReels-V2 | skyreels-v2 |
1,363 |
| T2V | Wan2.1 | wan2-1 |
1,363 |
| I2V | CogVideoX | cogvideox |
1,363 |
| I2V | DaVinci-MagiHuman | daVinci-MagiHuman |
1,363 |
| I2V | Helios | helios |
1,363 |
| I2V | LTX-2.3 | ltx2-3 |
1,363 |
| I2V | Magi-1 | magi-1 |
1,363 |
| I2V | SkyReels-V2 | skyreels-v2 |
1,363 |
| I2V | Wan2.1 | wan2-1 |
1,363 |
| Total (raw) | 15 T2V+I2V generators | 20,445 | |
| Total (all compressions) | 15 generators × 4 compression levels | 81,780 |
| Metric | Value |
|---|---|
| Unique Synthetic Videos (T2V) | 10,904 |
| Unique Synthetic Videos (I2V) | 9,541 |
| Total Unique Synthetic Videos | 20,445 |
| Total Video Files (4 compressions) | 81,780 |
| Total Unique Frames | 1,934,097 |
| Total Unique Video Duration | ~27.2 hours |
| Landscape Videos | 16,349 |
| Portrait Videos | 4,096 |
| Resolution Range (W×H) | 640×384 – 1920×1088 |
| Frame Rate Range (FPS) | 8 – 25 |
| Duration Range (s) | 4 – 6 |
Resolutions are reported for the raw (uncompressed) videos; compressed versions preserve the same dimensions. Orientation: L = landscape (W > H), P = portrait (H > W).
| Branch | Generator | Resolution (W×H) | Orient. | Count (raw) |
|---|---|---|---|---|
| T2V | CogVideoX | 720×480 | L | 1,363 |
| T2V | DaVinci-MagiHuman | 1920×1088 | L | 667 |
| T2V | DaVinci-MagiHuman | 1088×1920 | P | 696 |
| T2V | Helios | 640×384 | L | 1,363 |
| T2V | LTX-2.3 | 1536×1024 | L | 703 |
| T2V | LTX-2.3 | 1024×1536 | P | 660 |
| T2V | Magi-1 | 1280×720 | L | 665 |
| T2V | Magi-1 | 720×1280 | P | 698 |
| T2V | Self-Forcing | 832×480 | L | 664 |
| T2V | Self-Forcing | 480×832 | P | 699 |
| T2V | SkyReels-V2 | 960×544 | L | 702 |
| T2V | SkyReels-V2 | 544×960 | P | 661 |
| T2V | Wan2.1 | 832×480 | L | 689 |
| T2V | Wan2.1 | 480×832 | P | 674 |
| I2V | CogVideoX | 720×480 | L | 1,363 |
| I2V | DaVinci-MagiHuman | 1920×1088 | L | 1,361 |
| I2V | DaVinci-MagiHuman | 1088×1920 | P | 2 |
| I2V | Helios | 640×384 | L | 1,363 |
| I2V | LTX-2.3 | 1536×1024 | L | 1,361 |
| I2V | LTX-2.3 | 1024×1536 | P | 2 |
| I2V | Magi-1 | 1280×720 | L | 1,363 |
| I2V | SkyReels-V2 | 960×544 | L | 1,361 |
| I2V | SkyReels-V2 | 544×960 | P | 2 |
| I2V | Wan2.1 | 832×464 | L | 917 |
| I2V | Wan2.1 | 720×544 | L | 273 |
| I2V | Wan2.1 | 736×528 | L | 89 |
| I2V | Wan2.1 | 704×560 | L | 51 |
| I2V | Wan2.1 | 768×512 | L | 28 |
| I2V | Wan2.1 | 800×480 | L | 1 |
| I2V | Wan2.1 | 816×480 | L | 1 |
| I2V | Wan2.1 | 688×560 | L | 1 |
| I2V | Wan2.1 | 464×832 | P | 1 |
| I2V | Wan2.1 | 608×640 | P | 1 |