Verite!: Cross-Domain Deception Detection

Verite! is a cross-domain deception detection system built on ModernBERT-base, combining spectral features, hyperspherical classification, local consistency modeling, and multi-task domain learning.
Evaluated on the DIFrauD benchmark (7 domains, ~103K samples).

Overview

Deception detection is a challenging NLP task that requires generalizing across radically different domains (fake news, phishing, product reviews, SMS spam, political statements, job scams, Twitter rumours). Verite! addresses this by combining a powerful pre-trained encoder with domain-aware multi-task learning and several auxiliary objectives designed to capture both semantic and structural deception signals.

Key contributions:

Spectral features: top-k FFT magnitudes + spectral centroid + entropy over the token sequence, capturing frequency-domain patterns of deception
Local Consistency Module: segment-level cross-attention to detect internal contradictions
HypersphericalHead: prototype-based classification on the unit hypersphere, robust to inter-domain shifts
Domain MTL: shared encoder learns domain-specific cues through a positive multi-task head (complementary to adversarial DANN)
Multi-sample dropout (5×) + EMA + AWP (epochs 4–5): strong regularization stack

Architecture

Input Text ────────────────────────── Linguistic Features (8-d)
    │                                          │
    ▼                                          ▼
ModernBERT-base (149M params, fp32)       ling_proj → feat_emb (512-d)
    │                                          │
    ├── AttentionPooling ──► semantic_proj ──► sem_emb  (512-d) ──┐
    ├── LocalConsistencyModule (4 segs)  ──► cons_emb (256-d) ────┤
    └── SpectralFeatures (top-8 FFT + centroid + entropy) ─► spec (10-d)
                                                                   │
                  Concatenate [sem_emb | feat_emb | cons_emb | spec]  (1290-d)
                                         │
                               LayerNorm → Linear(512) → GELU
                                         │
                         Multi-sample Dropout (5×) → HypersphericalHead
                                         │
                                    Logits (2 classes)

Training objectives:

Focal loss (γ=2.0) with class-balanced weights and label smoothing (ε=0.05)
Supervised Contrastive loss (λ=0.1, τ=0.07) on semantic embeddings
Domain MTL cross-entropy (λ=0.1) on 7 domain heads

Optimization:

AdamW with Layer-wise LR Decay (LLRD, decay=0.9): encoder LR=1e-5, head LR=1e-4
Cosine schedule with 8% linear warmup
Gradient accumulation (×4), gradient clipping (0.7)
Adversarial Weight Perturbation (AWP, ε=0.001) from epoch 4 onward

Results

Evaluated on the DIFrauD test set (macro-F1, higher is better).

System	Macro-F1	AUC-ROC
Majority class	0.3792	0.5000
TF-IDF + LR	0.8094	0.9079
ModernBERT-base (fine-tuned)	~0.82	—
Verite! (ours)	0.8512	0.9487
SOTA (DIFrauD leaderboard)	0.904	—

Results obtained with a single seed (seed=42) on 2×NVIDIA T4 GPUs.
Multi-seed ensemble (multi_seed=True, 3 seeds) is expected to close the gap further.

Installation

git clone https://github.com/Daxlia/Verite.git
cd Verite
pip install torch>=2.1.0 transformers>=4.47.0 safetensors sentencepiece
pip install scikit-learn pandas numpy tqdm datasets huggingface_hub

Hardware requirements: 1–2 GPUs with ≥15GB VRAM (tested on 2×T4 16GB).

Usage

Inference

import torch
from torch.utils.data import DataLoader
from transformers import AutoTokenizer
from safetensors.torch import load_file

from VeriteTrainer import DeceptionReasoningModel, Config, DeceptionDataset, collate_fn

cfg       = Config()
tokenizer = AutoTokenizer.from_pretrained("Daxlia/verite")

model = DeceptionReasoningModel(cfg)
model.load_state_dict(load_file("model.safetensors"))
model.eval()

texts = ["This is a suspicious message claiming you've won a prize."]

dataset = DeceptionDataset(texts, [0] * len(texts), tokenizer, cfg)
loader  = DataLoader(dataset, batch_size=8, shuffle=False, collate_fn=collate_fn)

with torch.no_grad():
    for batch in loader:
        out  = model(input_ids=batch["input_ids"],
                     attention_mask=batch["attention_mask"],
                     ling_feats=batch["ling_feats"])
        prob = torch.softmax(out["logits"], dim=-1)[:, 1]
        for t, p in zip(texts, prob.tolist()):
            print(f"P(deceptive) = {p:.4f} | {t}")

Training from scratch

Training was run on a Kaggle notebook with the following setup:

Accelerator: 2×T4 GPU
Dataset input: difraud/difraud (added via HuggingFace Hub integration)
VeriteTrainer.py uploaded as a private Kaggle dataset input
Runtime: ~13 h (< 14 h total session)

Cell 1 — install dependencies:

!pip install transformers>=4.47.0 safetensors sentencepiece

Cell 2 — run training:

exec(open("/kaggle/input/verite/VeriteTrainer.py").read())

Then: Save Version → Run All.

Training

Training was performed on 2×NVIDIA T4 (16GB each) provided by Kaggle free GPU hardware.

Hyperparameter	Value
Base encoder	`answerdotai/ModernBERT-base`
Total runtime	~13 h on 2×T4 (< 14 h)
Max sequence length	256
Batch size (effective)	64 (8 × 2 GPUs × 4 accum steps)
Epochs	5
Encoder LR	1e-5
Head LR	1e-4
LLRD decay	0.9
Warmup	8%
Weight decay	0.02
Focal γ	2.0
SupCon λ	0.1
Domain MTL λ	0.1
AWP start	Epoch 4
EMA decay	0.995

Citation

If you use Verite! in your research, please cite:

@misc{verite2026,
  author    = {Daxlia},
  title     = {Verite!: Cross-Domain Deception Detection with ModernBERT},
  year      = {2026},
  doi       = {10.5281/zenodo.20256648},
  url       = {https://doi.org/10.5281/zenodo.20256648}
}

AI Disclosure

This project was developed with assistance from AI tools:

ChatGPT (OpenAI) — Bug identification, algorithmic suggestions, and research ideation
Claude (Anthropic) — Code implementation, debugging, and writing of documentation files

All AI-generated content was reviewed, validated, and adapted by the author.

License

This project is licensed under the MIT License.

The base encoder ModernBERT-base is licensed under Apache 2.0.
The DIFrauD dataset is subject to its own license; refer to the dataset repository.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
LICENSE		LICENSE
README.md		README.md
VeriteTrainer.py		VeriteTrainer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Verite!: Cross-Domain Deception Detection

Table of Contents

Overview

Architecture

Results

Installation

Usage

Inference

Training from scratch

Training

Citation

AI Disclosure

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Verite!: Cross-Domain Deception Detection

Table of Contents

Overview

Architecture

Results

Installation

Usage

Inference

Training from scratch

Training

Citation

AI Disclosure

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages