# WavLM - Release-In-The-Wild - Retreival Augmented Framework for Deepfake Audio Detection

## WavLM Based DeepFake Audio Detection

* In this notebook, we are going to use WavLM based classifier which directly operated over raw audio for different tasks.
* Here, we are going to use for deepfake audio detection


### Summary of Key Processing Steps in WavLM
1. Works directly on raw audio, allowing for easy input preparation.
2. Utilizes convolutional layers for feature extraction followed by a Transformer encoder to capture global context in the audio.
3. Gated Relative Position Bias: Enhances contextual understanding, which is especially useful in capturing anomalies or artifacts typical in deepfake audio.
4. This process enables WavLM to excel in deepfake detection tasks by capturing nuanced patterns that distinguish real from synthetic audio, leveraging both local and global information in the audio signal.

## Training

In [None]:
import logging
from config import Config
from dataset import AudioDataset
import argparse
import os
from pipeline import DeepfakeDetectionPipeline
import torch

# ========================
# Main runner (same behavior; wandb toggle)
# ========================
"""
Run the complete audio deepfake detection pipeline with single-GPU optimizations.
"""
import argparse

# 1. Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)

# 3. Disable problematic torchaudio backends
os.environ["TORCHAUDIO_USE_SOX"] = "0"
os.environ["TORCHAUDIO_USE_BACKEND_DISPATCHER"] = "1"

# 4. Set device
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
if device.type == "cuda":
    torch.cuda.set_device(device)

# 5. Create configuration
config = Config()
config.device = device

config.train_split = 0.8
mode = "train"
audio_path = None
config.feature_extractor_type = "wavlm"
pipeline_check = False

if pipeline_check:
  config.data_fraction = 0.01
  config.num_epochs = 2
  # 5a. Wandb toggle
  use_wandb = False  # set False to disable W&B
else:
  config.data_fraction = 1.0
  config.num_epochs = 10
  # 5a. Wandb toggle
  use_wandb = True  # set False to disable W&B


config.use_wandb = use_wandb

# 6. DataLoader settings
config.num_workers = max(1, torch.cuda.device_count() * 2)
config.train_batch_size = getattr(config, "train_batch_size", 256)
config.eval_batch_size = getattr(config, "eval_batch_size", 256)
config.db_batch_size = getattr(config, "db_batch_size", 64)
config.top_k = getattr(config, "top_k", 5)
config.use_batch_norm = False
config.use_layer_norm = True


# 7. Initialize pipeline
pipeline = DeepfakeDetectionPipeline(config)

if mode == "train":
    train_dataset = AudioDataset(config, is_train=True, split_data=True)
    val_dataset   = AudioDataset(config, is_train=False, split_data=True)
    pipeline.print_split_stats(train_dataset, "Train")
    pipeline.print_split_stats(val_dataset,   "Val")
    pipeline.train(train_dataset, val_dataset)

elif mode == "evaluate":
    config.use_wandb = False
    pipeline.load_models("final_model")
    pipeline.vector_db.load()

    test_dataset = AudioDataset(config, is_train=False, split_data=True)
    if hasattr(pipeline, "evaluate_with_metrics"):
        metrics = pipeline.evaluate_with_metrics(test_dataset)
        print("Evaluation metrics:")
        for key, value in metrics.items():
            print(f"{key}: {value}")
    else:
        loss, acc = pipeline.evaluate(test_dataset)
        print(f"Eval Loss: {loss:.4f}, Eval Acc: {acc:.4f}")

elif mode == "predict":
    if not audio_path:
        raise ValueError("Audio path must be provided for predict mode")
    pipeline.load_models("best_model")
    pipeline.vector_db.load()
    result = pipeline.predict(audio_path)
    logging.info(f"Prediction  : {result['prediction']}")
    logging.info(f"Probability(bona-fide) : {result['probability_bonafide']:.4f}")
    logging.info(f"Retrieved   : {result['retrieved_labels']}")

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

preprocessor_config.json:   0%|          | 0.00/215 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/378M [00:00<?, ?B/s]

Feature dimension set to: 768
Train set → total: 25423, bonafide: 9453 (37.18%), spoof: 15970 (62.82%)
Val set → total: 6356, bonafide: 2363 (37.18%), spoof: 3993 (62.82%)


Vector DB Build:   0%|          | 0/398 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/378M [00:00<?, ?B/s]

Vector DB Build: 100%|██████████| 398/398 [20:32<00:00,  3.10s/it]
  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mavinash-saxena[0m ([33mavinash-saxena-san-jose-state-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch 1/10: 100%|██████████| 100/100 [19:58<00:00, 11.99s/it]
Evaluating: 100%|██████████| 25/25 [05:00<00:00, 12.00s/it]


Epoch 1: Train Loss: 0.8383, Train Acc: 0.5761, Val Loss: 0.7630, Val Acc:0.5887 | AUC: 0.8866, EER: 20.51% (thr=1.3564), Macro EER: 17.55%, min t-DCF: nan


Epoch 2/10: 100%|██████████| 100/100 [19:45<00:00, 11.86s/it]
Evaluating: 100%|██████████| 25/25 [04:57<00:00, 11.90s/it]


Epoch 2: Train Loss: 0.5360, Train Acc: 0.8116, Val Loss: 0.3876, Val Acc:0.8622 | AUC: 0.9508, EER: 12.48% (thr=0.5547), Macro EER: 7.89%, min t-DCF: nan


Epoch 3/10: 100%|██████████| 100/100 [19:35<00:00, 11.75s/it]
Evaluating: 100%|██████████| 25/25 [04:56<00:00, 11.87s/it]


Epoch 3: Train Loss: 0.4364, Train Acc: 0.8557, Val Loss: 0.3388, Val Acc:0.9138 | AUC: 0.9708, EER: 9.13% (thr=-1.5684), Macro EER: 5.27%, min t-DCF: nan


Epoch 4/10: 100%|██████████| 100/100 [19:21<00:00, 11.61s/it]
Evaluating: 100%|██████████| 25/25 [04:56<00:00, 11.88s/it]


Epoch 4: Train Loss: 0.3553, Train Acc: 0.8889, Val Loss: 0.3437, Val Acc:0.8609 | AUC: 0.9750, EER: 8.04% (thr=1.2041), Macro EER: 5.45%, min t-DCF: nan


Epoch 5/10: 100%|██████████| 100/100 [19:35<00:00, 11.75s/it]
Evaluating: 100%|██████████| 25/25 [04:53<00:00, 11.75s/it]


Epoch 5: Train Loss: 0.3315, Train Acc: 0.8942, Val Loss: 0.2943, Val Acc:0.9254 | AUC: 0.9814, EER: 7.25% (thr=-1.6426), Macro EER: 4.66%, min t-DCF: nan


Epoch 6/10: 100%|██████████| 100/100 [19:35<00:00, 11.75s/it]
Evaluating: 100%|██████████| 25/25 [04:52<00:00, 11.69s/it]


Epoch 6: Train Loss: 0.3124, Train Acc: 0.9017, Val Loss: 0.2993, Val Acc:0.9251 | AUC: 0.9830, EER: 6.72% (thr=-1.8662), Macro EER: 4.43%, min t-DCF: nan


Epoch 7/10: 100%|██████████| 100/100 [19:15<00:00, 11.55s/it]
Evaluating: 100%|██████████| 25/25 [04:48<00:00, 11.54s/it]


Epoch 7: Train Loss: 0.2785, Train Acc: 0.9143, Val Loss: 0.2084, Val Acc:0.9412 | AUC: 0.9855, EER: 6.19% (thr=-0.7051), Macro EER: 4.30%, min t-DCF: nan


Epoch 8/10: 100%|██████████| 100/100 [19:10<00:00, 11.50s/it]
Evaluating: 100%|██████████| 25/25 [04:47<00:00, 11.49s/it]


Epoch 8: Train Loss: 0.2472, Train Acc: 0.9248, Val Loss: 0.2294, Val Acc:0.9399 | AUC: 0.9869, EER: 6.31% (thr=-1.5547), Macro EER: 4.74%, min t-DCF: nan


Epoch 9/10: 100%|██████████| 100/100 [19:00<00:00, 11.41s/it]
Evaluating: 100%|██████████| 25/25 [04:44<00:00, 11.39s/it]


Epoch 9: Train Loss: 0.2614, Train Acc: 0.9173, Val Loss: 0.2445, Val Acc:0.9039 | AUC: 0.9869, EER: 5.88% (thr=1.1104), Macro EER: 4.03%, min t-DCF: nan


Epoch 10/10: 100%|██████████| 100/100 [18:57<00:00, 11.38s/it]
Evaluating: 100%|██████████| 25/25 [04:46<00:00, 11.45s/it]


Epoch 10: Train Loss: 0.2258, Train Acc: 0.9309, Val Loss: 0.2031, Val Acc:0.9292 | AUC: 0.9879, EER: 5.41% (thr=0.8394), Macro EER: 4.32%, min t-DCF: nan


0,1
curves/auc,▁▅▇▇██████
grad_norm/detection,▁▁▄▆█▁▇▃▆▃▁▂▂▃▂▂▂▂▂▁▁▅▂▄▃▂▂▃▁▁▂▂▃▂▂▃▃▁▃▁
grad_norm/fuse,▁▁▁▁▁▂▃▂▇▄▄▅▃▆▆▆▅█▄▃▄▅▇▄▄▄▂▁▄▆▃▄▅▂▂▆█▇▁▂
grad_norm/projection,▁▁▂▄▇▆█▁▆▂▃▃▇▆▅▃▂▄▃▃▃▃▂▂▂▃▁▃▃▂▆▄▆▂▂▄▃▂▂▄
lr/detection,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
lr/fuse,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
lr/projection,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
train/batch_loss,██▇█▇▄▆▄▄▃▅▂▂▃▃▃▃▂▂▃▂▂▂▂▃▁▂▂▃▂▂▄▅▁▁▃▁▁▁▁
train/nnz_neighbor_rate,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
curves/auc,0.98786
grad_norm/detection,0.3364
grad_norm/fuse,1.04317
grad_norm/projection,0.03671
lr/detection,0.001
lr/fuse,0.001
lr/projection,0.001
train/batch_loss,0.1487
train/nnz_neighbor_rate,1.0


## Loss and Accuracy Curves

---



In [None]:
pipeline.plot_training_curves()

In [None]:
pipeline.show_curves_inline(smooth=7)

## Sample Predictions

### Spoof Prediction

In [None]:
import pandas as pd
file_name = "10136.wav"
audio_path = "/content/release_in_the_wild/"+file_name

df = pd.read_csv("/content/release_in_the_wild/meta.csv")
expected = df[df["file"]==file_name]


# 12. Single-file prediction on GPU
result = pipeline.predict(audio_path)

filtered_df = df[df["file"].isin(result['retrieved_files'])]


print(f"Prediction  : {result['prediction']}, Expected: {expected['label'].values[0]}, Speaker: {expected['speaker'].values[0]}")
print(f"Probability Spoof: {result['probability_spoof']:.4f}")
print("Similar Audio Files retrieved")
print(filtered_df)

Prediction  : spoof, Expected: spoof, Speaker: Alec Guinness
Probability Spoof: 0.9414
Similar Audio Files retrieved
            file                speaker  label
8551    8551.wav  Arnold Schwarzenegger  spoof
15187  15187.wav          Alec Guinness  spoof
21971  21971.wav               Ayn Rand  spoof
25803  25803.wav               Ayn Rand  spoof
29585  29585.wav          Alec Guinness  spoof


### Bonafide Prediction

In [None]:
file_name = "10135.wav"
audio_path = "/content/release_in_the_wild/"+file_name

df = pd.read_csv("/content/release_in_the_wild/meta.csv")
expected = df[df["file"]==file_name]


# 12. Single-file prediction on GPU
result = pipeline.predict(audio_path)

filtered_df = df[df["file"].isin(result['retrieved_files'])]


print(f"Prediction  : {result['prediction']}, Expected: {expected['label'].values[0]}, Speaker: {expected['speaker'].values[0]}")
print(f"Probability Spoof: {result['probability_spoof']:.4f}")
print("Similar Audio Files retrieved")
print(filtered_df)

Prediction  : bona-fide, Expected: bona-fide, Speaker: Barack Obama
Probability Spoof: 0.1082
Similar Audio Files retrieved
            file       speaker      label
3019    3019.wav  Barack Obama  bona-fide
5376    5376.wav  Barack Obama  bona-fide
7594    7594.wav  Barack Obama  bona-fide
7666    7666.wav  Barack Obama  bona-fide
21007  21007.wav  Barack Obama  bona-fide
