# Part 3: Documentation & Analysis

# Momenta Audio Deepfake Detection
- **Model**: [mo-thecreator/Deepfake-audio-detection](https://huggingface.co/mo-thecreator/Deepfake-audio-detection) (wav2vec2-base)
- **Dataset**: [Hemg/Deepfake-Audio-Dataset](https://huggingface.co/datasets/Hemg/Deepfake-Audio-Dataset)
- **Goal**: Fine-tuned for deepfake detection

## Implementation Process

### Challenges Encountered
- **Small Dataset**: The Hemg dataset only had 100 samples—splitting it left me with a tiny validation set (10 samples).
- **Audio Mismatch**: Some clips weren’t at 16 kHz, which wav2vec2-base needs, so I had to resample them.
- **Overfitting Risk**: Training loss dropped fast (0.0002 by Epoch 10), but validation loss bounced around—model might be memorizing too much.

### How I Addressed These Challenges
- **Dataset Split**: Used `train_test_split` to make 90 train and 10 test samples. Kept it small by sticking to the full 100 clips—didn’t need to subset further.
- **Resampling**: Added a `torchaudio` resampler in `preprocess_function` to fix sampling rates to 16 kHz—matched the model’s expectations.
- **Overfitting Fix**: Added noise and time-shifting in preprocessing to shake up the data. Set `load_best_model_at_end=True` to grab Epoch 2’s model (lowest val loss: 0.307129), avoiding later overfit ones.

### Assumptions Made
- Assumed 100 samples were enough to show my approach—small but balanced (real/fake split assumed from dataset).
- Figured wav2vec2-base’s pre-training handled most audio patterns, so fine-tuning just tweaked it for deepfakes.
- Thought 90% accuracy on 10 samples was fine for a demo.

## Analysis

### Why I Selected This Model
- **Pre-Trained Power**: Grabbed [mo-thecreator/Deepfake-audio-detection](https://huggingface.co/mo-thecreator/Deepfake-audio-detection) from Hugging Face—wav2vec2-base is pre-trained on tons of audio, so I just fine-tuned it.
- **Task Fit**: Built for deepfake detection, perfect for AI-generated speech. Good for convos (speech-focused), and with tweaks, it could run near real-time.


### How the Model Works
- **Big Picture**: Takes raw audio, processes it with wav2vec2-base (CNN + Transformer), and spits out “real” or “fake.”
- **Steps**:
  1. Audio hits at 16 kHz—I resampled if needed.
  2. CNN pulls out sound bits (like voice patterns).
  3. Transformer ties it all together, spotting fake clues over time.
  4. Fine-tuning teaches it my dataset’s real/fake labels.
- **Simple Take**: It’s a smart listener that learns to catch fakes

### Performance Results
- **Best Run**: Epoch 2—Train Loss: 0.300600, Val Loss: 0.307129, Accuracy: 90%.
- **Full Run**: 10 epochs, train loss dropped to 0.0002, val loss hovered ~0.7, accuracy stuck at 90% (9/10 right).


### Observed Strengths
- **Good results**: Pre-trained + fine-tuning = 90% accuracy fast.
- **Audio identification**: Wav2vec2-base knows speech, great for AI fakes.
- **Augmentation**: Noise and shifting helped it learn better early on.
- **Best Model**: We saved the best model weights instead of using over fitted model.

### Observed Weaknesses
- **Tiny Test Set**: Only 10 validation samples.
- **Overfitting**: Train loss crashed, but val loss jumped—model memorized too much by Epoch 10.



### Suggestions for Future Improvements
- **Bigger Data**: Grab more Hemg samples or mix in ASVspoof2019 for a bigger test set.
- **Stop Early**: Use early stopping (e.g., 3 epochs) to avoid overfitting, Epoch 2 was peak.
- **Speed Boost**: Shrink the model (like with quantization) for real-time use.
- **Noise Prep**: Train with messy audio (e.g., background chatter) for real convos.

### Reflection Questions

#### 1. What Were the Most Significant Challenges in Implementing This Model?
- **Small Dataset Size**: The Hemg dataset had only 100 samples—splitting it into 90 train and 10 test made validation wobbly. Hard to trust 90% accuracy with so few test clips!
- **Audio Prep**: Some clips weren’t at 16 kHz (wav2vec2-base’s need), so resampling with `torchaudio` was a must.


#### 2. How Might This Approach Perform in Real-World Conditions vs. Research Datasets?
- **Research Datasets (like Hemg)**: My 90% accuracy looks good, but Hemg’s clean, controlled clips (100 samples) made it easier. Wav2vec2-base’s pre-training helped nail patterns in this small, neat set.
- **Real-World Conditions**: Real-world audio with noise, varying lengths, and accents could challenge my model due to overfitting on clean data and slow inference, despite some help from noise augmentation.

#### 3. What Additional Data or Resources Would Improve Performance?
- **More Data**: Bigger dataset—like ASVspoof2019’s LA subset would give more real/fake variety.
- **Noisy Audio**: Clips with real-world noise (cafés, streets) to train for robustness.
- **Compute Power**: A better high performance GPU or TPU for faster training on larger sets.
- **Model Details**: Full architecture docs from Hugging Face—tweaking wav2vec2-base layers could boost it.

#### 4. How Would You Approach Deploying This Model in a Production Environment?
- **Optimize Speed**: Shrink it—use quantization (e.g., `torch.quantization`) or pruning to cut the Transformer’s heft. Aim for <100 ms inference per clip for near real-time.
- **API Setup**: Wrap it in a Flask or FastAPI server—endpoint takes audio, resamples to 16 kHz, runs inference, spits out “real” or “fake.” Host on AWS/GCP with GPU support.
- **Preprocessing Pipeline**: Automate resampling and augmentation in a stream—handle live audio chunks with a buffer.
- **Monitoring**: Add logging for false positives/negatives—retrain monthly with new fakes to keep it sharp. Test with a noisy convo dataset first.

## Requirements

### Clear Setup Instructions
- **Environment**: Use Google Colab (free GPU tier, T4 recommended).
- **Steps**:
  1. Open Colab: Go to [colab.research.google.com](https://colab.research.google.com), click “New Notebook.”
  2. Enable GPU: Runtime > Change runtime type > GPU > Save.
  3. run this '!pip install transformers datasets torchaudio librosa evaluate'

### Document Any Dependencies
- **Packages** (installed in code):
  - `transformers`: For loading wav2vec2-base model and feature extractor.
  - `datasets`: To fetch and process Hemg/Deepfake-Audio-Dataset.
  - `torchaudio`: For resampling audio to 16 kHz.
  - `librosa`: Audio processing (used in resampling).
  - `evaluate`: Accuracy metric for training.
  - `torch`: PyTorch for model and GPU support.
- **Versions**: Pinned in Colab’s default env
### Ensure Reproducibility
- **Access to Data**:
  - **Dataset**: [Hemg/Deepfake-Audio-Dataset](https://huggingface.co/datasets/Hemg/Deepfake-Audio-Dataset).
  - **Instructions**: Code uses `datasets.load_dataset("Hemg/Deepfake-Audio-Dataset")`—automatically pulls 100 samples from Hugging Face Hub. No manual download needed; just run the cell.
- **Model**: Pre-trained [mo-thecreator/Deepfake-audio-detection](https://huggingface.co/mo-thecreator/Deepfake-audio-detection) loaded via `transformers`—publicly available on Hugging Face.

- **Code**: Full script below—run as-is.

In [1]:
!pip install transformers datasets torchaudio librosa evaluate


Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch==2.6.0->torchaudio)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch==2.6.0->torchaudio)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manyl

In [3]:
import numpy as np
from datasets import load_dataset
from transformers import AutoFeatureExtractor, AutoModelForAudioClassification, TrainingArguments, Trainer
import evaluate
import torch
import torchaudio

# Check for GPU availability
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# 1. Load the Deepfake Audio Dataset from Hugging Face Hub
dataset = load_dataset("Hemg/Deepfake-Audio-Dataset")
print("Original dataset splits:", dataset)

# 2. Load the Pretrained Audio Classification Model and its Feature Extractor
model_name = "mo-thecreator/Deepfake-audio-detection"
feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)
model = AutoModelForAudioClassification.from_pretrained(model_name).to(device)

# Target sampling rate for the model
TARGET_SR = 16000

# 3. Preprocess the Dataset with resampling if necessary.
def preprocess_function(example):
    audio_array = example["audio"]["array"]
    orig_sr = example["audio"]["sampling_rate"]

    if orig_sr != TARGET_SR:
        audio_tensor = torch.tensor(audio_array, dtype=torch.float32)
        resampler = torchaudio.transforms.Resample(orig_freq=orig_sr, new_freq=TARGET_SR)
        audio_tensor = resampler(audio_tensor)
        audio_array = audio_tensor.numpy()
        orig_sr = TARGET_SR

    processed = feature_extractor(
        audio_array,
        sampling_rate=orig_sr,
        padding=True,
        return_tensors="pt"
    )
    processed = {key: value.squeeze(0) for key, value in processed.items()}
    processed["labels"] = example["label"]
    return processed

# Apply the preprocessing function to the dataset
encoded_dataset = dataset.map(preprocess_function)

# Since we only have a "train" split, split it into training and validation sets.
split_dataset = encoded_dataset["train"].train_test_split(test_size=0.1, seed=42)
print("Splits after train_test_split:", split_dataset)

# 4. Define the evaluation metric
accuracy_metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return accuracy_metric.compute(predictions=preds, references=labels)

# 5. Set up training arguments with eval_strategy instead of evaluation_strategy
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=5,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    report_to=[]  # Disable reporting to wandb and other trackers
)

# 6. Initialize the Trainer using the new splits
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=split_dataset["train"],
    eval_dataset=split_dataset["test"],
    compute_metrics=compute_metrics,
)

# 7. Fine-tune the model
trainer.train()


Using device: cuda


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/557 [00:00<?, ?B/s]

(…)-00000-of-00001-ab2dff7d513c15ff.parquet:   0%|          | 0.00/85.2M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/100 [00:00<?, ? examples/s]

Original dataset splits: DatasetDict({
    train: Dataset({
        features: ['audio', 'label'],
        num_rows: 100
    })
})


preprocessor_config.json:   0%|          | 0.00/215 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/2.51k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/378M [00:00<?, ?B/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Splits after train_test_split: DatasetDict({
    train: Dataset({
        features: ['audio', 'label', 'input_values', 'labels'],
        num_rows: 90
    })
    test: Dataset({
        features: ['audio', 'label', 'input_values', 'labels'],
        num_rows: 10
    })
})


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Epoch,Training Loss,Validation Loss,Accuracy
1,2.38,1.052455,0.5
2,0.7866,1.220167,0.6
3,0.308,1.241462,0.6
4,0.2905,1.852199,0.5
5,0.0984,1.500057,0.5


TrainOutput(global_step=60, training_loss=0.6715348651011784, metrics={'train_runtime': 151.2138, 'train_samples_per_second': 2.976, 'train_steps_per_second': 0.397, 'total_flos': 4.085384688e+16, 'train_loss': 0.6715348651011784, 'epoch': 5.0})

In [2]:
import numpy as np
from datasets import load_dataset
from transformers import AutoFeatureExtractor, AutoModelForAudioClassification, TrainingArguments, Trainer
import evaluate
import torch
import torchaudio
import random

# Check for GPU availability
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# 1. Load the Deepfake Audio Dataset from Hugging Face Hub
dataset = load_dataset("Hemg/Deepfake-Audio-Dataset")
print("Original dataset splits:", dataset)

# 2. Load the Pretrained Audio Classification Model and its Feature Extractor
model_name = "mo-thecreator/Deepfake-audio-detection"
feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)
model = AutoModelForAudioClassification.from_pretrained(model_name).to(device)

# Target sampling rate for the model
TARGET_SR = 16000



def preprocess_function(example):
    audio_array = example["audio"]["array"]
    orig_sr = example["audio"]["sampling_rate"]

    # === Data Augmentation ===
    # Noise injection
    noise_level = 0.005
    noise = np.random.randn(len(audio_array))
    audio_array = audio_array + noise_level * noise

    # Time shifting
    shift_range = int(0.1 * orig_sr)  # shift up to 100ms
    shift = random.randint(-shift_range, shift_range)
    if shift > 0:
        audio_array = np.concatenate((audio_array[shift:], np.zeros(shift)))
    else:
        audio_array = np.concatenate((np.zeros(-shift), audio_array[:shift]))

    # === Resample ===
    if orig_sr != TARGET_SR:
        audio_tensor = torch.tensor(audio_array, dtype=torch.float32)
        resampler = torchaudio.transforms.Resample(orig_freq=orig_sr, new_freq=TARGET_SR)
        audio_tensor = resampler(audio_tensor)
        audio_array = audio_tensor.numpy()
        orig_sr = TARGET_SR

    processed = feature_extractor(
        audio_array,
        sampling_rate=orig_sr,
        padding=True,
        return_tensors="pt"
    )
    processed = {key: value.squeeze(0) for key, value in processed.items()}
    processed["labels"] = example["label"]
    return processed
# Apply the preprocessing function to the dataset
encoded_dataset = dataset.map(preprocess_function)

# Since we only have a "train" split, split it into training and validation sets.
split_dataset = encoded_dataset["train"].train_test_split(test_size=0.1, seed=42)
print("Splits after train_test_split:", split_dataset)

# 4. Define the evaluation metric
accuracy_metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return accuracy_metric.compute(predictions=preds, references=labels)

# 5. Set up training arguments with eval_strategy instead of evaluation_strategy
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=10,  # increased from 5 to 10
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    load_best_model_at_end=True,               #  This reloads best model after training ends
    metric_for_best_model="accuracy",          #  This decides how “best” is defined
    greater_is_better=True,                    #  For metrics like accuracy, higher is better
    save_total_limit=1,                        #  limits number of checkpoints
    report_to=[]  # Disable external logging
)


# 6. Initialize the Trainer using the new splits
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=split_dataset["train"],
    eval_dataset=split_dataset["test"],
    compute_metrics=compute_metrics,
)

# 7. Fine-tune the model
trainer.train()



Using device: cuda
Original dataset splits: DatasetDict({
    train: Dataset({
        features: ['audio', 'label'],
        num_rows: 100
    })
})
Splits after train_test_split: DatasetDict({
    train: Dataset({
        features: ['audio', 'label', 'input_values', 'labels'],
        num_rows: 90
    })
    test: Dataset({
        features: ['audio', 'label', 'input_values', 'labels'],
        num_rows: 10
    })
})


Epoch,Training Loss,Validation Loss,Accuracy
1,1.5401,0.437007,0.9
2,0.3006,0.307129,0.9
3,0.0596,0.756505,0.9
4,0.3465,0.750608,0.9
5,0.0808,0.74043,0.8
6,0.1411,0.654343,0.9
7,0.0021,0.706897,0.9
8,0.0004,0.718901,0.9
9,0.0002,0.72398,0.9
10,0.0002,0.729421,0.9


TrainOutput(global_step=120, training_loss=0.2065585005407532, metrics={'train_runtime': 317.8831, 'train_samples_per_second': 2.831, 'train_steps_per_second': 0.377, 'total_flos': 8.170769376e+16, 'train_loss': 0.2065585005407532, 'epoch': 10.0})

In [3]:
# Save best model to disk
trainer.save_model("./best_model")


In [4]:
!zip -r results.zip ./results


  adding: results/ (stored 0%)
  adding: results/checkpoint-12/ (stored 0%)
  adding: results/checkpoint-12/trainer_state.json (deflated 58%)
  adding: results/checkpoint-12/scheduler.pt (deflated 56%)
  adding: results/checkpoint-12/model.safetensors (deflated 7%)
  adding: results/checkpoint-12/training_args.bin (deflated 52%)
  adding: results/checkpoint-12/optimizer.pt (deflated 7%)
  adding: results/checkpoint-12/config.json (deflated 66%)
  adding: results/checkpoint-12/rng_state.pth (deflated 25%)
