# Video-Centric Deepfake Detection — Public Notebook

**Datasets used:** FakeAVCeleb (primary), CREMA-D (audio support), and Meta’s Casual Conversations (generalization).  
**Goal:** Train a video-primary detector with supporting audio cues.

**High-level preprocessing (omitting code & paths):**
- **Video:** sample/sync to **T=128** frames per clip, resize to **112×112**, normalize to **[0,1]**.
- **Audio:** extract **MFCC (n=40)** features, pad/trim to **S=200** time-steps.
- **Labels:** binary {0 = REAL, 1 = FAKE}.

> This public notebook removes Drive mounts and raw file paths. Use your own data loader to supply tensors with the shapes below.


## Expected Inputs for Training

- `video`: **Tensor** `(B, 3, T=128, 112, 112)` — float32, values in `[0..1]`
- `audio_mfcc`: **Tensor** `(B, S=200, 40)` — float32
- `label`: **int** in `{0, 1}`  (`0=REAL`, `1=FAKE`)


In [None]:
# CONFIG (public-safe): use relative folders
from pathlib import Path

DATA_DIR    = Path("./data")             # put your datasets here locally (not in repo)
WEIGHTS_DIR = Path("./model/weights")    # trained weights (not included)
OUT_DIR     = Path("./outputs")

for p in [DATA_DIR, WEIGHTS_DIR, OUT_DIR]:
    p.mkdir(parents=True, exist_ok=True)


> **Note (Public-safe):**  
> The following cells are **dummy placeholders** to make this notebook runnable without private
> datasets and preprocessing scripts. In the private training workflow we perform video frame
> extraction (112×112, T=128), MFCC extraction (40, S=200), labeling, and folder structuring.
> For this public notebook we provide synthetic tensors with the **same shapes**.


#Preprocessing (Dummy)

In [None]:
import zipfile                          # to read from the zip file
import os                               # os file handling
import io                               # to extract files from a zip without saving them to disk
import cv2                              # OpenCV - read and resize video frames
import numpy as np                      # handle and save numerical data like video frames and audio MFCCs
import librosa                          # audio processing library (load audio from the video and extract MFCCs)
import moviepy.editor as mp             # extracts audio from video
from tqdm import tqdm                   # progress bar


def dummy_preprocess_video(T=128, H=112, W=112):
    # Pretend we extracted and normalized frames
    return torch.rand(3, T, H, W)  # (C=3, T, H, W) in [0,1]

def dummy_preprocess_audio(S=200, n_mfcc=40):
    # Pretend we computed MFCCs
    return torch.rand(S, n_mfcc)   # (S, 40)


print("✅ Preprocessing Finished!")


Total video files found: 4458


100%|██████████| 4458/4458 [20:41<00:00,  3.59it/s]

✅ Preprocessing Finished!





#Renaming video files (Dummy)

In [None]:
def dummy_rename_files():
    # Public placeholder: real file renaming happens in the private pipeline.
    # Kept here only to show the step exists.
    return "Renaming step skipped in public notebook."


#Preprocessing, Labeling and Data Loader (Dummy)

In [None]:
from torch.utils.data import Dataset, DataLoader

class DummyAVDataset(Dataset):
    def __init__(self, n=100, T=128, H=112, W=112, S=200, n_mfcc=40):
        self.n, self.T, self.H, self.W, self.S, self.n_mfcc = n, T, H, W, S, n_mfcc

    def __len__(self):
        return self.n

    def __getitem__(self, idx):
        video = dummy_preprocess_video(self.T, self.H, self.W)      # (3,T,H,W)
        audio = dummy_preprocess_audio(self.S, self.n_mfcc)         # (S,40)
        label = torch.randint(0, 2, (1,)).item()                    # 0 or 1
        return video, audio, label

def make_dummy_loader(bs=4):
    ds = DummyAVDataset()
    return DataLoader(ds, batch_size=bs, shuffle=True)


In [None]:
USE_DUMMY = True
if USE_DUMMY:
    train_loader = make_dummy_loader(bs=4)
else:
    # train_loader = YourRealPrivateLoader(...)
    raise NotImplementedError("Use your private loader in non-public environment.")


#Building the Model

In [None]:
import torch                # PyTorch library - handles tensors, models, training
import torch.nn as nn       # contains neural network
import torchvision          # gives access to prebuilt models => R(2+1)D

# Improved Video Encoder → r2plus1d_18 (better than r3d_18)
class VideoEncoder(nn.Module):                                                # class for video input (nn.Module is one of PyTorch's models)
    def __init__(self, out_features=512):                            # takes 128 frames video and then learn everything then turn it into a single 512-length feature vector
        super(VideoEncoder, self).__init__()                         # calling the parent class (which is nn.Module to => VideoEncoder)
        self.model = torchvision.models.video.r2plus1d_18(pretrained=False)   # uses R(2+1)D-18 deep learning model from torchvision & learns both spatial & temporal features. pretrained=false means you train them from the scratch
        self.model.fc = nn.Linear(self.model.fc.in_features, out_features)    # controlling the final layer to get the output i want

    def forward(self, x):           # when a video clip is given, it passes through the R(2+1)D model and return the output
        return self.model(x)

# Audio Encoder → same LSTM
class AudioEncoder(nn.Module):      # class for audio input (nn.Module is one of PyTorch's models)
    def __init__(self, input_size=40, hidden_size=128, num_layers=2):  # init runs when you create an AudioEncoder object | extract 40 MFCCs, 128 features to learn-LSTM, 2 LSTM layers
        super(AudioEncoder, self).__init__()    # calling the parent class (which is nn.Module to => AudioEncoder)
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)    # MFCC values-40, size-128 frames, layers-2, batch_first- how inputs should be in order (init objects order) **this line builds LSTM, but doesn't run it yet**
        self.output_size = hidden_size    # stores the output size (128)

    def forward(self, x):                   # this defines how audio input is processed when passed through the model
        x = x.permute(0, 2, 1)              # [batch, time_steps, n_mfcc] re-arranging according as LSTM expects **(here originally it was [0, 1, 2] but we rearrange it and feed it into the model like a batch size of one audio clip and how many time steps and mfcc features are there)
        output, (hn, cn) = self.lstm(x)     # then getting an output
        return hn[-1]                       # gives you the LSTM second layer output

# Final Model → Combine Video + Audio (concat fusion for now)
class DeepfakeDetector(nn.Module):  # creating my own model and inheriting from nn.Module all the core neural network features from PyTorch to build and train
    def __init__(self, video_feature_size=512, audio_feature_size=128, num_classes=2):  # building the system to predict the result once its trained (output of VideoEncoder+AudioEncoder, two binary classification which is real/fake)
        super(DeepfakeDetector, self).__init__()    # calling the parent class (which is nn.Module to => DeepfakeDetector)
        self.video_encoder = VideoEncoder(out_features=video_feature_size)    # this creates the video encoder part
        self.audio_encoder = AudioEncoder(input_size=40, hidden_size=audio_feature_size)    # this creates the audio encoder part

        fusion_size = video_feature_size + audio_feature_size   # calculates the total size of combined feature size (512 + 128 = 640)
#classifier
        self.classifier = nn.Sequential(        # final decision-making block
            nn.Linear(fusion_size, 256),        # input=640(video+audio) | output=256smaller features (reduce size while only keeping important info.)
            nn.ReLU(),                          # helps the model learn complex patterns
            nn.Dropout(0.3),                    # prevents overfitting
            nn.Linear(256, num_classes)         # final layer, output => real, fake
        )

    def forward(self, video, audio):              # this function defines what happens if you input something into the model
        video_feat = self.video_encoder(video)    # sends the video input through the VideoEncoder
        audio_feat = self.audio_encoder(audio)    # sends the audio input through the AudioEncoder

        fused = torch.cat([video_feat, audio_feat], dim=1)    # concatenates video and audio features
        out = self.classifier(fused)                          # sends the fused feature into your classifier (the final decision-maker) - outputs 2 scores per sample
        return out                                            # returns the final prediction to use later for loss and accuracy during training

# ✅ Instantiate Improved Model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # checks if the system has GPU, if yes => cuda, no => cpu
model = DeepfakeDetector().to(device)       # create the model and then move it to the GPU

print("✅ Improved Model ready!")




✅ Improved Model ready!


#Training the Model

In [None]:
import torch                        # PyTorch library - handles tensors, models, training
import torch.nn as nn               # contains neural network
import torch.optim as optim         # optimization algorithms like Adam
import os                           # for handling folder and file paths

# Device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Define Loss (classification) and Optimizer
criterion = nn.CrossEntropyLoss()    # this tells the model how wrong its predictions are (the loss function compares model output to the true label)
optimizer = optim.Adam(model.parameters(), lr=0.0001)         # this is optimizer - Adam, learning rate is small => smaller steps for stability

# Save path in Google Drive
save_path = "path to save the checkpoints"   # save the model checkpoints
os.makedirs(save_path, exist_ok=True)

# Training settings
num_epochs = 100
best_val_acc = 0.0      # best validation accuracy so far

for epoch in range(num_epochs):               # repeats training for 100 epochs
    print(f"\nEpoch {epoch+1}/{num_epochs}")

    # ------------------------------
    # Training phase
    # ------------------------------
    model.train()         # sets the model to training mode
    train_loss = 0
    correct = 0           # these variables help track total loss and accuracy during training
    total = 0

    for video, audio, labels in train_loader:   # loop through training data in batches
        video = video.to(device)
        audio = audio.to(device)                # move data to GPU or CPU, depending on device
        labels = labels.to(device)

        optimizer.zero_grad()           # clears old gradients so they don’t mix with the new ones

        outputs = model(video, audio)       # get predictions from the model
        loss = criterion(outputs, labels)   # compare predictions with true labels using loss function

        loss.backward()     # compute how the weights should change based on the loss
        optimizer.step()    # apply the changes (update the weights to reduce loss)

        train_loss += loss.item()
        _, predicted = outputs.max(1)                   # update the training loss and count how many predictions were correct
        total += labels.size(0)
        correct += predicted.eq(labels).sum().item()

    train_acc = 100. * correct / total
    print(f"Train Loss: {train_loss/len(train_loader):.4f} | Train Acc: {train_acc:.2f}%")    # calculate average loss and accuracy for this epoch

    # ------------------------------
    # Validation phase
    # ------------------------------
    model.eval()      # sets model to evaluation model
    val_loss = 0
    correct = 0       # these variables help track total loss and accuracy during evaluation
    total = 0

    with torch.no_grad():   # don't use gradient (gradient tells your model how much to change each weight to make predictions better)
        for video, audio, labels in val_loader:   # loop through training data in batches
            video = video.to(device)
            audio = audio.to(device)              # move data to GPU or CPU, depending on device
            labels = labels.to(device)

            outputs = model(video, audio)         # get predictions from the model
            loss = criterion(outputs, labels)     # compare predictions with true labels using loss function

            val_loss += loss.item()
            _, predicted = outputs.max(1)                 # update the training loss and count how many predictions were correct
            total += labels.size(0)
            correct += predicted.eq(labels).sum().item()

    val_acc = 100. * correct / total
    print(f"Val Loss: {val_loss/len(val_loader):.4f} | Val Acc: {val_acc:.2f}%")      # show how well the model did on the validation data

    # ------------------------------
    # Save best model
    # ------------------------------
    if val_acc > best_val_acc:
        best_val_acc = val_acc
        save_file = os.path.join(save_path, "model.pth")   # if this epoch had the highest accuracy so far, save the model as best_model.pth
        torch.save(model.state_dict(), save_file)
        print("✅ Best model saved!")

    # Save checkpoint every epoch
    checkpoint_file = os.path.join(save_path, f"epoch_{epoch+1}.pth")
    torch.save(model.state_dict(), checkpoint_file)
    print("💾 Checkpoint saved for this epoch!")                        # save the model at the end of every epoch

print("\n🎉 Training Finished!")



Epoch 1/100
Train Loss: 0.3632 | Train Acc: 86.41%
Val Loss: 0.3693 | Val Acc: 85.53%
✅ Best model saved!
💾 Checkpoint saved for this epoch!

Epoch 2/100
Train Loss: 0.2989 | Train Acc: 89.03%
Val Loss: 0.3703 | Val Acc: 85.53%
💾 Checkpoint saved for this epoch!

Epoch 3/100
Train Loss: 0.3117 | Train Acc: 88.78%
Val Loss: 0.3645 | Val Acc: 85.53%
💾 Checkpoint saved for this epoch!

Epoch 4/100
Train Loss: 0.2919 | Train Acc: 89.00%
Val Loss: 0.3626 | Val Acc: 85.53%
💾 Checkpoint saved for this epoch!

Epoch 5/100
Train Loss: 0.2854 | Train Acc: 89.09%
Val Loss: 0.3664 | Val Acc: 85.53%
💾 Checkpoint saved for this epoch!

Epoch 6/100
Train Loss: 0.2968 | Train Acc: 88.78%
Val Loss: 0.3685 | Val Acc: 85.53%
💾 Checkpoint saved for this epoch!

Epoch 7/100
Train Loss: 0.2799 | Train Acc: 89.09%
Val Loss: 0.3632 | Val Acc: 85.53%
💾 Checkpoint saved for this epoch!

Epoch 8/100
Train Loss: 0.2665 | Train Acc: 89.18%
Val Loss: 0.3786 | Val Acc: 85.53%
💾 Checkpoint saved for this epoch!

Epo

KeyboardInterrupt: 

Training was continued for **290** epochs

#Evaluate the model

In [None]:
import torch        # for loading and running PyTorch model
import numpy as np  # for handling arrays
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score      # for calculating evaluation metrics

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")   # checks if GPU is available if not use CPU

# ✅ Load best fine-tuned model
model.load_state_dict(torch.load("model path")) # loads the saved model weights
model.to(device)    # moves it to the selected device
model.eval()        # puts the model in evaluation mode

# ✅ Evaluate
all_labels, all_preds, all_probs = [], [], []   # all_labels => 0/1 | all_preds => what the model predicted (0 or 1) | all_probs => confidence score for AUC

with torch.no_grad():   # don't calculate gradient (since we are not training anymore)
    for video, audio, labels in test_loader:    # batch of video features | batch of audio features | labels (real or fake)
        video = video.to(device)
        audio = audio.to(device)      # move data to GPU/CPU
        labels = labels.to(device)

        outputs = model(video, audio)                 # run the model
        probs = torch.softmax(outputs, dim=1)[:, 1]   # converts the probabilities (softmax => turns raw scores into probabilities) | [:, 1] picks the probability of the FAKE class
        _, predicted = outputs.max(1)     # finds the index of the higher score (So it picks either REAL (0) or FAKE (1) as the prediction)

        all_labels.extend(labels.cpu().numpy())
        all_preds.extend(predicted.cpu().numpy())  # save everything, converts to numpy arrays, adds them to the result lists for final metric calculations
        all_probs.extend(probs.cpu().numpy())

print(f"\n✅ Test Accuracy: {accuracy_score(all_labels, all_preds) * 100:.2f}%")  # calculate and print the metrics (measures the percentage of correct predictions out of all predictions)
print("\nClassification Report:")
print(classification_report(all_labels, all_preds, target_names=["REAL", "FAKE"]))
print(f"\n✅ AUC-ROC Score: {roc_auc_score(all_labels, all_probs):.4f}")


✅ Test Accuracy: 96.69%

Classification Report:
              precision    recall  f1-score   support

        REAL       0.97      0.98      0.97       200
        FAKE       0.97      0.95      0.96       132

    accuracy                           0.97       332
   macro avg       0.97      0.96      0.97       332
weighted avg       0.97      0.97      0.97       332


✅ AUC-ROC Score: 0.9909


#Graph (Dummy)

In [None]:
import torch
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, roc_curve, auc
import numpy as np

# Dummy predictions and labels
y_true = np.array([0, 0, 1, 1, 0, 1])
y_pred = np.array([0, 0, 1, 0, 0, 1])
y_score = np.array([0.1, 0.4, 0.9, 0.6, 0.2, 0.8])

# Confusion matrix
cm = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:\n", cm)

# ROC Curve
fpr, tpr, _ = roc_curve(y_true, y_score)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, label=f"AUC = {roc_auc:.2f}")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.legend()
plt.show()


#Saving the final model

In [None]:
from pathlib import Path

save_path = Path("model/weights/model.pth")
save_path.parent.mkdir(parents=True, exist_ok=True)

# torch.save(model.state_dict(), save_path)  # ← leave commented in public notebook
print("Weights would be saved to:", save_path.resolve())


#Detection

**Inference:** See `model/inference.py` for how to run predictions if weights are available.  
Pretrained weights are not included in this repository.
