**Objective**
Build a model that can detect whether a video is "real" or "fake" by analyzing facial features across multiple frames using:

MTCNN for face detection

MobileNetV2 for feature extraction

LSTM for modeling temporal (frame-by-frame) dependencies

In [1]:
import zipfile
import os
import cv2
import numpy as np
import matplotlib.pyplot as plt
from mtcnn.mtcnn import MTCNN
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.applications import MobileNetV2
from tensorflow.keras.applications.mobilenet_v2 import preprocess_input
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout, GlobalAveragePooling2D
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
from tqdm import tqdm

Unzips the FaceForensics++ dataset (which includes folders like real and fake containing video files).

In [2]:
zip_path = r"D:\Spring_2025\STT_811\FF++.zip"
extract_to = "faceforensics_data"

with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(extract_to)

print("Extracted files:", os.listdir(extract_to))

Extracted files: ['fake', 'real']


Initialize MTCNN to detect faces in video frames.
- IMG_SIZE = 224: The image size to which faces will be resized.
- FRAMES_PER_VIDEO = 5: Number of frames to extract from each video.

The purpose of the function "extract_faces_from_video" is:

For each video:
- Extracts 5 evenly spaced frames.
- Converts to RGB and uses MTCNN to detect faces.
- Crops and resizes the detected face to 224x224.

In [3]:
detector = MTCNN()
IMG_SIZE = 224
FRAMES_PER_VIDEO = 5

def extract_faces_from_video(video_path, frames_to_extract=FRAMES_PER_VIDEO):
    faces = []
    cap = cv2.VideoCapture(video_path)
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))

    for i in np.linspace(0, total_frames - 1, frames_to_extract, dtype=int):
        cap.set(cv2.CAP_PROP_POS_FRAMES, i)
        ret, frame = cap.read()
        if not ret:
            continue
        frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        detections = detector.detect_faces(frame_rgb)
        if detections:
            x, y, w, h = detections[0]['box']
            x, y = max(0, x), max(0, y)
            face = frame_rgb[y:y+h, x:x+w]
            if face.size == 0:
                continue
            face = cv2.resize(face, (IMG_SIZE, IMG_SIZE))
            faces.append(face)
    cap.release()
    return faces

- Loads a pretrained MobileNetV2 model (without top layer).

- Extracts a 1280-dimensional vector per face image using GlobalAveragePooling2D.

In [4]:
base_model = MobileNetV2(weights='imagenet', include_top=False, input_shape=(IMG_SIZE, IMG_SIZE, 3))
feature_extractor = Model(inputs=base_model.input, outputs=GlobalAveragePooling2D()(base_model.output))

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/mobilenet_v2/mobilenet_v2_weights_tf_dim_ordering_tf_kernels_1.0_224_no_top.h5
[1m9406464/9406464[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 0us/step



Iterates through the real and fake folders.

For each video:
- Extracts 5 face frames.
- Extracts 1280-dim feature vector per face.
- Combines them into a (5, 1280) array representing the video.
- Appends the array to sequences, and the label to labels.

In [None]:
sequences = []
labels = []
classes = ['real', 'fake']

for label, folder_name in enumerate(classes):
    folder_path = os.path.join(extract_to, folder_name)
    video_files = os.listdir(folder_path)
    print(f"Processing {len(video_files)} videos in {folder_name} folder")

    for video_file in tqdm(video_files):
        video_path = os.path.join(folder_path, video_file)
        faces = extract_faces_from_video(video_path)

        if len(faces) < FRAMES_PER_VIDEO:
            continue  # Skip incomplete videos

        video_features = []
        for face in faces:
            face = preprocess_input(face.astype(np.float32))
            features = feature_extractor.predict(np.expand_dims(face, axis=0), verbose=0)
            video_features.append(features.squeeze())

        sequences.append(np.array(video_features))  # Shape: (5, 1280)
        labels.append(label)

Processing 200 videos in real folder


 57%|█████▊    | 115/200 [15:28<12:04,  8.53s/it]

- Saves the extracted features and labels for later use to avoid reprocessing.
- Y_seq is one-hot encoded

In [None]:
X_seq = np.array(sequences)
y_seq = to_categorical(labels)

np.save("X_seq.npy", X_seq)
np.save("y_seq.npy", y_seq)

print("Shape of X_seq:", X_seq.shape)
print("Shape of y_seq:", y_seq.shape)

- Splits the dataset (80% train, 20% test) while keeping label distribution balanced (stratify=y_seq).

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_seq, y_seq, stratify=y_seq, test_size=0.2, random_state=42)

Builds a sequential model to learn temporal patterns across face features in video:

- LSTM(64) learns the sequence over 5 frames.
- Dropout to prevent overfitting.
- Dense(2, softmax) outputs class probabilities (real/fake).

In [None]:
model = Sequential([
    LSTM(64, input_shape=(FRAMES_PER_VIDEO, 1280), return_sequences=False),
    Dropout(0.5),
    Dense(64, activation='relu'),
    Dense(2, activation='softmax')
])

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.summary()

Trains the model over 10 epochs using categorical_crossentropy loss.
Uses adam optimizer.

In [None]:
history = model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10, batch_size=16)

- Converts predicted probabilities to class labels.
- Prints a classification report (precision, recall, F1-score).
- Plots a confusion matrix.

In [None]:
y_pred = model.predict(X_test)
y_pred_classes = np.argmax(y_pred, axis=1)
y_true = np.argmax(y_test, axis=1)

print("\nClassification Report:")
print(classification_report(y_true, y_pred_classes))

plt.figure(figsize=(6, 5))
sns.heatmap(confusion_matrix(y_true, y_pred_classes), annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()