
=============================================================================
#**MINOR PROJECT: DEEPFAKE DETECTION USING CNN AND FREQUENCY DOMAIN ANALYSIS**
=============================================================================
# **OBJECTIVE**:
This script automates the entire data preparation pipeline for the deepfake detection project.
Its responsibilities include:
1. Configuring the Google Colab environment with the necessary Kaggle credentials.
  2. Downloading the Celeb-DF (v2) dataset directly from Kaggle.
 3. Installing required libraries for video and image processing.
 4. Systematically extracting frames from both real and fake videos.
   5. Detecting and cropping faces from each frame using a Multi-task Cascaded
      Convolutional Network (MTCNN).
   6. Standardizing the cropped face images by resizing them.
   7. Organizing and saving the processed face images into a clean, structured directory
   that is ready for the Deep Learning Engineer to use for model training.


#**SECTION 1: ENVIRONMENT SETUP AND DATASET ACQUISITION**
**Purpose:** To prepare the Colab environment for data download and processing. This involves authenticating with the Kaggle API and downloading the specified dataset.

**Step 1.1: Authenticate Kaggle API**

First, we need to upload the kaggle.json API token file. This file contains your credentials and is required by the Kaggle API to authorize downloads. Running the cell below will prompt you to upload the file from your computer.

In [1]:
# We begin by uploading the 'kaggle.json' API token file. This file contains the user's
# credentials and is required by the Kaggle API to authorize downloads.

from google.colab import files
import os
import shutil

print("--- Authenticating Kaggle API ---")
print("Please upload your 'kaggle.json' file to authenticate.")

# Clean up any previous versions of the file or directory for a fresh start.
if os.path.exists('kaggle.json'):
    os.remove('kaggle.json')
if os.path.exists('/root/.kaggle'):
    shutil.rmtree('/root/.kaggle')

uploaded = files.upload()

# Once uploaded, the token must be moved to the appropriate directory (~/.kaggle/) where the
# Kaggle client expects to find it. We also set the correct file permissions (600) for security.
if 'kaggle.json' in uploaded:
    print("Authentication file uploaded successfully.")
    # Create the .kaggle directory.
    os.makedirs('/root/.kaggle', exist_ok=True)
    # Move the kaggle.json file into the directory.
    os.rename('kaggle.json', '/root/.kaggle/kaggle.json')
    # Set permissions to read/write for the owner only.
    os.chmod('/root/.kaggle/kaggle.json', 0o600)
    print("Kaggle API key configured.")
else:
    print("Upload failed. Please re-run the cell and upload the 'kaggle.json' file.")
    # Stop execution if authentication fails.
    exit()

--- Authenticating Kaggle API ---
Please upload your 'kaggle.json' file to authenticate.


Saving kaggle.json to kaggle.json
Authentication file uploaded successfully.
Kaggle API key configured.


**Step 1.2: Download the Celeb-DF (v2) Dataset**

Now that the API is configured, we will install the kagglehub library and use it to download our dataset. This function handles both downloading and unzipping the data automatically.

In [2]:
# We install the 'kagglehub' library, which provides a high-level interface for
# interacting with Kaggle resources. Then, we download the dataset.

print("--- Downloading Dataset ---")
# The '-q' flag ensures a quiet installation with minimal output.
!pip install kagglehub -q
import kagglehub

# Using kagglehub, we download the dataset. This function handles both downloading
# and unzipping the data, returning the path to the extracted files.
print("Downloading the Celeb-DF (v2) dataset. This may take several minutes...")
try:
    dataset_path = kagglehub.dataset_download("reubensuju/celeb-df-v2")
    print(f"Dataset downloaded and extracted to: {dataset_path}")
except Exception as e:
    print(f"An error occurred during download: {e}")
    print("Please ensure you have accepted the dataset's rules on the Kaggle website.")
    exit()

--- Downloading Dataset ---
Downloading the Celeb-DF (v2) dataset. This may take several minutes...
Downloading from https://www.kaggle.com/api/v1/datasets/download/reubensuju/celeb-df-v2?dataset_version_number=1...


100%|██████████| 9.29G/9.29G [01:28<00:00, 112MB/s]

Extracting files...





Dataset downloaded and extracted to: /root/.cache/kagglehub/datasets/reubensuju/celeb-df-v2/versions/1


#**SECTION 2: Video Preprocessing and Face Extraction**
**Purpose:** To process the raw video files into a structured set of face images suitable for training a neural network.

**Step 2.1: Install Computer Vision Libraries**

Next, we install the necessary Python libraries for our task: mtcnn for high-accuracy face detection and opencv-python for all video and image manipulation tasks.

In [3]:
print("--- Installing Computer Vision Libraries ---")
# We install 'mtcnn' for high-accuracy face detection and 'opencv-python' for video
# and image manipulation tasks.
!pip install mtcnn opencv-python -q
import cv2
from mtcnn.mtcnn import MTCNN
import glob

print("Libraries installed successfully.")

--- Installing Computer Vision Libraries ---
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m18.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m32.2 MB/s[0m eta [36m0:00:00[0m
[?25hLibraries installed successfully.


**Step 2.2: Define File Paths, Parameters, and Initialize Detector**

Here, we'll set up all the important variables for our script. This includes defining the paths to the video files, creating the output directories for our processed images, setting parameters like image size, and initializing the MTCNN face detector model.

In [4]:
print("--- Defining Paths and Parameters ---")
# Path to the source videos within the downloaded dataset structure.
real_videos_path = os.path.join(dataset_path, 'Celeb-real', '*.mp4')
fake_videos_path = os.path.join(dataset_path, 'Celeb-synthesis', '*.mp4')

# Parent directory for all processed data.
output_parent_dir = '/content/processed_data'
# Subdirectories for real and fake face images.
output_dir_real = os.path.join(output_parent_dir, 'real')
output_dir_fake = os.path.join(output_parent_dir, 'fake')

# Create the output directories if they do not already exist.
os.makedirs(output_dir_real, exist_ok=True)
os.makedirs(output_dir_fake, exist_ok=True)
print(f"Output directories created at: {output_parent_dir}")

# Parameters for processing.
FRAME_INTERVAL = 15      # Extract one frame every 15 frames.
IMAGE_SIZE = (224, 224)  # Target image size for the model (e.g., ResNet, VGG).
MAX_VIDEOS_PER_CLASS = 200 # Limit the number of videos to process for a quicker run.

# Initialize the MTCNN model.
print("Initializing MTCNN face detector...")
detector = MTCNN()
print("MTCNN detector ready.")


--- Defining Paths and Parameters ---
Output directories created at: /content/processed_data
Initializing MTCNN face detector...
MTCNN detector ready.


**Step 2.3: Define the Video Processing Function**

To keep our code clean and reusable, we define a single function process_videos that handles the entire pipeline for a given set of videos: extracting frames, detecting faces, cropping, resizing, and saving them.

In [5]:
# A modular function is created to handle the processing of a list of video files. This
# promotes code reusability and clarity.

def process_videos(video_paths, output_dir, video_type, limit):
    """
    Extracts, detects, crops, and saves faces from a list of video files.

    Args:
        video_paths (list): A list of paths to the video files.
        output_dir (str): The directory where processed faces will be saved.
        video_type (str): A string identifier ('real' or 'fake') for naming files.
        limit (int): The maximum number of videos to process from the list.
    """
    videos_processed = 0
    for video_path in video_paths:
        if videos_processed >= limit:
            print(f"Reached processing limit of {limit} videos for class '{video_type}'.")
            break

        print(f"Processing {video_type} video: {os.path.basename(video_path)}")

        cap = cv2.VideoCapture(video_path)
        frame_index = 0
        face_saved_count = 0

        while cap.isOpened():
            ret, frame = cap.read()
            if not ret:
                break

            if frame_index % FRAME_INTERVAL == 0:
                frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
                results = detector.detect_faces(frame_rgb)

                if results:
                    x1, y1, width, height = results[0]['box']
                    x1, y1 = abs(x1), abs(y1)
                    x2, y2 = x1 + width, y1 + height

                    face = frame[y1:y2, x1:x2]

                    if face.size > 0:
                        face_resized = cv2.resize(face, IMAGE_SIZE)
                        output_filename = f"{os.path.basename(video_path).split('.')[0]}_frame{frame_index}.jpg"
                        save_path = os.path.join(output_dir, output_filename)
                        cv2.imwrite(save_path, face_resized)
                        face_saved_count += 1
                    else:
                        print(f"  - Warning: Empty face crop at frame {frame_index}.")
            frame_index += 1

        cap.release()
        print(f"  - Finished. Saved {face_saved_count} faces.")
        videos_processed += 1

print("Video processing function defined.")


Video processing function defined.


**Step 2.4: Execute the Processing Pipeline**

This is the final step where we put everything together. We get the lists of real and fake video files and pass them to our process_videos function to start the main task. This will take some time to complete.

In [6]:
# We fetch the video file paths and call the processing function for both real and fake videos.

print("\n--- Starting Video Processing Pipeline ---")

# Get a list of all video files.
real_video_files = glob.glob(real_videos_path)
fake_video_files = glob.glob(fake_videos_path)
print(f"Found {len(real_video_files)} real videos and {len(fake_video_files)} fake videos.")

# Process FAKE videos.
process_videos(fake_video_files, output_dir_fake, "fake", MAX_VIDEOS_PER_CLASS)

# Process REAL videos.
process_videos(real_video_files, output_dir_real, "real", MAX_VIDEOS_PER_CLASS)

print("\n Processing pipeline finished.")


--- Starting Video Processing Pipeline ---
Found 590 real videos and 5639 fake videos.
Processing fake video: id53_id49_0009.mp4
  - Finished. Saved 31 faces.
Processing fake video: id34_id21_0005.mp4
  - Finished. Saved 32 faces.
Processing fake video: id57_id50_0004.mp4
  - Finished. Saved 31 faces.
Processing fake video: id39_id40_0001.mp4
  - Finished. Saved 21 faces.
Processing fake video: id23_id24_0009.mp4
  - Finished. Saved 22 faces.
Processing fake video: id30_id26_0004.mp4
  - Finished. Saved 27 faces.
Processing fake video: id26_id24_0004.mp4
  - Finished. Saved 24 faces.
Processing fake video: id9_id23_0005.mp4
  - Finished. Saved 28 faces.
Processing fake video: id47_id40_0007.mp4
  - Finished. Saved 23 faces.
Processing fake video: id35_id16_0000.mp4
  - Finished. Saved 33 faces.
Processing fake video: id59_id5_0002.mp4
  - Finished. Saved 34 faces.
Processing fake video: id2_id0_0008.mp4
  - Finished. Saved 35 faces.
Processing fake video: id21_id35_0000.mp4
  - Finish

**Step 2.5: Verification**

The data preparation is now complete! The final cell below will print a summary of the processed images. The processed_data folder now contains separate real and fake subfolders filled with clean face images, ready for the Deep Learning Engineer on your team. You can now download this folder or mount Google Drive to share it.

In [7]:
print("\n=========================================================")
print("          DATA PREPARATION COMPLETE")
print("=========================================================")
real_count = len(os.listdir(output_dir_real))
fake_count = len(os.listdir(output_dir_fake))
print(f"Total processed REAL face images: {real_count}")
print(f"Total processed FAKE face images: {fake_count}")


          DATA PREPARATION COMPLETE
Total processed REAL face images: 5148
Total processed FAKE face images: 5072


**Step 2.6 : Save processed data to google drive**

In [11]:
print("Zipping the processed_data folder...")
!zip -r /content/processed_data.zip /content/processed_data
print("✅ Zipping complete. File 'processed_data.zip' is ready for download.")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  adding: content/processed_data/real/id29_0009_frame60.jpg (deflated 2%)
  adding: content/processed_data/real/id7_0002_frame105.jpg (deflated 1%)
  adding: content/processed_data/real/id10_0004_frame255.jpg (deflated 1%)
  adding: content/processed_data/real/id0_0004_frame300.jpg (deflated 1%)
  adding: content/processed_data/real/id28_0001_frame135.jpg (deflated 2%)
  adding: content/processed_data/real/id61_0006_frame195.jpg (deflated 2%)
  adding: content/processed_data/real/id5_0004_frame450.jpg (deflated 2%)
  adding: content/processed_data/real/id20_0004_frame60.jpg (deflated 1%)
  adding: content/processed_data/real/id26_0004_frame120.jpg (deflated 1%)
  adding: content/processed_data/real/id41_0008_frame210.jpg (deflated 1%)
  adding: content/processed_data/real/id47_0007_frame105.jpg (deflated 2%)
  adding: content/processed_data/real/id0_0004_frame120.jpg (deflated 1%)
  adding: content/processed_data/real/id6