# **Data Preprocessing**

This notebook covers the **data preprocessing steps** required to clean and prepare the dataset for further analysis and model training.

# 1.**Importing Necessary Libraries**

This section includes all required libraries for data handling, video processing, face detection, and storage.

In [1]:
!pip install face_recognition
import zipfile
import glob
import numpy as np
import cv2
import os
import face_recognition
from tqdm import tqdm


Collecting face_recognition
  Downloading face_recognition-1.3.0-py2.py3-none-any.whl.metadata (21 kB)
Collecting face-recognition-models>=0.3.0 (from face_recognition)
  Downloading face_recognition_models-0.3.0.tar.gz (100.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m100.1/100.1 MB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Downloading face_recognition-1.3.0-py2.py3-none-any.whl (15 kB)
Building wheels for collected packages: face-recognition-models
  Building wheel for face-recognition-models (setup.py) ... [?25l[?25hdone
  Created wheel for face-recognition-models: filename=face_recognition_models-0.3.0-py2.py3-none-any.whl size=100566162 sha256=de4a0a046d0de8c76a707365c2e3f03cd4fff57376a4da6985252ff87be82b53
  Stored in directory: /root/.cache/pip/wheels/04/52/ec/9355da79c29f160b038a20c784db2803c2f9fa2c8a462c176a
Successfully built face-recognition-models
Installing collected packages: face-recog

# **2. Mounting Google Drive**

Since the dataset is stored in Google Drive, this section ensures access to the files.

In [2]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


# **3. Loading the Dataset**

The dataset (zip file) is extracted from Google Drive to a local directory for further processing.

In [9]:
# Path to your zip file in Google Drive
zip_path = "/content/drive/MyDrive/Dataset.zip"

# Destination folder to extract the contents
extract_path = '/content/Dataset'

# Unzipping the file
import zipfile
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(extract_path)

print("Unzipping completed. Files are extracted to:", extract_path)


Unzipping completed. Files are extracted to: /content/Dataset


# **4. Checking for Video Files**

We check for .mp4 video files in the extracted dataset directory.

In [10]:
video_files = glob.glob('/content/Dataset/Dataset/Realvideos/*.mp4')

print("\nNumber of .mp4 files found:", len(video_files))

# If no files are found, try recursive search
if len(video_files) == 0:
    print("No .mp4 files found! Checking subdirectories...")
    video_files = glob.glob('/content/Dataset/Dataset/Realvideos/**/*.mp4', recursive=True)
    print("Number of .mp4 files found (recursive search):", len(video_files))



Number of .mp4 files found: 11


# **5. Filtering Videos with Sufficient Frames**

Videos with less than 150 frames are removed.

In [11]:
frame_count = []
for video_file in video_files.copy():  # Use copy to avoid modifying the list while iterating
    cap = cv2.VideoCapture(video_file)
    if int(cap.get(cv2.CAP_PROP_FRAME_COUNT)) < 150:
        video_files.remove(video_file)
        continue
    frame_count.append(int(cap.get(cv2.CAP_PROP_FRAME_COUNT)))

print("\nFiltered frames count:", frame_count)
print("Total number of videos after filtering:", len(frame_count))

# Fix NaN issue when no frames exist
print("Average frame per video:", np.mean(frame_count) if frame_count else 0)



Filtered frames count: [693, 505, 309, 396, 460, 310, 637, 303, 474, 385, 588]
Total number of videos after filtering: 11
Average frame per video: 460.0


# **6. Creating Output Directory for Processed Data**

A directory is created in Google Drive to store processed video outputs.

In [12]:
output_dir = '/content/drive/My Drive/FF_REAL_Face_only_data/'
os.makedirs(output_dir, exist_ok=True)


**# 7. Defining Functions for Face Extraction**
We define functions to:

  Extract frames from videos
  Detect faces and save cropped face-only frames

In [13]:
# Function to extract frames from video
def frame_extract(path):
    vidObj = cv2.VideoCapture(path)
    success = True
    while success:
        success, image = vidObj.read()
        if success:
            yield image

# Function to create face-only videos
def create_face_videos(path_list, out_dir):
    already_present_count = glob.glob(os.path.join(out_dir, '*.mp4'))
    print("\nNumber of videos already present:", len(already_present_count))
    for path in tqdm(path_list, desc="Processing videos"):
        out_path = os.path.join(out_dir, os.path.basename(path))
        if os.path.exists(out_path):
            print(f"File already exists: {out_path}")
            continue
        out = cv2.VideoWriter(out_path, cv2.VideoWriter_fourcc('M','J','P','G'), 30, (112, 112))
        frames = []
        for idx, frame in enumerate(frame_extract(path)):
            if idx <= 150:
                frames.append(frame)
                if len(frames) == 4:
                    faces = face_recognition.batch_face_locations(frames)
                    for i, face in enumerate(faces):
                        if face:
                            top, right, bottom, left = face[0]
                            try:
                                cropped_face = cv2.resize(frames[i][top:bottom, left:right], (112, 112))
                                out.write(cropped_face)
                            except Exception as e:
                                print(f"Error processing frame {idx} in {path}: {e}")
                    frames = []
        out.release()


# **8. Running the Face Video Creation Process**
We now process the videos and extract face-only clips.

In [14]:
# Run the face video creation if there are videos to process
if video_files:
    create_face_videos(video_files, output_dir)
    print("\nFace video creation completed.")
else:
    print("\nNo videos to process.")



Number of videos already present: 11


Processing videos: 100%|██████████| 11/11 [00:00<00:00, 2792.14it/s]

File already exists: /content/drive/My Drive/FF_REAL_Face_only_data/002.mp4
File already exists: /content/drive/My Drive/FF_REAL_Face_only_data/007.mp4
File already exists: /content/drive/My Drive/FF_REAL_Face_only_data/004.mp4
File already exists: /content/drive/My Drive/FF_REAL_Face_only_data/000.mp4
File already exists: /content/drive/My Drive/FF_REAL_Face_only_data/001.mp4
File already exists: /content/drive/My Drive/FF_REAL_Face_only_data/006.mp4
File already exists: /content/drive/My Drive/FF_REAL_Face_only_data/008.mp4
File already exists: /content/drive/My Drive/FF_REAL_Face_only_data/003.mp4
File already exists: /content/drive/My Drive/FF_REAL_Face_only_data/010.mp4
File already exists: /content/drive/My Drive/FF_REAL_Face_only_data/005.mp4
File already exists: /content/drive/My Drive/FF_REAL_Face_only_data/009.mp4

Face video creation completed.





# **9. Final Execution of Face Extraction Pipeline**

We call the function on all valid videos.

In [15]:
create_face_videos(video_files, output_dir)



Number of videos already present: 11


Processing videos: 100%|██████████| 11/11 [00:00<00:00, 4610.97it/s]

File already exists: /content/drive/My Drive/FF_REAL_Face_only_data/002.mp4
File already exists: /content/drive/My Drive/FF_REAL_Face_only_data/007.mp4
File already exists: /content/drive/My Drive/FF_REAL_Face_only_data/004.mp4
File already exists: /content/drive/My Drive/FF_REAL_Face_only_data/000.mp4
File already exists: /content/drive/My Drive/FF_REAL_Face_only_data/001.mp4
File already exists: /content/drive/My Drive/FF_REAL_Face_only_data/006.mp4
File already exists: /content/drive/My Drive/FF_REAL_Face_only_data/008.mp4
File already exists: /content/drive/My Drive/FF_REAL_Face_only_data/003.mp4
File already exists: /content/drive/My Drive/FF_REAL_Face_only_data/010.mp4
File already exists: /content/drive/My Drive/FF_REAL_Face_only_data/005.mp4
File already exists: /content/drive/My Drive/FF_REAL_Face_only_data/009.mp4



