# Preprocessing

In this notebook we will read in the train videos and extract n number of frames. Then we will perform facial recognition to extract every face from those frames and write them as their own images (after resizing). 

No augmentation will be done in this notebook - this will leave us the option to do it after the raw face images have been written. That way we can try numerous augmentation techniques without having to extract the frames again, and ensures that we try augmentation to the same raw images each time (and thus have a more reliable testing environment).

We use OpenCV to read the videos, extract the frames and reshape them. The [MTCNN algorithm](https://github.com/ipazc/mtcnn) is used for facial recognition. This is an effective algorithm, however I am keen to try quicker, more lightweight algorithms like BlazeFace and YOLOv2 later on...

------------------------------
*PLEASE NOTE*:
The scripts in this notebook have been designed for the FULL training dataset on [Kaggle](https://www.kaggle.com/c/deepfake-detection-challenge). There will be some pathing and folder related errors if you attempt this using the train_sample data.

In [1]:
#!pip install ../input/mtcnn-package/mtcnn-0.1.0-py3-none-any.whl

import pandas as pd
import numpy as np

import os
import sys
import shutil

import cv2
from mtcnn import MTCNN

from tqdm.notebook import tqdm
import random

import warnings
warnings.filterwarnings("ignore")

Using TensorFlow backend.


First we define our directory paths and directory lists - including the directory where we will save our train and test images that we extract from the videos.

In [2]:
train_videos_path = '../input/train_videos/'
train_metadata_path = '../input/train_metadata/'
train_images_path = "../input/train_images/" # path to save train images to

We'll loop through all of the videos in all the train folder locations to make one list of paths. We will also rename the metadata (to determine which folder it corresponds to) and copy it to a new directory 'train_metadata'.

In [3]:
train_videos_files = [] # List of all train videos paths
train_metadata_files = [] # List of train metadata paths

for folder in enumerate(os.listdir(train_videos_path)):
    for file in os.listdir(train_videos_path + folder[1]):
        if file == 'metadata.json':
            # Rename and copy the metadata to a new directory
            old_path = train_videos_path + folder[1] + '/' + file
            new_path = train_metadata_path + 'metadata' + str(folder[0]) + '.json'
            shutil.copy(old_path, new_path)            
            train_metadata_files.append(new_path)
        else:
            train_videos_files.append(train_videos_path + folder[1] + '/'+ file)

Now we loop round all the videos in our directory to extract images for each video.

In [4]:
def extract_faces(videos_dir_path, images_dir_path, frames=1, conf_level=0.95):
    """
    Inputs a directory of videos, extracts n frames. 
    Outputs images of ANY faces detected in those frames.
    
    videos_dir_path: (str) Path to your directory of videos
    images_dir_path: (str) Path to where you'll save your images to
    frames: (int or list) Number of frames. If int, take that many frames. If list, take frame numbers specified in list. 
    conf_level: (float) Confidence level for the face recognition model.
    """
    def crop(img, x, y, w, h):
        """
        Crop and reshape images to be uniform across all frames
        """
        x -= 40
        y -= 40
        w += 80
        h += 80
        if x < 0:
            x = 0
        if y <= 0:
            y = 0
        return cv2.cvtColor(cv2.resize(img[y:y + h, x:x + w], (256, 256)), cv2.COLOR_BGR2RGB)
    
    if type(videos_dir_path) == list: 
        videos_dir = videos_dir_path
    else: 
        videos_dir = os.listdir(videos_dir_path) # List train vids
    
    # Extract images from videos
    if type(frames) == list:
        print(f'Extracting frames {frames} from videos')
    else:
        print(f'Extracting {frames} random frame(s) from videos')
        
    with tqdm(total=len(range(0, len(videos_dir)))) as pbar: 
        for i in range(0, len(videos_dir)): 
            try:
                if type(videos_dir_path) == list: 
                    file_name = videos_dir_path[i].split('/')[4]
                    file_path = videos_dir_path[i]
                    vid_name = file_name.split('.')[0]
                else: 
                    file_name = videos_dir[i] # file name with .ext
                    file_path = videos_dir_path + file_name # full file path
                    vid_name = file_name.split('.')[0] # file name without .ext

                if type(frames) == list:
                    for num in range(0, len(frames)):
                        cap = cv2.VideoCapture(file_path)
                        total_frames = cap.get(7)
                        vid_length = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
                        cap.set(1, num) # EDIT HERE FOR FRAME NUMBER
                        ret, frame = cap.read()
                        image_name = vid_name + '_' + str(num) + '.jpg'
                        cv2.imwrite(os.path.join(train_images_path, image_name), frame) # Save frame as image
                        cv2.destroyAllWindows()
                        cap.release()
                else:
                    for num in range(0, frames):
                        cap = cv2.VideoCapture(file_path)
                        total_frames = cap.get(7)
                        vid_length = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
                        cap.set(1, random.randint(0, vid_length)) # EDIT HERE FOR FRAME NUMBER
                        ret, frame = cap.read()
                        image_name = vid_name + '_' + str(num) + '.jpg'
                        cv2.imwrite(os.path.join(train_images_path, image_name), frame) # Save frame as image
                        cv2.destroyAllWindows()
                        cap.release()
                pbar.update(1)
            except:
                pass
    images_dir = os.listdir(images_dir_path) # List newly created training images
    detector = MTCNN()

    print('Extracting faces from frames')
    with tqdm(total=len(images_dir)) as pbar:
        
        for image in range(0, len(images_dir)):
            try:
                image_name = images_dir[image].split('.')[0] # Get image name without .ext
             # Read image and detect faces
                frame = cv2.imread(images_dir_path + images_dir[image])
                result = detector.detect_faces(frame)
             # Extract and save faces as their own images
                for face in range(0, len(result)):
                    # Only extract the face if confidence is more than or equal to default 0.95
                    if result[face]['confidence'] >= conf_level:            
                        startX, startY, endX, endY = result[face]['box'] # Get box coordinates
                        crop_img = crop(frame, startX, startY, endX, endY)
                        cv2.imwrite(images_dir_path + image_name + '_' + str(face) + '.jpg', crop_img)
                os.remove(images_dir_path + images_dir[image]) # Delete original images
            except:
                pass
            pbar.update(1)

In [5]:
extract_faces(train_videos_files, train_images_path, frames=2)

Extracting 2 random frames from videos


HBox(children=(FloatProgress(value=0.0, max=36.0), HTML(value='')))


Instructions for updating:
Colocations handled automatically by placer.
Extracting faces from frames


HBox(children=(FloatProgress(value=0.0, max=72.0), HTML(value='')))




We have completed the face extraction and image preprocessing stage for the training data. We should now have a directory of images that we will train our model with.

This is by no means a perfect solution - it took ~1 day to complete on the entire training set. I ended up with ~120,000 images. 

I'd like to find a way to not have to write the images twice (once for the frame extraction, and once for the face extraction) but have not yet found a solution. I will look into this aat a later date.

We'll revisit this code in the test stage, when we create our test pipeline.

The next stage of this project is the [Train notebook](https://github.com/TheNerdyCat/deepfake-detection-challenge/blob/master/output/train.ipynb)