## Summary

Based on the WiderFace dataset, this code contains a sample of ~12k images with one or more faces. The dataset has been transformed to a format suitable for training a face detection model using the `ultralytics` library.

* Since the images can contain more than one face, and the fact that we need suitable resolution of images for face-generation modules, the images have been filtered to only those with reasonable number of faces, with reasonable area of the face in the image.


In [4]:
import os

bounding_boxes = os.listdir("Data/labels")

print("Original Dataset Size:", len(bounding_boxes))

def contains_too_many_faces(label_file, max_faces=5):
    """
    Check if the bounding boxes contain too many faces.
    """
    
    #get number of lines in the label file
    with open(os.path.join("Data/labels", label_file), 'r') as file:
        lines = file.readlines()
    
    return len(lines) > max_faces

bounding_boxes = [bb for bb in bounding_boxes if not contains_too_many_faces(bb, max_faces=5)]
print("Filtered Dataset Size:", len(bounding_boxes))

Original Dataset Size: 12880
Filtered Dataset Size: 8439


In [13]:
import numpy as np

def get_area_of_faces(label_file):
    """
    Using the normalized bounding boxes, calculate the least area of the face
    """
    with open(os.path.join("Data/labels", label_file), 'r') as file:
        lines = file.readlines()
        bboxes = [list(map(float, line.strip().split()[1:])) for line in lines]
    
    areas = [np.abs((bbox[2] - bbox[0]) * (bbox[3] - bbox[1])) for bbox in bboxes]
    return min(areas) if areas else 0

bbox_areas = {file_name: get_area_of_faces(file_name) for file_name in bounding_boxes}

areas = np.array(list(bbox_areas.values()))
#print quantiles of the areas
quantiles = [0.15, 0.25, 0.3, 0.35, 0.5, 0.75, 0.9, 0.95, 0.99]
for quantile in quantiles:
    print(f"{int(quantile*100)}th quantile: {np.quantile(areas, quantile)*100:.2f}%")

15th quantile: 0.33%
25th quantile: 0.69%
30th quantile: 0.89%
35th quantile: 1.14%
50th quantile: 2.12%
75th quantile: 5.63%
90th quantile: 11.88%
95th quantile: 17.07%
99th quantile: 31.87%


* Since the original images are scaled down to 640x640, then the detections of 1% are going to be 64x64. Anything smaller than that would be discarded. We would need to filter the images by ~4% threshold if we want the faces to be at least 128x128 pixels. However, since this is only a detection model, the size of the face is not critical. Hence only the noisy samples are being removed. Simiarly, the outlying 1% of the images with more than 30% of the image area occupied by faces are also removed.

In [21]:
bbox_areas = {file_name: area for file_name, area in bbox_areas.items() if (area > 0.01) & (area <0.3187)}
bounding_boxes = list(bbox_areas.keys())
print("Final Dataset Size:", len(bounding_boxes))

Final Dataset Size: 5627


In [23]:
import shutil
label_path ="Data/labels"
image_path = "Data/images"

filtered_labels = "Data/filtered_labels"
filtered_images = "Data/filtered_images"
os.makedirs(filtered_labels, exist_ok=True)
os.makedirs(filtered_images, exist_ok=True)

for filtered_file in bounding_boxes:
    #copy the label file
    shutil.copy(os.path.join(label_path, filtered_file), os.path.join(filtered_labels, filtered_file))
    
    #copy the image file
    image_file = filtered_file.replace(".txt", ".jpg")
    shutil.copy(os.path.join(image_path, image_file), os.path.join(filtered_images, image_file))