# Arranging Incorrectly Masked Face Dataset (IMFD)

This notebook would split __IMFD__ images based on their sub-labels, "__uncovered chin__", "__uncovered nose__", and "__uncovered nose and mouth__". These splitted images would then be sampled and copied to new directories to be feed into __tf.keras.preprocessing.image.ImageDataGenerator__.

In [1]:
from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


In [36]:
import os
import ntpath
import random
import shutil

In [33]:
IMFD_DIR = '/content/drive/MyDrive/capstone_machine_learning/capstone_dataset/IMFD' # Change based on dataset dir on your system/device
DEST_DIR = '/content/drive/MyDrive/capstone_machine_learning/capstone_dataset/arranged_dataset' # Change based on dataset dir on your system/device

# Size for each label
TRAIN_SIZE = 2000
VAL_SIZE = 400
TEST_SIZE = 400

os.listdir(IMFD_DIR)

['00000',
 '01000',
 '02000',
 '03000',
 '04000',
 '05000',
 '06000',
 '07000',
 '08000',
 '09000',
 '10000',
 '11000',
 '12000',
 '13000',
 '14000',
 '15000',
 '16000',
 '17000',
 '18000',
 '19000',
 '20000',
 '21000',
 '22000',
 '23000',
 '24000',
 '25000',
 '26000',
 '27000',
 '28000',
 '29000',
 '30000',
 '31000',
 '32000',
 '33000',
 '34000',
 '35000',
 '36000',
 '37000',
 '38000',
 '39000',
 '40000',
 '41000',
 '42000',
 '43000',
 '44000',
 '45000',
 '46000',
 '47000',
 '48000',
 '49000',
 '50000',
 '51000',
 '52000',
 '53000',
 '54000',
 '55000',
 '56000',
 '57000',
 '58000',
 '59000',
 '60000',
 '61000',
 '62000',
 '63000',
 '64000',
 '65000',
 '66000',
 '67000',
 '68000',
 '69000']

## Splitting the three sub-labels

IMFD directory doesn't directly has the images as direct children. They are subfoldered into 70 subfolders of ~1000 images, i.e. "00000/", "01000/", "02000/", etc. So to get all the IMFD images, we need to traverse in each of IMFD's subfolders.

So the idea here is to list all the images names in all subfolders, then split them to three `lists` based on the name. "Mask_Nose_Mouth" would go to `uncov_chin_paths`, "Mask_Mouth_Chin" would go to `uncov_nose_paths`, and "Mask_Chin" would go to `uncov_nose_mouth_paths`.

In [18]:
uncov_chin_paths = []
uncov_nose_paths = []
uncov_nose_mouth_paths = []

for child in os.listdir(IMFD_DIR):
  child_path = os.path.join(IMFD_DIR, child)
  if os.path.isdir(child_path):
    # print(child, len(os.listdir(child_path)))
    # print(os.listdir(child_path))
    for img in os.listdir(child_path):
      img_path = os.path.join(child_path, img)
      if 'Mask_Nose_Mouth' in img:
        uncov_chin_paths.append(img_path)
      elif 'Mask_Mouth_Chin' in img:
        uncov_nose_paths.append(img_path)
      elif 'Mask_Chin' in img:
        uncov_nose_mouth_paths.append(img_path)
      
print('Count of uncovered chin:', len(uncov_chin_paths))
print('Count of uncovered nose:', len(uncov_nose_paths))
print('Count of uncovered nose and mouth:', len(uncov_nose_mouth_paths))

Count of uncovered chin: 6245
Count of uncovered nose: 55653
Count of uncovered nose and mouth: 4836


In [20]:
# Debug splitted lists percentages
len_chin = len(uncov_chin_paths)
len_nose = len(uncov_nose_paths)
len_nose_mouth = len(uncov_nose_mouth_paths)
summ = len_chin + len_nose + len_nose_mouth
print('Total IMFD images:', summ)
print('Percentage of uncovered chin:', len_chin *100 / summ)
print('Percentage of uncovered nose:', len_nose *100 / summ)
print('Percentage of uncovered nose and mouth:', len_nose_mouth *100 / summ)

Total IMFD images: 66734
Percentage of uncovered chin: 9.358048371145143
Percentage of uncovered nose: 83.39527077651572
Percentage of uncovered nose and mouth: 7.2466808523391375


To avoid potential ordering biases, we shuffle the lists.

In [29]:
random.shuffle(uncov_chin_paths)
random.shuffle(uncov_nose_paths)
random.shuffle(uncov_nose_mouth_paths)

## Sampling each shuffled lists into destination directories.

After we get lists of paths for each labels, we need to do sampling and copy sampled images to new directories. There are 1 directory for each label, i.e. `uncovered_chin/` for `uncov_chin_paths`, `uncovered_nose/` for `uncov_nose_paths`, and `uncovered_nose_and_mouth` for `uncov_nose_mouth_paths`.

There are 3 sets of these label directories, each for `training/`, `validation/`, and `testing/`. `training/`, `validation/`, and `testing/` should be found in `DEST_DIR`. `TRAIN_SIZE`, `VAL_SIZE`, and `TEST_SIZE` determine how many data in each path list are distributed to the `training/`, `validation/`, and `testing/`.

First we need a function to easily extract file name from a full path. This is used for appending file name to new directory path for copying.

In [23]:
def get_path_leaf(path):
  '''
  Taken from https://stackoverflow.com/questions/8384737/extract-file-name-from-path-no-matter-what-the-os-path-format
  '''
  head, tail = ntpath.split(path)
  return tail or ntpath.basename(head)

In [31]:
get_path_leaf(uncov_chin_paths[0])

'04544_Mask_Nose_Mouth.jpg'

Now we make a function to copy files in a list into a new directory. This function copies into only a single directory from list of many paths.

In [37]:
def copy_list_of_files(list_of_paths, dest_dir):
  try:
    for img in list_of_paths:
      name = get_path_leaf(img)
      shutil.copy2(img, os.path.join(dest_dir, name))
    return True
  except Exception as err:
    print(err)
    return False

Because we already shuffle the lists. We can just pick the first n elements of each lists. Then we copy subset of each lists to corresponding new directories using previously created function.

In [44]:
LABELS = ['uncovered_chin', 'uncovered_nose', 'uncovered_nose_and_mouth']
lists = [uncov_chin_paths, uncov_nose_paths, uncov_nose_mouth_paths]
for i, lst in enumerate(lists):
  # train = lst[:TRAIN_SIZE]
  val = lst[TRAIN_SIZE : TRAIN_SIZE+VAL_SIZE]
  test = lst[TRAIN_SIZE+VAL_SIZE : TRAIN_SIZE+VAL_SIZE+TEST_SIZE]
  label = LABELS[i]

  # train_dest = os.path.join(DEST_DIR, 'train', label)
  val_dest = os.path.join(DEST_DIR, 'validation', label)
  test_dest = os.path.join(DEST_DIR, 'test', label)

  # print(f'Copying to \"train/{label}\" success:', copy_list_of_files(train, train_dest))
  print(f'Copying to \"validation/{label}\" success:', copy_list_of_files(val, val_dest))
  print(f'Copying to \"test/{label}\" success:', copy_list_of_files(test, test_dest))

Copying to "validation/uncovered_chin" success: True
Copying to "test/uncovered_chin" success: True
Copying to "validation/uncovered_nose" success: True
Copying to "test/uncovered_nose" success: True
Copying to "validation/uncovered_nose_and_mouth" success: True
Copying to "test/uncovered_nose_and_mouth" success: True


## Debug

In [50]:
# Debug file count in each folder
summ = 0
for i in os.listdir(DEST_DIR):
  for j in os.listdir(os.path.join(DEST_DIR, i)):
    length = len(os.listdir(os.path.join(DEST_DIR, i, j)))
    summ += length
    print(i, j, length)

print(summ)

train correctly_masked 0
train uncovered_chin 2000
train uncovered_nose 2000
train uncovered_nose_and_mouth 2000
train no_mask 0
validation uncovered_chin 400
validation uncovered_nose 400
validation correctly_masked 0
validation uncovered_nose_and_mouth 400
validation no_mask 0
test uncovered_chin 400
test correctly_masked 0
test uncovered_nose 400
test uncovered_nose_and_mouth 400
test no_mask 0
8400


In [51]:
# Debug unique file names to check no duplication
all_set = set()
for i in os.listdir(DEST_DIR):
  for j in os.listdir(os.path.join(DEST_DIR, i)):
    all_set = set.union(all_set, set(os.listdir(os.path.join(DEST_DIR, i, j))))

print(len(all_set))

8400
