### Human Detection Script

In [1]:
from imageai.Detection import ObjectDetection
import os
from PIL import Image
from pathlib import Path
import pickle
import pandas as pd

pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)
pd.set_option('display.max_colwidth', 500)

def get_human_ratio(image, detector):
    human_ratios = []

    im = Image.open(image)
    image_size = im.size[0] * im.size[1]

    detections = detector.detectObjectsFromImage(input_image=image)
    for detection in detections:
        for i, attribute in enumerate(detection.values()):
            if attribute == 'person':
                human_box = list(detection.values())[2]

                # Box Points - [x1,y1,x2,y2]
                human_size = (human_box[2] - human_box[0]) * (human_box[3] - human_box[1])
                human_ratios.append(human_size / image_size)
    return human_ratios

Using TensorFlow backend.


### Loading of Object Detection Model

In order to start detecting the human in the images, we need to have a trained model for this. The trained model can be downloaded from the following link:
https://github.com/OlafenwaMoses/ImageAI/releases/download/1.0/resnet50_coco_best_v2.0.1.h5

In [None]:
execution_path = os.getcwd()
detector = ObjectDetection()
detector.setModelTypeAsRetinaNet()
detector.setModelPath(os.path.join(execution_path, "resnet50_coco_best_v2.0.1.h5"))
detector.loadModel()

### Detect humans in all images
The list detection_list will save all the images' path and also the percentage of humans detection in each image.

In [None]:
data_path = 'data/images'
data_path_full = (Path().resolve().parents[1] / f'{data_path}').resolve()
folders = data_path_full.glob("*/*")
detection_list = []

for folder in folders:
    files = folder.resolve().glob("*.*")
    for file in list(files):
        ratio = sum(get_human_ratio(file, detector))
        detection_list.append([str(file), ratio])

### Saving and loading list

In [54]:
with open("detection_list.pkl", "wb") as f:
    pickle.dump(detection_list, f)

In [2]:
with open("detection_list.pkl", "rb") as fp:   # Unpickling
    detection_list2 = pickle.load(fp)

### Understanding the image composition
Training a good model requires sufficient training data and also clean data. In this section, we aim to strike a balance between this the quantity and the quality of images.

In [79]:
import pandas as pd
import pickle
    
filtered_samples = pd.DataFrame(detection_list2, columns=["File", "Detection_Pct"])

def replace_folder(str_list, dest = 'filtered'):
    # Remove train/val/test folder
    del str_list[-3]
    str_list[-3] = dest
    new_path = '\\'.join(str_list)
    return new_path

filtered_samples['NewPath'] = filtered_samples.File.str.split('\\').apply(replace_folder)

In [80]:
filtered_samples.loc[:,'Category'] = filtered_samples.File.apply(lambda x: x.split('\\')[-2])
filtered_samples.columns

Index(['File', 'Detection_Pct', 'NewPath', 'Category'], dtype='object')

### Number of images per Class at each Detection_Pct Bin

In [98]:
import numpy as np
import matplotlib.pyplot as plt
bins = [-0.1,0,0.05,0.1,0.15,0.2,0.3,0.4,0.5,1]

filtered_samples['Bin'] = pd.cut(filtered_samples.Detection_Pct,bins=bins, include_lowest=True)
filtered_samples.head()
summary_pivot = filtered_samples.drop(columns=['File','NewPath']).pivot_table(index='Bin', columns='Category',aggfunc=['count']).fillna(0)
summary_pivot.sort_values(by=['Bin'], ascending=True).astype('float').cumsum().droplevel(axis='columns', level=[0,1])

Category,Borehole - Mechanized,Borehole - Mechanized with diesel,Bucket,Hand Pump,Hand Pump - Afridev,Hand Pump - India Mark II,Hand Pump - Vergnet,Kiosk,Other,Protected Spring,Tapstand
Bin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
"(-0.101, 0.0]",63.0,3.0,624.0,686.0,677.0,674.0,624.0,845.0,6.0,236.0,874.0
"(0.0, 0.05]",68.0,5.0,675.0,750.0,760.0,737.0,693.0,877.0,6.0,250.0,917.0
"(0.05, 0.1]",72.0,5.0,725.0,795.0,791.0,786.0,761.0,907.0,6.0,268.0,940.0
"(0.1, 0.15]",75.0,5.0,791.0,830.0,851.0,833.0,820.0,930.0,6.0,278.0,952.0
"(0.15, 0.2]",76.0,5.0,847.0,866.0,887.0,873.0,862.0,947.0,8.0,284.0,958.0
"(0.2, 0.3]",78.0,6.0,930.0,927.0,939.0,935.0,927.0,963.0,8.0,304.0,974.0
"(0.3, 0.4]",80.0,6.0,974.0,957.0,970.0,972.0,960.0,977.0,10.0,314.0,983.0
"(0.4, 0.5]",81.0,6.0,986.0,977.0,987.0,988.0,976.0,985.0,10.0,318.0,992.0
"(0.5, 1.0]",82.0,6.0,1000.0,997.0,1000.0,998.0,999.0,999.0,10.0,324.0,999.0


### Image Removal Decision
It can be observed from the table that categories with low number of data such as "Borehole - Mechanized with diesel" and "Other" have only up to 40% of the image that is occupied by humans.

By removing the images that have more than __40% of human detected__, we can prevent removing any images from these two categories. Furthermore, it will retent sufficient training images for the other categories.

In [99]:
filtered_samples = filtered_samples[filtered_samples.Detection_Pct < 0.4]

### Putting the images into a new folder

In [100]:
import pathlib
import shutil

proc_folder = 'filtered'
data_path = Path('\\'.join(filtered_samples.iloc[0][2].split('\\')[:-3])).resolve()

for index, rows in filtered_samples.iterrows():
    files = rows['File'].split('\\')
    if (data_path/proc_folder).is_dir()==False:
        pathlib.Path.mkdir(data_path/proc_folder)
    if (data_path/proc_folder/files[-2]).is_dir()==False:
        pathlib.Path.mkdir(data_path/proc_folder/files[-2])
    shutil.copy((rows['File']), (rows['NewPath']))

### Train / Val / Test Split
Using this repository split_folders, we can easily split the images into train/val/test dataset for keras model training.

This library can be installed using the following command:
```
pip install split-folders
```
Detailed documentation can be found in the github repository:
https://github.com/jfilter/split-folders

I chose to split the dataset into 65% training data so that there will be sufficient validation data and also not too little testing data because of the lack of images in the "Borehole - Mechanized with diesel" and "Other".

In [101]:
import split_folders

split_folders.ratio('../../data/filtered', output="../../data/filtered_split", seed=1337, ratio=(.65, .15, .2))

Copying files: 7203 files [00:14, 499.12 files/s]
