# Extra Notebook 1. Extract raw image data

This Notebook is responsible for:
- extracting objects and bounding boxes from the raw images
- cleaning up detections registered with error
- exporting a dataset with all detections for further analysis (into a parquet file)

In [2]:
# import ConfigImports Notebook to import and configure libs
%run ../Config/ConfigImports.ipynb

TF -> Using GPU ->  PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')


### Prepare dates for images

In [21]:
# scan directory with images and find all folders corresponding to dates
days_recorded = np.array(os.listdir(CONFIG['IMG_BASE_DIR']))
days_recorded.sort()
len(days_recorded)

273

In [24]:
# create a DataFrame and remove dates outside of the scope for this research
df = pd.DataFrame({'date': days_recorded.tolist()})
df['date_dt'] = pd.to_datetime(df['date'])
START_DATE, END_DATE = '2019-09-09', '2020-03-02'
df = df.loc[(df['date_dt'] >= START_DATE) & (df['date_dt'] <= END_DATE)]

# extract unique dates
dates_found = df['date'].unique().tolist()
n_dates = len(dates_found)
n_dates

176

### Load NN Yolo model

In [26]:
options = {
    'model': 'cfg/yolov2.cfg',
    'load': 'weights/yolov2.weights',
    'threshold': 0.4,
    'gpu': 0.5,
    'verbalise': True
}
tfnet = TFNet(options);

Parsing ./cfg/yolov2.cfg
Parsing cfg/yolov2.cfg
Loading weights/yolov2.weights ...
Successfully identified 203934260 bytes
Finished in 0.008756637573242188s

Building net ...
Source | Train? | Layer description                | Output size
-------+--------+----------------------------------+---------------
       |        | input                            | (?, 608, 608, 3)
 Load  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 608, 608, 32)
 Load  |  Yep!  | maxp 2x2p0_2                     | (?, 304, 304, 32)
 Load  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 304, 304, 64)
 Load  |  Yep!  | maxp 2x2p0_2                     | (?, 152, 152, 64)
 Load  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 152, 152, 128)
 Load  |  Yep!  | conv 1x1p0_1  +bnorm  leaky      | (?, 152, 152, 64)
 Load  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 152, 152, 128)
 Load  |  Yep!  | maxp 2x2p0_2                     | (?, 76, 76, 128)
 Load  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | 

### Process images

This section is solving a very serious problem in the dataset.

The problem occurs when the background subtraction algorithm detects a motion, but it's a False Positive (like wind causing tree branches to move). Normally such an image should be discared, but if there is a car parked outside the house - it will be picked up and there fore image saved as a valid observation.

Code below detects these False Positives using a simple Computer Vision rule: predict labels for an image and remove any cars which are parked in front of the house. This simple approach works quite well, but as a future exercise, a more elegant solution to this problem should be identified.

This step has been added to the real time data processing (in a *Custom Logic* step), and predictions from 26 November 2019 won't suffer from this problem any more.

The code section below needs over five hours to complete, as it is essentially running Yolo detection again for all the collected images along with some custom logic.

The goal is to create a set of csv files containing filenames of good images and bad images (which can be deleted as they have been registered erroneously).

In [20]:
im_idx = 0
# loop through all dates and print progress (using tqdm package)
for i in tqdm(range(0, n_dates), ascii=True):
    # initialize daily variables
    dataset = []
    dataset_rejected = []
    d = dates_found[i]
    # set up paths
    images_path = CONFIG['IMG_BASE_DIR'] + '/' + d
    images_found = os.listdir(images_path)
    # iterate through images in a given day
    for img in images_found:
        objects_detected = img.split('_')[-1].replace('.jpg', '')
        time_of_event = img.split('_')[0][:8]
        im_path = images_path + '/' + img
        img_orig = cv2.imread(im_path)
        if img_orig is None:
            print('Unable to open file {}'.format(im_path))
            continue
        img_sm = resize(img_orig.copy(), width=608, height=608)
        # make predictions
        results = tfnet.return_predict(img_sm)
        # verify if valid objects were found
        im_with_boxes, legit_boxes_info = draw_boxes(img_sm, results)
        if len(legit_boxes_info) > 0:
            for b in legit_boxes_info:
                b.insert(0, im_idx)
                b.append(d)
                b.append(time_of_event)
                b.append(img)
                b.append(len(legit_boxes_info))
                dataset.append(b)
            im_idx += 1
        else:
            dataset_rejected.append([d, time_of_event, img])
    # store daily csv file with good predictions
    if len(dataset) > 0:
        df_ok = pd.DataFrame(dataset, columns=['img_idx', 'label', 'confidence', 'x1', 'y1', 'x2', 'y2', 
                                           'date', 'time', 'filename', 'img_n_boxes']).to_csv(
        './out/ok_{}.csv'.format(d), index=False)
    # store daily csv file with FP predictions
    if len(dataset_rejected) > 0:
        df_rejected = pd.DataFrame(dataset_rejected, columns=['date', 'time', 'filename']).to_csv(
            './out/rejected_{}.csv'.format(d), index=False)

### Delete False Positive images from SSD

In [25]:
# find all csv files starting with rej*
rejected_files = glob.glob('./out/rej*')
rejected_files.sort()
df_rej = pd.concat([pd.read_csv(f) for f in rejected_files])
# remove images
for index, row in df_rej.iterrows():
    f_path = im_root + '/' + row['date'] + '/' + row['filename']
    try:
        os.unlink(f_path)
    except:
        pass  # forgive errors if file has been already deleted in the past

### Gather good images and create a DataFrame

One row row below represents a single object detected in a frame, so there may be multiple records representing a single frame if more than one object was detected in an image.

In [35]:
# find all csv files starting with ok*
ok_files = glob.glob('./out/ok*')
ok_files.sort()
# put in a dataframe and show top 5 results
df = pd.concat([pd.read_csv(f) for f in ok_files]).reset_index()
df.head()

Unnamed: 0,index,img_idx,label,confidence,x1,y1,x2,y2,date,time,filename,img_n_boxes
0,0,72846,car,0.523175,298,7,426,71,2019-09-09,07.02.40,07.02.40.270_34c99836_car-car-car.jpg,1
1,1,72847,person,0.759682,489,31,518,106,2019-09-09,12.02.42,12.02.42.921_ea6c9143_person-bicycle.jpg,2
2,2,72847,bicycle,0.532076,444,54,484,100,2019-09-09,12.02.42,12.02.42.921_ea6c9143_person-bicycle.jpg,2
3,3,72848,person,0.864749,463,55,537,263,2019-09-09,07.30.02,07.30.02.409_c5662b14_person-car-car.jpg,1
4,4,72849,car,0.859297,302,23,410,73,2019-09-09,20.26.56,20.26.56.841_4ba2f42d_car.jpg,1


In [36]:
# verify how many results there is in total
df.shape

(657894, 12)

In [38]:
# verify a count for each class type
df['label'].value_counts()

person       333894
car          238556
truck         54166
dog           14692
bicycle        7241
cat            5026
bird           2351
motorbike      1968
Name: label, dtype: int64

In [39]:
# add date time related dimensions
df['time_ms'] = df['filename'].str[9:12]
df['date_time'] = pd.to_datetime((df['date'] + ' ' + df['time'].str.replace('.', ':') + '.' + df['time_ms'] + '00'), 
                                 format="%Y-%m-%d %H:%M:%S.%f")
df['week_day'] = df['date_time'].dt.day_name()
df['is_weekend'] = df['week_day'].isin(['Saturday', 'Sunday'])
df['month'] = df['date_time'].dt.month
df['hour'] = df['date_time'].dt.hour
df['min'] = df['date_time'].dt.minute

### Save into a Parquet file (efficient columnar format for large datasets)

In [41]:
# save into the efficient Parquet file
df.to_parquet('res/results_2019-09-09_2020-03-02.parquet.gzip', engine='fastparquet', compression='gzip')