# BiteMe | Preprocessing

The purpose of this notebook is to create the image preprocessing pipeline to be used during train/test time. The output will be functions we can include in the `preprocessing.py` script. 

TODO: 
 - Preprocessing pipeline
 - Train/test split
 - Augmentations
 - Write augmented images into `preprocessed/train/<label>/...` and `preprocessed/test/<label>/...`
 - Write metadata including processed images. Write images first with augs, then rename images to hash, then create metadata.
  - [Histogram Equalization and Adaptive Histogram Equalization (CLAHE)](https://pyimagesearch.com/2021/02/01/opencv-histogram-equalization-and-adaptive-histogram-equalization-clahe/)
 - 

In [1]:
import pandas as pd
import numpy as np
import os
import sys

from tqdm import tqdm

import cv2
import albumentations as A
import imgaug as ia
import imgaug.augmenters as iaa

sys.path.append("..")
from helpers import read_images, get_train_test_split, get_augs
from constants import ROWS, COLS, CHANNELS, SEED, TEST_SIZE, VERBOSE

plt.rcParams["figure.figsize"] = (14, 8)

np.random.seed(SEED)
ia.seed(SEED)

In [2]:
# Define directories
base_dir_path = "../"

data_dir_path = os.path.join(base_dir_path, "data")
data_cleaned_dir_path = os.path.join(data_dir_path, "cleaned")
data_preprocessed_dir_path = os.path.join(data_dir_path, "preprocessed")

data_dir = os.listdir(data_dir_path)
data_cleaned_dir = os.listdir(data_cleaned_dir_path)

metadata_cleaned_path = os.path.join(data_cleaned_dir_path, "metadata.csv")
metadata = pd.read_csv(metadata_cleaned_path)

# Write processed images to disk
write_preprocessed_images = False

metadata.head()

Unnamed: 0,img_name,img_path,label
0,7059b14d2aa03ed6c4de11afa32591995181d31c.jpg,../data/cleaned/none/7059b14d2aa03ed6c4de11afa...,none
1,ea1b100b581fcdb7ddfae52cc62347a99e304ba4.jpg,../data/cleaned/none/ea1b100b581fcdb7ddfae52cc...,none
2,1a1442990ff143b7560e5757d9f76d37ab007f48.jpg,../data/cleaned/none/1a1442990ff143b7560e5757d...,none
3,6eac051b9c45ff6821ec8675216f371711b7cea9.jpg,../data/cleaned/none/6eac051b9c45ff6821ec86752...,none
4,fc72767f8520df9b2b83941077dc0ee013eb9399.jpg,../data/cleaned/none/fc72767f8520df9b2b8394107...,none


## Split Data into Train/Test

In [3]:
# Split data into train and test
train_idx, test_idx, y_train, y_test = get_train_test_split(
    metadata_df=metadata, 
    test_size=TEST_SIZE,
    verbose=VERBOSE
)

192 train images
22 test images

TRAIN IMAGE COUNTS
------------------
tick        26
mosquito    25
horsefly    25
bedbug      25
none        25
ant         23
bee         22
mite        21
Name: label, dtype: int64

TEST IMAGE COUNTS
------------------
bedbug      3
tick        3
ant         3
horsefly    3
mosquito    3
none        3
mite        2
bee         2
Name: label, dtype: int64


In [4]:
# Re-write metadata csv for preprocessed
# WILL NEED TO UPDATE IF THERE WE GENERATE SYNTHETIC IMAGES
metadata["split"] = "train"
metadata["split"][test_idx] = "test"

metadata_preprocessed_path = os.path.join(data_preprocessed_dir_path, "metadata.csv")
metadata.to_csv(metadata_preprocessed_path, index=False)

## Create Preprocessing Pipeline

In [5]:
img_array = read_images(
    data_dir_path=data_cleaned_dir_path, 
    rows=ROWS, 
    cols=COLS, 
    channels=CHANNELS, 
    write_images=False, 
    output_data_dir_path=None,
    verbose=VERBOSE
)

# Split images into train/test
X_train = img_array[train_idx]
X_test = img_array[test_idx]    

Reading images from: ../data/cleaned
Rows set to 512
Columns set to 512
Channels set to 3
Writing images is set to: False
Reading images...


100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:00<00:00, 186.15it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 29/29 [00:00<00:00, 95.31it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 23/23 [00:00<00:00, 66.95it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:00<00:00, 52.07it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:00<00:00, 41.46it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [00:00<00:00, 35.05it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:00<00:00, 30.29it/s]
100%|█

Image reading complete.
Image array shape: (214, 512, 512, 3)


In [6]:
if write_preprocessed_images == True:
    # Make train/test dirs for preprocessed images
    if "train" not in os.listdir(data_preprocessed_dir_path):
        os.mkdir(os.path.join(data_preprocessed_dir_path, "train"))
    if "test" not in os.listdir(data_preprocessed_dir_path):
        os.mkdir(os.path.join(data_preprocessed_dir_path, "test"))


    # Write preprocessed images (split) to preprocessed directory
    for idx in tqdm(metadata.index):
        if metadata["split"][idx] == "train":
            img_dir_path = os.path.join(data_preprocessed_dir_path, "train", metadata["label"][idx])
            # If doesn't exist, create label directory
            if not os.path.isdir(img_dir_path):
                os.mkdir(img_dir_path)
            # Create img write path
            img_path_write = os.path.join(img_dir_path, metadata["img_name"][idx])
            # Write to train img directory
            cv2.imwrite(img_path_write, img_array[idx])
            
        elif metadata["split"][idx] == "test":
            # Write to test directory
            img_dir_path = os.path.join(data_preprocessed_dir_path, "test", metadata["label"][idx])
            # If doesn't exist, create label directory
            if not os.path.isdir(img_dir_path):
                os.mkdir(img_dir_path)
            # Create img write path
            img_path_write = os.path.join(img_dir_path, metadata["img_name"][idx])
            # Write to train img directory
            cv2.imwrite(img_path_write, img_array[idx])

## Run Preprocessing Pipeline

In [7]:
# Example preprocessing run
X_train_aug, y_train_aug, augs = get_augs(
    imgs_raw=X_train, 
    labels_raw=y_train,
    keep_originals=False,
    verbose=VERBOSE
)

Used augs: ['fliplr', 'flipud']
Created 192 augmentations.
Image array shape: (384, 512, 512, 3)
Labels array shape: (384,)
