This notebook is to prepare the images in the distracted driver dataset locally, for upload to S3.

This notebook will create two versions of the data. The first will be a LST mapping file, with the image files in JPEG format. The second will be a recordio format.

# Import

## Install

In [188]:
!pip install checksumdir



## Library / Packages

In [189]:
import pandas as pd
import random 
import os
import shutil
import boto3

import hashlib
from checksumdir import dirhash

from filecmp import dircmp

## Data

In [190]:
url = "https://raw.githubusercontent.com/DSBA-6190-Final-Project-Team/DSBA-6190_Final-Project/master/wine_predict/data/driver_imgs_list.csv"
path_file = 'data/driver_imgs_list.csv'

df_driver_index = pd.read_csv(path_file) 

# EDA

In [191]:
df_driver_index.shape

(22424, 3)

In [192]:
df_driver_index.head(5)

Unnamed: 0,subject,classname,img
0,p002,c0,img_44733.jpg
1,p002,c0,img_72999.jpg
2,p002,c0,img_25094.jpg
3,p002,c0,img_69092.jpg
4,p002,c0,img_92629.jpg


In [193]:
df_driver_index.columns

Index(['subject', 'classname', 'img'], dtype='object')

The following lists each unique driver along with the number of different classname and images are associated with each. Note the number of classnames is not unique, so the images an classnames have equal frequencies.

In [194]:
drivers_gb = df_driver_index.groupby(['subject'])
drivers_gb.count()

Unnamed: 0_level_0,classname,img
subject,Unnamed: 1_level_1,Unnamed: 2_level_1
p002,725,725
p012,823,823
p014,876,876
p015,875,875
p016,1078,1078
p021,1237,1237
p022,1233,1233
p024,1226,1226
p026,1196,1196
p035,848,848


We will set the training/validation split ratio.

In [195]:
train_val_split = 0.3

To split the data into training and validation sets, we'll need a unique index of drivers. We don't need the frequency counts.

We will create a shuffled list of unique drivers.


In [196]:
drivers_unique = drivers_gb.groups.keys()
drivers_unique = list(drivers_unique)

random.Random(5590).shuffle(drivers_unique)
print(drivers_unique)

['p049', 'p022', 'p012', 'p075', 'p045', 'p072', 'p026', 'p042', 'p015', 'p039', 'p047', 'p081', 'p052', 'p014', 'p021', 'p035', 'p061', 'p064', 'p066', 'p056', 'p002', 'p041', 'p024', 'p016', 'p051', 'p050']


validation we'll set the list of drivers in the train set and the test set.

In [197]:
num_drivers_val = round(len(drivers_unique)*train_val_split)
#print(num_drivers_val)
num_drivers_train = len(drivers_unique) - num_drivers_val
#print(num_drivers_train)

In [198]:
drivers_val = drivers_unique[:num_drivers_val]
print(drivers_val)

['p049', 'p022', 'p012', 'p075', 'p045', 'p072', 'p026', 'p042']


In [199]:
drivers_train = drivers_unique[-num_drivers_train:]
print(drivers_train)

['p015', 'p039', 'p047', 'p081', 'p052', 'p014', 'p021', 'p035', 'p061', 'p064', 'p066', 'p056', 'p002', 'p041', 'p024', 'p016', 'p051', 'p050']


# Training and Validation Image Lists
We'll now create two lists, one list of every image file name associated with the trainging set, another list associated with the test set. 

We will use these lists to filter the overall lst mapping file.

In [200]:
df_images_val = df_driver_index[df_driver_index['subject'].isin(drivers_val)]
df_images_training = df_driver_index[~df_driver_index['subject'].isin(df_images_val)]

In [201]:
print(df_images_training.shape)
df_images_training.head()

(22424, 3)


Unnamed: 0,subject,classname,img
0,p002,c0,img_44733.jpg
1,p002,c0,img_72999.jpg
2,p002,c0,img_25094.jpg
3,p002,c0,img_69092.jpg
4,p002,c0,img_92629.jpg


In [202]:
print(df_images_val.shape)
df_images_val.head()

(6738, 3)


Unnamed: 0,subject,classname,img
725,p012,c0,img_10206.jpg
726,p012,c0,img_27079.jpg
727,p012,c0,img_50749.jpg
728,p012,c0,img_97089.jpg
729,p012,c0,img_37741.jpg


# Move Validation Images
We need to move the validation images into a set of validation folders. 

In [203]:
path_current = "D:\\Notebooks\\dsba_6190\\team_project\\image_classification_preprocessing"
prefix_training = "imgs\\train"
prefix_validation = "imgs\\validation"

We need to create a relative file path to each image. We will create this path for all of the images in the validation set. Then we will move them from the training folders to the validation folders.

In [204]:
df_images_val['rel_path'] = df_images_val[['classname', 'img']].apply(lambda x: '\\'.join(x), axis = 1) 
df_images_val.head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,subject,classname,img,rel_path
725,p012,c0,img_10206.jpg,c0\img_10206.jpg
726,p012,c0,img_27079.jpg,c0\img_27079.jpg
727,p012,c0,img_50749.jpg,c0\img_50749.jpg
728,p012,c0,img_97089.jpg,c0\img_97089.jpg
729,p012,c0,img_37741.jpg,c0\img_37741.jpg
730,p012,c0,img_65697.jpg,c0\img_65697.jpg
731,p012,c0,img_3866.jpg,c0\img_3866.jpg
732,p012,c0,img_19098.jpg,c0\img_19098.jpg
733,p012,c0,img_31885.jpg,c0\img_31885.jpg
734,p012,c0,img_41423.jpg,c0\img_41423.jpg


In [205]:
os.path.join(path_current, prefix_training)

'D:\\Notebooks\\dsba_6190\\team_project\\image_classification_preprocessing\\imgs\\train'

In [206]:
'''
for index in df_images_val.index:
    path_train = os.path.join(path_current, prefix_training, df_images_val['rel_path'][index])
    path_val = os.path.join(path_current, prefix_validation, df_images_val['rel_path'][index])
    
    #path_train = path_train.replace('\\','/')
    #path_val = path_val.replace('\\','/')
    
    os.rename(path_train, path_val)
''' 

"\nfor index in df_images_val.index:\n    path_train = os.path.join(path_current, prefix_training, df_images_val['rel_path'][index])\n    path_val = os.path.join(path_current, prefix_validation, df_images_val['rel_path'][index])\n    \n    #path_train = path_train.replace('\\','/')\n    #path_val = path_val.replace('\\','/')\n    \n    os.rename(path_train, path_val)\n"

## Verification

We will now check there are the correct number of files in each directory.

In [207]:
# Training
dir_train = os.path.join(path_current, prefix_training)

cpt = sum([len(files) for r, d, files in os.walk(dir_train)])
print("There are {} image files in the training dataset.".format(cpt))

There are 15686 image files in the training dataset.


In [208]:
# Validation
dir_val = os.path.join(path_current, prefix_validation)

cpt = sum([len(files) for r, d, files in os.walk(dir_val)])
print("There are {} image files in the validation dataset.".format(cpt))

There are 6738 image files in the validation dataset.


# Process Flow

Some of the downstream processing steps are computationally heavy. Therefore, in order to avoid running these code chunks, without manually commenting them out once run. 

To check if any of the input files have changed, which would then require reprocessing the images, we will use Hash checksum values to document the state of the inputs. We will check against this state to determine if we need to reprocess the data. 

To do this, we will have to convert the processing steps into callable functions.

## im2rec Function Calls
The following functions use the im2rec.py tool to process the image inputs.

[https://mxnet.apache.org/api/faq/recordio](https://mxnet.apache.org/api/faq/recordio) (Accessed on 3/20/2020)

### LST File Generation
These functions perform the first step, creating LST mapping files.

#### Train

In [209]:
def im2rec_lst_train():
    %run tools/im2rec.py lst_files/train imgs/train --list --recursive 

#### Validation

In [210]:
def im2rec_lst_val():
    %run tools/im2rec.py lst_files/validation imgs/validation/ --list --recursive 

### Recordio Conversion
These functions convert the images into a binary recordio binary file format.

#### Train

In [211]:
def im2rec_rec_train():
    %run tools/im2rec.py "lst_files\train.lst" "imgs\train" 

#### Validation

In [212]:
def im2rec_rec_val():
    %run tools/im2rec.py "lst_files\validation.lst" "imgs\validation" 

## Writing Hash Checksum
The following function generates the Hash of a directory (training or validation images) and writes the Hash in a CSV

In [213]:
def write_hash(status):
    # Generate File Name, Path and Directory
    file_name = "{}_imgs_hash.csv".format(status)
    file_path = os.path.join("hash", file_name)
    file_dir = os.path.join('imgs', status)
    
    # Generate Hash
    dir_hash = dirhash(file_dir, 'sha256')
    
    # Write to CSV
    dict_dir_hash = {'hash': [dir_hash]}
    df_dir_hash = pd.DataFrame(dict_dir_hash)
    df_dir_hash.to_csv(file_path, index=False)
    print("Hash for {} images successfully written to CSV.".format(status))
    print()
    

## Combine Processing and Writing Hash
Whenever we process the input data we need to write a new hash, and vice versa. Therefore, for simplicity in the final code, we will lump these actions together.


In [214]:
def im2rec_and_write(status):
    if status == "train":
        print("Generating LST File: {}".format(status))
        im2rec_lst_train()
        print("Starting generating REC fil: {}".format(status))
        im2rec_rec_train()
        print("REC File complete.")
        write_hash(status)
    else:
        print("Generating LST File: {}".format(status))
        im2rec_lst_val()
        print("Starting generating REC fil: {}".format(status))
        im2rec_rec_val()
        print("REC File complete.")
        write_hash(status)
        

## Process Flow Function
The following function will verify the hash of the current files matches the older record. If they match, no action is taken. If they do not match, the input data is processed, and a new hash is generated.

In [215]:
def verify_hash(status):
    print()
    print("Status: {}".format(status))
    print()
    
    # Generate File Name, Path and Directory
    file_name = "{}_imgs_hash.csv".format(status)
    file_path = os.path.join("hash", file_name)
    file_dir = os.path.join('imgs', status)
    
    # Print for Sanity Check
    print("File Name: {}".format(file_name))
    print("File Path: {}".format(file_path))
    print("File Directory: {}".format(file_dir))
    print()
    
    # Generate Current Hash
    hash_new = dirhash(file_dir, 'sha256')
    
    if not os.path.exists(file_path):
        print("Hash for {} images do not exist.".format(status))
        print()
        im2rec_and_write(status)
        return
    
    else:
        # Read Existing Hash
        df_hash_old = pd.read_csv(file_path)
        hash_old = df_hash_old.iloc[0]['hash']
        
        if hash_old == hash_new:
            print("Hash for {} images are equal.".format(status))
            print("New Hash not generated. Processing not required.")
            print()
            return
        
        else:
            print("Hash for {} images are NOT equal.".format(status))
            print()
            im2rec_and_write(status)

## Execute Process
Runnign the following code chuck will execute the process flow function develpoed above for both training and validation data.

In [220]:
status_list = ['train', 'validation']
for status in status_list:
    verify_hash(status)


Status: train

File Name: train_imgs_hash.csv
File Path: hash\train_imgs_hash.csv
File Directory: imgs\train

Hash for train images are equal.
New Hash not generated. Processing not required.


Status: validation

File Name: validation_imgs_hash.csv
File Path: hash\validation_imgs_hash.csv
File Directory: imgs\validation

Hash for validation images are equal.
New Hash not generated. Processing not required.



# S3 Upload