This notebook is to prepare the images in the distracted driver dataset locally, for upload to S3.

This notebook will create two versions of the data. The first will be a LST mapping file, with the image files in JPEG format. The second will be a recordio format.

# Import

## Install

In [58]:
!pip install checksumdir

Collecting checksumdir
  Downloading checksumdir-1.1.7.tar.gz (3.1 kB)
Building wheels for collected packages: checksumdir
  Building wheel for checksumdir (setup.py): started
  Building wheel for checksumdir (setup.py): finished with status 'done'
  Created wheel for checksumdir: filename=checksumdir-1.1.7-py3-none-any.whl size=4247 sha256=b25353c6ffb3528aad22ffa56f5058aaa5258adc81247bb0fdb6ab820888c28b
  Stored in directory: c:\users\canfi\appdata\local\pip\cache\wheels\38\02\0f\76662753d74e5b3ddddc1e6daa8fabe369be85f3ca21647b36
Successfully built checksumdir
Installing collected packages: checksumdir
Successfully installed checksumdir-1.1.7


## Library / Packages

In [59]:
import pandas as pd
import random 
import os
import shutil
import boto3

import hashlib
from checksumdir import dirhash

from filecmp import dircmp

## Data

In [2]:
url = "https://raw.githubusercontent.com/DSBA-6190-Final-Project-Team/DSBA-6190_Final-Project/master/wine_predict/data/driver_imgs_list.csv"
path_file = 'data/driver_imgs_list.csv'

df_driver_index = pd.read_csv(path_file) 

# EDA

In [3]:
df_driver_index.shape

(22424, 3)

In [4]:
df_driver_index.head(5)

Unnamed: 0,subject,classname,img
0,p002,c0,img_44733.jpg
1,p002,c0,img_72999.jpg
2,p002,c0,img_25094.jpg
3,p002,c0,img_69092.jpg
4,p002,c0,img_92629.jpg


In [5]:
df_driver_index.columns

Index(['subject', 'classname', 'img'], dtype='object')

The following lists each unique driver along with the number of different classname and images are associated with each. Note the number of classnames is not unique, so the images an classnames have equal frequencies.

In [6]:
drivers_gb = df_driver_index.groupby(['subject'])
drivers_gb.count()

Unnamed: 0_level_0,classname,img
subject,Unnamed: 1_level_1,Unnamed: 2_level_1
p002,725,725
p012,823,823
p014,876,876
p015,875,875
p016,1078,1078
p021,1237,1237
p022,1233,1233
p024,1226,1226
p026,1196,1196
p035,848,848


We will set the training/validation split ratio.

In [7]:
train_val_split = 0.3

To split the data into training and validation sets, we'll need a unique index of drivers. We don't need the frequency counts.

We will create a shuffled list of unique drivers.


In [8]:
drivers_unique = drivers_gb.groups.keys()
drivers_unique = list(drivers_unique)

random.Random(5590).shuffle(drivers_unique)
print(drivers_unique)

['p049', 'p022', 'p012', 'p075', 'p045', 'p072', 'p026', 'p042', 'p015', 'p039', 'p047', 'p081', 'p052', 'p014', 'p021', 'p035', 'p061', 'p064', 'p066', 'p056', 'p002', 'p041', 'p024', 'p016', 'p051', 'p050']


validation we'll set the list of drivers in the train set and the test set.

In [9]:
num_drivers_val = round(len(drivers_unique)*train_val_split)
#print(num_drivers_val)
num_drivers_train = len(drivers_unique) - num_drivers_val
#print(num_drivers_train)

In [10]:
drivers_val = drivers_unique[:num_drivers_val]
print(drivers_val)

['p049', 'p022', 'p012', 'p075', 'p045', 'p072', 'p026', 'p042']


In [11]:
drivers_train = drivers_unique[-num_drivers_train:]
print(drivers_train)

['p015', 'p039', 'p047', 'p081', 'p052', 'p014', 'p021', 'p035', 'p061', 'p064', 'p066', 'p056', 'p002', 'p041', 'p024', 'p016', 'p051', 'p050']


# Training and Validation Image Lists
We'll now create two lists, one list of every image file name associated with the trainging set, another list associated with the test set. 

We will use these lists to filter the overall lst mapping file.

In [12]:
df_images_val = df_driver_index[df_driver_index['subject'].isin(drivers_val)]
df_images_training = df_driver_index[~df_driver_index['subject'].isin(df_images_val)]

In [13]:
print(df_images_training.shape)
df_images_training.head()

(22424, 3)


Unnamed: 0,subject,classname,img
0,p002,c0,img_44733.jpg
1,p002,c0,img_72999.jpg
2,p002,c0,img_25094.jpg
3,p002,c0,img_69092.jpg
4,p002,c0,img_92629.jpg


In [14]:
print(df_images_val.shape)
df_images_val.head()

(6738, 3)


Unnamed: 0,subject,classname,img
725,p012,c0,img_10206.jpg
726,p012,c0,img_27079.jpg
727,p012,c0,img_50749.jpg
728,p012,c0,img_97089.jpg
729,p012,c0,img_37741.jpg


# Move Validation Images
We need to move the validation images into a set of validation folders. 

In [15]:
path_current = "D:\\Notebooks\\dsba_6190\\team_project\\image_classification_preprocessing"
prefix_training = "imgs\\train"
prefix_validation = "imgs\\validation"

We need to create a relative file path to each image. We will create this path for all of the images in the validation set. Then we will move them from the training folders to the validation folders.

In [16]:
df_images_val['rel_path'] = df_images_val[['classname', 'img']].apply(lambda x: '\\'.join(x), axis = 1) 
df_images_val.head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,subject,classname,img,rel_path
725,p012,c0,img_10206.jpg,c0\img_10206.jpg
726,p012,c0,img_27079.jpg,c0\img_27079.jpg
727,p012,c0,img_50749.jpg,c0\img_50749.jpg
728,p012,c0,img_97089.jpg,c0\img_97089.jpg
729,p012,c0,img_37741.jpg,c0\img_37741.jpg
730,p012,c0,img_65697.jpg,c0\img_65697.jpg
731,p012,c0,img_3866.jpg,c0\img_3866.jpg
732,p012,c0,img_19098.jpg,c0\img_19098.jpg
733,p012,c0,img_31885.jpg,c0\img_31885.jpg
734,p012,c0,img_41423.jpg,c0\img_41423.jpg


In [17]:
os.path.join(path_current, prefix_training)

'D:\\Notebooks\\dsba_6190\\team_project\\image_classification_preprocessing\\imgs\\train'

In [18]:
'''
for index in df_images_val.index:
    path_train = os.path.join(path_current, prefix_training, df_images_val['rel_path'][index])
    path_val = os.path.join(path_current, prefix_validation, df_images_val['rel_path'][index])
    
    #path_train = path_train.replace('\\','/')
    #path_val = path_val.replace('\\','/')
    
    os.rename(path_train, path_val)
''' 

"\nfor index in df_images_val.index:\n    path_train = os.path.join(path_current, prefix_training, df_images_val['rel_path'][index])\n    path_val = os.path.join(path_current, prefix_validation, df_images_val['rel_path'][index])\n    \n    #path_train = path_train.replace('\\','/')\n    #path_val = path_val.replace('\\','/')\n    \n    os.rename(path_train, path_val)\n"

## Verification

We will now check there are the correct number of files in each directory.

In [63]:
# Training
dir_train = os.path.join(path_current, prefix_training)

cpt = sum([len(files) for r, d, files in os.walk(dir_train)])
print("There are {} image files in the training dataset.".format(cpt))

There are 15686 image files in the training dataset.


In [64]:
# Validation
dir_val = os.path.join(path_current, prefix_validation)

cpt = sum([len(files) for r, d, files in os.walk(dir_val)])
print("There are {} image files in the validation dataset.".format(cpt))b

There are 6738 image files in the validation dataset.


# Process Flow

Some of the downstream processing steps are computationally heavy. Therefore, in order to avoid running these code chunks, without manually commenting them out once run. 

To check if any of the input files have changed, which would then require reprocessing the images, we will use Hash checksum values to document the state of the inputs. We will check against this state to determine if we need to reprocess the data. 

To do this, we will have to convert the processing steps into callable functions.

## im2rec Function Calls
The following functions use the im2rec.py tool to process the image inputs.

[https://mxnet.apache.org/api/faq/recordio](https://mxnet.apache.org/api/faq/recordio) (Accessed on 3/20/2020)

### LST File Generation
These functions perform the first step, creating LST mapping files.

#### Train

In [None]:
def im2rec_lst_train():
    %run tools/im2rec.py lst_files/train imgs/train --list --recursive 

#### Validation

In [None]:
def im2rec_lst_val():
    %run tools/im2rec.py lst_files/validation imgs/validation/ --list --recursive 

### Recordio Conversion
These functions convert the images into a binary recordio binary file format.

#### Train

In [161]:
def im2rec_rec_train():
    %run tools/im2rec.py "lst_files\train.lst" "imgs\train" 

#### Validation

In [162]:
def im2rec_rec_val():
    %run tools/im2rec.py "lst_files\validation.lst" "imgs\validation" 

## Writing Hash Checksum
The following function generates the Hash of a directory (training or validation images) and writes the Hash in a CSV

In [139]:
def write_hash(file_name, file_path, file_dir):
    # Generate Hash
    dir_hash = dirhash(file_dir, 'sha256')
    
    # Write to CSV
    dict_dir_hash = {'hash': [dir_hash]}
    df_dir_hash = pd.DataFrame(dict_dir_hash)
    df_dir_hash.to_csv(file_path, index=False)
    print("Hash for {} images successfully written to CSV.".format(status))
    print()
    

## Process Flow Function
The following function will verify the hash of the current files matches the older record. If they match, no action is taken. If they do not match, the input data is processed, and a new hash is generated.

In [138]:
def verify_hash(status):
    print()
    print("Status: {}".format(status))
    print()
    
    # Generate File Name, Path and Directory
    file_name = "{}_imgs_hash.csv".format(status)
    file_path = os.path.join("hash", file_name)
    file_dir = os.path.join('imgs', status)
    
    # Print for Sanity Check
    print("File Name: {}".format(file_name))
    print("File Path: {}".format(file_path))
    print("File Directory: {}".format(file_dir))
    print()
    
    # Generate Current Hash
    hash_new = dirhash(file_dir, 'sha256')
    
    if not os.path.exists(file_path):
        print("Hash for {} images do not exist.".format(status))
        print()
        # Generate LST
        # Generate REC
        write_hash(file_name, file_path, file_dir)
        return
    
    else:
        # Read Existing Hash
        df_hash_old = pd.read_csv(file_path)
        hash_old = df_hash_old.iloc[0]['hash']
        
        if hash_old == hash_new:
            print("Hash for {} images are equal.".format(status))
            print("New Hash not generated. Processing not required.")
            print()
            return
        
        else:
            print("Hash for {} images are NOT equal.".format(status))
            print()
            # Generate LST
            # Generate REC
            write_hash(file_name, file_path, file_dir)

## Execute Process
Runnign the following code chuck will execute the process flow function develpoed above for both training and validation data.

In [141]:
status_list = ['train', 'validation']
for status in status_list:
    verify_hash(status)

Status: train

File Name: train_imgs_hash.csv
File Path: hash\train_imgs_hash.csv
File Directory: imgs\train

Hash for train images are equal.
New Hash not generated.

Status: validation

File Name: validation_imgs_hash.csv
File Path: hash\validation_imgs_hash.csv
File Directory: imgs\validation

Hash for validation images are equal.
New Hash not generated.



# LST Files

The Sagemaker Image Classification Algorithm requires either an LST or REC file as input, one for the training and one for the validation set. The file acts as a mapping function, connecting the image of each set to the image file location and the image class.

Now that the files have be sorted into training and validation sets, we will create the training and validation LST files.

The lst tool is im2rec.py, found here: [https://mxnet.apache.org/api/faq/recordio](https://mxnet.apache.org/api/faq/recordio)

In [155]:
train_lst_loc = 'lst_files/train'

NameError: name 'lst_files' is not defined

In [156]:
def im2rec_lst_train():
    %run tools/im2rec.py lst_files/train imgs/train --list --recursive 

In [158]:
def im2rec_lst_val():
    %run tools/im2rec.py lst_files/validation imgs/validation/ --list --recursive 

## Training

In [154]:
%run tools/im2rec.py **train_lst_loc imgs/train --list --recursive 

c0 0
c1 1
c2 2
c3 3
c4 4
c5 5
c6 6
c7 7
c8 8
c9 9


OSError: [Errno 22] Invalid argument: 'D:\\Notebooks\\dsba_6190\\team_project\\image_classification_preprocessing\\**train_lst_loc.lst'

## Validation

In [22]:
%run tools/im2rec.py lst_files/validation imgs/validation/ --list --recursive 

c0 0
c1 1
c2 2
c3 3
c4 4
c5 5
c6 6
c7 7
c8 8
c9 9


# REC Binary File Creation

We will now transform the images from standard image formats to recordio format, for better processing in the image classification algorithm.

## Training

In [49]:
#%run tools/im2rec.py "lst_files\train.lst" "imgs\train" 

Creating .rec file from D:\Notebooks\dsba_6190\team_project\image_classification_preprocessing\lst_files\train.lst in D:\Notebooks\dsba_6190\team_project\image_classification_preprocessing\lst_files
multiprocessing not available, fall back to single threaded encoding
time: 0.053854942321777344  count: 0
time: 36.72012400627136  count: 1000
time: 40.099653482437134  count: 2000
time: 33.50782084465027  count: 3000
time: 29.593960523605347  count: 4000
time: 30.440842151641846  count: 5000
time: 28.752363204956055  count: 6000
time: 31.05421471595764  count: 7000
time: 31.182876586914062  count: 8000
time: 40.862788677215576  count: 9000
time: 29.686290740966797  count: 10000
time: 28.63100504875183  count: 11000
time: 29.594340562820435  count: 12000
time: 28.277966499328613  count: 13000
time: 35.239041805267334  count: 14000
time: 39.334925413131714  count: 15000


## Validation

In [50]:
#%run tools/im2rec.py "lst_files\validation.lst" "imgs\validation" 

Creating .rec file from D:\Notebooks\dsba_6190\team_project\image_classification_preprocessing\lst_files\validation.lst in D:\Notebooks\dsba_6190\team_project\image_classification_preprocessing\lst_files
multiprocessing not available, fall back to single threaded encoding
time: 0.06337356567382812  count: 0
time: 28.41474437713623  count: 1000
time: 27.819944381713867  count: 2000
time: 36.39020895957947  count: 3000
time: 40.80285167694092  count: 4000
time: 51.46012330055237  count: 5000
time: 33.11307072639465  count: 6000


In [55]:
with open("lst_files/train.idx", "rb") as f:

    print(hashlib.sha256(f.read()).hexdigest())

89017af1cfb0525ebb81e27d42a30df2f2359ba0efd1a92350b544b2baac4a63
