# Notes
## Project Folder Structure
The data for this file is too large to store with the github repo. Therefore, the files need to be added manually. The raw image files can be found here:

[https://www.kaggle.com/c/state-farm-distracted-driver-detection/data](https://www.kaggle.com/c/state-farm-distracted-driver-detection/data) (Accessed on 3/21/2020)

The file structre should be as follows:

-Working

--imgs
  
---train

---test

The train and test folders come directly from the kaggle dataset. Subsequent folders will be generated by this notebook.
## Scope
This notebook is to prepare the images in the distracted driver dataset locally, for upload to S3.

This notebook will create two versions of the data. The first will be a LST mapping file, with the image files in JPEG format. The second will be a recordio format.

A Complete dataset and Sample dataset will be created and uploaded.

# Import

## Install

In [53]:
!pip install checksumdir



## Library / Packages

In [54]:
import pandas as pd
import random 
import os
import shutil
import boto3

import hashlib
from checksumdir import dirhash

from filecmp import dircmp

#Settings
seed = 5590

## Data

In [55]:
url = "https://raw.githubusercontent.com/DSBA-6190-Final-Project-Team/DSBA-6190_Final-Project/master/wine_predict/data/driver_imgs_list.csv"
path_file = 'data/driver_imgs_list.csv'

df_driver_index = pd.read_csv(path_file) 

# EDA

In [56]:
df_driver_index.shape

(22424, 3)

In [57]:
df_driver_index.head(5)

Unnamed: 0,subject,classname,img
0,p002,c0,img_44733.jpg
1,p002,c0,img_72999.jpg
2,p002,c0,img_25094.jpg
3,p002,c0,img_69092.jpg
4,p002,c0,img_92629.jpg


In [58]:
df_driver_index.columns

Index(['subject', 'classname', 'img'], dtype='object')

The following lists each unique driver along with the number of different classname and images are associated with each. Note the number of classnames is not unique, so the images an classnames have equal frequencies.

In [59]:
drivers_gb = df_driver_index.groupby(['subject'])
drivers_gb.count()

Unnamed: 0_level_0,classname,img
subject,Unnamed: 1_level_1,Unnamed: 2_level_1
p002,725,725
p012,823,823
p014,876,876
p015,875,875
p016,1078,1078
p021,1237,1237
p022,1233,1233
p024,1226,1226
p026,1196,1196
p035,848,848


We will set the training/validation split ratio.

In [60]:
train_val_split = 0.3

To split the data into training and validation sets, we'll need a unique index of drivers. We don't need the frequency counts.

We will create a shuffled list of unique drivers.


In [61]:
drivers_unique = drivers_gb.groups.keys()
drivers_unique = list(drivers_unique)

random.Random(seed).shuffle(drivers_unique)
print(drivers_unique)

['p049', 'p022', 'p012', 'p075', 'p045', 'p072', 'p026', 'p042', 'p015', 'p039', 'p047', 'p081', 'p052', 'p014', 'p021', 'p035', 'p061', 'p064', 'p066', 'p056', 'p002', 'p041', 'p024', 'p016', 'p051', 'p050']


We'll set the list of drivers in the train set and the validation set.

In [62]:
num_drivers_val = round(len(drivers_unique)*train_val_split)
#print(num_drivers_val)
num_drivers_train = len(drivers_unique) - num_drivers_val
#print(num_drivers_train)

In [63]:
drivers_val = drivers_unique[:num_drivers_val]
print(drivers_val)

['p049', 'p022', 'p012', 'p075', 'p045', 'p072', 'p026', 'p042']


In [64]:
drivers_train = drivers_unique[-num_drivers_train:]
print(drivers_train)

['p015', 'p039', 'p047', 'p081', 'p052', 'p014', 'p021', 'p035', 'p061', 'p064', 'p066', 'p056', 'p002', 'p041', 'p024', 'p016', 'p051', 'p050']


# Training and Validation Image Lists
We'll now create two lists, one list of every image file name associated with the trainging set, another list associated with the validation set. 

We will use these lists to filter the overall lst mapping file.

In [65]:
df_images_val = df_driver_index[df_driver_index['subject'].isin(drivers_val)]
df_images_train = df_driver_index[~df_driver_index['subject'].isin(drivers_val)]

In [66]:
print(df_images_train.shape)
df_images_train.head()

(15686, 3)


Unnamed: 0,subject,classname,img
0,p002,c0,img_44733.jpg
1,p002,c0,img_72999.jpg
2,p002,c0,img_25094.jpg
3,p002,c0,img_69092.jpg
4,p002,c0,img_92629.jpg


In [67]:
print(df_images_val.shape)
df_images_val.head()

(6738, 3)


Unnamed: 0,subject,classname,img
725,p012,c0,img_10206.jpg
726,p012,c0,img_27079.jpg
727,p012,c0,img_50749.jpg
728,p012,c0,img_97089.jpg
729,p012,c0,img_37741.jpg


# Move Images
We need to create two folders images, one for training, and one for validation. We will leave the complete folder of images alone, in case we need to go back and extract information from that folder

## Execute Copy
The following function takes the training or validaiton dataframe and iterates through each row. It then copies each file from the overall data to the respective folder. 

In [68]:
def copy_imgs_all(status, df):
    prefix_overall = "imgs\\train"
    folder_suffix = "_subset"
    folder_name = status + folder_suffix
    prefix_status = os.path.join("imgs",folder_name)

    for index, row in df.iterrows():
        path_src = os.path.join(prefix_overall,
                                row['classname'], 
                                row['img']).replace('\\','/')
             
       
        path_dst = os.path.join("imgs",
                               folder_name, 
                               row['classname'],
                               row['img']).replace('\\','/')
        
        #print("Source File:\t\t{}".format(path_src))
        #print("Destination File:\t{}".format(path_dst))
    
        
        if not os.path.exists(path_dst):
            # Verify Directory Exists. Create if Not 
            os.makedirs(os.path.dirname(path_dst), exist_ok=True)
            
            shutil.copy(path_src, path_dst)
            #print("Loop Start")
            #print("Source File:\t\t{}".format(path_src))
            #print("Destination File:\t{}".format(path_dst))
            #print()
        #else:
            #print("File Exists. No Copy Made.")
            

We will create a dictionary containing the status and associated dataframes. This will be looped over.

In [69]:
dict_image_df = {
    "train": df_images_train,
    "validation" : df_images_val
}

In [70]:
for key in dict_image_df:
    copy_imgs_all(key, dict_image_df[key])

## Verification

We will now check there are the correct number of files in each directory.

In [71]:
path_current = "D:\\Notebooks\\dsba_6190\\team_project\\image_classification_preprocessing"
prefix_training = "imgs\\train_subset"
prefix_validation = "imgs\\validation_subset"

### Training Subset

In [72]:
# Training
dir_train = os.path.join(path_current, prefix_training)
print("Path Analyzed:\t{}".format(dir_train))

cpt = sum([len(files) for r, d, files in os.walk(dir_train)])
print("There are {} image files in the training dataset.".format(cpt))

Path Analyzed:	D:\Notebooks\dsba_6190\team_project\image_classification_preprocessing\imgs\train_subset
There are 15686 image files in the training dataset.


In [73]:
# Validation
dir_val = os.path.join(path_current, prefix_validation)
print("Path Analyzed:\t{}".format(dir_val))

cpt = sum([len(files) for r, d, files in os.walk(dir_val)])
print("There are {} image files in the validation dataset.".format(cpt))

Path Analyzed:	D:\Notebooks\dsba_6190\team_project\image_classification_preprocessing\imgs\validation_subset
There are 6738 image files in the validation dataset.


# Sampling
In order to test the image classification algorithm, we are going to create a subsample of the training and validation test sets. We will need to create a list of images for each, and then copy the images into their own directory.

In [74]:
sample_rate = 0.1

## Create Sample Lists

### Training

In [75]:
df_images_train_sample = df_images_train.groupby('classname').apply(pd.DataFrame.sample, 
                                                                    frac = sample_rate, 
                                                                    random_state = seed)
df_images_train_sample

Unnamed: 0_level_0,Unnamed: 1_level_0,subject,classname,img
classname,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
c0,6903,p024,c0,img_86347.jpg
c0,4449,p021,c0,img_69997.jpg
c0,9311,p035,c0,img_30218.jpg
c0,17010,p056,c0,img_17591.jpg
c0,15491,p051,c0,img_24357.jpg
...,...,...,...,...
c9,4273,p016,c9,img_43974.jpg
c9,13457,p047,c9,img_55331.jpg
c9,2386,p014,c9,img_36650.jpg
c9,13507,p047,c9,img_44150.jpg


### Validation

In [76]:
df_images_val_sample = df_images_val.groupby('classname').apply(pd.DataFrame.sample, 
                                                                frac = sample_rate, 
                                                                random_state = seed)
df_images_val_sample

Unnamed: 0_level_0,Unnamed: 1_level_0,subject,classname,img
classname,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
c0,8202,p026,c0,img_34269.jpg
c0,11400,p042,c0,img_59032.jpg
c0,13573,p049,c0,img_11474.jpg
c0,8177,p026,c0,img_26025.jpg
c0,5662,p022,c0,img_9678.jpg
...,...,...,...,...
c9,11937,p042,c9,img_20372.jpg
c9,12668,p045,c9,img_75520.jpg
c9,14464,p049,c9,img_30754.jpg
c9,1495,p012,c9,img_74547.jpg


## Create Dataset

In [108]:
def copy_imgs_sample(status, df):
    path_prefix_all = "imgs"
    path_prefix_sample = "imgs/sample"
    folder_name = status + "_subset"

    for index, row in df.iterrows():
        path_src = os.path.join(path_prefix_all,
                                folder_name,
                                row['classname'], 
                                row['img']).replace('\\','/')
                
        path_dst = os.path.join(path_prefix_sample,
                               folder_name,
                               row['classname'],
                               row['img']).replace('\\','/')
        
        #print("Loop Start")
        #print("Source File:\t\t{}".format(path_src))
        #print("Destination File:\t{}".format(path_dst))
        #print("")    
        
        # verify
        os.makedirs(os.path.dirname(path_dst), exist_ok=True)
        
        if not os.path.exists(path_dst):
            shutil.copy(path_src, path_dst)

In [106]:
dict_image_df_sample = {
    "train": df_images_train_sample,
    "validation" : df_images_val_sample
}

dict_image_df_sample

{'train':                 subject classname            img
 classname                                       
 c0        6903     p024        c0  img_86347.jpg
           4449     p021        c0  img_69997.jpg
           9311     p035        c0  img_30218.jpg
           17010    p056        c0  img_17591.jpg
           15491    p051        c0  img_24357.jpg
 ...                 ...       ...            ...
 c9        4273     p016        c9  img_43974.jpg
           13457    p047        c9  img_55331.jpg
           2386     p014        c9  img_36650.jpg
           13507    p047        c9  img_44150.jpg
           18519    p061        c9  img_81214.jpg
 
 [1567 rows x 3 columns],
 'validation':                 subject classname            img
 classname                                       
 c0        8202     p026        c0  img_34269.jpg
           11400    p042        c0  img_59032.jpg
           13573    p049        c0  img_11474.jpg
           8177     p026        c0  img_26025.jpg

In [109]:
for key in dict_image_df_sample:
    copy_imgs_sample(key, dict_image_df[key])

# Process Flow

Some of the downstream processing steps are computationally heavy. Therefore, in order to avoid running these code chunks, without manually commenting them out once run. 

To check if any of the input files have changed, which would then require reprocessing the images, we will use Hash checksum values to document the state of the inputs. We will check against this state to determine if we need to reprocess the data. 

To do this, we will have to convert the processing steps into callable functions.

## im2rec Function Calls
The following functions use the im2rec.py tool to process the image inputs.

[https://mxnet.apache.org/api/faq/recordio](https://mxnet.apache.org/api/faq/recordio) (Accessed on 3/20/2020)

### LST File Generation
This function performs the first step, creating LST mapping files.

In [126]:
def im2rec_lst(status):
    lst_file = "lst_files/" + status + "_subset"
    img_loc = "imgs/" + status + "_subset"
    
    %run tools/im2rec.py {lst_file} {img_loc} --list --recursive 

### Recordio Conversion
This function converts the images into a binary recordio binary file format.

In [128]:
def im2rec_rec(status):
    lst_file = "lst_files/" + status + ".lst"
    img_loc = "imgs/" + status + "_subset"
    
    %run tools/im2rec.py {lst_file} {img_loc}

## Writing Hash Checksum
The following function generates the Hash of a directory (training or validation images) and writes the Hash in a CSV

In [None]:
def write_hash(status):
    # Generate File Name, Path and Directory
    file_name = "{}_imgs_hash.csv".format(status)
    file_path = os.path.join("hash", file_name)
    file_dir = os.path.join('imgs', status)
    
    # Generate Hash
    dir_hash = dirhash(file_dir, 'sha256')
    
    # Write to CSV
    dict_dir_hash = {'hash': [dir_hash]}
    df_dir_hash = pd.DataFrame(dict_dir_hash)
    df_dir_hash.to_csv(file_path, index=False)
    print("Hash for {} images successfully written to CSV.".format(status))
    print()
    

## Combine Processing and Writing Hash
Whenever we process the input data we need to write a new hash, and vice versa. Therefore, for simplicity in the final code, we will lump these actions together.


In [None]:
def im2rec_and_write(status):
        print("Generating LST File: {}".format(status))
        
        im2rec_lst(status)
        
        print("Starting generating REC fil: {}".format(status))
        
        im2rec_rec(status)
        
        print("REC File complete.")
        
        write_hash(status)

## Process Flow Function
The following function will verify the hash of the current files matches the older record. If they match, no action is taken. If they do not match, the input data is processed, and a new hash is generated.

In [None]:
def verify_hash(status):
    print()
    print("Status: {}".format(status))
    print()
    
    # Generate File Name, Path and Directory
    file_name = "{}_imgs_hash.csv".format(status)
    file_path = os.path.join("hash", file_name)
    file_dir = os.path.join('imgs', status)
    
    # Print for Sanity Check
    print("File Name: {}".format(file_name))
    print("File Path: {}".format(file_path))
    print("File Directory: {}".format(file_dir))
    print()
    
    # Generate Current Hash
    hash_new = dirhash(file_dir, 'sha256')
    
    if not os.path.exists(file_path):
        print("Hash for {} images do not exist.".format(status))
        print()
        im2rec_and_write(status)
        return
    
    else:
        # Read Existing Hash
        df_hash_old = pd.read_csv(file_path)
        hash_old = df_hash_old.iloc[0]['hash']
        
        if hash_old == hash_new:
            print("Hash for {} images are equal.".format(status))
            print("New Hash not generated. Processing not required.")
            print()
            return
        
        else:
            print("Hash for {} images are NOT equal.".format(status))
            print()
            im2rec_and_write(status)

## Execute Process
Runnign the following code chuck will execute the process flow function develpoed above for both training and validation data.

In [None]:
for status in status_list:
    verify_hash(status)

# S3 Upload

## Establish AWS Settings

In [None]:
profile = 'dsba_6190_proj_4'
region = 'us-east-1'
bucket = 'dsba-6190-final-team-project'
prefix = "channels"

In [None]:
session = boto3.session.Session(profile_name = profile,
                               region_name = region)

s3_resource = session.resource('s3')

## Upload

In [None]:
# Train
train_file = "lst_files/train.rec"
train_s3_key = os.path.join(prefix, "train", "train.rec").replace('\\','/')
print("s3 Key - Train: {}".format(train_s3_key))

# Validation
val_file = "lst_files/validation.rec"
val_s3_key = os.path.join(prefix, "validation","validation.rec").replace('\\','/')
print("s3 Key - Validation: {}".format(val_s3_key))

In [None]:
#Training
#s3_resource.Bucket(bucket).upload_file(Filename=train_file, Key = train_s3_key)

In [None]:
#Validation
#s3_resource.Bucket(bucket).upload_file(Filename=val_file, Key = val_s3_key)