# Notes
## Project Folder Structure
The input data for this project is too large to store with the github repo. Therefore, the files need to be added manually. The raw image files can be found here:

[https://www.kaggle.com/c/state-farm-distracted-driver-detection/data](https://www.kaggle.com/c/state-farm-distracted-driver-detection/data) (Accessed on 3/21/2020)

You can download the raw input files as a zip file. It will contain the following:


* driver_imgs_list.csv (This file in maintained in the repo)
* sample_submission.csv (This file is maintained in the repo)
* imgs folder, with test and train subfolders

This **imgs** folder is the part of the input which is not maintained in the repo. Therefore, for this notebook to function the data in the downloaded **imgs** must be copied to the correct location. Copy the **test** and **train** subfolders to the location *Working Directory > imgs > kaggle*. The resulting folders should have the following structure:

+--Working

|_ +--imgs

|_ _ +--kaggle
  
|_ _ _ +--train

|_ _ _ +--test


## Note on Dummy Data
There are several locations in this notebook where dummy data is created. This is due to an issue I have not been able to figure out regarding creating files and folders in python over a loop. When I create new files using a loop, the files create during the last iteration of the loop are corrupted. Therefore, I have created a dummy placeholder, which will always go last in the loop. The loop will still create a corrupted file, but this file will be our dummy file. The files we need for our analysis will now not be corrupted.

# Import

We will use a python script called **im2rec.py** as a tool to process our input data. This python script is hosted on the Apache GitHub page. To ensure we have the most current version of this tool, we will clone the required GitHub repo into this project.

In [106]:
#!git clone https://github.com/apache/incubator-mxnet.git

## Install
The following packages were not available and needed to be installed.

In [107]:
!pip install checksumdir
!pip install opencv-python mxnet



## Library / Packages

In [108]:
import pandas as pd
import numpy as np
import random 
import os
import shutil
import boto3

import hashlib
from checksumdir import dirhash

from filecmp import dircmp

#Settings
seed = 5590

## Data
See the Note at the top tof this notebook for how to use the raw input image files. 

In [109]:
df_driver_index = pd.read_csv("data/driver_imgs_list.csv") 

# EDA
## Overall Data Shape
The following section is some general Exploratory Data Analysis, focused on the image list.

The following are the columns in the image list CSV.

In [110]:
df_driver_index.columns

Index(['subject', 'classname', 'img'], dtype='object')

In [111]:
num_images = df_driver_index.shape
print("Number of Images: {}".format(num_images))

Number of Images: (22424, 3)


In [112]:
num_classes = df_driver_index['classname'].nunique()
print("Number of Classes: {}".format(num_classes))

Number of Classes: 10


## Driver Analysis

The following lists each unique driver along with the number of different classname and images are associated with each. Note the number of classnames is not unique, so the images an classnames have equal frequencies.

In [113]:
drivers_gb = df_driver_index.groupby(['subject'])
drivers_gb.count()

Unnamed: 0_level_0,classname,img
subject,Unnamed: 1_level_1,Unnamed: 2_level_1
p002,725,725
p012,823,823
p014,876,876
p015,875,875
p016,1078,1078
p021,1237,1237
p022,1233,1233
p024,1226,1226
p026,1196,1196
p035,848,848


# Train / Validation Data Split
While the imported data from **Kaggle** contains a **train** and **test** set of images, the **test** set in unable to funciton as our test or validation set. The images in the **test** folder are unlabeled, so we do not know what class they are in. These images function more as a **Kaggle** scoring test set.

In order to correctly train our image classification model, we will need a proper training and validation test set. We will pursue two different train/validation split methodologies. 

1. **Random Split**: The first method will be a general random split of the images. We will use the **im2rec.py** tool provided by Apache in the MXNET project folder. With this tool all we need to do is supply the image folder location and the train/test split, and the tool will create the two sets of images automatically.
2. **Driver Split**: This method is based on a blog post on this data that I can no longer find, unfortunatley. Instead of splitting randomly on the images, we will split the drivers randomly. Then, all of the images associated with a driver will either be in the train set, or validation set. No driver will appear in both sets.

## Method 1 - Random Split
As stated before, the **im2rec.py** tool will handle the training/validation split for us. But we sill need to create new sets of images. We will create three image sets:

1. Full copy of complete data
2. Sampling of comlpete data (10% by class)
3. Dummy data, equal to the 10% sample (See top of notebook for a note on dummy data)

To create a 10% sample, we need to create a sample list of images. We do that by using the overall image list, grouping by the class, and then applying a random sample function.

In [114]:
df_driver_index_sample = df_driver_index.groupby('classname').apply(pd.DataFrame.sample, 
                                                                    random_state = seed,  
                                                                    frac=0.1)

df_driver_index_sample

Unnamed: 0_level_0,Unnamed: 1_level_0,subject,classname,img
classname,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
c0,8160,p026,c0,img_87669.jpg
c0,3328,p016,c0,img_67547.jpg
c0,20854,p075,c0,img_43775.jpg
c0,15331,p051,c0,img_17811.jpg
c0,15477,p051,c0,img_99652.jpg
...,...,...,...,...
c9,12635,p045,c9,img_3155.jpg
c9,7987,p024,c9,img_59938.jpg
c9,7999,p024,c9,img_80769.jpg
c9,6728,p022,c9,img_50864.jpg


## Method 2 - Driver Split
The following section splits the drivers into training and validation sets.

First, we will set the training/validation split ratio.

In [115]:
train_val_split = 0.2

To split the data into training and validation sets, we'll need a unique index of drivers. We don't need the frequency counts.

We will create a shuffled list of unique drivers.

In [116]:
drivers_unique = drivers_gb.groups.keys()
drivers_unique = list(drivers_unique)

random.Random(seed).shuffle(drivers_unique)
print(drivers_unique)

['p049', 'p022', 'p012', 'p075', 'p045', 'p072', 'p026', 'p042', 'p015', 'p039', 'p047', 'p081', 'p052', 'p014', 'p021', 'p035', 'p061', 'p064', 'p066', 'p056', 'p002', 'p041', 'p024', 'p016', 'p051', 'p050']


We'll set the list of drivers in the train set and the validation set.

In [117]:
num_drivers_train = len(drivers_unique) - num_drivers_val
num_drivers_val = round(len(drivers_unique)*train_val_split)

#Print
print("Number of Drivers - Training Set:\t{}".format(num_drivers_train))
print("Number of Drivers - Validation Set:\t{}".format(num_drivers_val))

Number of Drivers - Training Set:	21
Number of Drivers - Validation Set:	5


#### Training Drivers

In [118]:
drivers_train = drivers_unique[-num_drivers_train:]
print(drivers_train)

['p072', 'p026', 'p042', 'p015', 'p039', 'p047', 'p081', 'p052', 'p014', 'p021', 'p035', 'p061', 'p064', 'p066', 'p056', 'p002', 'p041', 'p024', 'p016', 'p051', 'p050']


#### Validation Drivers

In [119]:
drivers_val = drivers_unique[:num_drivers_val]
print(drivers_val)

['p049', 'p022', 'p012', 'p075', 'p045']


### Training and Validation Image Lists

With the training and validation sets of drivers established, we will now create two lists of images. One list will be of every image file name associated with the trainging set, the other list associated with the validation set. 

These lists will be used to copy from the **Kaggle** dataset to create unique datasets.

In [120]:
df_images_val = df_driver_index[df_driver_index['subject'].isin(drivers_val)]
df_images_train = df_driver_index[~df_driver_index['subject'].isin(drivers_val)]

In [121]:
print(df_images_train.shape)
df_images_train.head()

(17819, 3)


Unnamed: 0,subject,classname,img
0,p002,c0,img_44733.jpg
1,p002,c0,img_72999.jpg
2,p002,c0,img_25094.jpg
3,p002,c0,img_69092.jpg
4,p002,c0,img_92629.jpg


In [122]:
print(df_images_val.shape)
df_images_val.head()

(4605, 3)


Unnamed: 0,subject,classname,img
725,p012,c0,img_10206.jpg
726,p012,c0,img_27079.jpg
727,p012,c0,img_50749.jpg
728,p012,c0,img_97089.jpg
729,p012,c0,img_37741.jpg


## Copy Images
We need to create two folders images, one for training, and one for validation. We will leave the **Kaggle** folder of images alone, so we always have the raw data if we need it. 

We will create these new folders by copying from the **Kaggle** folder, using the lists of 

## Execute Copy
The following function takes the training or validaiton dataframe and iterates through each row. It then copies each file from the overall data to the respective folder. 

In [123]:
def copy_imgs(nested_dict, folder = "default"):
   
    for key, value in nested_dict.items():
        print("Entering Initial For Loop")
        print("")
        if type(value) is dict:
            print("Entering If Statment - Recursive")
            print("")
            # Define Local Values
            folder = key
            copy_imgs(value, folder = folder)
           
        else:
            print("Entering Else Statment")
            print("Folder: {}".format(folder))
            print("Key: {}".format(key))
            print("")
            prefix_subfolder = "imgs"
            prefix_src = "kaggle/train"
            prefix_dst = "train-" + folder
    
            for index, row in value.iterrows():
                path_src = os.path.join(prefix_subfolder,
                                        prefix_src,
                                        row['classname'], 
                                        row['img']).replace('\\','/')

                path_dst = os.path.join(prefix_subfolder,
                            prefix_dst,
                            key,
                            row['classname'],
                            row['img']).replace('\\','/')
                
                #print("Source File:\t\t{}".format(path_src))
                #print("Destination File:\t{}".format(path_dst))
    
        
                if not os.path.exists(path_dst):
                    # Verify Directory Exists. Create if Not 
                    os.makedirs(os.path.dirname(path_dst), exist_ok=True)

                    shutil.copy(path_src, path_dst)
                    #print("Loop Start")
                    #print("Source File:\t\t{}".format(path_src))
                    #print("Destination File:\t{}".format(path_dst))
                    #print()
                #else:
                    #print("File Exists. No Copy Made.")


We will create a nested list of dictionaries to loop over to execute the required copies.

In [124]:
# Random Split Dictionary
dict_random_split = {
    "complete": df_driver_index,
    "sample" : df_driver_index_sample,
    "dummy": df_driver_index_sample
}

# Driver Split Dictionary
dict_driver_split = {
    "train": df_images_train,
    "validation" : df_images_val,
    "dummy": df_images_val
}

dict_overall_copy = {
    "split_random" : dict_random_split,
    "split_driver" : dict_driver_split
}

In [125]:
copy_imgs(dict_overall_copy)

Entering Initial For Loop

Entering If Statment - Recursive

Entering Initial For Loop

Entering Else Statment
Folder: split_random
Key: complete

Entering Initial For Loop

Entering Else Statment
Folder: split_random
Key: sample

Entering Initial For Loop

Entering Else Statment
Folder: split_random
Key: dummy

Entering Initial For Loop

Entering If Statment - Recursive

Entering Initial For Loop

Entering Else Statment
Folder: split_driver
Key: train

Entering Initial For Loop

Entering Else Statment
Folder: split_driver
Key: validation

Entering Initial For Loop

Entering Else Statment
Folder: split_driver
Key: dummy



# Verify Folder Size

We will now verify the size of the resulting folders.

In [126]:
# Function calculates the number of files in a directory, recursively
def num_files(nested_dict, folder = "default"):
    
     for key, value in nested_dict.items():
        #print("Entering Initial For Loop")
        #print("")
        if type(value) is dict:
            print("Entering If Statment - Recursive")
            print("")
            # Define Local Values
            folder = key
            num_files(value, folder = folder)
           
        else:
            
            path_current = "D:\\Notebooks\\dsba_6190\\team_project\\image_classification_preprocessing"
            prefix_folder = "imgs\\train-" + folder
            prefix_subfolder = key
            dir_check = os.path.join(path_current, prefix_folder, prefix_subfolder)
            
            cpt = sum([len(files) for r, d, files in os.walk(dir_check)])
            print("Directory Analyzed:\t{}".format(dir_check))
            print("\t\t\tThere are " + "\033[1m" + "{}".format(cpt) + "\033[0m" + " image files in the " + 
                  "\033[1m"+ folder + "\\" + prefix_subfolder +  "\033[0m" + " directory.")
            print()

In [127]:
num_files(dict_overall_copy)

Entering If Statment - Recursive

Directory Analyzed:	D:\Notebooks\dsba_6190\team_project\image_classification_preprocessing\imgs\train-split_random\complete
			There are [1m22424[0m image files in the [1msplit_random\complete[0m directory.

Directory Analyzed:	D:\Notebooks\dsba_6190\team_project\image_classification_preprocessing\imgs\train-split_random\sample
			There are [1m2243[0m image files in the [1msplit_random\sample[0m directory.

Directory Analyzed:	D:\Notebooks\dsba_6190\team_project\image_classification_preprocessing\imgs\train-split_random\dummy
			There are [1m2243[0m image files in the [1msplit_random\dummy[0m directory.

Entering If Statment - Recursive

Directory Analyzed:	D:\Notebooks\dsba_6190\team_project\image_classification_preprocessing\imgs\train-split_driver\train
			There are [1m17819[0m image files in the [1msplit_driver\train[0m directory.

Directory Analyzed:	D:\Notebooks\dsba_6190\team_project\image_classification_preprocessing\imgs\train-

# Process Flow

Some of the downstream processing steps are computationally heavy. Therefore, I'm going to put in some guardrails in order to avoid unecessarily re-running these code chunks

To check if any of the input files have changed, which would then require reprocessing the images, we will use Hash checksum values to document the state of the input folders. We will check against this Hash state to determine if we need to reprocess the data. 

## im2rec Function Calls
The following functions use the im2rec.py tool to process the image inputs.

[https://mxnet.apache.org/api/faq/recordio](https://mxnet.apache.org/api/faq/recordio) (Accessed on 3/20/2020)

### LST File Generation
This function performs the first step, creating LST mapping files.

In [128]:
def im2rec_lst(status, status_drop_train, dataset):
    
    lst_file_name = dataset
    lst_file_path = os.path.join("model_input_files", status_drop_train, dataset)
    img_loc =  os.path.join("imgs", status, dataset)
    
    if "im2rec" in status:
    
        %run incubator-mxnet/tools/im2rec.py --list --train-ratio 0.8 --recursive {lst_file_path} {img_loc} 
    else: 
        %run incubator-mxnet/tools/im2rec.py --list --recursive {lst_file_path} {img_loc}
    
    return img_loc, lst_file_path

### Recordio Conversion
This function converts the images into a RECORDIO binary file format. In addition to converting images from raw image files to RECORDIO, we need to resize the images. In the Amazon Sagemaker Image Classification Hyperparameters documentation ([here](https://docs.aws.amazon.com/sagemaker/latest/dg/IC-Hyperparameter.html)) there is a parameter called **image_shape**. This parameter requires a string with three numbers, comma seperated (i.e. "1,2,3").

The first value is **num_channels**. This is the number of channels our input images have. They are color images, so they have three channels (Red, Green, and Blue aka RGB). The second and third values are the heigth and width of the images, in pixels. Technically, the algorithm can accomadate images of any size, but if the images are too large, there may be memory constraints. It indicates typical dimensions are 244 x 244. Our images come are 640 x 480. By area this is 5x bigger. So, we need to resize the images.

The tool **im2rec.py** is capable of resizing images during the RECORDIO conversion. All we need to to is add the argument "**--resize #**" to the command line entry. The number in the command line argument is what the tool will resize the ***shortest edge*** of the input image. The larger edge will be resized to maintain the same relative dimensions. 

Therefore, we are going to resize the input images so that the resulting area is approximately equal to the area of a 244 x 244 image. The dimensions we are going to use is 210 x 280. This maintains the relative size of the input images, while have an area 98.7% that of the default image size. So, in the command line argument, the call will be **--resize 210**.

In [129]:
def im2rec_rec(img_loc, lst_file_path):
    
    %run incubator-mxnet/tools/im2rec.py  --resize 210 {lst_file_path} {img_loc} 

## Writing Hash Checksum
The following function generates the Hash of a directory (training or validation images) and writes the Hash in a CSV

In [130]:
def write_hash(status, file_path, file_dir):

    # Generate Hash
    dir_hash = dirhash(file_dir, 'sha256')
    
    # Write to CSV
    dict_dir_hash = {'hash': [dir_hash]}
    df_dir_hash = pd.DataFrame(dict_dir_hash)
    df_dir_hash.to_csv(file_path, index=False)
    print("Hash for {} images successfully written to CSV.".format(status))
    print()    

## Combine Processing and Writing Hash
Whenever we process the input data we need to write a new hash, and vice versa. Therefore, for simplicity in the final code, we will lump these actions together.


In [131]:
def im2rec_and_write(status, status_drop_train, dataset, file_name, file_path, file_dir):
        print("Generating LST File: {} status - {} dataset".format(status, dataset))
        
        img_loc, lst_file_path = im2rec_lst(status, status_drop_train, dataset)
        
        print("Starting generating REC file: {} status - {} dataset".format(status, dataset))
        
        im2rec_rec(img_loc, lst_file_path)
        
        print("REC File complete.")
        
        write_hash(status, file_path, file_dir)

## Process Flow Function
The following function will verify the hash of the current files matches the older record. If they match, no action is taken. If they do not match, the input data is processed, and a new hash is generated.

In [132]:
def image_convert(nested_dict, folder = "default"):
    
    for key, value in nested_dict.items():
        print("Entering Initial For Loop")
        print("")
        if type(value) is dict:
            print("Entering If Statment - Recursive")
            print("")
            # Define Local Values
            folder = key
            image_convert(value, folder = folder)
           
        else:
    
            print("Split Method: {}".format(folder))
            print("Dataset: {}".format(key))
            print()


            # Generate File Name, Path and Directory for Hash
            folder_plus_train = "train-" + folder
            file_name = "hash_{}.csv".format(key)
            file_path_hash = os.path.join("hash", folder, file_name)
            file_dir = os.path.join('imgs', folder_plus_train, key)

            # Print for Sanity Check
            print("Hash File Name: {}".format(file_name))
            print("Hash File Path: {}".format(file_path_hash))
            print("Image Set File Directory: {}".format(file_dir))
            print()

            # Generate Current Hash
            print('New Hash - Generating')
            hash_new = dirhash(file_dir, 'sha256')
            print('New Hash - Generated')

            if not os.path.exists(file_path_hash):
                print("Hash for {} images in {} do not exist.".format(key, folder))
                print()
                im2rec_and_write(folder_plus_train, folder, key, 
                                 file_name, file_path_hash, file_dir)
                return

            else:
                # Read Existing Hash
                df_hash_old = pd.read_csv(file_path_hash)
                hash_old = df_hash_old.iloc[0]['hash']

                if hash_old == hash_new:
                    print("Hash for {} images in {} are equal.".format(key, folder))
                    print("Hash not replaced.")
                    print("Processing not required.")
                    print()
                    return

                else:
                    print("Hash for {} images in {} are NOT equal.".format(key, folder))
                    print()
                    im2rec_and_write(folder_plus_train, folder, key, 
                                     file_name, file_path_hash, file_dir)

## Execute Process
Runnign the following code chuck will execute the process flow function develpoed above for both training and validation data.

In [133]:
image_convert(dict_overall_copy)

Entering Initial For Loop

Entering If Statment - Recursive

Entering Initial For Loop

Split Method: split_random
Dataset: complete

Hash File Name: hash_complete.csv
Hash File Path: hash\split_random\hash_complete.csv
Image Set File Directory: imgs\train-split_random\complete

New Hash - Generating
New Hash - Generated
Hash for complete images in split_random do not exist.

Generating LST File: train-split_random status - complete dataset
c0 0
c1 1
c2 2
c3 3
c4 4
c5 5
c6 6
c7 7
c8 8
c9 9


FileNotFoundError: [Errno 2] No such file or directory: 'D:\\Notebooks\\dsba_6190\\team_project\\image_classification_preprocessing\\model_input_files\\split_random\\complete.lst'

Starting generating REC file: train-split_random status - complete dataset


FileNotFoundError: [WinError 3] The system cannot find the path specified: 'D:\\Notebooks\\dsba_6190\\team_project\\image_classification_preprocessing\\model_input_files\\split_random'

REC File complete.


FileNotFoundError: [Errno 2] No such file or directory: 'hash\\split_random\\hash_complete.csv'

# S3 Upload

## Establish AWS Settings

In [59]:
profile = 'dsba_6190_proj_4'
region = 'us-east-1'
bucket = 'dsba-6190-final-team-project'
prefix = "channels_rec"
split_method = "split_driver"

In [40]:
session = boto3.session.Session(profile_name = profile,
                               region_name = region)

s3_resource = session.resource('s3')

## Upload
We now upload the Recordio format files to S3. The Amazon Sagemaker Algorithm requires a specific folder format, with the training and validation data as seperate subfolders of the same directory, with the names train and validation, respectively.

In [54]:
def upload_to_s3(status, dataset):
    # Establish Dynamic File and Path Names
    folder = "model_input_files"
    file_name_local = status + "_split_driver_"  + dataset + ".rec"
    file_name_s3 = status +  ".rec"
    
    # Define Boto3 Resource Inputs
    file_for_upload =  os.path.join(folder, split_method, file_name_local).replace('\\','/')
    s3_key = os.path.join(prefix, split_method, dataset, status, file_name_s3).replace('\\','/')
    
    print("File for Upload:\t{}".format(file_for_upload))
    print("s3 Key:\t\t\t{}".format(s3_key))
    
    # Upload
    s3_resource.Bucket(bucket).upload_file(Filename = file_for_upload, Key = s3_key)

### Complete Dataset

#### Training

In [55]:
status = status_list[0]
dataset = dataset_list[0]

#upload_to_s3(status, dataset)

File for Upload:	model_input_files/split_driver/train_subset_complete.rec
s3 Key:			channels/split_driver/complete/train/train.rec


#### Validation

In [56]:
status = status_list[1]
dataset = dataset_list[0]

#upload_to_s3(status, dataset)

File for Upload:	model_input_files/split_driver/validation_subset_complete.rec
s3 Key:			channels/split_driver/complete/validation/validation.rec


### Sample Dataset

#### Training

In [57]:
status = status_list[0]
dataset = dataset_list[1]

#upload_to_s3(status, dataset)

File for Upload:	model_input_files/split_driver/train_subset_sample.rec
s3 Key:			channels/split_driver/sample/train/train.rec


#### Validation

In [58]:
status = status_list[1]
dataset = dataset_list[1]

#upload_to_s3(status, dataset)

File for Upload:	model_input_files/split_driver/validation_subset_sample.rec
s3 Key:			channels/split_driver/sample/validation/validation.rec
