# Notes
## Project Folder Structure
The data for this file is too large to store with the github repo. Therefore, the files need to be added manually. The raw image files can be found here:

[https://www.kaggle.com/c/state-farm-distracted-driver-detection/data](https://www.kaggle.com/c/state-farm-distracted-driver-detection/data) (Accessed on 3/21/2020)

The file structre should be as follows:

--Working

----imgs

------complete
  
--------train

--------test

The train and test folders come directly from the kaggle dataset. Subsequent folders will be generated by this notebook.

## Scope
This notebook is to prepare the images in the distracted driver dataset locally, for upload to S3.

This notebook will create two versions of the data. The first will be a LST mapping file, with the image files in JPEG format. The second will be a RECORDIO format.

A Complete dataset and Sample dataset will be created and uploaded.

## Method

This notebook splits the training and validation set randomly, using the **im2rec.py** train/validate function.

# Import

Clone MXNET to get im2rec tool

In [50]:
#!git clone https://github.com/apache/incubator-mxnet.git

## Install

In [51]:
!pip install checksumdir
!pip install opencv-python mxnet



## Library / Packages

In [52]:
import pandas as pd
import random 
import os
import shutil
import boto3

import hashlib
from checksumdir import dirhash

from filecmp import dircmp

#Settings
seed = 5590

## Data

In [53]:
url = "https://raw.githubusercontent.com/DSBA-6190-Final-Project-Team/DSBA-6190_Final-Project/master/wine_predict/data/driver_imgs_list.csv"
path_file = 'data/driver_imgs_list.csv'

df_driver_index = pd.read_csv(path_file) 

# EDA

In [54]:
df_driver_index.shape

(22424, 3)

In [55]:
df_driver_index

Unnamed: 0,subject,classname,img
0,p002,c0,img_44733.jpg
1,p002,c0,img_72999.jpg
2,p002,c0,img_25094.jpg
3,p002,c0,img_69092.jpg
4,p002,c0,img_92629.jpg
...,...,...,...
22419,p081,c9,img_56936.jpg
22420,p081,c9,img_46218.jpg
22421,p081,c9,img_25946.jpg
22422,p081,c9,img_67850.jpg


In [56]:
df_driver_index.columns

Index(['subject', 'classname', 'img'], dtype='object')

The following lists each unique driver along with the number of different classname and images are associated with each. Note the number of classnames is not unique, so the images an classnames have equal frequencies.

In [57]:
drivers_gb = df_driver_index.groupby(['subject'])
drivers_gb.count()

Unnamed: 0_level_0,classname,img
subject,Unnamed: 1_level_1,Unnamed: 2_level_1
p002,725,725
p012,823,823
p014,876,876
p015,875,875
p016,1078,1078
p021,1237,1237
p022,1233,1233
p024,1226,1226
p026,1196,1196
p035,848,848


# Create Sampled Subset of Images

When using the **im2rec.py** tool to create RECORDIO files, to create the 10% sample, we need to create a copy of the raw images. 

## Create Sampled Dataframe
To do this, we can make a dataframe listing every image with the images relative location. Then take a sample of the dataframe. With the sub-sampled dataframe we can copy the sampled set. We start with the master list of images.

In [58]:
df_driver_index.head()

Unnamed: 0,subject,classname,img
0,p002,c0,img_44733.jpg
1,p002,c0,img_72999.jpg
2,p002,c0,img_25094.jpg
3,p002,c0,img_69092.jpg
4,p002,c0,img_92629.jpg


We want to sample from each class. To do so we'll group by class, then apply sampling.

In [59]:
df_driver_index_sample = df_driver_index.groupby('classname').apply(pd.DataFrame.sample, 
                                                                    random_state = seed,  
                                                                    frac=0.1)

df_driver_index_sample

Unnamed: 0_level_0,Unnamed: 1_level_0,subject,classname,img
classname,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
c0,8160,p026,c0,img_87669.jpg
c0,3328,p016,c0,img_67547.jpg
c0,20854,p075,c0,img_43775.jpg
c0,15331,p051,c0,img_17811.jpg
c0,15477,p051,c0,img_99652.jpg
...,...,...,...,...
c9,12635,p045,c9,img_3155.jpg
c9,7987,p024,c9,img_59938.jpg
c9,7999,p024,c9,img_80769.jpg
c9,6728,p022,c9,img_50864.jpg


## Copy
With the sampled dataframe, we can iterate through the list, 

In [60]:
def copy_imgs_im2rec(status, df):
    prefix_subfolder = "imgs"
    prefix_src = "raw/train"
    prefix_dst = "train-split_im2rec"
    #prefix_status = os.path.join("imgs",folder_name)

    for index, row in df.iterrows():
        path_src = os.path.join(prefix_subfolder,
                                prefix_src,
                                row['classname'], 
                                row['img']).replace('\\','/')
             
       
        path_dst = os.path.join(prefix_subfolder,
                                prefix_dst,
                                status,
                                row['classname'],
                                row['img']).replace('\\','/')
        
        #print("Source File:\t\t{}".format(path_src))
        #print("Destination File:\t{}".format(path_dst))
    
        
        if not os.path.exists(path_dst):
            # Verify Directory Exists. Create if Not 
            os.makedirs(os.path.dirname(path_dst), exist_ok=True)
            
            shutil.copy(path_src, path_dst)
            #print("Loop Start")
            #print("Source File:\t\t{}".format(path_src))
            #print("Destination File:\t{}".format(path_dst))
            #print()
        #else:
            #print("File Exists. No Copy Made.")
            

Now we execute the function for both the complete data and the sample data.

**Note:** I have been having issues when generating RECORDIO files in a loop. The last files generated in the loop appear to remain open by python until a python session is restarted. I'm worried they are being corrupted in some way. So, to try and avoid this, I am creating a third dummy dataset, to run at the end of the loop. This data will not be used in the analysis. It is only there to be the last action in the loop.

In [61]:
dict_input = {
    "complete": df_driver_index,
    "sample" : df_driver_index_sample,
    "dummy" : df_driver_index_sample
}

In [62]:
for key in dict_input:
    copy_imgs_im2rec(key, dict_input[key])

# Verify Folder Size

We will now verify the size of the resulting folders.

In [63]:
# Function calculates the number of files in a directory, recursively
def num_files(dir_scan):
    cpt = sum([len(files) for r, d, files in os.walk(dir_scan)])
    print("Directory Analyzed:\t{}".format(dir_scan))
    print("\t\t\tThere are {} image files in the directory.".format(cpt))
    print()

In [64]:
path_current = "D:\\Notebooks\\dsba_6190\\team_project\\image_classification_preprocessing"

## Complete Dataset

In [65]:
prefix_im2rec = "imgs\\train-split_im2rec"

# Complete
dir_complete = os.path.join(path_current, prefix_im2rec, "complete")
num_files(dir_complete)

# Sample
dir_sample = os.path.join(path_current, prefix_im2rec, "sample")
num_files(dir_sample)

Directory Analyzed:	D:\Notebooks\dsba_6190\team_project\image_classification_preprocessing\imgs\train-split_im2rec\complete
			There are 22424 image files in the directory.

Directory Analyzed:	D:\Notebooks\dsba_6190\team_project\image_classification_preprocessing\imgs\train-split_im2rec\sample
			There are 2243 image files in the directory.



# Process Flow

Some of the downstream processing steps are computationally heavy. Therefore, in order to avoid running these code chunks, without manually commenting them out once run. 

To check if any of the input files have changed, which would then require reprocessing the images, we will use Hash checksum values to document the state of the inputs. We will check against this state to determine if we need to reprocess the data. 

To do this, we will have to convert the processing steps into callable functions.

## im2rec Function Calls
The following functions use the im2rec.py tool to process the image inputs.

[https://mxnet.apache.org/api/faq/recordio](https://mxnet.apache.org/api/faq/recordio) (Accessed on 3/20/2020)

### LST File Generation
This function performs the first step, creating LST mapping files.

In [66]:
def im2rec_lst(status, dataset):
    lst_file_name = dataset
    lst_file_path = os.path.join("model_input_files", "split_im2rec", dataset)
    img_loc =  os.path.join("imgs", status, dataset)
    
    %run incubator-mxnet/tools/im2rec.py --list --train-ratio 0.8  --recursive {lst_file_path} {img_loc} 
    
    return img_loc, lst_file_path

### Recordio Conversion
This function converts the images into a RECORDIO binary file format. In addition to converting images from raw image files to RECORDIO, we need to resize the images. In the Amazon Sagemaker Image Classification Hyperparameters documentation ([here](https://docs.aws.amazon.com/sagemaker/latest/dg/IC-Hyperparameter.html)) there is a parameter called **image_shape**. This parameter requires a string with three numbers, comma seperated (i.e. "1,2,3").

The first value is **num_channels**. This is the number of channels our input images have. They are color images, so they have three channels (Red, Green, and Blue aka RGB). The second and third values are the heigth and width of the images, in pixels. Technically, the algorithm can accomadate images of any size, but if the images are too large, there may be memory constraints. It indicates typical dimensions are 244 x 244. Our images come are 640 x 480. By area this is 5x bigger. So, we need to resize the images.

The tool **im2rec.py** is capable of resizing images during the RECORDIO conversion. All we need to to is add the argument "**--resize #**" to the command line entry. The number in the command line argument is what the tool will resize the ***shortest edge*** of the input image. The larger edge will be resized to maintain the same relative dimensions. 

Therefore, we are going to resize the input images so that the resulting area is approximately equal to the area of a 244 x 244 image. The dimensions we are going to use is 210 x 280. This maintains the relative size of the input images, while have an area 98.7% that of the default image size. So, in the command line argument, the call will be **--resize 210**.

In [67]:
def im2rec_rec(img_loc, lst_file_path):
    
    %run incubator-mxnet/tools/im2rec.py  --resize 210 {lst_file_path} {img_loc} 

## Writing Hash Checksum
The following function generates the Hash of a directory (training or validation images) and writes the Hash in a CSV

In [68]:
def write_hash(status, file_name, file_path, file_dir):

    # Generate Hash
    dir_hash = dirhash(file_dir, 'sha256')
    
    # Write to CSV
    dict_dir_hash = {'hash': [dir_hash]}
    df_dir_hash = pd.DataFrame(dict_dir_hash)
    df_dir_hash.to_csv(file_path, index=False)
    print("Hash for {} images successfully written to CSV.".format(status))
    print()

## Combine Processing and Writing Hash
Whenever we process the input data we need to write a new hash, and vice versa. Therefore, for simplicity in the final code, we will lump these actions together.


In [69]:
def im2rec_and_write(status, dataset, file_name, file_path, file_dir):
        print("Generating LST File: {} status - {} dataset".format(status, dataset))
        
        img_loc, lst_file_path = im2rec_lst(status, dataset)
        
        print("Starting generating REC file: {} status - {} dataset".format(status, dataset))
        
        im2rec_rec(img_loc, lst_file_path)
        
        print("REC File complete.")
        
        write_hash(status, file_name, file_path, file_dir)

## Process Flow Function
The following function will verify the hash of the current files matches the older record. If they match, no action is taken. If they do not match, the input data is processed, and a new hash is generated.

In [70]:
def verify_hash_im2rec(dataset):
    status = "train-split_im2rec"
    
    # Verify dataset variable is correct.
    dataset_list = ["sample", "complete", "dummy"]
    
    if dataset not in dataset_list:
        print("Error. Correct Dataset Type Not Entered.")
        print("Dataset must be either type sample or complete.")
        return
    
    print()
    print("Dataset: {}".format(dataset))
    print()
            
              
    # Generate File Name, Path and Directory for Hash 
    file_name = "hash_{}.csv".format(dataset)
    file_path_hash = os.path.join("hash", "split_im2rec", file_name)
    file_dir = os.path.join('imgs', status, dataset)
    
    # Print for Sanity Check
    print("Hash File Name: {}".format(file_name))
    print("Hash File Path: {}".format(file_path_hash))
    print("Image Set File Directory: {}".format(file_dir))
    print()
    
    # Generate Current Hash
    print('New Hash - Generating')
    hash_new = dirhash(file_dir, 'sha256')
    print('New Hash - Generated')
    
    if not os.path.exists(file_path_hash):
        print("Hash for {} images in {} do not exist.".format(dataset, status))
        print()
        im2rec_and_write(status, dataset, file_name, file_path_hash, file_dir)
        return
    
    else:
        # Read Existing Hash
        df_hash_old = pd.read_csv(file_path_hash)
        hash_old = df_hash_old.iloc[0]['hash']
        
        if hash_old == hash_new:
            print("Hash for {} images are equal.".format(status))
            print("New Hash not generated. Processing not required.")
            print()
            return
        
        else:
            print("Hash for {} images are NOT equal.".format(status))
            print()
            im2rec_and_write_complete(status)

## Execute Process
Runnign the following code chuck will execute the process flow function develpoed above for both training and validation data.

In [71]:
dataset_list = ["complete", "sample", "dummy"]

for dataset in dataset_list:
    verify_hash_im2rec(dataset)


Dataset: complete

Hash File Name: hash_complete.csv
Hash File Path: hash\split_im2rec\hash_complete.csv
Image Set File Directory: imgs\train-split_im2rec\complete

New Hash - Generating
New Hash - Generated
Hash for train-split_im2rec images are equal.
New Hash not generated. Processing not required.


Dataset: sample

Hash File Name: hash_sample.csv
Hash File Path: hash\split_im2rec\hash_sample.csv
Image Set File Directory: imgs\train-split_im2rec\sample

New Hash - Generating
New Hash - Generated
Hash for train-split_im2rec images are equal.
New Hash not generated. Processing not required.


Dataset: dummy

Hash File Name: hash_dummy.csv
Hash File Path: hash\split_im2rec\hash_dummy.csv
Image Set File Directory: imgs\train-split_im2rec\dummy

New Hash - Generating
New Hash - Generated
Hash for train-split_im2rec images are equal.
New Hash not generated. Processing not required.



# S3 Upload

## Establish AWS Settings

In [82]:
profile = 'dsba_6190_proj_4'
region = 'us-east-1'
bucket = 'dsba-6190-final-team-project'
prefix_1 = "channels"
prefix_2 = "rec"
split_method = "split_im2rec"

In [83]:
session = boto3.session.Session(profile_name = profile,
                               region_name = region)

s3_resource = session.resource('s3')

## Upload
We now upload the Recordio format files to S3. The Amazon Sagemaker Algorithm requires a specific folder format, with the training and validation data as seperate subfolders of the same directory, with the names train and validation, respectively.

In [84]:
def upload_to_s3(status, status_tag, dataset):
    # Establish Dynamic File and Path Names
    folder = "model_input_files"
    file_name_local = dataset + "_"  + status_tag + ".rec"
    file_name_s3 = status +  ".rec"
    
    # Define Boto3 Resource Inputs
    file_for_upload =  os.path.join(folder, split_method, file_name_local).replace('\\','/')
    s3_key = os.path.join(prefix_1, prefix_2, split_method, dataset, status, file_name_s3).replace('\\','/')
    
    print("File for Upload:\t{}".format(file_for_upload))
    print("s3 Key:\t\t\t{}".format(s3_key))
    
    # Upload
    s3_resource.Bucket(bucket).upload_file(Filename = file_for_upload, Key = s3_key)

### Complete Dataset

#### Training

In [85]:
status = "train"
statust_tag = status
dataset = dataset_list[0]

#upload_to_s3(status, statust_tag, dataset)

File for Upload:	model_input_files/split_im2rec/complete_train.rec
s3 Key:			channels/rec/split_im2rec/complete/train/train.rec


#### Validation

In [86]:
status = "validation"
statust_tag = "val"
dataset = dataset_list[0]

#upload_to_s3(status, statust_tag, dataset)

File for Upload:	model_input_files/split_im2rec/complete_val.rec
s3 Key:			channels/rec/split_im2rec/complete/validation/validation.rec


### Sample Dataset

#### Training

In [87]:
status = "train"
statust_tag = status
dataset = dataset_list[1]

#upload_to_s3(status, statust_tag, dataset)

File for Upload:	model_input_files/split_im2rec/sample_train.rec
s3 Key:			channels/rec/split_im2rec/sample/train/train.rec


#### Validation

In [88]:
status = "validation"
statust_tag = "val"
dataset = dataset_list[1]

#upload_to_s3(status, statust_tag, dataset)

File for Upload:	model_input_files/split_im2rec/sample_val.rec
s3 Key:			channels/rec/split_im2rec/sample/validation/validation.rec
