# Notes
## Project Folder Structure
The data for this file is too large to store with the github repo. Therefore, the files need to be added manually. The raw image files can be found here:

[https://www.kaggle.com/c/state-farm-distracted-driver-detection/data](https://www.kaggle.com/c/state-farm-distracted-driver-detection/data) (Accessed on 3/21/2020)

The file structre should be as follows:

--Working

----imgs

------complete
  
--------train

--------test

The train and test folders come directly from the kaggle dataset. Subsequent folders will be generated by this notebook.

## Scope
This notebook is to prepare the images in the distracted driver dataset locally, for upload to S3.

This notebook will create two versions of the data. The first will be a LST mapping file, with the image files in JPEG format. The second will be a recordio format.

A Complete dataset and Sample dataset will be created and uploaded.

# Import

## Install

In [1]:
!pip install checksumdir



## Library / Packages

In [2]:
import pandas as pd
import random 
import os
import shutil
import boto3

import hashlib
from checksumdir import dirhash

from filecmp import dircmp

#Settings
seed = 5590

## Data

In [3]:
url = "https://raw.githubusercontent.com/DSBA-6190-Final-Project-Team/DSBA-6190_Final-Project/master/wine_predict/data/driver_imgs_list.csv"
path_file = 'data/driver_imgs_list.csv'

df_driver_index = pd.read_csv(path_file) 

# EDA

In [4]:
df_driver_index.shape

(22424, 3)

In [5]:
df_driver_index.head(5)

Unnamed: 0,subject,classname,img
0,p002,c0,img_44733.jpg
1,p002,c0,img_72999.jpg
2,p002,c0,img_25094.jpg
3,p002,c0,img_69092.jpg
4,p002,c0,img_92629.jpg


In [6]:
df_driver_index.columns

Index(['subject', 'classname', 'img'], dtype='object')

The following lists each unique driver along with the number of different classname and images are associated with each. Note the number of classnames is not unique, so the images an classnames have equal frequencies.

In [7]:
drivers_gb = df_driver_index.groupby(['subject'])
drivers_gb.count()

Unnamed: 0_level_0,classname,img
subject,Unnamed: 1_level_1,Unnamed: 2_level_1
p002,725,725
p012,823,823
p014,876,876
p015,875,875
p016,1078,1078
p021,1237,1237
p022,1233,1233
p024,1226,1226
p026,1196,1196
p035,848,848


We will set the training/validation split ratio.

In [8]:
train_val_split = 0.3

To split the data into training and validation sets, we'll need a unique index of drivers. We don't need the frequency counts.

We will create a shuffled list of unique drivers.


In [9]:
drivers_unique = drivers_gb.groups.keys()
drivers_unique = list(drivers_unique)

random.Random(seed).shuffle(drivers_unique)
print(drivers_unique)

['p049', 'p022', 'p012', 'p075', 'p045', 'p072', 'p026', 'p042', 'p015', 'p039', 'p047', 'p081', 'p052', 'p014', 'p021', 'p035', 'p061', 'p064', 'p066', 'p056', 'p002', 'p041', 'p024', 'p016', 'p051', 'p050']


We'll set the list of drivers in the train set and the validation set.

In [10]:
num_drivers_val = round(len(drivers_unique)*train_val_split)
#print(num_drivers_val)
num_drivers_train = len(drivers_unique) - num_drivers_val
#print(num_drivers_train)

In [11]:
drivers_val = drivers_unique[:num_drivers_val]
print(drivers_val)

['p049', 'p022', 'p012', 'p075', 'p045', 'p072', 'p026', 'p042']


In [12]:
drivers_train = drivers_unique[-num_drivers_train:]
print(drivers_train)

['p015', 'p039', 'p047', 'p081', 'p052', 'p014', 'p021', 'p035', 'p061', 'p064', 'p066', 'p056', 'p002', 'p041', 'p024', 'p016', 'p051', 'p050']


# Training and Validation Image Lists
We'll now create two lists, one list of every image file name associated with the trainging set, another list associated with the validation set. 

We will use these lists to filter the overall lst mapping file.

In [13]:
df_images_val = df_driver_index[df_driver_index['subject'].isin(drivers_val)]
df_images_train = df_driver_index[~df_driver_index['subject'].isin(drivers_val)]

In [14]:
print(df_images_train.shape)
df_images_train.head()

(15686, 3)


Unnamed: 0,subject,classname,img
0,p002,c0,img_44733.jpg
1,p002,c0,img_72999.jpg
2,p002,c0,img_25094.jpg
3,p002,c0,img_69092.jpg
4,p002,c0,img_92629.jpg


In [15]:
print(df_images_val.shape)
df_images_val.head()

(6738, 3)


Unnamed: 0,subject,classname,img
725,p012,c0,img_10206.jpg
726,p012,c0,img_27079.jpg
727,p012,c0,img_50749.jpg
728,p012,c0,img_97089.jpg
729,p012,c0,img_37741.jpg


# Move Images
We need to create two folders images, one for training, and one for validation. We will leave the complete folder of images alone, in case we need to go back and extract information from that folder

## Execute Copy
The following function takes the training or validaiton dataframe and iterates through each row. It then copies each file from the overall data to the respective folder. 

In [16]:
def copy_imgs_all(status, df):
    prefix_overall = "imgs\\complete\\train"
    folder_suffix = "_subset"
    folder_name = status + folder_suffix
    prefix_status = os.path.join("imgs",folder_name)

    for index, row in df.iterrows():
        path_src = os.path.join(prefix_overall,
                                row['classname'], 
                                row['img']).replace('\\','/')
             
       
        path_dst = os.path.join("imgs\\complete",
                               folder_name, 
                               row['classname'],
                               row['img']).replace('\\','/')
        
        #print("Source File:\t\t{}".format(path_src))
        #print("Destination File:\t{}".format(path_dst))
    
        
        if not os.path.exists(path_dst):
            # Verify Directory Exists. Create if Not 
            os.makedirs(os.path.dirname(path_dst), exist_ok=True)
            
            shutil.copy(path_src, path_dst)
            #print("Loop Start")
            #print("Source File:\t\t{}".format(path_src))
            #print("Destination File:\t{}".format(path_dst))
            #print()
        #else:
            #print("File Exists. No Copy Made.")
            

We will create a dictionary containing the status and associated dataframes. This will be looped over.

In [17]:
dict_image_df = {
    "train": df_images_train,
    "validation" : df_images_val
}

In [18]:
for key in dict_image_df:
    copy_imgs_all(key, dict_image_df[key])

### Complete Dataset

# Sampling
In order to test the image classification algorithm, we are going to create a subsample of the training and validation test sets. We will need to create a list of images for each, and then copy the images into their own directory.

In [19]:
sample_rate = 0.1

## Create Sample Lists

### Training

In [20]:
df_images_train_sample = df_images_train.groupby('classname').apply(pd.DataFrame.sample, 
                                                                    frac = sample_rate, 
                                                                    random_state = seed)
df_images_train_sample

Unnamed: 0_level_0,Unnamed: 1_level_0,subject,classname,img
classname,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
c0,6903,p024,c0,img_86347.jpg
c0,4449,p021,c0,img_69997.jpg
c0,9311,p035,c0,img_30218.jpg
c0,17010,p056,c0,img_17591.jpg
c0,15491,p051,c0,img_24357.jpg
...,...,...,...,...
c9,4273,p016,c9,img_43974.jpg
c9,13457,p047,c9,img_55331.jpg
c9,2386,p014,c9,img_36650.jpg
c9,13507,p047,c9,img_44150.jpg


### Validation

In [21]:
df_images_val_sample = df_images_val.groupby('classname').apply(pd.DataFrame.sample, 
                                                                frac = sample_rate, 
                                                                random_state = seed)
df_images_val_sample

Unnamed: 0_level_0,Unnamed: 1_level_0,subject,classname,img
classname,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
c0,8202,p026,c0,img_34269.jpg
c0,11400,p042,c0,img_59032.jpg
c0,13573,p049,c0,img_11474.jpg
c0,8177,p026,c0,img_26025.jpg
c0,5662,p022,c0,img_9678.jpg
...,...,...,...,...
c9,11937,p042,c9,img_20372.jpg
c9,12668,p045,c9,img_75520.jpg
c9,14464,p049,c9,img_30754.jpg
c9,1495,p012,c9,img_74547.jpg


## Create Dataset

In [22]:
def copy_imgs_sample(status, df):
    
    subfolder_complete = "complete"
    subfolder_sample = "sample"
    folder_name = status + "_subset"

    for index, row in df.iterrows():
        path_src = os.path.join("imgs",
                                subfolder_complete,
                                folder_name,
                                row['classname'], 
                                row['img']).replace('\\','/')
                
        path_dst = os.path.join("imgs",
                                subfolder_sample,
                                folder_name,
                                row['classname'],
                                row['img']).replace('\\','/')
        
        #print("Loop Start")
        #print("Source File:\t\t{}".format(path_src))
        #print("Destination File:\t{}".format(path_dst))
        #print("")    
        
        # verify
        os.makedirs(os.path.dirname(path_dst), exist_ok=True)
        
        if not os.path.exists(path_dst):
            shutil.copy(path_src, path_dst)

We establish a dictionary of the sample image lists.

In [23]:
dict_image_df_sample = {
    "train": df_images_train_sample,
    "validation" : df_images_val_sample
}

dict_image_df_sample

{'train':                 subject classname            img
 classname                                       
 c0        6903     p024        c0  img_86347.jpg
           4449     p021        c0  img_69997.jpg
           9311     p035        c0  img_30218.jpg
           17010    p056        c0  img_17591.jpg
           15491    p051        c0  img_24357.jpg
 ...                 ...       ...            ...
 c9        4273     p016        c9  img_43974.jpg
           13457    p047        c9  img_55331.jpg
           2386     p014        c9  img_36650.jpg
           13507    p047        c9  img_44150.jpg
           18519    p061        c9  img_81214.jpg
 
 [1567 rows x 3 columns],
 'validation':                 subject classname            img
 classname                                       
 c0        8202     p026        c0  img_34269.jpg
           11400    p042        c0  img_59032.jpg
           13573    p049        c0  img_11474.jpg
           8177     p026        c0  img_26025.jpg

In [24]:
for key in dict_image_df_sample:
    copy_imgs_sample(key, dict_image_df_sample[key])

# Verify Folder Size

We will now verify the size of the resulting folders.

In [25]:
# Function calculates the number of files in a directory, recursively
def num_files(dir_scan):
    cpt = sum([len(files) for r, d, files in os.walk(dir_scan)])
    print("Directory Analyzed:\t{}".format(dir_scan))
    print("\t\t\tThere are {} image files in the directory.".format(cpt))
    print()

In [26]:
path_current = "D:\\Notebooks\\dsba_6190\\team_project\\image_classification_preprocessing"

## Complete Dataset

In [27]:
prefix_training = "imgs\\complete\\train_subset"
prefix_validation = "imgs\\complete\\validation_subset"

# Training
dir_train = os.path.join(path_current, prefix_training)
num_files(dir_train)

# Validation
dir_validation = os.path.join(path_current, prefix_validation)
num_files(dir_validation)

Directory Analyzed:	D:\Notebooks\dsba_6190\team_project\image_classification_preprocessing\imgs\complete\train_subset
			There are 15686 image files in the directory.

Directory Analyzed:	D:\Notebooks\dsba_6190\team_project\image_classification_preprocessing\imgs\complete\validation_subset
			There are 6738 image files in the directory.



## Sample Dataset

In [28]:
prefix_training = "imgs\\sample\\train_subset"
prefix_validation = "imgs\\sample\\validation_subset"

# Training
dir_train = os.path.join(path_current, prefix_training)
print(dir_train)
num_files(dir_train)

# Validation
dir_validation = os.path.join(path_current, prefix_validation)
num_files(dir_validation)

D:\Notebooks\dsba_6190\team_project\image_classification_preprocessing\imgs\sample\train_subset
Directory Analyzed:	D:\Notebooks\dsba_6190\team_project\image_classification_preprocessing\imgs\sample\train_subset
			There are 1567 image files in the directory.

Directory Analyzed:	D:\Notebooks\dsba_6190\team_project\image_classification_preprocessing\imgs\sample\validation_subset
			There are 673 image files in the directory.



# Process Flow

Some of the downstream processing steps are computationally heavy. Therefore, in order to avoid running these code chunks, without manually commenting them out once run. 

To check if any of the input files have changed, which would then require reprocessing the images, we will use Hash checksum values to document the state of the inputs. We will check against this state to determine if we need to reprocess the data. 

To do this, we will have to convert the processing steps into callable functions.

## im2rec Function Calls
The following functions use the im2rec.py tool to process the image inputs.

[https://mxnet.apache.org/api/faq/recordio](https://mxnet.apache.org/api/faq/recordio) (Accessed on 3/20/2020)

### LST File Generation
This function performs the first step, creating LST mapping files.

In [29]:
def im2rec_lst(status, dataset):
    lst_file_name = status + "_subset_" + dataset
    lst_file_path = os.path.join("model_input_files", lst_file_name)
    folder_name = status + "_subset"
    img_loc =  os.path.join("imgs", dataset, folder_name)
    
    %run tools/im2rec.py {lst_file_path} {img_loc} --list --recursive 
    
    return img_loc

### Recordio Conversion
This function converts the images into a RECORDIO binary file format. In addition to converting images from raw image files to RECORDIO, we need to resize the images. In the Amazon Sagemaker Image Classification Hyperparameters documentation ([here](https://docs.aws.amazon.com/sagemaker/latest/dg/IC-Hyperparameter.html)) there is a parameter called **image_shape**. This parameter requires a string with three numbers, comma seperated (i.e. "1,2,3").

The first value is **num_channels**. This is the number of channels our input images have. They are color images, so they have three channels (Red, Green, and Blue aka RGB). The second and third values are the heigth and width of the images, in pixels. Technically, the algorithm can accomadate images of any size, but if the images are too large, there may be memory constraints. It indicates typical dimensions are 244 x 244. Our images come are 640 x 480. By area this is 5x bigger. So, we need to resize the images.

The tool **im2rec.py** is capable of resizing images during the RECORDIO conversion. All we need to to is add the argument "**--resize #**" to the command line entry. The number in the command line argument is what the tool will resize the ***shortest edge*** of the input image. The larger edge will be resized to maintain the same relative dimensions. 

Therefore, we are going to resize the input images so that the resulting area is approximately equal to the area of a 244 x 244 image. The dimensions we are going to use is 210 x 280. This maintains the relative size of the input images, while have an area 98.7% that of the default image size. So, in the command line argument, the call will be **--resize 210**.

In [30]:
def im2rec_rec(status, dataset, img_loc):
    lst_file_name_ext = status + "_subset_" + dataset +".lst"
    lst_file_path = os.path.join("model_input_files", lst_file_name_ext)
    
    %run tools/im2rec.py {lst_file_path} {img_loc} --resize 210

## Writing Hash Checksum
The following function generates the Hash of a directory (training or validation images) and writes the Hash in a CSV

In [31]:
def write_hash(status,file_name, file_path, file_dir):

    # Generate Hash
    dir_hash = dirhash(file_dir, 'sha256')
    
    # Write to CSV
    dict_dir_hash = {'hash': [dir_hash]}
    df_dir_hash = pd.DataFrame(dict_dir_hash)
    df_dir_hash.to_csv(file_path, index=False)
    print("Hash for {} images successfully written to CSV.".format(status))
    print()
    

## Combine Processing and Writing Hash
Whenever we process the input data we need to write a new hash, and vice versa. Therefore, for simplicity in the final code, we will lump these actions together.


In [32]:
def im2rec_and_write(status, subfolder, file_name, file_path, file_dir):
        print("Generating LST File: {}".format(status))
        
        img_loc = im2rec_lst(status, subfolder)
        
        print("Starting generating REC file: {}".format(status))
        
        im2rec_rec(status, subfolder, img_loc)
        
        print("REC File complete.")
        
        write_hash(status,file_name, file_path, file_dir)

## Process Flow Function
The following function will verify the hash of the current files matches the older record. If they match, no action is taken. If they do not match, the input data is processed, and a new hash is generated.

In [33]:
def verify_hash(status, dataset):
    
    # Verify dataset variable is correct.
    dataset_list = ["sample", "complete"]
    
    if dataset not in dataset_list:
        print("Error. Correct Dataset Type Not Entered.")
        print("Dataset must be either type sample or complete.")
        return
    
    print()
    print("Dataset: {},  Input Class: {}".format(dataset, status))
    print()
            
              
    # Generate File Name, Path and Directory
    file_name = "imgs_{}_{}_subset_hash.csv".format(dataset,status)
    file_path = os.path.join("hash", file_name)
    folder_name = status + "_subset"
    file_dir = os.path.join('imgs', dataset, folder_name)
    
    # Print for Sanity Check
    print("File Name: {}".format(file_name))
    print("File Path: {}".format(file_path))
    print("File Directory: {}".format(file_dir))
    print()
    
    # Generate Current Hash
    hash_new = dirhash(file_dir, 'sha256')
    
    if not os.path.exists(file_path):
        print("Hash for {} {} images do not exist.".format(dataset, status))
        print()
        im2rec_and_write(status, dataset, file_name, file_path, file_dir)
        return
    
    else:
        # Read Existing Hash
        df_hash_old = pd.read_csv(file_path)
        hash_old = df_hash_old.iloc[0]['hash']
        
        if hash_old == hash_new:
            print("Hash for {} images are equal.".format(status))
            print("New Hash not generated. Processing not required.")
            print()
            return
        
        else:
            print("Hash for {} images are NOT equal.".format(status))
            print()
            im2rec_and_write_complete(status)

## Execute Process
Runnign the following code chuck will execute the process flow function develpoed above for both training and validation data.

In [34]:
status_list = ["train", "validation"]
dataset_list = ["complete", "sample"]

for dataset in dataset_list:
    for status in status_list:
        verify_hash(status, dataset)


Dataset: complete,  Input Class: train

File Name: imgs_complete_train_subset_hash.csv
File Path: hash\imgs_complete_train_subset_hash.csv
File Directory: imgs\complete\train_subset

Hash for train images are equal.
New Hash not generated. Processing not required.


Dataset: complete,  Input Class: validation

File Name: imgs_complete_validation_subset_hash.csv
File Path: hash\imgs_complete_validation_subset_hash.csv
File Directory: imgs\complete\validation_subset

Hash for validation images are equal.
New Hash not generated. Processing not required.


Dataset: sample,  Input Class: train

File Name: imgs_sample_train_subset_hash.csv
File Path: hash\imgs_sample_train_subset_hash.csv
File Directory: imgs\sample\train_subset

Hash for train images are equal.
New Hash not generated. Processing not required.


Dataset: sample,  Input Class: validation

File Name: imgs_sample_validation_subset_hash.csv
File Path: hash\imgs_sample_validation_subset_hash.csv
File Directory: imgs\sample\valid

# S3 Upload

## Establish AWS Settings

In [35]:
profile = 'dsba_6190_proj_4'
region = 'us-east-1'
bucket = 'dsba-6190-final-team-project'
prefix = "channels"

In [36]:
session = boto3.session.Session(profile_name = profile,
                               region_name = region)

s3_resource = session.resource('s3')

## Upload
We now upload the Recordio format files to S3. The Amazon Sagemaker Algorithm requires a specific folder format, with the training and validation data as seperate subfolders of the same directory, with the names train and validation, respectively.

In [37]:
def upload_to_s3(status, dataset):
    # Establish Dynamic File and Path Names
    folder = "model_input_files"
    file_name_local = status + "_subset_"  + dataset + ".rec"
    file_name_s3 = status +  ".rec"
    
    # Define Boto3 Resource Inputs
    file_for_upload =  os.path.join(folder, file_name_local).replace('\\','/')
    s3_key = os.path.join(prefix, dataset, status, file_name_s3).replace('\\','/')
    
    print("File for Upload:\t{}".format(file_for_upload))
    print("s3 Key:\t\t\t{}".format(s3_key))
    
    # Upload
    s3_resource.Bucket(bucket).upload_file(Filename = file_for_upload, Key = s3_key)

### Complete Dataset

#### Training

In [47]:
status = status_list[0]
dataset = dataset_list[0]

#upload_to_s3(status, dataset)

File for Upload:	model_input_files/train_subset_complete.rec
s3 Key:			channels/complete/train/train.rec


#### Validation

In [48]:
status = status_list[1]
dataset = dataset_list[0]

#upload_to_s3(status, dataset)

File for Upload:	model_input_files/validation_subset_complete.rec
s3 Key:			channels/complete/validation/validation.rec


### Sample Dataset

#### Training

In [49]:
status = status_list[0]
dataset = dataset_list[1]

#upload_to_s3(status, dataset)

File for Upload:	model_input_files/train_subset_sample.rec
s3 Key:			channels/sample/train/train.rec


#### Validation

In [50]:
status = status_list[1]
dataset = dataset_list[1]

#upload_to_s3(status, dataset)

File for Upload:	model_input_files/validation_subset_sample.rec
s3 Key:			channels/sample/validation/validation.rec


# Verify Checksum in S3
In order to not accidentally upload duplicate files to S3, I want to cross-check the Checksum of the local file and the S3 file. The following functions approximate that. Note that technically the ETAG in S3 is not always the MD5. But our upload process is simple, and therefore the ETAG should equate to MD5 Checksum. Do not use this method if performing batch uploads.

The following functions were found on the following blog post:

[https://zihao.me/post/calculating-etag-for-aws-s3-objects/](https://zihao.me/post/calculating-etag-for-aws-s3-objects/) (Accessed on 3/21/2020)

## Functions

### Local Check

In [42]:
def md5_checksum(filename):
    m = hashlib.md5()
    with open(filename, 'rb') as f:
        for data in iter(lambda: f.read(1024 * 1024), b''):
            m.update(data)
    return m.hexdigest()

### S3 Check

In [43]:
def etag_checksum(filename, chunk_size=8 * 1024 * 1024):
    md5s = []
    with open(filename, 'rb') as f:
        for data in iter(lambda: f.read(chunk_size), b''):
            md5s.append(hashlib.md5(data).digest())
    m = hashlib.md5("".join(md5s))
    return '{}-{}'.format(m.hexdigest(), len(md5s))

### Compare

In [44]:
def etag_compare(filename, etag):
    et = etag[1:-1] # strip quotes
    if '-' in et and et == etag_checksum(filename):
        return True
    if '-' not in et and et == md5_checksum(filename):
        return True
    return False

### Scratch Work

In [45]:
status = status_list[0]
dataset = dataset_list[0]

folder = "model_input_files"
file_name_local = status + "_subset_"  + dataset + ".rec"
file_name_s3 = status +  ".rec"

file_for_upload =  os.path.join(folder, file_name_local).replace('\\','/')
s3_key = os.path.join(prefix, dataset, status, file_name_s3).replace('\\','/')

print("File for Upload:\t{}".format(file_for_upload))
print("s3 Key:\t\t\t{}".format(s3_key))

md5_local = md5_checksum(file_for_upload)

print("Local File MD5 Checksum: {}".format(md5_local))

#S3
obj = s3_resource.Object(bucket, s3_key)

print(obj.e_tag)

File for Upload:	model_input_files/train_subset_complete.rec
s3 Key:			channels/complete/train/train.rec
Local File MD5 Checksum: 5cccc4891b86b3679d48508c9bc4705e
"666c6c08660a90ae1c58936372eb3f18-149"


In [46]:
obj = s3_resource.Object(bucket, s3_key)

print(obj.e_tag[1:-1])

666c6c08660a90ae1c58936372eb3f18-149
