# Background and Proposed Method

## Data
The data set is the [State Farm Distracted Driver](https://www.kaggle.com/c/state-farm-distracted-driver-detection) dataset hosted on Kaggle.

The Kaggle page provides three items: 
1. a list of training images, their subject (driver) id, and class id (CSV)
2. a sample_submission.csv - a sample submission file in the correct format (CSV)
3. zipped folder of all (train/test) images (ZIP)

The data that the algorithm will train on is in the zipped folder. Within the folder are two sub-directories, train and test. The train directory contains the training set of the data, with each class of image within its own subfolder. The test directory contains only images, not seperated by class. It is to be used to score the final algorithm for Kaggle.






## Environment
The current plan is to generate a model in Amazon Sagemaker. A Amazon Sagemaker Notebook, using the Sagemaker Image Classification algorithm, will be used to generate a model. The Notebook will also deplot the model, once trained.

## Algorithm
The Amazon Sagemaker Image Classification Algorithm has four main inputs:
1. training set of images
2. validation set of images
3. lst/rec file for trianing set
4. lst/rec file for validation set

The lst or rec file (either acceptable) functions as a mapping resource, connecting each image to the image location and image class.

## Data Preparation
The provided test set will not be useful as a validation set for the algorithm. The Sagemaker Image Classification Algorithm requires a labeled validation set. Therefore, we will need to derive our own validation set from the provided training set.

Based on the [Towards Data blog post](https://towardsdatascience.com/how-i-tackled-my-first-kaggle-challenge-using-deep-learning-part-1-b0da29e1351b) walking through the same data, we are not going to split the training/validation data strictly randomly. Instead, we will split the drivers randomly, so that a unique driver only exists in either the training or the validation set.

### Training / validation Set
The first step will be developing a list of unique drivers. We should also check the amount of pictures per driver to get an idea of how even the split of pictures per driver is. 

Then, we will generate a list of training and validation drivers by randomly selecting from the unique list of drivers, using a 80/20 split.

### LST/REC File Development
The Image Classification Algorithm requires a file which maps the  images to the associated classes, for the training and validation sets. This can be in the form of a lst or rec file. 

There is python script available [here](https://mxnet.apache.org/api/faq/recordio) which will create these files. But these files must on your machine locally. I have already exported the files to S3. Therefore, I will need to re-download the files to a local machine and run this script. 

#### Training / Validation in the Mapping File
After the script has been generated, we will then need to seperate the training and validation images into seperate, labeled, folders.

# Import

## Library / Packages

In [2]:
import pandas as pd
import random 
import os
import shutil
import boto3

from filecmp import dircmp

## Data

In [3]:
url = "https://raw.githubusercontent.com/DSBA-6190-Final-Project-Team/DSBA-6190_Final-Project/master/wine_predict/data/driver_imgs_list.csv"
path_file = 'data/driver_imgs_list.csv'

df_driver_index = pd.read_csv(path_file) 

# EDA

In [4]:
df_driver_index.shape

(22424, 3)

In [5]:
df_driver_index.head(5)

Unnamed: 0,subject,classname,img
0,p002,c0,img_44733.jpg
1,p002,c0,img_72999.jpg
2,p002,c0,img_25094.jpg
3,p002,c0,img_69092.jpg
4,p002,c0,img_92629.jpg


In [6]:
df_driver_index.columns

Index(['subject', 'classname', 'img'], dtype='object')

The following lists each unique driver along with the number of different classname and images are associated with each. Note the number of classnames is not unique, so the images an classnames have equal frequencies.

In [7]:
drivers_gb = df_driver_index.groupby(['subject'])
drivers_gb.count()

Unnamed: 0_level_0,classname,img
subject,Unnamed: 1_level_1,Unnamed: 2_level_1
p002,725,725
p012,823,823
p014,876,876
p015,875,875
p016,1078,1078
p021,1237,1237
p022,1233,1233
p024,1226,1226
p026,1196,1196
p035,848,848


We will set the training/validation split ratio.

In [8]:
train_val_split = 0.3

To split the data into training and validation sets, we'll need a unique index of drivers. We don't need the frequency counts.

We will create a shuffled list of unique drivers.


In [9]:
drivers_unique = drivers_gb.groups.keys()
drivers_unique = list(drivers_unique)

random.Random(5590).shuffle(drivers_unique)
print(drivers_unique)

['p049', 'p022', 'p012', 'p075', 'p045', 'p072', 'p026', 'p042', 'p015', 'p039', 'p047', 'p081', 'p052', 'p014', 'p021', 'p035', 'p061', 'p064', 'p066', 'p056', 'p002', 'p041', 'p024', 'p016', 'p051', 'p050']


validation we'll set the list of drivers in the train set and the test set.

In [10]:
num_drivers_val = round(len(drivers_unique)*train_val_split)
#print(num_drivers_val)
num_drivers_train = len(drivers_unique) - num_drivers_val
#print(num_drivers_train)

In [11]:
drivers_val = drivers_unique[:num_drivers_val]
print(drivers_val)

['p049', 'p022', 'p012', 'p075', 'p045', 'p072', 'p026', 'p042']


In [12]:
drivers_train = drivers_unique[-num_drivers_train:]
print(drivers_train)

['p015', 'p039', 'p047', 'p081', 'p052', 'p014', 'p021', 'p035', 'p061', 'p064', 'p066', 'p056', 'p002', 'p041', 'p024', 'p016', 'p051', 'p050']


# Training and Validation Image Lists
We'll now create two lists, one list of every image file name associated with the trainging set, another list associated with the test set. 

We will use these lists to filter the overall lst mapping file.

In [13]:
df_images_val = df_driver_index[df_driver_index['subject'].isin(drivers_val)]
df_images_training = df_driver_index[~df_driver_index['subject'].isin(df_images_val)]

In [14]:
print(df_images_training.shape)
df_images_training.head()

(22424, 3)


Unnamed: 0,subject,classname,img
0,p002,c0,img_44733.jpg
1,p002,c0,img_72999.jpg
2,p002,c0,img_25094.jpg
3,p002,c0,img_69092.jpg
4,p002,c0,img_92629.jpg


In [15]:
print(df_images_val.shape)
df_images_val.head()

(6738, 3)


Unnamed: 0,subject,classname,img
725,p012,c0,img_10206.jpg
726,p012,c0,img_27079.jpg
727,p012,c0,img_50749.jpg
728,p012,c0,img_97089.jpg
729,p012,c0,img_37741.jpg


# LST Files
## Generate Overall List

The Sagemaker Image Classification Algorithm requires eitehr an LST or REC file as input, one for the training and one for the validation set. The file acts as a mapping function, connecting the image of each set to the image file location and the image class.

First we will develop a list, overall.lst, that catalogs every image. We will then import the list, and generate two new lists, train.lst and validation.lst. The train and validation files will be subsets of overall.lst, filtered based on the training and validation split done above.

The lst tool is im2rec.py, found here: [https://mxnet.apache.org/api/faq/recordio](https://mxnet.apache.org/api/faq/recordio)

In [16]:
%run tools/im2rec.py overall imgs/train --list --recursive 

c0 0
c1 1
c2 2
c3 3
c4 4
c5 5
c6 6
c7 7
c8 8
c9 9


## Generate Training / Validation List
We will now generate the lst files for the training and validation sets. 

We will also create sub-sampled training and validation lists, for preliminary work. To do this we first need to sample the overall list of images. Then, we will filter the training and validation **lst** files with the sampled overall list. In the sample, we will take the same sample, by percentage, of each unique driver.

### Upload Overall List

In [17]:
file_name = 'lst_files/overall.lst'

df_overall = pd.read_csv(file_name, sep='\t', header = None)

In [18]:
columns = ('index', 'classname', 'location')
df_overall.columns = columns
df_overall.head()

Unnamed: 0,index,classname,location
0,8127,3.0,c3\img_47595.jpg
1,3662,1.0,c1\img_56720.jpg
2,3008,1.0,c1\img_29658.jpg
3,10611,4.0,c4\img_5539.jpg
4,17309,7.0,c7\img_5064.jpg


### Create Image Column

We will create a new column, using the location column and char_location column, which will just list the image file name. 

In [19]:
# Create Dataframe of split image location
location_split = df_overall.location.str.split('\\', expand=True)

# Add Back Image File Name to original dataframe.
df_overall['img'] = location_split[1]
df_overall.head()

Unnamed: 0,index,classname,location,img
0,8127,3.0,c3\img_47595.jpg,img_47595.jpg
1,3662,1.0,c1\img_56720.jpg,img_56720.jpg
2,3008,1.0,c1\img_29658.jpg,img_29658.jpg
3,10611,4.0,c4\img_5539.jpg,img_5539.jpg
4,17309,7.0,c7\img_5064.jpg,img_5064.jpg


### Filter Overall List
We will now filter the overall list to create the training and validation set lists. Then, we will drop the img column so that when these files are exported, they maintain the correct **lst** format.

#### Total Data Set

In [20]:
df_train_lst_all = df_overall[df_overall['img'].isin(df_images_training['img'])]
df_train_lst = df_train_lst_all.drop(['img'], axis=1)

#Checks
print(df_train_lst.shape)
df_train_lst.head()

(22424, 3)


Unnamed: 0,index,classname,location
0,8127,3.0,c3\img_47595.jpg
1,3662,1.0,c1\img_56720.jpg
2,3008,1.0,c1\img_29658.jpg
3,10611,4.0,c4\img_5539.jpg
4,17309,7.0,c7\img_5064.jpg


In [21]:
df_val_lst_all = df_overall[~df_overall['img'].isin(df_images_training['img'])]
df_val_lst = df_val_lst_all.drop(['img'], axis=1)

#Checks
print(df_val_lst.shape)
df_val_lst.head()

(0, 3)


Unnamed: 0,index,classname,location


#### Sampled Data Set


In [22]:
sample_frac = 0.1

In [23]:
# Create Sample of Training Images
df_images_training_sample = df_images_training.groupby('subject').apply(pd.DataFrame.sample, 
                                                                        frac=sample_frac)

df_images_training_sample.shape

(2242, 3)

In [24]:
# Create Sample of Test Images
df_images_val_sample = df_images_val.groupby('subject').apply(pd.DataFrame.sample, 
                                                                        frac=sample_frac)

df_images_val_sample.shape

(673, 3)

##### Training

In [25]:
df_train_sample_lst_all = df_overall[df_overall['img'].isin(df_images_training_sample['img'])]
df_train_sample_lst = df_train_sample_lst_all.drop(['img'], axis=1)

#Checks
print(df_train_sample_lst.shape)
df_train_sample_lst.head()

(2242, 3)


Unnamed: 0,index,classname,location
6,5741,2.0,c2\img_45950.jpg
42,3510,1.0,c1\img_5090.jpg
46,14632,6.0,c6\img_29516.jpg
49,18392,8.0,c8\img_100688.jpg
80,13934,5.0,c5\img_94751.jpg


##### Validation

In [26]:
df_val_sample_lst_all = df_overall[df_overall['img'].isin(df_images_val_sample['img'])]
df_val_sample_lst = df_val_sample_lst_all.drop(['img'], axis=1)

#Checks
print(df_val_sample_lst.shape)
df_val_sample_lst.head()

(673, 3)


Unnamed: 0,index,classname,location
79,5012,2.0,c2\img_18257.jpg
107,9251,3.0,c3\img_93491.jpg
109,2855,1.0,c1\img_23212.jpg
135,21854,9.0,c9\img_75520.jpg
149,16468,7.0,c7\img_11423.jpg


### Export

#### Sample Subset

In [27]:
# Training
file_name = "lst_files/train_sample.lst"
df_train_sample_lst.to_csv(file_name, sep='\t', index=None, header=None)

In [28]:
# df_val_sample_lst
file_name = "lst_files/validation_sample.lst"
df_val_sample_lst.to_csv(file_name, sep='\t', index=None, header=None)

#### Total Sample

In [29]:
# Training
file_name = "lst_files/train.lst"
df_train_lst.to_csv(file_name, sep='\t', index=None, header=None)

In [30]:
# df_val_sample_lst
file_name = "lst_files/validation.lst"
df_val_lst.to_csv(file_name, sep='\t', index=None, header=None)

## Create Training and Validation Folders
With the training and validation sets established, we will need to physically seperate these files into a train and validation folder. To do this, we'll remove the images in the validation set from the current train folder. We will move these validation images into a seperate validation folder, with an equivalent strucutre to the train folder.

**Note:** The following code to create a Validation folder was started before the plan changed to move files directly in S3. This code will be maintained in case moving the files in S3 fails.

### Create Validation Folder Structure
Before we transfer the images, we need to create the validation strucure.

In [31]:
working_dir = os.getcwd()
print(working_dir)
rel_path = "imgs\\train"
file_location = os.path.join(working_dir, rel_path)
print(file_location)

D:\Notebooks\dsba_6190\team_project\image_classification_preprocessing
D:\Notebooks\dsba_6190\team_project\image_classification_preprocessing\imgs\train


In [32]:
input_path = file_location
output_rel_path = "imgs\\validation"
output_path = os.path.join(working_dir, output_rel_path)
print(input_path)
print(output_path)

D:\Notebooks\dsba_6190\team_project\image_classification_preprocessing\imgs\train
D:\Notebooks\dsba_6190\team_project\image_classification_preprocessing\imgs\validation


The following function will be used with the shutil.copytree command. The function will copy the folder strucutre of a source location, and replicated at a destination location. For us, the source is the current train folder. The destination is our validation folder. 

The function was copied from this Stack Overflow thread:
[https://stackoverflow.com/questions/7011814/using-python-to-copy-the-directory-structure](https://stackoverflow.com/questions/7011814/using-python-to-copy-the-directory-structure)

In [33]:
def copy_dir(src, dst, *, follow_sym=True):
    if os.path.(dst):
        dst = os.path.join(dst, os.path.basename(src))
    if os.path.isdir(src):
        shutil.copyfile(src, dst, follow_symlinks=follow_sym)
        shutil.copystat(src, dst, follow_symlinks=follow_sym)
    return dst

SyntaxError: invalid syntax (<ipython-input-33-3f58f1a5bbab>, line 2)

Once this the shutil.copytree function has been executed, it will throw an error. So we will nest this call in a if/else statement to check if our folder has already been created.

In [None]:
if os.path.exists(output_path):
    print("Folder Already Exists.")
else:
    print("Creating Folder Structure...")
    shutil.copytree(input_path, output_path, copy_function=copy_dir) 
    print("...Folder Structure Created.")

## Move Files in S3

The concepts for "moving" files in S3 are taken from this blog post:

[https://medium.com/plusteam/move-and-rename-objects-within-an-s3-bucket-using-boto-3-58b164790b78](https://medium.com/plusteam/move-and-rename-objects-within-an-s3-bucket-using-boto-3-58b164790b78).

In [None]:
s3_resource = boto3.resource('s3')