# Background and Proposed Method

## Data
The data set is the [State Farm Distracted Driver](https://www.kaggle.com/c/state-farm-distracted-driver-detection) dataset hosted on Kaggle.

The Kaggle page provides three items: 
1. a list of training images, their subject (driver) id, and class id (CSV)
2. a sample_submission.csv - a sample submission file in the correct format (CSV)
3. zipped folder of all (train/test) images (ZIP)

The data that the algorithm will train on is in the zipped folder. Within the folder are two sub-directories, train and test. The train directory contains the training set of the data, with each class of image within its own subfolder. The test directory contains only images, not seperated by class. It is to be used to score the final algorithm for Kaggle.






## Environment
The current plan is to generate a model in Amazon Sagemaker. A Amazon Sagemaker Notebook, using the Sagemaker Image Classification algorithm, will be used to generate a model. The Notebook will also deplot the model, once trained.

## Algorithm
The Amazon Sagemaker Image Classification Algorithm has four main inputs:
1. training set of images
2. test set of images
3. lst/rec file for trianing set
4. lst/rec file for test set

The lst or rec file (either acceptable) functions as a mapping resource, connecting each image to the image location and image class.

## Data Preparation
The provided test set will not be useful as a test set for the algorithm. The Sagemaker Image Classification Algorithm requires a labeled test set. Therefore, we will need to derive our own test set from the provided training set.

Based on the [Towards Data blog post](https://towardsdatascience.com/how-i-tackled-my-first-kaggle-challenge-using-deep-learning-part-1-b0da29e1351b) walking through the same data, we are not going to split the training/test data strictly randomly. Instead, we will split the drivers randomly, so that a unique driver only exists in either the training or the test set.

### Training / Test Set
The first step will be developing a list of unique drivers. We should also check the amount of pictures per driver to get an idea of how even the split of pictures per driver is. 

Then, we will generate a list of training and test drivers by randomly selecting from the unique list of drivers, using a 80/20 split.

### LST/REC File Development
The Image Classification Algorithm requires a file which maps the  images to the associated classes, for the training and testing sets. This can be in the form of a lst or rec file. 

There is python script available [here](https://mxnet.apache.org/api/faq/recordio) which will create these files. But these files must on your machine locally. I have already exported the files to S3. Therefore, I will need to redownload the files to a local machine and run this script. 

#### Training / Test in the Mapping File
Ideally, the training and test files would be physically seperated into two seperate folders. I would prefer not to do that, as the image folders have already been uploaded to S3. 

Therefore, I am going to try to keep the training and tes images in the same folder, but develop two mapping documents (as is required). That way, the mapping documents will scan the same image folder, but identify only the images on the training or testing mapping document.

If this doesn't work I will revert to physiclaly seperating the training and test images.



# Import

## Library / Packages

In [41]:
import pandas as pd
import random 

## Data

In [42]:
url = "https://raw.githubusercontent.com/DSBA-6190-Final-Project-Team/DSBA-6190_Final-Project/master/wine_predict/data/driver_imgs_list.csv"
path_file = 'data/driver_imgs_list.csv'

df_driver_index = pd.read_csv(path_file) 

# EDA

In [43]:
df_driver_index.shape

(22424, 3)

In [44]:
df_driver_index.head(5)

Unnamed: 0,subject,classname,img
0,p002,c0,img_44733.jpg
1,p002,c0,img_72999.jpg
2,p002,c0,img_25094.jpg
3,p002,c0,img_69092.jpg
4,p002,c0,img_92629.jpg


In [45]:
df_driver_index.columns

Index(['subject', 'classname', 'img'], dtype='object')

The following lists each unique driver along with the number of different classname and images are associated with each. Note the number of classnames is not unique, so the images an classnames have equal frequencies.

In [46]:
drivers_gb = df_driver_index.groupby(['subject'])
drivers_gb.count()

Unnamed: 0_level_0,classname,img
subject,Unnamed: 1_level_1,Unnamed: 2_level_1
p002,725,725
p012,823,823
p014,876,876
p015,875,875
p016,1078,1078
p021,1237,1237
p022,1233,1233
p024,1226,1226
p026,1196,1196
p035,848,848


We will set the training/test split ratio.

In [47]:
train_test_split = 0.3

To split the data into training and test sets, we'll need a unique index of drivers. We don't need the frequency counts.

We will create a shuffled list of unique drivers.


In [48]:
drivers_unique = drivers_gb.groups.keys()
drivers_unique = list(drivers_unique)

random.Random(5590).shuffle(drivers_unique)
print(drivers_unique)

['p049', 'p022', 'p012', 'p075', 'p045', 'p072', 'p026', 'p042', 'p015', 'p039', 'p047', 'p081', 'p052', 'p014', 'p021', 'p035', 'p061', 'p064', 'p066', 'p056', 'p002', 'p041', 'p024', 'p016', 'p051', 'p050']


Now we'll set the list of drivers in the train set and the test set.

In [49]:
num_drivers_test = round(len(drivers_unique)*train_test_split)
#print(num_drivers_test)
num_drivers_train = len(drivers_unique) - num_drivers_test
#print(num_drivers_train)

In [50]:
drivers_test = drivers_unique[:num_drivers_test]
print(drivers_test)

['p049', 'p022', 'p012', 'p075', 'p045', 'p072', 'p026', 'p042']


In [51]:
drivers_train = drivers_unique[-num_drivers_train:]
print(drivers_train)

['p015', 'p039', 'p047', 'p081', 'p052', 'p014', 'p021', 'p035', 'p061', 'p064', 'p066', 'p056', 'p002', 'p041', 'p024', 'p016', 'p051', 'p050']


# Training and Test Image Lists
We'll now create two lists, one list of every image file name associated with the trainging set, another list associated with the test set. 

We will use these lists to filter the overall lst mapping file.

In [52]:
df_images_test = df_driver_index[df_driver_index['subject'].isin(drivers_test)]
df_images_training = df_driver_index[~df_driver_index['subject'].isin(drivers_test)]

In [53]:
print(df_images_training.shape)
df_images_training.head()

(15686, 3)


Unnamed: 0,subject,classname,img
0,p002,c0,img_44733.jpg
1,p002,c0,img_72999.jpg
2,p002,c0,img_25094.jpg
3,p002,c0,img_69092.jpg
4,p002,c0,img_92629.jpg


In [54]:
print(df_images_test.shape)
df_images_test.head()

(6738, 3)


Unnamed: 0,subject,classname,img
725,p012,c0,img_10206.jpg
726,p012,c0,img_27079.jpg
727,p012,c0,img_50749.jpg
728,p012,c0,img_97089.jpg
729,p012,c0,img_37741.jpg


# LST Files
## Generate Overall List

The Sagemaker Image Classification Algorithm requires eitehr an LST or REC file as input, one for the training and one for the test set. The file acts as a mapping function, connecting the image of each set to the image file location and the image class.

First we will develop a list, overall.lst, that catalogs every image. We will then import the list, and generate two new lists, train.lst and test.lst. The train and test files will be subsets of overall.lst, filtered based on the training and test split done above.

The lst tool is im2rec.py, found here: [https://mxnet.apache.org/api/faq/recordio](https://mxnet.apache.org/api/faq/recordio)

In [55]:
%run tools/im2rec.py overall imgs/train --list --recursive 

c0 0
c1 1
c2 2
c3 3
c4 4
c5 5
c6 6
c7 7
c8 8
c9 9


## Generate Training / Test List
We will now generate the lst files for the training and test sets. 

We will also create sub-sampled training and testlists, for preliminary work. To do this we first need to sample the overall list of images. Then, we will filter the training and test **lst** files with the sampled overall list. In the sample, we will take the same sample, by percentage, of each unique driver.

### Upload Overall List

In [56]:
file_name = 'lst_files/overall.lst'

df_overall = pd.read_csv(file_name, sep='\t', header = None)

In [57]:
columns = ('index', 'classname', 'location')
df_overall.columns = columns
df_overall.head()

Unnamed: 0,index,classname,location
0,8127,3.0,c3\img_47595.jpg
1,3662,1.0,c1\img_56720.jpg
2,3008,1.0,c1\img_29658.jpg
3,10611,4.0,c4\img_5539.jpg
4,17309,7.0,c7\img_5064.jpg


### Create Image Column

We will create a new column, using the location column and char_location column, which will just list the image file name. 

In [58]:
# Create Dataframe of split image location
location_split = df_overall.location.str.split('\\', expand=True)

# Add Back Image File Name to original dataframe.
df_overall['img'] = location_split[1]
df_overall.head()

Unnamed: 0,index,classname,location,img
0,8127,3.0,c3\img_47595.jpg,img_47595.jpg
1,3662,1.0,c1\img_56720.jpg,img_56720.jpg
2,3008,1.0,c1\img_29658.jpg,img_29658.jpg
3,10611,4.0,c4\img_5539.jpg,img_5539.jpg
4,17309,7.0,c7\img_5064.jpg,img_5064.jpg


### Filter Overall List
We will now filter the overall list to create the training and test set lists. Then, we will drop the img column so that when these files are exported, they maintain the correct **lst** format.

#### Total Data Set

In [59]:
df_train_lst_all = df_overall[df_overall['img'].isin(df_images_training['img'])]
df_train_lst = df_train_lst_all.drop(['img'], axis=1)

#Checks
print(df_train_lst.shape)
df_train_lst.head()

(15686, 3)


Unnamed: 0,index,classname,location
0,8127,3.0,c3\img_47595.jpg
2,3008,1.0,c1\img_29658.jpg
3,10611,4.0,c4\img_5539.jpg
4,17309,7.0,c7\img_5064.jpg
6,5741,2.0,c2\img_45950.jpg


In [60]:
df_test_lst_all = df_overall[~df_overall['img'].isin(df_images_training['img'])]
df_test_lst = df_test_lst_all.drop(['img'], axis=1)

#Checks
print(df_test_lst.shape)
df_test_lst.head()

(6738, 3)


Unnamed: 0,index,classname,location
1,3662,1.0,c1\img_56720.jpg
5,1310,0.0,c0\img_56666.jpg
7,7177,3.0,c3\img_11920.jpg
10,20027,8.0,c8\img_87156.jpg
12,7265,3.0,c3\img_14803.jpg


#### Sampled Data Set


In [61]:
sample_frac = 0.1

In [62]:
# Create Sample of Training Images
df_images_training_sample = df_images_training.groupby('subject').apply(pd.DataFrame.sample, 
                                                                        frac=sample_frac)

df_images_training_sample.shape

(1569, 3)

In [63]:
# Create Sample of Test Images
df_images_test_sample = df_images_test.groupby('subject').apply(pd.DataFrame.sample, 
                                                                        frac=sample_frac)

df_images_test_sample.shape

(673, 3)

##### Training

In [64]:
df_train_sample_lst_all = df_overall[df_overall['img'].isin(df_images_training_sample['img'])]
df_train_sample_lst = df_training_sample_lst_all.drop(['img'], axis=1)

#Checks
print(df_train_sample_lst.shape)
df_train_sample_lst.head()

(1569, 3)


Unnamed: 0,index,classname,location
11,13097,5.0,c5\img_61196.jpg
14,15088,6.0,c6\img_48528.jpg
16,8853,3.0,c3\img_76235.jpg
37,16476,7.0,c7\img_11763.jpg
41,12573,5.0,c5\img_40810.jpg


##### Test

In [65]:
df_test_sample_lst_all = df_overall[df_overall['img'].isin(df_images_test_sample['img'])]
df_test_sample_lst = df_test_sample_lst_all.drop(['img'], axis=1)

#Checks
print(df_test_sample_lst.shape)
df_test_sample_lst.head()

(673, 3)


Unnamed: 0,index,classname,location
102,3892,1.0,c1\img_65134.jpg
109,2855,1.0,c1\img_23212.jpg
120,2861,1.0,c1\img_23793.jpg
197,13307,5.0,c5\img_69824.jpg
205,12949,5.0,c5\img_55613.jpg


### Export

#### Sample Subset

In [66]:
# Training
file_name = "lst_files/train_sample.lst"
df_train_sample_lst.to_csv(file_name, sep='\t', index=None, header=None)

In [67]:
# Testing
file_name = "lst_files/test_sample.lst"
df_test_sample_lst.to_csv(file_name, sep='\t', index=None, header=None)

#### Total Sample

In [68]:
# Training
file_name = "lst_files/train.lst"
df_train_lst.to_csv(file_name, sep='\t', index=None, header=None)

In [69]:
# Testing
file_name = "lst_files/test.lst"
df_test_lst.to_csv(file_name, sep='\t', index=None, header=None)