# Note
This Notebook Lost Source Control of the Images in S3. A New Notebook will be develoepd to re-do the preprocessing of the distracted driver images.

# Background and Proposed Method

## Data
The data set is the [State Farm Distracted Driver](https://www.kaggle.com/c/state-farm-distracted-driver-detection) dataset hosted on Kaggle.

The Kaggle page provides three items: 
1. a list of training images, their subject (driver) id, and class id (CSV)
2. a sample_submission.csv - a sample submission file in the correct format (CSV)
3. zipped folder of all (train/test) images (ZIP)

The data that the algorithm will train on is in the zipped folder. Within the folder are two sub-directories, train and test. The train directory contains the training set of the data, with each class of image within its own subfolder. The test directory contains only images, not seperated by class. It is to be used to score the final algorithm for Kaggle.






## Environment
The current plan is to generate a model in Amazon Sagemaker. A Amazon Sagemaker Notebook, using the Sagemaker Image Classification algorithm, will be used to generate a model. The Notebook will also deplot the model, once trained.

## Algorithm
The Amazon Sagemaker Image Classification Algorithm has four main inputs:
1. training set of images
2. validation set of images
3. lst/rec file for trianing set
4. lst/rec file for validation set

The lst or rec file (either acceptable) functions as a mapping resource, connecting each image to the image location and image class.

## Data Preparation
The provided test set will not be useful as a validation set for the algorithm. The Sagemaker Image Classification Algorithm requires a labeled validation set. Therefore, we will need to derive our own validation set from the provided training set.

Based on the [Towards Data blog post](https://towardsdatascience.com/how-i-tackled-my-first-kaggle-challenge-using-deep-learning-part-1-b0da29e1351b) walking through the same data, we are not going to split the training/validation data strictly randomly. Instead, we will split the drivers randomly, so that a unique driver only exists in either the training or the validation set.

### Training / validation Set
The first step will be developing a list of unique drivers. We should also check the amount of pictures per driver to get an idea of how even the split of pictures per driver is. 

Then, we will generate a list of training and validation drivers by randomly selecting from the unique list of drivers, using a 80/20 split.

### LST/REC File Development
The Image Classification Algorithm requires a file which maps the  images to the associated classes, for the training and validation sets. This can be in the form of a lst or rec file. 

There is python script available [here](https://mxnet.apache.org/api/faq/recordio) which will create these files. But these files must on your machine locally. I have already exported the files to S3. Therefore, I will need to re-download the files to a local machine and run this script. 

#### Training / Validation in the Mapping File
After the script has been generated, we will then need to seperate the training and validation images into seperate, labeled, folders.

# Import

## Library / Packages

In [205]:
import pandas as pd
import random 
import os
import shutil
import boto3

from filecmp import dircmp

## Data

In [206]:
url = "https://raw.githubusercontent.com/DSBA-6190-Final-Project-Team/DSBA-6190_Final-Project/master/wine_predict/data/driver_imgs_list.csv"
path_file = 'data/driver_imgs_list.csv'

df_driver_index = pd.read_csv(path_file) 

# EDA

In [207]:
df_driver_index.shape

(22424, 3)

In [208]:
df_driver_index.head(5)

Unnamed: 0,subject,classname,img
0,p002,c0,img_44733.jpg
1,p002,c0,img_72999.jpg
2,p002,c0,img_25094.jpg
3,p002,c0,img_69092.jpg
4,p002,c0,img_92629.jpg


In [209]:
df_driver_index.columns

Index(['subject', 'classname', 'img'], dtype='object')

The following lists each unique driver along with the number of different classname and images are associated with each. Note the number of classnames is not unique, so the images an classnames have equal frequencies.

In [210]:
drivers_gb = df_driver_index.groupby(['subject'])
drivers_gb.count()

Unnamed: 0_level_0,classname,img
subject,Unnamed: 1_level_1,Unnamed: 2_level_1
p002,725,725
p012,823,823
p014,876,876
p015,875,875
p016,1078,1078
p021,1237,1237
p022,1233,1233
p024,1226,1226
p026,1196,1196
p035,848,848


We will set the training/validation split ratio.

In [211]:
train_val_split = 0.3

To split the data into training and validation sets, we'll need a unique index of drivers. We don't need the frequency counts.

We will create a shuffled list of unique drivers.


In [212]:
drivers_unique = drivers_gb.groups.keys()
drivers_unique = list(drivers_unique)

random.Random(5590).shuffle(drivers_unique)
print(drivers_unique)

['p049', 'p022', 'p012', 'p075', 'p045', 'p072', 'p026', 'p042', 'p015', 'p039', 'p047', 'p081', 'p052', 'p014', 'p021', 'p035', 'p061', 'p064', 'p066', 'p056', 'p002', 'p041', 'p024', 'p016', 'p051', 'p050']


validation we'll set the list of drivers in the train set and the test set.

In [213]:
num_drivers_val = round(len(drivers_unique)*train_val_split)
#print(num_drivers_val)
num_drivers_train = len(drivers_unique) - num_drivers_val
#print(num_drivers_train)

In [214]:
drivers_val = drivers_unique[:num_drivers_val]
print(drivers_val)

['p049', 'p022', 'p012', 'p075', 'p045', 'p072', 'p026', 'p042']


In [215]:
drivers_train = drivers_unique[-num_drivers_train:]
print(drivers_train)

['p015', 'p039', 'p047', 'p081', 'p052', 'p014', 'p021', 'p035', 'p061', 'p064', 'p066', 'p056', 'p002', 'p041', 'p024', 'p016', 'p051', 'p050']


# Training and Validation Image Lists
We'll now create two lists, one list of every image file name associated with the trainging set, another list associated with the test set. 

We will use these lists to filter the overall lst mapping file.

In [216]:
df_images_val = df_driver_index[df_driver_index['subject'].isin(drivers_val)]
df_images_training = df_driver_index[~df_driver_index['subject'].isin(df_images_val)]

In [217]:
print(df_images_training.shape)
df_images_training.head()

(22424, 3)


Unnamed: 0,subject,classname,img
0,p002,c0,img_44733.jpg
1,p002,c0,img_72999.jpg
2,p002,c0,img_25094.jpg
3,p002,c0,img_69092.jpg
4,p002,c0,img_92629.jpg


In [218]:
print(df_images_val.shape)
df_images_val.head()

(6738, 3)


Unnamed: 0,subject,classname,img
725,p012,c0,img_10206.jpg
726,p012,c0,img_27079.jpg
727,p012,c0,img_50749.jpg
728,p012,c0,img_97089.jpg
729,p012,c0,img_37741.jpg


# LST Files
## Generate Overall List

The Sagemaker Image Classification Algorithm requires eitehr an LST or REC file as input, one for the training and one for the validation set. The file acts as a mapping function, connecting the image of each set to the image file location and the image class.

First we will develop a list, overall.lst, that catalogs every image. We will then import the list, and generate two new lists, train.lst and validation.lst. The train and validation files will be subsets of overall.lst, filtered based on the training and validation split done above.

The lst tool is im2rec.py, found here: [https://mxnet.apache.org/api/faq/recordio](https://mxnet.apache.org/api/faq/recordio)

In [219]:
%run tools/im2rec.py overall imgs/train --list --recursive 

c0 0
c1 1
c2 2
c3 3
c4 4
c5 5
c6 6
c7 7
c8 8
c9 9


## Generate Training / Validation List
We will now generate the lst files for the training and validation sets. 

We will also create sub-sampled training and validation lists, for preliminary work. To do this we first need to sample the overall list of images. Then, we will filter the training and validation **lst** files with the sampled overall list. In the sample, we will take the same sample, by percentage, of each unique driver.

### Upload Overall List

In [220]:
file_name = 'lst_files/overall.lst'

df_overall = pd.read_csv(file_name, sep='\t', header = None)

In [221]:
columns = ('index', 'classname', 'location')
df_overall.columns = columns
df_overall.head()
df_overall = df_overall.sort_values(by=['location'])

df_overall

Unnamed: 0,index,classname,location
18606,0,0.0,c0\img_100026.jpg
6202,1,0.0,c0\img_10003.jpg
778,2,0.0,c0\img_100050.jpg
9516,3,0.0,c0\img_100074.jpg
1452,4,0.0,c0\img_10012.jpg
...,...,...,...
8154,22419,9.0,c9\img_99761.jpg
4375,22420,9.0,c9\img_99801.jpg
2061,22421,9.0,c9\img_99927.jpg
11577,22422,9.0,c9\img_9993.jpg


For use in the Amazon environment, we need to invert the slashes in the location column.

In [222]:
df_overall['location'] = df_overall['location'].str.replace('\\','/')

In [223]:
df_overall.head()

Unnamed: 0,index,classname,location
18606,0,0.0,c0/img_100026.jpg
6202,1,0.0,c0/img_10003.jpg
778,2,0.0,c0/img_100050.jpg
9516,3,0.0,c0/img_100074.jpg
1452,4,0.0,c0/img_10012.jpg


### Create Image Column

We will create a new column, using the location column and char_location column, which will just list the image file name. 

In [224]:
# Create Dataframe of split image location
location_split = df_overall.location.str.split('/', expand=True)

# Add Back Image File Name to original dataframe.
df_overall['img'] = location_split[1]
df_overall.head()

Unnamed: 0,index,classname,location,img
18606,0,0.0,c0/img_100026.jpg,img_100026.jpg
6202,1,0.0,c0/img_10003.jpg,img_10003.jpg
778,2,0.0,c0/img_100050.jpg,img_100050.jpg
9516,3,0.0,c0/img_100074.jpg,img_100074.jpg
1452,4,0.0,c0/img_10012.jpg,img_10012.jpg


### Filter Overall List
We will now filter the overall list to create the training and validation set lists. Then, we will drop the img column so that when these files are exported, they maintain the correct **lst** format.

#### Total Data Set

In [225]:
df_train_lst_all = df_overall[df_overall['img'].isin(df_images_training['img'])]
df_train_lst = df_train_lst_all.drop(['img'], axis=1)

#Checks
print(df_train_lst.shape)
df_train_lst.head()

(22424, 3)


Unnamed: 0,index,classname,location
18606,0,0.0,c0/img_100026.jpg
6202,1,0.0,c0/img_10003.jpg
778,2,0.0,c0/img_100050.jpg
9516,3,0.0,c0/img_100074.jpg
1452,4,0.0,c0/img_10012.jpg


In [226]:
df_val_lst_all = df_overall[df_overall['img'].isin(df_images_val['img'])]
df_val_lst = df_val_lst_all.drop(['img'], axis=1)

#Checks
print(df_val_lst.shape)
df_val_lst.head()

(6738, 3)


Unnamed: 0,index,classname,location
6202,1,0.0,c0/img_10003.jpg
778,2,0.0,c0/img_100050.jpg
17543,5,0.0,c0/img_100145.jpg
21656,9,0.0,c0/img_100337.jpg
3034,10,0.0,c0/img_100456.jpg


#### Sampled Data Set


In [227]:
sample_frac = 0.1

In [228]:
# Create Sample of Training Images
df_images_training_sample = df_images_training.groupby('subject').apply(pd.DataFrame.sample, 
                                                                        frac=sample_frac)

df_images_training_sample.shape

(2242, 3)

In [229]:
# Create Sample of Test Images
df_images_val_sample = df_images_val.groupby('subject').apply(pd.DataFrame.sample, 
                                                                        frac=sample_frac)

df_images_val_sample.shape

(673, 3)

##### Training

In [230]:
df_train_sample_lst_all = df_overall[df_overall['img'].isin(df_images_training_sample['img'])]
df_train_sample_lst = df_train_sample_lst_all.drop(['img'], axis=1)

#Checks
print(df_train_sample_lst.shape)
df_train_sample_lst.head()

(2242, 3)


Unnamed: 0,index,classname,location
7843,19,0.0,c0/img_100824.jpg
12227,35,0.0,c0/img_101414.jpg
3404,41,0.0,c0/img_101668.jpg
11047,48,0.0,c0/img_101938.jpg
5772,62,0.0,c0/img_10570.jpg


##### Validation

In [231]:
df_val_sample_lst_all = df_overall[df_overall['img'].isin(df_images_val_sample['img'])]
df_val_sample_lst = df_val_sample_lst_all.drop(['img'], axis=1)

#Checks
print(df_val_sample_lst.shape)
df_val_sample_lst.head()

(673, 3)


Unnamed: 0,index,classname,location
6202,1,0.0,c0/img_10003.jpg
13191,12,0.0,c0/img_10053.jpg
20482,24,0.0,c0/img_101032.jpg
13616,50,0.0,c0/img_10206.jpg
6154,64,0.0,c0/img_10609.jpg


### Export

#### Sample Subset

In [232]:
# Training
file_name = "lst_files/train_sample.lst"
df_train_sample_lst.to_csv(file_name, sep='\t', index=None, header=None)

In [233]:
# df_val_sample_lst
file_name = "lst_files/validation_sample.lst"
df_val_sample_lst.to_csv(file_name, sep='\t', index=None, header=None)

#### Total Sample

In [234]:
# Training
file_name = "lst_files/train.lst"
df_train_lst.to_csv(file_name, sep='\t', index=None, header=None)

In [235]:
# df_val_sample_lst
file_name = "lst_files/validation.lst"
df_val_lst.to_csv(file_name, sep='\t', index=None, header=None)

## Move Files in S3

The concepts for "moving" files in S3 are taken from this blog post:

[https://medium.com/plusteam/move-and-rename-objects-within-an-s3-bucket-using-boto-3-58b164790b78](https://medium.com/plusteam/move-and-rename-objects-within-an-s3-bucket-using-boto-3-58b164790b78).

 In order to interact with AWS entities outside of the AWS system, we need to add AWS User credentials locally. The AWS User also needs to have the correct IAM permissions.
 
In order to avoid manually adding User credentials, I maintain a credentials file in Google Drive. We will copy the credentials file (INI file, no extension, per documentation) to the correct local directory.

I was unable to figure out a way to connect to my google drive and do this via code, so the credentials file was copied manually.

### Define S3 Parameters

In [236]:
profile = 'dsba_6190_proj_4'
region = 'us-east-1'
bucket = 'dsba-6190-final-team-project'
prefix_train = 'imgs/train'
prefix_val = 'imgs/validation'

In [237]:
session = boto3.session.Session(profile_name=profile, region_name = region)
s3_client = session.client('s3')

### Transfer Images

To copy everything correctly, we'll have to iterate over a list of images we know we want to move. We have that list in dataframe form (**df_val_lst**). We'll first test that we can iterate over the dataframe and generate the necessary s3 file locations.

The following code chunk is old. I am keeping it in case the actions used here need to be copied.

In [238]:
# Old
'''
#Set Row and Column Index
row_index = 5590
col_index = df_val_lst.columns.get_loc("location")

# Create relative path to image
img_loc = df_val_lst.iloc[row_index, col_index].replace("\\","/")

# Create source and destination paths in S3 to image
src_test = os.path.join(s3train, img_loc)
dst_test = os.path.join(s3validation, img_loc)
print('src: {}'.format(src_test))
print('dst: {}'.format(dst_test))
'''

'\n#Set Row and Column Index\nrow_index = 5590\ncol_index = df_val_lst.columns.get_loc("location")\n\n# Create relative path to image\nimg_loc = df_val_lst.iloc[row_index, col_index].replace("\\","/")\n\n# Create source and destination paths in S3 to image\nsrc_test = os.path.join(s3train, img_loc)\ndst_test = os.path.join(s3validation, img_loc)\nprint(\'src: {}\'.format(src_test))\nprint(\'dst: {}\'.format(dst_test))\n'

In [239]:
for index, row in df_val_lst.iterrows():
    src_key = os.path.join(prefix_train,row['location']).replace("\\","/")
    dst_key = os.path.join(prefix_val, row['location']).replace("\\","/")
    print()
    print('Source Key:\t\t{}'.format(src_key))
    print('Destination Key:\t{}'.format(dst_key))


Source Key:		imgs/train/c0/img_10003.jpg
Destination Key:	imgs/validation/c0/img_10003.jpg

Source Key:		imgs/train/c0/img_100050.jpg
Destination Key:	imgs/validation/c0/img_100050.jpg

Source Key:		imgs/train/c0/img_100145.jpg
Destination Key:	imgs/validation/c0/img_100145.jpg

Source Key:		imgs/train/c0/img_100337.jpg
Destination Key:	imgs/validation/c0/img_100337.jpg

Source Key:		imgs/train/c0/img_100456.jpg
Destination Key:	imgs/validation/c0/img_100456.jpg

Source Key:		imgs/train/c0/img_10053.jpg
Destination Key:	imgs/validation/c0/img_10053.jpg

Source Key:		imgs/train/c0/img_100598.jpg
Destination Key:	imgs/validation/c0/img_100598.jpg

Source Key:		imgs/train/c0/img_100605.jpg
Destination Key:	imgs/validation/c0/img_100605.jpg

Source Key:		imgs/train/c0/img_100828.jpg
Destination Key:	imgs/validation/c0/img_100828.jpg

Source Key:		imgs/train/c0/img_10092.jpg
Destination Key:	imgs/validation/c0/img_10092.jpg

Source Key:		imgs/train/c0/img_101032.jpg
Destination Key:	imgs/v

Destination Key:	imgs/validation/c0/img_75755.jpg

Source Key:		imgs/train/c0/img_75810.jpg
Destination Key:	imgs/validation/c0/img_75810.jpg

Source Key:		imgs/train/c0/img_75880.jpg
Destination Key:	imgs/validation/c0/img_75880.jpg

Source Key:		imgs/train/c0/img_75961.jpg
Destination Key:	imgs/validation/c0/img_75961.jpg

Source Key:		imgs/train/c0/img_75979.jpg
Destination Key:	imgs/validation/c0/img_75979.jpg

Source Key:		imgs/train/c0/img_76028.jpg
Destination Key:	imgs/validation/c0/img_76028.jpg

Source Key:		imgs/train/c0/img_76249.jpg
Destination Key:	imgs/validation/c0/img_76249.jpg

Source Key:		imgs/train/c0/img_76418.jpg
Destination Key:	imgs/validation/c0/img_76418.jpg

Source Key:		imgs/train/c0/img_76559.jpg
Destination Key:	imgs/validation/c0/img_76559.jpg

Source Key:		imgs/train/c0/img_76602.jpg
Destination Key:	imgs/validation/c0/img_76602.jpg

Source Key:		imgs/train/c0/img_76610.jpg
Destination Key:	imgs/validation/c0/img_76610.jpg

Source Key:		imgs/train/c0/im

Source Key:		imgs/train/c1/img_30434.jpg
Destination Key:	imgs/validation/c1/img_30434.jpg

Source Key:		imgs/train/c1/img_30504.jpg
Destination Key:	imgs/validation/c1/img_30504.jpg

Source Key:		imgs/train/c1/img_3055.jpg
Destination Key:	imgs/validation/c1/img_3055.jpg

Source Key:		imgs/train/c1/img_30715.jpg
Destination Key:	imgs/validation/c1/img_30715.jpg

Source Key:		imgs/train/c1/img_30855.jpg
Destination Key:	imgs/validation/c1/img_30855.jpg

Source Key:		imgs/train/c1/img_30861.jpg
Destination Key:	imgs/validation/c1/img_30861.jpg

Source Key:		imgs/train/c1/img_31004.jpg
Destination Key:	imgs/validation/c1/img_31004.jpg

Source Key:		imgs/train/c1/img_31054.jpg
Destination Key:	imgs/validation/c1/img_31054.jpg

Source Key:		imgs/train/c1/img_31346.jpg
Destination Key:	imgs/validation/c1/img_31346.jpg

Source Key:		imgs/train/c1/img_31485.jpg
Destination Key:	imgs/validation/c1/img_31485.jpg

Source Key:		imgs/train/c1/img_31534.jpg
Destination Key:	imgs/validation/c1/img_3

Source Key:		imgs/train/c1/img_75902.jpg
Destination Key:	imgs/validation/c1/img_75902.jpg

Source Key:		imgs/train/c1/img_76146.jpg
Destination Key:	imgs/validation/c1/img_76146.jpg

Source Key:		imgs/train/c1/img_76164.jpg
Destination Key:	imgs/validation/c1/img_76164.jpg

Source Key:		imgs/train/c1/img_76342.jpg
Destination Key:	imgs/validation/c1/img_76342.jpg

Source Key:		imgs/train/c1/img_76580.jpg
Destination Key:	imgs/validation/c1/img_76580.jpg

Source Key:		imgs/train/c1/img_76614.jpg
Destination Key:	imgs/validation/c1/img_76614.jpg

Source Key:		imgs/train/c1/img_76632.jpg
Destination Key:	imgs/validation/c1/img_76632.jpg

Source Key:		imgs/train/c1/img_76634.jpg
Destination Key:	imgs/validation/c1/img_76634.jpg

Source Key:		imgs/train/c1/img_76848.jpg
Destination Key:	imgs/validation/c1/img_76848.jpg

Source Key:		imgs/train/c1/img_76898.jpg
Destination Key:	imgs/validation/c1/img_76898.jpg

Source Key:		imgs/train/c1/img_77432.jpg
Destination Key:	imgs/validation/c1/img

Source Key:		imgs/train/c2/img_34008.jpg
Destination Key:	imgs/validation/c2/img_34008.jpg

Source Key:		imgs/train/c2/img_34137.jpg
Destination Key:	imgs/validation/c2/img_34137.jpg

Source Key:		imgs/train/c2/img_34169.jpg
Destination Key:	imgs/validation/c2/img_34169.jpg

Source Key:		imgs/train/c2/img_34191.jpg
Destination Key:	imgs/validation/c2/img_34191.jpg

Source Key:		imgs/train/c2/img_34358.jpg
Destination Key:	imgs/validation/c2/img_34358.jpg

Source Key:		imgs/train/c2/img_34386.jpg
Destination Key:	imgs/validation/c2/img_34386.jpg

Source Key:		imgs/train/c2/img_34409.jpg
Destination Key:	imgs/validation/c2/img_34409.jpg

Source Key:		imgs/train/c2/img_34520.jpg
Destination Key:	imgs/validation/c2/img_34520.jpg

Source Key:		imgs/train/c2/img_34682.jpg
Destination Key:	imgs/validation/c2/img_34682.jpg

Source Key:		imgs/train/c2/img_34699.jpg
Destination Key:	imgs/validation/c2/img_34699.jpg

Source Key:		imgs/train/c2/img_34791.jpg
Destination Key:	imgs/validation/c2/img


Source Key:		imgs/train/c2/img_82236.jpg
Destination Key:	imgs/validation/c2/img_82236.jpg

Source Key:		imgs/train/c2/img_82351.jpg
Destination Key:	imgs/validation/c2/img_82351.jpg

Source Key:		imgs/train/c2/img_82560.jpg
Destination Key:	imgs/validation/c2/img_82560.jpg

Source Key:		imgs/train/c2/img_82746.jpg
Destination Key:	imgs/validation/c2/img_82746.jpg

Source Key:		imgs/train/c2/img_82912.jpg
Destination Key:	imgs/validation/c2/img_82912.jpg

Source Key:		imgs/train/c2/img_83104.jpg
Destination Key:	imgs/validation/c2/img_83104.jpg

Source Key:		imgs/train/c2/img_83260.jpg
Destination Key:	imgs/validation/c2/img_83260.jpg

Source Key:		imgs/train/c2/img_83289.jpg
Destination Key:	imgs/validation/c2/img_83289.jpg

Source Key:		imgs/train/c2/img_83496.jpg
Destination Key:	imgs/validation/c2/img_83496.jpg

Source Key:		imgs/train/c2/img_83812.jpg
Destination Key:	imgs/validation/c2/img_83812.jpg

Source Key:		imgs/train/c2/img_83860.jpg
Destination Key:	imgs/validation/c2/im


Source Key:		imgs/train/c3/img_47156.jpg
Destination Key:	imgs/validation/c3/img_47156.jpg

Source Key:		imgs/train/c3/img_47186.jpg
Destination Key:	imgs/validation/c3/img_47186.jpg

Source Key:		imgs/train/c3/img_47378.jpg
Destination Key:	imgs/validation/c3/img_47378.jpg

Source Key:		imgs/train/c3/img_47426.jpg
Destination Key:	imgs/validation/c3/img_47426.jpg

Source Key:		imgs/train/c3/img_47635.jpg
Destination Key:	imgs/validation/c3/img_47635.jpg

Source Key:		imgs/train/c3/img_47731.jpg
Destination Key:	imgs/validation/c3/img_47731.jpg

Source Key:		imgs/train/c3/img_48101.jpg
Destination Key:	imgs/validation/c3/img_48101.jpg

Source Key:		imgs/train/c3/img_48324.jpg
Destination Key:	imgs/validation/c3/img_48324.jpg

Source Key:		imgs/train/c3/img_48350.jpg
Destination Key:	imgs/validation/c3/img_48350.jpg

Source Key:		imgs/train/c3/img_4842.jpg
Destination Key:	imgs/validation/c3/img_4842.jpg

Source Key:		imgs/train/c3/img_48454.jpg
Destination Key:	imgs/validation/c3/img_

Source Key:		imgs/train/c4/img_25113.jpg
Destination Key:	imgs/validation/c4/img_25113.jpg

Source Key:		imgs/train/c4/img_25138.jpg
Destination Key:	imgs/validation/c4/img_25138.jpg

Source Key:		imgs/train/c4/img_25448.jpg
Destination Key:	imgs/validation/c4/img_25448.jpg

Source Key:		imgs/train/c4/img_25495.jpg
Destination Key:	imgs/validation/c4/img_25495.jpg

Source Key:		imgs/train/c4/img_25700.jpg
Destination Key:	imgs/validation/c4/img_25700.jpg

Source Key:		imgs/train/c4/img_25790.jpg
Destination Key:	imgs/validation/c4/img_25790.jpg

Source Key:		imgs/train/c4/img_26101.jpg
Destination Key:	imgs/validation/c4/img_26101.jpg

Source Key:		imgs/train/c4/img_26149.jpg
Destination Key:	imgs/validation/c4/img_26149.jpg

Source Key:		imgs/train/c4/img_2625.jpg
Destination Key:	imgs/validation/c4/img_2625.jpg

Source Key:		imgs/train/c4/img_26292.jpg
Destination Key:	imgs/validation/c4/img_26292.jpg

Source Key:		imgs/train/c4/img_26388.jpg
Destination Key:	imgs/validation/c4/img_2


Source Key:		imgs/train/c4/img_74913.jpg
Destination Key:	imgs/validation/c4/img_74913.jpg

Source Key:		imgs/train/c4/img_7495.jpg
Destination Key:	imgs/validation/c4/img_7495.jpg

Source Key:		imgs/train/c4/img_75001.jpg
Destination Key:	imgs/validation/c4/img_75001.jpg

Source Key:		imgs/train/c4/img_75274.jpg
Destination Key:	imgs/validation/c4/img_75274.jpg

Source Key:		imgs/train/c4/img_75347.jpg
Destination Key:	imgs/validation/c4/img_75347.jpg

Source Key:		imgs/train/c4/img_75489.jpg
Destination Key:	imgs/validation/c4/img_75489.jpg

Source Key:		imgs/train/c4/img_75683.jpg
Destination Key:	imgs/validation/c4/img_75683.jpg

Source Key:		imgs/train/c4/img_75752.jpg
Destination Key:	imgs/validation/c4/img_75752.jpg

Source Key:		imgs/train/c4/img_75773.jpg
Destination Key:	imgs/validation/c4/img_75773.jpg

Source Key:		imgs/train/c4/img_75992.jpg
Destination Key:	imgs/validation/c4/img_75992.jpg

Source Key:		imgs/train/c4/img_75998.jpg
Destination Key:	imgs/validation/c4/img_

Source Key:		imgs/train/c5/img_46919.jpg
Destination Key:	imgs/validation/c5/img_46919.jpg

Source Key:		imgs/train/c5/img_47157.jpg
Destination Key:	imgs/validation/c5/img_47157.jpg

Source Key:		imgs/train/c5/img_47221.jpg
Destination Key:	imgs/validation/c5/img_47221.jpg

Source Key:		imgs/train/c5/img_47312.jpg
Destination Key:	imgs/validation/c5/img_47312.jpg

Source Key:		imgs/train/c5/img_47643.jpg
Destination Key:	imgs/validation/c5/img_47643.jpg

Source Key:		imgs/train/c5/img_4767.jpg
Destination Key:	imgs/validation/c5/img_4767.jpg

Source Key:		imgs/train/c5/img_47711.jpg
Destination Key:	imgs/validation/c5/img_47711.jpg

Source Key:		imgs/train/c5/img_48087.jpg
Destination Key:	imgs/validation/c5/img_48087.jpg

Source Key:		imgs/train/c5/img_4810.jpg
Destination Key:	imgs/validation/c5/img_4810.jpg

Source Key:		imgs/train/c5/img_48105.jpg
Destination Key:	imgs/validation/c5/img_48105.jpg

Source Key:		imgs/train/c5/img_48279.jpg
Destination Key:	imgs/validation/c5/img_482

Source Key:		imgs/train/c6/img_21284.jpg
Destination Key:	imgs/validation/c6/img_21284.jpg

Source Key:		imgs/train/c6/img_213.jpg
Destination Key:	imgs/validation/c6/img_213.jpg

Source Key:		imgs/train/c6/img_21400.jpg
Destination Key:	imgs/validation/c6/img_21400.jpg

Source Key:		imgs/train/c6/img_21435.jpg
Destination Key:	imgs/validation/c6/img_21435.jpg

Source Key:		imgs/train/c6/img_2152.jpg
Destination Key:	imgs/validation/c6/img_2152.jpg

Source Key:		imgs/train/c6/img_21766.jpg
Destination Key:	imgs/validation/c6/img_21766.jpg

Source Key:		imgs/train/c6/img_21889.jpg
Destination Key:	imgs/validation/c6/img_21889.jpg

Source Key:		imgs/train/c6/img_2209.jpg
Destination Key:	imgs/validation/c6/img_2209.jpg

Source Key:		imgs/train/c6/img_22101.jpg
Destination Key:	imgs/validation/c6/img_22101.jpg

Source Key:		imgs/train/c6/img_2239.jpg
Destination Key:	imgs/validation/c6/img_2239.jpg

Source Key:		imgs/train/c6/img_22534.jpg
Destination Key:	imgs/validation/c6/img_22534.jpg

Source Key:		imgs/train/c6/img_81133.jpg
Destination Key:	imgs/validation/c6/img_81133.jpg

Source Key:		imgs/train/c6/img_81272.jpg
Destination Key:	imgs/validation/c6/img_81272.jpg

Source Key:		imgs/train/c6/img_81351.jpg
Destination Key:	imgs/validation/c6/img_81351.jpg

Source Key:		imgs/train/c6/img_8143.jpg
Destination Key:	imgs/validation/c6/img_8143.jpg

Source Key:		imgs/train/c6/img_81432.jpg
Destination Key:	imgs/validation/c6/img_81432.jpg

Source Key:		imgs/train/c6/img_81555.jpg
Destination Key:	imgs/validation/c6/img_81555.jpg

Source Key:		imgs/train/c6/img_81698.jpg
Destination Key:	imgs/validation/c6/img_81698.jpg

Source Key:		imgs/train/c6/img_81704.jpg
Destination Key:	imgs/validation/c6/img_81704.jpg

Source Key:		imgs/train/c6/img_81728.jpg
Destination Key:	imgs/validation/c6/img_81728.jpg

Source Key:		imgs/train/c6/img_81807.jpg
Destination Key:	imgs/validation/c6/img_81807.jpg

Source Key:		imgs/train/c6/img_81828.jpg
Destination Key:	imgs/validation/c6/img_8

Source Key:		imgs/train/c7/img_61990.jpg
Destination Key:	imgs/validation/c7/img_61990.jpg

Source Key:		imgs/train/c7/img_62010.jpg
Destination Key:	imgs/validation/c7/img_62010.jpg

Source Key:		imgs/train/c7/img_62014.jpg
Destination Key:	imgs/validation/c7/img_62014.jpg

Source Key:		imgs/train/c7/img_62337.jpg
Destination Key:	imgs/validation/c7/img_62337.jpg

Source Key:		imgs/train/c7/img_62591.jpg
Destination Key:	imgs/validation/c7/img_62591.jpg

Source Key:		imgs/train/c7/img_62640.jpg
Destination Key:	imgs/validation/c7/img_62640.jpg

Source Key:		imgs/train/c7/img_62779.jpg
Destination Key:	imgs/validation/c7/img_62779.jpg

Source Key:		imgs/train/c7/img_62968.jpg
Destination Key:	imgs/validation/c7/img_62968.jpg

Source Key:		imgs/train/c7/img_62988.jpg
Destination Key:	imgs/validation/c7/img_62988.jpg

Source Key:		imgs/train/c7/img_63025.jpg
Destination Key:	imgs/validation/c7/img_63025.jpg

Source Key:		imgs/train/c7/img_63031.jpg
Destination Key:	imgs/validation/c7/img

Source Key:		imgs/train/c8/img_37727.jpg
Destination Key:	imgs/validation/c8/img_37727.jpg

Source Key:		imgs/train/c8/img_37781.jpg
Destination Key:	imgs/validation/c8/img_37781.jpg

Source Key:		imgs/train/c8/img_37893.jpg
Destination Key:	imgs/validation/c8/img_37893.jpg

Source Key:		imgs/train/c8/img_38041.jpg
Destination Key:	imgs/validation/c8/img_38041.jpg

Source Key:		imgs/train/c8/img_38369.jpg
Destination Key:	imgs/validation/c8/img_38369.jpg

Source Key:		imgs/train/c8/img_38759.jpg
Destination Key:	imgs/validation/c8/img_38759.jpg

Source Key:		imgs/train/c8/img_38924.jpg
Destination Key:	imgs/validation/c8/img_38924.jpg

Source Key:		imgs/train/c8/img_39068.jpg
Destination Key:	imgs/validation/c8/img_39068.jpg

Source Key:		imgs/train/c8/img_39134.jpg
Destination Key:	imgs/validation/c8/img_39134.jpg

Source Key:		imgs/train/c8/img_39194.jpg
Destination Key:	imgs/validation/c8/img_39194.jpg

Source Key:		imgs/train/c8/img_39467.jpg
Destination Key:	imgs/validation/c8/img


Source Key:		imgs/train/c9/img_17633.jpg
Destination Key:	imgs/validation/c9/img_17633.jpg

Source Key:		imgs/train/c9/img_18034.jpg
Destination Key:	imgs/validation/c9/img_18034.jpg

Source Key:		imgs/train/c9/img_18052.jpg
Destination Key:	imgs/validation/c9/img_18052.jpg

Source Key:		imgs/train/c9/img_18055.jpg
Destination Key:	imgs/validation/c9/img_18055.jpg

Source Key:		imgs/train/c9/img_1825.jpg
Destination Key:	imgs/validation/c9/img_1825.jpg

Source Key:		imgs/train/c9/img_18309.jpg
Destination Key:	imgs/validation/c9/img_18309.jpg

Source Key:		imgs/train/c9/img_18340.jpg
Destination Key:	imgs/validation/c9/img_18340.jpg

Source Key:		imgs/train/c9/img_18444.jpg
Destination Key:	imgs/validation/c9/img_18444.jpg

Source Key:		imgs/train/c9/img_18536.jpg
Destination Key:	imgs/validation/c9/img_18536.jpg

Source Key:		imgs/train/c9/img_18607.jpg
Destination Key:	imgs/validation/c9/img_18607.jpg

Source Key:		imgs/train/c9/img_19032.jpg
Destination Key:	imgs/validation/c9/img_

Source Key:		imgs/train/c9/img_67779.jpg
Destination Key:	imgs/validation/c9/img_67779.jpg

Source Key:		imgs/train/c9/img_67906.jpg
Destination Key:	imgs/validation/c9/img_67906.jpg

Source Key:		imgs/train/c9/img_68035.jpg
Destination Key:	imgs/validation/c9/img_68035.jpg

Source Key:		imgs/train/c9/img_68108.jpg
Destination Key:	imgs/validation/c9/img_68108.jpg

Source Key:		imgs/train/c9/img_68198.jpg
Destination Key:	imgs/validation/c9/img_68198.jpg

Source Key:		imgs/train/c9/img_68251.jpg
Destination Key:	imgs/validation/c9/img_68251.jpg

Source Key:		imgs/train/c9/img_68344.jpg
Destination Key:	imgs/validation/c9/img_68344.jpg

Source Key:		imgs/train/c9/img_6838.jpg
Destination Key:	imgs/validation/c9/img_6838.jpg

Source Key:		imgs/train/c9/img_68466.jpg
Destination Key:	imgs/validation/c9/img_68466.jpg

Source Key:		imgs/train/c9/img_68543.jpg
Destination Key:	imgs/validation/c9/img_68543.jpg

Source Key:		imgs/train/c9/img_68564.jpg
Destination Key:	imgs/validation/c9/img_6

The iteration test appears to have worked correctly. The image traning and validation file locations appear to be correct. To further test we will attacm to copy a file between S3 locations.

To copy a file within a bucket, we will use the **copy_object** function within a S3 client.

In [240]:
# Inputs
copy_source = {
    "Bucket": bucket,
    "Key": src_key
}

dest_key = dst_key

print("Bucket: {}".format(bucket))
print("Copy Source: {}".format(copy_source))
print("Destination Key: {}".format(dest_key))

Bucket: dsba-6190-final-team-project
Copy Source: {'Bucket': 'dsba-6190-final-team-project', 'Key': 'imgs/train/c9/img_99949.jpg'}
Destination Key: imgs/validation/c9/img_99949.jpg


In order to prevent rerunning a process heavy code chunk, and incurring unnessary AWS fees, I have commented out the following chunk after the initial run.

In [241]:
'''
for index, row in df_val_lst.iterrows():
    src_key = os.path.join(prefix_train, row['location']).replace("\\","/")
    dest_key = os.path.join(prefix_val, row['location']).replace("\\","/")
    
    # Function Inputs
    copy_source = {
        "Bucket": bucket,
        "Key": src_key
    }
    
    dest_key = dest_key
    
    # Copy
    s3.copy_object(Bucket = bucket,
                   CopySource = copy_source,
                   Key = dest_key)

    # Delete
    s3.delete_object(Bucket = bucket,
                     Key = src_key)
'''

'\nfor index, row in df_val_lst.iterrows():\n    src_key = os.path.join(prefix_train, row[\'location\']).replace("\\","/")\n    dest_key = os.path.join(prefix_val, row[\'location\']).replace("\\","/")\n    \n    # Function Inputs\n    copy_source = {\n        "Bucket": bucket,\n        "Key": src_key\n    }\n    \n    dest_key = dest_key\n    \n    # Copy\n    s3.copy_object(Bucket = bucket,\n                   CopySource = copy_source,\n                   Key = dest_key)\n\n    # Delete\n    s3.delete_object(Bucket = bucket,\n                     Key = src_key)\n'

### Copy LST Files
In order to copy files from our local directory to the correct S3 location, we'll need to us an S3 resource, as opposed to the S3 client used to move images.

In [242]:
s3_resource = session.resource('s3')

We will need to define where the LST files are going, and where they are located locally.

In [243]:
# S3 Locations
train_lst_key = "imgs/train_lst/train.lst"
validation_lst_key = "imgs/validation_lst/validation.lst"

#Local
train_lst_local = "lst_files/train.lst"
validation_lst_local = "lst_files/validation.lst"

In [244]:
# Training
#s3_resource.Bucket(bucket).upload_file(Filename=train_lst_local,
#                                      Key = train_lst_key)

In [245]:
# Validation
#s3_resource.Bucket(bucket).upload_file(Filename=validation_lst_local,
#                                      Key = validation_lst_key)