### Notebook to setup data for a tensorflow.keras train from dir
The following code will read out the images in the NABirds dataset I kept in GCS.  Some additional manipulation is performed to get image names and keys to match the actual file names.  I looked at the suggested train/test split in the NABirds dataset.  They have a 50/50 split.  I'd prefer 80/20 so I worked a new column called Train_Validate to mark that split.  Train/Validate is what I have Keras setup to look for under the "Images" directory in GCS.  I also looked at the balance of images by bird and it ranges from 91 to 120 which is fairly balanced so I did no data enrichment.  

In [1]:
import pandas as pd
import gcs
import gcs_inventory as gcsi
from google.cloud import storage
import os
import numpy as np

no module auth.py found with key google_json_key for gcs, assuming this is running from within GCP project


In [2]:
def pull_image_key(image_name: str) -> str:
    filename = os.path.basename(image_name)  # remove path   
    image_key = os.path.splitext(filename)[0]  # Remove .jpg
    return image_key

In [3]:
df_file_name = 'nabirds-jpg-list.csv'  # built in profiling nb
try: 
    df_raw = pd.read_csv(df_file_name)  # reload existing file if it exists, delete it if you want to recreate it
    print(f'Loading existing file {df_file_name}')
except:
    df_raw = gcsi.get_nabirds_jpg_images(df_file_name)  # .01 per 1000 operations so about $0.80 per run, create from scratch
    print(f'Reading images from GCS....')
    
df_raw['Image'] = df_raw['Image Name'].apply(pull_image_key)

Loading existing file nabirds-jpg-list.csv


In [4]:
# print(df_raw.shape)
# print(df_raw.columns)
# print(df_raw.head(1))

In [5]:
nabirds_storage = gcs.Storage(bucket_name='nabirds_filtered')
nabirds_test_train_df = nabirds_storage.get_df('train_test_split_filtered.txt', 
                                               header=None, column_names=['Image','Train_B'], delimiter= ' ')
nabirds_test_train_df['Image'] = nabirds_test_train_df['Image'].astype(str).str.replace("-", "")  # remove - from names
nabirds_test_train_df['Train_B'] = nabirds_test_train_df['Train_B'].astype(int)
print(nabirds_test_train_df.shape)
print(nabirds_test_train_df.head(1))

(3076, 2)
                              Image  Train_B
0  001c81f1d30240298cb54488e966eff7        1


In [6]:
joined_df = pd.merge(df_raw, nabirds_test_train_df, on='Image', how='inner')

In [7]:
# print(joined_df.shape)
# print(joined_df.columns)
# print(joined_df.head(1))

In [8]:
train_df = joined_df[joined_df['Train_B'] == 1] 
test_df = joined_df[joined_df['Train_B'] == 0]
print(train_df.shape)
print(test_df.shape)
print(joined_df['Class Name'].nunique())
print(joined_df['Class Name'].value_counts())
# print(train_df['Class Name'].value_counts())
# print(test_df['Class Name'].value_counts())

(1531, 9)
(1545, 9)
27
Class Name
Mourning Dove                             120
Northern Cardinal Female Juvenile         120
Blue Jay                                  120
White-Breasted Nuthatch                   120
Red-Bellied Woodpecker                    120
American Goldfinch Male                   120
House Finch Male                          120
American Goldfinch Female Juvenile        120
Brown-Headed Cowbird Male                 120
Rose-Breasted Grosbeak Male               120
Northern Cardinal Male                    120
Dark-Eyed Junco                           120
Downy Woodpecker                          120
Song Sparrow                              120
House Sparrow Male                        119
American Robin                            119
Common Grackle                            118
House Sparrow Female                      117
House Finch Female Juvenile               112
Tree Sparrow                              110
Baltimore Oriole Male                     109


### Test and Train Setup
The data is roughly split 50/50 between testing and training.  I'd like an 80/20 split so I'll create a new column that is Train_Validate for that split and leave Cornell's org col alone

In [9]:
np.random.seed(1)  # Set random seed for consistent splits
train_size = 0.8  # 80/20 split
tt_df = joined_df.copy()  # take a copy so we can re run this code section and debug
tt_df['Train_Test'] = -1  # init with -1 to id unassigned rows
unique_names = tt_df['Class Name'].unique()
for name in unique_names:
    group_df = tt_df[tt_df['Class Name'] == name]
    group_size = len(group_df)  # count number of images in class
    train_count = int(group_size * train_size)
    print(f'Train Test class {name} with size {train_count} over the group of {group_size}')
    train_indices = np.random.choice(group_df.index, size=train_count, replace=False)
    tt_df.loc[train_indices, 'Train_Test'] = 1  # set train to true for selected rows

tt_df.loc[tt_df['Train_Test'] == -1, 'Train_Test'] = 0  # set remaining rows to zero
print(f'Unassigned rows {(tt_df["Train_Test"] == -1).sum()}')  # check that we go everything

Train Test class Mourning Dove with size 96 over the group of 120
Train Test class Red-Bellied Woodpecker with size 96 over the group of 120
Train Test class Downy Woodpecker with size 96 over the group of 120
Train Test class Dark-Eyed Junco with size 96 over the group of 120
Train Test class American Robin with size 95 over the group of 119
Train Test class Northern Cardinal Male with size 96 over the group of 120
Train Test class Rose-Breasted Grosbeak Male with size 96 over the group of 120
Train Test class Brown-Headed Cowbird Male with size 96 over the group of 120
Train Test class Baltimore Oriole Male with size 87 over the group of 109
Train Test class Purple Finch Male with size 82 over the group of 103
Train Test class House Finch Male with size 96 over the group of 120
Train Test class American Goldfinch Male with size 96 over the group of 120
Train Test class House Sparrow Male with size 95 over the group of 119
Train Test class Black-Capped Chickadee with size 76 over the 

In [10]:
print(tt_df.groupby('Class Name')['Train_Test'].sum().to_string())  # check results should match train values above
# print(tv_df.head(1))

Class Name
American Goldfinch Female Juvenile        96
American Goldfinch Male                   96
American Robin                            95
Baltimore Oriole Female Juvenile          78
Baltimore Oriole Male                     87
Black-Capped Chickadee                    76
Blue Jay                                  96
Brown-Headed Cowbird Female Juvenile      82
Brown-Headed Cowbird Male                 96
Common Grackle                            94
Dark-Eyed Junco                           96
Downy Woodpecker                          96
House Finch Female Juvenile               89
House Finch Male                          96
House Sparrow Female                      93
House Sparrow Male                        95
Mourning Dove                             96
Northern Cardinal Female Juvenile         96
Northern Cardinal Male                    96
Purple Finch Female Juvenile              72
Purple Finch Male                         82
Red-Bellied Woodpecker                    96

### Write out directory contents of Train and Validate under Images directory in GCS
This next section sets up the train and validate directory for the tensorflow.keras images dataset from directory function.  Delete the train and validation folders underneath the images directory if the train/validation split needs to get reworked for any reason.  

In [13]:
# wrote this out in native GCS API. Should consider moving this to gcs utility gcs.py 
storage_client = storage.Client()
bucket = storage_client.bucket('nabirds_filtered')

for _, row in tt_df.iterrows():
    source_blob_name = row['Image Name']
    path_parts = source_blob_name.split("images/")
    image_name = row['Image']
    destination_folder = 'train' if row['Train_Test'] == 1 else 'test'
    destination_blob_name = f'images/{destination_folder}/{path_parts[1]}'
    source_blob = bucket.blob(source_blob_name)
    destination_blob = bucket.copy_blob(source_blob, bucket, destination_blob_name)
    # print(f'Copied {source_blob_name} to {destination_blob_name}')
    
print(f'Done')

Done
