### Bootstrapping data prep pipeline

This is a weakly supervised technique and should be augmented with human annotaions at a higher level. 


### Sampler function for data prep for training


- The implentation for the blocks in blue ink.
- A set of functions that takes as input :
    - CSV's at multiple levels - Image, Species count level or Image, bounding box level, etc.  
    - Sampling technique
    - number of samples required
    - this list will evolve as I write this code
    

<img src="IMG_0026.jpg" width="600" height="200">



1. Read all the shards created - ```/home/ubuntu/data/tensorflow/my_workspace/training_demo/Predictions/round1/bootstrap_data_<>```
2. Consolidate the list of filenames and animals
3. Sample them on the proportions that I build
4. Choose all the multi-species images

In [55]:
import pandas as pd
import csv, glob, random
from timeit import default_timer as timer
import numpy as np
import matplotlib.pyplot as plt

In [56]:
# add in parameters
round_number = 'round1'
round_number_next = 'round2'
data_already_used_for_training_herd = set(pd.read_csv('../data/bootstrapping/{0}_csv/herd_bbox.csv'.format(round_number))['filename'])
data_already_used_for_training_ms = set(pd.read_csv('../data/bootstrapping/{0}_csv/multiSpecies_bbox.csv'.format(round_number))['filename'])
data_already_used_for_training_ss = set(pd.read_csv('../data/bootstrapping/{0}_csv/singleSpecies_bbox.csv'.format(round_number))['filename'])
data_already_used_for_training_list = list(data_already_used_for_training_herd.union(data_already_used_for_training_ms).union(data_already_used_for_training_ss))

# data_already_used_for_training_list.shape
print(len(data_already_used_for_training_list))
print(len(data_already_used_for_training_herd))
print(len(data_already_used_for_training_ms))
print(len(data_already_used_for_training_ss))

12255
2747
6505
3003


### Dataframe with all shards appended
Generate a master dataset with all the appended bounding boxes

In [57]:
df_base_bbox = pd.DataFrame()
start = timer()
for i, path in enumerate(glob.glob('/home/ubuntu/data/tensorflow/my_workspace/training_demo/Predictions/S1_S6/{0}/Post_procession_of_infer_detection/bootstrap_data_snapshot_serengeti_s01_s06_*'.format(round_number))):
    df = pd.read_csv(path)
    df_base_bbox = df_base_bbox.append(df)
end = timer()
print(round((end - start), 1))
print(df_base_bbox.shape)

15.4
(1828488, 6)


In [58]:
df_base_bbox.head()

Unnamed: 0,filename,class,xmin,ymin,xmax,ymax
0,S5/B04/B04_R3/S5_B04_R3_IMAG0200,wildebeest,0.645418,0.555125,0.722567,0.679608
1,S5/B04/B04_R3/S5_B04_R3_IMAG0200,wildebeest,0.348833,0.577583,0.39111,0.609568
2,S5/B04/B04_R3/S5_B04_R3_IMAG0200,wildebeest,0.213356,0.490319,0.334306,0.692091
3,S5/B04/B04_R3/S5_B04_R3_IMAG0200,wildebeest,0.676879,0.391617,1.0,0.91474
4,S5/B04/B04_R3/S5_B04_R3_IMAG0200,wildebeest,0.675606,0.395122,0.996273,0.925809


In [59]:
# Exclude data that have already been used in training
# index_not_used_for_training = [filename in data_already_used_for_training_list for filename in df_base_bbox['filename']]
# index_not_used_for_training

df_base_bbox = df_base_bbox[~df_base_bbox['filename'].isin(data_already_used_for_training_list)]
print(df_base_bbox.shape)
df_base_bbox.head()

(1783256, 6)


Unnamed: 0,filename,class,xmin,ymin,xmax,ymax
0,S5/B04/B04_R3/S5_B04_R3_IMAG0200,wildebeest,0.645418,0.555125,0.722567,0.679608
1,S5/B04/B04_R3/S5_B04_R3_IMAG0200,wildebeest,0.348833,0.577583,0.39111,0.609568
2,S5/B04/B04_R3/S5_B04_R3_IMAG0200,wildebeest,0.213356,0.490319,0.334306,0.692091
3,S5/B04/B04_R3/S5_B04_R3_IMAG0200,wildebeest,0.676879,0.391617,1.0,0.91474
4,S5/B04/B04_R3/S5_B04_R3_IMAG0200,wildebeest,0.675606,0.395122,0.996273,0.925809


### 1. HERD
We first focus on and sample the herd images. Since, the images are herd images are less than 5000 (2747 images to be precise), we take all of them for our model training.

In [60]:
df_image_popultion_dist = df_base_bbox.groupby(by=['filename'], as_index=False)['xmin'].count()
df_image_popultion_dist.columns = ['filename', 'count_box']
df_image_herd_lst = list(df_image_popultion_dist[df_image_popultion_dist['count_box'] >= 11]['filename'])
df_image_popultion_dist = df_image_popultion_dist.groupby(by=['count_box'], as_index=False)['filename'].count()
df_image_popultion_dist.columns = ['count_box', 'num_images']

# plt.bar(df_image_popultion_dist.count_box, df_image_popultion_dist.num_images)
df_image_popultion_dist.tail()

Unnamed: 0,count_box,num_images
15,16,43
16,17,23
17,18,11
18,19,4
19,20,2


In [61]:
len(df_image_herd_lst)

2965

## 2. Multi-Species images
1. Here I will sample 5000(or all) multi-species images. The training dataset does not contain multi-species images so I want to include multi-species images in further model trainig loop.
2. Exclude the images that were part of the **part 1 - HERD**

In [62]:
# 1. get unique filename and class. i.e, for each of the images gets the species in the data
# 2. Exclude the images that were part of HERD list
# 3. Get a dataframe that has image name and the animals in the image. 
start = timer() # time the process
df_base = df_base_bbox[['filename', 'class']]
df_base = df_base.drop_duplicates()
df_base = df_base[~df_base['filename'].isin(df_image_herd_lst)] # 2. Excluding images in HERD list
end = timer()
print("execution time: {0} Seconds".format(round(end - start, 1)))

execution time: 0.6 Seconds


In [63]:
# 1. From the above dataframe, get the list of images that have multiple animals.
# 2. Get the dataframe with image names and the species in it for the multi-species images

df_base_temp = df_base.groupby(by=['filename'], as_index=False)['class'].count() # groupby the get the distinct animals per image
df_base_temp.columns = ['filename', 'count_species'] # rename the columns

# Filtering the multispecies images
df_base_multi_species = df_base_temp[df_base_temp['count_species']>=2] # 1.
# Capturing the multi-species images in a list object 
df_base_multi_species_lst = list(set(df_base_multi_species.filename)) # 1.
print("# of images: {0}".format(len(set(df_base_multi_species['filename']))))
print(len(df_base_multi_species_lst))
df_base_multi_species.head()

# of images: 4142
4142


Unnamed: 0,filename,count_species
943,S1/B07/B07_R1/S1_B07_R1_PICT0461,2
1561,S1/B07/B07_R1/S1_B07_R1_PICT1218,2
1562,S1/B07/B07_R1/S1_B07_R1_PICT1220,2
3969,S1/C05/C05_R3/S1_C05_R3_PICT2416,2
6415,S1/D05/D05_R5/S1_D05_R5_PICT0491,2


In [64]:
# Distribution of the image count for n-distinct species
species_dist = df_base_temp.groupby(by=['count_species'], as_index=False)['filename'].count()
species_dist

Unnamed: 0,count_species,filename
0,1,652754
1,2,4130
2,3,12


## 3. Single species images

In [65]:
# 1. Filter out the image, class that are single species
## a. df_base already excludes HERD images **(Part 1, HERD)**

df_single_species = df_base[~df_base['filename'].isin(df_base_multi_species_lst)]
df_single_species.head()

Unnamed: 0,filename,class
0,S5/B04/B04_R3/S5_B04_R3_IMAG0200,wildebeest
6,S5/J02/J02_R1/S5_J02_R1_IMAG0752,wildebeest
12,S5/U10/U10_R4/S5_U10_R4_IMAG3509,wildebeest
30,S3/B07/B07_R11/S3_B07_R11_IMAG0246,gazelleThomsons
31,S2/K09/K09_R1/S2_K09_R1_PICT0812,gazelleThomsons


#### Function to sample out the images given parameters like:
1. Species to sample
2. Number of images to sample
    - Sampling size is proportional to in the inverse frequency of the training data size of the species.
    - Considered species with less than 1000 images in the initial dataset.

In [66]:
def get_image_samples_for_species(dataframe, species, count, seed = None):
    '''This function takes a dataframe and samples out n(count)-records
    for the given species'''
    random.seed(seed)
    dataframe_species = dataframe[dataframe['class']==species]
    row, col = dataframe_species.shape
    if row < count:
        df_sampled_species = dataframe_species
    else:
        filename_sampled = random.sample(list(set(dataframe_species['filename'])), count)
        df_sampled_species = dataframe_species[[val in filename_sampled for val in list(dataframe_species['filename'])]]
    return df_sampled_species

In [67]:
# Test the function 
get_image_samples_for_species(df_single_species, 'wildebeest', 1)

Unnamed: 0,filename,class
4690,S6/S09/S09_R1/S6_S09_R1_IMAG1104,wildebeest


In [68]:
# 2(a) Import the dataset with sampling counts for Species
df_sample_size = pd.read_csv('~/data/tensorflow/my_workspace/camera-trap-detection/data/bootstrapping/sample_proportion_{0}.csv'.format(round_number_next), sep=',')
df_sample_size.head()

Unnamed: 0,Species,image_count,inv_image_count,p_samples,num_sampled_images
0,ostrich,750,0.00133,0.62%,32
1,topi,812,0.00123,0.57%,29
2,eland,861,0.00116,0.54%,28
3,human,638,0.00157,0.73%,37
4,impala,780,0.00128,0.60%,30


In [69]:
# Initiate an empty dataset that will eventually contain the samples species and the corresponfing filename
df_all = pd.DataFrame()
for i in range(df_sample_size.shape[0]):
    species = df_sample_size.iloc[i]['Species'] # get the respective species to sample
    count = df_sample_size.iloc[i]['num_sampled_images'] # get count to sample
    df_sampled_species = get_image_samples_for_species(df_single_species, species, count)
    df_all = df_all.append(df_sampled_species)
df_sampled_species_lst = list(set(df_all['filename']))

In [70]:
print(sum(df_all.groupby(['class'], as_index=False)['filename'].count()['filename']))
print(len(df_sampled_species_lst))
df_all.tail()

1383
1383


Unnamed: 0,filename,class
3299,S1/O10/O10_R2/S1_O10_R2_PICT3226,reptiles
3526,S1/T13/T13_R1/S1_T13_R1_PICT3975,reptiles
1118,S2/K13/K13_R1/S2_K13_R1_PICT0662,zorilla
3303,S2/K13/K13_R1/S2_K13_R1_PICT0663,zorilla
1635,S2/K13/K13_R1/S2_K13_R1_PICT0664,zorilla


## Export the datasets to CSV:
1. herd dataset
2. Multi-Species
3. Single Species

**1. HERD images**

In [74]:
df_herd = df_base_bbox[df_base_bbox['filename'].isin(df_image_herd_lst)]
df_herd.to_csv('../data/bootstrapping/{0}_csv/herd_bbox.csv'.format(round_number_next), index=False)
print(df_herd.shape)

(35597, 6)


**2. Multi-Species**

In [75]:
df_multiSpecies = df_base_bbox[df_base_bbox['filename'].isin(df_base_multi_species_lst)]
df_multiSpecies.to_csv('../data/bootstrapping/{0}_csv/multiSpecies_bbox.csv'.format(round_number_next), index=False)
print(df_multiSpecies.shape)

(20701, 6)


**3. Single Species**

In [76]:
df_singleSpecies = df_base_bbox[df_base_bbox['filename'].isin(df_sampled_species_lst)]
df_singleSpecies.to_csv('../data/bootstrapping/{0}_csv/singleSpecies_bbox.csv'.format(round_number_next), index=False)
print(df_singleSpecies.shape)

(1678, 6)


# Species Distribution in the samples:

In [12]:
round_number = 'round1'
filepath_list = glob.glob('../data/bootstrapping/{0}_csv/*'.format(round_number))

In [13]:
filepath_list

['../data/bootstrapping/round1_csv/singleSpecies_bbox.csv',
 '../data/bootstrapping/round1_csv/herd_bbox.csv',
 '../data/bootstrapping/round1_csv/multiSpecies_bbox.csv']

In [14]:
df_pred_gt_consolidated = pd.DataFrame()
for filepath in filepath_list:
    # import csv to pandas
    df_pred_gt_temp = pd.read_csv(filepath)
    # subset the 2 columns
    df_pred_gt_temp = df_pred_gt_temp[['filename', 'class']]
    # Drop duplicates. We need to get the frequency of images for each animal
    df_pred_gt_temp = df_pred_gt_temp.drop_duplicates()
    df_pred_gt_consolidated = df_pred_gt_consolidated.append(df_pred_gt_temp)

# Get the frequency of images for each animal
df_bootstrap = df_pred_gt_consolidated.groupby(by=['class'], as_index=False)['filename'].count()
df_bootstrap.columns = ['species', 'Additional_Data_Size']

In [16]:
df_bootstrap.to_csv('/home/ubuntu/data/tensorflow/my_workspace/camera-trap-detection/EDA_and_ModelEvaluation/BootStrap_{0}_addition.csv'.format(round_number), index=False)