### Bootstrapping data prep pipeline

This is a weakly supervised technique and should be augmented with human annotaions at a higher level. 


### Sampler function for data prep for training


- The implentation for the blocks in blue ink.
- A set of functions that takes as input :
    - CSV's at multiple levels - Image, Species count level or Image, bounding box level, etc.  
    - Sampling technique
    - number of samples required
    - this list will evolve as I write this code
    

<img src="IMG_0026.jpg" width="600" height="200">



1. Read all the shards created - ```/home/ubuntu/data/tensorflow/my_workspace/training_demo/Predictions/round1/bootstrap_data_<>```
2. Consolidate the list of filenames and animals
3. Sample them on the proportions that I build
4. Choose all the multi-species images

In [1]:
import pandas as pd
import csv, glob, random
from timeit import default_timer as timer
import numpy as np
import matplotlib.pyplot as plt

### Dataframe with all shards appended
Generate a master dataset with all the appended bounding boxes

In [2]:
df_base_bbox = pd.DataFrame()
start = timer()
for i, path in enumerate(glob.glob('/home/ubuntu/data/tensorflow/my_workspace/training_demo/Predictions/round1/bootstrap_data_snapshot_serengeti_s01_s06_*')):
    df = pd.read_csv(path)
    df_base_bbox = df_base_bbox.append(df)
end = timer()
print(round((end - start), 1))
print(df_base_bbox.shape)

17.1
(1789370, 6)


In [3]:
df_base_bbox.head()

Unnamed: 0,filename,class,xmin,ymin,xmax,ymax
0,S5/B04/B04_R3/S5_B04_R3_IMAG0200,wildebeest,0.68684,0.403407,1.0,0.876225
1,S5/B04/B04_R3/S5_B04_R3_IMAG0200,wildebeest,0.64696,0.554431,0.722113,0.685629
2,S5/B04/B04_R3/S5_B04_R3_IMAG0200,wildebeest,0.206744,0.49246,0.337849,0.69106
3,S5/B04/B04_R3/S5_B04_R3_IMAG0200,wildebeest,0.351468,0.577559,0.393431,0.608868
4,S5/B04/B04_R3/S5_B04_R3_IMAG0200,wildebeest,0.350235,0.578253,0.391804,0.608087


### 1. HERD
We first focus on and sample the herd images. Since, the images are herd images are less than 5000 (2747 images to be precise), we take all of them for our model training.

In [3]:
df_image_popultion_dist = df_base_bbox.groupby(by=['filename'], as_index=False)['xmin'].count()
df_image_popultion_dist.columns = ['filename', 'count_box']
df_image_herd_lst = list(df_image_popultion_dist[df_image_popultion_dist['count_box'] >= 11]['filename'])
df_image_popultion_dist = df_image_popultion_dist.groupby(by=['count_box'], as_index=False)['filename'].count()
df_image_popultion_dist.columns = ['count_box', 'num_images']

# plt.bar(df_image_popultion_dist.count_box, df_image_popultion_dist.num_images)
df_image_popultion_dist.tail()

Unnamed: 0,count_box,num_images
16,17,31
17,18,10
18,19,9
19,20,1
20,21,1


In [4]:
len(df_image_herd_lst)

2747

## 2. Multi-Species images
1. Here I will sample 5000(or all) multi-species images. The training dataset does not contain multi-species images so I want to include multi-species images in further model trainig loop.
2. Exclude the images that were part of the **part 1 - HERD**

In [8]:
# 1. get unique filename and class. i.e, for each of the images gets the species in the data
# 2. Exclude the images that were part of HERD list
# 3. Get a dataframe that has image name and the animals in the image. 
start = timer() # time the process
df_base = pd.DataFrame()
for i, path in enumerate(glob.glob('/home/ubuntu/data/tensorflow/my_workspace/training_demo/Predictions/round1/bootstrap_data_snapshot_serengeti_s01_s06_*')):
    df = pd.read_csv(path)
    df = df[['filename', 'class']]
    df = df.drop_duplicates()
    df_base = df_base.append(df)
df_base = df_base.drop_duplicates()
df_base = df_base[[filename not in df_image_herd_lst for filename in df_base['filename']]]# 2. Excluding images in HERD list
end = timer()
print("execution time: {0} Seconds".format(round(end - start, 1)))

execution time: 49.2 Seconds


In [9]:
# 1. From the above dataframe, get the list of images that have multiple animals.
# 2. Get the dataframe with image names and the species in it for the multi-species images

df_base_temp = df_base.groupby(by=['filename'], as_index=False)['class'].count() # groupby the get the distinct animals per image
df_base_temp.columns = ['filename', 'count_species'] # rename the columns

# Filtering the multispecies images
df_base_multi_species = df_base_temp[df_base_temp['count_species']>=2] # 1.
# Capturing the multi-species images in a list object 
df_base_multi_species_lst = list(set(df_base_multi_species.filename)) # 1.
print("# of images: {0}".format(len(set(df_base_multi_species['filename']))))
print(len(df_base_multi_species_lst))
df_base_multi_species.head()

# of images: 6505
6505


Unnamed: 0,filename,count_species
434,S1/B05/B05_R3/S1_B05_R3_PICT0500,2
435,S1/B05/B05_R3/S1_B05_R3_PICT0501,2
951,S1/B07/B07_R1/S1_B07_R1_PICT0459,2
952,S1/B07/B07_R1/S1_B07_R1_PICT0460,2
1571,S1/B07/B07_R1/S1_B07_R1_PICT1219,2


In [10]:
# Distribution of the image count for n-distinct species
species_dist = df_base_temp.groupby(by=['count_species'], as_index=False)['filename'].count()
species_dist

Unnamed: 0,count_species,filename
0,1,655759
1,2,6488
2,3,17


## 3. Single species images

In [11]:
# 1. Filter out the image, class that are single species
## a. df_base already excludes HERD images **(Part 1, HERD)**

df_single_species = df_base.iloc[[val not in df_base_multi_species_lst for val in list(df_base['filename'])]]
df_single_species.head()

Unnamed: 0,filename,class
0,S5/B04/B04_R3/S5_B04_R3_IMAG0200,wildebeest
6,S5/J02/J02_R1/S5_J02_R1_IMAG0752,wildebeest
12,S5/U10/U10_R4/S5_U10_R4_IMAG3509,wildebeest
17,S3/B07/B07_R11/S3_B07_R11_IMAG0246,gazelleThomsons
18,S2/K09/K09_R1/S2_K09_R1_PICT0812,gazelleThomsons


#### Function to sample out the images given parameters like:
1. Species to sample
2. Number of images to sample
    - Sampling size is proportional to in the inverse frequency of the training data size of the species.
    - Considered species with less than 1000 images in the initial dataset.

In [12]:
def get_image_samples_for_species(dataframe, species, count, seed = None):
    '''This function takes a dataframe and samples out n(count)-records
    for the given species'''
    random.seed(seed)
    dataframe_species = dataframe[dataframe['class']==species]
    row, col = dataframe_species.shape
    if row < count:
        df_sampled_species = dataframe_species
    else:
        filename_sampled = random.sample(list(set(dataframe_species['filename'])), count)
        df_sampled_species = dataframe_species[[val in filename_sampled for val in list(dataframe_species['filename'])]]
    return df_sampled_species

In [13]:
# Test the function 
get_image_samples_for_species(df_single_species, 'wildebeest', 1)

Unnamed: 0,filename,class
4084,S4/B13/B13_R1/S4_B13_R1_IMAG0606,wildebeest


In [14]:
# 2(a) Import the dataset with sampling counts for Species
df_sample_size = pd.read_csv('~/data/tensorflow/my_workspace/camera-trap-detection/data/bootstrapping/sample_proportion_round1.csv', sep='\t')
df_sample_size.head()

Unnamed: 0,Species,Training_boxes_Count,image_count,inv_image_count,p_samples,num_sampled_images
0,ostrich,1007,676,0.00148,1.30%,65
1,topi,1690,674,0.00148,1.30%,66
2,eland,1812,665,0.0015,1.32%,66
3,human,871,629,0.00159,1.39%,70
4,impala,2317,622,0.00161,1.41%,71


In [24]:
# Initiate an empty dataset that will eventually contain the samples species and the corresponfing filename
df_all = pd.DataFrame()
for i in range(df_sample_size.shape[0]):
    species = df_sample_size.iloc[i]['Species'] # get the respective species to sample
    count = df_sample_size.iloc[i]['num_sampled_images'] # get count to sample
    df_sampled_species = get_image_samples_for_species(df_single_species, species, count)
    df_all = df_all.append(df_sampled_species)
df_sampled_species_lst = list(set(df_all['filename']))

In [25]:
print(sum(df_all.groupby(['class'], as_index=False)['filename'].count()['filename']))
print(len(df_sampled_species_lst))
df_all.tail()

3003
3003


Unnamed: 0,filename,class
3041,S2/G12/G12_R3/S2_G12_R3_IMAG0350,rhinoceros
1824,S2/G12/G12_R3/S2_G12_R3_IMAG0421,rhinoceros
4349,S2/G12/G12_R3/S2_G12_R3_IMAG0293,rhinoceros
3581,S2/G12/G12_R3/S2_G12_R3_IMAG0351,rhinoceros
1410,S2/G12/G12_R3/S2_G12_R3_IMAG0414,rhinoceros


## Export the datasets to CSV:
1. herd dataset
2. Multi-Species
3. Single Species

**1. HERD images**

In [21]:
df_herd = df_base_bbox[[val in df_image_herd_lst for val in df_base_bbox['filename'] ]]
df_herd.to_csv('../data/bootstrapping/round1_csv/herd_bbox.csv', index=False)
print(df_herd.shape)

**2. Multi-Species**

In [23]:
df_multiSpecies = df_base_bbox[[val in df_base_multi_species_lst for val in df_base_bbox['filename'] ]]
df_multiSpecies.to_csv('../data/bootstrapping/round1_csv/multiSpecies_bbox.csv', index=False)
print(df_multiSpecies.shape)

(25181, 6)


**3. Single Species**

In [26]:
df_singleSpecies = df_base_bbox[[val in df_sampled_species_lst for val in df_base_bbox['filename'] ]]
df_singleSpecies.to_csv('../data/bootstrapping/round1_csv/singleSpecies_bbox.csv', index=False)
print(df_singleSpecies.shape)

(3735, 6)
