# Data Model for Object Detection
The task here is to prepare a TFRecord dataset that can be fed into the [object detection API of tensorflow](https://github.com/tensorflow/models/tree/master/research/object_detection). This notebook uses a subset of the [GSSS](https://datadryad.org/resource/doi:10.5061/dryad.5pt92) dataset that were used in this [paper](https://datadryad.org/resource/doi:10.5061/dryad.5pt92) by Schneider! <br> 
I broke down the data model into the following steps:<br>
1. Database creation - As part of this step I work on consolidating the input data in various format into a one json file.
2. Using this json to create a tensorflow record
3. Validating the pipeline


The detailed steps that I follow are:
1. Data Export : CSV (from the panoptes API) -> JSON file
2. Data Import : JSON file -> Dictionary object 
3. Write TFRecord : Dictionary Object -> TFRecord file
4. Validate data in the TFRecord

## Importing necessary packages

In [13]:
#import pandas as pd
import csv, os, sys
import operator
import tensorflow as tf
import json
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as patches
sys.path.append('/home/ubuntu/data/tensorflow/my_workspace/camera-trap-detection/data/')
from utils import dataset_util
#Added this to handle the truncation error while decoding the jpeg
from PIL import ImageFile
ImageFile.LOAD_TRUNCATED_IMAGES = True

In [14]:
Project_filepath = "/home/ubuntu/data/tensorflow/my_workspace/training_demo/"#"/home/rai00007/Desktop/zooniverse/" # Original data - "/data/lucifer1.2/users/rai00007/"

In [17]:
df_schneider_box = pd.read_csv(Project_filepath + 'Data/GoldStandardBoundBoxCoord.csv')
schneider_events = list(set(df_schneider_box['filename']))
schneider_events = [word.split('.')[0] for word in schneider_events] # [word[:-4] for word in schneider_events]
len(schneider_events)
df_schneider_box.head()

Unnamed: 0,filename,width,height,class,xmin,ymin,xmax,ymax
0,ASG000dz24.jpg,2048,1536,Impala,1141,883,1227,977
1,ASG000dz24.jpg,2048,1536,Impala,1340,876,1381,925
2,ASG000dz24.jpg,2048,1536,Impala,1448,803,1538,1042
3,ASG000dz24.jpg,2048,1536,Impala,1382,763,1485,1080
4,ASG000c7hr.jpg,2048,1536,Wildebeest,1987,680,2048,751


In [23]:
df_all_images = pd.read_csv(Project_filepath + 'Data/all_images.csv')
df_all_images = df_all_images[df_all_images['CaptureEventID'].isin(schneider_events)]
print(df_all_images.shape)
df_all_images.head()

(10597, 2)


Unnamed: 0,CaptureEventID,URL_Info
1377257,ASG000c6uw,S4/B03/B03_R1/S4_B03_R1_IMAG0137.JPG
1377258,ASG000c6uw,S4/B03/B03_R1/S4_B03_R1_IMAG0138.JPG
1377259,ASG000c6uw,S4/B03/B03_R1/S4_B03_R1_IMAG0139.JPG
1381262,ASG000c6x1,S4/B03/B03_R1/S4_B03_R1_IMAG4142.JPG
1381263,ASG000c6x1,S4/B03/B03_R1/S4_B03_R1_IMAG4143.JPG


In [41]:
df = df_all_images.drop_duplicates(subset='CaptureEventID', keep='first')
df['URL_info_full'] = 'https://snapshotserengeti.s3.msi.umn.edu/' + df['URL_Info'].astype(str)
df.iloc[1]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


CaptureEventID                                           ASG000c6x1
URL_Info                       S4/B03/B03_R1/S4_B03_R1_IMAG4142.JPG
URL_info_full     https://snapshotserengeti.s3.msi.umn.edu/S4/B0...
Name: 1381262, dtype: object

**Download the images**

In [42]:
import os, sys, random, ssl
import urllib, urllib.request

In [43]:
def get_images_from_url(dataset, image_name_index, url_col_index, outpath):
    if (not os.environ.get('PYTHONHTTPSVERIFY', '') and
        getattr(ssl, '_create_unverified_context', None)): 
        ssl._create_default_https_context = ssl._create_unverified_context
        
        check = []
        
        for i in range(dataset.shape[0]):
            if dataset.iloc[i][image_name_index] not in check:
                j = 0
            if dataset.iloc[i][image_name_index] in check:
                j += 1 
            
            print('Processing image: %d' % i)
            
            urllib.request.urlretrieve(dataset.iloc[i][url_col_index], outpath+'{0}.jpg'\
                                       .format(dataset.iloc[i][image_name_index] ))

In [45]:
outpath = '../data/LILA/schneider_images/'

In [None]:
get_images_from_url(df, 0, 2, outpath)

In [48]:
pwd

'/home/ubuntu/data/tensorflow/my_workspace/camera-trap-detection/data_prep'