# Prepare Dataset

## Data Understanding
Images were obtained from two wildlife cameras (a Browning and Reconyx) set up near a beaver lodge on Lake Sammamish, Washington. Over 2,000 images were collected between April and June, 2019. Approximately 70% of the images contain beaver instances, the remaining are false triggers or other animals (roof rat, raccoon, squirrel, muskrat, otter, rabbit, frog, and various bird species). Most of the images are grayscale because the animals are more active at night. Both cameras were set to be motion activated and take at least 3 pictures when triggered. The date/time was off on the Browning camera and images are in weekly zipfiles with duplicate names across zipfiles. Challenges to object detection include imbalanced dataset (for non-beaver species), small instances (i.e. just tip of beaver tail), blur, grayscale, truncation, and occlusion.

## Data Cleaning

Image data is large, so before adding any data to my repo I made a data directory and added it to my .gitignore file. Any files and folders that were too big for GitHub I added to .gitignore and stored in a Google Cloud Storage Bucket. I used [Photos for macOS](https://www.apple.com/macos/photos/) and [ExifRenamer](https://www.macupdate.com/app/mac/10043/exifrenamer) to correct datetimes and rename all images by site, location, and datetime. There is only one site so far (s1). The Reconyx is camera 1 (c1), the Browning is camera 2 (c2). An example name for an image taken on the Reconyx on April 11th, 2019 at 9:30:56 a.m. looks like this s1c120190411_093056.jpg. Cleaned images were placed in a folder named ddb/data/images.

## Data Annotation

I used [LabelImg](https://github.com/tzutalin/labelImg) to make bounding boxes around animal instances in each image. If the image contained no animal instances I verified the image to create an annotation file with no bounding boxes. LabelImg creates a .xml file for each image. Annotations were generated to the folder ddb/data/annots. 

## Train Test Split Data

I decided to make two sets of data. A subset that contains only beaver images and a full dataset. For the beaver subset I made a train and test directory in my data directory, and ran the following code.

In [22]:
from model_functions.data_preprocessing import make_species_xml_list
from model_functions.data_preprocessing import train_test_split_data
from model_functions.data_preprocessing import make_annot_list
import xml.etree.ElementTree as et
import glob
import os
from os import listdir
import shutil
from sklearn.model_selection import train_test_split

%reload_ext autoreload
%autoreload 2

In [15]:
data_dir = './data/'
beaver_list = make_species_xml_list('beaver', data_dir)

In [16]:
train_test_split_data(beaver_list,
                      data_dir,
                      test_size=0.2, 
                      random_state=42)

I renamed the train and test files as test_beaver and train_beaver. Then I made another train and test directory and repeated with the whole dataset.

In [25]:
annot_list = make_annot_list(data_dir)
train_test_split_data(annot_list,
                      data_dir,
                      test_size=0.2,
                      random_state=42)

Next I made label maps for the beaver subset (beaver_label_map.pbtxt)and the whole dataset (species_label_map.pbtxt) in ddb/annotations. Label maps are pretty straight forward and map each image label to an integer.

I used the following code from the [Tensorflow Object Detection API Tutorial](https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/training.html#creating-label-map) to make tensorflow records of the train and test data. First I added the xml_to_csv and generate_tfrecord scripts to my model_functions directory. Second, I moved into my model_functions directory in my terminal and ran the xml_to_csv script to convert xml_files to a csv file. I added the csv files to the data directory.

```bash
# create train data
python xml_to_csv.py -i [PATH_TO_IMAGES_FOLDER]/train -o [PATH_TO_ANNOTATIONS_FOLDER]/train_labels.csv

# create test data
python xml_to_csv.py -i [PATH_TO_IMAGES_FOLDER]/test -o [PATH_TO_ANNOTATIONS_FOLDER]/test_labels.csv
```

And then I ran the generate_tfrecord script to convert those csv files to tfrecords. I only did this for the beaver data. Expanding to include other species is a future goal for the full dataset.

```bash
# create train data
python generate_tfrecord.py --label=<LABEL> --csv_input=<PATH_TO_ANNOTATIONS_FOLDER>/train_labels.csv
--img_path=<PATH_TO_IMAGES_FOLDER>/train --output_path=<PATH_TO_ANNOTATIONS_FOLDER>/train.record

# create test data
python generate_tfrecord.py --label=<LABEL> --csv_input=<PATH_TO_ANNOTATIONS_FOLDER>/test_labels.csv
--img_path=<PATH_TO_IMAGES_FOLDER>/test
--output_path=<PATH_TO_ANNOTATIONS_FOLDER>/test.record
```

Now the data is ready to train on.