For more information on this project, please visit the project website.
This is the pipeline for processing the image data, tiling the images, preparing the training, validation and test data and training the model in tensorflow. There are separate processes for DigitalGlobe data and for NOAA data. More details on the data used for this project can be found here.
1. Download data
Scrape the image files from source websites and save them in a folder. For DigitalGLobe sorting the image files into 3 band and 1 band folders is required.
2. Compress images
Takes image files. For DigitalGlobe this takes 3 TB and compresses to 60 GB.
3. Processing image files
Apply appropriate utility script as necessary based on observations of the data.
4. Tile images
Clip the big tif images into smaller tiles (2048 x 2048) from left to right and top to bottom including a csv of the lat long ranges for each tif image.
5. Index tiles to geojson
From the csv of lat long ranges per tif image and the geojson file of lat longs of bounding boxes with attached tif id produce a geojson of pixel ranges per bounding box with small tif id.
6. Convert lat long to pixel coordinates
SSD requires the training data input as pixel coordinates.
7. Split training data
Split the images and geojson file into training, validation and test subsets (8:1:1).
8. Debug dataset
Use ipython notebook to plot bounding boxes over the images (tiff files) to check for accuracy, render the bounding boxes over the tiff files to manually inspect, record bad labels, remove those bounding boxes from the geojson file.
9. Data augmentation
Shift, flip and rotate the images as a way to add more training data.
10. Feed training data to algorithm
Prepare input for the network.