Takes a GeoTiff image and labeled training polygons in a geojson file and produces a trained Convolutional Neural Network (CNN) classifier. The network architecture is VGG Net, which was developed as part of the 2014 ImageNet challenge. See here for an example of a specific implementation of this network.
Here we execute an example in which a classifier is trained to find property polygons with pools. Training data is provided in the specified S3 location.
Within an iPython terminal create a GBDX interface an specify the task input location:
from gbdxtools import Interface from os.path import join import uuid gbdx = Interface() input_location = 's3://gbd-customer-data/32cbab7a-4307-40c8-bb31-e2de32f940c2/platform-stories/train-cnn-classifier'
Create a task instance and set the required inputs:
train_task = gbdx.Task('train-cnn-classifier') train_task.inputs.images = join(input_location, 'images') train_task.inputs.geojson = join(input_location, 'geojson') train_task.inputs.classes = 'No swimming pool, Swimming pool'
Set any optional hyper-parameters if necessary. With the following parameters training should take about three hours to complete:
train_task.inputs.two_rounds = 'True' train_task.inputs.nb_epoch = '30' train_task.inputs.nb_epoch_2 = '5' train_task.inputs.train_size = '4500' train_task.inputs.train_size_2 = '2500' train_task.inputs.test_size = '1000' train_task.inputs.bit_depth = '8'
Initialize a workflow and specify where to save the output:
train_wf = gbdx.Workflow([train_task]) random_str = str(uuid.uuid4()) output_location = join('platform-stories/trial-runs', random_str) train_wf.savedata(train_task.outputs.trained_model, join(output_location, 'trained_model'))
Execute the workflow:
Track the status of the workflow as follows:
The task input ports. Note that booleans, integers and floats must be passed to the task as strings, e.g., 'True', '10', '0.001'.
|images||N/A||N/A||Contains GeoTiff image strip(s) from which to extract training polygons in geojson. The strips must be named after their catalog ids (ex- 1040010014BCA700.tif). Up to five image strips will be accepted as input.|
|geojson||N/A||N/A||Contains one geojson file with labeled training polygons. Each polygon should have an 'image_id' property containing the catalog id of the associated image strip, and a 'class_name' property with the appropriate training classifiction of the polygon.|
|classes||N/A||String||String: classes to train network on, each separated by a comma (e.g- 'No swimming pool, Swimming pool'). Must be exactly as they appear in the class_name property of the polygon feature in the geojson.|
|two_rounds||True||True, False||If True, train the network in two rounds- first on balanced classes then on the original distribution of classes. In the second training round only the weights of the final layer of the model will be updated. Recommended if there is class imbalance in the dataset.|
|filter_geojson||True||True, False||If True the task will remove any polygons that are larger than max_side_dim or smaller than min_side_dim from geojson. This is highly recommended as errors in training may occur if any polygons will be larger than the max_side_dim or smaller than min_side_dim.|
|min_side_dim||10||0 - max_side_dim||The minimum acceptable side dimension (in pixels) for training polygons.|
|max_side_dim||125||integer from 125 to 500||The maximum acceptable side dimension (in pixels) for training polygons.|
|train_size||10000||integer > 0||Number of polygons to train the network on for the first round of training.|
|train_size_2||0.5 * train_size||integer > 0||Number of polygons to train on in the second round of training. Only relevant if two_rounds is True.|
|batch_size||32||integer > 0||Number of chips to train on per batch.|
|nb_epoch||35||integer > 1||Number of training epochs to perform for the first round of training.|
|nb_epoch_2||8||integer > 1||Number of training epochs to perform for the second round of training. Only relevant if two_rounds is True.|
|test||True||True, False||If True testing will be completed on a subset of polygons. A test report with accuracy metrics will be saved as a text file in the 'model' output directory.|
|test_size||5000||integer > 0||Number of chips to test on. Only relevant if test is True.|
|learning_rate||0.001||float > 0||Learning rate for the first round of training.|
|learning_rate_2||0.01||float > 0||Learning rate for the second round of training (if applicable).|
|bit_depth||8||integer > 0||Bit depth of the image strips in images. This parameter is necessary for proper normalization.|
|use_lowest_val_loss||True||True, False||After the first round of training use the model weights that yielded the lowest validation loss (recommended). Otherwise the model weights after the final epoch will be used.|
|kernel_size||3||integer > 1||Side dimension (in pixels) of the kernels at each convolutional layer in the network.|
|resize_dim||None||int (< max_side_dim)||Dimension to resize the chip side to after padding. This should be smaller than max_side_dim.|
|[small model]||False||Bool||Use a model with 8 layers instead of 16. Useful for large input images (>250 pixels).|
train-cnn-classifier has one output directory port, trained_model. Its contents are listed in the following table:
|model_arch.json||Architecture of the trained model stored as a json int the output folder.|
|model_weights.h5||An h5py file containing the weights for the trained model.|
|model_weights||This sub-directory will contain the model weights after each epoch of training. This allows you to load and test different models if necessary.|
|test_report.txt||Results of testing the model. If test was set to False in the inputs this file will not be in the output directory.|
This section contains additional information that provides further insight into training parameters and suggestions for training an effective model.
CNNs require ample training data for effective training. In addition to this, all image classes should be equally represented in the data fed to the network. Thus, you should have at least 2500 labeled polygons from each class in train.geojson. This file should also be as clean as possible (accurately labeled, legitimate polygons), your classifier will only be as good as the training data you feed it!
Below are more details on the optional hyper-parameter settings
The two_rounds flag is a method for dealing with data that has natural class imbalance (unequal representation of image classes). Training a CNN on data with unbalanced classes often results in the network classifying all target data under the majority class. two_rounds avoids this by training in the following steps:
Train the network on balanced data (the task will take care of creating a balanced training dataset).
Retrain the network on the original class distribution to account for the probability of encountering a given image class. This round will only update the weights to the output layer of the network.
This two-round training process allows the network to learn to distinguish between classes based on distinct features (round one) and then learn the probability of encountering each class (round two). This is highly recommended for data that is not balanced.
This flag will filter your geojson before training the network by removing any polygons that are larger than max_side_dim or smaller than min_side_dim. This is necessary if there are any polygons with side-dimensions larger than the value of max_side_dim. You may do this before executing the task as follows:
from mltools import geojson_tools as gt gt.filter_shapefile('input_file.geojson', 'output_filename.geojson', min_polygon_hw=min_side_dim, max_polygon_hw=max_side_dim)
Constraints on the minimum and maximum size (in pixels) of a polygon from geojson. Side dimensions are based on the size of polygons the bounding box (white line below).
A sample input chip is displayed above. Note that CNNs require all train/test inputs to have identical dimensions. Thus, all polygons are zero-padded to the following dimensions: (num_bands, max_side_dim, max_side_dim). This means that any polygons that have a side dimension larger than max_side_dim cannot be used and will throw an error during training. Use the filter_geojson flag to avoid this.
Number of polygons to train on. If the provided geojson does not have enough polygons the task will throw an error.
Notice that if training takes place in two_rounds, the maximum train size will be as follows: size of smallest class * number of classes, assuming no polygons are removed in the filter_geojson step. Additionally if testing is performed the test data will be subtracted from the available training polygons.
Number of polygons to train on per batch. The model weights will be updated following each batch. Smaller batch sizes can help avoid local minima by increasing the amount of noise in the gradient.
Number of training epochs to complete. The validation loss tends to decrease with each successive epoch until a minimum loss is reached. At this point any additional training epochs may result in overfitting.
While the validation loss of the model tends to decrease with successive training epochs, it rarely does so monotonically. Furthermore, if too many training epochs are completed the model may begin to overfit, causing the validation loss to increase with successive epochs. The loss may therefore not be at a minimum at the end of training. Use this flag to ensure that the initial round returns a model with the lowest possible validation loss.
Note that all model weights will be returned in the model_weights folder of the output directory.
Testing should only be done on two-class classifications. The following explanations assume the classes are input to the task as follows: 'Negative class, Positive class'.
If the test flag is set to True, the task will put aside a set of polygons from geojson to get accuracy metrics for the trained model. Set this to False if you would like to complete testing manually.
The following metrics are provided in test results:
- False Positives: number of polygons falsely classified as Positive class
- False Negatives: number of polygons falsely classified as Negative class
- Precision: True Positives / (True Positives + False Positives)
- Recall: True Positives / (True Positives + False Negatives)
- Accuracy: Number of correctly classified polygons / test_size
Each convolutional layer of a CNN uses kernels to extract features from the input image to create an output feature map that is passed to the next layer. This parameter specifies the side dimension of these kernels to use to train the network. Note that increasing the kernel size will slow down training dramatically. Finding the ideal kernel size for a specific use case is often a matter of trial and error.
There may be memory errors when the input chips are too large (over 200px). This argument will downsample the input images to the input dimensions. Input should be as follows: (n_bands, rows, cols). The value of n_bands should be equal to the number of bands of the input imagery since only the side dimensions will be updated.
Build the Docker Image
You need to install Docker.
Clone the repository:
git clone https://github.com/platformstories/train-cnn-classifier
cd train-cnn-classifier docker build -t train-cnn-classifier .
Try out locally
Create a container in interactive mode and mount the sample input under
docker run -v full/path/to/sample-input:/mnt/work/input -it train-cnn-classifier
Then, within the container:
Watch the stdout to confirm that the model is being trained.
Login to Docker Hub:
Tag your image using your username and push it to DockerHub:
docker tag train-cnn-classifier yourusername/train-cnn-classifier docker push yourusername/train-cnn-classifier
The image name should be the same as the image name under containerDescriptors in train-cnn-classifier.json.
Alternatively, you can link this repository to a Docker automated build. Every time you push a change to the repository, the Docker image gets automatically updated.
Register on GBDX
In a Python terminal:
from gbdxtools import Interface gbdx = Interface() gbdx.task_registry.register(json_filename='train-cnn-classifier.json')
Note: If you change the task image, you need to reregister the task with a higher version number in order for the new image to take effect. Keep this in mind especially if you use Docker automated build.