Skip to content

02_Case Study Demo

Daniel Buscombe edited this page Jul 21, 2023 · 2 revisions

Case Study, Default demonstration dataset

New users should first run this test project to gain familiarity with the software and expected data formats, and also to test hardware for model training.

We have 43 images and corresponding label images of Cape Hatteras National Seashore. Imagery is obtained from this Zenodo release (click here to download) and consists of satellite images and associated labels made using Doodler. The imagery dataset consists of a time-series of Landsat-8 visible-spectrum (RGB) images of Cape Hatteras National Seashore, courtesy of the U.S. Geological Survey. Imagery spans the period February 2015 to September 2021.

We recommend that you download this dataset, unzip it, and test Gym with it. Once you have confirmed that Gym works with this test data, you can then go on to test Gym with your own data with the confidence that Gym works on your machine. This sample data is also useful as a template for your own data, which must be organized in the same way as the test data. The folder structure is

/Users/Someone/my_segmentation_gym_datasets
                    │   ├── config
                    │   |    └── *.json
                    │   ├── capehatteras_data
                    |   |   ├── fromDoodler
                    |   |   |     ├──images
                    |   │   |       └── *.jpg
                    │   |   |     └──labels
                    |   │   |       └── *.jpg
                    |   |   ├──npzForModel
                    |   │       └── *.npz                    
                    │   |   └──toPredict
                    |   │       └── *.jpg
                    │   └── modelOut
                    │       └── *.png
                    │   └── weights
                    │       └── *.h5

An explanation of these files is as follows:

  • config: this folder contains config files, which are detailed here
  • capehatteras_data: this folder contains the model training data
    • fromDoodler: these are the images and labels made (in this case, using Doodler, but please note it is not necessary to only use Doodler to acquire labeled images)
    • npzForModel: these are the npz files made by running the Gym script make_nd_dataset.py
    • toPredict: these are sample images that you'd like to segment
  • modelOut: this folder contains files to demonstrate model performance. These files are written by train_model.py after model training, and consist of some validation sample images with a semi-transparent color overlay depicting the image segmentation.
  • weights: this folder contains model weight files in h5 format. These files are written by train_model.py during model training.

When preparing your own data for Gym model training, you should closely follow the above folder structure. Your top level folder (in this case, my_segmentation_gym_datasets) should be named something relevant to your project. Likewise, your specific data (in this case, capehatteras_data) would have a different name. The folder names fromDoodler, npzForModel, and toPredict may optionally be named something else. However, you MUST have a folder named config and another named weights, and finally another called modelOut under your top-level directory. For each .h5 weights file, the program expects a .json config file with the same root name.

The folder npz4gym is populated with .npz format files, when you run the make_dataset.py script. The subfolders, named aug_sample and noaug_sample, are also created by the make_dataset.py script. The folders weights and modelOut are populated with files when you run the do_train.py script. The folder toPredict are some test images that you'd like to use with the trained model. Ideally, those images are an independent set of images, distinct from the train and validation subsets. Within that folder, a subfolder called out is created by the seg_images_in_folder.py or batch_seg_images_in_folders.py script.

So, in summary, when you start a new project, you have some folders that you populate with files, and others that get populated by Gym. Those you populate are

  1. config
  2. fromDoodler\images
  3. fromDoodler\labels
  4. toPredict

The following you create and the program populates:

  1. npz4model
  2. weights
  3. modelOut

There are 4 models: 2 residual unets, a segformer, and a vanilla unet

Test a model

You need an activated gym conda environment. Please refer to these instructions. CPU users should check if "SET_GPU": "-1" is in the config file for the model that they want to test. Users with a single NVIDIA GPU card should make sure that "SET_GPU": "0" is specified in the config file. Advanced users with multiple GPUs should specify which GPUs to use as a comma separated list, e.g. "SET_GPU": "0,1" or "SET_GPU": "0,1,2"

From your segmentation_gym root folder, e.g. /Users/Someone/github_clones/segmentation_gym, in an activated gym conda environment, run

python seg_images_in_folder.py

First, you will be prompted to select the location of the images you wish to apply the model to, e.g. /Users/Someone/my_segmentation_gym_datasets/capehatteras_data/toPredict

Next, you will be asked to select the weights file, e.g. /Users/Someone/my_segmentation_gym_datasets/weights/hatteras_l8_resunet.h5. Each weights file in h5 format represents a single model. It is set up for ensemble modeling, where your ensemble of models would differ based on config settings. When you are prompted to add more weights files, say 'No'. It will just load and use one model

The output printed to screen should look similar to this (if using GPU):

Using GPU
Using single GPU device
Version:  2.8.1
Eager mode:  True
Version:  2.8.1
Eager mode:  True
Num GPUs Available:  1
[]
WARNING:tensorflow:Mixed precision compatibility check (mixed_float16): WARNING
The dtype policy mixed_float16 may run slowly because this machine does not have a GPU. Only Nvidia GPUs with compute capability of at least 7.0 run quickly with mixed_float16.
If you will use compatible GPU(s) not attached to this host, e.g. by running a multi-worker model, you can ignore this warning. This message will only be logged once
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]
.....................................
Creating and compiling model 0...

Number of samples: 129

.....................................
Using model for prediction on images ...
100%|███████████████████████████████████████████████████████████████████████████████████████████| 129/129 [09:25<00:00,  4.38s/it]

The output printed to screen should look similar to this (if using CPU, which means editing the config file such that SET_GPU: '-1'):

Using CPU
Using single CPU device
Version:  2.8.1
Eager mode:  True
Version:  2.8.1
Eager mode:  True
Num GPUs Available:  0

In the folder of images that you specified, there should be a new subfolder called out that contains model outputs, e.g. /Users/Someone/my_segmentation_gym_datasets/capehatteras_data/toPredict/out. These png format files show sample images with a semi-transparent color overlay depicting the image segmentation.

Train a ResUNet model from scratch

A model already exists, i.e. there are valid weight files with .h5 format in the weights folder, but it is strongly recommended that you retrain this model on your own machine before using Gym on your own data. This exercise will allow you to gain familiarity with the software and data formats, allow you to test your GPU hardware for model training, and give you some insight into how long it takes to train a small model on a small dataset.

Step 1: Create a folder somewhere on your machine, e.g. /Users/Someone/model_from_scratch_test

Note that if you have no CUDA-enabled GPU such as an NVIDIA brand GPU, you will need to edit the config file(s) so the line "SET_GPU": "0" is replaced with "SET_GPU": "-1", which tells the machine to use a CPU instead of a GPU.

Step 2: from your segmentation_gym root folder, e.g. /Users/Someone/github_clones/segmentation_gym, in an activated gym conda environment, run python make_nd_dataset.py.

  • It will first prompt you to select the output directory where model training files will be written to, e.g. /Users/Someone/model_from_scratch_test.
  • Then it will ask for a config file. Select the hatteras_l8_resunet.json config file, e.g. /Users/Someone/my_segmentation_gym_datasets/config/hatteras_l8_resunet.json.
  • Select a folder of labels in jpeg format e.g. /Users/Someone/my_segmentation_gym_datasets/capehatteras_data/fromDoodler/labels
  • Select a folder of images in jpeg format e.g. /Users/Someone/my_segmentation_gym_datasets/capehatteras_data/fromDoodler/images
  • Finally, you will be asked, More directories of images? Select 'No'

In the above, if there is a nested images folder containing jpeg images inside the labels and images folders. In that case, you choose the top-level directory, i.e. /Users/Someone/my_segmentation_gym_datasets/capehatteras_data/fromDoodler/labels/ and /Users/Someone/my_segmentation_gym_datasets/capehatteras_data/fromDoodler/images/.

make_nd_dataset.py will create new files inside your output folder, e.g. /Users/Someone/model_from_scratch_test; npz format files contain all the aug and noaug image:label pairs. A subset of augmented and non-augmented image-label overlays are created inside aug_sample and noaug_sample folders, respectively.

If you wish to open the npz files to examine their contents, you can do so with an archive manager such as 7zip on Windows, or whatever you would typically use to open zipped or tarball folders. However, the arrays inside are not human readable.

Step 3: run python train_model.py.

  • It will first prompt you to select the output directory where model training files were written to, e.g. /Users/Someone/model_from_scratch_test.
  • Then it will ask for a config file. Select the hatteras_l8_resunet.json config file, e.g. /Users/Someone/my_segmentation_gym_datasets/config/hatteras_l8_resunet.json.
  • The model will then train. Your outputs will look like this, usually with some addtional warnings from Tensorflow:
Using GPU
Using single GPU device
Version:  2.8.1
Eager mode:  True
MODE not specified in config file. Setting to "all" files
MODE "all": using all augmented and non-augmented files
.....................................
Creating and compiling model ...
.....................................
Training model ...

Epoch 1: LearningRateScheduler setting learning rate to 1e-07.
Epoch 1/100
Relying on driver to perform ptx compilation. 
Modify $PATH to customize ptxas location.
This message will be only logged once.
34/34 [==============================] - 49s 592ms/step - loss: 0.8441 - mean_iou: 0.0017 - dice_coef: 0.1559 - val_loss: 0.7839 - val_mean_iou: 1.6857e-06 - val_dice_coef: 0.2161 - lr: 1.0000e-07

Epoch 2: LearningRateScheduler setting learning rate to 5.095e-06.
Epoch 2/100
34/34 [==============================] - 19s 573ms/step - loss: 0.7513 - mean_iou: 0.0132 - dice_coef: 0.2487 - val_loss: 0.7287 - val_mean_iou: 4.4711e-06 - val_dice_coef: 0.2713 - lr: 5.0950e-06

Epoch 3: LearningRateScheduler setting learning rate to 1.0090000000000002e-05.
Epoch 3/100
34/34 [==============================] - 20s 587ms/step - loss: 0.5039 - mean_iou: 0.5175 - dice_coef: 0.4961 - val_loss: 0.5678 - val_mean_iou: 0.6194 - val_dice_coef: 0.4322 - lr: 1.0090e-05
Epoch 4: LearningRateScheduler setting learning rate to 1.5085000000000002e-05.
Epoch 4/100
34/34 [==============================] - 20s 597ms/step - loss: 0.2645 - mean_iou: 0.8364 - dice_coef: 0.7355 - val_loss: 0.4721 - val_mean_iou: 0.7722 - val_dice_coef: 0.5279 - lr: 1.5085e-05

Epoch 5: LearningRateScheduler setting learning rate to 2.008e-05.
Epoch 5/100
34/34 [==============================] - 21s 612ms/step - loss: 0.1473 - mean_iou: 0.9232 - dice_coef: 0.8527 - val_loss: 0.3025 - val_mean_iou: 0.8952 - val_dice_coef: 0.6975 - lr: 2.0080e-05

Epoch 6: LearningRateScheduler setting learning rate to 2.5075000000000003e-05.
Epoch 6/100
34/34 [==============================] - 19s 576ms/step - loss: 0.0936 - mean_iou: 0.9566 - dice_coef: 0.9064 - val_loss: 0.2152 - val_mean_iou: 0.8894 - val_dice_coef: 0.7848 - lr: 2.5075e-05

Epoch 7: LearningRateScheduler setting learning rate to 3.0070000000000002e-05.
Epoch 7/100
34/34 [==============================] - 19s 574ms/step - loss: 0.0656 - mean_iou: 0.9710 - dice_coef: 0.9344 - val_loss: 0.1573 - val_mean_iou: 0.9023 - val_dice_coef: 0.8427 - lr: 3.0070e-05

Epoch 8: LearningRateScheduler setting learning rate to 3.5065000000000004e-05.
Epoch 8/100
34/34 [==============================] - 20s 579ms/step - loss: 0.0518 - mean_iou: 0.9750 - dice_coef: 0.9482 - val_loss: 0.1151 - val_mean_iou: 0.9187 - val_dice_coef: 0.8849 - lr: 3.5065e-05

Epoch 9: LearningRateScheduler setting learning rate to 4.0060000000000006e-05.
Epoch 9/100
34/34 [==============================] - 20s 583ms/step - loss: 0.0418 - mean_iou: 0.9793 - dice_coef: 0.9582 - val_loss: 0.0897 - val_mean_iou: 0.9322 - val_dice_coef: 0.9103 - lr: 4.0060e-05
Epoch 10: LearningRateScheduler setting learning rate to 4.505500000000001e-05.
Epoch 10/100
34/34 [==============================] - 19s 573ms/step - loss: 0.0361 - mean_iou: 0.9807 - dice_coef: 0.9639 - val_loss: 0.0745 - val_mean_iou: 0.9409 - val_dice_coef: 0.9255 - lr: 4.5055e-05

Epoch 11: LearningRateScheduler setting learning rate to 5.005000000000001e-05.
Epoch 11/100
34/34 [==============================] - 20s 588ms/step - loss: 0.0319 - mean_iou: 0.9821 - dice_coef: 0.9681 - val_loss: 0.0582 - val_mean_iou: 0.9542 - val_dice_coef: 0.9418 - lr: 5.0050e-05

Epoch 12: LearningRateScheduler setting learning rate to 5.5045000000000006e-05.
Epoch 12/100
34/34 [==============================] - 19s 577ms/step - loss: 0.0297 - mean_iou: 0.9822 - dice_coef: 0.9703 - val_loss: 0.0576 - val_mean_iou: 0.9529 - val_dice_coef: 0.9424 - lr: 5.5045e-05

Epoch 13: LearningRateScheduler setting learning rate to 6.004000000000001e-05.
Epoch 13/100
34/34 [==============================] - 19s 574ms/step - loss: 0.0270 - mean_iou: 0.9835 - dice_coef: 0.9730 - val_loss: 0.0475 - val_mean_iou: 0.9635 - val_dice_coef: 0.9525 - lr: 6.0040e-05

Epoch 14: LearningRateScheduler setting learning rate to 6.5035e-05.
Epoch 14/100
34/34 [==============================] - 20s 586ms/step - loss: 0.0262 - mean_iou: 0.9833 - dice_coef: 0.9738 - val_loss: 0.0401 - val_mean_iou: 0.9708 - val_dice_coef: 0.9599 - lr: 6.5035e-05

Epoch 15: LearningRateScheduler setting learning rate to 7.003e-05.
Epoch 15/100
34/34 [==============================] - 20s 582ms/step - loss: 0.0249 - mean_iou: 0.9835 - dice_coef: 0.9751 - val_loss: 0.0401 - val_mean_iou: 0.9654 - val_dice_coef: 0.9599 - lr: 7.0030e-05

Epoch 16: LearningRateScheduler setting learning rate to 7.502500000000001e-05.
Epoch 16/100
34/34 [==============================] - 19s 570ms/step - loss: 0.0234 - mean_iou: 0.9845 - dice_coef: 0.9766 - val_loss: 0.0426 - val_mean_iou: 0.9659 - val_dice_coef: 0.9574 - lr: 7.5025e-05

Epoch 17: LearningRateScheduler setting learning rate to 8.002000000000001e-05.
Epoch 17/100
34/34 [==============================] - 20s 583ms/step - loss: 0.0224 - mean_iou: 0.9849 - dice_coef: 0.9776 - val_loss: 0.0311 - val_mean_iou: 0.9761 - val_dice_coef: 0.9689 - lr: 8.0020e-05

Epoch 18: LearningRateScheduler setting learning rate to 8.501500000000001e-05.
Epoch 18/100
34/34 [==============================] - 20s 578ms/step - loss: 0.0220 - mean_iou: 0.9848 - dice_coef: 0.9780 - val_loss: 0.0317 - val_mean_iou: 0.9747 - val_dice_coef: 0.9683 - lr: 8.5015e-05

Epoch 19: LearningRateScheduler setting learning rate to 9.001000000000001e-05.
Epoch 19/100
34/34 [==============================] - 19s 573ms/step - loss: 0.0215 - mean_iou: 0.9849 - dice_coef: 0.9785 - val_loss: 0.0457 - val_mean_iou: 0.9582 - val_dice_coef: 0.9543 - lr: 9.0010e-05

Epoch 20: LearningRateScheduler setting learning rate to 9.500500000000002e-05.
Epoch 20/100
34/34 [==============================] - 20s 580ms/step - loss: 0.0206 - mean_iou: 0.9857 - dice_coef: 0.9794 - val_loss: 0.0350 - val_mean_iou: 0.9689 - val_dice_coef: 0.9650 - lr: 9.5005e-05

Epoch 21: LearningRateScheduler setting learning rate to 0.0001.
Epoch 21/100
34/34 [==============================] - 20s 580ms/step - loss: 0.0209 - mean_iou: 0.9851 - dice_coef: 0.9791 - val_loss: 0.0348 - val_mean_iou: 0.9689 - val_dice_coef: 0.9652 - lr: 1.0000e-04

Epoch 22: LearningRateScheduler setting learning rate to 9.001e-05.
Epoch 22/100
34/34 [==============================] - 19s 565ms/step - loss: 0.0198 - mean_iou: 0.9859 - dice_coef: 0.9802 - val_loss: 0.0253 - val_mean_iou: 0.9799 - val_dice_coef: 0.9747 - lr: 9.0010e-05

Epoch 23: LearningRateScheduler setting learning rate to 8.1019e-05.
Epoch 23/100
34/34 [==============================] - 19s 572ms/step - loss: 0.0185 - mean_iou: 0.9870 - dice_coef: 0.9815 - val_loss: 0.0297 - val_mean_iou: 0.9753 - val_dice_coef: 0.9703 - lr: 8.1019e-05

Epoch 24: LearningRateScheduler setting learning rate to 7.292710000000001e-05.
Epoch 24/100
34/34 [==============================] - 19s 564ms/step - loss: 0.0181 - mean_iou: 0.9874 - dice_coef: 0.9819 - val_loss: 0.0280 - val_mean_iou: 0.9766 - val_dice_coef: 0.9720 - lr: 7.2927e-05

Epoch 25: LearningRateScheduler setting learning rate to 6.564439000000001e-05.
Epoch 25/100
34/34 [==============================] - 20s 579ms/step - loss: 0.0178 - mean_iou: 0.9874 - dice_coef: 0.9822 - val_loss: 0.0252 - val_mean_iou: 0.9796 - val_dice_coef: 0.9748 - lr: 6.5644e-05

Epoch 26: LearningRateScheduler setting learning rate to 5.908995100000001e-05.
Epoch 26/100
34/34 [==============================] - 19s 575ms/step - loss: 0.0170 - mean_iou: 0.9882 - dice_coef: 0.9830 - val_loss: 0.0247 - val_mean_iou: 0.9810 - val_dice_coef: 0.9753 - lr: 5.9090e-05

Epoch 27: LearningRateScheduler setting learning rate to 5.319095590000001e-05.
Epoch 27/100
34/34 [==============================] - 20s 583ms/step - loss: 0.0167 - mean_iou: 0.9884 - dice_coef: 0.9833 - val_loss: 0.0230 - val_mean_iou: 0.9823 - val_dice_coef: 0.9770 - lr: 5.3191e-05

Epoch 28: LearningRateScheduler setting learning rate to 4.788186031000001e-05.
Epoch 28/100
34/34 [==============================] - 20s 588ms/step - loss: 0.0163 - mean_iou: 0.9888 - dice_coef: 0.9837 - val_loss: 0.0243 - val_mean_iou: 0.9818 - val_dice_coef: 0.9757 - lr: 4.7882e-05

Epoch 29: LearningRateScheduler setting learning rate to 4.3103674279000016e-05.
Epoch 29/100
34/34 [==============================] - 20s 586ms/step - loss: 0.0165 - mean_iou: 0.9884 - dice_coef: 0.9835 - val_loss: 0.0251 - val_mean_iou: 0.9807 - val_dice_coef: 0.9749 - lr: 4.3104e-05

Epoch 30: LearningRateScheduler setting learning rate to 3.880330685110001e-05.
Epoch 30/100
34/34 [==============================] - 19s 576ms/step - loss: 0.0174 - mean_iou: 0.9873 - dice_coef: 0.9826 - val_loss: 0.0229 - val_mean_iou: 0.9828 - val_dice_coef: 0.9771 - lr: 3.8803e-05

Epoch 31: LearningRateScheduler setting learning rate to 3.493297616599001e-05.
Epoch 31/100
34/34 [==============================] - 19s 562ms/step - loss: 0.0164 - mean_iou: 0.9886 - dice_coef: 0.9836 - val_loss: 0.0231 - val_mean_iou: 0.9831 - val_dice_coef: 0.9769 - lr: 3.4933e-05

Epoch 32: LearningRateScheduler setting learning rate to 3.144967854939101e-05.
Epoch 32/100
34/34 [==============================] - 19s 566ms/step - loss: 0.0153 - mean_iou: 0.9897 - dice_coef: 0.9847 - val_loss: 0.0241 - val_mean_iou: 0.9819 - val_dice_coef: 0.9759 - lr: 3.1450e-05

Epoch 33: LearningRateScheduler setting learning rate to 2.831471069445191e-05.
Epoch 33/100
34/34 [==============================] - 19s 575ms/step - loss: 0.0151 - mean_iou: 0.9898 - dice_coef: 0.9849 - val_loss: 0.0220 - val_mean_iou: 0.9842 - val_dice_coef: 0.9780 - lr: 2.8315e-05

Epoch 34: LearningRateScheduler setting learning rate to 2.5493239625006718e-05.
Epoch 34/100
34/34 [==============================] - 19s 573ms/step - loss: 0.0146 - mean_iou: 0.9903 - dice_coef: 0.9854 - val_loss: 0.0216 - val_mean_iou: 0.9843 - val_dice_coef: 0.9784 - lr: 2.5493e-05

Epoch 35: LearningRateScheduler setting learning rate to 2.295391566250605e-05.
Epoch 35/100
34/34 [==============================] - 19s 570ms/step - loss: 0.0147 - mean_iou: 0.9902 - dice_coef: 0.9853 - val_loss: 0.0218 - val_mean_iou: 0.9841 - val_dice_coef: 0.9782 - lr: 2.2954e-05

Epoch 36: LearningRateScheduler setting learning rate to 2.066852409625544e-05.
Epoch 36/100
34/34 [==============================] - 19s 574ms/step - loss: 0.0145 - mean_iou: 0.9904 - dice_coef: 0.9855 - val_loss: 0.0217 - val_mean_iou: 0.9841 - val_dice_coef: 0.9783 - lr: 2.0669e-05

Epoch 37: LearningRateScheduler setting learning rate to 1.86116716866299e-05.
Epoch 37/100
34/34 [==============================] - 19s 569ms/step - loss: 0.0144 - mean_iou: 0.9904 - dice_coef: 0.9856 - val_loss: 0.0220 - val_mean_iou: 0.9845 - val_dice_coef: 0.9780 - lr: 1.8612e-05

Epoch 38: LearningRateScheduler setting learning rate to 1.676050451796691e-05.
Epoch 38/100
34/34 [==============================] - 19s 573ms/step - loss: 0.0142 - mean_iou: 0.9907 - dice_coef: 0.9858 - val_loss: 0.0217 - val_mean_iou: 0.9841 - val_dice_coef: 0.9783 - lr: 1.6761e-05

Epoch 39: LearningRateScheduler setting learning rate to 1.5094454066170219e-05.
Epoch 39/100
34/34 [==============================] - 20s 579ms/step - loss: 0.0142 - mean_iou: 0.9907 - dice_coef: 0.9858 - val_loss: 0.0213 - val_mean_iou: 0.9842 - val_dice_coef: 0.9787 - lr: 1.5094e-05

Epoch 40: LearningRateScheduler setting learning rate to 1.35950086595532e-05.
Epoch 40/100
34/34 [==============================] - 20s 579ms/step - loss: 0.0140 - mean_iou: 0.9908 - dice_coef: 0.9860 - val_loss: 0.0209 - val_mean_iou: 0.9846 - val_dice_coef: 0.9791 - lr: 1.3595e-05

Epoch 41: LearningRateScheduler setting learning rate to 1.224550779359788e-05.
Epoch 41/100
34/34 [==============================] - 19s 573ms/step - loss: 0.0140 - mean_iou: 0.9909 - dice_coef: 0.9860 - val_loss: 0.0217 - val_mean_iou: 0.9842 - val_dice_coef: 0.9783 - lr: 1.2246e-05

Epoch 42: LearningRateScheduler setting learning rate to 1.1030957014238093e-05.
Epoch 42/100
34/34 [==============================] - 19s 575ms/step - loss: 0.0138 - mean_iou: 0.9909 - dice_coef: 0.9862 - val_loss: 0.0214 - val_mean_iou: 0.9842 - val_dice_coef: 0.9786 - lr: 1.1031e-05

Epoch 43: LearningRateScheduler setting learning rate to 9.937861312814282e-06.
Epoch 43/100
34/34 [==============================] - 19s 570ms/step - loss: 0.0139 - mean_iou: 0.9909 - dice_coef: 0.9861 - val_loss: 0.0212 - val_mean_iou: 0.9844 - val_dice_coef: 0.9788 - lr: 9.9379e-06

Epoch 44: LearningRateScheduler setting learning rate to 8.954075181532855e-06.
Epoch 44/100
34/34 [==============================] - 19s 570ms/step - loss: 0.0137 - mean_iou: 0.9911 - dice_coef: 0.9863 - val_loss: 0.0213 - val_mean_iou: 0.9844 - val_dice_coef: 0.9787 - lr: 8.9541e-06

Epoch 45: LearningRateScheduler setting learning rate to 8.06866766337957e-06.
Epoch 45/100
34/34 [==============================] - 19s 569ms/step - loss: 0.0138 - mean_iou: 0.9910 - dice_coef: 0.9862 - val_loss: 0.0214 - val_mean_iou: 0.9844 - val_dice_coef: 0.9786 - lr: 8.0687e-06

Epoch 46: LearningRateScheduler setting learning rate to 7.271800897041612e-06.
Epoch 46/100
34/34 [==============================] - 19s 563ms/step - loss: 0.0136 - mean_iou: 0.9911 - dice_coef: 0.9864 - val_loss: 0.0214 - val_mean_iou: 0.9843 - val_dice_coef: 0.9786 - lr: 7.2718e-06

Epoch 47: LearningRateScheduler setting learning rate to 6.554620807337451e-06.
Epoch 47/100
34/34 [==============================] - 19s 572ms/step - loss: 0.0136 - mean_iou: 0.9911 - dice_coef: 0.9864 - val_loss: 0.0212 - val_mean_iou: 0.9845 - val_dice_coef: 0.9788 - lr: 6.5546e-06

Epoch 48: LearningRateScheduler setting learning rate to 5.9091587266037055e-06.
Epoch 48/100
34/34 [==============================] - 19s 571ms/step - loss: 0.0136 - mean_iou: 0.9911 - dice_coef: 0.9864 - val_loss: 0.0212 - val_mean_iou: 0.9843 - val_dice_coef: 0.9788 - lr: 5.9092e-06

Epoch 49: LearningRateScheduler setting learning rate to 5.328242853943336e-06.
Epoch 49/100
34/34 [==============================] - 19s 571ms/step - loss: 0.0135 - mean_iou: 0.9912 - dice_coef: 0.9865 - val_loss: 0.0213 - val_mean_iou: 0.9844 - val_dice_coef: 0.9787 - lr: 5.3282e-06

Epoch 50: LearningRateScheduler setting learning rate to 4.805418568549002e-06.
Epoch 50/100
34/34 [==============================] - 20s 579ms/step - loss: 0.0136 - mean_iou: 0.9911 - dice_coef: 0.9864 - val_loss: 0.0213 - val_mean_iou: 0.9843 - val_dice_coef: 0.9787 - lr: 4.8054e-06

  • When the model is finished training, it will evaluate on the test set:
Evaluating model on entire validation set ...
20/20 [==============================] - 4s 180ms/step - loss: 0.0213 - mean_iou: 0.9843 - dice_coef: 0.9787
loss=0.0213, Mean IOU=0.9843, Mean Dice=0.9787
Mean of mean IoUs (validation subset)=0.985
Mean of mean Dice scores (validation subset)=0.979
Mean of mean IoUs (train subset)=0.990
Mean of mean Dice scores (train subset)=0.984

Model training results in the following new files inside your weights directory e.g. /Users/Someone/my_segmentation_gym_datasets/weights/

  • hatteras_l8_resunet.h5: this is a model weights file that is the result of running the do_train.py script
  • hatteras_l8_resunet_fullmodel.h5: this is a model + weights file for subsequent use for prediction. This is a more portable version and what you give you other people to use
  • hatteras_l8_resunet_model_history.npz: this is an archive containing arrays of losses and metrics per model training epoch. You may load these in and plot them using your own script
  • hatteras_l8_resunet_train_files.txt: this is a list of files used for model training
  • hatteras_l8_resunet_val_files.txt: this is a list of files used for model validation

In addition, new files are created inside your modelOut directory e.g. /Users/Someone/my_segmentation_gym_datasets/modelOut/. These png format files show sample images with a semi-transparent color overlay depicting the image segmentation.

Train a ("vanilla") UNet model from scratch

UNets are a family of similar models. The original, or "vanilla" UNet is available within Gym in addition to the Residual UNet. You may train (or implement) a model based on the vanilla UNet using the provided config file hatteras_l8_vanilla_unet.json.

This model trains a lot faster, at least on smaller datasets such as this. When the model is finished training, it will evaluate on the test set:

Evaluating model on entire validation set ...
17/17 [==============================] - 2s 111ms/step - loss: 2.0198 - mean_iou: 3.2129e-04 - dice_coef: 0.1860
loss=2.0198, Mean IOU=0.0003, Mean Dice=0.1860
Mean of mean IoUs (validation subset)=0.000
Mean of mean Dice scores (validation subset)=0.186
Mean of mean IoUs (train subset)=0.000
Mean of mean Dice scores (train subset)=0.187

As you can see, it is a lot less powerful than the residual UNet whose hyperparameters are defined in the other provided config files.

Train another ResUNet model from scratch, for use in ensemble mode

A second config file is available, hatteras_l8_resunet_model2.json that differs from the first only by specification of "KERNEL":7, instead of "KERNEL":9,. The kernel size is a hyperparameter commonly experimented with; different values may show a variation in model performance.

When the model is finished training, it will evaluate on the test set:

Evaluating model on entire validation set ...
20/20 [==============================] - 3s 152ms/step - loss: 0.0248 - mean_iou: 0.9822 - dice_coef: 0.9752
loss=0.0248, Mean IOU=0.9822, Mean Dice=0.9752
Mean of mean IoUs (validation subset)=0.984
Mean of mean Dice scores (validation subset)=0.977
Mean of mean IoUs (train subset)=0.987
Mean of mean Dice scores (train subset)=0.980

The validation and train statistics are very similar (and very slightly inferior) to the "KERNEL":9, model version

If you are going to experiment in this way, which is strongly encouraged, you should devise a way to keep track of experiments. The config and weights folders must share a common parent directory. The weights file will adopt the same name as the config file used to create it.

Once you have more than one model, you may use seg_images_in_folder.py in ensemble mode. When prompted add the second set of weights to the first. The program will construct all models and use them separately to make predictions. Then the predictions (softmax scores) are averaged and argmax is used to determine the label integer.

Use multiple models in ensemble mode

From your segmentation_gym root folder, e.g. /Users/Someone/github_clones/segmentation_gym, in an activated gym conda environment, run

python seg_images_in_folder.py

First, you will be prompted to select the location of the images you wish to apply the model to, e.g. /Users/Someone/my_segmentation_gym_datasets/capehatteras_data/toPredict

Next, you will be asked to select the weights file, e.g. /Users/Someone/my_segmentation_gym_datasets/weights/hatteras_l8_resunet.h5. This time, when you are prompted to add more weights files, say 'Yes'. Add another good model, e.g. weights/hatteras_l8_resunet_model2.h5 (the order of selection is unimportant).

The model weights/hatteras_l8_vanilla_unet.h5 did not have good training validation metrics (see above), so it is not used.

This time, the output looks like this:

.....................................
Creating and compiling model 0...
.....................................
Creating and compiling model 1...
Number of samples: 129
.....................................
Using model for prediction on images ...
100%|███████████████████████████████████████████████████████████████████████████████| 129/129 [14:40<00:00,  6.83s/it]

Notice that ensemble models take longer to execute, but the results are generally better. In my case, adding one model resulted in an execution time increase of 55%.