PWC-Net-small model mixed-precision training (with cyclical learning rate schedule)
=======================================================

In this notebook we:
- Use a small model (no dense or residual connections), 6 level pyramid, uspample level 2 by 4 as the final flow prediction
- Train the PWC-Net-small model on a mix of the `FlyingChairs` and `FlyingThings3DHalfRes` dataset using a Cyclic<sub>short</sub> schedule of our own
- The Cyclic<sub>short</sub> schedule oscillates between `5e-04` and `1e-05` for 200,000 steps with a stepsize of `40,000`
- The training is done using mixed-precision with a loss scaler of `128.0` and a batch size of `32`

Below, look for `TODO` references and customize this notebook based on your own needs.

## Reference

[2018a]<a name="2018a"></a> Sun et al. 2018. PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume. [[arXiv]](https://arxiv.org/abs/1709.02371) [[web]](http://research.nvidia.com/publication/2018-02_PWC-Net%3A-CNNs-for) [[PyTorch (Official)]](https://github.com/NVlabs/PWC-Net/tree/master/PyTorch) [[Caffe (Official)]](https://github.com/NVlabs/PWC-Net/tree/master/Caffe)

In [1]:
"""
pwcnet_train.ipynb

PWC-Net model training.

Written by Phil Ferriere

Licensed under the MIT License (see LICENSE for details)

Tensorboard:
    [win] tensorboard --logdir=E:\\repos\\tf-optflow\\tfoptflow\\pwcnet-sm-6-2-cyclic-chairsthingsmix-fp16
    [ubu] tensorboard --logdir=/media/EDrive/repos/tf-optflow/tfoptflow/pwcnet-sm-6-2-cyclic-chairsthingsmix-fp16
"""
from __future__ import absolute_import, division, print_function
import sys
from copy import deepcopy
import tensorflow as tf

from dataset_base import _DEFAULT_DS_TRAIN_OPTIONS
from dataset_flyingchairs import FlyingChairsDataset
from dataset_flyingthings3d import FlyingThings3DHalfResDataset
from dataset_mixer import MixedDataset
from model_pwcnet import ModelPWCNet, _DEFAULT_PWCNET_TRAIN_OPTIONS

## TODO: Set this first!

In [2]:
# TODO: You MUST set dataset_root to the correct path on your machine!
if sys.platform.startswith("win"):
    _DATASET_ROOT = 'E:/datasets/'
else:
    _DATASET_ROOT = '/Vol1/dbstore/datasets/'
_FLYINGCHAIRS_ROOT = _DATASET_ROOT + 'FlyingChairs_release'
_FLYINGTHINGS3DHALFRES_ROOT = _DATASET_ROOT + 'FlyingThings3D_HalfRes'
    
# TODO: You MUST adjust the settings below based on the number of GPU(s) used for training
# Set controller device and devices
# A one-gpu setup would be something like controller='/device:GPU:0' and gpu_devices=['/device:GPU:0']
# Here, we use a dual-GPU setup, as shown below
gpu_devices = ['/device:GPU:0', '/device:GPU:1']
controller = '/device:GPU:0'

# TODO: You MUST adjust this setting below based on the amount of memory on your GPU(s)
# Batch size
batch_size = 16

# Train on `FlyingChairs+FlyingThings3DHalfRes` mix

## Load the dataset

In [3]:
# TODO: You MUST set the batch size based on the capabilities of your GPU(s) 
#  Load train dataset
ds_opts = deepcopy(_DEFAULT_DS_TRAIN_OPTIONS)
ds_opts['in_memory'] = False                          # Too many samples to keep in memory at once, so don't preload them
ds_opts['aug_type'] = 'heavy'                         # Apply all supported augmentations
ds_opts['batch_size'] = batch_size * len(gpu_devices) # Use a multiple of 8; here, 16 for dual-GPU mode (Titan X & 1080 Ti)
ds_opts['crop_preproc'] = (256, 448)                  # Crop to a smaller input size
ds1 = FlyingChairsDataset(mode='train_with_val', ds_root=_FLYINGCHAIRS_ROOT, options=ds_opts)
ds_opts['type'] = 'into_future'
ds2 = FlyingThings3DHalfResDataset(mode='train_with_val', ds_root=_FLYINGTHINGS3DHALFRES_ROOT, options=ds_opts)
ds = MixedDataset(mode='train_with_val', datasets=[ds1, ds2], options=ds_opts)

In [4]:
# Display dataset configuration
ds.print_config()


Dataset Configuration:
  verbose              False
  in_memory            False
  crop_preproc         (256, 448)
  scale_preproc        None
  tb_test_imgs         False
  random_seed          1969
  val_split            0.03
  aug_type             heavy
  aug_labels           True
  fliplr               0.5
  flipud               0.5
  translate            (0.5, 0.05)
  scale                (0.5, 0.05)
  batch_size           32
  type                 into_future
  mode                 train_with_val
  train size           41731
  val size             1292


## Configure the training

In [5]:
# Start from the default options
nn_opts = deepcopy(_DEFAULT_PWCNET_TRAIN_OPTIONS)
nn_opts['verbose'] = True
nn_opts['ckpt_dir'] = './pwcnet-sm-6-2-cyclic-chairsthingsmix-fp16/'
nn_opts['batch_size'] = ds_opts['batch_size']
nn_opts['x_shape'] = [2, ds_opts['crop_preproc'][0], ds_opts['crop_preproc'][1], 3]
nn_opts['y_shape'] = [ds_opts['crop_preproc'][0], ds_opts['crop_preproc'][1], 2]
nn_opts['use_tf_data'] = True # Use tf.data reader
nn_opts['gpu_devices'] = gpu_devices
nn_opts['controller'] = controller

# Use the PWC-Net-small model in quarter-resolution mode
nn_opts['use_dense_cx'] = False
nn_opts['use_res_cx'] = False
nn_opts['pyr_lvls'] = 6
nn_opts['flow_pred_lvl'] = 2

# Use mixed precision training
nn_opts['use_mixed_precision'] = True 
nn_opts['loss_scaler'] = 128.
nn_opts['x_dtype'] = tf.float32
nn_opts['y_dtype'] = tf.float32

# More options
nn_opts['max_to_keep'] = 50

In [6]:
# Set the learning rate schedule. This schedule is for a single GPU using a batch size of 8.
# Below,we adjust the schedule to the size of the batch and the number of GPUs.
nn_opts['lr_policy'] = 'cyclic'
nn_opts['cyclic_lr_max'] = 5e-04
nn_opts['cyclic_lr_base'] = 1e-05
nn_opts['cyclic_lr_stepsize'] = 40000
nn_opts['max_steps'] = 200000

# Below,we adjust the schedule to the size of the batch and our number of GPUs (2).
nn_opts['max_steps'] = int(nn_opts['max_steps'] * 8 / ds_opts['batch_size'])
nn_opts['cyclic_lr_stepsize'] = int(nn_opts['cyclic_lr_stepsize'] * 8 / ds_opts['batch_size'])

In [7]:
# Instantiate the model and display the model configuration
nn = ModelPWCNet(mode='train_with_val', options=nn_opts, dataset=ds)
nn.print_config()

Building model towers...
  Building tower_0...
  ...tower_0 built.
  Building tower_1...
  ...tower_1 built.
... model towers built.
Initializing model from previous checkpoint ./pwcnet-sm-6-2-cyclic-chairsthingsmix-fp16/pwcnet.ckpt-10000 to resume training...

INFO:tensorflow:Restoring parameters from ./pwcnet-sm-6-2-cyclic-chairsthingsmix-fp16/pwcnet.ckpt-10000
... model initialized

Model Configuration:
  verbose                True
  ckpt_dir               ./pwcnet-sm-6-2-cyclic-chairsthingsmix-fp16/
  max_to_keep            50
  x_dtype                <dtype: 'float32'>
  x_shape                [2, 256, 448, 3]
  y_dtype                <dtype: 'float32'>
  y_shape                [256, 448, 2]
  train_mode             train
  adapt_info             None
  sparse_gt_flow         False
  display_step           100
  snapshot_step          1000
  val_step               1000
  val_batch_size         -1
  tb_val_imgs            pyramid
  tb_test_imgs           None
  gpu_devices        

## Train the model

In [9]:
# Train the model
nn.train()

Resume training from step 18670...
2019-04-22 10:52:42 Iter 18700 [Train]: loss=76.40, epe=5.74, lr=0.000074, samples/sec=59.6, sec/step=1.074, eta=9:20:19
2019-04-22 10:56:21 Iter 18800 [Train]: loss=78.32, epe=5.89, lr=0.000069, samples/sec=60.0, sec/step=1.067, eta=9:14:49
2019-04-22 11:00:02 Iter 18900 [Train]: loss=71.64, epe=5.37, lr=0.000064, samples/sec=60.1, sec/step=1.065, eta=9:11:52
2019-04-22 11:03:40 Iter 19000 [Train]: loss=67.32, epe=4.99, lr=0.000059, samples/sec=59.9, sec/step=1.069, eta=9:12:25
2019-04-22 11:04:05 Iter 19000 [Val]: loss=76.50, epe=5.71
Saving model...
INFO:tensorflow:./pwcnet-sm-6-2-cyclic-chairsthingsmix-fp16/pwcnet.ckpt-19000 is not in all_model_checkpoint_paths. Manually adding it.
... model saved in ./pwcnet-sm-6-2-cyclic-chairsthingsmix-fp16/pwcnet.ckpt-19000
2019-04-22 11:08:16 Iter 19100 [Train]: loss=65.43, epe=4.83, lr=0.000054, samples/sec=60.0, sec/step=1.067, eta=9:09:43
2019-04-22 11:11:57 Iter 19200 [Train]: loss=64.44, epe=4.75, lr=0.0

... model saved in ./pwcnet-sm-6-2-cyclic-chairsthingsmix-fp16/pwcnet.ckpt-24000
2019-04-22 14:38:43 Iter 24100 [Train]: loss=64.82, epe=4.79, lr=0.000110, samples/sec=53.8, sec/step=1.191, eta=8:33:54
2019-04-22 14:42:57 Iter 24200 [Train]: loss=66.23, epe=4.90, lr=0.000113, samples/sec=53.7, sec/step=1.192, eta=8:32:22
2019-04-22 14:47:10 Iter 24300 [Train]: loss=74.66, epe=5.58, lr=0.000115, samples/sec=55.0, sec/step=1.164, eta=8:18:47
2019-04-22 14:51:25 Iter 24400 [Train]: loss=110.92, epe=10.56, lr=0.000118, samples/sec=55.3, sec/step=1.158, eta=8:14:14
2019-04-22 14:55:49 Iter 24500 [Train]: loss=72.76, epe=5.57, lr=0.000120, samples/sec=53.7, sec/step=1.191, eta=8:26:09
2019-04-22 15:00:06 Iter 24600 [Train]: loss=65.84, epe=4.87, lr=0.000123, samples/sec=54.8, sec/step=1.167, eta=8:14:13
2019-04-22 15:04:25 Iter 24700 [Train]: loss=63.60, epe=4.68, lr=0.000125, samples/sec=54.2, sec/step=1.181, eta=8:18:04
2019-04-22 15:08:58 Iter 24800 [Train]: loss=66.07, epe=4.89, lr=0.000

2019-04-22 19:23:33 Iter 29600 [Train]: loss=173.64, epe=9.39, lr=0.000245, samples/sec=50.4, sec/step=1.270, eta=7:11:41
2019-04-22 19:28:55 Iter 29700 [Train]: loss=66.28, epe=4.89, lr=0.000248, samples/sec=49.6, sec/step=1.291, eta=7:16:54
2019-04-22 19:34:13 Iter 29800 [Train]: loss=72.86, epe=5.55, lr=0.000250, samples/sec=49.9, sec/step=1.283, eta=7:12:05
2019-04-22 19:39:32 Iter 29900 [Train]: loss=93.07, epe=8.08, lr=0.000253, samples/sec=49.6, sec/step=1.290, eta=7:12:18
2019-04-22 19:44:52 Iter 30000 [Train]: loss=67.73, epe=5.01, lr=0.000255, samples/sec=49.7, sec/step=1.287, eta=7:09:00
2019-04-22 19:45:25 Iter 30000 [Val]: loss=74.01, epe=5.49
Saving model...
INFO:tensorflow:./pwcnet-sm-6-2-cyclic-chairsthingsmix-fp16/pwcnet.ckpt-30000 is not in all_model_checkpoint_paths. Manually adding it.
... model saved in ./pwcnet-sm-6-2-cyclic-chairsthingsmix-fp16/pwcnet.ckpt-30000
2019-04-22 19:51:36 Iter 30100 [Train]: loss=66.65, epe=4.97, lr=0.000253, samples/sec=49.6, sec/step=

... model saved in ./pwcnet-sm-6-2-cyclic-chairsthingsmix-fp16/pwcnet.ckpt-35000
2019-04-23 01:31:02 Iter 35100 [Train]: loss=73.25, epe=5.49, lr=0.000130, samples/sec=37.3, sec/step=1.715, eta=7:05:48
2019-04-23 01:41:51 Iter 35200 [Train]: loss=58.43, epe=4.25, lr=0.000128, samples/sec=33.3, sec/step=1.919, eta=7:53:27
2019-04-23 01:52:38 Iter 35300 [Train]: loss=94.67, epe=7.25, lr=0.000125, samples/sec=35.1, sec/step=1.823, eta=7:26:35
2019-04-23 02:03:23 Iter 35400 [Train]: loss=57.81, epe=4.21, lr=0.000123, samples/sec=36.0, sec/step=1.780, eta=7:13:08
2019-04-23 02:14:15 Iter 35500 [Train]: loss=59.61, epe=4.35, lr=0.000120, samples/sec=36.9, sec/step=1.732, eta=6:58:36
2019-04-23 02:25:09 Iter 35600 [Train]: loss=59.82, epe=4.37, lr=0.000118, samples/sec=35.6, sec/step=1.800, eta=7:12:00
2019-04-23 02:35:43 Iter 35700 [Train]: loss=64.34, epe=4.86, lr=0.000115, samples/sec=35.8, sec/step=1.790, eta=7:06:39
2019-04-23 02:46:23 Iter 35800 [Train]: loss=55.64, epe=4.03, lr=0.00011

KeyboardInterrupt: 

## Training log

Here are the training curves for the run above:

![](img/pwcnet-sm-6-2-cyclic-chairsthingsmix-fp16/loss.png)
![](img/pwcnet-sm-6-2-cyclic-chairsthingsmix-fp16/epe.png)
![](img/pwcnet-sm-6-2-cyclic-chairsthingsmix-fp16/lr.png)