# Tensorflow tips and tricks

We mention a variety of details and tips and tricks to complement the introductory tutorial. These cover a wide variety of topics, including tensorflow internals, performance, etc.

In [1]:
import tensorflow as tf
import numpy as np

  from ._conv import register_converters as _register_converters


## Scopes and collections

Scopes and collections are two important notions that help tensorflow manage the graph. Scopes can be used to nest variables and operations in a principled way, organizing both the graph display on tensorboard and reusing operations in the appropriate place. Collections are used by tensorflow to keep track of tensors in a principled way. For example, this is how tensorflow can remember all the summaries that were created, or all the trainable variables that we have created.

In [2]:
# Let's have a look at the effect of scopes
def my_function(x, name=None):
    with tf.variable_scope(name, 'my_function_default'):
        w = tf.get_variable(name='w', shape=[], dtype=tf.float32, trainable=True)
        print(w.name)
        
        return w * x

with tf.Graph().as_default():
    x = tf.random_normal(shape=[])
    
    my_function(x)
    my_function(x)
    my_function(x, 'test1')
    
    with tf.variable_scope('outer_scope'):
        my_function(x)
        my_function(x)
        my_function(x, 'test1')

my_function_default/w:0
my_function_default_1/w:0
test1/w:0
outer_scope/my_function_default/w:0
outer_scope/my_function_default_1/w:0
outer_scope/test1/w:0


In [3]:
# Let's inspect some commonly used collections
with tf.Graph().as_default():
    x = tf.random_normal(shape=[32, 224, 224, 3])
    
    x = tf.layers.conv2d(x, filters=20, kernel_size=3, strides=1)
    tf.summary.scalar('test', x[0])
    
    print('trainable variables: {0}'.format(list(tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES))))
    print('summary ops: {0}'.format(list(tf.get_collection(tf.GraphKeys.SUMMARIES))))

trainable variables: [<tf.Variable 'conv2d/kernel:0' shape=(3, 3, 3, 20) dtype=float32_ref>, <tf.Variable 'conv2d/bias:0' shape=(20,) dtype=float32_ref>]
summary ops: [<tf.Tensor 'test:0' shape=() dtype=string>]


## Calling Python functions from Tensorflow

Sometimes, you may have some functionality that relies on snome other python package, and thus can only be used from python. As we have mentioned already, the programming style of tensorflow requires us to describe computation in a symbolic fashion, which is incompatible with working on the values directly. Tensorflow has an "escape hatch" for this scenario `py_func`, although performance often suffers substantially (in particular, this cannot be run on GPU at all).

In [6]:
from scipy.special import airy

# airy returns the Ai, Bi Airy functions and their derivatives (4 values total).

with tf.Graph().as_default():
    x = tf.random_normal(shape=[])
    y = tf.py_func(airy, [x], [tf.float32] * 4, stateful=False)
    
    with tf.Session() as session:
        print(session.run(y))

[0.5356552, 0.0012485808, 0.091488756, 0.5944572]


## Estimator Hooks

We may sometimes be interested in running operations before or after each / some training steps. This can be accomplished in the estimator framework by creating custom hooks.

In [8]:
class AddNoiseHook(tf.train.SessionRunHook):
    def __init__(self):
        # We will create this operation in `begin`
        self._op = None
    
    def begin(self):
        # create the operations we need here
        with tf.variable_scope('add_noise_to_weights'):
            ops = [tf.assign_add(v, 0.1 * tf.random_normal(shape=v.shape)) for v in tf.trainable_variables()]

        self._op = tf.group(*ops, name='add_noise')
    
    def before_run(self, context):
        # Ask estimator to also run our add noise op
        return tf.train.SessionRunArgs(self._op)

In [10]:
# We reproduce some of the training code we had earlier to help our examples.
def get_mnist_dataset():
    """ This function creates a dataset which can be used to load data from the MNIST dataset. """
    def _format_image(raw_data):
        image = tf.decode_raw(raw_data, tf.uint8)
        image = tf.to_float(image)
        image = tf.reshape(image, [28, 28, 1])
        image = image / 255
        return image
    
    def _format_label(raw_data):
        label = tf.decode_raw(raw_data, tf.uint8)
        label = tf.reshape(label, [])
        return tf.to_int32(label)
    
    dataset_img = tf.data.FixedLengthRecordDataset('data/train-images-idx3-ubyte', 28 * 28, header_bytes=16)
    dataset_img = dataset_img.map(_format_image)
    
    dataset_label = tf.data.FixedLengthRecordDataset('data/train-labels-idx1-ubyte', 1, header_bytes=8)
    dataset_label = dataset_label.map(_format_label)
    
    return tf.data.Dataset.zip((dataset_img, dataset_label))

def make_input_fn(repeat_count=None, shuffle_size=1000):
    def input_fn():
        dataset = get_mnist_dataset()
        # Shuffle the dataset, and repeat as necessary
        if shuffle_size is not None and shuffle_size > 0:
            from tensorflow.contrib.data import shuffle_and_repeat
            dataset = dataset.apply(shuffle_and_repeat(shuffle_size, repeat_count))
        else:
            dataset = dataset.repeat(repeat_count)
        dataset = dataset.prefetch(128) # Prefetch enough for a single batch for performance
        dataset = dataset.batch(batch_size=128) # Batch it up
        dataset = dataset.prefetch(2) # Prefetch two batches to device.
    
        return dataset
    return input_fn

def _evaluate(fn_or_value):
    if callable(fn_or_value):
        return fn_or_value()
    else:
        return fn_or_value

def model_fn(features, labels, mode, params):
    # Here, the params parameter is passed in from tensorflow
    images = features
    images = tf.layers.flatten(images)
    logits = tf.layers.dense(images, units=10)
    
    predictions = tf.argmax(logits, axis=1, output_type=tf.int32)
    accuracy = tf.metrics.accuracy(labels=labels, predictions=predictions)
    
    metrics = {'accuracy': accuracy}
    
    loss = tf.losses.sparse_softmax_cross_entropy(labels, logits)
    
    optimizer = tf.train.AdamOptimizer(
        learning_rate=_evaluate(params['learning_rate']))
    train_op = optimizer.minimize(loss, global_step=tf.train.get_or_create_global_step())
    
    # This gives tensorflow the description of what must be done
    # to construct our model.
    return tf.estimator.EstimatorSpec(
        loss=loss,
        mode=mode,
        train_op=train_op,
        predictions=predictions,
        eval_metric_ops=metrics)

In [11]:
# Let's use our hook to mess things up.
estimator = tf.estimator.Estimator(
    model_fn=model_fn,
    params={
        'learning_rate': lambda: tf.train.inverse_time_decay(
            0.1, tf.train.get_or_create_global_step(),
            decay_steps=10,
            decay_rate=1,
            staircase=True)
    })
print('\n---- Starting training ------')
estimator.train(make_input_fn(), steps=500)
print('---- Training Done ------')

print('\n---- Resuming training with noise ------')
estimator.train(make_input_fn(), steps=500, hooks=[AddNoiseHook()])
print('---- Training Done ------')

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'C:\\Users\\wenda\\AppData\\Local\\Temp\\tmp5qmv_t67', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x00000265AAA0AAC8>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}

---- Starting training ------
Instructions for updating:
Use the retry module or similar alternatives.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running lo

# High-Performance models in tensorflow

Deep learning applications can be quite sensitive to performance, and there are numerous techniques used to enable faster training. A good overview is available on the [tensorflow website](https://www.tensorflow.org/performance/performance_guide). We discuss a couple here.

## MKL-Optimized Tensorflow on CPU

If you are using tensorflow on CPU, it can be very advantageous to obtain a version that has been compiled with Intel's MKLDNN library. In addition of substantially speeding up inference and training, it implements the NCHW convolutions on CPU (whereas default tensorflow can only perform NHWC convolutions on CPU).

## Multi-GPU and distributed training

Deep-learning training is often easily parallelizable by splitting the minibatch computation across different devices. This may be several GPUs on the same machine, or even different machines. Tensorflow provides several strategies to do this in an efficient manner. Beyond doing it manually, the most recent (and easiest to use) API is the `tf.contrib.distribute` API, coming out in the next version (r1.8). The external library [horovod](https://github.com/uber/horovod) is also a good choice and leverages MPI on the cluster. Note that when you are using multi-gpu training, you should be careful of rescaling the learning rate and total steps as you are usually implicitly modifying the batch size.

## Half-precision floating point

On the recent P100 and V100 gpus, the performance of half-precision linear algebra is substantially higher than that of normal 32-bit single precision floating point operations. There are some subtleties to making use of half-precision floating point numbers without affecting performance, discussed [here](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/). Currently, implementing these in tensorflow is a manual process, although the developers plan to add automated tools for mixed precision training.

In [19]:
# We need some helpers for mixed-precision training.

def mixed_precision_getter(dtype=tf.float16):
    """ A custom getter which implements mixed precision by creating
    the variables and casting to mixed precision.
    
    """
    def custom_getter(getter, *args, **kwargs):
        variable = getter(*args, **kwargs)
        return tf.cast(variable, dtype)

def model_fn_mixed(features, labels, mode, params):
    # Here, the params parameter is passed in from tensorflow
    images = features
    
    with tf.variable_scope('Network', custom_getter=mixed_precision_getter()):
        images = tf.layers.flatten(images)
        # Convert the input to float16
        images = tf.cast(images, tf.float16)
        
        # Compute the network function
        logits = tf.layers.dense(images, units=10)
        
        # Convert the output back to float32 for precision
        logits = tf.to_float(logits)

        predictions = tf.argmax(logits, axis=1, output_type=tf.int32)
        accuracy = tf.metrics.accuracy(labels=labels, predictions=predictions)

        metrics = {'accuracy': accuracy}

        loss = tf.losses.sparse_softmax_cross_entropy(labels, logits)

        optimizer = tf.train.GradientDescentOptimizer(
            learning_rate=_evaluate(params['learning_rate']))
        
        # To avoid loss of precision, we scale up the loss and then scale down the gradients
        LOSS_SCALE = 128
        grads_and_vars = optimizer.compute_gradients(LOSS_SCALE * loss)
        grads_and_vars = [(g / LOSS_SCALE, v) for (g, v) in grads_and_vars]
        train_op = optimizer.apply_gradients(grads_and_vars, global_step=tf.train.get_or_create_global_step())

    # This gives tensorflow the description of what must be done
    # to construct our model.
    return tf.estimator.EstimatorSpec(
        loss=loss,
        mode=mode,
        train_op=train_op,
        predictions=predictions,
        eval_metric_ops=metrics)

# Let's do mixed precision training.
# In this simple case, guaranteed to be slower
estimator = tf.estimator.Estimator(
    model_fn=model_fn_mixed,
    params={
        'learning_rate': lambda: tf.train.inverse_time_decay(
            0.1, tf.train.get_or_create_global_step(),
            decay_steps=10,
            decay_rate=1,
            staircase=True)
    })
print('\n---- Starting training ------')
estimator.train(make_input_fn(), steps=500)
print('---- Training Done ------')

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'C:\\Users\\wenda\\AppData\\Local\\Temp\\tmph77t56tf', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x00000265BC311908>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}

---- Starting training ------
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 1 into C:\Users\wenda

## TPUs

Google has developed custom hardware, called [TPU](https://cloud.google.com/tpu/) available on GCP, for fast computation and large-scale inference. They are accesible through a similar API as the estimator API, but is somewhat more tricky and cumbersome. In particular, they require a careful separation of the host and TPU operations, which makes operations such as recording summaries somewhat more challenging.

# Eager execution

Eager execution is a relatively new tensoflow features which allows us to use tensorflow to do direct computations instead of computing a graph. It has some trade-offs in terms of performance but can offer a lot of flexibility. It is much more similar to pytorch. On the other hand, it has some drawbacks (for example, it does not work on TPU), and not all APIs work in the eager mode (you may often find newer APIs in `tf.contrib` which are compatible with eager).

Note that once you enable eager mode, you cannot disable eager in the python session. You will need to restart your python kernel to restore the default mode. You must also enable eager execution before anything else you do. To run this last part of the notebook, you will need to restart your kernel now.

In [1]:
import tensorflow as tf
tf.enable_eager_execution()

  from ._conv import register_converters as _register_converters


In [3]:
x = tf.random_normal(shape=[])
print(x)

tf.Tensor(-0.8435967, shape=(), dtype=float32)


In [4]:
print(2 * x)

tf.Tensor(-1.6871934, shape=(), dtype=float32)


In [5]:
# Most of the dataset is Eager mode compatible.
def get_mnist_dataset():
    """ This function creates a dataset which can be used to load data from the MNIST dataset. """
    def _format_image(raw_data):
        image = tf.decode_raw(raw_data, tf.uint8)
        image = tf.to_float(image)
        image = tf.reshape(image, [28, 28, 1])
        image = image / 255
        return image
    
    def _format_label(raw_data):
        label = tf.decode_raw(raw_data, tf.uint8)
        label = tf.reshape(label, [])
        return tf.to_int32(label)
    
    dataset_img = tf.data.FixedLengthRecordDataset('data/train-images-idx3-ubyte', 28 * 28, header_bytes=16)
    dataset_img = dataset_img.map(_format_image)
    
    dataset_label = tf.data.FixedLengthRecordDataset('data/train-labels-idx1-ubyte', 1, header_bytes=8)
    dataset_label = dataset_label.map(_format_label)
    
    return tf.data.Dataset.zip((dataset_img, dataset_label))

In [6]:
dataset = get_mnist_dataset()

In [9]:
iterator = tf.contrib.eager.Iterator(dataset)

In [None]:
print(next(iterator))