##### Copyright 2020 The TensorFlow Authors.

In [0]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# TensorFlow Profiler: Profile model performance

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://www.tensorflow.org/tensorboard/tensorboard_profiling_keras"><img src="https://www.tensorflow.org/images/tf_logo_32px.png" />View on TensorFlow.org</a>
  </td>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/tensorboard/blob/master/docs/tensorboard_profiling_keras.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tensorflow/tensorboard/blob/master/docs/tensorboard_profiling_keras.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
</table>

## Overview
Machine learning algorithms are typically computationally expensive. It is thus vital to quantify the performance of your machine learning application to ensure that you are running the most optimized version of your model. Use the TensorFlow Profiler to profile the execution of your TensorFlow code. 

Before you get started, select GPU as the Hardware accelerator in **Edit > Notebook settings**.

## Setup

In [0]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

from datetime import datetime
from packaging import version

import os

The TensorFlow Profiler requires the latest versions of TensorFlow and TensorBoard. Use the nightly builds of TensorFlow and TensorBoard until the 2.2 versions of both libraries are released.


In [0]:
# Uninstall twice to uninstall both the 1.15.0 and 2.1.0 version of TensorFlow and TensorBoard.
!pip uninstall -y -q tensorflow tensorboard
!pip uninstall -y -q tensorflow tensorboard
!pip install -U -q tf-nightly tb-nightly tensorboard_plugin_profile

In [0]:
import tensorflow as tf

print("TensorFlow version: ", tf.__version__)

Confirm that TensorFlow can access the GPU.

In [0]:
device_name = tf.test.gpu_device_name()
if not device_name:
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

## Train an image classification model with TensorBoard callbacks

In this tutorial, you explore the capabilities of the TensorFlow Profiler by capturing the performance profile obtained by training a model to classify images in the [MNIST dataset](http://yann.lecun.com/exdb/mnist/). 

Use TensorFlow datasets to import the training data and split it into training and test sets. 

In [0]:
import tensorflow_datasets as tfds
tfds.disable_progress_bar()

In [0]:
(ds_train, ds_test), ds_info = tfds.load(
    'mnist',
    split=['train', 'test'],
    shuffle_files=True,
    as_supervised=True,
    with_info=True,
)

Preprocess the training and test data by normalizing pixel values to be between 0 and 1.

In [0]:
def normalize_img(image, label):
  """Normalizes images: `uint8` -> `float32`."""
  return tf.cast(image, tf.float32) / 255., label

ds_train = ds_train.map(normalize_img)
ds_train = ds_train.batch(128)

In [0]:
ds_test = ds_test.map(normalize_img)
ds_test = ds_test.batch(128)

Create the image classification model using Keras.

In [0]:
model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28, 1)),
  tf.keras.layers.Dense(128,activation='relu'),
  tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer=tf.keras.optimizers.Adam(0.001),
    metrics=['accuracy']
)

Create a TensorBoard callback to capture performance profiles and call it while training the model.

In [0]:
# Create a TensorBoard callback
logs = "logs/" + datetime.now().strftime("%Y%m%d-%H%M%S")

tboard_callback = tf.keras.callbacks.TensorBoard(log_dir = logs,
                                                 histogram_freq = 1,
                                                 profile_batch = '500,520')

model.fit(ds_train,
          epochs=2,
          validation_data=ds_test,
          callbacks = [tboard_callback])

## Use the TensorFlow Profiler to profile model training performance

The TensorFlow Profiler is embedded within TensorBoard. Load TensorBoard using Colab magic and launch it. View the performance profiles by navigating to the **Profile** tab. 

In [0]:
# Load the TensorBoard notebook extension.
%load_ext tensorboard

In [0]:
%tensorboard --logdir=logs  # Navigate to the profile tab to view performance profile.

<img class="tfo-display-only-on-site" src="https://github.com/tensorflow/tensorboard/blob/master/docs/images/profiler_overview_page_bad_ip.png?raw=1"/>

The **Profile** tab opens the Overview page which shows you a high-level summary of your model performance. Looking at the Step-time Graph on the right, you can see that the model is highly input bound (i.e., it spends a lot of time in the data input piepline). The Overview page also gives you recommendations on potential next steps you can follow to optimize your model performance. 

To understand where the performance bottleneck occurs in the input pipeline, select the **Trace Viewer** from the **Tools** dropdown on the left. The Trace Viewer shows you a timeline of the different events that occured on the CPU and the GPU during the profiling period. 

The Trace Viewer shows multiple event groups on the vertical axis. Each event group has multiple horizontal tracks, filled with trace events. The track is an event timeline for events executed on a thread or a GPU stream. Individual events are the colored, rectangular blocks on the timeline tracks. Time moves from left to right. Navigate the trace events by using the keyboard shortcuts `W` (zoom in), `S` (zoom out), `A` (scroll left), and `D` (scroll right).

A single rectangle represents a trace event. Select the mouse cursor icon in the floating tool bar (or use the keyboard shortcut `1`) and click the trace event to analyze it. This will display information about the event, such as its start time and duration.

In addition to clicking, you can drag the mouse to to select a group of trace events. This will give you a list of all the events in that area along with an event summary. Use the `M` key to measure the time duration of the selected events.

Trace events are collected from:

*   **CPU:** CPU events are displayed  under an event group named `/host:CPU`. Each track represents a thread on CPU. CPU events include input pipeline events, GPU operation (op) scheduling events, CPU op execution events etc.
*   **GPU:** GPU events are displayed under event groups prefixed by `/device:GPU:`. Each event group represents one stream on the GPU. 

## Debug performance bottlenecks

Use the Trace Viewer to locate the performance bottlenecks in your input pipeline. The image below is a snapshot of the performance profile. 

![profiler_trace_viewer_bad_ip](https://github.com/tensorflow/tensorboard/blob/master/docs/images/profiler_trace_viewer_bad_ip.png?raw=1)

Looking at the event traces, you can see that the GPU is inactive while the `tf_data_iterator_get_next` op is running on the CPU. This op is responsible for processing the input data and sending it to the GPU for training. As a general rule of thumb, it is a good idea to always keep the device (GPU/TPU) active.

Use the `tf.data` API to optimize the input pipeline. In this case, let's cache the training dataset and prefetch the data to ensure that there is always data available for the GPU to process. See [here](https://www.tensorflow.org/guide/data_performance) for more details on using `tf.data` to optimize your input pipelines. 


In [0]:
(ds_train, ds_test), ds_info = tfds.load(
    'mnist',
    split=['train', 'test'],
    shuffle_files=True,
    as_supervised=True,
    with_info=True,
)

In [0]:
ds_train = ds_train.map(normalize_img)
ds_train = ds_train.batch(128)
ds_train = ds_train.cache()
ds_train = ds_train.prefetch(tf.data.experimental.AUTOTUNE)

In [0]:
ds_test = ds_test.map(normalize_img)
ds_test = ds_test.batch(128)
ds_test = ds_test.cache()
ds_test = ds_test.prefetch(tf.data.experimental.AUTOTUNE)

Train the model again and capture the performance profile by reusing the callback from before.

In [0]:
model.fit(ds_train,
          epochs=2,
          validation_data=ds_test,
          callbacks = [tboard_callback])

Re-launch TensorBoard and open the profile tab to observe the performance profile for the updated input pipeline. 

In [0]:
%tensorboard --logdir=logs

<img class="tfo-display-only-on-site" src="https://github.com/tensorflow/tensorboard/blob/master/docs/images/profiler_overview_page_good_ip.png?raw=1"/>

From the Overview page, you can see that the Average Step time has reduced as has the Input Step time. The Step-time Graph also indicates that the model is no longer input bound. Open the Trace Viewer to examine the trace events with the optimized input pipeline.

![profiler_trace_viewer_good_ip](https://github.com/tensorflow/tensorboard/blob/master/docs/images/profiler_trace_viewer_good_ip.png?raw=1)

The Trace Viewer shows that the `tf_data_iterator_get_next` op executes much faster. The GPU therefore gets a steady stream of data to perform training and achieves much better utilization through model training.

## Summary

Use the TensorFlow Profiler to profile and debug model training performance. Read the [Profiler guide](https://www.tensorflow.org/guide/profiler) to learn more about the various profiling tools and data collection modes available with the TensorFlow Profiler. 