# TensorBoard Profiler

Source: https://www.tensorflow.org/tensorboard/tensorboard_profiling_keras

In this notebook, we'll see how we can use TensorBoard to profile a training (or inference) run and optimize it for performance.

Let' start by clearing the log directory, adding the TB extension, and loading the required modules.

In [1]:
# Load the TensorBoard notebook extension
%load_ext tensorboard

# Clear any logs from previous runs
!rm -rf ./tb_log/ 

import tensorflow as tf
from tensorflow import keras

2021-12-22 12:14:28.280069: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-12-22 12:14:28.280091: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


##### Download the dataset

Download the MNIST Dataset. Note that, this time, we'll use TF datasets (not Keras') because it allows us to show some more interesting stuff in the TensorBoard profiler.

In [2]:
!pip install tensorflow_datasets

Defaulting to user installation because normal site-packages is not writeable
Collecting tensorflow_datasets
  Downloading tensorflow_datasets-4.4.0-py3-none-any.whl (4.0 MB)
     |████████████████████████████████| 4.0 MB 992 kB/s            
[?25hCollecting tensorflow-metadata
  Downloading tensorflow_metadata-1.5.0-py3-none-any.whl (48 kB)
     |████████████████████████████████| 48 kB 682 kB/s            
Collecting promise
  Downloading promise-2.3.tar.gz (19 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting dill
  Downloading dill-0.3.4-py2.py3-none-any.whl (86 kB)
     |████████████████████████████████| 86 kB 454 kB/s            
Collecting googleapis-common-protos<2,>=1.52.0
  Downloading googleapis_common_protos-1.54.0-py2.py3-none-any.whl (207 kB)
     |████████████████████████████████| 207 kB 548 kB/s            
[?25hCollecting absl-py
  Downloading absl_py-0.12.0-py3-none-any.whl (129 kB)
     |████████████████████████████████| 129 kB 518 kB/s            
[?25

In [3]:
# Equivalent in keras
# mnist = keras.datasets.mnist
# (x_train, y_train),(x_test, y_test) = mnist.load_data()
# x_train, x_test = x_train / 255.0, x_test / 255.0

import tensorflow_datasets as tfds

(ds_train, ds_test), ds_info = tfds.load(
    'mnist',
    split=['train', 'test'],
    shuffle_files=True,
    as_supervised=True,
    with_info=True,
)

def normalize_img(image, label):
  """Normalizes images: `uint8` -> `float32`."""
  return tf.cast(image, tf.float32) / 255., label


ds_train = ds_train.map(normalize_img)
ds_train = ds_train.batch(128)
ds_test = ds_test.map(normalize_img)
ds_test = ds_test.batch(128)

2021-12-22 12:14:41.029374: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".


[1mDownloading and preparing dataset 11.06 MiB (download: 11.06 MiB, generated: 21.00 MiB, total: 32.06 MiB) to /home/matteo/tensorflow_datasets/mnist/3.0.1...[0m


Dl Completed...:   0%|          | 0/4 [00:00<?, ? file/s]

[1mDataset mnist downloaded and prepared to /home/matteo/tensorflow_datasets/mnist/3.0.1. Subsequent calls will reuse this data.[0m


2021-12-22 12:15:20.946912: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2021-12-22 12:15:20.946938: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2021-12-22 12:15:20.946958: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (matteo-Inspiron-7591-2n1): /proc/driver/nvidia/version does not exist
2021-12-22 12:15:20.947214: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


##### Build the Model

Create a simple two-layer fully-connected DNN.

In [4]:
model = keras.models.Sequential([
  keras.layers.Flatten(input_shape=(28, 28, 1)),
  keras.layers.Dense(128,activation='relu'),
  keras.layers.Dense(10, activation='softmax')
])

model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer=tf.keras.optimizers.Adam(0.001),
    metrics=['accuracy']
)


##### Train the Model

Create a TensorBoard callback with the `profile_batch` option. In this case, let us profile batches from 500 to 520.

Then, train the model.

In [5]:
logs = "./tb_log"

tb_callback = tf.keras.callbacks.TensorBoard(log_dir = logs, histogram_freq = 1, profile_batch = '500,520')

# using test data for validation just for simplicity
model.fit(ds_train, epochs=5, validation_data=ds_test, callbacks = [tb_callback])


2021-12-22 12:15:22.261812: I tensorflow/core/profiler/lib/profiler_session.cc:110] Profiler session initializing.
2021-12-22 12:15:22.261899: I tensorflow/core/profiler/lib/profiler_session.cc:125] Profiler session started.
2021-12-22 12:15:22.266503: I tensorflow/core/profiler/lib/profiler_session.cc:143] Profiler session tear down.


Epoch 1/5
Epoch 2/5
 42/469 [=>............................] - ETA: 1s - loss: 0.1993 - accuracy: 0.9412

2021-12-22 12:15:24.704117: I tensorflow/core/profiler/lib/profiler_session.cc:110] Profiler session initializing.
2021-12-22 12:15:24.704150: I tensorflow/core/profiler/lib/profiler_session.cc:125] Profiler session started.
2021-12-22 12:15:24.757959: I tensorflow/core/profiler/lib/profiler_session.cc:67] Profiler session collecting data.
2021-12-22 12:15:24.817609: I tensorflow/core/profiler/lib/profiler_session.cc:143] Profiler session tear down.




2021-12-22 12:15:24.917148: I tensorflow/core/profiler/rpc/client/save_profile.cc:136] Creating directory: ./tb_log/plugins/profile/2021_12_22_12_15_24

2021-12-22 12:15:24.967400: I tensorflow/core/profiler/rpc/client/save_profile.cc:142] Dumped gzipped tool data for trace.json.gz to ./tb_log/plugins/profile/2021_12_22_12_15_24/matteo-Inspiron-7591-2n1.trace.json.gz
2021-12-22 12:15:24.997290: I tensorflow/core/profiler/rpc/client/save_profile.cc:136] Creating directory: ./tb_log/plugins/profile/2021_12_22_12_15_24

2021-12-22 12:15:24.997421: I tensorflow/core/profiler/rpc/client/save_profile.cc:142] Dumped gzipped tool data for memory_profile.json.gz to ./tb_log/plugins/profile/2021_12_22_12_15_24/matteo-Inspiron-7591-2n1.memory_profile.json.gz
2021-12-22 12:15:24.998105: I tensorflow/core/profiler/rpc/client/capture_profile.cc:251] Creating directory: ./tb_log/plugins/profile/2021_12_22_12_15_24
Dumped tool data for xplane.pb to ./tb_log/plugins/profile/2021_12_22_12_15_24/matteo-I

Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f03520a4220>

##### Examine Profiling Results

Open TensorBoard (in the notebook or from the command line) and examine the `PROFILE` tab from the dropdown menu.

In [6]:
%tensorboard --logdir="./tb_log"

##### Optimize for Performance

Optimize the input pipeline to speed-up the processing. In particular, cache and prefetch the data to avoid computation stalls (see dataset API lecture).

In [7]:
(ds_train, ds_test), ds_info = tfds.load(
    'mnist',
    split=['train', 'test'],
    shuffle_files=True,
    as_supervised=True,
    with_info=True,
)

ds_train = ds_train.map(normalize_img)
ds_train = ds_train.batch(128)
ds_train = ds_train.cache()
ds_train = ds_train.prefetch(tf.data.experimental.AUTOTUNE)

ds_test = ds_test.map(normalize_img)
ds_test = ds_test.batch(128)
ds_test = ds_test.cache()
ds_test = ds_test.prefetch(tf.data.experimental.AUTOTUNE)


##### Train the Model (v2)

Train again the model.

In [8]:
model.fit(ds_train, epochs=5, validation_data=ds_test, callbacks = [tb_callback])

Epoch 1/5
Epoch 2/5

2021-12-22 12:15:34.574677: I tensorflow/core/profiler/lib/profiler_session.cc:110] Profiler session initializing.
2021-12-22 12:15:34.574709: I tensorflow/core/profiler/lib/profiler_session.cc:125] Profiler session started.
2021-12-22 12:15:34.606636: I tensorflow/core/profiler/lib/profiler_session.cc:67] Profiler session collecting data.
2021-12-22 12:15:34.609227: I tensorflow/core/profiler/lib/profiler_session.cc:143] Profiler session tear down.
2021-12-22 12:15:34.615203: I tensorflow/core/profiler/rpc/client/save_profile.cc:136] Creating directory: ./tb_log/plugins/profile/2021_12_22_12_15_34

2021-12-22 12:15:34.619759: I tensorflow/core/profiler/rpc/client/save_profile.cc:142] Dumped gzipped tool data for trace.json.gz to ./tb_log/plugins/profile/2021_12_22_12_15_34/matteo-Inspiron-7591-2n1.trace.json.gz
2021-12-22 12:15:34.622695: I tensorflow/core/profiler/rpc/client/save_profile.cc:136] Creating directory: ./tb_log/plugins/profile/2021_12_22_12_15_34

2021-12-22 12:15:34.622

Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f03526cc160>

Check TensorBoard again and compare the two runs!