Skip to content

Harness the benefits of TensorFlow Serving—a flexible and high performance serving system—with IBM Z Accelerated for TensorFlow Serving to help deploy ML models in production.

License

IBM/ibmz-accelerated-serving-for-tensorflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

Using the IBM Z Accelerated Serving for TensorFlow Container Image

Table of contents

Overview

TensorFlow Serving is an open source, high-performance, serving system that provides a system to handle the inference aspect of machine learning.

On IBM® z16™ and later (running Linux on IBM Z or IBM® z/OS® Container Extensions (IBM zCX)), TensorFlow core Graph Execution will leverage new inference acceleration capabilities that transparently target the IBM Integrated Accelerator for AI through the IBM z Deep Neural Network (zDNN) library. The IBM zDNN library contains a set of primitives that support Deep Neural Networks. These primitives transparently target the IBM Integrated Accelerator for AI on IBM z16 and later. No changes to the original model are needed to take advantage of the new inference acceleration capabilities.

Note. When using IBM Z Accelerated Serving for TensorFlow on either an IBM z15® or an IBM z14®, TensorFlow will transparently target the CPU with no changes to the model.

Downloading the IBM Z Accelerated Serving for TensorFlow container image

Downloading the IBM Z Accelerated Serving for TensorFlow container image requires credentials for the IBM Z and LinuxONE Container Registry, icr.io.

Documentation on obtaining credentials to icr.io is located here.


Once credentials to icr.io are obtained and have been used to login to the registry, you may pull (download) the IBM Z Accelerated Serving for TensorFlow container image with the following code block:

# Replace X.X.X with the desired version to pull, or remove to fetch the latest.
docker pull icr.io/ibmz/ibmz-accelerated-serving-for-tensorflow:X.X.X

In the docker pull command illustrated above, the version specified above is X.X.X. This is based on the version available in the IBM Z and LinuxONE Container Registry. Release notes about a particular version can be found in this GitHub Repository under releases here.


To remove the IBM Z Accelerated Serving for TensorFlow container image, please follow the commands in the code block:

# Find the Image ID from the image listing
docker images

# Remove the image
docker rmi <IMAGE ID>

*Note. This documentation will refer to image/containerization commands in terms of Docker. If you are utilizing Podman, please replace docker with podman when using our example code snippets.

Container Image Contents

To view a brief overview of the operating system version, software versions and content installed in the container, as well as any release notes for each released container image version, please visit the releases section of this GitHub Repository, or you can click here.

TensorFlow Serving Usage

For documentation on serving models with TensorFlow Serving please visit the official Open Source TensorFlow Serving documentation.

For brief examples on deploying models with TensorFlow Serving, please visit our samples section.

A Look into the Acceleration

The acceleration is enabled through Custom Ops, Kernels, and a Graph Optimizer that get registered within TensorFlow.

  • The registered Ops are custom versions of built-in TensorFlow Ops.

    • They will only support float and half data types.
  • The registered Kernels will perform the computation for the Custom Ops.

    • There will be one-or-more Kernels registered for each Custom Op registered, depending on the data type(s) of the input(s) and/or output(s).
    • Many Kernels will call one-or-more zDNN functions to perform computation using the accelerator.
    • Some Kernels will run custom logic to perform non-computational procedures such as transposing or broadcasting.
  • The registered Graph Optimizer will check a TensorFlow Graph for Ops with valid input(s) and/or output(s) and remap them to the Custom Ops.

    • Only Ops with valid input(s) and/or output(s) will be remapped.
    • Some Ops with valid input(s) and/or output(s) may still not be remapped if their overhead is likely to outweigh any cost savings.

Tensor

Kernels will receive input(s) and/or output(s) in the form of Tensor objects.

  • TensorFlow's internal Tensor objects manage the shape, data type, and a pointer to the data buffer
  • More info can be found here

Graph Mode Requirement

Custom Kernels will only be used when the Graph Optimizer is utilized. This happens whenever TensorFlow is operating within a tf.function.

  • TensorFlow's built-in Keras module, used for creating a Model, will use a tf.function within many the Model's internal functions, include the predict function.
  • A normal function can be used as a tf.function by:
    • Passing it to a tf.function call, or
    • Adding a @tf.function decorator above the function.
  • More information can be found here.

Eigen Fallback

During the Graph Optimizer pass, input(s) and/or output(s) are checked to ensure they are the correct shape and data type.

  • This is done before computation and is designed to work with variable batch sizes.
  • This can result in shapes for input(s) and/or output(s) being partially-defined, denoting undefined dimensions with negative numbers.
    • This means it is not always possible to determine if the input(s) and/or output(s) will be a valid shape for all batch sizes.

Due to this, all Custom Ops will check the shape of the input(s) and/or output(s) before performing computation.

  • If all shapes are valid, the custom logic is used.
  • If any shape is invalid, the default Eigen logic is used.

NNPA Instruction Set Requirement

Before the Graph Optimizer is registered, a call to zdnn_is_nnpa_installed is made to ensure the NNPA instruction set for the accelerator is installed.

  • If this call returns false, the Graph Optimizer is not registered and runtime should proceed the same way TensorFlow would without the acceleration benefits.

Environment Variables for Logging

Certain environment variables can be set before execution to enable/disable features or logs.

  • ZDNN_ENABLE_PRECHECK: true

    • If set to true, zDNN will print logging information before running any computational operation.
    • Example: export ZDNN_ENABLE_PRECHECK=true.
      • Enable zDNN logging.
  • TF_CPP_MIN_LOG_LEVEL: integer

    • If set to any number >= 0, logging at that level and higher (up until TF_CPP_MAX_VLOG_LEVEL) will be enabled.
    • If TF_CPP_MAX_VLOG_LEVEL is not set, only logs exactly at TF_CPP_MIN_LOG_LEVEL will be enabled.
    • Example: export TF_CPP_MIN_LOG_LEVEL=0.
      • Logs at level 0 will be enabled.
  • TF_CPP_MAX_VLOG_LEVEL: integer

    • If set to any number >= 0, logging at that level and lower (down until TF_CPP_MIN_LOG_LEVEL) will be enabled.
    • Requires TF_CPP_MIN_LOG_LEVEL is set to a number >= 0.
    • Example: export TF_CPP_MIN_LOG_LEVEL=0 export TF_CPP_MAX_VLOG_LEVEL=1.
      • Logs at levels 0 and 1 will be enabled.
  • TF_CPP_MODULE: 'file=level' | 'file1=level1,file2=level2',

    • Enables logging at level 'level' and lower (down until TF_CPP_MIN_LOG_LEVEL) for any file named 'file'.
      • Extensions for file name are ignored so 'file.h', 'file.cc', and 'file.cpp' would all have logging enabled.
    • Requires TF_CPP_MIN_LOG_LEVEL is set to a number >= 0.
    • Example: export TF_CPP_MIN_LOG_LEVEL=0 TF_CPP_VMODULE='remapper=2,cwise_ops=1'.
      • Logs at level 0 will be enabled.
      • Logs for files named remapper.* will be enabled at levels 0, 1, and 2.
      • Logs for files named cwise_ops.* will be enabled at levels 0 and 1.

Security and Deployment Guidelines

  • For security and deployment best practices, please visit the common AI Toolkit documentation found here.

Execution on the Integrated Accelerator for AI and on CPU

Execution Paths

IBM Z Accelerated Serving for TensorFlow container image follows IBM's train anywhere and deploy on IBM Z strategy.

By default, when using the IBM Z Accelerated Serving for TensorFlow container image on an IBM z16 and later system, TensorFlow core will transparently target the Integrated Accelerator for AI for a number of compute-intensive operations during inferencing with no changes to the model.

When using IBM Z Accelerated Serving for TensorFlow on either an IBM z15 or an IBM z14, TensorFlow will transparently target the CPU with no changes to the model

To modify the default execution path, you may change the environment variable, NNPA_DEVICES, before the application calls any TensorFlow API:

  • NNPA_DEVICES: 0 | false
    • If set to '0' or 'false', the Graph Optimizer will not be registered and runtime will proceed the same way TensorFlow would without the acceleration benefits. If NNPA_DEVICES is unset the IBM Integrated Accelerator for AI is targeted by default.
    • Example: export NNPA_DEVICES=0.
      • Graph Optimizer will not be registered.

Eager Mode vs Graph Mode

The two primary methods of performing computations with TensorFlow are Eager Execution and Graph Execution.

  • Eager Execution performs the computations immediately as they are received (operation by operation), those values are then returned as a result.

  • Graph Execution encapsulates the computations within a graph (tf.graph). Each node in the graph is an operation (tf.Operation). Each edge that connects one node to another is a tensor (tf.Tensor) that represents the flow between operations. Graph Execution performs these computation at a later point within a TensorFlow Session. To instruct TensorFlow to run in Graph Mode, leverage tf.function either as a direct call or as a decorator. The tf.function is a Python callable that builds TensorFlow graphs from the Python function.

As mentioned in an early section, to take advantage of the the acceleration capabilities, there is a Graph Mode Requirement. This means in order to leverage the IBM Z Integrated Accelerated for AI, TensorFlow must be used through tf.function.

Model Validation

Various models that were trained on x86 or IBM Z have demonstrated focused optimizations that transparently target the IBM Integrated Accelerator for AI for a number of compute-intensive operations during inferencing.

Models that we expect (based on internal research) to demonstrate the optimization illustrated in this document can be found here.

Note. Models that were trained outside of the TensorFlow ecosystem may throw endianness issues.

Using the Code Samples

Documentation for our code samples can be found here.

Frequently Asked Questions

Q: Where can I get the IBM Z Accelerated Serving for TensorFlow container image?

Please visit this link here. Or read the section titled Downloading the IBM Z Accelerated Serving for TensorFlow container image.

Q: Why are there multiple TensorFlow container images in the IBM Z and LinuxONE Container Registry?

You may have seen multiple TensorFlow Serving container images in IBM Z and LinuxONE Container Registry, namely ibmz/tensorflow-serving and ibmz/ibmz-accelerated-serving-for-tensorflow.

The ibmz/tensorflow-serving container image does not have support for the IBM Integrated Accelerator for AI. The ibmz/tensorflow-serving container image only transparently targets the CPU. It does not have any optimizations referenced in this document.

The ibmz/ibmz-accelerated-serving-for-tensorflow container image includes support for TensorFlow core Graph Execution to transparently target the IBM Integrated Accelerator for AI. The ibmz/ibmz-accelerated-serving-for-tensorflow container image also still allows it's users to transparently target the CPU. This container image contains the optimizations referenced in this document.

Q: Where can I run the IBM Z Accelerated Serving for TensorFlow container image?

You may run the IBM Z Accelerated Serving for TensorFlow container image on IBM Linux on Z or IBM® z/OS® Container Extensions (IBM zCX).

Note. The IBM Z Accelerated Serving for TensorFlow container image will transparently target the IBM Integrated Accelerator for AI on IBM z16 and later. However, if using the IBM Z Accelerated Serving for TensorFlow container image on either an IBM z15 or an IBM z14, TensorFlow will transparently target the CPU with no changes to the model.

Q: Can I install a newer or older version of TensorFlow Serving in the container?

No. Installing newer or older version of TensorFlow Serving than what is configured in the container will not target the IBM Integrated Accelerator for AI. Additionally, installing a newer or older version of TensorFlow Serving, or modifying the existing TensorFlow Serving that is installed in the container image may have unintended, unsupported, consequences.

Technical Support

Information regarding technical support can be found here.

Versioning Policy and Release Cadence

IBM Z Accelerated Serving for TensorFlow will follow the semantic versioning guidelines with a few deviations. Overall, IBM Z Accelerated Serving for TensorFlow follows a continuous release model with a cadence of 1-2 minor releases per year. In general, bug fixes will be applied to the next minor release and not back ported to prior major or minor releases. Major version changes are not frequent and may include features supporting new zSystems hardware as well as major feature changes in TensorFlow Serving that are not likely backward compatible. Please refer to TensorFlow Serving guidelines for backwards compatibility across different versions of TensorFlow Serving.

IBM Z Accelerated Serving for TensorFlow Versions

Each release version of IBM Z Accelerated Serving for TensorFlow has the form MAJOR.MINOR.PATCH. For example, IBM Z Accelerated Serving for TensorFlow version 1.2.3 has MAJOR version 1, MINOR version 2, and PATCH version 3. Changes to each number have the following meaning:

MAJOR / VERSION

All releases with the same major version number will have API compatibility. Major version numbers will remain stable. For instance, 1.X.Y may last 1 year or more. It will potentially have backwards incompatible changes. Code and data that worked with a previous major release will not necessarily work with the new release.

MINOR / FEATURE

Minor releases will typically contain new backward compatible features, improvements, and bug fixes.

PATCH / MAINTENANCE

Maintenance releases will occur more frequently and depend on specific patches introduced (e.g. bug fixes) and their urgency. In general, these releases are designed to patch bugs.

Release cadence

Feature releases for IBM Z Accelerated Serving for TensorFlow occur about every 6 months in general. Hence, IBM Z Accelerated Serving for TensorFlow 1.3.0 would generally be released about 6 months after 1.2.0. Maintenance releases happen as needed in between feature releases. Major releases do not happen according to a fixed schedule.

Licenses

The International License Agreement for Non-Warranted Programs (ILAN) agreement can be found here

The registered trademark Linux® is used pursuant to a sublicense from the Linux Foundation, the exclusive licensee of Linus Torvalds, owner of the mark on a worldwide basis.

TensorFlow, the TensorFlow logo and any related marks are trademarks of Google Inc.

Docker and the Docker logo are trademarks or registered trademarks of Docker, Inc. in the United States and/or other countries. Docker, Inc. and other parties may also have trademark rights in other terms used herein.

IBM, the IBM logo, and ibm.com, IBM z16, IBM z15, IBM z14 are trademarks or registered trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. The current list of IBM trademarks can be found here.

About

Harness the benefits of TensorFlow Serving—a flexible and high performance serving system—with IBM Z Accelerated for TensorFlow Serving to help deploy ML models in production.

Resources

License

Stars

Watchers

Forks

Packages

No packages published