General Best Practices for Intel® Optimization for TensorFlow
TensorFlow* is one of the most popular deep learning frameworks for large-scale machine learning (ML) and deep learning (DL). Since 2016, Intel and Google engineers have been working together to optimize TensorFlow performance for deep learning training and inference on Intel® Xeon® processors using the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN). The Intel MKL-DNN primitives library includes convolution, normalization, activation, and other primitives. Please see the install guide for how to set up Intel® Optimization for TensorFlow on your system for accelerated TensorFlow execution on CPU platforms with no code changes.
- Throughput measures how many input tensors can be processed per second with batches of size greater than one. Typically for maximum throughput, optimal performance is achieved by exercising all the physical cores on a socket.
- Latency (also called real-time inference) is a measurement of the time it takes to process a single input tensor, i.e. a batch of size one. In a real-time inference scenario, optimal latency is achieved by minimizing thread launching and orchestration between concurrent processes.
This guide will help you set your TensorFlow runtime options for good balanced performance over both metrics. However, if you want to prioritize one metric over the other or further tune Tensorflow for your specific model, please see the tutorials. A link to these can be found in the Model Zoo docs readme.
TensorFlow Configuration Settings
These are the parameters you need to set when running TensorFlow with Intel MKL-DNN. A more complete description of these settings can be found in the performance considerations article.
TensorFlow Runtime Settings
inter_op_parallelism_threads is the number of thread pools to use for a TensorFlow session. A good guideline we have found empirically is to set this to 2. (you may want to start with this suggestion but then try other values, as well).
intra_op_parallelism_threads is the number of threads in each threadpool to use for a TensorFlow session. This should be set to the number of physical cores may be different from the number of logical cores or CPUs and can be found in Linux with the
Data Format specifies the way data is stored and accesed in memory. We recommend using channels-first (NCHW) format. Please see the data format section of performance doc for more information.
- OMP_NUM_THREADS is the maximum number of threads available for the OpenMP runtime. A good guideline is to set it equal to the number of physical cores.
- KMP_BLOCKTIME This environment variable sets the time, in milliseconds, that a thread should wait, after completing the execution of a parallel region, before sleeping. The default value is 200ms. A small KMP_BLOCKTIME value may offer better overall performance if application contains non-OpenMP threaded code. A larger KMP_BLOCKTIME value may be more appropriate if threads are to be reserved solely for use for OpenMP execution. It is suggested to be set to 0 for CNN based models. KMP_BLOCKTIME=1 is a goood place to start for non-CNN topologies.
- KMP_AFFINITY controls how threads are distributed and ultimately bound to specific processing units. KMP_AFFINITY=granularity=fine,verbose,compact,1,0 is recommended when hyperthreading is enabled, and KMP_AFFINITY=granularity=fine,verbose,compact is recommended when hyperthreading is disabled.
- cpu_node_bind/membind It is recommended to confine a TensorFlow session to a single NUMA node. If numa is enabled on your system, use numactl --cpunodebind=0 --membind=0 python to call your TensorFlow script.
- Concurrent Execution You can fully exercise your hardware by sharding your data and launching multiple executions concurrenly, each bound to a different NUMA node. To do this, use the & command to launch non-blocking execution in the shell: numactl --cpunodebind=0 --membind=0 python & numactl --cpunodebind=1 --membind=1 python
To help set these settings, you can execute the
lscpu command in linux to find important information about your system:
$lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Thread(s) per core: 2 Core(s) per socket: 16 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) CPU @ 2.00GHz Stepping: 3 CPU MHz: 2000.168 BogoMIPS: 4000.33 Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 56320K NUMA node0 CPU(s): 0-31 Flags: fpu vme de pse tsc ...
To get number of physical cores, simple multiply
Core(s) per socket by
For usage specifics and command line examples, see one of the hands-on tutorials for an advanced walkthrough of your use case.