# Auto-tuning TensorFlow Threading Model for CPU Backend

Niranjan Hasabnis
Intel Corporation
niranjan.hasabnis@intel.com

Abstract—

TensorFlow\* is a popular deep learning framework used by data scientists to solve a wide-range of machine learning and deep learning problems such as image classification and speech recognition. It also operates at a large scale and in heterogeneous environments - it allows users to train neural network models or deploy them for inference using GPUs, CPUs, (Intel® Xeon ®† CPUs) and deep learning specific custom-designed hardware such as TPUs. Even though TensorFlow supports a variety of optimized backends, realizing the best performance using a backend may require additional efforts. For instance, getting the best performance from a CPU backend requires careful tuning of its threading model. Unfortunately, the best tuning approach used today is manual, tedious, time-consuming, and, more importantly, may not guarantee the best performance.

In this paper, we develop an automatic approach, called TENSORTUNER, to search for optimal parameter settings of TensorFlow's threading model for CPU backends. We evaluate TENSORTUNER on both Eigen and Intel's MKL CPU backends using a set of neural networks from TensorFlow's benchmarking suite. Our evaluation results demonstrate that the parameter settings found by TENSORTUNER produce 2% to 123% performance improvement for the Eigen CPU backend and 1.5% to 28% performance improvement for the MKL CPU backend over the performance obtained using their best-known parameter settings. This highlights the fact that the default parameter settings in Eigen CPU backend are not the ideal settings; and even for a carefully hand-tuned MKL backend, the settings are sub-optimal. Our evaluations also revealed that TENSORTUNER is efficient at finding the optimal settings — it is able to converge to the optimal settings quickly by pruning more than 90% of the parameter search space.

# I. INTRODUCTION

Machine learning has seen phenomenal growth in the last few years, and is being applied to a variety of problems in different fields. This success can be attributed to availability of large datasets and easy access to increasingly-powerful computational resources (thanks to Moore's Law) for processing these datasets.

Realizing the growth in machine learning and deep learning areas, Google developed and open-sourced TensorFlow [14], [17] framework in 2011. TensorFlow has since then become a very popular framework for developing deep learning models with applications to image recognition, speech recognition, language translation, NLP, etc. In addition to supporting various types of models, TensorFlow also supports both large-scale training and inference: models can be trained on a large distributed cluster of different devices such as CPUs, GPUs, and TPUs, and inference can be done on a device as small as a mobile phone [18].

TensorFlow uses Eigen [11] template library as a default implementation for its CPU backend. Realizing that the default implementation does not deliver best training and inference performance on Intel ® Xeon ® and Xeon Phi TM platforms, Intel open-sourced an alternative implementation [17] using its Math Kernel Library for Deep Neural Networks (MKL-DNN [6]). Intel's implementation delivers up to 70x gains over Eigen CPU backend [5], [12].

Unfortunately, realizing the best performance even from a highly-optimized backend like Intel's MKL requires additional efforts. To be precise, TensorFlow represents a neural network as a data-flow graph and allows users to exploit the graph-level parallelism by offering a configurable threading model to express the parallelism. Concretely, Ten-

<sup>\*</sup>Other names and brands may be claimed as the property of others.

 $<sup>^\</sup>dagger$ Intel, the Intel logo and Xeon  $^\circledast$  are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.  $^\circledast$  Intel Corporation.

sorFlow's threading model contains following three parameters: (1) inter\_op\_parallelism\_threads: This parameter dictates the maximum number of graph nodes that can be executed in parallel, (2) intra\_op\_parallelism\_threads: This parameter dictates the maximum number of threads that can be used to execute a graph node, and (3) OMP\_NUM\_THREADS: This parameter applies to MKL backend only, and it dictates the maximum number of threads to be used to execute a graph node that is of type MKL. As one can now realize, the training or inference performance of a neural network depends on setting the parameters of the threading model correctly; incorrect settings will not deliver the best possible performance from a CPU backend.

The threading model parameters can be treated as the hyper-parameters of a neural network, and one can apply well-researched hyper-parameter tuning techniques [3], [24], [26] to get the best performance. Unfortunately, tuning techniques used in TensorFlow's CPU backend are not that advanced — the default approach used by both Eigen and MKL CPU backends is to set inter op parallelism threads to a fix number such as 2 or 4 and to set the remaining two parameters to the available number of CPU cores (by querying at runtime). It is easy to realize that such an approach does not deliver best performance as it is neural-network agnostic. It is possible, nevertheless, to override these defaults by explicitly setting the parameters. Both TensorFlow's publiclyavailable performance guide [16] as well as Intel's blogs on TensorFlow optimizations [8] offer general advice on selecting parameter values (Intel's blogs also offer specific parameter values for a set of popular neural networks such as ResNet50.) Unfortunately, as we will see in the Evaluation section, such a general advice may not deliver best performance. And the last resort of exhaustively exploring the parameter search spaces does not generalize as it is an NP-complete problem (it may work for small search spaces though.) An automated approach that efficiently finds the optimal parameter settings that deliver the best performance is highly desirable.

In this paper, we propose an automatic approach, called TENSORTUNER, to efficiently search for optimal parameter settings of TensorFlow's threading model for CPU backends. We formulate the problem as a *black-box function optimization problem* 

and solve the problem using Nelder-Mead Simplex [22] gradient-free optimization algorithm. We use a set of neural networks from TensorFlow's benchmarking suite and evaluate our approach using two criteria: (1) Tuning quality measures the performance of the neural network with the parameter setting suggested by TENSORTUNER, and (2) Tuning efficiency measures the ability of TEN-SORTUNER to find the optimal settings quickly. Our results demonstrate that the optimal settings found by TENSORTUNER deliver between 1.5% to 123% better performance with Eigen and MKL CPU backends. Moreover, they also demonstrate that TENSORTUNER is efficient in finding the optimal settings and can prune the parameter search spaces by as much as 90%.

Contributions. This paper makes following contributions:

- To the best of our knowledge, ours is the first effort to solve the problem of tuning Tensor-Flow's threading model for CPU backends.
- Our evaluations demonstrate that TENSOR-TUNER can find optimal parameter settings for MKL and Eigen CPU efficiently (10X more efficient than an exhaustive evaluation). Moreover, the optimal parameter settings found by TENSORTUNER deliver 1.5% to 123% improvement in performance over the best-known parameter settings for a set of neural network models from TensorFlow's benchmark suite.
- More importantly, our evaluations demonstrate the approach proposed by TENSORTUNER is considerably more efficient and qualitatively better than the current approach of manually tuning TensorFlow's CPU backend.

Paper organization. This paper is organized as follows. Section V covers related work, where as Section II covers the necessary background to understand the technique. Section III formulates the problem of tuning TensorFlow's CPU backend, describes the design and implementation of TENSORTUNER. Evaluation results for both MKL and Eigen CPU backends are then presented in Section IV. Lastly, Section VI briefly mentions future work, and Section VII concludes the paper.

# II. BACKGROUND

The background provided here is not meant to be a comprehensive description of TensorFlow — one may refer to references such as [14], [17]

```
import tensorflow as tf

a = tf.Variable(tf.zeros([100]))
b = tf.Variable(tf.ones([100]))
c = a + b
d = a - b
e = c * d

s = tf.Session()
result = s.run([e])
```

Fig. 1: TensorFlow Python code for e = (a + b)\* (a - b)



Fig. 2: Dataflow graph for e = (a + b) \* (a - b)

for a detailed description of TensorFlow. Rather, this description is supposed to be a short summary of TensorFlow concepts that are necessary to understand the optimizations discussed in the next section.

a) TensorFlow execution model: TensorFlow provides a simple dataflow-based programming abstraction by offering a high-level scripting interface (typically Python) that allows users (typically data scientists) to construct dataflow graphs and experiment with them easily. Figure 1 shows an example code written in Python to construct and execute a TensorFlow graph (shown in Figure 2) for expression e = (a + b) \* (a - b). a and b are TensorFlow variables, initialized to vector containing one hundred zeros and one hundred ones resp. Rectangular graph nodes represent individual mathematical operators (called *operations*) such as matrix multiplication, convolution, etc, and the edges between the nodes represent dataflow between the nodes. To be precise, edges carry multidimensional arrays called tensors. After the graph

```
session_config = tf.ConfigProto(
  inter_op_parallelism_threads=2,
  intra_op_parallelism_threads=8)
s = tf.Session(config=session_config)
result = s.run([e])
```

Fig. 3: TensorFlow Python code for setting threading model

is constructed, it is executed using TensorFlow's Session APIs.

Execution of a neural network model is mapped to executing operators in the dataflow graph by considering their constraints. Constraints consist of simple dataflow constraints (inputs of an operation need to be available to execute it) and controlflow constraints (a user may explicitly add controlflow constraint between operations to enforce sequential execution.) When an operator is ready for execution, TensorFlow runtime calls the kernel for that operator on the device that it has assigned to the operator. Kernel is a function implementing the operator. A single operation may have multiple registered kernels with specialized implementations for a particular device or data type. TensorFlow's CPU backend uses Eigen [11] open-source library to implement CPU kernels for almost all of the TensorFlow operators.

b) Parallel execution: Given how Tensor-Flow's dataflow graphs are executed, it is easy to realize that the dataflow graph shown in Figure 2 can execute operators + and - in parallel. Such a parallel execution enables users to exploit their hardware-level parallelism to its full potential. Serial execution of + and -, on the other hand, may waste CPU's computation power by keeping percentage of CPU idle. TensorFlow offers a threading model with two parameters to control the parallel execution of its dataflow graph: (1) inter\_op\_parallelism\_threads specifies the maximum number of operators that users want to execute in parallel, and (2) intra\_op\_parallelism\_threads specifies the maximum number of threads users want to use to execute individual operators. For the example graph shown above, setting inter\_op\_parallelism\_threads to 2 allows both + and - to execute in parallel; while setting it to 1 essentially enables serial execution.

TensorFlow offers high-level configuration APIs to specify the values of the threading model param-

eters. Figure 3 shows the Python code to set these parameters for the graph shown in Figure 2.

#### III. DESIGN AND IMPLEMENTATION

The problem of tuning TensorFlow's CPU backend to deliver maximum performance for a given neural network is an optimization problem that can be expressed as a function maximization problem.

#### A. Problem formulation

Broadly speaking, performance of TensorFlow's CPU backend is a function of various types of input parameters such as

- The neural network along with its hyperparameters and the input dataset used for execution,
- The configuration of the machine used for execution in terms of the hardware devices and their configurations (micro-architecture of the CPU, cache sizes, etc), operating system version installed on the machine, the configuration of the software environment (such as TensorFlow version, Eigen version, MKL version, compiler version as well as compiler options used to build CPU backend, Python version, etc).

Given such a diverse set of parameters that define the performance of TensorFlow's CPU backend, precise formulation of the problem requires assuming that a number of these input parameters are known. To be precise, we concretize the problem by assuming that:

- The neural network along with its hyperparameters used for execution and the input dataset to the model are known,
- The configurations of the underlying hardware CPU devices are known,
- The configurations of the software environment used for running the model are precisely defined.

In other words, assuming that all of these parameters are constant, the performance f of Tensor-Flow's CPU backend can be defined as a function:

$$s = f_C(\Sigma)$$

Where,

 s represents performance "score" (e.g., images per second is a typical metric to measure training performance (throughput) of convolutional neural networks),

- C represents the set of all the constant parameters discussed above,
- $\Sigma$  represents a set of various parameters of the threading model for which we are looking for optimal settings. Concretely,  $\Sigma$  can be defined as:

$$\Sigma = \{p_1, p_2, ..., p_n\}$$

where p is a parameter, and n is the number of parameters that we are looking for optimal settings. To be precise, in this paper, we assume that

- Σ for MKL backend contains the parameters specified in Intel's blog (i.e., Σ = {inter\_op, intra\_op, OMP\_NUM\_THREADS})), and
- Σ for Eigen backend contains the parameters from TensorFlow's threading model (i.e., Σ = {inter\_op, intra\_op}).

Since  $\Sigma$  is the only input parameter to our performance function, the set of all possible instantiations of  $\Sigma$ , represented using  $\tau$ , represents the parameter search space. Concretely, if v represents the setting for a parameter p, then a single instance of  $\tau$  is essentially a map containing the setting for every parameter of  $\Sigma$  (It can be represented as  $\{(p_1,v_1),(p_2,v_2),...,(p_n,v_n)\}$ .) To keep the search space bounded, we impose strict upper and lower bounds on all the elements of  $\Sigma$  (i.e.,  $\forall p \in \Sigma, v_p \in \{l_v,...,h_v\}$ , where  $l_v$  and  $h_v$  are the lower and upper bound for p.)

Given this formulation, the problem of tuning TensorFlow's CPU backend for maximum performance can be defined as:

find 
$$t \in \tau \mid f(t) > f(t') \, \forall \, t' \in \tau \land t \neq t'$$

#### B. Design

Since we have formulated our tuning problem as a function maximization problem, we can use the typical approach of using gradient-based optimizers [9], [2] to solve the problem. Gradient-based optimizers use the gradient of the objective function to determine the most promising directions along which one should search. For a large numbers of variables, gradient-based optimizers are usually the most efficient algorithms.

Unfortunately, gradient-based optimizers have been known to be inefficient at handling optimization problems with one or more of the following challenges: non-differentiable functions, mixed variables, multiple local minima, and large dimensionality. Although the objective function for our tuning problem may not have all of the above issues (for example, it does not have large number of variables), it may be non-differentiable, and it definitely has mixed variables. The key strength of gradientfree methods is their ability to solve problems that are difficult to solve using gradient-based methods. Furthermore, many of them are designed as global optimizers and thus are able to find multiple local optima while searching for the global optimum. As a result, we decided to use gradient-free optimization method called Nelder-Mead Simplex [22]. A number of gradient-free methods such as Simulated Annealing, Divided Rectangles method, Genetic Algorithms, and Particle Swarm optimization have been developed. But of all of them, Nelder-Mead Simplex is a very simple yet efficient method. That is why we decided to use it for our approach. Nonetheless, a comparative analysis and further improvements to TENSORTUNER may be possible.

a) Nelder-Mead simplex algorithm: Nelder-Mead simplex is a well-known and popular optimization algorithm for multi-dimensional unconstrained optimization problems without gradients. Since this algorithm is not the contribution of this paper, we advise readers to refer to the original paper [22] for more details. Since Nelder-Mead simplex is a solution to a function minimization problem, we convert our function maximization problem into a minimization problem by using a division function. Specifically, our new objective function is:

$$f_C'(\Sigma) = \frac{1}{f_C(\Sigma)}$$

With the new objective function, the problem of tuning for maximum performance can be defined as:

find 
$$t \in \tau \mid f'(t) < f'(t') \, \forall \, t' \in \tau \land t \neq t'$$

Figure 4 shows the design of TENSORTUNER. Given the problem formulation, TENSORTUNER just needs to produce an assignment for all the variables of  $\Sigma$  in every iteration. This assignment is decided by the Nelder-Mead simplex search strategy by accessing the output of the objective function for the previous assignment to the variables. The design of our system is such that it is very easy to plug-in new search strategies. In addition to the search strategy, TENSORTUNER also accepts



Fig. 4: TENSORTUNER in operation

configurations (upper bound, lower bound and step size) for the variables in  $\Sigma$ . These constraints on the search space are not necessary — Nelder-Mead simplex algorithm supports unconstrained search also. But given that the number of CPU cores are bounded, a constrained search is likely to converge more quickly.

# C. Implementation

Parameter tuning is a well-known problem in HPC domain. Given that, we utilize open-source implementation of Active Harmony [25] tool to implement TENSORTUNER. Active Harmony project supports infrastructure and algorithms (such as Nelder-Mead Simplex) to achieve high performance in a distributed, heterogeneous software systems with changing resource requirements and capacities. For our implementation, we utilized Active Harmony's publicly-available source code and wrote a shell script on top of its tuna tool to accept the configurations for variables and pass them appropriately to tuna. Specific command lines used to obtain our evaluation results are mentioned in the Appendix section (Section A).

#### IV. EVALUATION

We evaluated effectiveness of TENSORTUNER in terms of tuning TensorFlow's default Eigen CPU backend and MKL CPU backend. We describe our evaluation results in terms of: 1) tuning quality, and 2) tuning efficiency. Tuning quality is a criterion for comparing search algorithms in terms of their ability to find the global optimum (and for their ability to avoid local optimums). Tuning efficiency,

on the other hand, evaluates an ability of a search algorithm to converge to global optimum quickly.

#### A. Experimental setup

Before we discuss the results, we discuss the experimental setup used for the evaluation.

- *a) Hardware description:* We conducted all of our experiments on Intel <sup>®</sup> Xeon <sup>®</sup> Platinum 8180 processor [4], running CentOS Linux version 7.3.1611. We used GCC-6.3.0 for building Tensor-Flow, and Python-2.7.5 for running the TensorFlow benchmarks.
- *b) Eigen backend:* Since Eigen is the default CPU backend for TensorFlow, we used pre-built TensorFlow-1.7 wheel (the latest version as of March 2018) for CPU<sup>‡</sup> for our experiments.
- c) MKL backend: The source code of Inteloptimized MKL CPU backend is now present in TensorFlow's github repository [17]. We cloned TensorFlow's master branch for commit id 7d8ad3da99aba29319c7c9f0e62d567aa2071c21<sup>§</sup> and built it using the steps listed on the TensorFlow website [20]. We built the wheel for MKL backend ourselves since the pre-built wheel for version 1.7 was not available from Intel's webpage [7].
- d) TensorFlow benchmarks: TensorFlow authors have open-sourced a suite of popular convolutional neural networks for benchmarking purpose [19]. The suite includes popular networks such as ResNet50, VGG11, InceptionV3 etc.

Table 5 shows the neural network models from TensorFlow's benchmarking suite that we use for evaluation along with their hyper-parameters such as batch size and data format. Configurations of batch size and data format that deliver best performance using MKL CPU backend are publicly-available from Intel's blog [8]. Batch sizes used for Eigen CPU backend evaluation are same as that of MKL backend evaluation, while the data format is NHWC (since Eigen CPU backend does not support NCHW data format.) Rest of the hyper-parameters for both the backends are default parameters set by TensorFlow's benchmarking scripts.

Intel's blog [8] also specifies values for different environment variables that deliver the best performance on MKL CPU backend. We used these values for our experimentation and have also documented them in Table 6. Note that Intel's blog

| Backend   | Model      | odel Batch |        |
|-----------|------------|------------|--------|
|           |            | Size       | Format |
| MKL CPU   | ResNet-50  | 128        | NCHW   |
| MKL CPU   | Inception3 | 64         | NCHW   |
| MKL CPU   | VGG16      | 128        | NCHW   |
| MKL CPU   | VGG11      | 128        | NCHW   |
| MKL CPU   | GoogLeNet  | 96         | NCHW   |
| Eigen CPU | ResNet-50  | 128        | NHWC   |
| Eigen CPU | Inception3 | 64         | NHWC   |
| Eigen CPU | VGG16      | 128        | NHWC   |
| Eigen CPU | VGG11      | 128        | NHWC   |
| Eigen CPU | GoogLeNet  | 96         | NHWC   |

Fig. 5: Models used for evaluation

|            | int | int | OMP_ | KMP_ |
|------------|-----|-----|------|------|
|            | er_ | ra_ | NUM_ | BLO  |
| Model      | ор  | op  | THR  | CK_  |
|            |     |     | EADS | TIME |
| ResNet-50  | 2   | 56  | 56   | 1    |
| Inception3 | 2   | 56  | 56   | 1    |
| VGG16      | 1   | 56  | 56   | 1    |
| VGG11      | 1   | 56  | 56   | 1    |
| GoogLeNet  | 2   | 56  | 56   | 1    |

Fig. 6: Environment variables used for MKL backend evaluation

does not specify values for VGG11 and GoogLeNet models. We obtained values of environment variables for these models from the observation that the values of these variables are fairly uniform for other models (e.g., OMP\_NUM\_THREADS is 56 for all other models.) We also confirmed with Intel that the values that we used were indeed the best-known settings for VGG11 and GoogLeNet.

Our evaluation methodology obtains the performance of TensorFlow's Eigen and MKL CPU backends with the default settings and compares it with the performance obtained using optimum settings suggested by TENSORTUNER. We used number of images processed per second (throughput) as the performance metric for training and inference scenarios.

e) Configurations for parameter search space: As mentioned in the problem formulation section (section III), we add strict upper and lower bounds on the parameters to restrict the search space. Figure 7 specifies the bounds used for our evaluation. These values are derived from the number of cores

<sup>&</sup>lt;sup>‡</sup>pip install tensorflow==1.7

<sup>§</sup>The latest commit when we started experiments

|         | inter_    | intra_      | OMP_        |
|---------|-----------|-------------|-------------|
| Backend | ор        | op          | NUM_        |
|         |           |             | THREADS     |
| MKL     | [1, 4, 1] | [14, 56, 7] | [14, 56, 7] |
| Eigen   | [1, 4, 1] | [14, 56, 7] | -           |

Fig. 7: [Lower bound, upper bound, step size] for parameter search

available on the machine that we used for evaluation (56 physical cores). But one can choose these ranges and step sizes depending on their needs — it is not necessary to stick to these ranges.

## B. Tuning quality

Tuning quality evaluates TENSORTUNER's ability to find global optimum parameter settings. Concretely, we evaluate tuning quality by comparing the performance score that we get with the optimal parameter setting suggested by TENSORTUNER with the score obtained using the best-known parameter setting. All the best-known parameter settings for TensorFlow's MKL CPU backend are mentioned in the Intel's recent blog [8]. For Eigen CPU backend, TensorFlow's performance guide [16] mentions that the default parameter settings in TensorFlow's source code (for inter\_op\_parallelism\_threads and intra\_op\_parallelism\_threads) are efficient for the most systems.

a) Evaluating tuning quality on TensorFlow's MKL CPU backend: Figure 8a compares the training performance of TensorFlow's MKL CPU backend that is tuned using the best-known parameter settings (baseline) with that obtained using the optimal parameter settings found by TENSORTUNER. Labels on the X-axis show the parameter settings suggested by TENSORTUNER and the Y-axis shows the percentage improvement that we got with TEN-SORTUNER suggested settings over baseline. For all the topologies TENSORTUNER could find a setting that delivered the performance close to the bestknown performance from TensorFlow's MKL CPU backend. In fact, for GoogleNet, the settings found by TENSORTUNER produced better performance (by 1.28%) than the best-known performance from Intel's blog!

Figure 8b shows the inference performance of TensorFlow's MKL CPU backend with the settings found by TENSORTUNER and compares that with the performance obtained using the best-known settings (baseline). Similar to the training plot, labels

on the X-axis show the parameter settings. Continuing the story from training results, inference results also show that TENSORTUNER could find settings that produced better performance than the performance produced by the best-known settings on all the topologies! To be precise, settings found by TENSORTUNER produced 1.28% to 28% (28% improvement on GoogLeNet) improved performance than the performance produced by the best-known settings. This points to the fact that manual tuning, if it is not systematic, can lead to sub-optimal parameter settings.

In order to understand the effectiveness of TEN-SORTUNER in its ability to converge to global optimum, we performed an exhaustive evaluation by scanning the whole parameter search space for InceptionV3 training run and obtained the performance score for every parameter setting. The exhaustive search found the settings (2, 56, 49) that delivered 1.47% better performance than the optimal settings found by TENSORTUNER (2, 49, 49 as per Fig 8a). This result highlights the ability of TENSORTUNER and the underlying Nelder-Mead Simplex algorithm to get close to the global optimum, if not converge to it. Nonetheless, it is interesting to understand why TENSORTUNER could not find the best setting for InceptionV3, and if we can tune the Nelder-Mead algorithm to find that setting. Various settings of Nelder-Mead such as radius used to construct initial simplex and convergence criteria could make a difference. We plan to do this analysis as a part of future work.

b) Evaluating tuning quality on TensorFlow's Eigen backend: After the evaluation of TensorFlow's MKL backend, we evaluated its Eigen CPU backend for training and inference scenarios. Since TensorFlow sets the default parameter values for Eigen backend statically, this is an evaluation of the effectiveness of such a static approach and its generality.

Figure 8c compares the training performance obtained using the default settings with that obtained using the optimal settings found by TENSORTUNER. Similar to earlier figures, labels on the X-axis indicate the parameter settings used to obtain the performance improvements that are plotted on

**DISCLAIMER**: Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks. Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at www.intel.com.

<sup>¶</sup>To trigger the default settings for inter\_op\_parallelism\_threads and intra\_op\_parallelism\_threads, we set both of them to 0 as the command line arguments to tf\_cnn\_benchmarks.



Models with (Inter\_op, Intra\_op, OMP\_) Found by TensorTuner

Models with (Inter\_op, Intra\_op, OMP\_) Found by TensorTuner

Inceptions (5, 45, 28)

14.98

28.1

PCCCII (1-40-36)

#### (a) Tuning Quality of MKL Backend for Training



## (b) Tuning Quality of MKL Backend for Inference



Models with (Inter\_op, Intra\_op) Found by TensorTuner

(c) Tuning Quality of Eigen Backend for Training

(d) Tuning Quality of Eigen Backend for Inference

Fig. 8: TENSORTUNER Tuning Quality on MKL and Eigen CPU Backends

% Improvement Over Best Perf

35

30

25

20

15

10

the Y-axis. Note that the settings found by TENSOR-TUNER delivered 4% to 123% improvement in the training performance over the performance obtained using the default settings. Figure 8d compares the inference performance for the same set of models. Even for inference scenario, the optimal settings found by TENSORTUNER delivered 2% to 30% improvement in the performance over the performance obtained using the default settings. This clearly highlights the fact that the static approach for default parameter settings in TensorFlow's Eigen CPU backend is not an ideal approach.

c) Comparison of settings found by TENSOR-TUNER with the default settings: The settings found by TENSORTUNER delivered 123% improvement in VGG11 training performance with Eigen backend over the default settings. To get more insights into this performance improvement, we collected CPU utilization over the whole duration of VGG11's

training run. The utilization was obtained for every 1 second from Linux top utility. Figure 9a shows the CPU utilization for the training run with the settings found by TENSORTUNER in black color, while the CPU utilization for the training run with the default settings is shown in orange color. Notice that the complete training run with the default settings took 751 seconds, while it took 340 seconds with the settings found by TENSORTUNER. Also notice that the peak CPU utilization for the run with the default settings is almost 11200%. This is because Intel<sup>®</sup> Xeon<sup>®</sup> Platinum 8180 processor has 28 physical cores in 1 socket and 56 physical cores with 2 sockets. Additionally, with hyperthreading enabled (and 2 threads per core), the total number of cores in a 2-socket 8180 processor is 112. The most important observation though is that the average CPU utilization during the run with the settings found by TENSORTUNER was much

DISCLAIMER: The benchmark results may need to be revised as additional testing is conducted. The results depend on the specific platform configurations and workloads utilized in the testing, and may not be applicable to any particular user's components, computer system or workloads. The results are not necessarily representative of other benchmarks and other benchmark results may show greater or lesser impact from mitigations.



#### (a) CPU Utilization of VGG11 with Eigen CPU Backend



# (b) CPU Utilization With Default Settings



(c) CPU Utilization With TENSORTUNER Found Settings Fig. 9: Analysis of VGG11 Performance with Eigen CPU Backend

less than during the run with the default settings. This points to thread over-subscription issue. The difference between average CPU utilizations was also recorded by the report obtained from Intel's VTune Amplifier (Figure 9b and Figure 9c).

To confirm the thread over-subscription issue, we repeated the training experiment with the same settings but by exposing only 56 CPU cores to the TensorFlow framework. We saw that the VGG11 training performance with the default settings improved by almost 26%. In other words, the performance gap was reduced from 123% to 65%. The report obtained using Intel's VTune Amplifier also confirmed the reduction in the thread spin time, leading to the performance improvement.

# C. Efficiency of TENSORTUNER

We evaluate efficiency of our approach by comparing the number of points in the parameter space searched by TENSORTUNER with the total number of points that would be explored by an exhaustive search. Figure 10a compares the efficiency in terms of tuning the MKL CPU backend for training and inference modes. In comparison to 196 points that would be explored by an exhaustive search, TENSORTUNER searches from 9% to 24% of the points to find the optimal settings. It could be possible to further reduce the number of points searched by tuning the expansion or reduction settings of Nelder-Mead algorithm, but that could also affect the tuning quality.

Figure 10b compares the efficiency of TENSOR-TUNER for tuning Eigen CPU backend for training and inference modes. In comparison to the 35 points that would be explored by an exhaustive search, TENSORTUNER searches from 31% to 77% of the points to find the optimal settings. More than 50% of the points searched by TENSORTUNER could potentially be because of the smaller search space that reduces the effectiveness of Nelder-Mead algorithm in quickly pruning the search space.

#### V. RELATED WORK

a) Optimization algorithms: The problem of auto-tuning TensorFlow's CPU backend comes under a general class of mathematical optimization problems. Considerable research over multiple centuries has been devoted to the problem of optimizing linear functions, non-linear functions, single-variable and multi-variables functions, functions with constraint and without constraints, convex and non-convex functions. Of course, the discussion

The number of CPUs were restricted using numact1 utility.





- (a) Tuning Efficiency of MKL Backend
- (b) Tuning Efficiency of Eigen Backend

Fig. 10: TENSORTUNER Tuning Efficiency on MKL and Eigen CPU Backends

here cannot cover all of the past research, so an interested reader can refer to various books [23] on mathematical optimizations.

The problem of auto-tuning TensorFlow's CPU backend can be formulated as a black-box optimization problem or a white-box optimization problem. Black-box optimization problems are those in which the objective function  $f:X\to\mathbb{R}$  can only be evaluated (f(x)) for any  $x\in X$  but we have no access to any other information about f such as gradients or the Hessian. White-box optimization problems, on the other hand, are those in which the additional information about f is available.

White-box optimization problems can be efficiently solved using gradient-based function optimization algorithms as the gradients or Hessian are available. These algorithms (such as gradient descent, Newton's method, Quasi-Newton methods [10] such as DFP method and BFGS method) use additional information available about the function to decide the promising directions along which to search for an optimum solution. For a large number of variables, gradient-based optimization algorithms are typically the most efficient algorithms. Unfortunately, gradient-based optimization algorithms are inefficient at optimizing noisy, discontinuous, and non-convex functions. Additionally, they do not support function parameters that are discrete or mixed discrete-continuous.

Black-box optimization problems, on the other hand, can only be solved using *gradient-free function optimization algorithms*. A number of gradient-free optimization algorithms such as Nelder-Mead Simplex [22], Simulated Annealing, Divided Rect-

angles method, Genetic Algorithms and Particle Swarm Optimization [2] exist but have their own weaknesses. Gradient-free optimization algorithms are typically expensive when the objective function has a large number of parameters. More importantly, gradient-free optimization algorithms are not guaranteed for find global optimum as the problem of finding global optimum is NP-complete (since they need to evaluate  $f(x) \ \forall \ x \in X$  in order to find global optimum.)

b) Auto-tuning in high-performance computing systems: Automatic tuning of parameters for best performance is a well-researched area in high performance computing (HPC) domain [1]. Performance is typically the most important objective function in many scientific and HPC applications. Unsurprisingly, considerable efforts has been devoted to solving this problem. Since matrix multiplication is the one of the fundamental computations in many HPC applications, considerable research efforts were devoted to auto-tuning matrix multiplication kernels [29], [28]. Given the raw compute power of GPUs, considerable research efforts have focused on performance optimization for GPUs as well [21], [13].

In machine learning domain, auto-tuning is routinely applied to the problem of *hyper-parameter tuning* (e.g., HyperOpt [3], MOE\*\*, Spearmint [24], AutoWeka [26], and Hypertune†† subsystem in Google Cloud Machine Learning Engine that uses Google's Vizier [15]) and *automatic generation of* 

<sup>\*\*</sup>https://github.com/Yelp/MOE

<sup>††</sup> https://cloud.google.com/ml/

efficient kernels of neural network operations (e.g., Tensor Comprehension [27]).

Hyper-parameters are tunable parameters in neural networks. Typical hyper-parameters such as batch size, learning rate, etc, are critical to convergence as well as good training or inference accuracy. Hyper-parameter tuning also takes considerable time, since neural networks typically take long time to converge. It is unsurprising then to see research effort dedicated to solving this problem. All of the existing hyper-parameter tuning techniques apply to the problem of tuning TensorFlow's CPU backend, since the threading model can be considered as a set of hyper-parameters. TENSORTUNER, on the other hand, can also be applied to the tuning problems solved using existing hyper-parameter tuning tools. All of these tools have fairly similar strengths — ability to define configurable search spaces, ability to plug-in new search algorithms (such as Batched Gaussian Process Bandits, Treeof-Parzen-Estimators (TPE), random search, grid search, etc), ability to support real-valued, integervalued or discrete-valued parameters, etc.

We are not aware of any existing research work that applies auto-tuning techniques to improve performance of TensorFlow's CPU backend on a neural network model. TensorFlow's GPU backend, however, uses an auto-tuning technique to choose the best convolution algorithms at the beginning of training or inference of a neural network. But unlike TENSORTUNER that optimizes the threading model parameters, the auto-tuning technique used by TensorFlow's GPU backend seems to<sup>‡‡</sup> just scan the list of available convolution algorithms at the start and pick up the best-performing one after trial executions. TENSORTUNER, in that sense, is a black-box optimization approach.

#### VI. FUTURE WORK

Although TENSORTUNER is able to find many parameter settings that deliver better performance for TensorFlow's MKL and Eigen CPU backends over the best-known performance numbers, there are a number of avenues for improvement. First, although Nelder-Mead Simplex algorithm was able to find optimal parameter settings efficiently, the question about the effectiveness of other gradient-free search strategies is not answered. Additionally,

Nelder-Mead algorithm is known to have convergence issues with a large number of design variables. So far we explored the algorithm with a maximum of 3 variables; it is interesting to explore applicability of Nelder-Mead simplex algorithm to more number of TensorFlow design variables. We evaluated TENSORTUNER on TensorFlow's benchmark suite, and the applicability of the technique to general models with large datasets, such as ImageNet, is unknown. Although the model along with its dataset is a black-box for TENSORTUNER, the number of threads used for pre-processing a dataset could affect the optimal settings for TensorFlow's threading model.

# VII. CONCLUSION

In this paper, we presented our approach, called TENSORTUNER, to the problem of auto-tuning TensorFlow's threading model for MKL and Eigen CPU backends. Our experimental evaluation on a set of popular convolution neural network models from TensorFlow's benchmark suite revealed that the default parameter settings used by Eigen CPU backend for both training and inference use case deliver sub-optimal TensorFlow performance, and with the settings found by TENSORTUNER, we get almost 2X improvement in performance. Even for the well-tuned MKL CPU backend, we found that the publicly-available settings delivered sub-optimal TensorFlow performance for inference use case (the settings delivered optimal TensorFlow performance for training use case though,) and that the settings found by TENSORTUNER delivered almost 30% improvement in performance for inference use case. Our experimental results underscore the fact that the manual tuning of TensorFlow's CPU backends may not yield the best TensorFlow performance, and an automated approach can tune the CPU backends much better than the manual tuning.

#### ACKNOWLEDGEMENTS

We would like to thank anonymous reviewers for their comments and suggestions. We would specifically like to thank Nagib Hakim for his insightful comments and discussions on the preliminary version of the paper. Finally, we would also like to thank all the members of Intel's TensorFlow optimization team for their inputs for this work and for experimenting with TENSORTUNER.

<sup>&</sup>lt;sup>‡‡</sup>No standard reference, other than TensorFlow's source code, exists for this work

#### REFERENCES

- [1] David H Bailey, Robert F Lucas, and Samuel Williams. Performance tuning of scientific applications. CRC Press, 2010
- [2] A.D. Belegundu and T.R. Chandrupatla. Optimization Concepts and Applications in Engineering. Prentice Hall, 1999
- [3] James Bergstra, Dan Yamins, and David D Cox. Hyperopt:
   A python library for optimizing the hyperparameters of machine learning algorithms. In *Proceedings of the 12th Python in Science Conference*, 2013.

   [4] Intel Corporation. Intel<sup>®</sup> Xeon<sup>®</sup> Platinum 8180
- [4] Intel Corporation. Intel<sup>®</sup> Xeon<sup>®</sup> Platinum 8180 Processor, 2017. https://ark.intel.com/products/120496/ Intel-Xeon-Platinum-8180-Processor-38\_5M-Cache-2\_ 50-GHz.
- [5] Intel Corporation. TensorFlow Optimizations on Modern Intel® Architecture, 2017. https://software.intel.com/en-us/articles/tensorflow-optimizations-on-modern-intel-architecture.
- [6] Intel Corporation. Intel<sup>®</sup> Math Kernel Library for Deep Neural Network, 2018. https://github.com/01org/mkl-dnn/ releases.
- [7] Intel Corporation. Intel® Optimization for TensorFlow\* Installation Guide, 2018. https://software.intel.com/en-us/articles/ intel-optimization-for-tensorflow-installation-guide.
- [8] Intel Corporation. TensorFlow\* Optimizations for the Intel® Xeon® Scalable Processor, 2018. https://ai.intel.com/tensorflow-optimizations-intel-xeon-scalable-processor/.
- [9] Y. H. Dai and Y. Yuan. A nonlinear conjugate gradient method with a strong global convergence property. SIAM J. on Optimization.
- [10] John E Dennis, Jr and Jorge J Moré. Quasi-newton methods, motivation and theory. SIAM review, 1977.
- [11] Eigen. Eigen, 2017. http://eigen.tuxfamily.org.
- [12] Elmoustapha Ould-Ahmed-Vall et al. Accelerating tensorflow on modern intel architectures. First International Workshop on Architectures for Intelligent Machines, 2017.
- [13] Grauer-Gray et al. Auto-tuning a high-level language targeted to gpu codes. In *Innovative Parallel Computing* (*InPar*), 2012. IEEE, 2012.
- [14] Martín Abadi et al. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
- [15] Daniel et al. Golovin. Google vizier: A service for black-box optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '17, 2017.
- [16] Google Inc. Performance Guide. https://www.tensorflow. org/performance/performance\_guide#optimizing\_for\_cpu.
- [17] Google Inc. Computation using data flow graphs for scalable machine learning, 2016. https://github.com/ tensorflow/tensorflow.
- [18] Google Inc. TensorFlow Mobile, 2016. https://www.tensorflow.org/mobile.
- [19] Google Inc. *TensorFlow Benchmarks Github*, 2017. https://github.com/tensorflow/benchmarks.
- [20] Google Inc. TensorFlow with Intel<sup>®</sup> MKL DNN, 2017. https://www.tensorflow.org/performance/performance\_guide#tensorflow\_with\_intel\_mkl\_dnn.
- [21] Yinan Li, Jack Dongarra, and Stanimire Tomov. A note on auto-tuning gemm for gpus. In *International Conference* on Computational Science. Springer, 2009.
- [22] John A. Nelder and Roger Mead. A simplex method for function minimization. *Computer Journal*, 1965.

- [23] Christos H. Papadimitriou and Kenneth Steiglitz. Combinatorial Optimization: Algorithms and Complexity. Prentice-Hall, Inc., 1982.
- [24] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, 2012.
- [25] Cristian Tapus, I-Hsin Chung, and Jeffrey K. Hollingsworth. Active harmony: Towards automated performance tuning. ACM/IEEE SC 2002 Conference (SC'02), 2002.
- [26] Chris et al. Thornton. Auto-weka: Combined selection and hyperparameter optimization of classification algorithms. In 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2013.
- [27] Nicolas et al. Vasilache. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. arXiv preprint arXiv:1802.04730, 2018.
- [28] Richard Vuduc, James W Demmel, and Katherine A Yelick. Oski: A library of automatically tuned sparse matrix kernels. In *Journal of Physics: Conference Series*. IOP Publishing, 2005.
- [29] R. C. Whaley and J. J. Dongarra. Automatically tuned linear algebra software. In Supercomputing, 1998.SC98. IEEE/ACM Conference on, 1998.

#### APPENDIX

a) Command to run TensorTuner to tune MKL CPU backend for training:

harmony/bin/tuna STRATEGY=nm.so -q -v
-i=interop,\$interop\_min,\$interop\_max,\$interop\_step
-i=intraop,\$intraop\_min,\$intraop\_max,\$intraop\_step
-i=omp,\$omp\_min,\$omp\_max,\$omp\_step
numactl -l python tf\_cnn\_benchmarks.py
--forward\_only=False --num\_warmup\_batches=0
--batch\_size=\$batch\_size --data\_format=NCHW
--num\_batches=100 --model=\$model
--num\_inter\_threads=%interop
--num\_intra\_threads=%intraop
--num\_omp\_threads=%omp

# b) Command to run TensorTuner to tune Eigen CPU backend for training:

harmony/bin/tuna STRATEGY=nm.so -q -v
-i=interop,\$interop\_min,\$interop\_max,\$interop\_step
-i=intraop,\$intraop\_min,\$intraop\_max,\$intraop\_step
python tf\_cnn\_benchmarks.py
--forward\_only=False --num\_warmup\_batches=0
--batch\_size=\$batch\_size --data\_format=NHWC
--num\_batches=100 --model=\$model
--num\_inter\_threads=%interop
--num\_intra\_threads=%intraop

c) Command to run TensorTuner to tune MKL and Eigen CPU backends for inference: Running their respective commands with – forward\_only=True enables inference mode of tf\_cnn\_benchmarks.py.