Skip to content

Maximize Mask-RCNN (TF, Keras) inference time performance on AKS with CPU #1019

@Larleyt

Description

@Larleyt

Hello!
The aim: run inference (no training needed) of a custom Mask-RCNN at CPU VM as an AksWebservice as fast as possible. (CPU is chosen mainly because it's cheaper.)

Default TF build from pip on inference sent warnings like:
Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA

So I tried to install provided CPU-optimized build using:
CondaDependencies.add_tensorflow_conda_package(core_type='cpu', version="1.15")

It successfully installs TF 1.15 with all the needed instructions but only for MKL-DNN operations:

This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags. 

And it slowed down the inference by x2. Why could that be? I've seen quite a lot of similar issues ([1], [2], [3] about performance degradation when using MKL. Maybe, it is somehow threads related.
I also tried pip intel-tensorflow==1.15.2. Same performance.
Installing CondaDependencies.add_tensorflow_pip_package(core_type='cpu', version="1.15")
leads to no optimizations at all (pip installs usual TF binary).

So I decided to build my own TF1.15.3 with AVX2 AVX512F FMA but without MKL (correct me if I'm wrong and that won't change the performance):

$ bazel build -c opt --copt=-march=native --copt=-mfpmath=both //tensorflow/tools/pip_package:build_pip_package

It's been successfully compiled and installed without errors. No warnings about unsupported instructions or MKL-DNN-only operations occurred. But no performance boost has been noticed.

So, why optimized TF builds works the same (if not worse for MKL-DNN part)? Am I using wrong type of VM for this type of task? (Right now I'm using Fs-v2 series.)

And also few side questions if you please:

  1. How to delete an Environment from a Workspace? So it wouldn't appear on Environment.list(ws) anymore?
  2. Why could it be that the AksWebservice doesn't create all the replicas but only few of them? Say, 2/10. Others are "unavailable" on a Kubernetes Services page on portal.azure.com. Autoscaling is set to False so this should not be the reason.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions