MKL backend performance regression with some topologies #398

moderato · 2017-09-07T05:24:29Z

Hello! I use neon to train a model on three backends: CPU, MKL and GPU. All the settings are the same when running with these backends. I got very similar costs from CPU and GPU, while the cost from MKL backend was usually higher than the previous two, sometimes even nan. Anybody has an idea why does that happen?

The CPU is an Intel i7; the GPU is a Nvidia GTX 1050; the code is running on Ubuntu 16.04. Here is the printed result of the code...

Use cpu as backend.

DISPLAY:neon:-------------------------------------------------------------------------------------
DISPLAY:neon:|    Func     |    Mean     |   Median    |     Min     |     Max     |    Units    |
DISPLAY:neon:-------------------------------------------------------------------------------------
DISPLAY:neon:| fprop       |  456.74     |  452.61     |  439.07     |  501.7      |    msec     |
DISPLAY:neon:| bprop       |  819.21     |  796.45     |  772.53     |  979.8      |    msec     |
DISPLAY:neon:| iteration   |  1276       |  1250       |  1213.5     |  1457       |    msec     |
DISPLAY:neon:-------------------------------------------------------------------------------------

Epoch 0   [Train |████████████████████|  246/246  batches, 3.51 cost, 303.30s] [CrossEntropyMulti Loss 0.00, 0.00s]
Epoch 1   [Train |████████████████████|  245/245  batches, 3.49 cost, 301.14s] [CrossEntropyMulti Loss 0.00, 0.00s]
Epoch 2   [Train |████████████████████|  245/245  batches, 3.47 cost, 301.43s] [CrossEntropyMulti Loss 0.00, 0.00s]
Epoch 3   [Train |████████████████████|  245/245  batches, 3.46 cost, 302.56s] [CrossEntropyMulti Loss 0.00, 0.00s]
Epoch 4   [Train |████████████████████|  245/245  batches, 3.44 cost, 302.91s] [CrossEntropyMulti Loss 0.00, 0.00s]
Neon training finishes in 1646.99 seconds.
Misclassification error = 91.2%. Finished in 26.86 seconds.
Top 3 Misclassification error = 78.1%. Finished in 27.36 seconds.
Top 5 Misclassification error = 65.7%. Finished in 27.36 seconds.
Misclassification error = 91.7% on test set. Finished in 43.54 seconds.
Top 3 Misclassification error = 79.8% on test set. Finished in 43.60 seconds.
Top 5 Misclassification error = 67.3% on test set. Finished in 43.76 seconds.


Use mkl as backend.

DISPLAY:neon:-------------------------------------------------------------------------------------
DISPLAY:neon:|    Func     |    Mean     |   Median    |     Min     |     Max     |    Units    |
DISPLAY:neon:-------------------------------------------------------------------------------------
DISPLAY:neon:| fprop       |  119.82     |  120.03     |  111.14     |  130.82     |    msec     |
DISPLAY:neon:| bprop       |  157.51     |  156.32     |  151.81     |  165.86     |    msec     |
DISPLAY:neon:| iteration   |  277.33     |  280.49     |  264.03     |  285.16     |    msec     |
DISPLAY:neon:-------------------------------------------------------------------------------------

Epoch 0   [Train |████████████████████|  246/246  batches, 48.12 cost, 70.76s] [CrossEntropyMulti Loss 0.00, 0.00s]
Epoch 1   [Train |████████████████████|  245/245  batches, 47.54 cost, 73.94s] [CrossEntropyMulti Loss 0.00, 0.00s]
Epoch 2   [Train |████████████████████|  245/245  batches, 48.52 cost, 77.99s] [CrossEntropyMulti Loss 0.00, 0.00s]
Epoch 3   [Train |████████████████████|  245/245  batches, 48.09 cost, 74.04s] [CrossEntropyMulti Loss 0.00, 0.00s]
Epoch 4   [Train |████████████████████|  245/245  batches, 48.20 cost, 79.86s] [CrossEntropyMulti Loss 0.00, 0.00s]
Neon training finishes in 422.74 seconds.
Misclassification error = 94.6%. Finished in 9.29 seconds.
Top 3 Misclassification error = 90.1%. Finished in 9.56 seconds.
Top 5 Misclassification error = 85.6%. Finished in 9.78 seconds.
Misclassification error = 94.5% on test set. Finished in 15.48 seconds.
Top 3 Misclassification error = 90.0% on test set. Finished in 15.47 seconds.
Top 5 Misclassification error = 85.5% on test set. Finished in 14.99 seconds.


Use gpu as backend.

DISPLAY:neon:-------------------------------------------------------------------------------------
DISPLAY:neon:|    Func     |    Mean     |   Median    |     Min     |     Max     |    Units    |
DISPLAY:neon:-------------------------------------------------------------------------------------
DISPLAY:neon:| fprop       |  6.1057     |  6.0366     |  5.8992     |  6.3699     |    msec     |
DISPLAY:neon:| bprop       |  10.76      |  10.753     |  9.9809     |  11.841     |    msec     |
DISPLAY:neon:| iteration   |  16.865     |  16.783     |  15.88      |  18.185     |    msec     |
DISPLAY:neon:-------------------------------------------------------------------------------------

Epoch 0   [Train |████████████████████|  246/246  batches, 3.51 cost, 3.98s] [CrossEntropyMulti Loss 0.00, 0.00s]
Epoch 1   [Train |████████████████████|  245/245  batches, 3.48 cost, 3.97s] [CrossEntropyMulti Loss 0.00, 0.00s]
Epoch 2   [Train |████████████████████|  245/245  batches, 3.47 cost, 3.98s] [CrossEntropyMulti Loss 0.00, 0.00s]
Epoch 3   [Train |████████████████████|  245/245  batches, 3.46 cost, 3.98s] [CrossEntropyMulti Loss 0.00, 0.00s]
Epoch 4   [Train |████████████████████|  245/245  batches, 3.44 cost, 3.98s] [CrossEntropyMulti Loss 0.00, 0.00s]
Neon training finishes in 21.84 seconds.
Misclassification error = 91.2%. Finished in 0.38 seconds.
Top 3 Misclassification error = 78.0%. Finished in 0.38 seconds.
Top 5 Misclassification error = 65.6%. Finished in 0.38 seconds.
Misclassification error = 91.6% on test set. Finished in 0.60 seconds.
Top 3 Misclassification error = 79.8% on test set. Finished in 0.60 seconds.
Top 5 Misclassification error = 67.4% on test set. Finished in 0.60 seconds.

The text was updated successfully, but these errors were encountered:

indie · 2017-09-13T00:22:28Z

Hmm... without knowing more about what kind of functions or code you are running in this model, here's a possible cause:

The high-level interface for this backend (-b mkl) includes an optional, on-by-default NaN check
on all matrix inputs before they can call any LAPACKE functions. Basically what happens is that when an input matrix contains any NaNs, the input parameter corresponding to that matrix can be flagged with an INFO parameter error.

See also:

For your reference (and this was recently added to the documentation for Intel MKL 2018 Gold), you can turn NaN check OFF for savings a couple different ways:

Through the environment variable:
- Set LAPACKE_NANCHECK to 0 to turn NaN checking OFF
- Set LAPACKE_NANCHECK to 1 (or any non-zero integer) to turn NaN checking back ON.
Through the API
- Call LAPACKE_set_nancheck(flag) where flag = 0 turns OFF NaN checking.
- Call LAPACKE_set_nancheck(flag) where flag ≠ 0 turns NaN checking back ON.

It's also possibly something else, but based only on the output you are showing here, this might be a good place to start. NaN checks are not always a bad idea; just depends on the data.

wei-v-wang · 2017-09-13T05:29:50Z

@moderato Could you try another MKL version (as found in here: https://github.com/01org/mkl-dnn/releases) e.g. mklml_lnx_2018.0.20170425.tgz?
The way you switch to using an older MKL version is 1) delete the existing MKL folder under neon, 2) comment out the prepare_mkl.sh 's "wget" and "tar(unzip)" line

moderato · 2017-09-18T05:00:05Z

@wei-v-wang Sorry for the late reply! I rebuild numpy and scipy from source based on the mkl library that comes with Intel Parallel Studio XE student edition (https://software.intel.com/en-us/parallel-studio-xe/choose-download/student-linux-fortran), and the nan cost problem still exists. Usually it pops up these two warnings:

/home/moderato/miniconda3/envs/neon/lib/python3.5/site-packages/neon-2.1.0-py3.5.egg/neon/backends/math_cpu.py:138: RuntimeWarning: invalid value encountered in maximum
  return np.log(np.maximum(x, np.exp(-50.)))
/home/moderato/miniconda3/envs/neon/lib/python3.5/site-packages/neon-2.1.0-py3.5.egg/neon/backends/math_cpu.py:242: RuntimeWarning: invalid value encountered in multiply
  return np.multiply(x1, x2)

Any hints?

moderato · 2017-09-18T05:19:59Z

@indie Ohhh I guess you were giving this answer in terms of using C? I'm actually using Python and sorry for not being clear enough. Btw is there a way to do nancheck using Python?

wei-v-wang · 2017-09-18T05:50:22Z

@moderato Can you please use the MKL library downloaded automatically by neon (when you type "make")? You probably need to unset MKL library path so that neon uses the default one under its root directory rather than the one from Intel Parallel Studio XE student edition. It may or may not make a difference, please let me know what you find.

moderato · 2017-09-18T18:25:51Z

@wei-v-wang I tried your suggestion. I edited the site.cfg file and .bashrc accordingly as follows:

site.cfg

[mkl]
library_dirs = /home/moderato/Documents/neon/mklml_lnx_2018.0.20170720/lib
include_dirs = /home/moderato/Documents/neon/mklml_lnx_2018.0.20170720/include
mkl_libs = mkl_rt
lapack_libs =

.bashrc (only the newly added part)

# mkl
export PATH=$PATH:/opt/intel/bin
# mkl intel, comment for neon
# export LD_LIBRARY_PATH=/opt/intel/compilers_and_libraries_2018/linux/mkl/lib/intel64/:/opt/intel/compilers_and_libraries_2018/linux/lib/intel64:$LD_LIBRARY_PATH
# mkl neon, comment for intel
export LD_LIBRARY_PATH=/home/moderato/Documents/neon/mklml_lnx_2018.0.20170720/lib:/opt/intel/compilers_and_libraries_2018/linux/lib/intel64:$LD_LIBRARY_PATH

Actually I have no idea which compiler to use. I used the same settings as that when I build numpy on top of mkl coming with parallel studio. Here's the tutorial I followed: https://software.intel.com/en-us/articles/numpyscipy-with-intel-mkl

However this time it gave me an error like this:

>>> import numpy
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/moderato/miniconda3/envs/neon/lib/python3.5/site-packages/numpy/__init__.py", line 142, in <module>
    from . import add_newdocs
  File "/home/moderato/miniconda3/envs/neon/lib/python3.5/site-packages/numpy/add_newdocs.py", line 13, in <module>
    from numpy.lib import add_newdoc
  File "/home/moderato/miniconda3/envs/neon/lib/python3.5/site-packages/numpy/lib/__init__.py", line 19, in <module>
    from .polynomial import *
  File "/home/moderato/miniconda3/envs/neon/lib/python3.5/site-packages/numpy/lib/polynomial.py", line 20, in <module>
    from numpy.linalg import eigvals, lstsq, inv
  File "/home/moderato/miniconda3/envs/neon/lib/python3.5/site-packages/numpy/linalg/__init__.py", line 51, in <module>
    from .linalg import *
  File "/home/moderato/miniconda3/envs/neon/lib/python3.5/site-packages/numpy/linalg/linalg.py", line 31, in <module>
    from numpy.linalg import lapack_lite, _umath_linalg
ImportError: /home/moderato/miniconda3/envs/neon/lib/python3.5/site-packages/numpy/linalg/_umath_linalg.cpython-35m-x86_64-linux-gnu.so: undefined symbol: __intel_avx_rep_memset

Any ideas?

wei-v-wang · 2017-09-18T18:58:20Z

@moderato From the above, may I suggest you starting from scratch, downloading neon, and by making sure the pre-requisites are satisfied " python-pip, python-virtualenv, libhdf5-dev, libyaml-dev, pkg-config"
and then type "make clean && make" ?

From our side, we do not see a need to manually build numpy or scipy, they would be automatically installed in a virtual environment. Our performance does not come from "MKL-based numpy" but pure MKL. So please do not worry about building NUMPY/SCIPY based on MKL. I believe "site.cfg" was something used by numpy? Please start from scratch, after downloading neon, just do "make clean && make", provided you have the aforementioned packages installed.
For example, on Ubuntu system, you can do "apt-get install python-virtualenv" if python-virtualenv is not already installed.

moderato · 2017-09-18T19:13:39Z

@wei-v-wang The reason why I built numpy manually is that the version of numpy coming with neon is 1.11.1 which is not the latest version, so each time when I reinstall neon and get started, it gives me an error on numpy version in the first place, then I have to reinstall numpy by myself. Does the numpy version make any difference?

wei-v-wang · 2017-09-18T19:21:25Z

@moderato We have been using numpy 1.11.1 and numpy 1.13 without any problem. However, please note that the numpy is recommended to be in virtualenv. Please see below for the difference: the first "pip list" is without virtualenv, and you can see numpy is even not installed. However, by activating virtual environment, ". .venv/bin/activate", and do another "pip list", it says numpy 1.11 is used. If it complains about numpy 1.11, can you please do "pip uninstall numpy", I think neon would be able to take care of installing numpy in neon's virtualenv.

wangwei3@wangwei3-mac01: ~/git/private-neon$ pip list
DEPRECATION: The default format will switch to columns in the future. You can use --format=(legacy|columns) (or define a format=(legacy|columns) in your pip.conf under the [list] section) to disable this warning.
pip (9.0.1)
setuptools (32.1.0)
virtualenv (15.1.0)
wheel (0.29.0)
wangwei3@wangwei3-mac01: ~/ git/private-neon$ . .venv2/bin/activate
(.venv2) wangwei3@wangwei3-mac01: ~/ git/private-neon$ pip list |grep numpy
DEPRECATION: The default format will switch to columns in the future. You can use --format=(legacy|columns) (or define a format=(legacy|columns) in your pip.conf under the [list] section) to disable this warning.
numpy (1.11.1)

moderato · 2017-09-18T19:53:56Z

@wei-v-wang The numpy version error I mentioned is like:

RuntimeError: module compiled against API version 0xb but this version of numpy is 0xa

Every time I reinstall neon and run a neon program it pops up, and usually I solve it by 'pip uninstall numpy' then 'pip install numpy'. By the way, I use Anaconda for virtual environment management. Are they making any difference?

wei-v-wang · 2017-09-18T20:08:19Z

@moderato I might be wrong, but 0xb = numpy v1.11, and during runtime numpy v1.10 (0xa) is used?

Anaconda might have added one more layer of confusion. Could you please just use the normal python distribution and not use the python in Anaconda?

So far, have you been able to use neon with the mkl backend? Did the "make" command eventually work?

moderato · 2017-09-18T20:35:39Z

@wei-v-wang I think I'll stick to Anaconda as I also installed many other necessary packages in this environment. The 'make' command is always working, only the mkl backend is producing weird results.

I just rebuilt neon and tried

pip install --upgrade numpy --no-cache-dir

to fix that error, and different from mkl-based numpy, the "RuntimeWarning: invalid value encountered" problem is gone, but the cost is still getting exploded easily.

wei-v-wang · 2017-09-18T20:38:26Z

Ok, got you. thanks for the feedback @moderato. In future we probably will make sure we test in Anaconda environment as well.
Please stay tuned, we are going to have a new release that will hopefully get rid of the errors you see. At that time, you can delete the existing "0720 mkl" first and do "git pull" and "make" again.

wei-v-wang · 2017-09-18T20:44:12Z

If you can not wait, can you apply the following patch and retry? @moderato

Please change prepare_mkl.sh by applying the following changes

i.e. only the following several lines of change in prepare_mkl.sh
old:
-VERSION_MATCH=20170720
-ARCHIVE_BASE=mklml_lnx_2018.0.20170720
new:
+VERSION_MATCH=20170908
+ARCHIVE_BASE=mklml_lnx_2018.0.$VERSION_MATCH

old:
-GITHUB_RELEASE_TAG=v0.9
new:
+GITHUB_RELEASE_TAG=v0.10

moderato · 2017-09-18T20:45:35Z

@wei-v-wang I see. Maybe the problem is related to anaconda. For your reference, I am using miniconda3 with Python 3.5. Thanks for your help and looking forward to the new release!

moderato · 2017-09-18T21:01:54Z

@wei-v-wang Just edited the prepare_mkl.sh file and took a quick try. Good news is that the nan cost problem is gone! But there's another problem comes up: the speed is much slower than I expect. The output looks like:

Epoch 0   [Train |██████              |   78/246  batches, 3.62 cost, 117.85s]

Based on the cnn model and the dataset I use, the ratio of time over trained batch num would be around 1 using neon cpu backend (works fine), and around 0.5-0.6 using mkl-based numpy with neon mkl backend (although it has nan cost). I think in my case this ratio is supposed be at least <1 right? After all the mkl backend should be faster than cpu. The cpu on my PC is an i7 core.

Hope it helps with your testing!

wei-v-wang · 2017-09-18T21:22:52Z

@moderato Glad to know the NaN is gone and thank you for the update!
I see what you mean, you used to at least have 246 batches done in 70 to 80 seconds with the MKL backend. Now, it is slower than the CPU backend. Do you mind sharing the model file that you use? We want to debug the perf. downgrade with that. Maybe we could include the fix to this perf. drop as well if you could provide the model file you use.
You can also send it via email to wei.v.wang@intel.com if that is better. Thanks!

moderato · 2017-09-18T21:58:30Z

@wei-v-wang Program code sent. Thank you for your help!

wei-v-wang · 2017-09-18T22:11:41Z

Thanks @moderato We will take a look and let you know when we have improved the performance.

moderato · 2017-10-13T23:03:03Z

@wei-v-wang I tried Neon 2.2 on my model and looks like the problem is gone! I also spotted MKL's training loss has big fluctuations compared to the CPU&GPU backends as well as other frameworks somehow, although the validation accuracy is barely affected. Just let you know about it in case it's helpful to your work. Thanks!

wei-v-wang · 2017-10-13T23:06:58Z

Hi @moderato, Thanks for your update!
To clarify, I know you already said the NaN error is gone above. Do you also mean the problem of MKL being slow for your topology is also gone?

moderato · 2017-10-27T23:26:21Z

Hi @wei-v-wang , sorry for just seeing the message... Yes, the NaN problem is gone, and MKL is way better than CPU in this version. It got the best performance in some of our tests, i.e. on Resnet-50 with an image size of 32x32, while it was defeated in others tests using a random model with a random size, i.e. 47x47. I wonder if that's because Neon is only optimized for 2^n size and famous models like Resnets?

wei-v-wang · 2017-10-28T00:06:37Z

Hi @moderato I am very glad to hear your NaN problem is gone with some good performance observations using MKL as well.

Yes, you are right, we prioritized things by accelerating common cases/popular cases that we are aware of. Neon is not only optimized for Resnets, but also Alexnet, Googlenet, VGG, Deep Speech 2 etc. But we do not guarantee universal performance advantage over the numpy CPU backend with any random model or random size.

wei-v-wang · 2017-11-01T07:43:05Z

Hi @moderato I am closing this issue. Feel free to open more issues.

wei-v-wang changed the title ~~Use MKL backend and get very big (even nan) cost...~~ MKL backend performance regression with some topologies Oct 13, 2017

wei-v-wang mentioned this issue Oct 28, 2017

Core dumped when running mnist_nlp example #408

Closed

wei-v-wang closed this as completed Nov 1, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MKL backend performance regression with some topologies #398

MKL backend performance regression with some topologies #398

moderato commented Sep 7, 2017 •

edited

Loading

indie commented Sep 13, 2017 •

edited

Loading

wei-v-wang commented Sep 13, 2017 •

edited

Loading

moderato commented Sep 18, 2017

moderato commented Sep 18, 2017

wei-v-wang commented Sep 18, 2017

moderato commented Sep 18, 2017

wei-v-wang commented Sep 18, 2017

moderato commented Sep 18, 2017

wei-v-wang commented Sep 18, 2017 •

edited

Loading

moderato commented Sep 18, 2017

wei-v-wang commented Sep 18, 2017

moderato commented Sep 18, 2017 •

edited

Loading

wei-v-wang commented Sep 18, 2017

wei-v-wang commented Sep 18, 2017 •

edited

Loading

moderato commented Sep 18, 2017 •

edited

Loading

moderato commented Sep 18, 2017 •

edited

Loading

wei-v-wang commented Sep 18, 2017

moderato commented Sep 18, 2017

wei-v-wang commented Sep 18, 2017

moderato commented Oct 13, 2017

wei-v-wang commented Oct 13, 2017

moderato commented Oct 27, 2017

wei-v-wang commented Oct 28, 2017 •

edited

Loading

wei-v-wang commented Nov 1, 2017

MKL backend performance regression with some topologies #398

MKL backend performance regression with some topologies #398

Comments

moderato commented Sep 7, 2017 • edited Loading

indie commented Sep 13, 2017 • edited Loading

wei-v-wang commented Sep 13, 2017 • edited Loading

moderato commented Sep 18, 2017

moderato commented Sep 18, 2017

wei-v-wang commented Sep 18, 2017

moderato commented Sep 18, 2017

wei-v-wang commented Sep 18, 2017

moderato commented Sep 18, 2017

wei-v-wang commented Sep 18, 2017 • edited Loading

moderato commented Sep 18, 2017

wei-v-wang commented Sep 18, 2017

moderato commented Sep 18, 2017 • edited Loading

wei-v-wang commented Sep 18, 2017

wei-v-wang commented Sep 18, 2017 • edited Loading

moderato commented Sep 18, 2017 • edited Loading

moderato commented Sep 18, 2017 • edited Loading

wei-v-wang commented Sep 18, 2017

moderato commented Sep 18, 2017

wei-v-wang commented Sep 18, 2017

moderato commented Oct 13, 2017

wei-v-wang commented Oct 13, 2017

moderato commented Oct 27, 2017

wei-v-wang commented Oct 28, 2017 • edited Loading

wei-v-wang commented Nov 1, 2017

moderato commented Sep 7, 2017 •

edited

Loading

indie commented Sep 13, 2017 •

edited

Loading

wei-v-wang commented Sep 13, 2017 •

edited

Loading

wei-v-wang commented Sep 18, 2017 •

edited

Loading

moderato commented Sep 18, 2017 •

edited

Loading

wei-v-wang commented Sep 18, 2017 •

edited

Loading

moderato commented Sep 18, 2017 •

edited

Loading

moderato commented Sep 18, 2017 •

edited

Loading

wei-v-wang commented Oct 28, 2017 •

edited

Loading