-
Notifications
You must be signed in to change notification settings - Fork 811
MKL backend performance regression with some topologies #398
Comments
Hmm... without knowing more about what kind of functions or code you are running in this model, here's a possible cause: The high-level interface for this backend ( See also:
For your reference (and this was recently added to the documentation for Intel MKL 2018 Gold), you can turn NaN check OFF for savings a couple different ways:
It's also possibly something else, but based only on the output you are showing here, this might be a good place to start. NaN checks are not always a bad idea; just depends on the data. |
@moderato Could you try another MKL version (as found in here: https://github.com/01org/mkl-dnn/releases) e.g. mklml_lnx_2018.0.20170425.tgz? |
@wei-v-wang Sorry for the late reply! I rebuild numpy and scipy from source based on the mkl library that comes with Intel Parallel Studio XE student edition (https://software.intel.com/en-us/parallel-studio-xe/choose-download/student-linux-fortran), and the nan cost problem still exists. Usually it pops up these two warnings:
Any hints? |
@indie Ohhh I guess you were giving this answer in terms of using C? I'm actually using Python and sorry for not being clear enough. Btw is there a way to do nancheck using Python? |
@moderato Can you please use the MKL library downloaded automatically by neon (when you type "make")? You probably need to unset MKL library path so that neon uses the default one under its root directory rather than the one from Intel Parallel Studio XE student edition. It may or may not make a difference, please let me know what you find. |
@wei-v-wang I tried your suggestion. I edited the site.cfg file and .bashrc accordingly as follows: site.cfg
.bashrc (only the newly added part)
Actually I have no idea which compiler to use. I used the same settings as that when I build numpy on top of mkl coming with parallel studio. Here's the tutorial I followed: https://software.intel.com/en-us/articles/numpyscipy-with-intel-mkl However this time it gave me an error like this:
Any ideas? |
@moderato From the above, may I suggest you starting from scratch, downloading neon, and by making sure the pre-requisites are satisfied " python-pip, python-virtualenv, libhdf5-dev, libyaml-dev, pkg-config" From our side, we do not see a need to manually build numpy or scipy, they would be automatically installed in a virtual environment. Our performance does not come from "MKL-based numpy" but pure MKL. So please do not worry about building NUMPY/SCIPY based on MKL. I believe "site.cfg" was something used by numpy? Please start from scratch, after downloading neon, just do "make clean && make", provided you have the aforementioned packages installed. |
@wei-v-wang The reason why I built numpy manually is that the version of numpy coming with neon is 1.11.1 which is not the latest version, so each time when I reinstall neon and get started, it gives me an error on numpy version in the first place, then I have to reinstall numpy by myself. Does the numpy version make any difference? |
@moderato We have been using numpy 1.11.1 and numpy 1.13 without any problem. However, please note that the numpy is recommended to be in virtualenv. Please see below for the difference: the first "pip list" is without virtualenv, and you can see numpy is even not installed. However, by activating virtual environment, ". .venv/bin/activate", and do another "pip list", it says numpy 1.11 is used. If it complains about numpy 1.11, can you please do "pip uninstall numpy", I think neon would be able to take care of installing numpy in neon's virtualenv. wangwei3@wangwei3-mac01: ~/git/private-neon$ pip list |
@wei-v-wang The numpy version error I mentioned is like:
Every time I reinstall neon and run a neon program it pops up, and usually I solve it by 'pip uninstall numpy' then 'pip install numpy'. By the way, I use Anaconda for virtual environment management. Are they making any difference? |
@moderato I might be wrong, but 0xb = numpy v1.11, and during runtime numpy v1.10 (0xa) is used? Anaconda might have added one more layer of confusion. Could you please just use the normal python distribution and not use the python in Anaconda? So far, have you been able to use neon with the mkl backend? Did the "make" command eventually work? |
@wei-v-wang I think I'll stick to Anaconda as I also installed many other necessary packages in this environment. The 'make' command is always working, only the mkl backend is producing weird results. I just rebuilt neon and tried
to fix that error, and different from mkl-based numpy, the "RuntimeWarning: invalid value encountered" problem is gone, but the cost is still getting exploded easily. |
Ok, got you. thanks for the feedback @moderato. In future we probably will make sure we test in Anaconda environment as well. |
If you can not wait, can you apply the following patch and retry? @moderato Please change prepare_mkl.sh by applying the following changes i.e. only the following several lines of change in prepare_mkl.sh old: |
@wei-v-wang I see. Maybe the problem is related to anaconda. For your reference, I am using miniconda3 with Python 3.5. Thanks for your help and looking forward to the new release! |
@wei-v-wang Just edited the prepare_mkl.sh file and took a quick try. Good news is that the nan cost problem is gone! But there's another problem comes up: the speed is much slower than I expect. The output looks like:
Based on the cnn model and the dataset I use, the ratio of time over trained batch num would be around 1 using neon cpu backend (works fine), and around 0.5-0.6 using mkl-based numpy with neon mkl backend (although it has nan cost). I think in my case this ratio is supposed be at least <1 right? After all the mkl backend should be faster than cpu. The cpu on my PC is an i7 core. Hope it helps with your testing! |
@moderato Glad to know the NaN is gone and thank you for the update! |
@wei-v-wang Program code sent. Thank you for your help! |
Thanks @moderato We will take a look and let you know when we have improved the performance. |
@wei-v-wang I tried Neon 2.2 on my model and looks like the problem is gone! I also spotted MKL's training loss has big fluctuations compared to the CPU&GPU backends as well as other frameworks somehow, although the validation accuracy is barely affected. Just let you know about it in case it's helpful to your work. Thanks! |
Hi @moderato, Thanks for your update! |
Hi @wei-v-wang , sorry for just seeing the message... Yes, the NaN problem is gone, and MKL is way better than CPU in this version. It got the best performance in some of our tests, i.e. on Resnet-50 with an image size of 32x32, while it was defeated in others tests using a random model with a random size, i.e. 47x47. I wonder if that's because Neon is only optimized for 2^n size and famous models like Resnets? |
Hi @moderato I am very glad to hear your NaN problem is gone with some good performance observations using MKL as well. Yes, you are right, we prioritized things by accelerating common cases/popular cases that we are aware of. Neon is not only optimized for Resnets, but also Alexnet, Googlenet, VGG, Deep Speech 2 etc. But we do not guarantee universal performance advantage over the numpy CPU backend with any random model or random size. |
Hi @moderato I am closing this issue. Feel free to open more issues. |
Hello! I use neon to train a model on three backends: CPU, MKL and GPU. All the settings are the same when running with these backends. I got very similar costs from CPU and GPU, while the cost from MKL backend was usually higher than the previous two, sometimes even nan. Anybody has an idea why does that happen?
The CPU is an Intel i7; the GPU is a Nvidia GTX 1050; the code is running on Ubuntu 16.04. Here is the printed result of the code...
The text was updated successfully, but these errors were encountered: