Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Threaded MKL for paddle #2379

Closed
wanglovesyang opened this issue Jun 5, 2017 · 4 comments
Closed

Threaded MKL for paddle #2379

wanglovesyang opened this issue Jun 5, 2017 · 4 comments
Labels
User 用于标记用户问题

Comments

@wanglovesyang
Copy link

I read the cblas.cmake in the paddle and found that paddle make use of libmkl_sequential.so which means that all the matrix operation on CPU are done by one core (in one trainer). This could be reasonable when using common server nodes(128G + 12 cores)。However, I am currently using the CPU of Intel Phi which contains 256 cores. The 128G memory cannot hold 256 trainers if I want make use of all computing resources.

Hence, I refer to libmkl_intel_thread.so (by changing the cmake file to parallel the GEMM operation of paddle, such that I can obtain a 100% cpu usage while holding 10 trainers. Unfortunately, the training process (e.g. 1h / pass, 100%cpu) is much slower than using libmkl_sequential.so on 10 trainers. (0.5h / pass, 5%cpu). This result is nearly ridiculous within my understanding. Could any one help me check out this problem?

@Xreki
Copy link
Contributor

Xreki commented Jun 5, 2017

Hi @wanglovesyang , I am curious about the Xeon Phi you use. Do you use it as the CPU or as a co-processor? We have never run PaddlePaddle on Xeon Phi and thank you for your try.

In PaddlePaddle, when you launch 10 trainers, 10 threads will be created and each one will be assigned a trainer. Thus we use singled-threaded gemm implementation and link libmkl_sequential.so instead of libmkl_intel_thread.so. In your test, how many threads you use to run parallel mkl? I think you may need to adapt the thread numbers for each trainer using MKL_NUM_THREADS, as well as the thread affinity.

@wanglovesyang
Copy link
Author

@Xreki I tried two thread settings, both with trainers=10:

  1. MKL_NUM_THREADS=25, since each thread obtain at most 25 threads, 250 cores can be make used of at max.
  2. MKL_DYNAMIC=true, this make use of mkl dynamic thread scheduling to maximize the cpu usage.

However, both these two settings run slower than single-thread. Even though the cpu usage is close to 100% at most of the time.

@luotao1
Copy link
Contributor

luotao1 commented Jun 6, 2017

  1. 请问您的Xeon Phi型号是多少呢?有几个物理核?
  2. 您是把Xeon Phi当作普通CPU用,还是协处理器用呢?
  3. trainers=10意味着起了10个线程,一般来说1个物理核启一个线程的做法比较推荐。

@QiJune QiJune added the User 用于标记用户问题 label Jul 23, 2017
@QiJune
Copy link
Member

QiJune commented Jul 23, 2017

长时间没有更新,暂时close;如有进一步更新,欢迎reopen

@QiJune QiJune closed this as completed Jul 23, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
User 用于标记用户问题
Projects
None yet
Development

No branches or pull requests

4 participants