Threaded MKL for paddle #2379

wanglovesyang · 2017-06-05T07:31:09Z

I read the cblas.cmake in the paddle and found that paddle make use of libmkl_sequential.so which means that all the matrix operation on CPU are done by one core (in one trainer). This could be reasonable when using common server nodes(128G + 12 cores)。However, I am currently using the CPU of Intel Phi which contains 256 cores. The 128G memory cannot hold 256 trainers if I want make use of all computing resources.

Hence, I refer to libmkl_intel_thread.so (by changing the cmake file to parallel the GEMM operation of paddle, such that I can obtain a 100% cpu usage while holding 10 trainers. Unfortunately, the training process (e.g. 1h / pass, 100%cpu) is much slower than using libmkl_sequential.so on 10 trainers. (0.5h / pass, 5%cpu). This result is nearly ridiculous within my understanding. Could any one help me check out this problem?

Xreki · 2017-06-05T08:40:57Z

Hi @wanglovesyang , I am curious about the Xeon Phi you use. Do you use it as the CPU or as a co-processor? We have never run PaddlePaddle on Xeon Phi and thank you for your try.

In PaddlePaddle, when you launch 10 trainers, 10 threads will be created and each one will be assigned a trainer. Thus we use singled-threaded gemm implementation and link libmkl_sequential.so instead of libmkl_intel_thread.so. In your test, how many threads you use to run parallel mkl? I think you may need to adapt the thread numbers for each trainer using MKL_NUM_THREADS, as well as the thread affinity.

wanglovesyang · 2017-06-05T12:56:56Z

@Xreki I tried two thread settings, both with trainers=10:

MKL_NUM_THREADS=25， since each thread obtain at most 25 threads, 250 cores can be make used of at max.
MKL_DYNAMIC=true, this make use of mkl dynamic thread scheduling to maximize the cpu usage.

However, both these two settings run slower than single-thread. Even though the cpu usage is close to 100% at most of the time.

luotao1 · 2017-06-06T02:27:07Z

请问您的Xeon Phi型号是多少呢？有几个物理核？
您是把Xeon Phi当作普通CPU用，还是协处理器用呢？
trainers=10意味着起了10个线程，一般来说1个物理核启一个线程的做法比较推荐。

QiJune · 2017-07-23T02:01:28Z

长时间没有更新，暂时close；如有进一步更新，欢迎reopen

QiJune added the User 用于标记用户问题 label Jul 23, 2017

QiJune closed this as completed Jul 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Threaded MKL for paddle #2379

Threaded MKL for paddle #2379

wanglovesyang commented Jun 5, 2017

Xreki commented Jun 5, 2017

wanglovesyang commented Jun 5, 2017

luotao1 commented Jun 6, 2017

QiJune commented Jul 23, 2017

Threaded MKL for paddle #2379

Threaded MKL for paddle #2379

Comments

wanglovesyang commented Jun 5, 2017

Xreki commented Jun 5, 2017

wanglovesyang commented Jun 5, 2017

luotao1 commented Jun 6, 2017

QiJune commented Jul 23, 2017