Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

本地跑cpu版本多线程问题 #5280

Closed
CAOYUHUI opened this issue Nov 1, 2017 · 40 comments
Closed

本地跑cpu版本多线程问题 #5280

CAOYUHUI opened this issue Nov 1, 2017 · 40 comments
Labels
User 用于标记用户问题

Comments

@CAOYUHUI
Copy link

CAOYUHUI commented Nov 1, 2017

DSSM的示例代码
设置了trainer_count=32
但是训练过程中cpu使用情况显示并没有多线程。
训练速度很很慢,求解答。

@peterzhang2029
Copy link
Contributor

你好,为了更好的明确问题, 请提供更多的信息, 包括具体的训练速度, 网络配置文件, 训练数据的格式等。 谢谢

@CAOYUHUI
Copy link
Author

CAOYUHUI commented Nov 1, 2017

Uploading 2.jpg…
用的是DSSM的例子,model选的是fc,分类任务,训练数据的格式和给的示例文件的格式一致。

@peterzhang2029
Copy link
Contributor

你好, 这里的图片没有显示, 麻烦贴上对应的代码等文本信息, 便于大家搜索,谢谢

@peterzhang2029 peterzhang2029 added the User 用于标记用户问题 label Nov 1, 2017
@CAOYUHUI
Copy link
Author

CAOYUHUI commented Nov 1, 2017

cpu占有率相关信息如下:
Cpu0 : 98.3% us, 1.7% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu1 : 1.7% us, 5.3% sy, 0.0% ni, 93.0% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu2 : 2.0% us, 5.3% sy, 0.0% ni, 92.7% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu3 : 2.0% us, 5.6% sy, 0.0% ni, 92.4% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu4 : 1.7% us, 5.6% sy, 0.0% ni, 92.7% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu5 : 1.7% us, 5.6% sy, 0.0% ni, 92.7% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu6 : 2.0% us, 5.3% sy, 0.0% ni, 92.7% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu7 : 2.0% us, 5.3% sy, 0.0% ni, 92.7% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu8 : 3.0% us, 6.3% sy, 0.3% ni, 90.4% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu9 : 2.3% us, 7.6% sy, 0.0% ni, 90.0% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu10 : 2.0% us, 5.3% sy, 0.0% ni, 92.7% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu11 : 2.3% us, 5.0% sy, 0.0% ni, 92.7% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu12 : 2.0% us, 5.3% sy, 0.0% ni, 92.7% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu13 : 1.7% us, 5.6% sy, 0.0% ni, 92.7% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu14 : 2.0% us, 5.3% sy, 0.0% ni, 92.7% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu15 : 2.0% us, 5.3% sy, 0.0% ni, 92.7% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu16 : 0.0% us, 0.0% sy, 0.0% ni, 100.0% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu17 : 0.0% us, 0.0% sy, 0.0% ni, 100.0% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu18 : 0.0% us, 0.0% sy, 0.0% ni, 100.0% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu19 : 0.3% us, 0.3% sy, 0.0% ni, 99.3% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu20 : 0.0% us, 0.0% sy, 0.0% ni, 100.0% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu21 : 0.0% us, 0.0% sy, 0.0% ni, 100.0% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu22 : 0.0% us, 0.0% sy, 0.0% ni, 98.3% id, 1.7% wa, 0.0% hi, 0.0% si
Cpu23 : 0.0% us, 0.0% sy, 0.0% ni, 100.0% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu24 : 0.0% us, 0.0% sy, 0.0% ni, 100.0% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu25 : 0.0% us, 0.3% sy, 0.0% ni, 99.7% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu26 : 0.0% us, 0.3% sy, 0.0% ni, 99.7% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu27 : 0.0% us, 0.0% sy, 0.0% ni, 100.0% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu28 : 0.0% us, 0.0% sy, 0.3% ni, 99.7% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu29 : 0.0% us, 0.0% sy, 0.0% ni, 100.0% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu30 : 0.0% us, 0.0% sy, 0.3% ni, 99.7% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu31 : 0.7% us, 0.3% sy, 0.0% ni, 99.0% id, 0.0% wa, 0.0% hi, 0.0% si

运行这个示例的脚本如下:
${python} train.py -y 0 --model_arch 0 --class_num 2
--train_data_dir './data/train/'
--test_data_path './data/test_data.txt'
--source_dic_path './data/dict'
--target_dic_path './data/dict'
--batch_size 1024
--num_passes 50
--dnn_dims '128,64,32'
--num_workers 10
--model_output_prefix './models/'
--share_embed True \

@windy444
Copy link

windy444 commented Nov 2, 2017

我这边也碰到了类似的问题。今天更新到最新版本后,就出现了。
代码是在nce这个上改的,基本只改了输入数据读取部分
https://github.com/PaddlePaddle/models/tree/641554898bae59e68a909e299c84074a645d5464/nce_cost

paddle.init(use_gpu=False, trainer_count=24)
optimizer = paddle.optimizer.Adam(learning_rate=3e-2)
trainer.train( paddle.batch( paddle.reader.shuffle( lambda: reader.train_reader(train_data, word_dict, 5)(), buf_size=8000), 6400), num_passes=1000, event_handler=event_handler)

目前cpu使用情况
**top - 15:55:19 up 423 days, 2:01, 11 users, load average: 17.92, 27.74, 27.97
Tasks: 609 total, 1 running, 608 sleeping, 0 stopped, 0 zombie
Cpu0 : 91.4%us, 7.9%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.7%si, 0.0%st
Cpu1 : 8.7%us, 41.7%sy, 0.0%ni, 49.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu2 : 6.3%us, 42.9%sy, 0.3%ni, 50.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu3 : 7.0%us, 42.4%sy, 0.0%ni, 50.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu4 : 7.6%us, 41.7%sy, 0.0%ni, 50.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu5 : 8.3%us, 41.3%sy, 0.0%ni, 50.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu6 : 7.6%us, 41.7%sy, 0.0%ni, 50.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu7 : 7.0%us, 42.4%sy, 0.0%ni, 50.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu8 : 7.9%us, 42.7%sy, 0.3%ni, 49.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu9 : 8.3%us, 41.7%sy, 0.3%ni, 49.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu10 : 8.3%us, 41.5%sy, 0.0%ni, 50.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu11 : 8.3%us, 41.4%sy, 0.3%ni, 50.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu12 : 8.0%us, 42.0%sy, 0.0%ni, 50.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu13 : 8.4%us, 41.8%sy, 0.0%ni, 49.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu14 : 9.6%us, 40.9%sy, 0.0%ni, 49.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu15 : 7.6%us, 42.2%sy, 0.3%ni, 46.5%id, 3.3%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu16 : 0.3%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu17 : 0.3%us, 0.0%sy, 0.0%ni, 99.0%id, 0.7%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu18 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu19 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu20 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu21 : 0.0%us, 0.3%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu22 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu23 : 1.0%us, 0.3%sy, 0.0%ni, 98.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu24 : 0.7%us, 1.3%sy, 0.0%ni, 98.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu25 : 1.0%us, 1.7%sy, 0.0%ni, 97.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu26 : 0.7%us, 1.0%sy, 0.3%ni, 98.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu27 : 1.7%us, 1.3%sy, 0.3%ni, 96.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu28 : 0.3%us, 1.7%sy, 0.7%ni, 97.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu29 : 0.7%us, 2.0%sy, 1.0%ni, 96.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu30 : 1.0%us, 1.7%sy, 1.7%ni, 95.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu31 : 1.0%us, 2.0%sy, 0.7%ni, 94.7%id, 1.7%wa, 0.0%hi, 0.0%si, 0.0%st
**
但在之前版本,每个核差不多能用到50%的。整体运行时间是之前的5倍左右。

装最新版本的时候出现过这个错
**PaddlePaddle wasn't compiled to use avx instructions, but these are available on your machine **
后面修复了,但是发现修复前后cpu表现是类似的,运行时间也类似。

另外,还有个问题,我始终没有试出来,单机多线程和单机单线程,在运行时间上有什么差别,batch_size,学习率都调整过,但是始终没看到效果。但至少之前cpu使用率是上去了。

@typhoonzero
Copy link
Contributor

上面的两个情况,可以看到CPU0的利用率已经完全满了,Cpu0 : 91.4%us, 7.9%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.7%si, 0.0%st。Linux中CPU0通常会处理系统终端,调度等工作,会成为SMP处理的瓶颈。

建议的处理方法:使用linux taskset设置CPU亲和性,设置paddle进程不使用CPU0。或者找到其他占用CPU0的进程,设置其使用CPU0意外以外的CPU。另外,如果是网卡终端处理缓慢,也可以考虑使用irq_balance之类的工具。

参考:

https://linux.die.net/man/1/taskset

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_MRG/1.3/html/Realtime_Reference_Guide/chap-Realtime_Reference_Guide-Affinity.html

https://unix.stackexchange.com/questions/73/how-can-i-set-the-processor-affinity-of-a-process-on-linux

@windy444
Copy link

windy444 commented Nov 2, 2017

我贴的那个不能算完全满吧,CPU0,90%左右,最高也就93% @typhoonzero

设置亲和到2的情况。除了这个任务,没有其他任务大量占用资源的

Cpu0 : 90.7%us, 2.3%sy, 0.0%ni, 6.6%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
Cpu1 : 4.0%us, 20.1%sy, 0.0%ni, 75.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu2 : 11.8%us, 19.1%sy, 0.3%ni, 68.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu3 : 4.3%us, 19.9%sy, 0.0%ni, 75.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu4 : 4.7%us, 19.6%sy, 0.7%ni, 75.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu5 : 4.7%us, 19.6%sy, 0.0%ni, 75.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu6 : 4.7%us, 19.3%sy, 0.0%ni, 76.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu7 : 4.0%us, 19.9%sy, 0.0%ni, 76.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu8 : 4.3%us, 20.3%sy, 0.3%ni, 75.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu9 : 4.0%us, 20.1%sy, 0.0%ni, 75.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu10 : 4.0%us, 20.6%sy, 0.0%ni, 75.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu11 : 5.0%us, 19.9%sy, 0.3%ni, 74.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu12 : 5.0%us, 19.6%sy, 0.0%ni, 75.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu13 : 5.0%us, 19.6%sy, 0.3%ni, 75.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu14 : 6.0%us, 19.0%sy, 0.3%ni, 74.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu15 : 5.0%us, 19.9%sy, 0.0%ni, 75.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu16 : 0.0%us, 1.0%sy, 0.3%ni, 98.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu17 : 0.0%us, 0.3%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu18 : 0.3%us, 0.0%sy, 0.3%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu19 : 0.3%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu20 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu21 : 0.0%us, 0.3%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu22 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu23 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu24 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu25 : 0.7%us, 0.7%sy, 0.3%ni, 98.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu26 : 0.7%us, 1.0%sy, 0.0%ni, 97.4%id, 1.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu27 : 0.3%us, 1.0%sy, 0.0%ni, 98.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu28 : 0.3%us, 0.7%sy, 1.0%ni, 98.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu29 : 0.7%us, 1.3%sy, 0.3%ni, 97.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu30 : 0.3%us, 2.0%sy, 0.7%ni, 96.7%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
Cpu31 : 1.3%us, 2.3%sy, 0.3%ni, 96.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

@typhoonzero
Copy link
Contributor

@windy444 Cpu0 : 91.4%us, 7.9%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.7%si, 0.0%st 这个数值中,us, sy, si都是占用,分别是用户态CPU,内核态CPU,软中断。id是空闲,可以看到CPU0是100%的。

第二个CPU占用情况中, 如果没有其他的任务,CPU0也有很高的占用率。

另外也可以试试是否是reader成为了训练瓶颈,使用paddle.v2.reader.buffered缓存reader数据,提升吞吐。参考:http://doc.paddlepaddle.org/release/0.10.0/doc/api/v2/data.html#reader

@Yancey1989
Copy link
Contributor

Hi @windy444 ,可以看下是不是reader的部分消耗的CPU(Python程序只占一个Core)

可以尝试使用Python的Profile工具:https://docs.python.org/2/library/profile.html

@windy444
Copy link

windy444 commented Nov 6, 2017

@typhoonzero 我用了下buffered,但是实际时长差不了太多。不知道是不是用错了。
trainer.train( paddle.batch( paddle.reader.shuffle( lambda: paddle.reader.buffered(reader.train_reader(train_data, word_dict, 5), 1000)(), buf_size=2000), 640), num_passes=1000, event_handler=event_handler)

另外,发现我即使用一个线程跑,也是很多核被占用。用24个核的话,情况和这个差不多。
top - 14:59:56 up 427 days, 1:05, 9 users, load average: 10.70, 8.82, 6.54
Tasks: 609 total, 1 running, 608 sleeping, 0 stopped, 0 zombie
Cpu0 : 99.7%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
Cpu1 : 10.3%us, 45.5%sy, 0.3%ni, 43.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu2 : 10.6%us, 45.2%sy, 0.0%ni, 44.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu3 : 11.6%us, 44.5%sy, 0.0%ni, 43.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu4 : 10.6%us, 45.2%sy, 0.0%ni, 44.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu5 : 10.0%us, 45.8%sy, 0.0%ni, 44.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu6 : 9.9%us, 45.9%sy, 0.3%ni, 43.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu7 : 11.6%us, 44.2%sy, 0.3%ni, 43.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu8 : 11.0%us, 46.2%sy, 0.7%ni, 42.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu9 : 10.3%us, 45.8%sy, 0.0%ni, 43.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu10 : 11.3%us, 45.7%sy, 0.0%ni, 43.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu11 : 10.6%us, 46.0%sy, 0.3%ni, 43.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu12 : 10.3%us, 46.2%sy, 0.3%ni, 43.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu13 : 10.6%us, 45.8%sy, 0.0%ni, 43.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu14 : 11.6%us, 44.7%sy, 0.3%ni, 43.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu15 : 11.6%us, 44.2%sy, 0.3%ni, 43.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu16 : 0.0%us, 3.3%sy, 1.0%ni, 95.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu17 : 0.3%us, 3.3%sy, 0.7%ni, 95.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu18 : 1.0%us, 3.7%sy, 0.7%ni, 94.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu19 : 0.0%us, 2.0%sy, 0.3%ni, 97.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu20 : 0.3%us, 1.3%sy, 0.3%ni, 98.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu21 : 0.0%us, 2.7%sy, 1.7%ni, 95.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu22 : 0.7%us, 4.7%sy, 1.7%ni, 93.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu23 : 0.7%us, 2.0%sy, 0.7%ni, 96.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu24 : 0.7%us, 7.0%sy, 2.0%ni, 90.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu25 : 0.3%us, 11.0%sy, 4.0%ni, 84.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu26 : 0.7%us, 6.4%sy, 2.0%ni, 91.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu27 : 2.0%us, 6.3%sy, 1.7%ni, 90.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu28 : 0.3%us, 16.8%sy, 1.7%ni, 80.5%id, 0.7%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu29 : 0.3%us, 8.0%sy, 2.0%ni, 89.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu30 : 0.0%us, 6.4%sy, 3.7%ni, 89.6%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
Cpu31 : 0.3%us, 9.0%sy, 1.7%ni, 89.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

@typhoonzero
Copy link
Contributor

@windy444 用法是正确的,bufsize可以设置大一点。另外可以看下空闲状态的CPU利用率么。

@windy444
Copy link

windy444 commented Nov 6, 2017

@typhoonzero 空闲时刻CPU情况
top - 17:55:14 up 427 days, 4:00, 9 users, load average: 5.24, 7.62, 5.55
Tasks: 602 total, 1 running, 601 sleeping, 0 stopped, 0 zombie
Cpu0 : 0.0%us, 0.3%sy, 0.0%ni, 98.7%id, 1.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu1 : 0.3%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu2 : 0.3%us, 0.3%sy, 0.3%ni, 99.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu3 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu4 : 0.3%us, 0.0%sy, 0.3%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu5 : 0.0%us, 0.3%sy, 0.3%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu6 : 0.3%us, 0.3%sy, 0.7%ni, 98.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu7 : 0.0%us, 0.7%sy, 0.7%ni, 98.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu8 : 0.3%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu9 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu10 : 0.0%us, 0.3%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu11 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu12 : 0.3%us, 0.0%sy, 0.3%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu13 : 0.0%us, 0.7%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu14 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu15 : 0.3%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu16 : 0.0%us, 0.7%sy, 1.4%ni, 98.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu17 : 0.3%us, 1.0%sy, 0.3%ni, 98.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu18 : 0.7%us, 1.3%sy, 0.0%ni, 98.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu19 : 0.0%us, 0.3%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu20 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu21 : 0.3%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu22 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu23 : 0.0%us, 0.3%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu24 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu25 : 0.3%us, 0.7%sy, 0.0%ni, 99.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu26 : 0.3%us, 0.7%sy, 0.3%ni, 98.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu27 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu28 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu29 : 0.3%us, 0.7%sy, 0.0%ni, 99.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu30 : 0.0%us, 0.3%sy, 0.3%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu31 : 0.0%us, 0.3%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

bufsize加大后
--use_gpu=False --trainer_count=1
trainer.train( paddle.batch( paddle.reader.shuffle( lambda: paddle.reader.buffered(reader.train_reader(train_data, word_dict, 5), 100000)(), buf_size=2000), 640), num_passes=1000, event_handler=event_handler)

top - 17:58:02 up 427 days, 4:03, 10 users, load average: 5.02, 5.60, 5.06
Tasks: 609 total, 2 running, 607 sleeping, 0 stopped, 0 zombie
Cpu0 : 99.3%us, 0.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
Cpu1 : 11.6%us, 45.2%sy, 0.3%ni, 42.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu2 : 11.2%us, 45.4%sy, 0.3%ni, 42.8%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
Cpu3 : 10.6%us, 48.3%sy, 0.0%ni, 41.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu4 : 10.0%us, 46.8%sy, 0.0%ni, 43.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu5 : 10.9%us, 45.7%sy, 0.3%ni, 43.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu6 : 11.9%us, 44.9%sy, 0.0%ni, 43.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu7 : 11.3%us, 45.5%sy, 0.3%ni, 42.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu8 : 10.6%us, 46.0%sy, 0.0%ni, 43.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu9 : 10.6%us, 46.2%sy, 0.0%ni, 43.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu10 : 11.3%us, 45.4%sy, 0.0%ni, 43.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu11 : 10.9%us, 46.2%sy, 0.0%ni, 42.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu12 : 9.2%us, 47.4%sy, 0.0%ni, 43.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu13 : 11.3%us, 45.4%sy, 0.0%ni, 43.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu14 : 11.5%us, 45.1%sy, 0.0%ni, 43.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu15 : 11.9%us, 44.9%sy, 0.0%ni, 43.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu16 : 0.3%us, 3.7%sy, 0.3%ni, 95.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu17 : 1.0%us, 4.3%sy, 0.7%ni, 94.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu18 : 0.7%us, 2.3%sy, 0.3%ni, 96.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu19 : 0.7%us, 1.3%sy, 0.7%ni, 97.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu20 : 0.3%us, 1.4%sy, 0.7%ni, 97.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu21 : 0.3%us, 1.3%sy, 2.0%ni, 96.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu22 : 0.3%us, 0.7%sy, 0.3%ni, 98.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu23 : 0.3%us, 1.3%sy, 1.7%ni, 96.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu24 : 0.3%us, 1.3%sy, 0.3%ni, 98.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu25 : 0.3%us, 1.0%sy, 0.0%ni, 98.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu26 : 0.3%us, 1.0%sy, 0.3%ni, 98.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu27 : 0.0%us, 5.0%sy, 0.3%ni, 94.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu28 : 97.7%us, 2.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu29 : 0.0%us, 0.3%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu30 : 0.3%us, 0.7%sy, 0.7%ni, 98.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu31 : 0.3%us, 2.6%sy, 0.3%ni, 96.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

@Yancey1989
Copy link
Contributor

Hi @windy444 , 可以看下trainer_count=1trainer_count=2时,python进程的CPU利用率么?

@windy444
Copy link

windy444 commented Nov 9, 2017

@Yancey1989
首先确认没跑paddle的时候,整体cpu基本是0
trainer_count=1
top - 17:00:50 up 430 days, 3:06, 9 users, load average: 4.25, 1.64, 0.78
Tasks: 607 total, 1 running, 605 sleeping, 1 stopped, 0 zombie
Cpu0 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu1 : 2.3%us, 10.7%sy, 0.7%ni, 86.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu2 : 2.3%us, 10.4%sy, 0.7%ni, 86.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu3 : 3.0%us, 10.0%sy, 0.3%ni, 86.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu4 : 2.7%us, 10.3%sy, 0.3%ni, 86.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu5 : 2.3%us, 10.0%sy, 0.3%ni, 87.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu6 : 3.0%us, 10.0%sy, 0.0%ni, 87.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu7 : 3.0%us, 9.7%sy, 0.7%ni, 86.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu8 : 2.3%us, 10.0%sy, 0.0%ni, 87.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu9 : 3.0%us, 9.7%sy, 0.7%ni, 86.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu10 : 2.7%us, 10.0%sy, 0.0%ni, 87.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu11 : 2.3%us, 11.0%sy, 0.0%ni, 86.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu12 : 2.7%us, 10.1%sy, 0.0%ni, 87.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu13 : 3.0%us, 9.9%sy, 0.0%ni, 87.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu14 : 2.7%us, 9.7%sy, 0.3%ni, 87.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu15 : 1.7%us, 10.6%sy, 0.0%ni, 87.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu16 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu17 : 0.3%us, 2.3%sy, 1.0%ni, 96.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu18 : 0.7%us, 2.3%sy, 0.3%ni, 96.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu19 : 0.7%us, 1.7%sy, 0.3%ni, 97.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu20 : 0.3%us, 1.0%sy, 0.3%ni, 98.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu21 : 0.7%us, 0.7%sy, 0.0%ni, 98.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu22 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu23 : 1.0%us, 1.3%sy, 1.0%ni, 96.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu24 : 0.0%us, 0.7%sy, 0.3%ni, 99.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu25 : 0.7%us, 1.7%sy, 0.3%ni, 97.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu26 : 0.3%us, 1.7%sy, 0.7%ni, 97.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu27 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu28 : 0.3%us, 0.7%sy, 0.3%ni, 98.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu29 : 0.0%us, 0.7%sy, 0.3%ni, 99.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu30 : 0.0%us, 1.0%sy, 0.7%ni, 98.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu31 : 15.3%us, 54.5%sy, 0.0%ni, 30.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

trainer_count=2
top - 17:04:25 up 430 days, 3:10, 9 users, load average: 2.99, 2.10, 1.14
Tasks: 608 total, 1 running, 606 sleeping, 1 stopped, 0 zombie
Cpu0 : 99.7%us, 0.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu1 : 2.3%us, 10.7%sy, 0.3%ni, 86.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu2 : 2.3%us, 10.3%sy, 0.7%ni, 86.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu3 : 2.6%us, 10.3%sy, 0.3%ni, 86.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu4 : 2.3%us, 10.7%sy, 0.3%ni, 86.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu5 : 2.3%us, 10.3%sy, 0.0%ni, 87.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu6 : 2.3%us, 10.6%sy, 0.0%ni, 87.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu7 : 2.3%us, 10.3%sy, 0.3%ni, 87.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu8 : 2.7%us, 10.3%sy, 0.3%ni, 86.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu9 : 2.7%us, 10.3%sy, 0.0%ni, 87.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu10 : 3.0%us, 10.3%sy, 0.0%ni, 86.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu11 : 2.7%us, 10.7%sy, 0.0%ni, 86.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu12 : 2.3%us, 11.1%sy, 0.0%ni, 86.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu13 : 2.7%us, 10.4%sy, 0.3%ni, 86.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu14 : 2.7%us, 10.7%sy, 0.3%ni, 86.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu15 : 2.3%us, 11.0%sy, 0.0%ni, 86.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu16 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu17 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu18 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu19 : 0.0%us, 0.7%sy, 0.3%ni, 99.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu20 : 0.3%us, 0.0%sy, 0.3%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu21 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu22 : 0.0%us, 0.3%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu23 : 0.3%us, 0.3%sy, 0.3%ni, 99.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu24 : 0.3%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu25 : 0.3%us, 0.3%sy, 0.3%ni, 99.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu26 : 0.3%us, 0.7%sy, 0.0%ni, 99.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu27 : 0.3%us, 0.7%sy, 0.3%ni, 98.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu28 : 0.7%us, 0.7%sy, 0.0%ni, 98.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu29 : 0.3%us, 1.7%sy, 0.7%ni, 97.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu30 : 0.0%us, 1.0%sy, 1.0%ni, 98.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu31 : 0.7%us, 1.0%sy, 0.0%ni, 98.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

@Yancey1989
Copy link
Contributor

@windy444 类似这样的

trainer_count=1

%Cpu0  : 76.5 us,  1.3 sy,  0.0 ni, 20.2 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  0.0 us,  0.3 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  : 18.9 us,  0.3 sy,  0.0 ni, 80.1 id,  0.7 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu4  :  0.0 us,  0.3 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu5  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu6  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu7  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu8  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu9  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu10 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu11 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu12 :  4.3 us,  0.7 sy,  0.0 ni, 94.7 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu13 :  3.0 us,  0.7 sy,  0.0 ni, 96.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu14 :  0.7 us,  1.3 sy,  0.0 ni, 98.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu15 :  0.3 us,  0.7 sy,  0.0 ni, 99.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu16 :  0.7 us,  0.3 sy,  0.0 ni, 99.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu17 :  1.3 us,  0.7 sy,  0.0 ni, 98.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu18 : 13.0 us,  1.3 sy,  0.0 ni, 85.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu19 :  2.0 us,  2.3 sy,  0.0 ni, 95.3 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu20 : 13.6 us,  1.3 sy,  0.0 ni, 85.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu21 :  3.3 us,  1.7 sy,  0.0 ni, 95.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu22 :  2.6 us,  0.7 sy,  0.0 ni, 96.4 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu23 : 98.3 us,  0.0 sy,  0.0 ni,  1.3 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
KiB Mem : 26404276+total, 57028208 free, 25900816 used, 18111374+buff/cache
KiB Swap:   975868 total,        0 free,   975868 used. 23048297+avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
15099 root      20   0 2066412 651536  49528 R 100.0  0.2   0:07.46 python train.py -y 0 --model_arch 0 --class_num=2 --num_passes=100 --num_workers=1

trainer_count=2

%Cpu0  :  6.2 us,  6.2 sy,  0.0 ni, 87.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  : 93.8 us,  6.2 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  : 41.2 us,  5.9 sy,  0.0 ni, 52.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu4  : 93.3 us,  0.0 sy,  0.0 ni,  6.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu5  :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu6  : 93.8 us,  6.2 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu7  :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu8  : 50.0 us, 12.5 sy,  0.0 ni, 37.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu9  : 93.8 us,  6.2 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu10 : 93.8 us,  6.2 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu11 : 93.8 us,  6.2 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu12 :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu13 : 88.2 us,  5.9 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  5.9 si,  0.0 st
%Cpu14 : 93.8 us,  6.2 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu15 : 26.7 us,  6.7 sy,  0.0 ni, 66.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu16 : 93.3 us,  6.7 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu17 : 93.8 us,  6.2 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu18 : 93.8 us,  6.2 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu19 :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu20 : 93.8 us,  6.2 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu21 : 88.2 us,  5.9 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  5.9 si,  0.0 st
%Cpu22 : 56.2 us,  6.2 sy,  0.0 ni, 37.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu23 : 93.8 us,  6.2 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 26404276+total, 57017188 free, 25997252 used, 18102832+buff/cache
KiB Swap:   975868 total,        0 free,   975868 used. 23038672+avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
15303 root      20   0 3542044 660096  49088 R  1894  0.2   0:20.57 python train.py -y 0 --model_arch 0 --class_num=2 --num_passes=100 --num_workers=2

@windy444
Copy link

windy444 commented Nov 9, 2017

@Yancey1989
trainer_count=1
`top - 17:48:28 up 430 days, 3:54, 9 users, load average: 2.65, 2.37, 1.61
Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie
Cpu0 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu1 : 3.7%us, 13.0%sy, 0.3%ni, 83.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu2 : 3.7%us, 13.3%sy, 0.0%ni, 83.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu3 : 3.0%us, 13.7%sy, 0.0%ni, 83.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu4 : 3.7%us, 13.3%sy, 0.0%ni, 83.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu5 : 3.3%us, 13.4%sy, 0.0%ni, 83.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu6 : 2.7%us, 13.8%sy, 0.0%ni, 83.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu7 : 3.3%us, 13.4%sy, 0.0%ni, 83.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu8 : 4.0%us, 12.7%sy, 1.0%ni, 82.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu9 : 3.3%us, 13.4%sy, 0.0%ni, 83.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu10 : 3.7%us, 13.0%sy, 1.0%ni, 82.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu11 : 2.7%us, 14.0%sy, 0.0%ni, 83.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu12 : 3.3%us, 13.3%sy, 0.0%ni, 83.0%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
Cpu13 : 3.4%us, 13.6%sy, 0.7%ni, 82.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu14 : 3.7%us, 13.1%sy, 0.0%ni, 83.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu15 : 3.7%us, 13.0%sy, 0.0%ni, 83.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu16 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu17 : 0.3%us, 0.7%sy, 0.0%ni, 98.3%id, 0.7%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu18 : 0.3%us, 1.0%sy, 0.0%ni, 98.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu19 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu20 : 0.0%us, 0.3%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu21 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu22 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu23 : 0.7%us, 1.0%sy, 0.3%ni, 98.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu24 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu25 : 0.3%us, 1.3%sy, 0.0%ni, 98.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu26 : 0.3%us, 1.0%sy, 1.0%ni, 97.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu27 : 0.3%us, 0.0%sy, 0.3%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu28 : 0.3%us, 0.7%sy, 0.0%ni, 99.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu29 : 0.3%us, 0.7%sy, 0.3%ni, 98.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu30 : 0.0%us, 0.3%sy, 0.7%ni, 99.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu31 : 0.7%us, 0.7%sy, 0.0%ni, 98.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 131836552k total, 125252140k used, 6584412k free, 412236k buffers
Swap: 0k total, 0k used, 0k free, 63336072k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18522 work 20 0 3644m 1.3g 22m S 344.0 1.0 3:54.04 python `

trainer_count=2
`top - 17:43:27 up 430 days, 3:49, 9 users, load average: 4.87, 2.75, 1.40
Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie
Cpu0 : 99.0%us, 0.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.7%si, 0.0%st
Cpu1 : 2.7%us, 10.7%sy, 0.7%ni, 85.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu2 : 2.3%us, 11.4%sy, 0.3%ni, 85.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu3 : 2.0%us, 11.4%sy, 0.3%ni, 86.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu4 : 3.4%us, 10.4%sy, 0.7%ni, 85.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu5 : 2.3%us, 11.4%sy, 0.7%ni, 85.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu6 : 2.3%us, 11.4%sy, 0.0%ni, 86.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu7 : 2.7%us, 11.0%sy, 0.7%ni, 85.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu8 : 2.7%us, 11.0%sy, 0.0%ni, 86.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu9 : 2.3%us, 11.3%sy, 0.0%ni, 86.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu10 : 2.3%us, 11.4%sy, 0.0%ni, 86.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu11 : 2.0%us, 11.7%sy, 0.0%ni, 86.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu12 : 2.7%us, 11.0%sy, 0.0%ni, 86.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu13 : 2.3%us, 11.0%sy, 0.0%ni, 86.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu14 : 3.0%us, 10.9%sy, 0.0%ni, 86.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu15 : 2.0%us, 11.4%sy, 0.0%ni, 86.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu16 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu17 : 0.0%us, 0.0%sy, 0.3%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu18 : 0.3%us, 0.3%sy, 0.3%ni, 99.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu19 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu20 : 0.0%us, 0.3%sy, 0.3%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu21 : 0.0%us, 0.3%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu22 : 0.0%us, 0.3%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu23 : 1.0%us, 1.4%sy, 0.3%ni, 97.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu24 : 0.0%us, 0.3%sy, 0.3%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu25 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu26 : 0.0%us, 0.7%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu27 : 0.3%us, 1.0%sy, 0.0%ni, 98.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu28 : 0.3%us, 1.0%sy, 0.0%ni, 98.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu29 : 0.3%us, 0.7%sy, 0.0%ni, 99.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu30 : 0.0%us, 0.7%sy, 0.7%ni, 98.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu31 : 0.3%us, 1.0%sy, 0.0%ni, 98.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 131836552k total, 125839760k used, 5996792k free, 412228k buffers
Swap: 0k total, 0k used, 0k free, 63329476k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6429 work 20 0 4463m 1.9g 22m S 297.7 1.5 11:03.32 python `

@luotao1
Copy link
Contributor

luotao1 commented Nov 13, 2017

但在之前版本,每个核差不多能用到50%的。整体运行时间是之前的5倍左右

请问之前是什么版本?@windy444

@Yancey1989
Copy link
Contributor

根据gdb的调试情况来看,即使trainer_count=1的情况也会有很多iomp的线程:

(gdb) info thread
  Id   Target Id         Frame
* 1    Thread 0x7f915f7d8700 (LWP 51) "python" __memset_avx2 () at ../sysdeps/x86_64/multiarch/memset-avx2.S:161
  2    Thread 0x7f9147652700 (LWP 79) "python" runtime.futex () at /usr/local/go/src/runtime/sys_linux_amd64.s:423
  3    Thread 0x7f9147e53700 (LWP 80) "python" runtime.futex () at /usr/local/go/src/runtime/sys_linux_amd64.s:423
  4    Thread 0x7f9148654700 (LWP 81) "python" runtime.futex () at /usr/local/go/src/runtime/sys_linux_amd64.s:423
  5    Thread 0x7f9148e55700 (LWP 82) "python" runtime.futex () at /usr/local/go/src/runtime/sys_linux_amd64.s:423
  6    Thread 0x7f911b34b700 (LWP 83) "python" runtime.futex () at /usr/local/go/src/runtime/sys_linux_amd64.s:423
  7    Thread 0x7f9111254780 (LWP 84) "python" 0x00007f911b3e4bd6 in _INTERNAL_25_______src_kmp_barrier_cpp_34128d84::__kmp_hyper_barrier_release(barrier_type, kmp_info*, int, int, int, void*) () from /usr/local/lib/libiomp5.so
  8    Thread 0x7f9110e53800 (LWP 85) "python" 0x00007f911b3e4c61 in _INTERNAL_25_______src_kmp_barrier_cpp_34128d84::__kmp_hyper_barrier_release(barrier_type, kmp_info*, int, int, int, void*) () from /usr/local/lib/libiomp5.so
  9    Thread 0x7f9110a52880 (LWP 86) "python" 0x00007f915f0c07f7 in sched_yield () at ../sysdeps/unix/syscall-template.S:84
  10   Thread 0x7f9110651900 (LWP 87) "python" 0x00007f911b3e4c61 in _INTERNAL_25_______src_kmp_barrier_cpp_34128d84::__kmp_hyper_barrier_release(barrier_type, kmp_info*, int, int, int, void*) () from /usr/local/lib/libiomp5.so
  11   Thread 0x7f90e3ffc980 (LWP 88) "python" 0x00007f911b3e4c68 in _INTERNAL_25_______src_kmp_barrier_cpp_34128d84::__kmp_hyper_barrier_release(barrier_type, kmp_info*, int, int, int, void*) () from /usr/local/lib/libiomp5.so
  12   Thread 0x7f90e3bfba00 (LWP 89) "python" 0x00007f911b3e4cc2 in _INTERNAL_25_______src_kmp_barrier_cpp_34128d84::__kmp_hyper_barrier_release(barrier_type, kmp_info*, int, int, int, void*) () from /usr/local/lib/libiomp5.so
  13   Thread 0x7f90e37faa80 (LWP 90) "python" 0x00007f915f0c07f7 in sched_yield () at ../sysdeps/unix/syscall-template.S:84
  14   Thread 0x7f90e33f9b00 (LWP 91) "python" 0x00007f911b3e4c5c in _INTERNAL_25_______src_kmp_barrier_cpp_34128d84::__kmp_hyper_barrier_release(barrier_type, kmp_info*, int, int, int, void*) () from /usr/local/lib/libiomp5.so
...

@luotao1 沟通后得知mkl会自动用满CPU来优化计算性能,所以

另外,发现我即使用一个线程跑,也是很多核被占用。用24个核的话,情况和这个差不多。

应该是mkl自动做的性能优化。

@typhoonzero
Copy link
Contributor

如果默认iomp是开启的,是否可以设置iomp的线程数加速训练呢?或者如何关闭iomp,然后使用trainer_count加速?

@luotao1
Copy link
Contributor

luotao1 commented Nov 13, 2017

用MKL加速的时候,需要设置一下环境变量,以达到最好的加速效果:

unset OMP_NUM_THREADS MKL_NUM_THREADS
export OMP_DYNAMIC="FALSE"
export KMP_AFFINITY="granularity=fine,compact,0,0" //如果本机没有开启超线程
(export KMP_AFFINITY="granularity=fine,compact,1,0" // 如果本机开启了超线程)
export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1

@Yancey1989
Copy link
Contributor

这些环境变量要和trainer_count设置成一样的吗?我发现设置成OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 , trainer_count=10其实也只能用到一个核。

@tensor-tang
Copy link
Contributor

综合了下 @CAOYUHUI@windy444 的问题,大致看起来就是没有绑core导致。

  1. 首先说trainer_count为1的情况,真正能使用到多核的时间占比是不一定的,它是跟实际所跑的任务相关,跟着网络怎么写的有关,并且由于cpu利用率一直在变化,所以看cpu利用率的时候大致用峰值看,有没有用满core的时候。
    同时,现在的 trainer count为1时,Paddle也会自动调用多核,所以,理论上,只要你的硬件支持,此时应该也能看到满core跑的瞬间(就是我上面说的,不是所有时间都在满core跑)。如果实在没有这个瞬间,那么就可以怀疑是不是没有绑核。一般机器超线程是开的,所以用export KMP_AFFINITY="granularity=fine,compact,1,0"绑核之后再看。如果不知道机器是否开了超线程,用 lscpu | grep "per core"看下,如果>1代表开着的。此时的OMP_NUM_THREADS MKL_NUM_THREADS应该是unset的才对,因为一般机器默认是没有这两个值的。

  2. 确定了trainer count之后,再来看>1的时候。此时就需要设置OMP_NUM_THREADS来达到最好性能(一般MKL_NUM_THREADS与前者设定一样即可)。他可以让Paddle一直都处在一个比较高利用率的时候。
    OMP_NUM_THREADS这个值的设置应该搭配着机器实际有多少thread来决定。假设系统可用的CPU最大值为MAX_N,如果整个系统只有目前这一个任务,那么OMP_NUM_THREADS 需要等于int(MAX_N/trainer_count)即可达到最好性能。
    至于,整个系统设置多少个trainer count才能达到性能最优。由于性能与core数不是成正比的,所以根据实际情况设置就好了,比如MAX_N=50, batchsizeize=64时。trainer_count设为8,OMP_NUM_THREADS设为6即可。

@luotao1
Copy link
Contributor

luotao1 commented Nov 13, 2017

谢谢 @tensor-tang 的详细回答。还有几个疑问:

  1. 绑核这个操作,能不能封装进docker镜像?
  2. 假设系统可用的CPU最大值为MAX_N:这个是实际CPU的核数?用什么命令测呢?

@tensor-tang
Copy link
Contributor

  1. 应该是可以的
  2. MAX_N具体指的就是前面大家列出的CPU占用率时的CPU最大个数。不一定是实际核数。用top或者lscpu都可以看。

@luotao1
Copy link
Contributor

luotao1 commented Nov 15, 2017

@CAOYUHUI@windy444 : 请先使用如下脚本来绑核和设置最优的MKL_NUM_THREADS和OMP_NUM_THREADS。linux本地环境或docker环境均可以:

#!/bin/bash 

logicalNumber=$(grep "processor" /proc/cpuinfo|sort -u|wc -l)
physicalNumber=$(grep "physical id" /proc/cpuinfo|sort -u|wc -l)
coreNumber=$(grep "cpu cores" /proc/cpuinfo|uniq|awk -F':' '{print $2}'|xargs)
HT=$((logicalNumber / (physicalNumber * coreNumber))) 

echo "****** CPU Information ******"
echo "Logical CPU Number  : ${logicalNumber}"
echo "Physical CPU Number : ${physicalNumber}"
echo "CPU Core Number     : ${coreNumber}"

if [ ${HT} -ne 1 ]; then
    echo "Hyper Threading(HT) : ON"
    export KMP_AFFINITY="granularity=fine,compact,1,0"
else
    echo "Hyper Threading(HT) : OFF"
    export KMP_AFFINITY="granularity=fine,compact,0,0"
fi

echo "********** Settings *********"
unset OMP_NUM_THREADS MKL_NUM_THREADS
trainerCount=$1
numThreads=$((logicalNumber / trainerCount))
export OMP_NUM_THREADS=${numThreads}
export MKL_NUM_THREADS=${numThreads}

echo "Trainer Count      : ${trainerCount}"
echo "OMP_NUM_THREADS    : ${OMP_NUM_THREADS}"
echo "MKL_NUM_THREADS    : ${MKL_NUM_THREADS}"

下载上述脚本存为: cpu_configure.sh,使用方法如下:

sh cpu_configure.sh TRAINER_COUNT

这是在我的服务器上运行的结果:

$ sh cpu_configure.sh 2
****** CPU Information ******
Logical CPU Number  : 12
Physical CPU Number : 2
CPU Core Number     : 6
Hyper Threading(HT) : OFF
********** Settings *********
Trainer Count      : 2
OMP_NUM_THREADS    : 6
MKL_NUM_THREADS    : 6

之后, @tensor-tang 会在源代码中加入上述功能。

@luotao1
Copy link
Contributor

luotao1 commented Nov 16, 2017

@CAOYUHUI@windy444
请更新代码,现在使用MKL的时候,已经绑核和设置最优的MKL_NUM_THREADS和OMP_NUM_THREADS了。

@Bella-Zhao
Copy link

Bella-Zhao commented Nov 16, 2017

@luotao1 我测试了下上述脚本,在我的脚本执行

sh cpu_configure.sh ${TRAINER_COUNT}
python train.py \
    --train_data_path /home/work/zhaoyijin/video-recsys-model/dssm/train_data_dir/train/train \
    --test_data_path /home/work/zhaoyijin/video-recsys-model/dssm/test_data_dir/test/test \
    --dic_path /home/work/zhaoyijin/video-recsys-model/dssm/dict_data_dir/feature_dict \
    --batch_size 1000 \
    --num_passes 17 \
    --model_type 0 \
    --share_network_between_source_target FALSE \
    --share_embed FALSE \
    --dnn_dims 512,216,216,216,128 \
    --num_workers ${TRAINER_COUNT} \
    --use_gpu FALSE \
    --class_num 2 \
    --model_output_prefix ./output_model/ \
    --num_batches_to_log 1

TRAINER_COUNT=24时:

****** CPU Information ******
Logical CPU Number  : 32
Physical CPU Number : 2
CPU Core Number     : 8
Hyper Threading(HT) : ON
********** Settings *********
Trainer Count      : 24
OMP_NUM_THREADS    : 1
MKL_NUM_THREADS    : 1

top - 15:24:26 up 437 days,  1:30, 10 users,  load average: 4.59, 3.52, 3.02
Tasks: 615 total,   3 running, 612 sleeping,   0 stopped,   0 zombie
Cpu0  : 96.4%us,  3.6%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  : 13.6%us, 50.7%sy,  0.0%ni, 35.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  : 14.3%us, 49.7%sy,  0.0%ni, 36.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  : 14.9%us, 49.3%sy,  0.0%ni, 35.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  : 14.3%us, 49.7%sy,  0.3%ni, 35.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  : 14.2%us, 49.7%sy,  0.3%ni, 35.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  : 14.6%us, 49.3%sy,  0.0%ni, 36.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  : 14.6%us, 49.3%sy,  0.0%ni, 36.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu8  : 13.3%us, 50.8%sy,  0.0%ni, 35.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu9  : 14.0%us, 50.2%sy,  0.3%ni, 35.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu10 : 14.6%us, 49.8%sy,  0.0%ni, 35.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu11 : 14.9%us, 49.3%sy,  0.0%ni, 35.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu12 : 14.5%us, 49.8%sy,  0.0%ni, 35.6%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu13 : 14.2%us, 49.7%sy,  0.0%ni, 36.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu14 : 14.6%us, 49.5%sy,  0.0%ni, 35.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu15 : 15.3%us, 49.2%sy,  0.0%ni, 35.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu16 :  0.3%us,  0.3%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu17 :  0.3%us,  0.7%sy,  0.0%ni, 99.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu18 :  0.3%us,  1.3%sy,  0.7%ni, 97.6%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu19 :  0.3%us,  0.0%sy,  0.3%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu20 :  0.3%us,  0.7%sy,  0.7%ni, 98.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu21 :  0.3%us,  0.3%sy,  0.3%ni, 99.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu22 :  0.0%us,  0.3%sy,  0.7%ni, 99.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu23 :  0.3%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu24 :  1.0%us,  1.4%sy,  0.7%ni, 97.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu25 :  1.0%us,  1.7%sy,  0.3%ni, 97.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu26 :  0.3%us,  1.0%sy,  0.0%ni, 98.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu27 :  0.7%us,  1.0%sy,  0.0%ni, 98.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu28 :  0.3%us,  2.0%sy,  0.0%ni, 97.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu29 :  0.3%us,  1.0%sy,  0.0%ni, 98.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu30 :  0.3%us,  1.4%sy,  1.0%ni, 97.0%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu31 :  0.3%us,  1.0%sy,  0.0%ni, 98.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

TRAINER_COUNT=1时:

****** CPU Information ******
Logical CPU Number  : 32
Physical CPU Number : 2
CPU Core Number     : 8
Hyper Threading(HT) : ON
********** Settings *********
Trainer Count      : 1
OMP_NUM_THREADS    : 32
MKL_NUM_THREADS    : 32

top - 15:27:10 up 437 days,  1:32, 10 users,  load average: 2.46, 3.05, 2.93
Tasks: 613 total,   2 running, 611 sleeping,   0 stopped,   0 zombie
Cpu0  : 91.7%us,  7.9%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu1  : 15.9%us, 40.7%sy,  0.3%ni, 43.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  : 16.2%us, 40.9%sy,  0.0%ni, 42.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  : 15.7%us, 41.0%sy,  0.3%ni, 43.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  : 15.8%us, 40.9%sy,  0.0%ni, 43.2%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  : 15.9%us, 41.1%sy,  0.0%ni, 43.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  : 15.9%us, 41.4%sy,  0.3%ni, 42.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  : 15.4%us, 41.3%sy,  0.3%ni, 43.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu8  : 15.1%us, 41.8%sy,  0.3%ni, 42.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu9  : 14.8%us, 42.0%sy,  0.3%ni, 43.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu10 : 15.1%us, 41.8%sy,  0.3%ni, 42.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu11 : 14.8%us, 42.4%sy,  0.3%ni, 42.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu12 : 15.6%us, 41.7%sy,  0.0%ni, 42.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu13 : 15.2%us, 41.7%sy,  0.0%ni, 43.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu14 : 15.2%us, 42.1%sy,  0.0%ni, 42.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu15 : 15.9%us, 41.1%sy,  0.0%ni, 43.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu16 :  0.0%us,  0.7%sy,  1.0%ni, 98.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu17 :  0.7%us,  1.0%sy,  0.0%ni, 98.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu18 :  0.3%us,  3.0%sy,  1.3%ni, 95.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu19 :  0.3%us,  1.3%sy,  0.7%ni, 97.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu20 :  0.0%us,  0.3%sy,  0.3%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu21 :  0.0%us,  0.7%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu22 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu23 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu24 :  0.3%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu25 :  0.7%us,  2.3%sy,  0.7%ni, 96.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu26 :  0.3%us,  1.7%sy,  0.3%ni, 97.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu27 :  0.3%us,  1.7%sy,  0.7%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu28 :  0.7%us,  1.0%sy,  0.0%ni, 98.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu29 :  0.3%us,  1.0%sy,  0.0%ni, 98.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu30 :  0.0%us,  0.7%sy,  0.3%ni, 98.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu31 :  0.7%us,  1.0%sy,  0.0%ni, 98.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

两者cpu占用率没有什么区别,请帮忙看看,我使用的是否正确,谢谢。

@tensor-tang
Copy link
Contributor

应该是你设置的变量没有生效,

你可以在加 paddle train前面加echo $OMP_NUM_THREADS 确认下是否生效。

@CAOYUHUI
Copy link
Author

@luotao1 你好,我尝试了cpu_configure.sh那个脚本。运行脚本后训练,第一次尝试trainer_count=4,用htop查看cpu情况,发现确实运行了多个cpu,感觉多线程是生效了。但是后续再尝试训练,htop发现依旧占用着同一个cpu。
cpu_configure.sh脚本运行结果:
2017-11-16 5 24 48
htop情况:
2017-11-16 5 24 04

麻烦帮忙看看是什么情况,十分感谢~

@tensor-tang
Copy link
Contributor

在脚本里面的echo输出是正常的,但是我说的是在你的paddle train前面加echo。
因为你截图的那个环境变量并不一定在你当前的脚本环境下生效,top是都会有的,我估计在按照如下的方法加入echo是没有输出的,

sh cpu_configure.sh ${TRAINER_COUNT}
echo $OMP_NUM_THREADS
python train.py \

请先确定这个是否正确,谢谢

@CAOYUHUI
Copy link
Author

@tensor-tang 你好,运行echo $OMP_NUM_THREADS输出为空。

@tensor-tang
Copy link
Contributor

tensor-tang commented Nov 16, 2017

谢谢,那就证明脚本里面的环境变量确实没有生效。

需要使用source cpu_configure.sh ${TRAINER_COUNT}。再看应该就有值了。

或者使用最新的paddle编译安装也可以,脚本的功能已经集成了,不需要自己配置也可以了。

@luotao1
Copy link
Contributor

luotao1 commented Nov 16, 2017

但是后续再尝试训练,htop发现依旧占用着同一个cpu。

请问后续的训练,trainer_count依然是4?@CAOYUHUI

@CAOYUHUI
Copy link
Author

@luotao1 后来设置的trainer_count=8。
@tensor-tang 用source,在echo就是有值的了。但是训练依旧只用一个核,而且训练一个batch使用的时间和之前一样。

@luotao1
Copy link
Contributor

luotao1 commented Nov 16, 2017

请问用的paddle是什么版本的?

@CAOYUHUI
Copy link
Author

@luotao1 用的是v2,pip安装的。

@luotao1
Copy link
Contributor

luotao1 commented Nov 21, 2017

pip安装的是最新版本的paddle,还是0.10.0版本的呢?

@CAOYUHUI
Copy link
Author

@luotao1 是0.10.0的

@luotao1
Copy link
Contributor

luotao1 commented Nov 21, 2017

但是后续再尝试训练,htop发现依旧占用着同一个cpu。

如果更换了trainer_count的数量,得重新运行下脚本。重新运行脚本后,htop仍然是一个cpu么?

@peterzhang2029
Copy link
Contributor

Closing due to low activity. Feel free to reopen it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
User 用于标记用户问题
Projects
None yet
Development

No branches or pull requests

8 participants