update benchmark data on VGG19 #5148

tensor-tang · 2017-10-27T02:42:03Z

related #5008

luotao1 · 2017-10-27T08:43:48Z

benchmark/IntelOptimizedPaddle.md

+Machine:
+
+- Server
+ 	- Intel(R) Xeon(R) Gold 6148M CPU @ 2.40GHz, 2 Sockets, 20 Cores per socket


2 Sockets, 20 Cores per socket ，这样算是40 Cores。
我用cat /proc/cpuinfo看是80 processor

嗯，我这里写的实core的数目。80代表你的机器超线程是开的。

luotao1 · 2017-10-27T08:45:09Z

benchmark/IntelOptimizedPaddle.md

+ 	- DELL XPS15-9560-R1745: i7-7700HQ 8G 256GSSD
+ 	- i5 MacBook Pro (Retina, 13-inch, Early 2015)
+- Desktop
+ 	- i7-6700k


Laptop和Desktop这里的型号信息不全，可以加TODO

嗯，这个没问题，可以后续你们添加这一块测试数据的时候一起添加。
我这里的几个型号是issue #5008 里面你列的那几个型号。

luotao1 · 2017-10-27T08:45:46Z

benchmark/IntelOptimizedPaddle.md

+- Desktop
+ 	- i7-6700k
+
+System: CentOS 7.3.1611


CentOS 6.3.10

哦，我用的是7.3的这一个。

luotao1 · 2017-10-27T08:48:08Z

benchmark/IntelOptimizedPaddle.md

+|--------------|-------| -----| --------|
+| OpenBLAS     | 7.86  | 9.02  | 10.62  | 
+| MKLML        | 11.80 | 13.43 | 16.21  |
+| MKL-DNN      | 29.07 | 30.40 | 31.06  |


我测出来的数据，整体慢1.5-2倍。其中OpenBLAS是源码编译，MKLML和MKL-DNN都是用docker镜像来跑。

BatchSize 64 128 256

OpenBLAS 4 4.92 未测

MKLML 4.7 6.4 7.68

MKL-DNN 20 20 21

按照刚才你说的，看你的系统是开超线程的。那么
这里的配置最好写export KMP_AFFINITY="granularity=fine,compact,1,0"

我的脚本里面是关闭超线程的时候测的。

并且最好可以在运行的时候，用perf top看MKL-DNN的engine是否都运行正确了。

改成export KMP_AFFINITY="granularity=fine,compact,1,0后，测试结果依然一样。

嗯，请问下BIOS的版本是什么？另外内存条是不是都插满了，以及频率是多少？

使用dmidecode命令，这是打印结果
dmidecode.log.txt

我觉得和docker无关。mklml和mkldnn都在docker中运行，取第一列数据，我的提升是(20-4.7)/4.7=3.25倍，你的提升是(29.07-11.8)/11.8=1.45倍。

mklml和mkldnn的数据是不是也可以本地编译一下？只要测一个数据，看看数据有没有提升即可。

我可以编译一下docker中的openblas版，来进行测试。

因为上次你说的本地编译时libc缺乏的问题，我觉得还是要解决下

benchmark最好以docker为环境，这样能避免环境不一样带来的不同。

mklml和mkldnn都在docker中运行，取第一列数据，我的提升是(20-4.7)/4.7=3.25倍，你的提升是(29.07-11.8)/11.8=1.45倍。

这样更加能说明一点问题了。我在docker外测得mklml与mkldnn的比率没有那么大，恰好说明了你在docker中mklml的值是偏低的了，或者是有潜在的问题还没有被发现。

我可以编译一下docker中的openblas版，来进行测试。

嗯，这个我同意。把三者放在一个环境下比较好。

benchmark最好以docker为环境，这样能避免环境不一样带来的不同。

嗯，这个我也同意，如果可以的话，你可以把你的docker镜像分享给我一份吗，我用我的机器也跑下看看，先排除机器等基本配置问题。

仔细查看了dmidecode的结果，发现机器的内存确实不是性能最优的配置。现在插了16根内存条，有8个内存公用了4个channel。
需要把CPU0_A1, CPU0_D1, CPU1_A1, CPU1_D1的内存条去掉。
如果板子上的槽分蓝色和黑色的话，即把所有黑色槽上的内存条去掉。

非常感谢系统部的 @BlackZhengSQ 帮助我们调对了内存条。
目前MKLDNN下，batchsize=64, 数据为26.67。看上去内存对性能的影响很大。
但26.67和28.46还存在一定的差距。

BatchSize 64 128 256

MKLML 10.95 12.81 15.21

MKL-DNN 26.67 28.06 28.65

非常感谢系统部的 @BlackZhengSQ 帮助我们从CentOs4.3升级到CentOs6.3。
目前MKLDNN下，差距从原来的6%缩小到3%。

BatchSize 64 128

MKL-DNN 27.69 28.8

tensor-tang · 2017-10-30T13:07:13Z

我用docker 最新的镜像paddlepaddle/paddle:latest，在6148上简单的跑了一下MKLDNN 在bs64的情况，

I1030 12:57:26.494154 40 Stat.cpp:102] ======= StatSet: [GlobalStatInfo] status ======
I1030 12:57:26.494226 40 Stat.cpp:105] Stat=FwdBwd TID=40 total=224885 avg=2248.85 max=2322.83 min=2235.7 count=100

速度为 64/2.24885 = 28.46 与我之前测的29.07基本能对上。

tensor-tang · 2017-11-01T09:19:27Z

看来CentOS的版本还是会有一些影响的，commit中的数据是我在7.2的版本裸机下跑的。

针对MKL-DNN的数据，我又在docker 1.12.6里面跑了下:
batchsize 64

I1101 08:12:39.871258 35 Stat.cpp:105] Stat=FwdBwd TID=35 total=223947 avg=2239.47 max=2306.18 min=2225.71 count=100

batchsize 128

I1101 08:20:38.068055 137 Stat.cpp:105] Stat=FwdBwd TID=137 total=422396 avg=4223.96 max=4285.01 min=4200.91 count=100

batchsize 256

I1101 08:36:04.149484 239 Stat.cpp:105] Stat=FwdBwd TID=239 total=829780 avg=8297.8 max=8360.78 min=8259.83 count=100

整理如下：

Batchsize	64	128	256
with docker (A)	28.58	30.30	30.85
witout docker (B)	29.07	30.40	31.06
differ = (B-A)/A	1.72%	0.32%	0.68%

误差最大1.7%，说明在docker内和外基本没啥差别。

对比在CentOS 6.3上面的数据，误差范围在3%左右。

luotao1 · 2017-11-01T09:29:01Z

docker版本对性能的影响也不大。针对MKL-DNN的数据，batchsize=64的情况下，我在docker 1.6.0和1.13.1里面跑了下:

docker 1.13.1:

I1101 08:44:03.134213 39 Stat.cpp:105] Stat=FwdBwd TID=39 total=232649 avg=2326.49 max=2522.99 min=2301.4 count=100

docker 1.6.0:

I1101 08:03:47.377549 34 Stat.cpp:105] Stat=FwdBwd TID=34 total=231180 avg=2311.8 max=2405.7 min=2298.31 count=100

前面conversation中的数据，都是在docker 1.6.0下测的。

tensor-tang · 2017-11-01T14:23:30Z

对比了下 @luotao1 更新的数据。

从绝对值上对比误差，集中在5%左右：

0.51%	4.64%	2.71%
7.08%	4.43%	5.74%
4.98%	5.56%	6.12%

相对值上，如果按照我以前的数据。

MKLML / OpenBLAS	1.50	1.49	1.53
MKL-DNN / MKLML	2.46	2.26	1.92

CentOS 6.3上的数据

MKLML / OpenBLAS	1.41	1.49	1.48
MKL-DNN / MKLML	2.51	2.24	1.91

误差在：

6.53%	-0.20%	2.95%
-1.96%	1.08%	0.35%

从相对误差来看，就第一个数字误差大点，MKL-DNN/MKLML的都还好

tensor-tang requested a review from luotao1 October 27, 2017 02:42

luotao1 reviewed Oct 27, 2017

View reviewed changes

update benchmark data on VGG19

2dccdc3

tensor-tang force-pushed the benchmark branch from 41b46b4 to 2dccdc3 Compare October 30, 2017 13:09

update the VGG benchmark on CentOs6.3 and Intel 6148

5bd1886

luotao1 approved these changes Nov 1, 2017

View reviewed changes

tensor-tang merged commit a343504 into PaddlePaddle:develop Nov 1, 2017

tensor-tang deleted the benchmark branch November 1, 2017 14:42

luotao1 mentioned this pull request Nov 16, 2017

关于数字识别章节的问题 #5690

Closed

tensor-tang mentioned this pull request Nov 21, 2017

update resnet50 benchmark data #5578

Merged

luotao1 mentioned this pull request Nov 27, 2017

Feature/cpu profiling #5895

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update benchmark data on VGG19 #5148

update benchmark data on VGG19 #5148

tensor-tang commented Oct 27, 2017

luotao1 Oct 27, 2017

tensor-tang Oct 27, 2017

luotao1 Oct 27, 2017

tensor-tang Oct 27, 2017

luotao1 Oct 27, 2017

tensor-tang Oct 27, 2017

luotao1 Oct 27, 2017 •

edited

Loading

tensor-tang Oct 27, 2017 •

edited

Loading

luotao1 Oct 27, 2017

tensor-tang Oct 27, 2017

luotao1 Oct 27, 2017 •

edited

Loading

luotao1 Oct 30, 2017

tensor-tang Oct 30, 2017 •

edited

Loading

tensor-tang Oct 30, 2017 •

edited

Loading

luotao1 Oct 31, 2017 •

edited

Loading

luotao1 Nov 1, 2017

tensor-tang commented Oct 30, 2017

tensor-tang commented Nov 1, 2017

luotao1 commented Nov 1, 2017

tensor-tang commented Nov 1, 2017 •

edited

Loading

update benchmark data on VGG19 #5148

update benchmark data on VGG19 #5148

Conversation

tensor-tang commented Oct 27, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

luotao1 Oct 27, 2017 • edited Loading

Choose a reason for hiding this comment

tensor-tang Oct 27, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

luotao1 Oct 27, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tensor-tang Oct 30, 2017 • edited Loading

Choose a reason for hiding this comment

tensor-tang Oct 30, 2017 • edited Loading

Choose a reason for hiding this comment

luotao1 Oct 31, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tensor-tang commented Oct 30, 2017

tensor-tang commented Nov 1, 2017

luotao1 commented Nov 1, 2017

tensor-tang commented Nov 1, 2017 • edited Loading

luotao1 Oct 27, 2017 •

edited

Loading

tensor-tang Oct 27, 2017 •

edited

Loading

luotao1 Oct 27, 2017 •

edited

Loading

tensor-tang Oct 30, 2017 •

edited

Loading

tensor-tang Oct 30, 2017 •

edited

Loading

luotao1 Oct 31, 2017 •

edited

Loading

tensor-tang commented Nov 1, 2017 •

edited

Loading