update resnet50 benchmark data #5578

tensor-tang · 2017-11-13T02:52:56Z

MKLML and MKL-DNN data are tested with latest docker.
OpenBLAS data is tested with local compiled since do not have openblas docker.

luotao1 · 2017-11-13T11:04:22Z

benchmark/IntelOptimizedPaddle.md

+|--------------|-------| ------| -------|
+| OpenBLAS     | 22.90 | 23.10 | 25.59  | 
+| MKLML        | 29.81 | 30.18 | 32.77  |
+| MKL-DNN      | 80.49 | 82.89 | 83.13  |


BatchSize 64 128 256

MKLML 25.77 23.98 24.27

MKL-DNN 77.57 77.99 74.23

这里有两个测试上的差别：

我是用centos6测的，但PR里是用centos7

我的CPU是6148，但PR里是6148M

是否因为这两个差别，导致我测出来的值，在bs增大时反而会下降呢？

比较了下百分比，（我的值-你的值）/你的值，差别还是挺大的。
MKLML的差别普遍比MKL-DNN的差别大。

15.68% 25.85% 35.02%

3.76% 6.28% 11.99%

可以看下这个cat /sys/devices/system/cpu/cpu*/online | grep -o '1' | wc -l的值吗

这个值是39，但我有40个processor

在我的机器上这两者都是40

感谢 @BlackZhengSQ 解答：cpu0不支持online/offline（其他cpu可以），但cpu0一直处于工作状态。
在docker中使用top查看，差不多能跑满40个core：

所以不存在少一个core的情况。

看到mklml的差距确实还挺大的，并且随着batchsize增大，请问下numa是否是打开的？numa是用于分配内存的，6148应该有2个numa node。

使用lscpu命令：我的numa没有打开。

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 40 On-line CPU(s) list: 0-39 Thread(s) per core: 1 Core(s) per socket: 20 CPU socket(s): 2 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Stepping: 4 CPU MHz: 2401.000 BogoMIPS: 4804.43 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 28160K NUMA node0 CPU(s): 0-39

开启NUMA后，果然数值对上了，还略快：

BatchSize 64 128 256

MKLML 33.32 31.68 33.12

MKL-DNN 81.69 82.35 84.08

Good News, 那直接用你的数据就好了。毕竟你测试的code base也比较新。

另外补充下，这里的NUMA问题应该也适用于VGG上的差距。

update resnet50 benchmark data

719644a

tensor-tang requested a review from luotao1 November 13, 2017 02:52

tensor-tang added this to Doing in Optimization on Intel Platform Nov 13, 2017

luotao1 reviewed Nov 13, 2017

View reviewed changes

Update the VGG and ResNet benchmark when NUMA=ON

44609c2

luotao1 approved these changes Nov 22, 2017

View reviewed changes

luotao1 merged commit f7fc6c2 into PaddlePaddle:develop Nov 22, 2017

tensor-tang moved this from Doing to Done in Optimization on Intel Platform Nov 22, 2017

tensor-tang deleted the benchmark branch November 22, 2017 03:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update resnet50 benchmark data #5578

update resnet50 benchmark data #5578

tensor-tang commented Nov 13, 2017

luotao1 Nov 13, 2017 •

edited

Loading

tensor-tang Nov 13, 2017

luotao1 Nov 13, 2017

tensor-tang Nov 13, 2017

luotao1 Nov 14, 2017

tensor-tang Nov 15, 2017 •

edited by luotao1

Loading

luotao1 Nov 15, 2017

luotao1 Nov 21, 2017

tensor-tang Nov 21, 2017

tensor-tang Nov 21, 2017

update resnet50 benchmark data #5578

update resnet50 benchmark data #5578

Conversation

tensor-tang commented Nov 13, 2017

luotao1 Nov 13, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tensor-tang Nov 15, 2017 • edited by luotao1 Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

luotao1 Nov 13, 2017 •

edited

Loading

tensor-tang Nov 15, 2017 •

edited by luotao1

Loading