Distribute training KPI on CE #10944

Yancey1989 · 2018-05-25T12:31:56Z

为了保证Fluid分布式训练的稳定性，我们需要在CE系统上添加一些评估指标 (KPI) 来检测每个合并到 develop 分支的代码 (merge request) 是否能够正常的跑多机训练，我们需要使用 aws benchmarking工具来向AWS提交任务并根据输入的日志来计算每个KPI的值：

加速比

batch_size	trainers	pservers	GPUs per trainer	thoughput (sampels / sec)
64	1	0	1	t0
64*8	1	0	8	t1
64*8	2	4	8	t2
64*8	4	8	8	t3

参考 https://en.wikipedia.org/wiki/Speedup ，采用吞吐加速比的计算方法

speedup(8GPUs) = t1 / t0

可以得出以下样例结果：

GPUs	speedup
1	1
8	6
16	10
32	20

对于每个任务，我们还需要计算以下指标

convergence speed:
准确率大于某一值(例如0.6)需要总的训练时间
GPU memory:
trainer节点和pserver节点的显存占用
speed:
训练任务的总吞吐
accuracy
验证精度是否对齐,例如验证单卡(batch_size=80), 32卡4节点(batch_size=20)同样训练5个pass后,test acc 相差小于0.001

总结

最终分布式训练在CE上的各项KPI如下：

KPI	value (1*GPU)	value (8*GPUs)	value (16*GPUs)	value (32*GPUs)
speedup	1	6	10	20
speed	120 (sampels/sec)	768 (sampels/sec)	1344 (samples/sec)	2304 (samples/sec)
converence speed acc > 0.6	1000s	800s	700s	600s
gpu memory	12000 (MB)	12000 (MB)	12000 (MB)	12000 (MB)
test acc_4passes	0.002	0.002	0.002	0.002

注：以上指标值均为样例数据，以实际实验数据为准。

The text was updated successfully, but these errors were encountered:

guochaorong · 2018-05-26T03:44:08Z

上面这个表是闫老师在某个模型上训练出来的具体数据？还是计划所有模型都按照这个kpi 指标走，比如speed 在单机1个GPU下，不小于120 sampels/sec算通过？

guochaorong · 2018-05-26T03:49:26Z

GPUs	speedup
1	1.0
8	t1/(t0*8)
16	t2/(t0*16)
32	t3/(t0*32)

speedup 的这个计算方法是业界标准嚒？我以为speedup 是，模型在单机单卡的运行时间除以N机N卡的时间，是个大于1的数

https://baike.baidu.com/item/%E5%8A%A0%E9%80%9F%E6%AF%94/4661409?fr=aladdin

typhoonzero · 2018-05-26T04:48:18Z

这些应该都是示例不是真实的数据。加速比可以用时间相除或吞吐相除。比如32卡加速比是19卡

panyx0718 · 2018-05-27T14:55:08Z

@putcn 这个aws benchmarking tool ready了吗？似乎可以集成进CE里，用teamcity调用这个tool，产生训练结果，持续监控。

putcn · 2018-05-28T02:04:12Z

@panyx0718 已经ready了, 目前需要一些ce端的改动, 比如paddle build的时候支持dist之类. 具体见pr
PaddlePaddle/paddle-ce-latest-kpis#27

Yancey1989 · 2018-05-28T04:46:20Z

FROM @guochaorong

上面这个表是闫老师在某个模型上训练出来的具体数据？还是计划所有模型都按照这个kpi 指标走，比如speed 在单机1个GPU下，不小于120 sampels/sec算通过？

列举的只是样例数据而非真实值

FROM @guochaorong @typhoonzero

speedup 的这个计算方法是业界标准嚒？我以为speedup 是，模型在单机单卡的运行时间除以N机N卡的时间，是个大于1的数

参考Wikipedia上的对加速比的解释：https://en.wikipedia.org/wiki/Speedup，分为延迟和吞吐两种定义，训练中更关心的吞吐的性能，可以按 @typhoonzero 的建议来计算加速比

加速比可以用时间相除或吞吐相除。比如32卡加速比是19卡

guochaorong · 2018-05-28T08:10:29Z

GPUs	speedup
1	1.0
8	t1/(t0*8)
16	t2/(t0*16)
32	t3/(t0*32)

恩恩，那就是t3/t0 这样吧？咱们是要把一个多机模型4种 GPU情况的 4个kpi 都弄出来吧，挺全的，建议优先把每个模型的加速比这个指标先固化下来~

putcn · 2018-05-28T22:08:01Z

ok, 我明天来把这些指标写在测试里

guochaorong · 2018-05-29T05:32:17Z

赞~

shanyi15 · 2018-08-15T11:35:45Z

您好，此issue在近一个月内暂无更新，我们将于今天内关闭。若在关闭后您仍需跟进提问，可重新开启此问题，我们将在24小时内回复您。因关闭带来的不便我们深表歉意，请您谅解~感谢您对PaddlePaddle的支持!
Hello, this issue has not been updated in the past month. We will close it today for the sake of other user‘s experience. If you still need to follow up on this question after closing, please feel free to reopen it. In that case, we will get back to you within 24 hours. We apologize for the inconvenience caused by the closure and thank you so much for your support of PaddlePaddle Group!

Yancey1989 assigned Yancey1989, putcn, panyx0718, gongweibao and typhoonzero May 25, 2018

shanyi15 closed this as completed Aug 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distribute training KPI on CE #10944

Distribute training KPI on CE #10944

Yancey1989 commented May 25, 2018 •

edited

Loading

guochaorong commented May 26, 2018

guochaorong commented May 26, 2018

typhoonzero commented May 26, 2018

panyx0718 commented May 27, 2018 •

edited

Loading

putcn commented May 28, 2018

Yancey1989 commented May 28, 2018 •

edited

Loading

guochaorong commented May 28, 2018

putcn commented May 28, 2018

guochaorong commented May 29, 2018

shanyi15 commented Aug 15, 2018

Distribute training KPI on CE #10944

Distribute training KPI on CE #10944

Comments

Yancey1989 commented May 25, 2018 • edited Loading

guochaorong commented May 26, 2018

guochaorong commented May 26, 2018

typhoonzero commented May 26, 2018

panyx0718 commented May 27, 2018 • edited Loading

putcn commented May 28, 2018

Yancey1989 commented May 28, 2018 • edited Loading

guochaorong commented May 28, 2018

putcn commented May 28, 2018

guochaorong commented May 29, 2018

shanyi15 commented Aug 15, 2018

Yancey1989 commented May 25, 2018 •

edited

Loading

panyx0718 commented May 27, 2018 •

edited

Loading

Yancey1989 commented May 28, 2018 •

edited

Loading