Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distribute training KPI on CE #10944

Closed
Yancey1989 opened this issue May 25, 2018 · 10 comments
Closed

Distribute training KPI on CE #10944

Yancey1989 opened this issue May 25, 2018 · 10 comments
Assignees

Comments

@Yancey1989
Copy link
Contributor

Yancey1989 commented May 25, 2018

为了保证Fluid分布式训练的稳定性,我们需要在CE系统上添加一些评估指标 (KPI) 来检测每个合并到 develop 分支的代码 (merge request) 是否能够正常的跑多机训练,我们需要使用 aws benchmarking工具来向AWS提交任务并根据输入的日志来计算每个KPI的值:

  1. 加速比
batch_size trainers pservers GPUs per trainer thoughput (sampels / sec)
64 1 0 1 t0
64*8 1 0 8 t1
64*8 2 4 8 t2
64*8 4 8 8 t3

参考 https://en.wikipedia.org/wiki/Speedup ,采用吞吐加速比的计算方法

speedup(8GPUs) = t1 / t0

可以得出以下样例结果:

GPUs speedup
1 1
8 6
16 10
32 20
  1. 对于每个任务,我们还需要计算以下指标
  • convergence speed:
    准确率大于某一值(例如0.6)需要总的训练时间
  • GPU memory:
    trainer节点和pserver节点的显存占用
  • speed:
    训练任务的总吞吐
  • accuracy
    验证精度是否对齐,例如验证单卡(batch_size=80), 32卡4节点(batch_size=20)同样训练5个pass后,test acc 相差小于0.001
  1. 总结

最终分布式训练在CE上的各项KPI如下:

KPI value (1*GPU) value (8*GPUs) value (16*GPUs) value (32*GPUs)
speedup 1 6 10 20
speed 120 (sampels/sec) 768 (sampels/sec) 1344 (samples/sec) 2304 (samples/sec)
converence speed acc > 0.6 1000s 800s 700s 600s
gpu memory 12000 (MB) 12000 (MB) 12000 (MB) 12000 (MB)
test acc_4passes 0.002 0.002 0.002 0.002

:以上指标值均为样例数据,以实际实验数据为准。

@guochaorong
Copy link
Contributor

上面这个表是闫老师在某个模型上训练出来的具体数据? 还是计划所有模型都按照这个kpi 指标走, 比如speed 在单机1个GPU下, 不小于120 sampels/sec算通过?

@guochaorong
Copy link
Contributor

GPUs speedup
1 1.0
8 t1/(t0*8)
16 t2/(t0*16)
32 t3/(t0*32)

speedup 的这个计算方法是业界标准嚒? 我以为speedup 是,模型在单机单卡的运行时间除以N机N卡 的时间, 是个大于1的数

https://baike.baidu.com/item/%E5%8A%A0%E9%80%9F%E6%AF%94/4661409?fr=aladdin

@typhoonzero
Copy link
Contributor

这些应该都是示例不是真实的数据。加速比可以用时间相除或吞吐相除。比如32卡加速比是19卡

@panyx0718
Copy link
Contributor

panyx0718 commented May 27, 2018

@putcn 这个aws benchmarking tool ready了吗?似乎可以集成进CE里,用teamcity调用这个tool,产生训练结果,持续监控。

@putcn
Copy link
Contributor

putcn commented May 28, 2018

@panyx0718 已经ready了, 目前需要一些ce端的改动, 比如paddle build的时候支持dist之类. 具体见pr
PaddlePaddle/paddle-ce-latest-kpis#27

@Yancey1989
Copy link
Contributor Author

Yancey1989 commented May 28, 2018

FROM @guochaorong

上面这个表是闫老师在某个模型上训练出来的具体数据? 还是计划所有模型都按照这个kpi 指标走, 比如speed 在单机1个GPU下, 不小于120 sampels/sec算通过?

列举的只是样例数据而非真实值

FROM @guochaorong @typhoonzero

speedup 的这个计算方法是业界标准嚒? 我以为speedup 是,模型在单机单卡的运行时间除以N机N卡 的时间, 是个大于1的数

参考Wikipedia上的对加速比的解释:https://en.wikipedia.org/wiki/Speedup, 分为延迟和吞吐两种定义,训练中更关心的吞吐的性能,可以按 @typhoonzero 的建议来计算加速比

加速比可以用时间相除或吞吐相除。比如32卡加速比是19卡

@guochaorong
Copy link
Contributor

GPUs speedup
1 1.0
8 t1/(t0*8)
16 t2/(t0*16)
32 t3/(t0*32)

恩恩, 那就是t3/t0 这样吧? 咱们是要把一个多机模型4种 GPU情况的 4个kpi 都弄出来吧, 挺全的, 建议优先把每个模型的加速比这个指标先固化下来~

@putcn
Copy link
Contributor

putcn commented May 28, 2018

ok, 我明天来把这些指标写在测试里

@guochaorong
Copy link
Contributor

赞~

@shanyi15
Copy link
Collaborator

您好,此issue在近一个月内暂无更新,我们将于今天内关闭。若在关闭后您仍需跟进提问,可重新开启此问题,我们将在24小时内回复您。因关闭带来的不便我们深表歉意,请您谅解~感谢您对PaddlePaddle的支持!
Hello, this issue has not been updated in the past month. We will close it today for the sake of other user‘s experience. If you still need to follow up on this question after closing, please feel free to reopen it. In that case, we will get back to you within 24 hours. We apologize for the inconvenience caused by the closure and thank you so much for your support of PaddlePaddle Group!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants