Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

分布式稳定性测试 #11289

Closed
kolinwei opened this issue Jun 7, 2018 · 4 comments
Closed

分布式稳定性测试 #11289

kolinwei opened this issue Jun 7, 2018 · 4 comments
Assignees

Comments

@kolinwei
Copy link
Contributor

kolinwei commented Jun 7, 2018

1.功能验证

主要需要验证在不同的参数和运行条件时,多机功能的正常。需要考虑如下维度:

(1) 模型

    功能上,需要快速的验证,可以选用一个CNN模型,一个RNN模型。
    resnet50模型,flowers数据集。
    seqseq模型,wmt14数据集

(2)多机训练规模

    采用两种规模:1个ps,2个trainer;2个ps,4个trainer。nccl用4个trainer

(3) 训练相关配置

   同步和异步
   pserver和nccl2
   CPU和GPU
   其他一些训练参数,可以参照 #11206 

以上的维度,进行组合测试,验证功能,主要关注训练速度、训练收敛。

2.持续稳定性验证

验证对一些比较大的模型进行持续训练时的稳定情况。

(1) 模型

   选用两个模型SE-ResNeXt(Imagenet数据集)、transformer。

(2)训练规模

   2个ps 4个trainer,nccl用4个trainer

(3)训练相关配置

   同步和异步
   CPU和GPU
   pserver和nccl2

主要关注训练收敛度、速度、内存占用。可以持续稳定的训练较长时间(1-2天)

@panyx0718 panyx0718 self-assigned this Jun 7, 2018
@typhoonzero
Copy link
Contributor

其他一些训练参数,可以参照 #11206

其他的配置,用CI去做吧,放到CE的话跑一次时间太长了。验证的模型中,seq2seq包含稀疏embedding的话其实也可以覆盖稀疏场景的用例。

后续也需要加上分布式稀疏场景的验证。

@panyx0718
Copy link
Contributor

这个issue不包含最终精度对奇?

@velconia
Copy link
Collaborator

velconia commented Jun 8, 2018

我觉得直接使用: 2个ps,2个trainer, nccl用2个trainer; 这样的参数规模就挺好;

既照顾到了多pservers, 多trainers的情况, 同时也省CE资源, 加速测试

@gongweibao gongweibao self-assigned this Jul 10, 2018
@shanyi15
Copy link
Collaborator

您好,此issue在近一个月内暂无更新,我们将于今天内关闭。若在关闭后您仍需跟进提问,可重新开启此问题,我们将在24小时内回复您。因关闭带来的不便我们深表歉意,请您谅解~感谢您对PaddlePaddle的支持!
Hello, this issue has not been updated in the past month. We will close it today for the sake of other user‘s experience. If you still need to follow up on this question after closing, please feel free to reopen it. In that case, we will get back to you within 24 hours. We apologize for the inconvenience caused by the closure and thank you so much for your support of PaddlePaddle Group!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants