【discuss】分布式paddle性能问题，cpu集群 #1359

pkuyym · 2017-02-17T06:51:49Z

具体场景如下：

使用一个单元的simple_gru2网络，在cpu集群上运行
2.输入数据为序列（长度为2个词），然后预测第3个词，词表大小为200万
3.训练节点采用100，单个节点batch_size为2000，trainer_count为32，优化方法为momentum sync
4.每个节点cpu利用率比较低
现在训练速度为 9.3s训练1万样本，非常慢，请问这个性能是否符合预期，有没有优化的建议？

hedaoyuan · 2017-02-20T02:43:22Z

如果集群的CPU是支持AVX的，可以使用一个AVX版本的paddle，会快一些。

# paddle version
    with_avx: ON

试试减少trainer_count，增大batch_size；应该能提升训练速度。
训练速度为 9.3s训练1万样本 这个没法判断，这个是怎么测试出来的？总样本是多少？

pkuyym · 2017-02-20T05:10:36Z

@hedaoyuan

是avx版本
2.batch_size已经是最大量级，减少trainer_count，我理解应该是减少线程竞争，增大cpu利用率，感觉不能带来量级上的提升
3.一共5万个样本，迭代6轮取平均的结果，平均每轮46.7s
现在想看一下性能是否正常，以及是否存在配置上的优化，可以大幅提升性能（几十倍）

hedaoyuan · 2017-02-20T05:21:27Z

5万个样本为什么要用100个节点跑？另外，减少trainer_count，不会带来数量级上的性能提升，不过能够对每个节点cpu利用率比较低这个问题有帮助。

pkuyym · 2017-02-20T05:53:36Z

@hedaoyuan 现在是做性能测试，对比单机、多机的性能，所以用的比较小的数据集

* Rewrite firstn and shuffle functions, test=develop * Rewrite firstn and shuffle functions, test=develop * update, test=develop * updata, test=develop * update reader.shuffle and reader.firstn, test=develop

pkuyym changed the title ~~【discuss】分布式paddle性能问题~~ 【discuss】分布式paddle性能问题，cpu集群 Feb 17, 2017

reyoung self-assigned this Feb 20, 2017

reyoung added the question label Feb 20, 2017

pkuyym closed this as completed Aug 2, 2017

lizexu123 pushed a commit to lizexu123/Paddle that referenced this issue Feb 23, 2024

fix YOLO series act demo some bug (PaddlePaddle#1359)

fd85f9d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

【discuss】分布式paddle性能问题，cpu集群 #1359

【discuss】分布式paddle性能问题，cpu集群 #1359

pkuyym commented Feb 17, 2017

hedaoyuan commented Feb 20, 2017

pkuyym commented Feb 20, 2017

hedaoyuan commented Feb 20, 2017

pkuyym commented Feb 20, 2017

【discuss】分布式paddle性能问题，cpu集群 #1359

【discuss】分布式paddle性能问题，cpu集群 #1359

Comments

pkuyym commented Feb 17, 2017

hedaoyuan commented Feb 20, 2017

pkuyym commented Feb 20, 2017

hedaoyuan commented Feb 20, 2017

pkuyym commented Feb 20, 2017