support distribute training in python v2 API #1782

jacquesqiao · 2017-04-13T16:22:19Z

typhoonzero · 2017-04-14T02:21:19Z

python/paddle/v2/topology.py

+        """
+        for parameter in self.__model_config__.parameters:
+            if parameter.sparse_update or parameter.sparse_remote_update:
+                return True


一般不会在for中return，break出来再return会规范些？

qingqing01 · 2017-04-14T04:47:25Z

python/paddle/v2/trainer.py

+        if self.__is_local__:
+            updater = self.__optimizer__.create_local_updater()
+        else:
+            updater = self.__optimizer__.create_remote_updater(num_passes)


if else 可以放在optimizer里面会不会好些，让train()里的逻辑还是尽量简单~

Yancey1989 · 2017-04-14T09:57:54Z

python/paddle/v2/trainer.py

@@ -42,7 +42,7 @@ class SGD(object):
    :type extra_layers: paddle.v2.config_base.Layer
    """

-    def __init__(self, cost, parameters, update_equation, extra_layers=None):
+    def __init__(self, cost, parameters, update_equation, extra_layers=None, is_local=True):


提一个想法，可以将区分local和remote的变量放在环境变量中么？目的是让用户在做分布式训练时，尽量不修改code，比如：

self.__is_local__ = os.getenv("DISTRIBUTED_TRAIN", None)

好的，这个可以支持

感觉这个弄起来还涉及到其他很多的配置方式，应该专门弄个pr来整理这些配置项，所以就不放在这个pr里面了，还是用显式的配置方式来搞

有道理。可以创建一个issue记录一下这个项目，以便日后实现吗？

wangkuiyi · 2017-04-17T18:10:39Z

demo/word2vec/api_train_v2.py

@@ -2,26 +2,38 @@

 import paddle.v2 as paddle

-dictsize = 1953


github.com/paddlepaddle/padddle 的 demo 目录是不是和 github.com/paddlepaddle/book 里的内容重合了？是不是应该只保留一份？

另外，如果这里对word2vec demo 的修改支持为了测试一下分布式训练里update sparse parameters的效果，是不是创建一个新文件，放到某个 test 目录下更合适？比如 paddle/v2/trainer/test/sparse_parameter_update_test.py ？

1，确实有重复，book中的可以考虑去掉
2，这个不仅仅是测试，是可以训出一个可以用的模型的

wangkuiyi · 2017-04-17T19:02:37Z

paddle/api/ParameterUpdater.cpp

+  auto remoteUpdater = new paddle::RemoteParameterUpdater(
+      config->m->getConfig(), passCount, nullptr);
+  if (useSparseUpdater) {
+    std::unique_ptr<paddle::ParameterUpdater> remoteUpdaterPtr;


L37和 L38是不是不用分成两行，而是可以？

std::unique_ptr<paddle::ParameterUpdater> remoteUpdaterPtr(remoteUpdater);

赞，是的

wangkuiyi · 2017-04-17T19:04:00Z

python/paddle/v2/optimizer.py

-    def create_remote_updater(self, pass_num):
-        return swig_api.ParameterUpdater.createRemoteUpdater(self.__opt_conf__,
-                                                             pass_num)
+    def create_remote_updater(self, pass_num, use_sparse_updater):


这里暴露了两个函数 create_remote_updater 和 create_updater。用户怎么知道应该使用哪一个呢？或者说，什么时候用第一个，什么时候用第二个呢？需要一个comment吗？

这里修改之后create_remote_updater应该作为内部函数，不对外暴露

wangkuiyi · 2017-04-17T20:15:52Z

python/paddle/v2/topology.py

+        """
+        use_sparse = False
+        for parameter in self.__model_config__.parameters:
+            if parameter.sparse_update or parameter.sparse_remote_update:


sparse_update 和 sparse_remote_update 这两个概念是什么意思呀？

sparse_update是个独立的概念，就是在数据是sparse情况下，更新参数的时候只需要更新一部分，这个目前即支持本地，也支持远程。本地单机使用的时候，无需特殊设置。但是如果是remote的sparse_update，就需要对remote_parameter_updater专门设置下了，因为需要提前从远程parameter_server拉一些数据回来
主要就是 trainer.py中的：

if self.use_remote_sparse_updater(): self.__gradient_machine__.prefetch(in_args) self.__parameter_updater__.getParametersRemote()

@jacquesqiao 我理解的sparse update是只被用在word-embedding词典中，每个词一行，所以词典行数非常大。每次trainer向parameter server发gradient只需要发有更改的几行。请问我这样理解对吗？如果对的话，为何需要prefetch？

typhoonzero · 2017-04-19T12:06:37Z

demo/word2vec/api_train_v2.py

+    if not cluster_train:
+        paddle.init(use_gpu=False, trainer_count=1)
+    else:
+        paddle.init(


paddle.init的参数都是在代码里写的，这样集群训练就必须修改python代码了，建议直接在paddle.init中从先从环境变量读取参数，再从**kwargs读取，再用默认参数。可以保持本地训练代码提交集群训练时只需要修改环境变量。

嗯，确实应该这样，不过感觉这些配置参数还挺多的，可以把这个pr入了之后，专门建一个issue来解决这个配置参数获取的问题。

use environment variable to set trainer

这个跟reader的实现也有关，很可能reader会导致不得不改python代码。我没想出来一个不改reader就在集群运行的好用的方法。

今天看到王益的评论说要从jupiter notebook启动paddle，我其实不是很确定dist train是该从命令行启动，还是该从python启动。

reyoung · 2017-04-20T02:27:23Z

python/paddle/v2/trainer.py

@@ -96,6 +98,18 @@ def __prepare_parameter__(self, in_args):
            self.__gradient_machine__.prefetch(in_args)
            self.__parameter_updater__.getParametersRemote()

+    def save_parameter(self, dir_name, file_name):


参数不要传dirname和filename，直接传一个fp进来。

这样我们不只可以保存到本地文件，也可以保存到二进制流中。

@jacquesqiao 请修复这个问题。

jacquesqiao · 2017-04-20T02:42:38Z

use environment variable to set trainer

helinwang · 2017-04-20T22:32:49Z

demo/word2vec/api_train_v2.py

@@ -57,6 +69,7 @@ def main():
    def event_handler(event):
        if isinstance(event, paddle.event.EndIteration):
            if event.batch_id % 100 == 0:
+                trainer.save_parameter("output", "batch-" + str(event.batch_id))


重构之后接口可以不变，实现起来可以考虑save parameter由parameter server来做。(trainer.save_parameter告诉parameter server存parameter.)

helinwang · 2017-04-20T22:36:04Z

demo/word2vec/api_train_v2.py

    parameters = paddle.parameters.create(cost)
-    adam_optimizer = paddle.optimizer.Adam(
+    adagrad = paddle.optimizer.AdaGrad(


如果重写parameter server，需要讨论下第一版需不需要支持这么多update rule，实现简单考虑可以先不支持adagrad这类基于momentum的，只实现最简单的加法？

helinwang · 2017-04-20T22:42:05Z

paddle/api/ParameterUpdater.cpp

-  updater->m->updater.reset(new paddle::RemoteParameterUpdater(
-      config->m->getConfig(), passCount, nullptr));
+  auto remoteUpdater = new paddle::RemoteParameterUpdater(
+      config->m->getConfig(), passCount, nullptr);


为什么parameter updater需要知道passCount?

因为param server知道pass count之后可以自己退出。。不过其实也没啥用，训练完之后直接把param server kill掉就好。

helinwang · 2017-04-20T22:52:02Z

python/paddle/v2/trainer.py

+            os.makedirs(dir_name)
+        param_file_name = dir_name + "/" + file_name + '.tar.gz'
+        assert not os.path.exists(param_file_name)
+        self.__parameter_updater__.catchUpWith()


能否解释下

self.__parameter_updater__.catchUpWith() self.__parameter_updater__.apply() self.__parameter_updater__.getParametersRemote(True, True) self.__parameter_updater__.restore()

这个序列的对__parameter_updater__的操作是干啥的？

这是支持正则化和ModelAverage的操作。。。

正则化是Lazy的计算，而ModelAverage，当前训练用的模型和实际上预测或者保存的模型并不是一个模型。

helinwang · 2017-04-20T22:53:27Z

python/paddle/v2/trainer.py

@@ -101,23 +142,26 @@ def train(self, reader, num_passes=1, event_handler=None, feeding=None):
        for pass_id in xrange(num_passes):
            event_handler(v2_event.BeginPass(pass_id))
            pass_evaluator.start()
-            updater.startPass()
+            self.__parameter_updater__.startPass()


请问为什么__parameter_updater__需要知道startPass, startBatch, finishBatch, finishBatch。我理解的它只需要拿gradient，分发parameter。

hedaoyuan

LGTM, approve c++部分实现。python部分 @reyoung 再看一下。

reyoung

LGTM,

不过有两个没用的import可以删了，请删完了再merge

reyoung · 2017-04-24T09:33:33Z

python/paddle/v2/trainer.py

@@ -1,4 +1,6 @@
 import collections
+import gzip


似乎gzip不需要了？

reyoung · 2017-04-24T09:33:54Z

python/paddle/v2/trainer.py

@@ -1,4 +1,6 @@
 import collections
+import gzip
+import os


os也不需要了吧？

init support remote updater

6802b65

jacquesqiao added the developing label Apr 13, 2017

jacquesqiao changed the title ~~init support remote updater(draft)~~ support remote updater(draft) Apr 13, 2017

typhoonzero reviewed Apr 14, 2017

View reviewed changes

qingqing01 reviewed Apr 14, 2017

View reviewed changes

Yancey1989 reviewed Apr 14, 2017

View reviewed changes

jacquesqiao added 4 commits April 14, 2017 23:33

support RemoteSparseUpdater

bad503f

fix style probelm

64bfd81

add getParametersRemote for ParameterUpdater in api

8210350

add prefetch for trainer.test

f6c5b6f

jacquesqiao changed the title ~~support remote updater(draft)~~ support remote updater Apr 16, 2017

jacquesqiao added 2 commits April 16, 2017 16:16

word2vec demo support sparse remote update

ea25eef

fix style problem

6295f2d

jacquesqiao removed the developing label Apr 16, 2017

optimizer parameter_updater

cfff946

jacquesqiao requested review from reyoung, wangkuiyi and qingqing01 April 17, 2017 02:11

wangkuiyi reviewed Apr 17, 2017

View reviewed changes

refine code

cf86ca0

typhoonzero reviewed Apr 19, 2017

View reviewed changes

support save parameter in trainer

9562178

jacquesqiao changed the title ~~support remote updater~~ support distribute training in python v2 API Apr 19, 2017

reyoung requested changes Apr 20, 2017

View reviewed changes

helinwang reviewed Apr 20, 2017

View reviewed changes

fix the bug of use sparse_remote_update with MultiGradientMachine

68c1efd

hedaoyuan approved these changes Apr 24, 2017

View reviewed changes

jacquesqiao added 4 commits April 24, 2017 13:58

chage trainer.save_parameter to trainer.save_parameter_to_tar

35f1dfd

fix pre-commit check

9e9d456

save_parameter_to_tar to fd

6a2776e

add .tar.gz suffix to parameter save filename

cb84cba

reyoung approved these changes Apr 24, 2017

View reviewed changes

jacquesqiao merged commit 5f92400 into PaddlePaddle:develop Apr 24, 2017

lizexu123 pushed a commit to lizexu123/Paddle that referenced this pull request Feb 23, 2024

Add Advanced Experiments Data in Docs (PaddlePaddle#1782)

cafd985

support distribute training in python v2 API #1782

support distribute training in python v2 API #1782

Conversation

jacquesqiao commented Apr 13, 2017 • edited by wangkuiyi Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang Apr 20, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacquesqiao Apr 20, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacquesqiao commented Apr 20, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang Apr 20, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hedaoyuan left a comment

Choose a reason for hiding this comment

reyoung left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacquesqiao commented Apr 13, 2017 •

edited by wangkuiyi

Loading

helinwang Apr 20, 2017 •

edited

Loading

jacquesqiao Apr 20, 2017 •

edited

Loading

helinwang Apr 20, 2017 •

edited

Loading