-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support distribute training in python v2 API #1782
support distribute training in python v2 API #1782
Conversation
python/paddle/v2/topology.py
Outdated
""" | ||
for parameter in self.__model_config__.parameters: | ||
if parameter.sparse_update or parameter.sparse_remote_update: | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
一般不会在for中return,break出来再return会规范些?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好的~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
python/paddle/v2/trainer.py
Outdated
if self.__is_local__: | ||
updater = self.__optimizer__.create_local_updater() | ||
else: | ||
updater = self.__optimizer__.create_remote_updater(num_passes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if else 可以放在optimizer里面会不会好些, 让train()里的逻辑还是尽量简单~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
python/paddle/v2/trainer.py
Outdated
@@ -42,7 +42,7 @@ class SGD(object): | |||
:type extra_layers: paddle.v2.config_base.Layer | |||
""" | |||
|
|||
def __init__(self, cost, parameters, update_equation, extra_layers=None): | |||
def __init__(self, cost, parameters, update_equation, extra_layers=None, is_local=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
提一个想法,可以将区分local和remote的变量放在环境变量中么?目的是让用户在做分布式训练时,尽量不修改code,比如:
self.__is_local__ = os.getenv("DISTRIBUTED_TRAIN", None)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好的,这个可以支持
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
感觉这个弄起来还涉及到其他很多的配置方式,应该专门弄个pr来整理这些配置项,所以就不放在这个pr里面了,还是用显式的配置方式来搞
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
有道理。可以创建一个issue记录一下这个项目,以便日后实现吗?
@@ -2,26 +2,38 @@ | |||
|
|||
import paddle.v2 as paddle | |||
|
|||
dictsize = 1953 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
github.com/paddlepaddle/padddle 的 demo 目录是不是和 github.com/paddlepaddle/book 里的内容重合了?是不是应该只保留一份?
另外,如果这里对word2vec demo 的修改支持为了测试一下分布式训练里update sparse parameters的效果,是不是创建一个新文件,放到某个 test 目录下更合适?比如 paddle/v2/trainer/test/sparse_parameter_update_test.py ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1,确实有重复,book中的可以考虑去掉
2,这个不仅仅是测试,是可以训出一个可以用的模型的
paddle/api/ParameterUpdater.cpp
Outdated
auto remoteUpdater = new paddle::RemoteParameterUpdater( | ||
config->m->getConfig(), passCount, nullptr); | ||
if (useSparseUpdater) { | ||
std::unique_ptr<paddle::ParameterUpdater> remoteUpdaterPtr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L37和 L38是不是不用分成两行,而是可以?
std::unique_ptr<paddle::ParameterUpdater> remoteUpdaterPtr(remoteUpdater);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
赞,是的
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
python/paddle/v2/optimizer.py
Outdated
def create_remote_updater(self, pass_num): | ||
return swig_api.ParameterUpdater.createRemoteUpdater(self.__opt_conf__, | ||
pass_num) | ||
def create_remote_updater(self, pass_num, use_sparse_updater): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里暴露了两个函数 create_remote_updater
和 create_updater
。用户怎么知道应该使用哪一个呢?或者说,什么时候用第一个,什么时候用第二个呢?需要一个comment吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里修改之后create_remote_updater应该作为内部函数,不对外暴露
""" | ||
use_sparse = False | ||
for parameter in self.__model_config__.parameters: | ||
if parameter.sparse_update or parameter.sparse_remote_update: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sparse_update 和 sparse_remote_update 这两个概念是什么意思呀?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sparse_update是个独立的概念,就是在数据是sparse情况下,更新参数的时候只需要更新一部分,这个目前即支持本地,也支持远程。本地单机使用的时候,无需特殊设置。但是如果是remote的sparse_update,就需要对remote_parameter_updater专门设置下了,因为需要提前从远程parameter_server拉一些数据回来
主要就是 trainer.py
中的:
if self.use_remote_sparse_updater():
self.__gradient_machine__.prefetch(in_args)
self.__parameter_updater__.getParametersRemote()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jacquesqiao 我理解的sparse update是只被用在word-embedding词典中,每个词一行,所以词典行数非常大。每次trainer向parameter server发gradient只需要发有更改的几行。请问我这样理解对吗?如果对的话,为何需要prefetch?
if not cluster_train: | ||
paddle.init(use_gpu=False, trainer_count=1) | ||
else: | ||
paddle.init( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
paddle.init
的参数都是在代码里写的,这样集群训练就必须修改python代码了,建议直接在paddle.init中从先从环境变量读取参数,再从**kwargs
读取,再用默认参数。可以保持本地训练代码提交集群训练时只需要修改环境变量。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
嗯,确实应该这样,不过感觉这些配置参数还挺多的,可以把这个pr入了之后,专门建一个issue来解决这个配置参数获取的问题。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个跟reader的实现也有关,很可能reader会导致不得不改python代码。我没想出来一个不改reader就在集群运行的好用的方法。
今天看到王益的评论说要从jupiter notebook启动paddle,我其实不是很确定dist train是该从命令行启动,还是该从python启动。
python/paddle/v2/trainer.py
Outdated
@@ -96,6 +98,18 @@ def __prepare_parameter__(self, in_args): | |||
self.__gradient_machine__.prefetch(in_args) | |||
self.__parameter_updater__.getParametersRemote() | |||
|
|||
def save_parameter(self, dir_name, file_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
参数不要传dirname和filename,直接传一个fp进来。
这样我们不只可以保存到本地文件,也可以保存到二进制流中。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jacquesqiao 请修复这个问题。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
demo/word2vec/api_train_v2.py
Outdated
@@ -57,6 +69,7 @@ def main(): | |||
def event_handler(event): | |||
if isinstance(event, paddle.event.EndIteration): | |||
if event.batch_id % 100 == 0: | |||
trainer.save_parameter("output", "batch-" + str(event.batch_id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
重构之后接口可以不变,实现起来可以考虑save parameter由parameter server来做。(trainer.save_parameter告诉parameter server存parameter.)
parameters = paddle.parameters.create(cost) | ||
adam_optimizer = paddle.optimizer.Adam( | ||
adagrad = paddle.optimizer.AdaGrad( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果重写parameter server,需要讨论下第一版需不需要支持这么多update rule,实现简单考虑可以先不支持adagrad这类基于momentum的,只实现最简单的加法?
updater->m->updater.reset(new paddle::RemoteParameterUpdater( | ||
config->m->getConfig(), passCount, nullptr)); | ||
auto remoteUpdater = new paddle::RemoteParameterUpdater( | ||
config->m->getConfig(), passCount, nullptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为什么parameter updater需要知道passCount?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
因为param server知道pass count之后可以自己退出。。不过其实也没啥用,训练完之后直接把param server kill掉就好。
python/paddle/v2/trainer.py
Outdated
os.makedirs(dir_name) | ||
param_file_name = dir_name + "/" + file_name + '.tar.gz' | ||
assert not os.path.exists(param_file_name) | ||
self.__parameter_updater__.catchUpWith() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
能否解释下
self.__parameter_updater__.catchUpWith()
self.__parameter_updater__.apply()
self.__parameter_updater__.getParametersRemote(True, True)
self.__parameter_updater__.restore()
这个序列的对__parameter_updater__
的操作是干啥的?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这是支持正则化和ModelAverage的操作。。。
正则化是Lazy的计算,而ModelAverage,当前训练用的模型和实际上预测或者保存的模型并不是一个模型。
@@ -101,23 +142,26 @@ def train(self, reader, num_passes=1, event_handler=None, feeding=None): | |||
for pass_id in xrange(num_passes): | |||
event_handler(v2_event.BeginPass(pass_id)) | |||
pass_evaluator.start() | |||
updater.startPass() | |||
self.__parameter_updater__.startPass() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
请问为什么__parameter_updater__需要知道startPass, startBatch, finishBatch, finishBatch。我理解的它只需要拿gradient,分发parameter。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, approve c++部分实现。python部分 @reyoung 再看一下。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM,
不过有两个没用的import可以删了,请删完了再merge
@@ -1,4 +1,6 @@ | |||
import collections | |||
import gzip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
似乎gzip不需要了?
@@ -1,4 +1,6 @@ | |||
import collections | |||
import gzip | |||
import os |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
os也不需要了吧?
Fixes #1802