Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support distribute training in python v2 API #1782

Merged
merged 15 commits into from
Apr 24, 2017

Conversation

jacquesqiao
Copy link
Member

@jacquesqiao jacquesqiao commented Apr 13, 2017

Fixes #1802

@jacquesqiao jacquesqiao changed the title init support remote updater(draft) support remote updater(draft) Apr 13, 2017
"""
for parameter in self.__model_config__.parameters:
if parameter.sparse_update or parameter.sparse_remote_update:
return True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

一般不会在for中return,break出来再return会规范些?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的~

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if self.__is_local__:
updater = self.__optimizer__.create_local_updater()
else:
updater = self.__optimizer__.create_remote_updater(num_passes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if else 可以放在optimizer里面会不会好些, 让train()里的逻辑还是尽量简单~

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -42,7 +42,7 @@ class SGD(object):
:type extra_layers: paddle.v2.config_base.Layer
"""

def __init__(self, cost, parameters, update_equation, extra_layers=None):
def __init__(self, cost, parameters, update_equation, extra_layers=None, is_local=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

提一个想法,可以将区分local和remote的变量放在环境变量中么?目的是让用户在做分布式训练时,尽量不修改code,比如:

self.__is_local__ = os.getenv("DISTRIBUTED_TRAIN", None)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的,这个可以支持

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感觉这个弄起来还涉及到其他很多的配置方式,应该专门弄个pr来整理这些配置项,所以就不放在这个pr里面了,还是用显式的配置方式来搞

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

有道理。可以创建一个issue记录一下这个项目,以便日后实现吗?

@jacquesqiao jacquesqiao changed the title support remote updater(draft) support remote updater Apr 16, 2017
@@ -2,26 +2,38 @@

import paddle.v2 as paddle

dictsize = 1953
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

github.com/paddlepaddle/padddle 的 demo 目录是不是和 github.com/paddlepaddle/book 里的内容重合了?是不是应该只保留一份?

另外,如果这里对word2vec demo 的修改支持为了测试一下分布式训练里update sparse parameters的效果,是不是创建一个新文件,放到某个 test 目录下更合适?比如 paddle/v2/trainer/test/sparse_parameter_update_test.py ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1,确实有重复,book中的可以考虑去掉
2,这个不仅仅是测试,是可以训出一个可以用的模型的

auto remoteUpdater = new paddle::RemoteParameterUpdater(
config->m->getConfig(), passCount, nullptr);
if (useSparseUpdater) {
std::unique_ptr<paddle::ParameterUpdater> remoteUpdaterPtr;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L37和 L38是不是不用分成两行,而是可以?

std::unique_ptr<paddle::ParameterUpdater> remoteUpdaterPtr(remoteUpdater);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

赞,是的

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

def create_remote_updater(self, pass_num):
return swig_api.ParameterUpdater.createRemoteUpdater(self.__opt_conf__,
pass_num)
def create_remote_updater(self, pass_num, use_sparse_updater):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里暴露了两个函数 create_remote_updatercreate_updater。用户怎么知道应该使用哪一个呢?或者说,什么时候用第一个,什么时候用第二个呢?需要一个comment吗?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里修改之后create_remote_updater应该作为内部函数,不对外暴露

"""
use_sparse = False
for parameter in self.__model_config__.parameters:
if parameter.sparse_update or parameter.sparse_remote_update:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sparse_update 和 sparse_remote_update 这两个概念是什么意思呀?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sparse_update是个独立的概念,就是在数据是sparse情况下,更新参数的时候只需要更新一部分,这个目前即支持本地,也支持远程。本地单机使用的时候,无需特殊设置。但是如果是remote的sparse_update,就需要对remote_parameter_updater专门设置下了,因为需要提前从远程parameter_server拉一些数据回来
主要就是 trainer.py中的:

if self.use_remote_sparse_updater():
    self.__gradient_machine__.prefetch(in_args)
    self.__parameter_updater__.getParametersRemote()

Copy link
Contributor

@helinwang helinwang Apr 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jacquesqiao 我理解的sparse update是只被用在word-embedding词典中,每个词一行,所以词典行数非常大。每次trainer向parameter server发gradient只需要发有更改的几行。请问我这样理解对吗?如果对的话,为何需要prefetch?

if not cluster_train:
paddle.init(use_gpu=False, trainer_count=1)
else:
paddle.init(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paddle.init的参数都是在代码里写的,这样集群训练就必须修改python代码了,建议直接在paddle.init中从先从环境变量读取参数,再从**kwargs读取,再用默认参数。可以保持本地训练代码提交集群训练时只需要修改环境变量。

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

嗯,确实应该这样,不过感觉这些配置参数还挺多的,可以把这个pr入了之后,专门建一个issue来解决这个配置参数获取的问题。

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个跟reader的实现也有关,很可能reader会导致不得不改python代码。我没想出来一个不改reader就在集群运行的好用的方法。

今天看到王益的评论说要从jupiter notebook启动paddle,我其实不是很确定dist train是该从命令行启动,还是该从python启动。

@jacquesqiao jacquesqiao changed the title support remote updater support distribute training in python v2 API Apr 19, 2017
@@ -96,6 +98,18 @@ def __prepare_parameter__(self, in_args):
self.__gradient_machine__.prefetch(in_args)
self.__parameter_updater__.getParametersRemote()

def save_parameter(self, dir_name, file_name):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

参数不要传dirname和filename,直接传一个fp进来。

这样我们不只可以保存到本地文件,也可以保存到二进制流中。

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jacquesqiao 请修复这个问题。

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jacquesqiao
Copy link
Member Author

@@ -57,6 +69,7 @@ def main():
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 100 == 0:
trainer.save_parameter("output", "batch-" + str(event.batch_id))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

重构之后接口可以不变,实现起来可以考虑save parameter由parameter server来做。(trainer.save_parameter告诉parameter server存parameter.)

parameters = paddle.parameters.create(cost)
adam_optimizer = paddle.optimizer.Adam(
adagrad = paddle.optimizer.AdaGrad(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果重写parameter server,需要讨论下第一版需不需要支持这么多update rule,实现简单考虑可以先不支持adagrad这类基于momentum的,只实现最简单的加法?

updater->m->updater.reset(new paddle::RemoteParameterUpdater(
config->m->getConfig(), passCount, nullptr));
auto remoteUpdater = new paddle::RemoteParameterUpdater(
config->m->getConfig(), passCount, nullptr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为什么parameter updater需要知道passCount?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

因为param server知道pass count之后可以自己退出。。不过其实也没啥用,训练完之后直接把param server kill掉就好。

os.makedirs(dir_name)
param_file_name = dir_name + "/" + file_name + '.tar.gz'
assert not os.path.exists(param_file_name)
self.__parameter_updater__.catchUpWith()
Copy link
Contributor

@helinwang helinwang Apr 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

能否解释下

self.__parameter_updater__.catchUpWith()
self.__parameter_updater__.apply()
self.__parameter_updater__.getParametersRemote(True, True)
self.__parameter_updater__.restore()

这个序列的对__parameter_updater__的操作是干啥的?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这是支持正则化和ModelAverage的操作。。。

正则化是Lazy的计算,而ModelAverage,当前训练用的模型和实际上预测或者保存的模型并不是一个模型。

@@ -101,23 +142,26 @@ def train(self, reader, num_passes=1, event_handler=None, feeding=None):
for pass_id in xrange(num_passes):
event_handler(v2_event.BeginPass(pass_id))
pass_evaluator.start()
updater.startPass()
self.__parameter_updater__.startPass()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

请问为什么__parameter_updater__需要知道startPass, startBatch, finishBatch, finishBatch。我理解的它只需要拿gradient,分发parameter。

Copy link
Contributor

@hedaoyuan hedaoyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, approve c++部分实现。python部分 @reyoung 再看一下。

Copy link
Collaborator

@reyoung reyoung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM,

不过有两个没用的import可以删了,请删完了再merge

@@ -1,4 +1,6 @@
import collections
import gzip
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

似乎gzip不需要了?

@@ -1,4 +1,6 @@
import collections
import gzip
import os
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os也不需要了吧?

@jacquesqiao jacquesqiao merged commit 5f92400 into PaddlePaddle:develop Apr 24, 2017
lizexu123 pushed a commit to lizexu123/Paddle that referenced this pull request Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants