Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design doc: submit a distributed job #1770

Closed

Conversation

Yancey1989
Copy link
Contributor

@Yancey1989 Yancey1989 commented Apr 11, 2017

here maybe better to view.


This PR has too many comments to hard to view, move to #2111 Please.
Thanks @helinwang @jacquesqiao @gongweibao @typhoonzero @wangkuiyi

@@ -0,0 +1,70 @@

# PaddlePaddle Client
PaddlePaddle clinet is a command line tool, before startting a cluster train, you need to install it on your latop, you can sumit a cluster train job looks like:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clinet => client

--input=${INPUT_DIR}
```

- job-name: you can specify a name for every job, and the name should be uniq.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

job-name可以自己制定,框架应该自动生成一个app-id,这个是uniq的

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

有道理。kubectl应该会为每个job返回一个id。这个不难。

只是job-name最好还是可以自动生成一个unique的。比如说 full_job_name = k8s_user_name + "/" + job_name ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe full_job_name=<k8s-user-name>/<job-name>/<job-id> ? because user will submit the same job name twice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感觉可以考虑不允许用户指定job name。跟大家提到的一样:咱们自己给用户生成一个id就好了。(类似于docker run执行的时候会在stdout打印Docker生成的container ID)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

通过手动指定一个唯一的job name看起来是一个比较常见的方法(Kubernetes和gcloud里都有手动指定唯一job name的习惯),用户在使用时也比较统一,比如以下常用的一系列操作:

paddle job submit <job-name>
paddle job status <job-name>
paddle job stop <job-name>

如果使用生成的job-id的话会需要在第一次提交时候注意返回的ID,如果忽略掉的话用户很难再找到刚才提交的是哪个job了。


- Data source

Input data should be saved on the distributed and you have the access.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

access => access right?

--pcakage-path=./sharding-data
--module-name=sharding-data.main
--input=${INPUT_DIR}
--output=${OUTPUT_DIR
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是不是应该有个参数叫 sharding-count,分多少份

Copy link
Contributor Author

@Yancey1989 Yancey1989 Apr 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jacquesqiao 确实,根据https://github.com/PaddlePaddle/Paddle/pull/1770#discussion_r111053711, 通过设置环境变量来控制,看起来更通用一些。

Copy link
Contributor

@helinwang helinwang Apr 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感觉sharding不应该交给用户来做。trainer是10个的时候,一个数据集可能需要shard成500份,如果trainer是100个,可能就要5000份。既然都随着trainer数目变化了,用户改变trainer数目之前很可能需要再做一次sharding,会不会太麻烦?

我理解这里需要sharding的原因是:我们让用户上传了自己的数据,集群不知道这个格式是什么,所以就不知道如何sharding。
另一个选择是,用户在使用自己的数据集之前必须转换成集群支持的格式(据我了解google就是这样做的,对应的格式是sstable)。这样,集群自己去管用什么格式存储可以优化顺序读取(training是顺序读取),以及自己可以负责sharding。一切对用户是不可见的。sharding以及用什么格式教给用户做,他们不见得有兴趣做,也很难做好。

Copy link
Contributor Author

@Yancey1989 Yancey1989 Apr 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@helinwang 让用户来做文件sharding确实会带来各种问题,类似对于HDFS来说,文件会被切成多个block,trainer在启动获取一个未被读取过的block进行训练即可。更进一步的优化,如果存储部署在Kubernetes集群的计算节点,可以使trainer优先读取本地的block数据。

@@ -0,0 +1,70 @@

# PaddlePaddle Client
PaddlePaddle clinet is a command line tool, before startting a cluster train, you need to install it on your latop, you can sumit a cluster train job looks like:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里有不少英语语法错误。建议用 Grammerly.com 做检查。我花了钱。以下是我检查的结果(我是付费用户,所以Grammerly.com为我做的检查比为免费用户做的多):

screen shot 2017-04-11 at 7 12 42 pm

PaddlePaddle clinet is a command line tool, before startting a cluster train, you need to install it on your latop, you can sumit a cluster train job looks like:

```bash
paddle k8s job <job-name>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里应该是

paddle k8s start <job-name>

吧?还需要有其他的一些命令,比如 stop, log, ...?

paddle k8s job <job-name>
--package-path= ./demo
--module=demo.train
--pserver-count=2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--pservers=2, --trainers=4

--pserver-count=2
--trainer-count=4
--output=${OUTPUT_DIR}
--input=${INPUT_DIR}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我估计不需要input这个“标准”选项,而是可以通过 -e 参数指定一些环境变量或者 demo/train.py 接受的命令行参数吧?用法如:

paddle k8s start deepspeech2 \
   -e "INPUT_DIR=/glusterfs/voice-team/voice-data" \
   -e "MODEL_DIR=/glusterfs/yanxu/ds_2ps_4tn"

这样, demo/train.py 里可以有:

paddle.v2.train(
    reader=dataset.voice.train(os.getenv("INPUT_DIR")), 
    ...)

--input=${INPUT_DIR}
```

- job-name: you can specify a name for every job, and the name should be uniq.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

有道理。kubectl应该会为每个job返回一个id。这个不难。

只是job-name最好还是可以自动生成一个unique的。比如说 full_job_name = k8s_user_name + "/" + job_name ?

# Master process
- Setup master process

While user submit a distributed train job through PaddlePaddle client, it will setup a master process on cluster, the master process provids a http service for receiving package files and some cluster parameter.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

master 程序用什么语言开发?如何和trainer通信?


- Saving package

Master process will receive the package files and save on the distributed storage, for every job, master will generate a random id called *job-id*, the package file path looks like:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

package file 是什么?我以为 paddle k8s start 会把 package path 里的内容打包成一个Docker image,然后push到Kubernetes可以访问的一个Docker registry里呢?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

package file 是用户基于PaddlePaddle API开发的程序

我以为 paddle k8s start 会把 package path 里的内容打包成一个Docker image,然后push到Kubernetes可以访问的一个Docker registry里呢?

这确实是一个方法,但感觉在客户端做的事情有些多,既然只是需要这样一个package,那么上传到服务器并存储在分布式存储上,在job启动时volume这个目录看起来会简单一些。

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果我的 .py 文件里 import 了某些机群上没有Python packages怎么办?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在客户端上做可以是咱们的程序在下面做好了,根据指定的目录自动打包,不需要用户做。用户只用装docker。
这样的好处是,深度定制的用户可以直接传一个配好环境的docker image。

Copy link
Collaborator

@wangkuiyi wangkuiyi Apr 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

原来和 @emailweixu 商量过一个想法:通过访问Python的__all__ 这样的built-in variables来列举被import的Python packages。假设这些packages都是可以通过 pip install 安装的,我们可以自动生成一个Dockerfile,来安装这些packages。只是一个想法,上午具体的技术验证。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果我的 .py 文件里 import 了某些机群上没有Python packages怎么办?

有一个办法是在package目录下创建一个文件requirements.txt,这个文件列出来依赖的python package,在系统准备环境时预先执行pip install -r requirements.txt来安装这些package。


- Data source

Input data should be saved on the distributed and you have the access.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

数据源有几种

  1. 数据在Docker image里(因为之前放在了 package directory 里,所以被装进Docker image了)
  2. 数据在分布式文件系统里
  3. 数据是reader在线获得的。比如reader是一个HTTP server,等着data generator 发来数据)
  4. 数据是reader合成的,就像 fit_a_line 那个demo一样。

这几种方式都需仔细考虑。比如如果放在分布式文件系统上,文件格式应该是任何格式都行(因为用户可以自己写reader),还是必须是少数几种预先定义好的格式(SSTable或者String sequence file),用户使用预先定义好的reader来读取?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

根据日前的讨论,这一版只考虑第2种,分布式文件系统的数据。提供默认的reader读取分布式文件系统的数据(GlusterFS),但在设计上要支持其他reader以插件的形式嵌入到系统中。


# Different cluster management

PaddlePaddle client support plugins for different cluster management client, such as kubernetes looks like `paddle k8s ...`, MPI looks like `paddl mpi...`,but runinig process are the same:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paddle k8s start 需要生成 Docker image,但是 paddle mpi start 是不是就不需要 Docker image (也不需要package file)而是假设程序和数据都已经部署好了?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

package file 包括了用户开发的训练程序,这个是必要的。 paddle mpi start确实不需要docker image,环境在master(MPI-master)进程运行的环境里,向MPI集群提交任务时会将用户上传的训练程序和环境文件一起提交到集群进行训练。

@@ -0,0 +1,70 @@

# PaddlePaddle Client
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you are describing the PaddlePaddle client, do you need to describe the full features of it, like local training, show version etc.

PaddlePaddle clinet is a command line tool, before startting a cluster train, you need to install it on your latop, you can sumit a cluster train job looks like:

```bash
paddle k8s job <job-name>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be paddle k8s_train ..., use a single subcommand or use the kubectl style: paddle submit k8s_job -f [job.yaml]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except paddle k8s train, also paddle k8s [stop|status...] command, and paddle client also support submit MPI cluster train job looks like: paddle mpi train, so maybe subcommand looks better?

kubectl style is good point, shall we support the two style, simple job configuration for command line parameters and complex job configuration use yml file.

# Master process
- Setup master process

While user submit a distributed train job through PaddlePaddle client, it will setup a master process on cluster, the master process provids a http service for receiving package files and some cluster parameter.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need to setup master processes for each job?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the import consideration, for each job, they are independent. Master is responsible for setup, stop, report status for the job, so maybe one master process for each job is a simple plan, it does not care about sth. such as user access etc..

On the other hand, pre start multiple master process maybe a better plan, user does need to setup a new master process when submit a job. If you have some idea, let's discuss this one.

--module=demo.train
--pserver-count=2
--trainer-count=4
--output=${OUTPUT_DIR}
Copy link
Contributor

@helinwang helinwang Apr 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个客户端是装在用户电脑上的。我觉得不需要让用户可以设置输出路径,不然用户可能给错误的值。感觉可以job提交之后直接给用户指定一个路径。paddle k8s describe的时候打印出来。
paddle k8s 之后接的sub-command(submit, describe, ...)可以参考:https://cloud.google.com/sdk/gcloud/reference/ml-engine/jobs/

--pserver-count=2
--trainer-count=4
--output=${OUTPUT_DIR}
--input=${INPUT_DIR}
Copy link
Contributor

@helinwang helinwang Apr 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的input指的是数据集吧。我觉得在.py文件里指定就行了。比如说:

reader = paddle.dist.dataset("mnist") # mnist digit recognition dataset

用户上传的数据集也可以起一个名字,在.py文件中引用。
.py文件中引用的另一个好处是同时用到多个数据集比较方便。
另外,数据集和网络拓扑的输入格式是强相关的(比如:图像识别的网络拓扑的输入是图片,需要图片的数据集),既然是强相关的,感觉放在同一个地方(都在.py里面)比较好。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@helinwang 这里input指的是输入数据的路径,感觉dataset的名字不需要修改,在定义dataset时,可以使用input这个环境变量来指定输入路径?这样做的好处是,当在本地训练时,input是一个本地路径,云端是一个分布式存储的路径,而dataset的名字是同一个,只是reader的代码可能不一样?

Copy link
Contributor

@helinwang helinwang Apr 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yancey1989 明白了,你是指让用户上传数据集到一个地方,input这个环境变量代表输入路径(那个数据集被上传的地方)。好处是在本地和云端训练不用改代码(只用改环境变量)。
但是这个方法有个问题我没有想清楚:本地和云端执行的区别只有input,本地对应本地的文件夹,云端对应云端的文件夹。这样的话,云端执行的时候,数据还是同一个数据,那我们并不知道怎么去sharding(不知道用户用的什么格式。),也可能是我没想到,要是你有方法欢迎讨论哈。

我想象的是另外一种类似的方法,稍有不同:让用户上传数据集,但是必须让用户把数据集通过我们的程序转换成咱们的格式,储存在某个地方。转换的时候可以起名字,然后引用的是候只需要给出那个名字就好了,咱们的程序能自动查到具体的转换后的文件存在了哪里。
比如:

reader = paddle.dist.dataset("mnist") # mnist是公用数据集,不需要用户上传,只用指定名字就好。

或者

reader = paddle.dist.dataset("my-custom-dataset")

坏处是在本地和云端训练需要改代码(不过也只是需要改一行)。
好处是因为是我们的自定义格式,我们可以做sharding,可以把格式设计成顺序读取很快的格式。

--input=${INPUT_DIR}
```

- job-name: you can specify a name for every job, and the name should be uniq.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感觉可以考虑不允许用户指定job name。跟大家提到的一样:咱们自己给用户生成一个id就好了。(类似于docker run执行的时候会在stdout打印Docker生成的container ID)

# Master process
- Setup master process

While user submit a distributed train job through PaddlePaddle client, it will setup a master process on cluster, the master process provids a http service for receiving package files and some cluster parameter.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will setup -> PaddlePaddle will setup
是paddlepaddle创造的,不是用户直接创造的。改一下可以避免混淆。


- Saving package

Master process will receive the package files and save on the distributed storage, for every job, master will generate a random id called *job-id*, the package file path looks like:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在客户端上做可以是咱们的程序在下面做好了,根据指定的目录自动打包,不需要用户做。用户只用装docker。
这样的好处是,深度定制的用户可以直接传一个配好环境的docker image。


- Setup parameter and trainer

Master process will set up parameter server process and trainer process, on kubernetes, master will deploy two deployment for parameter server and trainer.
Copy link
Contributor

@helinwang helinwang Apr 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我觉得训练的master process只用训练(给trainer发task)就好了,不要管trainer,parameter server的启动。master, trainer, parameter server可以直接交给k8s来调度。

这里讲的应该是用户告诉master,经过master中转一下再交给k8s调度。不知道有啥好处?
如果有好处的话,这个master感觉可以直接做成一个服务,而不是叫训练里面的master process来担当这个职责。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里讲的应该是用户告诉master,经过master中转一下再交给k8s调度。不知道有啥好处?

我理解master还需要一个协调pserver和trainer的启动流程问题,例如每个job的启动过程如下:

  1. 提交一个k8s的job来启动pserver
  2. 等待pserver的job为RUNINIG状态,获取每个pod的IP地址(这一步必须等到RUNINIG才可以拿到IP)
  3. 启动trainer的job,并且将上一步得到的pserver的IP地址传给train的job

步骤1和2之间是有时间间隔的,如果放在客户端来做的话,用户每提交一个任务都需要很长的等待时间才可以,不是很友好。并且将提交的过程放在服务端来做的话也比较便于简化客户端的逻辑。

关于这里master和训练里的master是否为一个进程,我觉得可以讨论一下,如果是同一个进程的话大概是这样子:

  1. 启动pserver,trainer的job,
  2. 和trainer通信,发送task,如果有trainer重启的情况需要更新连接池。
  3. 待trainer的job全部为completed状态后结束训练并退出。
  4. 为了提高可用性,master进程可以同时启两个,通过leader elect机制保证有有一个提供服务。

Copy link
Contributor

@helinwang helinwang Apr 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Service discovery可以通过etcd来完成。请参考https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/dist ,按照里面的描述:

  1. master通过etcd看到trainer才开始分发task给他们。
  2. trainer通过etcd看到parameter server个数足够之后再把自己的联系方式公告在etcd。
  3. parameter server启动之后就把自己的联系方式公告在etcd。

这样貌似master不需要协调parameter server以及trainer的启动。

Copy link
Contributor Author

@Yancey1989 Yancey1989 Apr 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我最新的想法,在每个parameter server进程(Pod)启动后,Kubernetes会自动将IP地址等信息写在etcd中,我们是不是不需要直接读写etcd,而通过Kubernetes的API直接获取到每个Pod的联系方式了呢?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yancey1989 这个思路很有意思,我不知道Kubernetes也能把IP信息写到etcd中,通过Kubernetes API暴露出来,请问如何实现的?

我想了一下,有可能是我理解的不够透彻,有点担心的是这样会不会对kubernetes的依赖太高:以前是只依赖etcd(任意一个集群管理系统k8s,mesos,nomad都可以启动)。现在强依赖于k8s了。
另一个担心是我们需要多么定制化k8s才能完成直接通过k8s API获取每个pod的联系方式。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

使用kubernetes的client-go,代码大概长这样子:

	kubeconfig := flag.String("kubeconfig", "/Users/yanxu05/.kube/config",
		"absolute path to the kubeconfig file")
	flag.Parse()
	// uses the current context in kubeconfig
	config, err := clientcmd.BuildConfigFromFlags("", *kubeconfig)
	if err != nil {
		panic(err.Error())
	}
	// creates the clientset
	clientset, err := kubernetes.NewForConfig(config)

	if err != nil {
		panic(err.Error())
	}

	pod, err := clientset.CoreV1().Pods("default").Get("nginx-2351037158-ph4ts", metav1.GetOptions{})
	fmt.Println(pod.Status.PodIP)

看到issue #1807 了,感觉有道理的,使用Kubernetes的client确实会有安全性和可移植性的问题,多谢 @helinwang !!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


# Different cluster management

PaddlePaddle client support plugins for different cluster management client, such as kubernetes looks like `paddle k8s ...`, MPI looks like `paddl mpi...`,but runinig process are the same:
Copy link
Contributor

@helinwang helinwang Apr 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mpi比较于k8s有好处/特长吗?如果没有的话,加上因为这里说mpi也是基于k8s的,既然反正集群都装了k8s,要不要mpi就不要了。

Copy link
Contributor

@gongweibao gongweibao Apr 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

图画的有点问题,可能。
MPI receiver应该不是基于k8s集群的,和MPI cluter是一块的。目的是兼容公司内部的MPI集群环境。

Copy link
Contributor Author

@Yancey1989 Yancey1989 Apr 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gongweibao receiver可以运行在公司内网的机器上,部署了mpi client即可,这里运行在Kubernetes上为了简化之前的部署流程。

--pcakage-path=./sharding-data
--module-name=sharding-data.main
--input=${INPUT_DIR}
--output=${OUTPUT_DIR
Copy link
Contributor

@helinwang helinwang Apr 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感觉sharding不应该交给用户来做。trainer是10个的时候,一个数据集可能需要shard成500份,如果trainer是100个,可能就要5000份。既然都随着trainer数目变化了,用户改变trainer数目之前很可能需要再做一次sharding,会不会太麻烦?

我理解这里需要sharding的原因是:我们让用户上传了自己的数据,集群不知道这个格式是什么,所以就不知道如何sharding。
另一个选择是,用户在使用自己的数据集之前必须转换成集群支持的格式(据我了解google就是这样做的,对应的格式是sstable)。这样,集群自己去管用什么格式存储可以优化顺序读取(training是顺序读取),以及自己可以负责sharding。一切对用户是不可见的。sharding以及用什么格式教给用户做,他们不见得有兴趣做,也很难做好。


The relation of PaddlePaddle, kubernetes and docker:

<img src="./submit-job.png" width="500">
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Questions with this figure:

  1. I am not sure that pservers and trainers should be in two jobs. In our current configuration, each trainer has a pserver running on the same physical node to optimally overlay networking and computing.

  2. paddle must communicate with Kubernetes' API server to start the job. It might communicate to the master process of the job too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I am not sure that pservers and trainers should be in two jobs

现在的配置会有两个问题:

  1. 不利于trainer的大规模训练,由于trainer count==pserver count,当trainer数很大时,我们并不需要同样数量的pserver数,因为这会增加pserver失败的概率以及增加网络负载。
  2. 由于启在同一个container里,pserver和trainer有一个会以后台方式启动,不符合container的设计原则,不利于故障检测和恢复。

each trainer has a pserver running on the same physical node to optimally overlay networking and computing

看起来控制pserver的数量更可以达到优化网络的效果,而且trainer需要和所有的pserver通信,所以只有一个pserver 节点启在本地看起来效果也不是很大?

paddle must communicate with Kubernetes' API server to start the job. It might communicate to the master process of the job too

我觉得 @helinwang 的这个comment是有道理的,我paddle可以只和master进行通信,master作为一个service存在,并且这个master和 https://github.com/PaddlePaddle/Paddle/tree/develop/doc/design/dist#master-process 这里的master 并不是同一个master, 我会在下一版的更新中修改这部分的描述。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

# Running your training locally
Execute `paddle local train` to run your local train.
```bash
paddle local train
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first argument after paddle should be a command. local isn't even a verb. It seems that it could simply be paddle train --locally or just paddle train without Kubernetes related arguments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

- image: paddlepaddle production image
- e: environment varible

When you start the local train, the client starts a docker container like:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

==> When users start a local training job ...

--input=<input_dir>
--output=<output_dir>
--image=<paddle_image>
--e=NUM_PASS=4
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--env == -e

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Execute `paddle local train` to run your local train.
```bash
paddle local train
--pcakage-path=./demo
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pcakage ==> package

-v <input_dir>:/train/input
-v <output_dir>:/train/output
-v <package-path>:/train/package
-e NUM_PASS=4 <paddle_image>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we also need -e "PYTHONPATH=/train/package so to enable Python finding imported packages there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

When you start the local train, the client starts a docker container like:
```bash
docker run --rm
-v <input_dir>:/train/input
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about change /train/{input,output,package} into /{input,output,package}`?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

You can use `paddle submit job <job-name>` to submit a distributed training job.

```bash
paddle job submit train <job-name>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paddle job submit train ==>
paddle train

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

- e: environment variable

## Build your docker image
Before submitting a distributed training, you should build your docker image, here
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I expected is that paddle train pack everything into a Docker image, other than users pack them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


The relation of PaddlePaddle, kubernetes and docker:

<img src="./submit-job.png" width="500">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

图片里的"start up a local training job"需要PaddlePaddle Client来做吗?感觉直接本地python train.py就行了。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

现在本地训练需要docker run ... python train.py才可以,设计PaddlePaddle Client支持本地训练也是为了让用户不必学习Docker相关的操作。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

明白了,那这样用户还需要在下载docker image之外,另外下载一个运行脚本吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我们线下讨论下这个问题吧,还有Queue的实现是否使用etcd也需要一起讨论下:)

<img src="./submit-job.png" width="500">


# Running local training job
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

local training job感觉直接python train.py就行了,不需要命令行支持。

```bash
paddle train \
--job-name=cluster-quickstart \
--package-path=$PWD/quick_start \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是说要上传一个文件夹上去,作为执行时候的根目录,跟我所理解的pacakge貌似不是一个东西。感觉叫--env-path更容易理解。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

env-path看起来更像是设置一个环境变量? 这个package-path是包含了network configuration 的一个本地的目录, @helinwang 理解的package是指什么呢?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我把package理解成了编程语言里的一个包,在我的脑海里跟文件夹不一样(比如C++的包是头文件+动态库/静态库,不是一个文件夹)。

paddle train \
--job-name=cluster-quickstart \
--package-path=$PWD/quick_start \
--module=quick_start.train \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"--module=quick_start.train"建议改成:"--entry-point=python train.py" #也可以是一个用户写的脚本,没必要现制成只能从python的模块执行。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with --entry-point instead of --module

--job-name=cluster-quickstart \
--package-path=$PWD/quick_start \
--module=quick_start.train \
--input=<input-dir> \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

按照昨晚关于数据集和reader的讨论,--input貌似不需要了。请参考:#1696 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我觉得--input参数还是需要的,在分布式的训练中,--input指向一个GlusterFS的路径,我们提交数据,提交任务的过程如下

  1. paddle upload <host path> <clusterfs path>
  2. paddle train --input=<clusterfs path>
  3. 在描述启动Trainer的Job时,需要将GlusterFS的Volume mount到Pod中,例如https://github.com/k8sp/tutorials/blob/develop/quickstart/paddle_dist_train/quickstart.yaml#L57
  4. 在reader中:
trainerid = fetch_trainerid_from_todo_queue()
fp = open(os.path.join("/mnt/glusterfs", os.getenv("INPUT"), trainerid, ".list"))
def reader():
    for l in fp:
        yield ...
return reader

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yancey1989 武毅更改了一下他的上传数据及reader使用的PR,请看这里:https://github.com/typhoonzero/Paddle/blob/clusterdesign/doc/design/cluster_train/data_dispatch.md

PaddlePaddle support a default reader for reading data from distributed file system.
- HTTP server

TODO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's already mentioned in #1696 , maybe we can give a general introduction (as you already did) and reference there after that PR is merged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete this section :)


TODO

## PaddlePaddle client commands:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think running in client just let user do python train.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## Work feature
- V1
- Paddle server is a local version, build `runtime docker image`, deploy trainer and pserver job on user's host.
- implement `paddle train`, `paddle list`, `paddle cancel`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we let user do python train.py for running on local, we can change paddle train to something like paddle submit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we still need paddle train for a local train, as https://github.com/PaddlePaddle/Paddle/pull/1770/files#r112356575

- Paddle server is a local version, build `runtime docker image`, deploy trainer and pserver job on user's host.
- implement `paddle train`, `paddle list`, `paddle cancel`
- V2
- Paddle server is running on kubernetes, users will only upload the package files and some setup parameters and building `runtime docker image` on kubernetes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please list the benefit of building docker image on the cloud. Otherwise it's not convincing that we should do this step.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

- implement `paddle train`, `paddle list`, `paddle cancel`
- V2
- Paddle server is running on kubernetes, users will only upload the package files and some setup parameters and building `runtime docker image` on kubernetes.
- implement `paddle prediction` and other feature.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is too far away, and too much uncertainty: do we need to do load balancing for paddle prediction, etc... Maybe we can just remove it and add it later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete paddle prediction, done.

@helinwang
Copy link
Contributor

这个comment咱们开会的时候可能需要讨论下:#1770 (comment)

我觉得要封装docker到一个脚本里,让用户对docker无感知有点麻烦,因为还有单独下载一个脚本。而且通过命令行执行paddle,集群上的的jupyter貌似不是很方便支持。不知道其他人是怎么想的。

@Yancey1989
Copy link
Contributor Author

按讨论的结果做了更新。

If a user wants to start up a local train, he will start up a PaddlePaddle product Docker container firstly, and then
execute `python train.py` in the Docker container.The details about PaddlePaddle Docker image is [here](../../../paddle/scripts/docker/README.md)

If a user wants to start up a distributed training job, he will submit the distributed training job in python code, or use a command line tool.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

把提交分布式任务放在前面,把提交本地的放在后面?这个文档的重点在描述如何提交分布式任务

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


If a user wants to start up a distributed training job, he will submit the distributed training job in python code, or use a command line tool.

The relation of PaddlePaddle, kubernetes and docker:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relation -> relationship

下面一行就直接是一级标题了?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, 改成二级标题了,多谢指正。


- Python Dependencies

Users will provide a `requirments.txt` file in trainer packages, to list python dependencies packages, such as:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users will provide a requirments.txt file in trainer packages, to list python dependencies packages, such as:

You need to provide requirments.txt file in your "trainer" package. Example:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

The relation of PaddlePaddle, kubernetes and docker:


# Runtime Environment On kubernetes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里应该是二级标题?

pillow
protobuf==3.1.0
```
some other details about `requirements` is [here](https://pip.readthedocs.io/en/1.1/requirements.html).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some other details about requirements is here.

More details about requirements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

```
some other details about `requirements` is [here](https://pip.readthedocs.io/en/1.1/requirements.html).

Here is an example project:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example project looks like:

```
Execute the command: `paddle train --trainer-package=./paddle_eample/quick_start ...`, PaddlePaddle client will upload the trainer package(quick_start)and setup parameters to [Job Server](#job-server)

## Submit a Distributed Training Job In Python Code
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Submit a Distributed Training Job In Python Code -> Submit Distributed Training Job With Python Code

## Submit a Distributed Training Job In Python Code
<img src="./submit-job-python.png" width="800">

Users will call `paddle.dist_train` and provide distributed training configuration as the parameters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users will -> You can


Users will call `paddle.dist_train` and provide distributed training configuration as the parameters.
```python
paddle.dist_train(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python提交和命令行提交的参数,增加下参数的默认值,可选,必选的说明

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Execute the command: `paddle train --trainer-package=./paddle_eample/quick_start ...`, PaddlePaddle client will upload the trainer package(quick_start)and setup parameters to [Job Server](#job-server)

## Submit a Distributed Training Job In Python Code
<img src="./submit-job-python.png" width="800">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

job server是不是就和paddle cloud dashboard网站是同一个实例?

If a user wants to start up a distributed training job, he will submit the distributed training job with python code.

If a user wants to start up a local train, he will start up a PaddlePaddle production Docker container firstly, and then
execute `python train.py` in the Docker container.The details about PaddlePaddle Docker image is [here](../../../paddle/scripts/docker/README.md)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.后面少了个空格。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -0,0 +1,111 @@
# Submit a Distributed Training Job

If a user wants to start up a distributed training job, he will submit the distributed training job with python code.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python需要大写首字母。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

If a user wants to start up a distributed training job, he will submit the distributed training job with python code.

If a user wants to start up a local train, he will start up a PaddlePaddle production Docker container firstly, and then
execute `python train.py` in the Docker container.The details about PaddlePaddle Docker image is [here](../../../paddle/scripts/docker/README.md)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

句子结尾要有句号"."。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


## Runtime Environment On Kubernetes

For a distributed training job, there is two Docker image called **runtime Docker image** and **base Docker image**. The runtime Docker image is the Docker image that gets scheduled by Kubernetes to run during training. The base Docker image is for building the runtime Docker image.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

对不起,之前说错了,我评论里写的是斜体,给的markdown其实是粗体。新的名词应该用斜体:

*runtime Docker image*
*base Docker image*

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


- Base Docker Image

Usually, the base Docker image is PaddlePaddle product Docker image including paddle binary files and trainer startup script file. And of course, users can specify any image name hosted on any docker registry which users have the right access.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have the right access -> have the access right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

gpu_num|NO|1| if `use_gpu=true`, this parameter is required

- Startup Parameter Server and Trainer Jobs
- Deploy parameter server job, it's a Kubernetes StatefulSet.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

请问Parameter server用ReplicaSet能行吗,为什么需要StatefulSet?感觉要是简单的方法能用的话,就不要用复杂的方法吧。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

用StatefulSet的好处是,当有一个Parameter Server的Pod挂掉,Kubernetes新启动的Pod会和之Pod的hostname保持一致,这样trainer就不需要再去获取新的Parameter Server Pod的地址了。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

现在design doc里面写了如何进行service discovery,是不要求hostname或者ip一致的。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, 因为service discover使用了IP地址,所以保持hostname不变也没有必要了,使用简单的ReplicaSet替换StatefulSet。

- RESTful API

Job server provides a RESTful HTTP server receives the trainer packages, list PaddlePaddle job etc...
- `POST /v1/package` receive the trainer package and save them on GlustereFS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GlustereFS -> CephFS

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Job server provides a RESTful HTTP server receives the trainer packages, list PaddlePaddle job etc...
- `POST /v1/package` receive the trainer package and save them on GlustereFS
- `POST /v1/trainer/job` submit a trainer job
- `GET /v1/jobs/` list all job
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list all job -> list all jobs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

- `POST /v1/trainer/job` submit a trainer job
- `GET /v1/jobs/` list all job
- `GET /v1/jobs/<job-name>` the status of a job
- `DELETE /v1/jobs/<job-name>` cancel a job
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cancel的job对应的log需不需要仍然能让用户看到。如果需要的话,貌似Cancel就跟Delete不一样了,Delete是删掉,Cancel是把状态改到取消。不知道这种情况用HTTP DELTE还合不合适。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感觉前期可以只考虑Delete Job了,可以将日志存储在GlusterFS或者通过Logstash写在Elasticsearch中保存一段时间。

`paddle.dist_train` will upload the trainer package to Job Server and then save them on the distributed filesystem, and then start up a job for building the runtime Docker image, Parameter Server and Trainer will use this runtime Docker image.

There are some benefits for building runtime Docker image on JobServer:
- **Docker in Docker** should mount `docker.sock` in the container and set `--privileged`, if the code running in a kubernetes pod, it's not safety.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not safety -> it's not safe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@Yancey1989 Yancey1989 changed the title Submit a distributed job (draft) Submit a distributed job May 3, 2017
@Yancey1989 Yancey1989 changed the title Submit a distributed job Design doc: submit a distributed job May 3, 2017

- Runtime Docker Image

The trainer package which user upload and some Python dependencies are packaged into a runtime Docker image based on base Docker image
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add "."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

```

## Submit Distributed Training Job With Python Code
<img src="./src/submit-job-python.png" width="800">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a paragraph to describe the flow in a big picture. The paragraph below goes directly to the detail of paddle.dist_train, the reader needs a big picture to follow the concept.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

```

## Submit Distributed Training Job With Python Code
<img src="./src/submit-job-python.png" width="800">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docker build / Docker push应该是箭头那根线上的东西吧,这些都是动作,感觉应该放在线上,而不是图上(这里的图基本都是名词/一个概念)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

reader=reader,
paddle_job=PaddleJob(
job_name="quickstart",
pservers=4,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pservers=4但cpu=1好像不是很合理。。。cpu或是gpu=1的情况是不用pserver的。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不使用pserver的情况应该属于在集群进行的单机训练?感觉这是不是需要一个单独的API来做这个事情呢?因为即使CPU=10也有可能用户只想做CPU=10的单机训练而已。。另外我建了一个issue: #2019 讨论下如何指定资源,感觉pservers也可以不要了。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done,增加了对资源使用的描述。


- RESTful API

Job server provides a RESTful HTTP server receives the trainer packages, list PaddlePaddle job etc...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“list PaddlePaddle job etc...”要不要改成一个概括性的比如说:"display job related informations"。写"etc..."有点含糊,感觉不适合在design doc里出现。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


`paddle.dist_train` will upload the trainer package to Job Server and then save them on the distributed filesystem, and then start up a job for building the runtime Docker image, Parameter Server and Trainer will use this runtime Docker image.

There are some benefits for building runtime Docker image on JobServer:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我们最后觉得是在JobServer上build image还是在本地build image?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这一期因为所有代码都是在云端存储上,所以就不build image了直接copy trainer_package比较简单。根据之前的讨论后续还是在JobServer上来build runtime Docker image感觉比较合适。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


- Python Dependencies

You need to provide requirments.txt file in your "trainer" package. Example:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> requirements.txt

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

|-quick_start
|-trainer.py
|-dataset.py
|-requirments.txt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> requirements.txt

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


## Submit Distributed Training Job With Python Code
<img src="./src/submit-job-python.png" width="800">
- `paddle.job.dist_train()` will call the Job Server API `/v1/packages` to upload the trainer package and save them on CephFS, and then call `/v1/trainer/job` to submit the PaddlePaddle distributed job.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

能不能查一下为什么paddle.job.dist_train()这些都没被正常的语法高亮,显示出来的是这样的:
screen shot 2017-05-05 at 11 40 19 am

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Miss a blank line...Done.


## Submit Distributed Training Job With Python Code
<img src="./src/submit-job-python.png" width="800">
- `paddle.job.dist_train()` will call the Job Server API `/v1/packages` to upload the trainer package and save them on CephFS, and then call `/v1/trainer/job` to submit the PaddlePaddle distributed job.
Copy link
Contributor

@helinwang helinwang May 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为了把我们第一个版本说得具体,能不能这样改:change "The first version, we only implement submit the PaddlePaddle job in paddle.job.dist_train()" to "For the first version, we will not prepare the runtime docker image, instead, we will mount the trainer package in a temporary folder into the base docker image. We will not support custom Python dependencies in the first version as well."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

cpu_num=4,
gpu_num=2,
memory="1G"
input=/quickstart/input,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我理解的是咱们会自动把用户根目录mount进去在指定的位置(比如/home/),所以这里input和output都不需要了吧?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

parameter | required | default | explain
--- | --- | --- | ---
job_name|YES||you should special a uniq job name which in a namespace
trainer_package|YES|| entry point for startup trainer process
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trainer_package是指的上传哪个文件夹吗?这个也不是entrypoint吧。我理解entrypoint是一个指令。
我想象中的是这样的:

(..., trainer_package="/path/to/folder", entrypoint="python train.py")

其实我不确定需不需要,是不是可以trainer_package用当前Python文件的目录,entrypoint用python trainer Python文件就行。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不好意思,这里少了一行。因为Kubernetes的Pod要能访问到trainer_package所指向的目录,所以trainer_package应该是CephFS或者Docker image里的一个目录,用当前Python文件的目录应该是不行的。

entrypoint可以直接是"python trainer %s" % __file__

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -0,0 +1,115 @@
# Submit a Distributed Training Job

If a user wants to start up a distributed training job, he will submit the distributed training job with Python code.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

文章题目是"Submit a Distributed Training Job",是不是本地训练就不用说了(或者不用详细说,这里本地训练的介绍字数比远程训练的介绍字数还多)。
是不是可以改成:
The user can submit a distributed training job with Python code, rather than with a command-line interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


The trainer package which user upload and some Python dependencies are packaged into a runtime Docker image based on base Docker image.

- Python Dependencies
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里Python Dependencies跟Base / Runtime Docker image不是并列关系,是不是可以改成一个小标题

### Handle Python Dependencies

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, 是不是用-更合适一些?小标题的话要#### Handler Python Dependencies,层级太深了。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yancey1989 主要是旁边有两条-开头的是并列关系,这里再加个-开头的感觉不太合适。其他情况我觉得都可以哈。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我看到更新之后的了,可以的~


There are some benefits for building runtime Docker image on JobServer:
- **Docker in Docker** should mount `docker.sock` in the container and set `--privileged`, if the code running in a kubernetes pod, it's not safe.
- Users only need to upload the training package files, does not dependency docker engine, docker registry.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does not dependency docker engine, docker registry. -> does not need to install docker engine, docker registry as dependencies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

`paddle.job.dist_train` will upload the trainer package to Job Server and then save them on the distributed filesystem, and then start up a job for building the runtime Docker image, Parameter Server and Trainer will use this runtime Docker image.

There are some benefits for building runtime Docker image on JobServer:
- **Docker in Docker** should mount `docker.sock` in the container and set `--privileged`, if the code running in a kubernetes pod, it's not safe.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docker in Docker加粗不是很合适:读者可能都不明白什么是Docker in Docker。
另外,我没有明白用Docker in Docker在k8s里不safe,为什么在JobServer就safe了。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果代码在Paddle Cloud上运行,每个Jupyter Notebook 是一个Kubernetes的Pod,如果需要在这个Pod里去做Docker build,那么就需要将主机上的docker.sock mount到这个Pod里,那么用户可以在Notebook里写代码调用Docker的 REST API来访问本机的Docker Engine了。
而在JobServer里做Docker build比较安全的原因是在JobServer中会通过我们写好的一段bash,只做Docker build的事情,不会直接执行用户的代码。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

paddle.updater.Adam(...)),
reader=reader,
paddle_job=PaddleJob(
pserver_bucket="stander",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stander貌似是拼写错误了。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

trainer.train(reader, num_passes, event_handler, feeding)
```

parameter | required | default | explain
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explain -> explanation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


parameter | required | default | explain
--- | --- | --- | ---
job_name|YES||you should special a unique job name which in a namespace
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

special估计是想写specify。

我觉得参数说明应该是说清楚这个参数是做什么的,是陈述性的,这里翻译成中文是“你应该起一个在namespace唯一的job名字。”,这个语气不是陈述性的,内容用户看了可能也比较蒙:他们会不知道namespace是啥(貌似不需要知道,从他们的角度,没有namespace,只有他自己的任务)。

例子,docker run --name的说明:

If you do not assign a container name with the --name option, then the daemon generates a random string name for you. Defining a name can be a handy way to add meaning to a container. If you specify a name, you can use it when referencing the container within a Docker network.

写在这里可能太长了,可以考虑写成:
the unique name for the training job.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

--- | --- | --- | ---
job_name|YES||you should special a unique job name which in a namespace
entry_point|YES|| entry point for startup trainer process
trainer_package|YES|| trainer package file path, you can special a cloud path with `pfs://home/paddle` or a normal path with `/home/paddle`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

出现了多处special,意思好像错了,special是“特殊”的意思,貌似是想说“specify”。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

--- | --- | --- | ---
job_name|YES||you should special a unique job name which in a namespace
entry_point|YES|| entry point for startup trainer process
trainer_package|YES|| trainer package file path, you can special a cloud path with `pfs://home/paddle` or a normal path with `/home/paddle`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不是很懂写cloud path的话,我们需要怎么执行云端训练。如果当前运行的train.py与cloud path中的train.py内容不一致怎么办?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PaddleCloud中用户是直接在云端存储上写代码的,所以这里是一个云端存储的路径。
如果是从本地提交,这应该是Docker Image中的一个路径,也就是一个本地路径。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

改成了trainer package file path which user have the access right.
Done.

- PServer Resource
- Special `pserver_bucket`
- `pserver_bucket=mini`, a single PServer process, it's suitable for learning how to Paddle Cloud.
- `pserver_bueckt=stander`, many PServer processes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

并没有stander这个单词,是说standard吗?
我看到google cloud ML API用的是standard和premium。他们用premium我理解因为那个参数(premium)同时设定了trainer和parameter server的。这里,我觉得用single, medium, large就可以了。(更直白)

### Special Resource for a Distributed Training Job
- PServer Resource
- Special `pserver_bucket`
- `pserver_bucket=mini`, a single PServer process, it's suitable for learning how to Paddle Cloud.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果是single的话,bucket名字就写single吧。(更简单,我们尽量追求简单)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

去掉了pserver_bucket, done.

- `pserver_bucket=mini`, a single PServer process, it's suitable for learning how to Paddle Cloud.
- `pserver_bueckt=stander`, many PServer processes.
- `pserver_bucket=premium`, large PServer processes
- Custom PServer Resource
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

先不考虑了吧,太复杂了:)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, the same as #1770 (comment)

- Trainer Resource
- you *may* special `gpu_num` for the trainers totally used. By default, trainer count equal GPU count.
- you *must* special `cpu_num` for the trainers totally used. if `gpu_num=0`, trainer count equal CPU count.
- you *must* special `memory` for the trainers totally used, you can express memory as a plain integer using one of these suffixes: E, P, T, G, M, K.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

用户应该只关心每个trainer的memory情况,所以memory还是按照每个trainer来吧,如果是total,还需要根据有没有GPU的不同规则算有多少个trainer。。。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

`POST /v1/trainer/job` receives the distributed training parameters, and deploy the job as follows:
- Deploy PaddlePaddle Parameter Server processes, it's a Kubernetes ReplicaSet.
- Deploy PaddlePaddle Trainer processes, it's a Kubernetes Job.
- Deploy PaddlePaddle Master processes, it's a Kubernetes ReplicaSet.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

只有一个master,需要ReplicaSet吗?(或者什么别的更合适一点),我不是特清楚部署的时候用哪个比较好,请教一下。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReplicaSe比较简单了,主要是为了当Master节点挂掉时,Kubernetes会重新将Master调度起来,用Pod将会失去这个特性。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的,明白了!

Copy link
Contributor

@helinwang helinwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two more comments, Almost there!

<img src="./src/submit-job-python.png" width="800">

- `paddle.job.dist_train()` will call the Job Server API `/v1/packages` to upload the trainer package and save them on CephFS, and then call `/v1/trainer/job` to submit the PaddlePaddle distributed job.
- `/v1/trainer/job` will start a building job for preparing the runtime Docker image. When the building job is finished, Job Server will submit the PaddlePaddle distributed job to Kubernetes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

第一版本是通过JobServer来submit job,还是本地Python直接来?

Copy link
Contributor Author

@Yancey1989 Yancey1989 May 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

第一版就本地直接提交了,本地提交需要build runtime Docker image,这里补充了一句。
Done.

- Deploy PaddlePaddle Trainer processes, it's a Kubernetes Job.
- Deploy PaddlePaddle Master processes, it's a Kubernetes ReplicaSet.

## Job Server
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

第一版本需要实现job server吗?https://github.com/PaddlePaddle/cloud/wiki/2017-05 中提到了“查看训练状态”是指网站通过job server的API取得信息,显示给用户看任务列表和任务log吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这一版是想在网站后台直接实现JobServer的部分接口,例如查看训练状态和任务列表。

### specify Resource for a Distributed Training Job
- PServer Resource
- specify `pserver_bucket`
- `pserver_bucket=single`, a single PServer process, it's suitable for learning how to Paddle Cloud.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how to Paddle Cloud

how to use Paddle Cloud?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

去掉了这个参数,这个表述不太清楚,意思是说Paddle Cloud根据集群架构以及资源情况来自动计算出PServer的数量。


- `paddle.job.dist_train()` will call the Job Server API `/v1/packages` to upload the trainer package and save them on CephFS, and then call `/v1/trainer/job` to submit the PaddlePaddle distributed job.
- `/v1/trainer/job` will start a building job for preparing the runtime Docker image. When the building job is finished, Job Server will submit the PaddlePaddle distributed job to Kubernetes.
- *NOTE*: For the first version, we will not prepare the runtime Docker image on JobServer, instead, we will build the runtime Docker image on our host. If the code is running on PaddleCloud, we will mount the trainer package in a temporary folder into the base Docker image. We will not support custom Python dependencies in the first version as well.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"instead, we will build the runtime Docker image on our host"和后面的“we will mount the trainer package in a temporary folder into the base Docker image”有冲突,前面是说要build runtime image,后面说是直接用base image就好。
建议改成

NOTE: For the first version, we will not prepare the runtime Docker image, instead, the package is uploaded to Paddle cloud, and Paddle cloud will mount the package in a temporary folder into the base Docker image. We will not support custom Python dependencies in the first version as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Paddle cloud => Paddle Cloud
Done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

另外貌似用户不需要指定runtime Docker image, 所以在参数中也只保留了image,去掉了base_imageruntime_image

paddle_job=PaddleJob(
runtime_image = "yancey1989/paddle-job",
job_name="paddle-job",
namespace="paddle-cloud",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不理解为什么需要namespace,我以为job_name就可以成为namespace了?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

现在Kubernetes集群上的做法是为每个用户创建一个namespace,一个namespace下会跑多个Job。 不过本地提交可以在~/.kube/config中获取namespace,PaddleCloud也可以通过环境变量来获取,所以这个参数可以去掉了:)
Done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

另外我在实现PaddleJob的时候发现dist_train可能只需要trainer, paddle_job两个参数即可,其中tariner是一个的function, PR中做了说明。
Done.

trainer_package | str | trainer package file path which user have the access right
base_image|str|the [base image](#base-docker-image) for building the [runtime image](#runtime-docker-image)
runtime_image|str| [runtime image](#runtime-docker-image)
memory|str| memory allocated for the job, a plain integer using one of these suffixes: E, P, T, G, M, K
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经有trainer_mem,memroy是不是可以删了。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
去掉了第一种配置资源的方式。

trainer_mem|str| memory allocated for each Trainer process, a plain integer using one of these suffixes: E, P, T, G, M, K

### Specify Resource for a Distributed Training Job
- Specify Job Resource
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里讲了两种方法,我感觉我们只想清楚了第二种,也就准备支持第二种。那么第一种我们先删掉吧:)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@Yancey1989 Yancey1989 closed this May 12, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants