Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster train doc for v2 API #2072

Merged
merged 15 commits into from
Oct 19, 2017

Conversation

typhoonzero
Copy link
Contributor

@typhoonzero typhoonzero commented May 9, 2017

Fix #1942. Work in progress.
Fix #3003
Fix #3008

Better to view from here

and here

This document is using https://github.com/ekalinin/github-markdown-toc to generate TOC.

@typhoonzero typhoonzero requested a review from luotao1 May 12, 2017 03:10
@typhoonzero typhoonzero changed the title [WIP] Cluster train doc for v2 API Cluster train doc for v2 API May 12, 2017
Copy link
Contributor

@luotao1 luotao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 英文文档等中文文档修改完后,再细看
  2. 目前的代码都放在demo/word2vec下面,但demo目录今后要被撤掉,所以 @lcy-seso 这块集群代码是否可以移到book repo下了。

os.getenv("PADDLE_INIT_NUM_GRADIENT_SERVERS", "1")),
trainer_id=int(os.getenv("PADDLE_INIT_TRAINER_ID", "0")),
pservers=os.getenv("PADDLE_INIT_PSERVERS", "127.0.0.1"))
#word_dict = paddle.dataset.imikolov.build_dict()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

63行注释的代码可以删掉

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

示例代码移动到了src目录下

@@ -1,135 +1,214 @@
```eval_rst
.. _cluster_train:
* [概述](#概述)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

中英文档都缺一个总体的标题,这样目录中无法调用显示,且所有列出来的标题变成下一级。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

# 概述
本文将介绍如何使用PaddlePaddle在不同的集群框架下完成分布式训练。分布式训练架构如下图所示:

![cluster train](src/trainer.png)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 图在github中看着过大,请缩小一点。
  2. 在中文文档中,图中的字能变成中文么?图中的local/global model shard是本地/全局参数分片的意思?这两个内容没在下面的三点中提到。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


![cluster train](src/trainer.png)

- Data shard(数据分片): 用于训练神经网络的数据,被切分成多个部分,每个部分分别给每个trainer使用
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 25行缺句号。
  2. 因为是中文文档,所以写成: 数据分片(Data shard)比较好,下同。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

![cluster train](src/trainer.png)

- Data shard(数据分片): 用于训练神经网络的数据,被切分成多个部分,每个部分分别给每个trainer使用
- Trainer(计算节点): 每个trainer启动后读取切分好的一部分数据,并开始神经网络的“前馈”和“后馈”计算,并和parameter server通信。在完成一定量数据的训练后,上传计算得出的梯度(gradients)然后下载优化更新后的神经网络参数(parameters)。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. ,(加一个逗号)然后
  2. parameter server改成参数服务器

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


`PADDLE_NIC` 集群通信通道的 NIC(Network Interface Card, 网络接口卡) 接口名称,例如以太网的 eth0,infiniband 的 ib0。
```
- `train_data_dir`:此目录可以是从分布式存储挂载过来包含训练数据的目录,也可以是在任务启动前下载到本地的。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

包含训练数据的目录,可以是从分布式存储挂载过来的,也可以是在任务启动前下载到本地的。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

`PADDLE_NIC` 集群通信通道的 NIC(Network Interface Card, 网络接口卡) 接口名称,例如以太网的 eth0,infiniband 的 ib0。
```
- `train_data_dir`:此目录可以是从分布式存储挂载过来包含训练数据的目录,也可以是在任务启动前下载到本地的。
- `test_data_dir`:包含测试数据集,同样可以是挂载或下载生成。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

包含测试数据集的目录

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* [启动trainer](#启动trainer)
* [准备数据集](#准备数据集)
* [准备训练程序](#准备训练程序)
* [使用分布式计算平台或工具](#使用分布式计算平台或工具)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

计算平台和工具不在同一级里。这块比较迷惑:
“使用Fabric启动集群作业”中包含了“使用kubernetes启动一个测试集群”,“在OpenMPI集群中提交训练作业”中包含了“使用kubernetes启动一个测试openmpi集群”。

所以是有两种启动方式:Fabric和k8s,然后两种提交方式:openmpi和k8s?还是两个平台,都可以用Fabric进行启动?

请重新整理下目录和对应内容,内容下次再细看。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


`PADDLE_PORTS_NUM_FOR_SPARSE` 用于 sparse remote updater 集群通信信道的端口数。如果使用 sparse remote update,则可以像 `PADDLE_PORTS_NUM` 一样设置
对于不同的集群平台,会分别介绍集群作业的启动和停止方法。这些例子都可以在[cluster_train_v2](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/scripts/cluster_train_v2)找到
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个链接是失效的

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

由于是使用develop分支,merge后就可以生效了


## 在Kubernetes集群中提交训练作业

此部分的使用方法可以参考[k8s_cn.md](../k8s/k8s_cn.md)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里之前不能使用中文,会在ci是sphinx报错。

Copy link
Contributor

@luotao1 luotao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

英文部分能否请 @helinwang 帮忙review一下

* [检查集群训练结果](#检查集群训练结果)
* [检查模型输出](#检查模型输出)
* [在OpenMPI集群中提交训练作业](#在openmpi集群中提交训练作业)
* [准备openmpi集群](#准备openmpi集群)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

openmpi->OpenMPI


这样,通过计算节点和参数服务器的分布式协作,可以完成神经网络的SGD方法的训练。PaddlePaddle可以同时支持同步随机梯度下降(SGD)和异步随机梯度下降。

在使用同步SGD训练神经网络时,PaddlePaddle使用同步屏障(barrier),使梯度的提交和参数的更新按照顺序方式执行。在异步SGD中,则并不会等待所有trainer提交梯度才更新参数,这样极大地提高了计算的并行性:参数服务器之间不相互依赖,并行的接收梯度和更新参数,参数服务器也不会等待计算节点全部都提交梯度之后才开始下一步,计算节点之间也不会相互依赖,并行地执行模型的训练。可以看出,虽然异步SGD方式会提高参数更新并行度, 但是并不能保证参数同步更新,在任意时间某一台参数服务器上保存的参数可能比另一台要更新,与同步SGD相比,梯度会有噪声。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

并行的接收梯度-》并行地接收梯度

<img src="src/trainer_cn.png" width="500">

- 数据分片(Data shard): 用于训练神经网络的数据,被切分成多个部分,每个部分分别给每个trainer使用。
- 计算节点(Trainer): 每个trainer启动后读取切分好的一部分数据,并开始神经网络的“前馈”和“后馈”计算,并和参数服务器通信。在完成一定量数据的训练后,上传计算得出的梯度(gradients),然后下载优化更新后的神经网络参数(parameters)。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

并开始神经网络-》开始神经网络(去掉第一个并)


以下步骤基于 demo 目录中的 [demo/recommendation](https://github.com/PaddlePaddle/Paddle/tree/develop/demo/recommendation)
参考样例数据准备脚本"prepare.py"(位于`doc/howto/usage/cluster/src/word2vec/prepare.py`),准备训练数据和验证数据集,我们使用paddle.dataset.imikolov数据集,并根据分布式训练并发数(trainer节点个数),指定`SPLIT_COUNT`将数据切分成多份
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 改成url格式:prepare.py
  2. SPLIT_COUNT是要在哪儿设置么?还是不需要读者关注的一个透明变量?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

需要在prepare.py中设置。

`HOSTS` 所有节点运行集群作业的主机名或 IP 。你还可以将用户和 ssh 端口附加到主机名上,例如 root@192.168.100.17:9090。
- `mylib.py`:会被`train.py`调用的一些库函数。
- `word_dict.pickle`:在`train.py`中会使用到的字典数据文件。
- `train.py`:训练程序,代码参考"api_train_v2_cluster.py"(位于`doc/howto/usage/cluster/src/word2vec/api_train_v2_cluster.py`)。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

改成url格式:代码参考api_train_v2_cluster.py


`PADDLE_PORTS_NUM_FOR_SPARSE` 用于 sparse remote updater 集群通信信道的端口数。如果使用 sparse remote update,则可以像 `PADDLE_PORTS_NUM` 一样设置
在使用分布式计算平台进行训练时,任务被平台调度在集群中时会使用计算平台提供的API或环境变量获取启动的参数
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

后半句不是很明白。是说任务会自动使用API或启动的参数么?


## 在Kubernetes集群中提交训练作业

此部分的使用方法可以参考[Kubernetes分布式训练](../k8s/k8s_distributed_cn.md)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

缺句号。


### 启动集群作业

`paddle.py` 提供了自动化脚本来启动不同节点中的所有 PaddlePaddle 集群进程。默认情况下,所有命令行选项可以设置为```paddle.py``` 命令选项并且 `paddle.py` 将透明、自动地将这些选项应用到 PaddlePaddle 底层进程。

`paddle.py` 为方便作业启动提供了两个独特的命令选项。

`job_dispatch_package` 设为本地 `workspace` 目录,它将被分发到 conf.py 中设置的所有节点。 它有助于帮助频繁修改和访问工作区文件的用户减少负担,否则频繁的多节点工作空间部署可能会很麻烦。
`job_workspace` 设为已部署的工作空间目录,`paddle.py` 将跳过分发阶段直接启动所有节点的集群作业。它可以帮助减少分发延迟。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • job_dispatch_package 设为本地 workspace 目录,它将被分发到 conf.py 中设置的所有节点。 它有助于帮助频繁修改和访问工作区文件的用户减少负担,否则频繁的多节点工作空间部署可能会很麻烦。
  • job_workspace 设为已部署的工作空间目录,paddle.py 将跳过分发阶段直接启动所有节点的集群作业。它可以帮助减少分发延迟。

`paddle.py` 提供了自动化脚本来启动不同节点中的所有 PaddlePaddle 集群进程。默认情况下,所有命令行选项可以设置为```paddle.py``` 命令选项并且 `paddle.py` 将透明、自动地将这些选项应用到 PaddlePaddle 底层进程。

`paddle.py` 为方便作业启动提供了两个独特的命令选项。

`job_dispatch_package` 设为本地 `workspace` 目录,它将被分发到 conf.py 中设置的所有节点。 它有助于帮助频繁修改和访问工作区文件的用户减少负担,否则频繁的多节点工作空间部署可能会很麻烦。
`job_workspace` 设为已部署的工作空间目录,`paddle.py` 将跳过分发阶段直接启动所有节点的集群作业。它可以帮助减少分发延迟。

`cluster_train/run.sh` 提供了命令样例来运行 `demo/recommendation` 集群工作,只需用你定义的目录修改 `job_dispatch_package` 和 `job_workspace`,然后:
`cluster_train/run.sh` 提供了命令样例来运行 `doc/howto/usage/cluster/src/word2vec` 集群任务,只需用你定义的目录修改 `job_dispatch_package` 和 `job_workspace`,然后:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

你-》您


### 启动集群作业

`paddle.py` 提供了自动化脚本来启动不同节点中的所有 PaddlePaddle 集群进程。默认情况下,所有命令行选项可以设置为```paddle.py``` 命令选项并且 `paddle.py` 将透明、自动地将这些选项应用到 PaddlePaddle 底层进程。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

所有命令行选项可以设置为paddle.py中的选项,并且XXX

@helinwang
Copy link
Contributor

@luotao1 Sure will review.
@typhoonzero Is the private key safe to check in?


# Introduction

In this article, we'll explain how to do distributed training jobs with PaddlePaddle on different types of clusters. The diagram below shows the mail architecture of a distributed trainning job:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how to do -> how to run
mail -> main


<img src="src/trainer.png" width="500">

- Data shard: training data will be split into multiple parts, trainers use some parts of the whole dataset to do the training job.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multiple parts -> multiple partitions
trainers use some parts of the whole dataset -> trainers use the partitions
可能更好些。

<img src="src/trainer.png" width="500">

- Data shard: training data will be split into multiple parts, trainers use some parts of the whole dataset to do the training job.
- Trainer: each trainer reads the data shard, and do Neural Network training algorithms like "forward" and "backward". Then the trainer will upload calculated "gradients" to parameter servers, and wait for parameters to be optimized on the parameter server side. When that finishes, the trainer download optimized parameters and continues its training.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neural Network貌似不需要大写。

<img src="src/trainer.png" width="500">

- Data shard: training data will be split into multiple parts, trainers use some parts of the whole dataset to do the training job.
- Trainer: each trainer reads the data shard, and do Neural Network training algorithms like "forward" and "backward". Then the trainer will upload calculated "gradients" to parameter servers, and wait for parameters to be optimized on the parameter server side. When that finishes, the trainer download optimized parameters and continues its training.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and do Neural Network training algorithms like "forward" and "backward"

感觉forward和backward并不是分开的training algorithm,是不是可以写成"and train the neural network".


- Data shard: training data will be split into multiple parts, trainers use some parts of the whole dataset to do the training job.
- Trainer: each trainer reads the data shard, and do Neural Network training algorithms like "forward" and "backward". Then the trainer will upload calculated "gradients" to parameter servers, and wait for parameters to be optimized on the parameter server side. When that finishes, the trainer download optimized parameters and continues its training.
- Parameter server: every parameter server stores part of the whole Neural Network model data. They will do optimization calculations when gradients are uploaded from trainers, and then send updated parameters to trainers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neural Network貌似不需要大写。


`dataprovider.py`
used to read train/test samples. It's same as local training.
When job started, every trainer needs to get it's own part of data. In some distributed systems they will provide a storage service, so the date under that path can be accessed by all the trainer nodes. Without storage service, you must copy the training data to each trainer node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they will provide a storage service -> a storage service will be provided.
Without storage service -> Without the storage service

`ROOT_DIR` workspace ROOT directory for placing JOB workspace directory
- `mylib.py`: some library functions. This is optional.
- `word_dict.pickle`: dict file for training word embeding.
- `train.py`: training program. Sample code: "api_train_v2_cluster.py"(located in `doc/howto/usage/cluster/src/word2vec/api_train_v2_cluster.py`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use a link:

[api_train_v2_cluster.py](./src/word2vec/api_train_v2_cluster.py)


`ROOT_DIR` workspace ROOT directory for placing JOB workspace directory
- `mylib.py`: some library functions. This is optional.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个看了之后还是不知道mylib.py是做什么的,能否给个更详细的说明或者例子?


### Launching Cluster Job
`paddle.py` provides automatical scripts to start all PaddlePaddle cluster processes in different nodes. By default, all command line options can set as `paddle.py` command options and `paddle.py` will transparently and automatically set these options to PaddlePaddle lower level processes.

`paddle.py`provides two distinguished command option for easy job launching.

`job_dispatch_package` set it with local `workspace`directory, it will be dispatched to all nodes set in conf.py. It could be helpful for frequent hacking workspace files, otherwise frequent mulit-nodes workspace deployment could make your crazy.
`job_dispatch_package` set it with local `workspace`directory, it will be dispatched to all nodes set in conf.py. It could be helpful for frequent hacking workspace files, otherwise, frequent multi-nodes workspace deployment could make your crazy.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe change "could make your crazy" to "is very troublesome"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

改成了"frequent multi-nodes workspace deployment is very annoying"

#environments setting for all processes in cluster job
LD_LIBRARY_PATH="/usr/local/cuda/lib64:/usr/lib64"
```
Run `kubectl -f ssh_servers.yaml` under the directory: `paddle/scripts/cluster_train_v2/fabric/docker_cluster` will launch a demo cluster. Run `kubectl get po -o wide` to get IP addresses of these nodes.

### Launching Cluster Job
`paddle.py` provides automatical scripts to start all PaddlePaddle cluster processes in different nodes. By default, all command line options can set as `paddle.py` command options and `paddle.py` will transparently and automatically set these options to PaddlePaddle lower level processes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can set as -> can be set as

@typhoonzero
Copy link
Contributor Author

@luotao1 @helinwang comments are all done. Thank you very for such detailed comments, thank you.

luotao1
luotao1 previously approved these changes Jul 28, 2017
Copy link
Contributor

@luotao1 luotao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

helinwang
helinwang previously approved these changes Jul 28, 2017
Copy link
Contributor

@helinwang helinwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@typhoonzero typhoonzero dismissed stale reviews from helinwang and luotao1 via 3cee548 October 19, 2017 12:42
@typhoonzero typhoonzero merged commit 63ffe52 into PaddlePaddle:develop Oct 19, 2017
@typhoonzero typhoonzero deleted the cluster_train_doc branch December 22, 2017 05:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants