Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release v0.1.1 beta.3 #235

Merged
merged 61 commits into from
Jul 23, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
61 commits
Select commit Hold shift + click to select a range
c1c28c7
tutorial and usage update
Jun 26, 2017
93f3c58
first add
Jun 26, 2017
f0da31d
rm not need
Jun 26, 2017
8247b5d
fix bugs
Jun 26, 2017
5135d6c
add logging
Jun 27, 2017
32b7ea7
add docker file
Jun 27, 2017
279060b
fix by yancey's comment
Jun 27, 2017
7448274
update
Jun 27, 2017
9f52351
save parameter
Yancey1989 Jun 27, 2017
fdbe8c0
fix yaml
Jun 27, 2017
1357ad8
fix bugs
gongweibao Jun 27, 2017
67c1d9f
fix bugs
Jun 27, 2017
599d873
fix convert bug
gongweibao Jun 27, 2017
aa0f747
fix by wuyi's comment
Jun 27, 2017
deab719
add readme
gongweibao Jun 27, 2017
1ab5b67
Merge branch 'convertdataset' of https://github.com/gongweibao/cloud …
gongweibao Jun 27, 2017
53f8417
rm logging.confg
gongweibao Jun 27, 2017
af4dbfd
modify README.md
gongweibao Jun 28, 2017
a0017f0
modify README.md
gongweibao Jun 28, 2017
87dd9d7
add start command
Jun 28, 2017
f950d13
update
Yancey1989 Jun 28, 2017
b5e27bf
Merge pull request #189 from Yancey1989/save_paramters
Yancey1989 Jun 28, 2017
19eff0c
Merge pull request #183 from gongweibao/convertdataset
gongweibao Jun 29, 2017
f1e1588
update by comments
Jun 29, 2017
7aa6486
Merge pull request #181 from typhoonzero/update_doc
typhoonzero Jun 29, 2017
2f0ebbf
upload files with recursion
Yancey1989 Jun 30, 2017
2f59037
prettify output
Jul 4, 2017
dde9e8a
remove replica set name
Jul 4, 2017
453c2c0
Merge pull request #200 from typhoonzero/prettify_output
typhoonzero Jul 4, 2017
493f1dd
recursion to loop
Yancey1989 Jul 5, 2017
e664206
Merge pull request #193 from Yancey1989/upload_file_recursion
Yancey1989 Jul 5, 2017
3b5e00d
dlnel index page (#194)
Yancey1989 Jul 6, 2017
fa66e4e
update tutorial (#202)
Yancey1989 Jul 6, 2017
2bc48db
test format
gongweibao Jul 7, 2017
310521b
modify travis.yaml
gongweibao Jul 7, 2017
0d4772c
fix
gongweibao Jul 7, 2017
693c7c8
fix
gongweibao Jul 7, 2017
1d30410
fix sudo
gongweibao Jul 7, 2017
bbc7cf7
add travis
gongweibao Jul 7, 2017
3d06084
add glide
gongweibao Jul 7, 2017
9ce8bda
add gimme
gongweibao Jul 7, 2017
fecfe65
fix style
gongweibao Jul 7, 2017
c7b205a
fix style
gongweibao Jul 7, 2017
fb33706
add files
gongweibao Jul 7, 2017
03b73db
modify sh
gongweibao Jul 7, 2017
284e69a
fix by wuyi's comments
gongweibao Jul 11, 2017
b00a1f6
Merge pull request #204 from gongweibao/goprecommit
gongweibao Jul 11, 2017
2573a2d
fix pre-commit bugs
gongweibao Jul 11, 2017
e79391d
Merge pull request #209 from gongweibao/precommit
gongweibao Jul 11, 2017
e457b04
Format quota print (#205)
Yancey1989 Jul 12, 2017
4779205
add sleep for pserver get ready (#216)
Yancey1989 Jul 13, 2017
efe6abd
Update readme (#214)
typhoonzero Jul 13, 2017
826df41
Enable ingress notebook access (#219)
typhoonzero Jul 17, 2017
3e53a5c
test ok
gongweibao Jul 18, 2017
b5e235c
fix
gongweibao Jul 18, 2017
f5c3314
fix
gongweibao Jul 18, 2017
5e70a76
Merge pull request #225 from gongweibao/review
gongweibao Jul 18, 2017
013ee18
fix login url (#229)
typhoonzero Jul 19, 2017
0fa45d9
Fix invalid job path (#227)
Yancey1989 Jul 19, 2017
f916c16
check job name in clint (#231)
Yancey1989 Jul 20, 2017
009314e
[Done]Fault tolerant job (#212)
typhoonzero Jul 20, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
.Python
*.crt
.cache
vendor
7 changes: 7 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
- repo: https://github.com/dnephin/pre-commit-golang
sha: e4693a4c282b4fc878eda172a929f7a6508e7d16
hooks:
- id: go-fmt
files: \.go$
- id: go-lint
files: \.go$
20 changes: 20 additions & 0 deletions .tools/check_style.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
#!/bin/bash
function abort(){
echo "Your change doesn't follow PaddleCloud's code style." 1>&2
echo "Please use pre-commit to reformat your code and git push again." 1>&2
exit 1
}

trap 'abort' 0
set -e

cd $TRAVIS_BUILD_DIR
export PATH=/usr/bin:$PATH
pre-commit install
pre-commit --version

if ! pre-commit run -a ; then
git diff --exit-code
fi

trap : 0
12 changes: 11 additions & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,17 @@ matrix:
include:
- language: go
go: 1.8.x
script: bash .tools/gen_config.sh && cd go && go test ./...
sudo: required
before_script:
- eval "$(GIMME_GO_VERSION=1.8.3 gimme)"
- go get -u github.com/golang/lint/golint
- curl https://glide.sh/get | bash
- sudo pip install pre-commit
script:
- |
bash .tools/check_style.sh
RESULT=$?; if [ $RESULT -eq 0 ]; then true; else false; fi;
- bash .tools/gen_config.sh && cd go && glide install && go test $(glide novendor)
- language: python
python: 2.7
sudo: required
Expand Down
77 changes: 49 additions & 28 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,82 +1,104 @@
# PaddlePaddle Cloud

## Using Command-line To Submit Cloud Training Jobs
PaddlePaddle Cloud is a Distributed Deep-Learning Cloud Platform for both cloud
providers and enterprises.

PaddlePaddle Cloud use [Kubernetes](https://kubernetes.io) as it's backend job
dispatching and cluster resource management center. And use [PaddlePaddle](https://github.com/PaddlePaddle/Paddle.git)
as the deep-learning frame work. Users can use web pages or command-line tools
to submit their deep-learning training jobs remotely to make use of power of
large scale GPU clusters.

[English tutorials](./doc/usage_en.md)
## Using Command-line To Submit Cloud Training Jobs

[中文手册](./doc/usage_cn.md)

English tutorials(comming soon...)

## Deploy PaddlePaddle Cloud

### Pre-Requirements
- PaddlePaddle Cloud needs python to support `OPENSSL 1.2`. To check it out, simply run:
```python
>>> import ssl
>>> ssl.OPENSSL_VERSION
'OpenSSL 1.0.2k 26 Jan 2017'
```
- Make sure you have `Python > 2.7.10` installed.
- PaddlePaddle Cloud use kubernetes as it's backend core, deploy kubernetes cluster
using [Sextant](https://github.com/k8sp/sextant) or any tool you like.


### Run on kubernetes
- Build Paddle Cloud Docker Image

```bash
# build docker image
git clone https://github.com/PaddlePaddle/cloud.git
cd cloud/paddlecloud
docker build -t [your_docker_registry]/pcloud .
# push to registry so that we can submit paddlecloud to kubernetes
docker push [your_docker_registry]/pcloud
```
- We use [volume](https://kubernetes.io/docs/concepts/storage/volumes/) to mount MySQL data and cert files, such as CephFS, GlusterFS and etc..., the follow is a example using [hostpath](https://kubernetes.io/docs/concepts/storage/volumes/#hostpath):

- We use [volume](https://kubernetes.io/docs/concepts/storage/volumes/) to mount MySQL data,
cert files and settings, in `k8s/` folder we have some samples for how to mount
stand-alone files and settings using [hostpath](https://kubernetes.io/docs/concepts/storage/volumes/#hostpath). Here's
a good tutorial of creating kubernetes certs: https://coreos.com/kubernetes/docs/latest/getting-started.html

- create data folder on a Kubernetes node, such as:
```bash
mkdir -p /home/pcloud/data/mysql
mkdir -p /home/pcloud/data/certs
```
- Copy Kubernetes CA files (ca.pem, ca-key.pem, ca.srl) to `pcloud_data/certs` folder
- Copy Kubernetes admin user key (admin.pem, admin-key.pem) to `pcloud_data/certs` folder
- Copy CephFS Key file(admin.secret) to `pcloud_data/certs` folder
- Copy `/paddlecloud/settings.py` file to `pcloud_data` folder
- Copy Kubernetes CA files (ca.pem, ca-key.pem, ca.srl) to `/home/pcloud/data/certs` folder
- Copy Kubernetes admin user key (admin.pem, admin-key.pem) to `/home/pcloud/data/certs` folder
- Optianal: copy CephFS Key file(admin.secret) to `/home/pcloud/data/certs` folder
- Copy `paddlecloud/settings.py` file to `/home/pcloud/data` folder

- Configure `cloud_deployment.yaml`
- `spec.template.spec.containers[0].volumes` change the `hostPath` which match your data folder.
- `spec.template.spec.nodeSelector.`, edit the value `kubernetes.io/hostname` to host which data folder on.You can use `kubectl get nodes` to list all the Kubernetes nodes.
- Configure `settings.py`
- Add your domain name to `ALLOWED_HOSTS`.
- Configure `DATACENTERS` to your backend storage.
- Configure `cloud_ingress.yaml`
- `spec.rules[0].host` specify your domain name
- Configure `DATACENTERS` to your backend storage, supports CephFS and HostPath currently.
You can use HostPath mode to make use of shared file-systems like "NFS".
- Configure `cloud_ingress.yaml` is your kubernetes cluster is using [ingress](https://kubernetes.io/docs/concepts/services-networking/ingress/)
to proxy HTTP traffics, or you can configure `cloud_service.yaml` to use [NodePort](https://kubernetes.io/docs/concepts/services-networking/service/#type-nodeport)
- if using ingress, configure `spec.rules[0].host` to your domain name
- Deploy cloud on Kubernetes
- `kubectl create -f k8s/cloud_deployment.yaml`
- `kubectl create -f k8s/cloud_service.yaml`
- `kubectl create -f k8s/cloud_ingress.yaml`
- `kubectl create -f k8s/cloud_ingress.yaml`(optianal)


To test or visit the website, find out the kubernetes [ingress](https://kubernetes.io/docs/concepts/services-networking/ingress/) IP addresses, and bind it to your `/etc/hosts` file:
```
# your ingress IP address
192.168.1.100 cloud.paddlepaddle.org
```
To test or visit the website, find out the kubernetes ingress IP
addresses, or the NodePort.

Then open your browser and visit http://cloud.paddlepaddle.org.
Then open your browser and visit http://<ingress-ip-address>, or
http://<any-node-ip-address>:<NodePort>

- Prepare public dataset

You can create a Kubernetes Job for preparing the public dataset and cluster trainer files.
```bash
kubectl create -f k8s/prepare_dataset.yaml
```

### Run locally
Make sure you are using a virtual environment of some sort (e.g. `virtualenv` or

### Run locally without docker

- You still need a kubernetes cluster when try running locally.
- Make sure you have `Python > 2.7.10` installed.
- Python needs to support `OPENSSL 1.2`. To check it out, simply run:
```python
>>> import ssl
>>> ssl.OPENSSL_VERSION
'OpenSSL 1.0.2k 26 Jan 2017'
```
- Make sure you are using a virtual environment of some sort (e.g. `virtualenv` or
`pyenv`).

```
virtualenv paddlecloudenv
# enable the virtualenv
source paddlecloudenv/bin/activate
```

To run for the first time, you need to:

```
npm install
pip install -r requirements.txt
Expand All @@ -102,4 +124,3 @@ EMAIL_BACKEND = 'django_sendmail_backend.backends.EmailBackend'
You may need to use `hostNetwork` for your pod when using mail command.

Or you can use django smtp bindings just refer to https://docs.djangoproject.com/en/1.11/topics/email/

8 changes: 7 additions & 1 deletion demo/fit_a_line/train.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
import paddle.v2 as paddle
import pcloud.dataset.uci_housing as uci_housing
import os
import gzip
trainer_id = os.getenv("PADDLE_INIT_TRAINER_ID")

def main():
# init
Expand Down Expand Up @@ -34,7 +37,10 @@ def event_handler(event):
reader=paddle.batch(uci_housing.test(), batch_size=2),
feeding=feeding)
print "Test %d, Cost %f" % (event.pass_id, result.cost)

if trainer_id == "0":
with gzip.open("fit-a-line_pass_%05d.tar.gz" % event.pass_id,
"w") as f:
parameters.to_tar(f)
# training
trainer.train(
reader=paddle.batch(
Expand Down
150 changes: 150 additions & 0 deletions doc/tutorial_cn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
# 提交第一个训练任务

---

## 下载并配置paddlecloud

`paddlecloud`是提交PaddlePaddleCloud分布式训练任务的命令行工具。

步骤1: 访问链接 https://github.com/PaddlePaddle/cloud/releases 根据操作系统下载最新的`paddlecloud`
二进制客户端,并把`paddlecloud`拷贝到环境变量$PATH中的路径下,比如:`/usr/local/bin`,然后增加可执行权限:
`chmod +x /usr/local/bin/paddlecloud`

|操作系统|二进制版本|
-- | --
Mac OSX| paddlecloud.dawin
Windows| paddlecloud.exe
Linux | paddlecloud.x86_64

步骤2: 创建`~/.paddle/config`文件(windows系统创建当前用户目录下的`.paddle\config`文件),并写入下面内容,

```yaml
datacenters:
- name: dlnel
username: [your user name]
password: [secret]
endpoint: http://cloud.dlnel.com
current-datacenter: dlnel
```

配置文件用于指定使用的PaddlePaddleCloud服务器集群的接入地址,并需要配置用户的登录信息:
- name: 自定义的datacenter名称,可以是任意字符串
- username: PaddlePaddleCloud的用户名,账号在未开放注册前需要联系管理员分配,通常用户名为邮箱地址
- password: 账号对应的密码
- endpoint: PaddlePaddleCloud集群API地址,可以从集群管理员处获得
- current-datacenter: 标明使用哪个datacenter作为当前操作的datacenter

配置文件创建完成后,执行`paddlecloud`会显示该客户端的帮助信息:

```
Usage: paddlecloud <flags> <subcommand> <subcommand args>

Subcommands:
commands list all command names
delete Delete the specify resource.
file Simple file operations.
get Print resources
help describe subcommands and their syntax
kill Stop the job. -rm will remove the job from history.
logs Print logs of the job.
registry Add registry secret on paddlecloud.
submit Submit job to PaddlePaddle Cloud.

Subcommands for PFS:
cp uoload or download files
ls List files on PaddlePaddle Cloud
mkdir mkdir directoies on PaddlePaddle Cloud
rm rm files on PaddlePaddle Cloud


Use "paddlecloud flags" for a list of top-level flags
```

## 下载demo代码并提交运行

完成上面的配置之后,您可以马上提交一个示例的集群训练任务。我们准备了一些样例代码帮助理解集群训练
任务的提交方法,您可以使用下面的命令获取样例代码并提交任务:

这些示例都是基于[paddle book](https://github.com/PaddlePaddle/book)编写的,对应的每个示例
的解释可以参考paddle book。

```bash
mkdir fit_a_line
cd fit_a_line
wget https://raw.githubusercontent.com/PaddlePaddle/cloud/develop/demo/fit_a_line/train.py
cd ..
paddlecloud submit -jobname fit-a-line -cpu 1 -gpu 1 -parallelism 1 -entry "python train.py" fit_a_line/
```

可以看到在提交任务的时候,我们指定了任务的名称`-jobname fit-a-line`、使用的CPU资源`-cpu 1`、
使用的GPU资源`-gpu 1`、并行度`-parallelism 1`(训练节点个数),启动命令`-entry "python train.py"`
和任务程序目录`fit_a_line/`。

***说明1:*** 如果希望查看完整的任务提交参数说明,可以执行`paddlecloud submit -h`。

***说明2:*** 每个任务推荐使用不同的jobname提交,这样之前的任务的代码和执行结果都会保存在云端。

## 查看任务运行状态和日志

任务启动之后,可以用过命令`paddlecloud get jobs`查看正在运行的任务:
```bash
paddlecloud get jobs
NUM NAME SUCC FAIL START COMP ACTIVE
0 fit-a-line <nil> <nil> 2017-06-26T08:41:01Z <nil> 1
```

其中, “ACTIVE”表示正在运行的节点个数,“SUCC”表示运行成功的节点个数,“FAIL”表示运行失败的节点个数。

然后,使用下面的命令可以查看正在运行或完成运行任务的日志:

```bash
paddlecloud logs fit-a-line
Test 28, Cost 13.184950
append file: /pfs/dlnel/public/dataset/uci_housing/train-00000.pickle
append file: /pfs/dlnel/public/dataset/uci_housing/train-00001.pickle
append file: /pfs/dlnel/public/dataset/uci_housing/train-00002.pickle
append file: /pfs/dlnel/public/dataset/uci_housing/train-00003.pickle
append file: /pfs/dlnel/public/dataset/uci_housing/train-00004.pickle
Pass 28, Batch 0, Cost 9.695825
Pass 28, Batch 100, Cost 14.143484
Pass 28, Batch 200, Cost 11.380404
Test 28, Cost 13.184950
...
# logs命令默认返回10条末尾的日志,如果需要查看更多的日志,
# 也可以使用-n参数指定日志的条数
paddlecloud logs -n 100 fit-a-line
...
```

任务执行完成后,任务的状态会显示为如下状态:

```bash
paddlecloud get jobs
NUM NAME SUCC FAIL START COMP ACTIVE
0 fit-a-line 1 <nil> 2017-06-26T08:41:01Z 2017-06-26T08:41:29Z <nil>
```

## 下载任务的模型输出

任务成功执行后,训练程序一般会将模型输出保存在云端文件系统中,可以用下面的命令查看,并下载模型的输出:

```
paddlecloud file ls /pfs/dlnel/home/wuyi05@baidu.com/jobs/fit_a_line/
train.py
image
output
paddlecloud file ls /pfs/dlnel/home/wuyi05@baidu.com/jobs/fit_a_line/output/
pass-0001.tar
...
paddlecloud file get /pfs/dlnel/home/wuyi05@baidu.com/jobs/fit_a_line/output/pass-0001.tar ./
```

模型下载之后,就可以把模型应用在预测服务等环境了。

## 清除任务

使用下面命令可以完全清除集群上的训练任务,清理之后,任务的历史日志将无法查看,但仍然可以在任务名的目录下找到之前的输出。

```back
paddlecloud kill fit-a-line
```
Loading