Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added paddle on kubernetes tutorial. #370

Merged
merged 11 commits into from
Dec 30, 2016
Merged

Conversation

tizhou86
Copy link
Member

@tizhou86 tizhou86 commented Nov 7, 2016

Added paddle on kubernetes tutorial in english.

@tizhou86
Copy link
Member Author

tizhou86 commented Nov 7, 2016

@wangkuiyi Please review this tutorial and let me know your feedback, thanks!

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.03%) to 62.391% when pulling dc6859f on tizhou86:develop into e05f4ff on baidu:develop.

@tizhou86
Copy link
Member Author

@wangkuiyi @backyes First edition of PaddlePaddle on AWS with Kubernetes tutorial, efs integration and calico container networking are still in working progress.

@backyes
Copy link
Contributor

backyes commented Nov 29, 2016

@tizhou86

这个PR已经完整了么? 能简短介绍下这个PR的目的么? 是单纯的英文版aws上跑PaddlePaddle? 同时支持单机和多机么?

@tizhou86
Copy link
Member Author

tizhou86 commented Dec 7, 2016

@wangkuiyi @backyes 第一篇Kubernetes on PaddlePaddle纯粹是单机运行PaddlePaddle的英文文档,第二篇PaddlePaddle on AWS是让用户从头在AWS上搭建一个Kubernetes集群来运行分布式的PaddlePaddle训练。



###Script Details

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里列出的items,对初学者没什么帮助的感觉,不能理解是什么,对应脚本位置,其实这更像是note,而不是tutorial。建议再稍微好理解一点。

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

恩,这一部分并不是给初学者看的,这是为有Kubernetes以及AWS经验的开发人员编写以方便他们搭建私有集群,初学者只需要关注第一部分文档即可。

@tizhou86
Copy link
Member Author

@wangkuiyi @backyes 再改了一稿,加了些便于用户理解的图,把PaddlePaddle on Kubernetes的翻译也合进来了,麻烦大家再review一下。

@tizhou86
Copy link
Member Author

这篇文章有些长,最好请PaddlePaddle的同学以用户的身份跑一边,一起修正一下这篇文章。

@pineking
Copy link

@tizhou86 我参照 https://github.com/tizhou86/Paddle/blob/develop/doc/kubernetes_on_paddle.md#use-kubernetes-for-training 运行了下,发现 pod 运行报错,提示缺少 dict.txt: IOError: [Errno 2] No such file or directory: './data/dict.txt'

$ docker logs k8s_pi.8f635e06_quickstart-mx24w_default_8626edbc-c1d0-11e6-b8dc-002590c0f780_438904e3
I1214 07:41:10.549923    26 Util.cpp:155] commandline: /usr/local/bin/../opt/paddle/bin/paddle_trainer --config=trainer_config.lr.py --save_dir=./output --trainer_count=4 --log_period=20 --num_passes=15 --use_gpu=false --show_parameter_stats_period=100 --test_all_data_in_one_period=1
I1214 07:41:10.550160    26 Util.cpp:130] Calling runInitFunctions
I1214 07:41:10.550518    26 Util.cpp:143] Call runInitFunctions done.
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/paddle/trainer/config_parser.py", line 3406, in parse_config_and_serialize
    config = parse_config(config_file, config_arg_str)
  File "/usr/local/lib/python2.7/dist-packages/paddle/trainer/config_parser.py", line 3382, in parse_config
    execfile(config_file, make_config_environment(config_file, config_args))
  File "trainer_config.lr.py", line 21, in <module>
    with open(dict_file, 'r') as f:
IOError: [Errno 2] No such file or directory: './data/dict.txt'
F1214 07:41:10.609591    26 PythonUtil.cpp:134] Check failed: (ret) != nullptr Current PYTHONPATH: ['/usr/local/opt/paddle/bin', '/root/paddle/demo/quick_start', '/usr/lib/python2.7', '/usr/lib/python2.7/plat-x86_64-linux-gnu', '/usr/lib/python2.7/lib-tk', '/usr/lib/python2.7/lib-old', '/usr/lib/python2.7/lib-dynload', '/usr/local/lib/python2.7/dist-packages', '/usr/lib/python2.7/dist-packages', '/usr/lib/python2.7/dist-packages/PILcompat', '/usr/lib/python2.7/dist-packages/gtk-2.0', '/usr/lib/pymodules/python2.7']
Python Error: <type 'exceptions.IOError'> : [Errno 2] No such file or directory: './data/dict.txt'
Python Callstack:
            /usr/local/lib/python2.7/dist-packages/paddle/trainer/config_parser.py : 3406
            /usr/local/lib/python2.7/dist-packages/paddle/trainer/config_parser.py : 3382
            trainer_config.lr.py : 21
Call Object failed.
*** Check failure stack trace: ***
    @     0x7fd3cc271daa  (unknown)
    @     0x7fd3cc271ce4  (unknown)
    @     0x7fd3cc2716e6  (unknown)
    @     0x7fd3cc274687  (unknown)
    @           0x76814a  paddle::callPythonFuncRetPyObj()
    @           0x76832c  paddle::callPythonFunc()
    @           0x684ef3  paddle::TrainerConfigHelper::TrainerConfigHelper()
    @           0x685534  paddle::TrainerConfigHelper::createFromFlags()
    @           0x513207  main
    @     0x7fd3cb47df45  (unknown)
    @           0x51f2a5  (unknown)
    @              (nil)  (unknown)
/usr/local/bin/paddle: line 109:    26 Aborted                 (core dumped) ${DEBUGGER} $MYDIR/../opt/paddle/bin/paddle_trainer ${@:2}

docker image 中 data 目录下文件如下:

root@c020d520a0fd:~/paddle/demo/quick_start/data# ll
total 484252
drwxr-xr-x  1 root root         6 Dec 14 07:58 ./
drwxr-xr-x  1 root root        17 Dec 14 07:27 ../
-rwxr-xr-x  2 root root      1052 Nov 30 09:02 get_data.sh*
drwxr-xr-x 22 root root      4096 Dec  7 10:09 mosesdecoder-master/
-rw-r--r--  2 root root        16 Nov 30 09:02 pred.list
-rw-r--r--  2 root root      1740 Nov 30 09:02 pred.txt
-rw-r--r--  1 root root 495854086 Apr 26  2016 reviews_Electronics_5.json.gz

@drinktee
Copy link

@pineking
执行get_data.sh之后还要执行preprocess.sh

@pineking
Copy link

pineking commented Dec 14, 2016

@drinktee 多谢提醒,我试下
@tizhou86 需要执行 preprocess.sh 操作,可以补充到 https://github.com/tizhou86/Paddle/blob/develop/doc/kubernetes_on_paddle.md

@pineking
Copy link

pineking commented Dec 14, 2016

@tizhou86 @drinktee 重新执行了 preprocess.sh 操作,在一个3台物理机器组成的 Kubernetes 集群上测试没有问题

@@ -0,0 +1,639 @@
ddlePaddle on AWS with Kubernetes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PaddlePaddle

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

* NetworkAdministrator


![managed_policy](managed_policy.png =800x))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

多了一个 )

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个图片插入失败,无法正常显示图片,本文件下面几个图片插入也是无法显示,可以把控制大小的 =800x 去除,或者用 HTML 标签 插入,支持控制大小

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

恩,这个我用html标签修改了

@wangkuiyi
Copy link
Collaborator

@tizhou86 我看到这个PR的Travis CI没有过。具体错误在 https://travis-ci.org/PaddlePaddle/Paddle/jobs/183563852#L700 。意思好像是有些文本文件最后一行不是空行。

对code style的check是 @reyoung 最近加入的。你应该是需要在本机上安装 pre-commit:

pip install pre-commit

以及 clang-format

brew update && brew install clang-format

@reyoung 说,好像 clang-format的版本得是4.0.0以上:

$ clang-format --version
clang-format version 4.0.0 (tags/google/testing/2016-08-03)

@@ -0,0 +1,639 @@
ddlePaddle on AWS with Kubernetes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个标题有问题吧?应该写成?

# PaddlePaddle on AWS and Kubernetes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

恩,已修改


##Prerequisites

You need an Amazon account and your user account needs the following privileges to continue:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里还得有注册AWS账号的步骤,以及下载和安装aws命令行工具的步骤。至少得有链接。最好有抓图。

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

恩,我这儿添加了相应的链接。


![managed_policy](managed_policy.png =800x))

If you are not in Unites States, we also recommend creating a jump server VM instance with default amazon AMI in the same available zone as your cluster and login to jump server for the following operations, otherwise there will be some issues related to account authentication.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是不是“如果身在中国则需要一个tunnel server“,而不在于是不是“身在美国“?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

恩,修改为在中国。

Fill in the required fields:

```
AWS Access Key ID: YOUR_ACCESS_KEY_ID
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

access key和secret access key从哪儿获得?至少在这几个命令行后面得有一个链接。

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

恩,已补充


```

By default, the script will provision a new VPC and a 4 node k8s cluster in us-west-2a (Oregon) with EC2 instances running on Debian. You can override the variables defined in `<path/to/kubernetes-directory>/cluster/config-default.sh` to change this behavior as follows:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VPC是什么?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

virtual private cloud,修改后在文档里添加了说明

```

###Kubernetes Cluster Start Up
And then type the following command:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在哪台机器上运行下述命令呢?是在我的笔记本电脑上吗?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

恩, 在你执行aws configure之后,我已修改文档

export KUBE_AWS_INSTANCE_PREFIX=k8s
...

```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这一行三个 ``` 是需要的吗?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不需要,我去除了

@luotao1
Copy link
Contributor

luotao1 commented Dec 15, 2016

@tizhou86 cluster目录已经移到doc/howto/usage/cluster,请更新下新增图片和文档位置,同时将kubernetes_on_paddle.md放到doc/howto/usage/cluster/k8s目录下,谢谢!

另外,paddlepaddle_on_aws_with_kubernetes.md这个名字是否太长?是否可以去掉paddlepaddle

@wangkuiyi
Copy link
Collaborator

wangkuiyi commented Dec 16, 2016

@tizhou86 我问了Paddle pre-commit check的配置者 @reyoung ,目前 checks failed

screen shot 2016-12-15 at 5 59 38 pm

的原因是 —— 所有文本文件文末必须有且只有一个空行

具体报错信息在这里: https://travis-ci.org/PaddlePaddle/Paddle/jobs/184222241#L699

@xiang90
Copy link

xiang90 commented Dec 16, 2016

@luotao1 @wangkuiyi

Have you tried https://github.com/coreos/coreos-kubernetes/tree/master/multi-node/aws to setup k8s on aws? This should make your life a lot easier.

@xiang90
Copy link

xiang90 commented Dec 16, 2016

for the local dev doc, probably you want to try https://github.com/kubernetes/minikube.

it is the easiest way to setup a local k8s for testing/demo purpose.

$ docker run --name quick_start_data -it paddledev/paddle:cpu-demo-latest
```

### Download Training Data
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Download Example Training Data

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll modify that.


####Create PaddlePaddle Node

After Kubernetes master gets the request, it will parse the yaml file and create several pods (PaddlePaddle's node number), Kubernetes will allocate these pods onto cluster's node. A pod represents a PaddlePaddle node, when pod is successfully allocated onto one physical/virtual machine, Kubernetes will startup the container in the pod, and this container will use the environment variables in yaml file and start up `paddle pserver` and `paddle trainer` processes.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After Kubernetes receives the job creation, it will .... Kubernetes will assign these pods to work nodes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll use "assign" instead of "allocate", thanks~


####Start up Training

After container gets started, it starts up the distributed training by using scripts. We know `paddle train` process need to know other node's ip address and it's own trainer_id, since PaddlePaddle currently don't have the ability to do the service discovery, so in the start up script, each node will use job pod's name to query all to pod info from Kubernetes apiserver (apiserver's endpoint is an environment variable in container by default).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually you do not need to do anything for service discovery. you can create services for the created pods, and give dns names to the paddle master. then you do not need the ip collector script.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently PaddlePaddle have to add all the paddle pserver's ip and port as paddle train's startup parameters.

@wangkuiyi
Copy link
Collaborator

wangkuiyi commented Dec 18, 2016

The source branch tizhou86/develop hadn't been merged with usptream/develop for a while, so I did a git pull upstream develop and updated this PR.

@wangkuiyi
Copy link
Collaborator

  1. Following @luotao1 's suggestions, I moved files in this PR to their canonical places.
  2. After learning from @reyoung, I removed those extra empty lines in the end of Markdown files so to make them pass the pre-commit check.

Copy link
Collaborator

@wangkuiyi wangkuiyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tizhou86 doesn't have to fix the following comments, as I have created a PR tizhou86#1 to fix them and for your review.

@@ -0,0 +1,650 @@
#PaddlePaddle on AWS with Kubernetes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# 后面应该有空格。Github能正确显示没有空格的情况,但是其他Markdown处理器(包括Paddle Notebook)可能处理不了这种情况。

@@ -0,0 +1,650 @@
#PaddlePaddle on AWS with Kubernetes

##Prerequisites
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

##Prerequisites ==> ## Set up AWS Accounts


##Prerequisites

First, you need an AWS account, please check out [this](http://docs.aws.amazon.com/lambda/latest/dg/setting-up.html) for how to setup an AWS account.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please follow this guide to set up your AWS account.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your AWS account ==> our AWS account

As this is a tutorial, we are suppose to run it together with our readers.


First, you need an AWS account, please check out [this](http://docs.aws.amazon.com/lambda/latest/dg/setting-up.html) for how to setup an AWS account.

And then you can create an user by following [this](http://docs.aws.amazon.com/IAM/latest/UserGuide/id_users_create.html) instruction, you shall create an user group with following privileges, and then add the user to that group:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And then ==> Then


First, you need an AWS account, please check out [this](http://docs.aws.amazon.com/lambda/latest/dg/setting-up.html) for how to setup an AWS account.

And then you can create an user by following [this](http://docs.aws.amazon.com/IAM/latest/UserGuide/id_users_create.html) instruction, you shall create an user group with following privileges, and then add the user to that group:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can create an user

we can create a user


First, you need an AWS account, please check out [this](http://docs.aws.amazon.com/lambda/latest/dg/setting-up.html) for how to setup an AWS account.

And then you can create an user by following [this](http://docs.aws.amazon.com/IAM/latest/UserGuide/id_users_create.html) instruction, you shall create an user group with following privileges, and then add the user to that group:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should create a user group

we should create a user group


##Prerequisites

First, you need an AWS account, please check out [this](http://docs.aws.amazon.com/lambda/latest/dg/setting-up.html) for how to setup an AWS account.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need ==> we need


###Core Concept of PaddlePaddle Training on AWS

Now we've already setup a 3 node distributed training cluster, and on each node we've attached the EFS volume, in this training demo, we will create three Kubernetes pod and scheduling them on 3 node. Each pod contains a PaddlePaddle container. When container gets created, it will start pserver and trainer process, load the training data from EFS volume and start the distributed training task.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a 3 nodes Kubernetes cluster. (we have not setup the training cluster yet I guess.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'll modify it.


####Create PaddlePaddle Node

After Kubernetes master gets the request, it will parse the yaml file and create several pods (PaddlePaddle's node number), Kubernetes will allocate these pods onto cluster's node. A pod represents a PaddlePaddle node, when pod is successfully allocated onto one physical/virtual machine, Kubernetes will startup the container in the pod, and this container will use the environment variables in yaml file and start up `paddle pserver` and `paddle trainer` processes.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defined by PaddlePaddle's node number

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll modify it.

Rephrase the first paragraph
```


Fill in the required fields (You can get your AWS aceess key id and AWS secrete access key by following [this](http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSGettingStartedGuide/AWSCredentials.html) instruction):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link does not work for me. It shows up a webpage of Getting Started with Amazon SQS

Maybe this could be a better link? http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing it out, I'll modify it.

@helinwang
Copy link
Contributor

@tizhou86 我按照这个说明进行尝试,会卡在

 [master running]
Attaching IP 52.9.99.195 to instance i-0eec1bfda9aa908d9
Attaching persistent data volume (vol-024eeaf7baa728f7c) to master
2016-12-21T00:34:36.626Z	/dev/sdb	i-0eec1bfda9aa908d9	attaching	vol-024eeaf7baa728f7c
Cluster "aws_kubernetes" set.
User "aws_kubernetes" set.
Context "aws_kubernetes" set.
Switched to context "aws_kubernetes".
User "aws_kubernetes-basic-auth" set.
Wrote config for aws_kubernetes to /Users/helinwang/.kube/config
Creating minion configuration
Creating autoscaling group
 0 minions started; waiting
 0 minions started; waiting
 0 minions started; waiting
 0 minions started; waiting
 0 minions started; waiting
 2 minions started; ready
Waiting for cluster initialization.

  This will continually check to see if the API for kubernetes is reachable.
  This might loop forever if there was some uncaught error during start
  up.

...........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

我用不同的设置尝试了三次,每次都卡在这里:

export KUBERNETES_PROVIDER=aws; curl -sS https://get.k8s.io | bash
export KUBE_AWS_ZONE=us-west-1a; export KUBERNETES_PROVIDER=aws; curl -sS https://get.k8s.io | bash
export NUM_NODES=2&&export KUBE_AWS_ZONE=us-west-1a; export KUBERNETES_PROVIDER=aws; curl -sS https://get.k8s.io | bash

请问你碰到过类似的问题吗?

@tizhou86
Copy link
Member Author

@helinwang 你的权限都齐全吗,很有可能是你有一些权限没有导致集群启动失败,你可以登陆到master机器上ps aux | grep api一下看看api server启动了没

@helinwang
Copy link
Contributor

helinwang commented Dec 22, 2016

@tizhou86 我换了aws root权限(我的aws账号的token,有所有的权限)来尝试,依然是卡住了。

ps aux | grep api
admin     4410  0.0  0.1  12728  2148 pts/0    S+   00:21   0:00 grep api

看来api server没有启动。
我在琢磨琢磨,看来是k8s aws脚本的问题。

@tizhou86
Copy link
Member Author

Modified VM OS from Debian to CoreOS.

@wangkuiyi
Copy link
Collaborator

wangkuiyi commented Dec 28, 2016

I am sorry that due to the Christmas vacation the approval is 6 days later than the last change. My great appreciation to @tizhou86 @drinktee for this work, to @pineking and @helinwang and @backyes for the verification, and to @xiang90 for reviewing!

@backyes backyes merged commit 1a24be1 into PaddlePaddle:develop Dec 30, 2016
thisjiang pushed a commit to thisjiang/Paddle that referenced this pull request Oct 28, 2021
* fix computeat

* fix bugs

* fix compute at

* enhance compute_at

* add test

* fix codestyle

* add comments

* fix bugs

* add comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants