-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added paddle on kubernetes tutorial. #370
Conversation
@wangkuiyi Please review this tutorial and let me know your feedback, thanks! |
@wangkuiyi @backyes First edition of PaddlePaddle on AWS with Kubernetes tutorial, efs integration and calico container networking are still in working progress. |
这个PR已经完整了么? 能简短介绍下这个PR的目的么? 是单纯的英文版aws上跑PaddlePaddle? 同时支持单机和多机么? |
@wangkuiyi @backyes 第一篇Kubernetes on PaddlePaddle纯粹是单机运行PaddlePaddle的英文文档,第二篇PaddlePaddle on AWS是让用户从头在AWS上搭建一个Kubernetes集群来运行分布式的PaddlePaddle训练。 |
|
||
|
||
###Script Details | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里列出的items,对初学者没什么帮助的感觉,不能理解是什么,对应脚本位置,其实这更像是note,而不是tutorial。建议再稍微好理解一点。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
恩,这一部分并不是给初学者看的,这是为有Kubernetes以及AWS经验的开发人员编写以方便他们搭建私有集群,初学者只需要关注第一部分文档即可。
@wangkuiyi @backyes 再改了一稿,加了些便于用户理解的图,把PaddlePaddle on Kubernetes的翻译也合进来了,麻烦大家再review一下。 |
这篇文章有些长,最好请PaddlePaddle的同学以用户的身份跑一边,一起修正一下这篇文章。 |
@tizhou86 我参照 https://github.com/tizhou86/Paddle/blob/develop/doc/kubernetes_on_paddle.md#use-kubernetes-for-training 运行了下,发现 pod 运行报错,提示缺少 dict.txt:
docker image 中 data 目录下文件如下:
|
@pineking |
@drinktee 多谢提醒,我试下 |
@@ -0,0 +1,639 @@ | |||
ddlePaddle on AWS with Kubernetes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PaddlePaddle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
* NetworkAdministrator | ||
|
||
|
||
![managed_policy](managed_policy.png =800x)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
多了一个 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
恩,这个我用html标签修改了
@tizhou86 我看到这个PR的Travis CI没有过。具体错误在 https://travis-ci.org/PaddlePaddle/Paddle/jobs/183563852#L700 。意思好像是有些文本文件最后一行不是空行。 对code style的check是 @reyoung 最近加入的。你应该是需要在本机上安装 pre-commit:
以及 clang-format
据 @reyoung 说,好像 clang-format的版本得是4.0.0以上: $ clang-format --version
clang-format version 4.0.0 (tags/google/testing/2016-08-03) |
@@ -0,0 +1,639 @@ | |||
ddlePaddle on AWS with Kubernetes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个标题有问题吧?应该写成?
# PaddlePaddle on AWS and Kubernetes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
恩,已修改
|
||
##Prerequisites | ||
|
||
You need an Amazon account and your user account needs the following privileges to continue: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里还得有注册AWS账号的步骤,以及下载和安装aws命令行工具的步骤。至少得有链接。最好有抓图。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
恩,我这儿添加了相应的链接。
|
||
![managed_policy](managed_policy.png =800x)) | ||
|
||
If you are not in Unites States, we also recommend creating a jump server VM instance with default amazon AMI in the same available zone as your cluster and login to jump server for the following operations, otherwise there will be some issues related to account authentication. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是不是“如果身在中国则需要一个tunnel server“,而不在于是不是“身在美国“?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
恩,修改为在中国。
Fill in the required fields: | ||
|
||
``` | ||
AWS Access Key ID: YOUR_ACCESS_KEY_ID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
access key和secret access key从哪儿获得?至少在这几个命令行后面得有一个链接。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
恩,已补充
|
||
``` | ||
|
||
By default, the script will provision a new VPC and a 4 node k8s cluster in us-west-2a (Oregon) with EC2 instances running on Debian. You can override the variables defined in `<path/to/kubernetes-directory>/cluster/config-default.sh` to change this behavior as follows: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VPC是什么?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
virtual private cloud,修改后在文档里添加了说明
``` | ||
|
||
###Kubernetes Cluster Start Up | ||
And then type the following command: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
在哪台机器上运行下述命令呢?是在我的笔记本电脑上吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
恩, 在你执行aws configure之后,我已修改文档
export KUBE_AWS_INSTANCE_PREFIX=k8s | ||
... | ||
|
||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这一行三个 ``` 是需要的吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不需要,我去除了
@tizhou86 cluster目录已经移到doc/howto/usage/cluster,请更新下新增图片和文档位置,同时将kubernetes_on_paddle.md放到doc/howto/usage/cluster/k8s目录下,谢谢! 另外,paddlepaddle_on_aws_with_kubernetes.md这个名字是否太长?是否可以去掉paddlepaddle |
@tizhou86 我问了Paddle pre-commit check的配置者 @reyoung ,目前 checks failed 的原因是 —— 所有文本文件文末必须有且只有一个空行。 具体报错信息在这里: https://travis-ci.org/PaddlePaddle/Paddle/jobs/184222241#L699 |
Have you tried https://github.com/coreos/coreos-kubernetes/tree/master/multi-node/aws to setup k8s on aws? This should make your life a lot easier. |
for the local dev doc, probably you want to try https://github.com/kubernetes/minikube. it is the easiest way to setup a local k8s for testing/demo purpose. |
$ docker run --name quick_start_data -it paddledev/paddle:cpu-demo-latest | ||
``` | ||
|
||
### Download Training Data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Download Example Training Data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll modify that.
|
||
####Create PaddlePaddle Node | ||
|
||
After Kubernetes master gets the request, it will parse the yaml file and create several pods (PaddlePaddle's node number), Kubernetes will allocate these pods onto cluster's node. A pod represents a PaddlePaddle node, when pod is successfully allocated onto one physical/virtual machine, Kubernetes will startup the container in the pod, and this container will use the environment variables in yaml file and start up `paddle pserver` and `paddle trainer` processes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After Kubernetes receives the job creation, it will .... Kubernetes will assign these pods to work nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll use "assign" instead of "allocate", thanks~
|
||
####Start up Training | ||
|
||
After container gets started, it starts up the distributed training by using scripts. We know `paddle train` process need to know other node's ip address and it's own trainer_id, since PaddlePaddle currently don't have the ability to do the service discovery, so in the start up script, each node will use job pod's name to query all to pod info from Kubernetes apiserver (apiserver's endpoint is an environment variable in container by default). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually you do not need to do anything for service discovery. you can create services for the created pods, and give dns names to the paddle master. then you do not need the ip collector script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently PaddlePaddle have to add all the paddle pserver
's ip and port as paddle train
's startup parameters.
… tizhou-develop
The source branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tizhou86 doesn't have to fix the following comments, as I have created a PR tizhou86#1 to fix them and for your review.
@@ -0,0 +1,650 @@ | |||
#PaddlePaddle on AWS with Kubernetes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#
后面应该有空格。Github能正确显示没有空格的情况,但是其他Markdown处理器(包括Paddle Notebook)可能处理不了这种情况。
@@ -0,0 +1,650 @@ | |||
#PaddlePaddle on AWS with Kubernetes | |||
|
|||
##Prerequisites |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
##Prerequisites
==> ## Set up AWS Accounts
|
||
##Prerequisites | ||
|
||
First, you need an AWS account, please check out [this](http://docs.aws.amazon.com/lambda/latest/dg/setting-up.html) for how to setup an AWS account. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please follow this guide to set up your AWS account.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
your AWS account ==> our AWS account
As this is a tutorial, we are suppose to run it together with our readers.
|
||
First, you need an AWS account, please check out [this](http://docs.aws.amazon.com/lambda/latest/dg/setting-up.html) for how to setup an AWS account. | ||
|
||
And then you can create an user by following [this](http://docs.aws.amazon.com/IAM/latest/UserGuide/id_users_create.html) instruction, you shall create an user group with following privileges, and then add the user to that group: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And then ==> Then
|
||
First, you need an AWS account, please check out [this](http://docs.aws.amazon.com/lambda/latest/dg/setting-up.html) for how to setup an AWS account. | ||
|
||
And then you can create an user by following [this](http://docs.aws.amazon.com/IAM/latest/UserGuide/id_users_create.html) instruction, you shall create an user group with following privileges, and then add the user to that group: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can create an user
we can create a user
|
||
First, you need an AWS account, please check out [this](http://docs.aws.amazon.com/lambda/latest/dg/setting-up.html) for how to setup an AWS account. | ||
|
||
And then you can create an user by following [this](http://docs.aws.amazon.com/IAM/latest/UserGuide/id_users_create.html) instruction, you shall create an user group with following privileges, and then add the user to that group: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should create a user group
we should create a user group
|
||
##Prerequisites | ||
|
||
First, you need an AWS account, please check out [this](http://docs.aws.amazon.com/lambda/latest/dg/setting-up.html) for how to setup an AWS account. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need ==> we need
|
||
###Core Concept of PaddlePaddle Training on AWS | ||
|
||
Now we've already setup a 3 node distributed training cluster, and on each node we've attached the EFS volume, in this training demo, we will create three Kubernetes pod and scheduling them on 3 node. Each pod contains a PaddlePaddle container. When container gets created, it will start pserver and trainer process, load the training data from EFS volume and start the distributed training task. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a 3 nodes Kubernetes cluster. (we have not setup the training cluster yet I guess.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'll modify it.
|
||
####Create PaddlePaddle Node | ||
|
||
After Kubernetes master gets the request, it will parse the yaml file and create several pods (PaddlePaddle's node number), Kubernetes will allocate these pods onto cluster's node. A pod represents a PaddlePaddle node, when pod is successfully allocated onto one physical/virtual machine, Kubernetes will startup the container in the pod, and this container will use the environment variables in yaml file and start up `paddle pserver` and `paddle trainer` processes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defined by PaddlePaddle's node number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll modify it.
Rephrase the first paragraph
``` | ||
|
||
|
||
Fill in the required fields (You can get your AWS aceess key id and AWS secrete access key by following [this](http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSGettingStartedGuide/AWSCredentials.html) instruction): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The link does not work for me. It shows up a webpage of Getting Started with Amazon SQS
Maybe this could be a better link? http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing it out, I'll modify it.
@tizhou86 我按照这个说明进行尝试,会卡在
我用不同的设置尝试了三次,每次都卡在这里:
请问你碰到过类似的问题吗? |
@helinwang 你的权限都齐全吗,很有可能是你有一些权限没有导致集群启动失败,你可以登陆到master机器上ps aux | grep api一下看看api server启动了没 |
@tizhou86 我换了aws root权限(我的aws账号的token,有所有的权限)来尝试,依然是卡住了。
看来api server没有启动。 |
Modified VM OS from Debian to CoreOS. |
* fix computeat * fix bugs * fix compute at * enhance compute_at * add test * fix codestyle * add comments * fix bugs * add comments
Added paddle on kubernetes tutorial in english.