aws benchmarking tool #9638

putcn · 2018-04-04T05:26:25Z

Purpose

This is an automation tool for deploying paddlepaddle benchmark testing to aws.

Features

subnet creation to fit just the amount of ec2 instances required.
pserver and trainer ec2 instances allocation, and instance state verification
nvidia-docker ready for GPU training
Instances and network element garbage collection when a task is accomplished or an error occurred
Test log is collected in realtime
Web service for checking log or tearing down the testing setup
No testing code change needed
Lots of optional configuration options

for more info, please refer to the README.md in this PR.

Yancey1989

Thanks, it's a very useful PR.
I saw we need some Python dependencies to run the client, so maybe we can run the client in a Docker image? If that we need a Dockerfile under the path: tools/aws_benchmarking/Dockerfile.

Yancey1989 · 2018-04-04T10:44:05Z

tools/aws_benchmarking/pserver.sh.template

@@ -0,0 +1 @@
+nvidia-docker run -i -e "TRAINING_ROLE=PSERVER"  -e "PSERVER_HOSTS={PSERVER_HOSTS}" {DOCKER_IMAGE}


May we need a argument -p to port the container port to host.

thanks, will add it

putcn · 2018-04-05T01:33:39Z

@Yancey1989 this is a great idea to run this tool in docker, will do, thanks

… aws-benchmark

helinwang

It would be nice if we can achieve:

the user don't have to change anything in the train.py, all they have to do is to provide the train.py and specify the command line arguments for pserver and trainer.
terminate ec2 instance automatically when the training process finishes (either due to completion, or due to crash).
save all output (stdout and stderr) from the process, able to show the output in realtime.

helinwang · 2018-04-09T17:25:12Z

tools/aws_benchmarking/client/cluster_launcher.py

+parser.add_argument(
+    '--pserver_instance_type',
+    type=str,
+    default="p2.8xlarge",


We probably don't need GPU instance for pserver.

agree, will update default pserver instance type

helinwang · 2018-04-09T17:26:09Z

tools/aws_benchmarking/client/cluster_launcher.py

+    '--pserver_count', type=int, default=1, help="Pserver count")
+
+parser.add_argument(
+    '--action', type=str, default="serve", help="create|cleanup|status")


serve is not in the set "create|cleanup|status"?

thanks for catching this

putcn · 2018-04-09T18:17:37Z

@helinwang thanks for the great ideas, will update

putcn · 2018-04-11T00:31:09Z

Going to make some final tweaks and a Readme file tomorrow.

helinwang · 2018-04-12T17:35:50Z

tools/aws_benchmarking/README.md

+```
+
+***Please Note***
+Training nodes will run your `ENTRYPOINT` script with the following environment variables:


Does this works with our current benchmark scripts?
The vgg16 script takes env variable: SERVER_ENDPOINT PSERVERS TRAINERS TRAINING_ROLE

example usages:

pserver:

SERVER_ENDPOINT=172.19.61.250:8000 PSERVERS=172.19.61.250:8000 TRAINERS=1 TRAINING_ROLE=PSERVER CUDA_VISIBLE_DEVICES=2 LD_LIBRARY_PATH=`pwd`:/usr/local/cuda-8.0/lib64:/usr/local/lib/ python vgg16_fluid.py --local false --device GPU --data_set flowers --batch_size 4

trainer:

SERVER_ENDPOINT=172.19.61.250:8000 PSERVERS=172.19.61.250:8000 TRAINERS=1 TRAINING_ROLE=TRAINER CUDA_VISIBLE_DEVICES=1 LD_LIBRARY_PATH=`pwd`:/usr/local/cuda-8.0/lib64:/usr/local/lib/ python vgg16_fluid.py --local false --device GPU --data_set flowers --batch_size 4

got it, will update the env vars

helinwang · 2018-04-12T17:37:19Z

tools/aws_benchmarking/README.md

+To access the master log:
+
+```bash
+docker run -i -v $HOME/.aws:/root/.aws -v <full path to your pem file>:/<key pare name>.pem \


Curious why accessing master log requires *.pem?

ah, good catch, we don't need pem to access log, will update.

helinwang · 2018-04-12T17:37:55Z

tools/aws_benchmarking/README.md

+To retrieve training logs
+TBD
+
+### Tech details


There is one special character here that needs to be deleted (rendered as �Tech details).

thanks, will update

helinwang

One comment, otherwise LGTM!

helinwang · 2018-04-16T23:54:45Z

tools/aws_benchmarking/README.md

+
+ - `TASK_NAME`: unique name to identify this training process.
+ - `TRAINING_ROLE`: current node's role in this training process, either "PSERVER" or "TRAINER"
+ - `PSERVER_HOSTS`: comma separated value of pserver end points, I.E. "192.168.1.2:5436,192.168.1.3:5436"


Do we need PSERVER_HOSTS? It's not in @typhoonzero 's script or transpiler.py, could we remove it, otherwise it causes confusion like why there are "PSERVER_HOSTS" and "PSERVERS" with the exact same meaning.

the reason leaving these duplicated env vars is to be compatible with other existing tests, they may require different env vars for the same purpose.

helinwang

LGTM!!!

putcn added 3 commits April 3, 2018 21:42

init checkin for aws benchmarking tool

a2a5ffa

test pre-commit

cd31c12

move required arguments together

ad4bef7

putcn requested review from Yancey1989, helinwang, gongweibao and typhoonzero April 4, 2018 05:26

Yancey1989 previously approved these changes Apr 4, 2018

View reviewed changes

GA for creating and cleaning instances

2d324b6

putcn dismissed Yancey1989’s stale review via 2d324b6 April 6, 2018 02:05

putcn added 3 commits April 8, 2018 12:52

adding client and docker files

aaa6428

cleanup dockerfile

07b31b8

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

82530e2

… aws-benchmark

helinwang reviewed Apr 9, 2018

View reviewed changes

update features mentioned by @helin

ca8af94

add readme and some small tweaks

d44895e

helinwang reviewed Apr 12, 2018

View reviewed changes

putcn changed the title ~~[WIP]aws benchmarking tool~~ aws benchmarking tool Apr 12, 2018

putcn added 7 commits April 12, 2018 20:54

fix log service, add docker run command

94ad30e

minor tweaks

45d87ad

add no clean option

f3a55f2

change pserver to use regular docker and some other tweaks

b857726

update docker command template

946dc16

adding more env vars

edb199b

doc update

1e7c69f

helinwang reviewed Apr 16, 2018

View reviewed changes

helinwang approved these changes Apr 17, 2018

View reviewed changes

putcn merged commit 3b6d678 into PaddlePaddle:develop Apr 17, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aws benchmarking tool #9638

aws benchmarking tool #9638

putcn commented Apr 4, 2018 •

edited

Loading

Yancey1989 left a comment

Yancey1989 Apr 4, 2018

putcn Apr 5, 2018

putcn commented Apr 5, 2018

helinwang left a comment •

edited

Loading

helinwang Apr 9, 2018

putcn Apr 9, 2018

helinwang Apr 9, 2018

putcn Apr 9, 2018

putcn commented Apr 9, 2018

putcn commented Apr 11, 2018

helinwang Apr 12, 2018

putcn Apr 12, 2018

helinwang Apr 12, 2018

putcn Apr 12, 2018

helinwang Apr 12, 2018

putcn Apr 12, 2018

helinwang left a comment

helinwang Apr 16, 2018

putcn Apr 17, 2018

helinwang left a comment

		@@ -0,0 +1 @@
		nvidia-docker run -i -e "TRAINING_ROLE=PSERVER" -e "PSERVER_HOSTS={PSERVER_HOSTS}" {DOCKER_IMAGE}

aws benchmarking tool #9638

aws benchmarking tool #9638

Conversation

putcn commented Apr 4, 2018 • edited Loading

Purpose

Features

Yancey1989 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

putcn commented Apr 5, 2018

helinwang left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

putcn commented Apr 9, 2018

putcn commented Apr 11, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang left a comment

Choose a reason for hiding this comment

putcn commented Apr 4, 2018 •

edited

Loading

helinwang left a comment •

edited

Loading