Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws benchmarking tool #9638

Merged
merged 16 commits into from
Apr 17, 2018
Merged

aws benchmarking tool #9638

merged 16 commits into from
Apr 17, 2018

Conversation

putcn
Copy link
Contributor

@putcn putcn commented Apr 4, 2018

Purpose

This is an automation tool for deploying paddlepaddle benchmark testing to aws.

Features

  • subnet creation to fit just the amount of ec2 instances required.
  • pserver and trainer ec2 instances allocation, and instance state verification
  • nvidia-docker ready for GPU training
  • Instances and network element garbage collection when a task is accomplished or an error occurred
  • Test log is collected in realtime
  • Web service for checking log or tearing down the testing setup
  • No testing code change needed
  • Lots of optional configuration options

for more info, please refer to the README.md in this PR.

Yancey1989
Yancey1989 previously approved these changes Apr 4, 2018
Copy link
Contributor

@Yancey1989 Yancey1989 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it's a very useful PR.
I saw we need some Python dependencies to run the client, so maybe we can run the client in a Docker image? If that we need a Dockerfile under the path: tools/aws_benchmarking/Dockerfile.

@@ -0,0 +1 @@
nvidia-docker run -i -e "TRAINING_ROLE=PSERVER" -e "PSERVER_HOSTS={PSERVER_HOSTS}" {DOCKER_IMAGE}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May we need a argument -p to port the container port to host.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, will add it

@putcn
Copy link
Contributor Author

putcn commented Apr 5, 2018

@Yancey1989 this is a great idea to run this tool in docker, will do, thanks

Copy link
Contributor

@helinwang helinwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if we can achieve:

  1. the user don't have to change anything in the train.py, all they have to do is to provide the train.py and specify the command line arguments for pserver and trainer.

  2. terminate ec2 instance automatically when the training process finishes (either due to completion, or due to crash).

  3. save all output (stdout and stderr) from the process, able to show the output in realtime.

parser.add_argument(
'--pserver_instance_type',
type=str,
default="p2.8xlarge",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't need GPU instance for pserver.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, will update default pserver instance type

'--pserver_count', type=int, default=1, help="Pserver count")

parser.add_argument(
'--action', type=str, default="serve", help="create|cleanup|status")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

serve is not in the set "create|cleanup|status"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for catching this

@putcn
Copy link
Contributor Author

putcn commented Apr 9, 2018

@helinwang thanks for the great ideas, will update

@putcn
Copy link
Contributor Author

putcn commented Apr 11, 2018

Going to make some final tweaks and a Readme file tomorrow.

```

***Please Note***
Training nodes will run your `ENTRYPOINT` script with the following environment variables:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this works with our current benchmark scripts?
The vgg16 script takes env variable: SERVER_ENDPOINT PSERVERS TRAINERS TRAINING_ROLE

example usages:

pserver:

SERVER_ENDPOINT=172.19.61.250:8000 PSERVERS=172.19.61.250:8000 TRAINERS=1 TRAINING_ROLE=PSERVER CUDA_VISIBLE_DEVICES=2 LD_LIBRARY_PATH=`pwd`:/usr/local/cuda-8.0/lib64:/usr/local/lib/ python vgg16_fluid.py --local false --device GPU --data_set flowers --batch_size 4

trainer:

SERVER_ENDPOINT=172.19.61.250:8000 PSERVERS=172.19.61.250:8000 TRAINERS=1 TRAINING_ROLE=TRAINER CUDA_VISIBLE_DEVICES=1 LD_LIBRARY_PATH=`pwd`:/usr/local/cuda-8.0/lib64:/usr/local/lib/ python vgg16_fluid.py --local false --device GPU --data_set flowers --batch_size 4

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, will update the env vars

To access the master log:

```bash
docker run -i -v $HOME/.aws:/root/.aws -v <full path to your pem file>:/<key pare name>.pem \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why accessing master log requires *.pem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, good catch, we don't need pem to access log, will update.

To retrieve training logs
TBD

### Tech details
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one special character here that needs to be deleted (rendered as �Tech details).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, will update

@putcn putcn changed the title [WIP]aws benchmarking tool aws benchmarking tool Apr 12, 2018
Copy link
Contributor

@helinwang helinwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment, otherwise LGTM!


- `TASK_NAME`: unique name to identify this training process.
- `TRAINING_ROLE`: current node's role in this training process, either "PSERVER" or "TRAINER"
- `PSERVER_HOSTS`: comma separated value of pserver end points, I.E. "192.168.1.2:5436,192.168.1.3:5436"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need PSERVER_HOSTS? It's not in @typhoonzero 's script or transpiler.py, could we remove it, otherwise it causes confusion like why there are "PSERVER_HOSTS" and "PSERVERS" with the exact same meaning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reason leaving these duplicated env vars is to be compatible with other existing tests, they may require different env vars for the same purpose.

Copy link
Contributor

@helinwang helinwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!!!

@putcn putcn merged commit 3b6d678 into PaddlePaddle:develop Apr 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants