Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed training issue on aws #10106

Closed
putcn opened this issue Apr 20, 2018 · 3 comments
Closed

Distributed training issue on aws #10106

putcn opened this issue Apr 20, 2018 · 3 comments
Assignees

Comments

@putcn
Copy link
Contributor

putcn commented Apr 20, 2018

still trying to make vgg benchmark working in aws, I fixed the issue which was due to the build type. Now i'm using the latest build with GPU ON, here is error message from trainer

/usr/local/lib/python2.7/dist-packages/paddle/fluid/average.py:42: Warning: The WeightedAverage is deprecated, please use fluid.metrics.Accuracy instead.
  (self.__class__.__name__), Warning)
*** Aborted at 1524263251 (unix time) try "date -d @1524263251" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGSEGV (@0x0) received by PID 1 (TID 0x7f3eec864700) from PID 0; stack trace: ***
    @     0x7f3eec43c390 (unknown)
    @                0x0 (unknown)

here are the commands to start pserver and trainer in different aws instances:

docker run --network="host" -i -e "SERVER_ENDPOINT=172.31.48.104:5436" -e "MASTER_ENDPOINT=172.31.48.110:5436" -e "TASK_NAME=nostalgic_boyd" -e "TRAINER_INDEX=0" -e "TRAINING_ROLE=PSERVER" -e "TRAINER_COUNT=2" -e "TRAINERS=2" -e "PSERVER_HOSTS=172.31.48.101:5436,172.31.48.104:5436" -e "PSERVERS=172.31.48.101:5436,172.31.48.104:5436" putcn/vgg16_test:latest-mainrepo --device CPU --local no
nvidia-docker run --network="host" -i  -e "MASTER_ENDPOINT=172.31.48.110:5436" -e "TASK_NAME=nostalgic_boyd" -e "TRAINER_COUNT=2" -e "TRAINERS=2" -e "TRAINER_INDEX=1"  -e "PADDLE_INIT_TRAINER_ID=1" -e "TRAINING_ROLE=TRAINER"  -e "PSERVER_HOSTS=172.31.48.101:5436,172.31.48.104:5436"  -e "PSERVERS=172.31.48.101:5436,172.31.48.104:5436" putcn/vgg16_test:latest-mainrepo --device GPU --local no --batch_size 20

for more info, please find the master log from:
http://13.58.193.187:5436/status
trainer log:
http://13.58.193.187:5436/log/trainer_0.log
trainer error log:
http://13.58.193.187:5436/log/trainer_0_err.log
pserver log:
http://13.58.193.187:5436/log/pserver_172.31.48.104.log
perver error log:
http://13.58.193.187:5436/log/pserver_172.31.48.104_err.log

thanks!

@Yancey1989
Copy link
Contributor

Hi @putcn, I can't reproduce this issue on my host, maybe you can send me the key of the AWS host so that I can do some debug.

@putcn
Copy link
Contributor Author

putcn commented Apr 27, 2018

I was trying to reproduce the error in our local dev machine, looks the testing script is not working with latest paddle dist.

I built latest paddle with the command as follows:

nvidia-docker run --rm -v $PWD:/paddle \
-e "WITH_GPU=ON" \
-e "WITH_DISTRIBUTE=ON" \
-e "WITH_AVX=ON" \
-e "WITH_GOLANG=ON" \
-e "WITH_PYTHON=ON" \
-e "WITH_STYLE_CHECK=OFF" \
-e "WITH_TESTING=OFF" \
-e "WITH_DOC=OFF" \
paddlepaddle/paddle:latest-dev

then build the production image with:

cd build
nvidia-docker build -t putcn/paddle:latest-gpu-dist .

then I build the vgg16 testing docker image with python file from https://github.com/PaddlePaddle/Paddle/blob/develop/benchmark/cluster/vgg16/vgg16_fluid.py
and tagged it with putcn/vgg16_test:latest-gpu-dist

the script to start pserver:

docker run --rm --network="host" -i \
-e "SERVER_ENDPOINT=172.19.56.199:5436" \
-e "MASTER_ENDPOINT=172.19.56.199:5436" \
-e "POD_IP=172.19.56.199" -e "PADDLE_INIT_PORT=5436" \
-e "TASK_NAME=nervous_newton" \
-e "TRAINER_INDEX=0" \
-e "TRAINING_ROLE=PSERVER" \
-e "TRAINER_COUNT=1" \
-e "TRAINERS=1" \
-e "PSERVER_HOSTS=172.19.56.199:5436" \
-e "PSERVERS=172.19.56.199:5436" \
-e "GLOG_logtostderr=1" \
-e "GLOG_vmodule=executor=3" \
putcn/vgg16_test:latest-gpu-dist \
--device CPU \
--local no \
--ps_hosts 172.19.56.199:5436

which looks works fine, and it's listening at port 5436

I0427 21:21:28.128012    78 grpc_server.cc:224] Server listening on 172.19.56.199:5436 selected port: 5436

then I tried to start the trainer in the same server:

nvidia-docker run --network="host" -it  \
-e "MASTER_ENDPOINT=172.19.56.199:5436" \
-e "TASK_NAME=nervous_newton" \
-e "TRAINER_COUNT=1" \
-e "TRAINERS=1" \
-e "TRAINER_INDEX=1"  \
-e "PADDLE_INIT_TRAINER_ID=1" \
-e "TRAINING_ROLE=TRAINER"  \
-e "PSERVER_HOSTS=172.19.56.199:5436"  \
-e "PSERVERS=172.19.56.199:5436"  \
-e "GLOG_logtostderr=1" \
-e "GLOG_vmodule=executor=3" \
putcn/vgg16_test:latest-gpu-dist \
--device GPU \
--local no \
--ps_hosts 172.19.56.199:5436

it failed with the following error:

Traceback (most recent call last):
  File "vgg16_fluid.py", line 293, in <module>
    main()
  File "vgg16_fluid.py", line 261, in main
    exe.run(fluid.default_startup_program())
  File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/executor.py", line 336, in run
    self.executor.run(program.desc, scope, 0, True, True)
paddle.fluid.core.EnforceNotMet: enforce allocating <= available failed, 11177147630 > 901906176
 at [/paddle/paddle/fluid/platform/gpu_info.cc:119]

I tried with both docker and nvidia-docker in pserver side, same error.

@putcn
Copy link
Contributor Author

putcn commented Apr 27, 2018

issue was caused by aws network policy which needs to allow udp traffic over specific port. now it works

@putcn putcn closed this as completed Apr 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants