Distributed training issue on aws #10106

putcn · 2018-04-20T22:45:58Z

still trying to make vgg benchmark working in aws, I fixed the issue which was due to the build type. Now i'm using the latest build with GPU ON, here is error message from trainer

/usr/local/lib/python2.7/dist-packages/paddle/fluid/average.py:42: Warning: The WeightedAverage is deprecated, please use fluid.metrics.Accuracy instead.
  (self.__class__.__name__), Warning)
*** Aborted at 1524263251 (unix time) try "date -d @1524263251" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGSEGV (@0x0) received by PID 1 (TID 0x7f3eec864700) from PID 0; stack trace: ***
    @     0x7f3eec43c390 (unknown)
    @                0x0 (unknown)

here are the commands to start pserver and trainer in different aws instances:

docker run --network="host" -i -e "SERVER_ENDPOINT=172.31.48.104:5436" -e "MASTER_ENDPOINT=172.31.48.110:5436" -e "TASK_NAME=nostalgic_boyd" -e "TRAINER_INDEX=0" -e "TRAINING_ROLE=PSERVER" -e "TRAINER_COUNT=2" -e "TRAINERS=2" -e "PSERVER_HOSTS=172.31.48.101:5436,172.31.48.104:5436" -e "PSERVERS=172.31.48.101:5436,172.31.48.104:5436" putcn/vgg16_test:latest-mainrepo --device CPU --local no

nvidia-docker run --network="host" -i  -e "MASTER_ENDPOINT=172.31.48.110:5436" -e "TASK_NAME=nostalgic_boyd" -e "TRAINER_COUNT=2" -e "TRAINERS=2" -e "TRAINER_INDEX=1"  -e "PADDLE_INIT_TRAINER_ID=1" -e "TRAINING_ROLE=TRAINER"  -e "PSERVER_HOSTS=172.31.48.101:5436,172.31.48.104:5436"  -e "PSERVERS=172.31.48.101:5436,172.31.48.104:5436" putcn/vgg16_test:latest-mainrepo --device GPU --local no --batch_size 20

for more info, please find the master log from:
http://13.58.193.187:5436/status
trainer log:
http://13.58.193.187:5436/log/trainer_0.log
trainer error log:
http://13.58.193.187:5436/log/trainer_0_err.log
pserver log:
http://13.58.193.187:5436/log/pserver_172.31.48.104.log
perver error log:
http://13.58.193.187:5436/log/pserver_172.31.48.104_err.log

thanks!

The text was updated successfully, but these errors were encountered:

Yancey1989 · 2018-04-23T05:10:58Z

Hi @putcn, I can't reproduce this issue on my host, maybe you can send me the key of the AWS host so that I can do some debug.

putcn · 2018-04-27T21:30:22Z

I was trying to reproduce the error in our local dev machine, looks the testing script is not working with latest paddle dist.

I built latest paddle with the command as follows:

nvidia-docker run --rm -v $PWD:/paddle \
-e "WITH_GPU=ON" \
-e "WITH_DISTRIBUTE=ON" \
-e "WITH_AVX=ON" \
-e "WITH_GOLANG=ON" \
-e "WITH_PYTHON=ON" \
-e "WITH_STYLE_CHECK=OFF" \
-e "WITH_TESTING=OFF" \
-e "WITH_DOC=OFF" \
paddlepaddle/paddle:latest-dev

then build the production image with:

cd build
nvidia-docker build -t putcn/paddle:latest-gpu-dist .

then I build the vgg16 testing docker image with python file from https://github.com/PaddlePaddle/Paddle/blob/develop/benchmark/cluster/vgg16/vgg16_fluid.py
and tagged it with putcn/vgg16_test:latest-gpu-dist

the script to start pserver:

docker run --rm --network="host" -i \
-e "SERVER_ENDPOINT=172.19.56.199:5436" \
-e "MASTER_ENDPOINT=172.19.56.199:5436" \
-e "POD_IP=172.19.56.199" -e "PADDLE_INIT_PORT=5436" \
-e "TASK_NAME=nervous_newton" \
-e "TRAINER_INDEX=0" \
-e "TRAINING_ROLE=PSERVER" \
-e "TRAINER_COUNT=1" \
-e "TRAINERS=1" \
-e "PSERVER_HOSTS=172.19.56.199:5436" \
-e "PSERVERS=172.19.56.199:5436" \
-e "GLOG_logtostderr=1" \
-e "GLOG_vmodule=executor=3" \
putcn/vgg16_test:latest-gpu-dist \
--device CPU \
--local no \
--ps_hosts 172.19.56.199:5436

which looks works fine, and it's listening at port 5436

I0427 21:21:28.128012    78 grpc_server.cc:224] Server listening on 172.19.56.199:5436 selected port: 5436

then I tried to start the trainer in the same server:

nvidia-docker run --network="host" -it  \
-e "MASTER_ENDPOINT=172.19.56.199:5436" \
-e "TASK_NAME=nervous_newton" \
-e "TRAINER_COUNT=1" \
-e "TRAINERS=1" \
-e "TRAINER_INDEX=1"  \
-e "PADDLE_INIT_TRAINER_ID=1" \
-e "TRAINING_ROLE=TRAINER"  \
-e "PSERVER_HOSTS=172.19.56.199:5436"  \
-e "PSERVERS=172.19.56.199:5436"  \
-e "GLOG_logtostderr=1" \
-e "GLOG_vmodule=executor=3" \
putcn/vgg16_test:latest-gpu-dist \
--device GPU \
--local no \
--ps_hosts 172.19.56.199:5436

it failed with the following error:

Traceback (most recent call last):
  File "vgg16_fluid.py", line 293, in <module>
    main()
  File "vgg16_fluid.py", line 261, in main
    exe.run(fluid.default_startup_program())
  File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/executor.py", line 336, in run
    self.executor.run(program.desc, scope, 0, True, True)
paddle.fluid.core.EnforceNotMet: enforce allocating <= available failed, 11177147630 > 901906176
 at [/paddle/paddle/fluid/platform/gpu_info.cc:119]

I tried with both docker and nvidia-docker in pserver side, same error.

putcn · 2018-04-27T23:46:41Z

issue was caused by aws network policy which needs to allow udp traffic over specific port. now it works

putcn assigned Yancey1989 Apr 20, 2018

putcn closed this as completed Apr 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed training issue on aws #10106

Distributed training issue on aws #10106

putcn commented Apr 20, 2018

Yancey1989 commented Apr 23, 2018

putcn commented Apr 27, 2018 •

edited

putcn commented Apr 27, 2018

Distributed training issue on aws #10106

Distributed training issue on aws #10106

Comments

putcn commented Apr 20, 2018

Yancey1989 commented Apr 23, 2018

putcn commented Apr 27, 2018 • edited

putcn commented Apr 27, 2018

putcn commented Apr 27, 2018 •

edited