-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributed training issue on aws #10106
Comments
Hi @putcn, I can't reproduce this issue on my host, maybe you can send me the key of the AWS host so that I can do some debug. |
I was trying to reproduce the error in our local dev machine, looks the testing script is not working with latest paddle dist. I built latest paddle with the command as follows: nvidia-docker run --rm -v $PWD:/paddle \
-e "WITH_GPU=ON" \
-e "WITH_DISTRIBUTE=ON" \
-e "WITH_AVX=ON" \
-e "WITH_GOLANG=ON" \
-e "WITH_PYTHON=ON" \
-e "WITH_STYLE_CHECK=OFF" \
-e "WITH_TESTING=OFF" \
-e "WITH_DOC=OFF" \
paddlepaddle/paddle:latest-dev then build the production image with: cd build
nvidia-docker build -t putcn/paddle:latest-gpu-dist . then I build the vgg16 testing docker image with python file from https://github.com/PaddlePaddle/Paddle/blob/develop/benchmark/cluster/vgg16/vgg16_fluid.py the script to start pserver: docker run --rm --network="host" -i \
-e "SERVER_ENDPOINT=172.19.56.199:5436" \
-e "MASTER_ENDPOINT=172.19.56.199:5436" \
-e "POD_IP=172.19.56.199" -e "PADDLE_INIT_PORT=5436" \
-e "TASK_NAME=nervous_newton" \
-e "TRAINER_INDEX=0" \
-e "TRAINING_ROLE=PSERVER" \
-e "TRAINER_COUNT=1" \
-e "TRAINERS=1" \
-e "PSERVER_HOSTS=172.19.56.199:5436" \
-e "PSERVERS=172.19.56.199:5436" \
-e "GLOG_logtostderr=1" \
-e "GLOG_vmodule=executor=3" \
putcn/vgg16_test:latest-gpu-dist \
--device CPU \
--local no \
--ps_hosts 172.19.56.199:5436 which looks works fine, and it's listening at port 5436 I0427 21:21:28.128012 78 grpc_server.cc:224] Server listening on 172.19.56.199:5436 selected port: 5436 then I tried to start the trainer in the same server: nvidia-docker run --network="host" -it \
-e "MASTER_ENDPOINT=172.19.56.199:5436" \
-e "TASK_NAME=nervous_newton" \
-e "TRAINER_COUNT=1" \
-e "TRAINERS=1" \
-e "TRAINER_INDEX=1" \
-e "PADDLE_INIT_TRAINER_ID=1" \
-e "TRAINING_ROLE=TRAINER" \
-e "PSERVER_HOSTS=172.19.56.199:5436" \
-e "PSERVERS=172.19.56.199:5436" \
-e "GLOG_logtostderr=1" \
-e "GLOG_vmodule=executor=3" \
putcn/vgg16_test:latest-gpu-dist \
--device GPU \
--local no \
--ps_hosts 172.19.56.199:5436 it failed with the following error:
I tried with both |
issue was caused by aws network policy which needs to allow udp traffic over specific port. now it works |
still trying to make vgg benchmark working in aws, I fixed the issue which was due to the build type. Now i'm using the latest build with GPU ON, here is error message from trainer
here are the commands to start pserver and trainer in different aws instances:
for more info, please find the master log from:
http://13.58.193.187:5436/status
trainer log:
http://13.58.193.187:5436/log/trainer_0.log
trainer error log:
http://13.58.193.187:5436/log/trainer_0_err.log
pserver log:
http://13.58.193.187:5436/log/pserver_172.31.48.104.log
perver error log:
http://13.58.193.187:5436/log/pserver_172.31.48.104_err.log
thanks!
The text was updated successfully, but these errors were encountered: