Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

当ParallelExecutor运行在已被占用的GPU上时, Trainer夯住 #11375

Closed
velconia opened this issue Jun 11, 2018 · 2 comments · Fixed by #11377
Closed

当ParallelExecutor运行在已被占用的GPU上时, Trainer夯住 #11375

velconia opened this issue Jun 11, 2018 · 2 comments · Fixed by #11377

Comments

@velconia
Copy link
Collaborator

velconia commented Jun 11, 2018

复现过程如下:

环境: GPU 0-3 空闲, 4-7被占用
Tesla P40, 0 MiB
Tesla P40, 0 MiB
Tesla P40, 0 MiB
Tesla P40, 0 MiB
Tesla P40, 20391 MiB
Tesla P40, 20391 MiB
Tesla P40, 20391 MiB
Tesla P40, 20391 MiB

  1. 将Pserver启动在0,1两个GPU上:
CUDA_VISIBLE_DEVICES=0,1 PADDLE_TRAINING_ROLE=PSERVER PADDLE_PSERVER_PORT=7164 PADDLE_PSERVER_IPS=127.0.0.1 PADDLE_TRAINERS=1 PADDLE_CURRENT_IP=127.0.0
.1 PADDLE_TRAINER_ID=0 python fluid/fluid_benchmark.py --gpus 2 --device GPU --model resnet --update_method pserver &   
sleep 5
  1. 将 Trainer 启动在 3,4 两号 GPU 上, 此时 3 号 GPU 空闲, 4 号 GPU 被占用:
CUDA_VISIBLE_DEVICES=3,4 GLOG_logtostderr=1 GLOG_v=4 PADDLE_TRAINING_ROLE=TRAINER PADDLE_PSERVER_PORT=7164 PADDLE_PSERVER_IPS=127.0.0.1 PADDLE_TRAINERS=1 PADDLE_CURRENT_IP=127.0.0.1 PADDLE_TRAINER_ID=0 python fluid/fluid_benchmark.py --gpus 2 --device GPU --model resnet --update_method pserver

结果: Trainer长时间被夯住

@velconia
Copy link
Collaborator Author

原因排查:

Paddle只检查了第一个GPU的状态(通过Executor, 而不是Parallel), 没有检查第二个的状态, 在将variables ncclBcast到第二号GPU时, 这个动作会长时间夯住, 导致NcclGroupEnd长时间不退出, 进程hang住

@velconia
Copy link
Collaborator Author

进一步排查:

ncclBcast夯住的原因是因为, 在buddy allocator为ncclBcast创建变量时, 会探测到GPU显存不够, throw一个runtime_error, 但是程序接收到runtime_error, 开始析构NcclGroupGuard时, NcclGroupEnd会夯住 (因为第一个GPU的nccBcast已经执行了, 而第二个GPU的还未执行, 所以第一个动作不会退出), 现在将为GPU创建变量的动作放到了NcclGroupGuard之外; 建议以后使用NcclGroupGuard时, 尽量只Guard必要的nccl动作;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant