OS: Ubuntu 18.04 LTS
GPU: NVIDIA TESLA V100 32G * 8
Docker: pytorch-18.09-py3
I ran these script files RN50_FP16_8GPU.sh, RN50_FP16_4GPU.sh, RN50_FP32_8GPU.sh, RN50_FP32_4GPU.sh, and all got nan loss after several epochs (<=6). After I replaced ResNet50 to ResNet18, there was also got nan loss after ~20 epochs.
I have tried to decrease lr and batch_size, but it not works.
PS: In addition, I could successfully run the mxnet version of ImageNet ResNet50v1.5.