CUDA device-side assert, image scores passed to BCE loss is nan during early training #24

xiaaoo-zz · 2020-12-07T16:49:09Z

Thanks for you amazing work!
I got RuntimeError: CUDA error: device-side assert triggered around ~200 steps during training. This error always occurs even after I rerunning the program multiple times or according to #22 setting a higher epsilon (I've tried 1e-8 and 1e-6).

This is the command I use for training. I have tried pytorch 1.6 and 1.7 with cuda 10.1.

CUDA_LAUNCH_BLOCKING=1 python tools/train_net.py --config-file "configs/voc/V_16_voc07.yaml" --use-tensorboard  \
OUTPUT_DIR output \
SOLVER.IMS_PER_BATCH 1 \
SOLVER.ITER_SIZE 8 \
DB.METHOD none

Here is the logging

2020-12-08 00:13:05,575 wetectron.trainer INFO: eta: 1 day, 5:33:33  iter: 180  loss: 0.4550 (0.6005)  loss_img: 0.2575 (0.2831)  loss_ref_cls0: 0.0003 (0.0011)  loss_ref_reg0: 0.0000 (0.0002)  loss_ref_cls1: 0.1219 (0.1517)  loss_ref_reg1: 0.0277 (0.0274)  loss_ref_cls2: 0.0612 (0.1122)  loss_ref_reg2: 0.0119 (0.0247)  acc_img: 0.0000 (0.2319)  acc_ref0: 0.0000 (0.0690)  acc_ref1: 0.0000 (0.2546)  acc_ref2: 0.0000 (0.2713)  time: 0.4167 (0.4437)  data: 0.0097 (0.0117)  lr: 0.004100  max mem: 4047
tensor([0.0061, 0.0272, 0.0212, 0.0003, 0.0143, 0.0203, 0.0304, 0.2264, 0.0059,
        0.2383, 0.0125, 0.0261, 0.0525, 0.1852, 0.0306, 0.0003, 0.0211, 0.0074,
        0.0092, 0.0183, 0.0403], device='cuda:0', grad_fn=<ClampBackward>)
tensor([3.4809e-03, 2.0176e-02, 1.5387e-02, 3.3157e-05, 7.3178e-03, 2.7314e-02,
        1.9246e-02, 4.2303e-01, 1.8498e-03, 2.4402e-01, 8.2913e-03, 2.3048e-02,
        4.3761e-02, 1.8561e-01, 1.7174e-02, 5.2509e-05, 1.1843e-02, 3.0689e-03,
        5.3479e-03, 8.2327e-03, 2.8905e-02], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([0.0063, 0.0345, 0.0294, 0.0005, 0.0227, 0.0191, 0.0276, 0.1232, 0.0156,
        0.2002, 0.0143, 0.0268, 0.0588, 0.2178, 0.0280, 0.0011, 0.0264, 0.0082,
        0.0126, 0.0231, 0.0392], device='cuda:0', grad_fn=<ClampBackward>)
tensor([4.6997e-03, 2.4954e-02, 1.9726e-02, 8.2294e-05, 1.1063e-02, 2.8040e-02,
        2.3942e-02, 3.1750e-01, 4.0102e-03, 2.3422e-01, 1.1085e-02, 2.5569e-02,
        4.9987e-02, 1.9260e-01, 2.2549e-02, 1.3269e-04, 1.6148e-02, 5.0528e-03,
        7.5706e-03, 1.2965e-02, 3.4329e-02], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([1.0000e-08, 5.4938e-07, 3.5194e-06, 1.3429e-08, 4.6343e-08, 1.0000e-08,
        1.0000e-08, 3.2079e-05, 1.9538e-08, 4.8927e-03, 9.9997e-05, 1.4685e-07,
        3.2431e-01, 7.7275e-02, 1.0000e-08, 9.3841e-01, 1.4319e-03, 1.0000e-08,
        1.0000e-08, 1.9370e-08, 1.7341e-06], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([1.0000e-08, 2.8905e-08, 9.4311e-06, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 6.9164e-04, 1.0000e-08, 1.7446e-02, 2.6333e-08, 2.8428e-07,
        1.0017e-01, 7.1074e-02, 1.0000e-08, 9.8742e-01, 1.4429e-06, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 6.2333e-07], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([1.0000e-08, 5.2896e-06, 6.3474e-07, 8.0780e-07, 4.2772e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 2.0732e-02, 1.0000e-08, 2.8691e-08,
        2.5120e-01, 1.0660e-02, 1.0000e-08, 8.6331e-01, 2.0186e-03, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 3.8909e-07], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([1.0000e-08, 5.0810e-07, 1.7225e-06, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.9051e-07, 2.6197e-02, 5.0034e-05, 1.4151e-07,
        4.4598e-01, 7.7825e-02, 1.0000e-08, 2.2080e-01, 9.6862e-03, 1.0000e-08,
        1.0000e-08, 2.3176e-08, 1.8984e-06], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 2.9017e-08, 1.0000e-08, 1.0000e-08, 4.3676e-07,
        1.0000e-08, 9.8482e-01, 1.0000e-08, 2.6416e-03, 1.7355e-06, 5.0040e-08,
        3.8472e-01, 9.1854e-03, 1.0000e-08, 9.9960e-01, 5.8315e-05, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 4.1070e-07], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([1.0000e-08, 3.2842e-06, 2.7481e-06, 1.5629e-07, 3.9651e-06, 1.0004e-07,
        9.6509e-08, 1.3372e-04, 1.6091e-08, 1.6981e-02, 2.7259e-04, 1.5076e-05,
        1.5756e-01, 6.3610e-02, 3.7470e-07, 9.4090e-01, 2.4577e-04, 1.0000e-08,
        1.0000e-08, 4.2767e-08, 1.5077e-04], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 7.5319e-04, 1.0000e-08, 1.0000e-08,
        1.7518e-03, 1.7722e-01, 1.0000e-08, 9.8997e-01, 6.3139e-02, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([1.0000e-08, 4.1463e-05, 5.7913e-04, 1.0175e-05, 8.5911e-06, 1.0000e-08,
        2.1342e-07, 6.7830e-02, 1.5353e-06, 2.2693e-02, 1.1492e-07, 1.2851e-05,
        3.3217e-01, 1.1930e-01, 4.0176e-06, 8.4664e-01, 4.7693e-03, 1.0000e-08,
        1.0000e-08, 1.9446e-06, 7.9586e-05], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 1.0000e+00, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 1.0000e+00, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e+00, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 1.0000e+00, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 1.0000e+00, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 1.0000e+00, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 1.0000e+00, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 1.0000e+00, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 1.0000e+00, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e+00, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
       grad_fn=<ClampBackward>)
2020-12-08 00:13:14,823 wetectron.trainer INFO: eta: 1 day, 5:40:51  iter: 200  loss: 5.6391 (16.8278)  loss_img: 0.9144 (0.5902)  loss_ref_cls0: 0.0000 (0.0021)  loss_ref_reg0: 0.0000 (0.0008)  loss_ref_cls1: 0.0425 (0.1522)  loss_ref_reg1: 0.0017 (0.0262)  loss_ref_cls2: 0.0000 (14.2085)  loss_ref_reg2: 0.0000 (1.8478)  acc_img: 0.0000 (0.2213)  acc_ref0: 0.0000 (0.0704)  acc_ref1: 0.0000 (0.2392)  acc_ref2: 0.0000 (0.2625)  time: 0.4264 (0.4456)  data: 0.0115 (0.0117)  lr: 0.004167  max mem: 4047
tensor([1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e+00, 1.0000e+00,
        1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e+00,
        1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e+00, 1.0000e+00,
        1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e+00, 1.0000e+00,
        1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e+00, 1.0000e+00, 1.0000e+00,
        1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e+00,
        1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e+00,
        1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e+00, 1.0000e-08, 1.0000e+00,
        1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
       device='cuda:0', grad_fn=<ClampBackward>)
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [0,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [1,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [2,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [3,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [4,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [5,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [6,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [7,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [8,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [9,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [10,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [11,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [12,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [13,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [14,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [15,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [16,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [17,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [18,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [19,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [20,0,0] Assertion `input_val >= zero && input_val <= one` failed.
Traceback (most recent call last):
  File "tools/train_net.py", line 301, in <module>
    main()
  File "tools/train_net.py", line 280, in main
    use_tensorboard=args.use_tensorboard
  File "tools/train_net.py", line 92, in train
    meters
  File "/home/unnc/Desktop/sota/wetectron/wetectron/engine/trainer.py", line 94, in do_train
    loss_dict, metrics = model(images, targets, rois)
  File "/home/unnc/anaconda3/envs/wetectron1.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/unnc/anaconda3/envs/wetectron1.7/lib/python3.7/site-packages/apex/amp/_initialize.py", line 197, in new_fwd
    **applier(kwargs, input_caster))
  File "/home/unnc/Desktop/sota/wetectron/wetectron/modeling/detector/generalized_rcnn.py", line 61, in forward
    x, result, detector_losses, accuracy = self.roi_heads(features, proposals, targets, model_cdb)
  File "/home/unnc/anaconda3/envs/wetectron1.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/unnc/Desktop/sota/wetectron/wetectron/modeling/roi_heads/weak_head/weak_head.py", line 106, in forward
    loss_img, accuracy_img = self.loss_evaluator([cls_score], [det_score], ref_scores, ref_bbox_preds, proposals, targets)
  File "/home/unnc/Desktop/sota/wetectron/wetectron/modeling/roi_heads/weak_head/loss.py", line 254, in __call__
    return_loss_dict['loss_img'] += F.binary_cross_entropy(img_score_per_im, labels_per_im.clamp(0, 1))
  File "/home/unnc/anaconda3/envs/wetectron1.7/lib/python3.7/site-packages/torch/nn/functional.py", line 2526, in binary_cross_entropy
    input, target, weight, reduction_enum)
RuntimeError: CUDA error: device-side assert triggered

The text was updated successfully, but these errors were encountered:

xiaaoo-zz · 2020-12-07T17:06:38Z

I find lowering the BASE_LR to 0.001 can help, but I saw this comment that lr does not need to be altered in config if I only change ITER_SIZE to 8 for 1 GPU running.

jason718 · 2020-12-07T17:18:08Z

We don't recommend single-GPU training honestly. 1st, it's unstable as you have spotted. 2nd, the performance has not been verified, especially after learning rate dropping.

As you mentioned, this happened in early training. So try longer warm-up? Note that your log shows the issue happened at iteration 180, and the default WARMUP_ITERS is just 200.

xiaaoo-zz · 2020-12-07T18:35:28Z

Thanks for your help. I just tested a wide range of values for WARMUP_ITERS but the error still occurs between 100-400 steps. I think it's more related to BASE_LR but I am not sure if I can reproduce the result with a lower BASE_LR, say 0.00125 (0.01 / 8).

jason718 · 2020-12-08T21:35:53Z

probably not. BASE_LR is pretty crucial to the performance.

jinhseo · 2020-12-09T06:37:39Z

@xiaaoo Does warmup factor / BASE_LR actually solve this problem??
Same issue happened to me. That error seems to occur because of ITER_SIZE arguments the code added by @bradezard131.

I also am not get used to utilize ITER_SIZE args, but adding ITER_SIZE occurs this error for my env.

My conclusion is to increase IMS_PER_BATCH instead of using ITER_SIZE. eg) BATCH : 4, not BATCH :1 and ITER_SIZE 4

Setup BATCH 4, BASE_LR 0.01, STEPS(0,25000) reproduces mAP 42.22 (OICR baseline) at last checkpoint. But my conclusion is still not clear. Any idea or sharing configs from other users would be helpful.

xiaaoo-zz · 2020-12-09T07:16:46Z

@tohoaa I am still waiting for the results by setting a lower BASE_LR. We still need to increase ITER_SIZE once a single GPU can't handle a higher batch size. Have you tried to reproduce the MIST results?

@bradezard131 commented BS=1, ITER_SIZE=8 can give ~50 and it seems other configs are not needed to be changed.

bradezard131 · 2020-12-10T07:39:02Z

I have not had the issues you are describing, for me it works fine. If I get time to look at it again I will. It appears that adding iter-size has hurt more people than it has helped, which was not my intention :(

xiaaoo-zz · 2020-12-10T07:53:30Z

@bradezard131 Your idea has helped me a lot. Without your iter-size PR, it won't even be possible to run this on 1 GPU while maintaining the effective batch size (batch_size * iter_size=8). I am just curious why this error occurs when using iter_size to approximate the batch_size param. Again, thank you for your contribution :)

jason718 closed this as completed Jan 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA device-side assert, image scores passed to BCE loss is nan during early training #24

CUDA device-side assert, image scores passed to BCE loss is nan during early training #24

xiaaoo-zz commented Dec 7, 2020

xiaaoo-zz commented Dec 7, 2020

jason718 commented Dec 7, 2020 •

edited

xiaaoo-zz commented Dec 7, 2020 •

edited

jason718 commented Dec 8, 2020

jinhseo commented Dec 9, 2020

xiaaoo-zz commented Dec 9, 2020

bradezard131 commented Dec 10, 2020

xiaaoo-zz commented Dec 10, 2020

CUDA device-side assert, image scores passed to BCE loss is nan during early training #24

CUDA device-side assert, image scores passed to BCE loss is nan during early training #24

Comments

xiaaoo-zz commented Dec 7, 2020

xiaaoo-zz commented Dec 7, 2020

jason718 commented Dec 7, 2020 • edited

xiaaoo-zz commented Dec 7, 2020 • edited

jason718 commented Dec 8, 2020

jinhseo commented Dec 9, 2020

xiaaoo-zz commented Dec 9, 2020

bradezard131 commented Dec 10, 2020

xiaaoo-zz commented Dec 10, 2020

jason718 commented Dec 7, 2020 •

edited

xiaaoo-zz commented Dec 7, 2020 •

edited