Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA device-side assert, image scores passed to BCE loss is nan during early training #24

Closed
xiaaoo-zz opened this issue Dec 7, 2020 · 8 comments

Comments

@xiaaoo-zz
Copy link

Thanks for you amazing work!
I got RuntimeError: CUDA error: device-side assert triggered around ~200 steps during training. This error always occurs even after I rerunning the program multiple times or according to #22 setting a higher epsilon (I've tried 1e-8 and 1e-6).

This is the command I use for training. I have tried pytorch 1.6 and 1.7 with cuda 10.1.

CUDA_LAUNCH_BLOCKING=1 python tools/train_net.py --config-file "configs/voc/V_16_voc07.yaml" --use-tensorboard  \
OUTPUT_DIR output \
SOLVER.IMS_PER_BATCH 1 \
SOLVER.ITER_SIZE 8 \
DB.METHOD none

Here is the logging

2020-12-08 00:13:05,575 wetectron.trainer INFO: eta: 1 day, 5:33:33  iter: 180  loss: 0.4550 (0.6005)  loss_img: 0.2575 (0.2831)  loss_ref_cls0: 0.0003 (0.0011)  loss_ref_reg0: 0.0000 (0.0002)  loss_ref_cls1: 0.1219 (0.1517)  loss_ref_reg1: 0.0277 (0.0274)  loss_ref_cls2: 0.0612 (0.1122)  loss_ref_reg2: 0.0119 (0.0247)  acc_img: 0.0000 (0.2319)  acc_ref0: 0.0000 (0.0690)  acc_ref1: 0.0000 (0.2546)  acc_ref2: 0.0000 (0.2713)  time: 0.4167 (0.4437)  data: 0.0097 (0.0117)  lr: 0.004100  max mem: 4047
tensor([0.0061, 0.0272, 0.0212, 0.0003, 0.0143, 0.0203, 0.0304, 0.2264, 0.0059,
        0.2383, 0.0125, 0.0261, 0.0525, 0.1852, 0.0306, 0.0003, 0.0211, 0.0074,
        0.0092, 0.0183, 0.0403], device='cuda:0', grad_fn=<ClampBackward>)
tensor([3.4809e-03, 2.0176e-02, 1.5387e-02, 3.3157e-05, 7.3178e-03, 2.7314e-02,
        1.9246e-02, 4.2303e-01, 1.8498e-03, 2.4402e-01, 8.2913e-03, 2.3048e-02,
        4.3761e-02, 1.8561e-01, 1.7174e-02, 5.2509e-05, 1.1843e-02, 3.0689e-03,
        5.3479e-03, 8.2327e-03, 2.8905e-02], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([0.0063, 0.0345, 0.0294, 0.0005, 0.0227, 0.0191, 0.0276, 0.1232, 0.0156,
        0.2002, 0.0143, 0.0268, 0.0588, 0.2178, 0.0280, 0.0011, 0.0264, 0.0082,
        0.0126, 0.0231, 0.0392], device='cuda:0', grad_fn=<ClampBackward>)
tensor([4.6997e-03, 2.4954e-02, 1.9726e-02, 8.2294e-05, 1.1063e-02, 2.8040e-02,
        2.3942e-02, 3.1750e-01, 4.0102e-03, 2.3422e-01, 1.1085e-02, 2.5569e-02,
        4.9987e-02, 1.9260e-01, 2.2549e-02, 1.3269e-04, 1.6148e-02, 5.0528e-03,
        7.5706e-03, 1.2965e-02, 3.4329e-02], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([1.0000e-08, 5.4938e-07, 3.5194e-06, 1.3429e-08, 4.6343e-08, 1.0000e-08,
        1.0000e-08, 3.2079e-05, 1.9538e-08, 4.8927e-03, 9.9997e-05, 1.4685e-07,
        3.2431e-01, 7.7275e-02, 1.0000e-08, 9.3841e-01, 1.4319e-03, 1.0000e-08,
        1.0000e-08, 1.9370e-08, 1.7341e-06], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([1.0000e-08, 2.8905e-08, 9.4311e-06, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 6.9164e-04, 1.0000e-08, 1.7446e-02, 2.6333e-08, 2.8428e-07,
        1.0017e-01, 7.1074e-02, 1.0000e-08, 9.8742e-01, 1.4429e-06, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 6.2333e-07], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([1.0000e-08, 5.2896e-06, 6.3474e-07, 8.0780e-07, 4.2772e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 2.0732e-02, 1.0000e-08, 2.8691e-08,
        2.5120e-01, 1.0660e-02, 1.0000e-08, 8.6331e-01, 2.0186e-03, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 3.8909e-07], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([1.0000e-08, 5.0810e-07, 1.7225e-06, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.9051e-07, 2.6197e-02, 5.0034e-05, 1.4151e-07,
        4.4598e-01, 7.7825e-02, 1.0000e-08, 2.2080e-01, 9.6862e-03, 1.0000e-08,
        1.0000e-08, 2.3176e-08, 1.8984e-06], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 2.9017e-08, 1.0000e-08, 1.0000e-08, 4.3676e-07,
        1.0000e-08, 9.8482e-01, 1.0000e-08, 2.6416e-03, 1.7355e-06, 5.0040e-08,
        3.8472e-01, 9.1854e-03, 1.0000e-08, 9.9960e-01, 5.8315e-05, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 4.1070e-07], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([1.0000e-08, 3.2842e-06, 2.7481e-06, 1.5629e-07, 3.9651e-06, 1.0004e-07,
        9.6509e-08, 1.3372e-04, 1.6091e-08, 1.6981e-02, 2.7259e-04, 1.5076e-05,
        1.5756e-01, 6.3610e-02, 3.7470e-07, 9.4090e-01, 2.4577e-04, 1.0000e-08,
        1.0000e-08, 4.2767e-08, 1.5077e-04], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 7.5319e-04, 1.0000e-08, 1.0000e-08,
        1.7518e-03, 1.7722e-01, 1.0000e-08, 9.8997e-01, 6.3139e-02, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([1.0000e-08, 4.1463e-05, 5.7913e-04, 1.0175e-05, 8.5911e-06, 1.0000e-08,
        2.1342e-07, 6.7830e-02, 1.5353e-06, 2.2693e-02, 1.1492e-07, 1.2851e-05,
        3.3217e-01, 1.1930e-01, 4.0176e-06, 8.4664e-01, 4.7693e-03, 1.0000e-08,
        1.0000e-08, 1.9446e-06, 7.9586e-05], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 1.0000e+00, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 1.0000e+00, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e+00, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 1.0000e+00, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 1.0000e+00, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 1.0000e+00, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 1.0000e+00, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 1.0000e+00, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 1.0000e+00, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e+00, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
       grad_fn=<ClampBackward>)
2020-12-08 00:13:14,823 wetectron.trainer INFO: eta: 1 day, 5:40:51  iter: 200  loss: 5.6391 (16.8278)  loss_img: 0.9144 (0.5902)  loss_ref_cls0: 0.0000 (0.0021)  loss_ref_reg0: 0.0000 (0.0008)  loss_ref_cls1: 0.0425 (0.1522)  loss_ref_reg1: 0.0017 (0.0262)  loss_ref_cls2: 0.0000 (14.2085)  loss_ref_reg2: 0.0000 (1.8478)  acc_img: 0.0000 (0.2213)  acc_ref0: 0.0000 (0.0704)  acc_ref1: 0.0000 (0.2392)  acc_ref2: 0.0000 (0.2625)  time: 0.4264 (0.4456)  data: 0.0115 (0.0117)  lr: 0.004167  max mem: 4047
tensor([1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e+00, 1.0000e+00,
        1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e+00,
        1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e+00, 1.0000e+00,
        1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e+00, 1.0000e+00,
        1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e+00, 1.0000e+00, 1.0000e+00,
        1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e+00,
        1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e+00,
        1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e-08,
        1.0000e-08, 1.0000e-08, 1.0000e-08, 1.0000e+00, 1.0000e-08, 1.0000e+00,
        1.0000e-08, 1.0000e-08, 1.0000e-08], device='cuda:0',
       grad_fn=<ClampBackward>)
tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
       device='cuda:0', grad_fn=<ClampBackward>)
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [0,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [1,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [2,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [3,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [4,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [5,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [6,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [7,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [8,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [9,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [10,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [11,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [12,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [13,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [14,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [15,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [16,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [17,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [18,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [19,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [0,0,0], thread: [20,0,0] Assertion `input_val >= zero && input_val <= one` failed.
Traceback (most recent call last):
  File "tools/train_net.py", line 301, in <module>
    main()
  File "tools/train_net.py", line 280, in main
    use_tensorboard=args.use_tensorboard
  File "tools/train_net.py", line 92, in train
    meters
  File "/home/unnc/Desktop/sota/wetectron/wetectron/engine/trainer.py", line 94, in do_train
    loss_dict, metrics = model(images, targets, rois)
  File "/home/unnc/anaconda3/envs/wetectron1.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/unnc/anaconda3/envs/wetectron1.7/lib/python3.7/site-packages/apex/amp/_initialize.py", line 197, in new_fwd
    **applier(kwargs, input_caster))
  File "/home/unnc/Desktop/sota/wetectron/wetectron/modeling/detector/generalized_rcnn.py", line 61, in forward
    x, result, detector_losses, accuracy = self.roi_heads(features, proposals, targets, model_cdb)
  File "/home/unnc/anaconda3/envs/wetectron1.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/unnc/Desktop/sota/wetectron/wetectron/modeling/roi_heads/weak_head/weak_head.py", line 106, in forward
    loss_img, accuracy_img = self.loss_evaluator([cls_score], [det_score], ref_scores, ref_bbox_preds, proposals, targets)
  File "/home/unnc/Desktop/sota/wetectron/wetectron/modeling/roi_heads/weak_head/loss.py", line 254, in __call__
    return_loss_dict['loss_img'] += F.binary_cross_entropy(img_score_per_im, labels_per_im.clamp(0, 1))
  File "/home/unnc/anaconda3/envs/wetectron1.7/lib/python3.7/site-packages/torch/nn/functional.py", line 2526, in binary_cross_entropy
    input, target, weight, reduction_enum)
RuntimeError: CUDA error: device-side assert triggered

@xiaaoo-zz
Copy link
Author

I find lowering the BASE_LR to 0.001 can help, but I saw this comment that lr does not need to be altered in config if I only change ITER_SIZE to 8 for 1 GPU running.

@jason718
Copy link
Contributor

jason718 commented Dec 7, 2020

We don't recommend single-GPU training honestly. 1st, it's unstable as you have spotted. 2nd, the performance has not been verified, especially after learning rate dropping.

As you mentioned, this happened in early training. So try longer warm-up? Note that your log shows the issue happened at iteration 180, and the default WARMUP_ITERS is just 200.

@xiaaoo-zz
Copy link
Author

xiaaoo-zz commented Dec 7, 2020

Thanks for your help. I just tested a wide range of values for WARMUP_ITERS but the error still occurs between 100-400 steps. I think it's more related to BASE_LR but I am not sure if I can reproduce the result with a lower BASE_LR, say 0.00125 (0.01 / 8).

@jason718
Copy link
Contributor

jason718 commented Dec 8, 2020

probably not. BASE_LR is pretty crucial to the performance.

@jinhseo
Copy link

jinhseo commented Dec 9, 2020

@xiaaoo Does warmup factor / BASE_LR actually solve this problem??
Same issue happened to me. That error seems to occur because of ITER_SIZE arguments the code added by @bradezard131.

I also am not get used to utilize ITER_SIZE args, but adding ITER_SIZE occurs this error for my env.

My conclusion is to increase IMS_PER_BATCH instead of using ITER_SIZE. eg) BATCH : 4, not BATCH :1 and ITER_SIZE 4

Setup BATCH 4, BASE_LR 0.01, STEPS(0,25000) reproduces mAP 42.22 (OICR baseline) at last checkpoint. But my conclusion is still not clear. Any idea or sharing configs from other users would be helpful.

@xiaaoo-zz
Copy link
Author

@tohoaa I am still waiting for the results by setting a lower BASE_LR. We still need to increase ITER_SIZE once a single GPU can't handle a higher batch size. Have you tried to reproduce the MIST results?

@bradezard131 commented BS=1, ITER_SIZE=8 can give ~50 and it seems other configs are not needed to be changed.

@bradezard131
Copy link
Contributor

I have not had the issues you are describing, for me it works fine. If I get time to look at it again I will. It appears that adding iter-size has hurt more people than it has helped, which was not my intention :(

@xiaaoo-zz
Copy link
Author

@bradezard131 Your idea has helped me a lot. Without your iter-size PR, it won't even be possible to run this on 1 GPU while maintaining the effective batch size (batch_size * iter_size=8). I am just curious why this error occurs when using iter_size to approximate the batch_size param. Again, thank you for your contribution :)

@jason718 jason718 closed this as completed Jan 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants