Error during training (Assertion `input_val >= zero && input_val <= one` failed.) #813

cena001plus · 2021-10-20T07:37:35Z

thank your contribution, I also encountered some problems when using this project， i need some suggestion, I use yolox-tiny to train my own VOC data, batch_size: 32, gpu_num:2, img_size:[224x224], an error occurs when the training reaches the 30th-40th epoch, the error message is as follows, I think it is not a memory overflow problem, because My picture is very small, I used yolox-S to train the batch_size to be larger, and the picture size is larger without this problem:

/pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [0,0,0] Assertion **input_val >= zero && input_val <= one failed**. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [1,0,0] Assertion **input_val >= zero && input_val <= onefailed.** /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [2,0,0] Assertioninput_val >= zero && input_val <= onefailed.** /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [3,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [4,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [5,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [6,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [7,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [8,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [9,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [10,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [11,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [12,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [13,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [14,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [15,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [16,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [17,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [18,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [19,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [20,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [21,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [22,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [23,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [24,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [25,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [26,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [27,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [28,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [29,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [30,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [31,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [32,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [33,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [34,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [35,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [36,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [37,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [38,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [39,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [40,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [41,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [42,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [43,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [44,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [45,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [46,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [47,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [48,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [49,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [50,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [51,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [52,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [53,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [54,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [55,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [56,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [57,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [58,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [59,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [60,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [61,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [62,0,0] Assertioninput_val >= zero && input_val <= onefailed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:111: operator(): block: [62,0,0], thread: [63,0,0] Assertioninput_val >= zero && input_val <= one` failed.
[W CUDAGuardImpl.h:112] Warning: CUDA warning: device-side assert triggered (function destroyEvent)
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1055 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f443a744a22 in /home/ailab/anaconda3/envs/yolox/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: + 0x10983 (0x7f443a9a5983 in /home/ailab/anaconda3/envs/yolox/lib/python3.7/site-packages/torch/lib/libc10_cuda.so
te-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void) + 0x1a7 (0x7f443a9a7027 in /home/ailab/anaconda3/envs/yolox/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7f443a72e5a4 in /home/ailab/anaconda3/envs/yolox/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #4: std::vector<c10d::Reducer::Bucket, std::allocatorc10d::Reducer::Bucket >::~vector() + 0x2f9 (0x7f44915f7199 in /home/ailab/anaconda3/envs/yolox/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: c10d::Reducer::~Reducer() + 0x276 (0x7f44915edbc6 in /home/ailab/anaconda3/envs/yolox/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #6: std::_Sp_counted_ptr<c10d::Reducer, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7f449161d882 in /home/anaconda3/envs/yolox/lib/python3.7/site-packages/torch/lib/libtorch_python.so)

Traceback (most recent call last):
File "train.py", line 135, in
args=(exp, args),
File "/media/E/yolox/core/launch.py", line 95, in launch
start_method=start_method,
File "/home/anaconda3/envs/yolox/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/home/anaconda3/envs/yolox/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 136, in join
signal_name=name
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGABRT

frame #7: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x7f4490d675c6 in /home/anaconda3/envs/yolox/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #8: std::_Sp_counted_ptr<c10d::Logger*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x1d (0x7f449162259d in /home/anaconda3/envs/yolox/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #9: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x7f4490d675c6 in /home/anaconda3/envs/yolox/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #10: + 0xdaf07f (0x7f449162007f in /home/anaconda3/envs/yolox/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #11: + 0x4ff188 (0x7f4490d70188 in /home/anaconda3/envs/yolox/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #12: + 0x50048e (0x7f4490d7148e in /home/anaconda3/envs/yolox/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #13: + 0xfc197 (0x557df8e1e197 in /home/anaconda3/envs/yolox/bin/python)
frame #14: + 0x1817b6 (0x557df8ea37b6 in /home/anaconda3/envs/yolox/bin/python)
frame #15: + 0xfc1`

The text was updated successfully, but these errors were encountered:

cena001plus · 2021-10-20T07:39:31Z

@Joker316701882

charles-str · 2021-10-20T07:52:30Z

@Joker316701882

我也是训练yolox-tiny 自己的数据集，我没有发现这个问题

cena001plus · 2021-10-20T08:00:32Z

I want to know what the error message means? Hope the big guy can help me interpret it and find the reason. @charles-str @Joker316701882

charles-str · 2021-10-20T08:04:52Z

I want to know what the error message means? Hope the big guy can help me interpret it and find the reason. @charles-str @Joker316701882

pytorch 版本按照作者的吗？

cena001plus · 2021-10-20T08:08:57Z

yolox-s is ok， yolox-s has not encountered such a problem, only the problem of de-enhancement.
Assertion input_val >= zero && input_val <= onefailed. If it’s the input_val problem, it should go wrong in the first epoch. Why did it go wrong after dozens of epochs? I don’t know why?
#748 #147 #159

DacDinh147 · 2021-10-21T04:03:02Z

@cena001plus I got the same problem. I try to set the lower learning rate to avoid this problem. I found out this is because the model outputs Nan value in prediction head. You can print out the value of bbox_preds or obj_preds or use torch.isnan(x).sum().item() condition to print out the error . I pass !export CUDA_LAUNCH_BLOCKING=1; python train.py ... to debug this. Hope this will help you, I do not know why but the model is quite unstable, they should normalize it some where in prediction head to keep it more stable.
ps: This logging error informs you that the Nan value in output prediction head can not be used to calculate the bce in this excerpt

with torch.cuda.amp.autocast(enabled=False):
            cls_preds_ = (
                cls_preds_.float().unsqueeze(0).repeat(num_gt, 1, 1).sigmoid_()
                * obj_preds_.float().unsqueeze(0).repeat(num_gt, 1, 1).sigmoid_()
            )
            pair_wise_cls_loss = F.binary_cross_entropy(
                cls_preds_.sqrt_(), gt_cls_per_image, reduction="none"
            ).sum(-1)

cena001plus · 2021-10-21T07:03:49Z

@cena001plus I got the same problem. I try to set the lower learning rate to avoid this problem. I found out this is because the model outputs Nan value in prediction head. You can print out the value of bbox_preds or obj_preds or use torch.isnan(x).sum().item() condition to print out the error . I pass !export CUDA_LAUNCH_BLOCKING=1; python train.py ... to debug this. Hope this will help you, I do not know why but the model is quite unstable, they should normalize it some where in prediction head to keep it more stable. ps: This logging error informs you that the Nan value in output prediction head can not be used to calculate the bce in this excerpt
with torch.cuda.amp.autocast(enabled=False):
            cls_preds_ = (
                cls_preds_.float().unsqueeze(0).repeat(num_gt, 1, 1).sigmoid_()
                * obj_preds_.float().unsqueeze(0).repeat(num_gt, 1, 1).sigmoid_()
            )
            pair_wise_cls_loss = F.binary_cross_entropy(
                cls_preds_.sqrt_(), gt_cls_per_image, reduction="none"
            ).sum(-1)

Try to reduce the learning rate, the problem is solved, thank you very much。

cena001plus changed the title ~~Error during training~~ Error during training (Assertion input_val >= zero && input_val <= one failed.) Oct 20, 2021

cena001plus closed this as completed Oct 21, 2021

mahdiabdollahpour mentioned this issue Dec 19, 2021

an error in yolo_head.py -> def dynamic_k_matching() #927

Open

beniz mentioned this issue Feb 23, 2022

fix: selected index k out of range error jolibrain/YOLOX#1

Merged

FANGAreNotGnu mentioned this issue Aug 26, 2024

[BUG] CUDA error occurred when fitting Object Detection dataset with MultiModalPredictor autogluon/autogluon#3349

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error during training (Assertion `input_val >= zero && input_val <= one` failed.) #813

Error during training (Assertion `input_val >= zero && input_val <= one` failed.) #813

cena001plus commented Oct 20, 2021

cena001plus commented Oct 20, 2021

charles-str commented Oct 20, 2021

cena001plus commented Oct 20, 2021

charles-str commented Oct 20, 2021

cena001plus commented Oct 20, 2021 •

edited

Loading

DacDinh147 commented Oct 21, 2021 •

edited

Loading

cena001plus commented Oct 21, 2021

Error during training (Assertion input_val >= zero && input_val <= one failed.) #813

Error during training (Assertion input_val >= zero && input_val <= one failed.) #813

Comments

cena001plus commented Oct 20, 2021

cena001plus commented Oct 20, 2021

charles-str commented Oct 20, 2021

cena001plus commented Oct 20, 2021

charles-str commented Oct 20, 2021

cena001plus commented Oct 20, 2021 • edited Loading

DacDinh147 commented Oct 21, 2021 • edited Loading

cena001plus commented Oct 21, 2021

Error during training (Assertion `input_val >= zero && input_val <= one` failed.) #813

Error during training (Assertion `input_val >= zero && input_val <= one` failed.) #813

cena001plus commented Oct 20, 2021 •

edited

Loading

DacDinh147 commented Oct 21, 2021 •

edited

Loading