-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA error when evaluating RT-DETR trained on custom data #8402
Comments
After some investigation, I found that the trained model is producing NaN as output, which results in invalid index values that in turn lead to the above CUDA error. However, the training looked perfectly normal. There was no Inf or NaN issue during the training, and the loss was decreasing gradually. I don't see how a normally trained model can output NaN during inference. Edit: The error does not occur when I train a smaller model, |
I also encounter same issue, I found it seems due to paddle.gather_nd have some bug, error line should be line: 542 in post_process.py, the "index" have some error.
|
I try set snapshot_epoch to 20, rtdetr_r50vd is normal now |
Hi @HansenLYX0708 , I also encountered the same problem when training rtdetr_r50vd from COCO. Did you mean after training some epochs, rtdetr_r50vd produces normal performance? |
I had tried the same strategy (training without evaluation for longer epochs), but that didn't solve the problem for me. Had stopped training just before 20 epochs though. |
@ichbing1 I found this error is due to top-k score is NaN when call DETRPostProcess class, but in my debugging, it's called only when Evaluation, so is same error when you training? |
@nijkah I only have one GPU, but the README document suggest use 4 GPUs, so I haven't got the exact performance yet, I can only can said that set a larger evaluate epoch can avoid encounter NaN when evaluation, and here is my initial result on custom data, its seems a little hard training with one GPU, after 72 epoch, the loss is 20.8
|
I think you should adapt lr according to total batch size, eg. |
@lyuwenyu |
Yes, I'm also seeing the error only during evaluation. And I think I found a solution. I backtraced the NaN output of the network to see where the values start to diverge. It seems that Then I noticed that Now when I train |
ichbing1的回答完美解决了我的问题 |
问题确认 Search before asking
请提出你的问题 Please ask your question
While training RT-DETR on my custom dataset in COCO format, I'm getting the following errors after the first training epoch when it tries to evaluate the model.
I searched for similar questions and found some answers that it's because of indices exceeding the number of categories, but I double-checked and my dataset has no such problem. I also specified the new number of classes in the data configuration file
configs/datasets/my_dataset.yml
.Just to make sure, I also tested training with validation data (which caused errors during evaluation), and confirmed that training proceeds without errors. Likewise, I tested evaluating with training data, and got the same errors. Also, I get the same error when running inference.
Am I missing something here? Any suggestions or comments are appreciated. Thanks!
I'm using PaddleDetection docker
paddlecloud/paddledetection:2.4-gpu-cuda11.2-cudnn8-e9a542
with
paddlepaddle-gpu==2.4.2.post112
installed. (2.4.1 required for RT-DETR)Same issue observed on Tesla V100 & RTX 4080.
The model is
rtdetr_r50vd.yml
The text was updated successfully, but these errors were encountered: