RuntimeError: CUDA error: device-side assert triggered #69

PhamLeQuangNhat · 2021-06-01T15:59:28Z

When i was training my own dataset, I modified following in the config flie:

character = 'aAàÀảẢãÃáÁạẠăĂằẰẳẲẵẴắẮặẶâÂầẦẩẨẫẪấẤậẬbBcCdDđĐeEèÈẻẺẽẼéÉẹẸêÊềỀểỂễỄếẾệỆfFgGhHiIìÌỉỈĩĨíÍịỊj
JkKlLmMnNoOòÒỏỎõÕóÓọỌôÔồỒổỔỗỖốỐộỘơƠờỜởỞỡỠớỚợỢpPqQrRsStTuUùÙủỦũŨúÚụỤưƯừỪửỬữỮ
ứỨựỰvVwWxXyYỳỲỷỶỹỸýÝỵỴzZ0123456789'
batch_max_length = 25
num_class = len(character) + 1 # num_class = 197
gpu_id='5,7'

and I ran the command: bash tools/dist_train.sh configs/stn_cstr.py 2

then I got the error:

/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [24,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [25,0,0] Assertion t >= 0 && t < n_classes failed.
Traceback (most recent call last):
File "/home/recognition/vedastr-cstr/tools/train.py", line 49, in
main()
File "/home/recognition/vedastr-cstr/tools/train.py", line 45, in main
runner()
File "/home/recognition/vedastr-cstr/tools/../vedastr/runners/train_runner.py", line 165, in call
self._train_batch(img, label)
File "/home/recognition/vedastr-cstr/tools/../vedastr/runners/train_runner.py", line 118, in _train_batch
loss.backward()
File "/root/anaconda3/envs/vedastr/lib/python3.9/site-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/root/anaconda3/envs/vedastr/lib/python3.9/site-packages/torch/autograd/init.py", line 145, in backward
Variable._execution_engine.run_backward(
RuntimeError: CUDA error: device-side assert triggered

What might cause it? How to fix it? Thank in advance.

The text was updated successfully, but these errors were encountered:

bharatsubedi · 2021-06-04T02:22:11Z

@PhamLeQuangNhat did you solved this? could you share your solution Please?

ChaseMonsterAway · 2021-06-04T02:37:41Z

@bharatsubedi
Did you meet the same problem? Can you tell me which parts in config file you have modified?

ChaseMonsterAway · 2021-06-04T02:39:07Z

@PhamLeQuangNhat
I gusse this is caused by wrong class index. Can you tell me which part in config file you have modified? I will check it with your settings.

bharatsubedi · 2021-06-04T03:29:01Z

I am using Korean dataset for training and I only modified character set. what is mean by wrong class index? could you explain please? Special characters also included in my data

ChaseMonsterAway · 2021-06-04T03:47:55Z

I am using Korean dataset for training and I only modified character set. what is mean by wrong class index? could you explain please? Special characters also included in my data

For example, if the channel dimension of your logits is C, and your gt has a number lager than C. Using this logits and gt to compute cross entropy loss will cause errors like t >= 0 && t < n_classes.

bharatsubedi · 2021-06-04T03:51:32Z

@ChaseMonsterAway could you please let me know which part of code I should change for solving that problem? The length of character set is 1800 in my case,

ChaseMonsterAway · 2021-06-04T05:21:15Z

@ChaseMonsterAway could you please let me know which part of code I should change for solving that problem? The length of character set is 1800 in my case,

I think you should change the ignore index of criterion in your config file. In your case, i think you should set it as 1801.
I will optimize the config file in the futher.

bharatsubedi · 2021-06-04T05:59:25Z

@ChaseMonsterAway after changing criterion equal to len(character) + 1 error was gone. but I received another error RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument find_unused_parameters=Truetotorch.nn.parallel.DistributedDataParallel; (2) making sure all forwardfunction outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module'sforwardfunction. Please include the loss function and the structure of the return value offorward of your module when reporting this issue (e.g. list, dict, iterable).

ChaseMonsterAway · 2021-06-04T07:07:16Z

@bharatsubedi
Which config did you used?

bharatsubedi · 2021-06-04T07:25:38Z

@ChaseMonsterAway
I am using STN_cstr.config file. adding find_unused_parameters=True in dataparallel no error during training.
I have error during validation.

PhamLeQuangNhat · 2021-06-05T04:26:22Z

@ChaseMonsterAway
I tried to add special character such as !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ in file config/cstr.py and trained with my own data. It seems to have run successfully until the model is validated, it throws an error:

batch_text[i][:len(text)] = torch.LongTensor(text)
RuntimeError: The expanded size of the tensor (26) must match the existing size (27) at non-singleton dimension 0. Target sizes: [26]. Tensor sizes: [27]

@bharatsubedi
I met same problem 'find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel' when running file config/stn_cstr.py with data from deep-text-recognition-benchmark. Did you solved this? Could you share your solution?

bharatsubedi · 2021-06-05T07:59:57Z

@PhamLeQuangNhat
if you add find_unused_parameters=True in file inference_runner.py inside DistributedDataParallel(find_unused_parameters=True) this error will not happen, but you will receive a problem during validation as you mention. I don't know how to solve that error; we have to figure out and share everyone.

ChaseMonsterAway · 2021-06-05T08:40:43Z

@bharatsubedi @PhamLeQuangNhat
Hi,
Sorry to have these problems.
I will fix some bugs and make the config file more clearly today.
After i test the code successfully, i will update the cstr branch.

ChaseMonsterAway · 2021-06-07T06:12:49Z

@bharatsubedi @PhamLeQuangNhat
I have fixed some bugs and make the config file more clearly. Please use the latest code of cstr branch. If you has any other problems, please let me know. Thanks.

PhamLeQuangNhat · 2021-06-07T09:38:27Z

@ChaseMonsterAway Yes, I tried to run the latest code of cstr branch. It works well. Thank you very much.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: CUDA error: device-side assert triggered #69

RuntimeError: CUDA error: device-side assert triggered #69

PhamLeQuangNhat commented Jun 1, 2021 •

edited

bharatsubedi commented Jun 4, 2021

ChaseMonsterAway commented Jun 4, 2021

ChaseMonsterAway commented Jun 4, 2021 •

edited

bharatsubedi commented Jun 4, 2021 •

edited

ChaseMonsterAway commented Jun 4, 2021

bharatsubedi commented Jun 4, 2021 •

edited

ChaseMonsterAway commented Jun 4, 2021

bharatsubedi commented Jun 4, 2021

ChaseMonsterAway commented Jun 4, 2021

bharatsubedi commented Jun 4, 2021 •

edited

PhamLeQuangNhat commented Jun 5, 2021 •

edited

bharatsubedi commented Jun 5, 2021 •

edited

ChaseMonsterAway commented Jun 5, 2021

ChaseMonsterAway commented Jun 7, 2021

PhamLeQuangNhat commented Jun 7, 2021

RuntimeError: CUDA error: device-side assert triggered #69

RuntimeError: CUDA error: device-side assert triggered #69

Comments

PhamLeQuangNhat commented Jun 1, 2021 • edited

bharatsubedi commented Jun 4, 2021

ChaseMonsterAway commented Jun 4, 2021

ChaseMonsterAway commented Jun 4, 2021 • edited

bharatsubedi commented Jun 4, 2021 • edited

ChaseMonsterAway commented Jun 4, 2021

bharatsubedi commented Jun 4, 2021 • edited

ChaseMonsterAway commented Jun 4, 2021

bharatsubedi commented Jun 4, 2021

ChaseMonsterAway commented Jun 4, 2021

bharatsubedi commented Jun 4, 2021 • edited

PhamLeQuangNhat commented Jun 5, 2021 • edited

bharatsubedi commented Jun 5, 2021 • edited

ChaseMonsterAway commented Jun 5, 2021

ChaseMonsterAway commented Jun 7, 2021

PhamLeQuangNhat commented Jun 7, 2021

PhamLeQuangNhat commented Jun 1, 2021 •

edited

ChaseMonsterAway commented Jun 4, 2021 •

edited

bharatsubedi commented Jun 4, 2021 •

edited

bharatsubedi commented Jun 4, 2021 •

edited

bharatsubedi commented Jun 4, 2021 •

edited

PhamLeQuangNhat commented Jun 5, 2021 •

edited

bharatsubedi commented Jun 5, 2021 •

edited