Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA error: device-side assert triggered #69

Open
PhamLeQuangNhat opened this issue Jun 1, 2021 · 15 comments
Open

RuntimeError: CUDA error: device-side assert triggered #69

PhamLeQuangNhat opened this issue Jun 1, 2021 · 15 comments

Comments

@PhamLeQuangNhat
Copy link

PhamLeQuangNhat commented Jun 1, 2021

When i was training my own dataset, I modified following in the config flie:

character = 'aAàÀảẢãÃáÁạẠăĂằẰẳẲẵẴắẮặẶâÂầẦẩẨẫẪấẤậẬbBcCdDđĐeEèÈẻẺẽẼéÉẹẸêÊềỀểỂễỄếẾệỆfFgGhHiIìÌỉỈĩĨíÍịỊj
JkKlLmMnNoOòÒỏỎõÕóÓọỌôÔồỒổỔỗỖốỐộỘơƠờỜởỞỡỠớỚợỢpPqQrRsStTuUùÙủỦũŨúÚụỤưƯừỪửỬữỮ
ứỨựỰvVwWxXyYỳỲỷỶỹỸýÝỵỴzZ0123456789'
batch_max_length = 25
num_class = len(character) + 1 # num_class = 197
gpu_id='5,7'

and I ran the command: bash tools/dist_train.sh configs/stn_cstr.py 2

then I got the error:

/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [24,0,0] Assertion t >= 0 && t < n_classes failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [25,0,0] Assertion t >= 0 && t < n_classes failed.
Traceback (most recent call last):
File "/home/recognition/vedastr-cstr/tools/train.py", line 49, in
main()
File "/home/recognition/vedastr-cstr/tools/train.py", line 45, in main
runner()
File "/home/recognition/vedastr-cstr/tools/../vedastr/runners/train_runner.py", line 165, in call
self._train_batch(img, label)
File "/home/recognition/vedastr-cstr/tools/../vedastr/runners/train_runner.py", line 118, in _train_batch
loss.backward()
File "/root/anaconda3/envs/vedastr/lib/python3.9/site-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/root/anaconda3/envs/vedastr/lib/python3.9/site-packages/torch/autograd/init.py", line 145, in backward
Variable._execution_engine.run_backward(
RuntimeError: CUDA error: device-side assert triggered

What might cause it? How to fix it? Thank in advance.

@bharatsubedi
Copy link

@PhamLeQuangNhat did you solved this? could you share your solution Please?

@ChaseMonsterAway
Copy link
Contributor

@bharatsubedi
Did you meet the same problem? Can you tell me which parts in config file you have modified?

@ChaseMonsterAway
Copy link
Contributor

ChaseMonsterAway commented Jun 4, 2021

@PhamLeQuangNhat
I gusse this is caused by wrong class index. Can you tell me which part in config file you have modified? I will check it with your settings.

@bharatsubedi
Copy link

bharatsubedi commented Jun 4, 2021

I am using Korean dataset for training and I only modified character set. what is mean by wrong class index? could you explain please? Special characters also included in my data

@ChaseMonsterAway
Copy link
Contributor

I am using Korean dataset for training and I only modified character set. what is mean by wrong class index? could you explain please? Special characters also included in my data

For example, if the channel dimension of your logits is C, and your gt has a number lager than C. Using this logits and gt to compute cross entropy loss will cause errors like t >= 0 && t < n_classes.

@bharatsubedi
Copy link

bharatsubedi commented Jun 4, 2021

@ChaseMonsterAway could you please let me know which part of code I should change for solving that problem? The length of character set is 1800 in my case,

@ChaseMonsterAway
Copy link
Contributor

@ChaseMonsterAway could you please let me know which part of code I should change for solving that problem? The length of character set is 1800 in my case,

I think you should change the ignore index of criterion in your config file. In your case, i think you should set it as 1801.
I will optimize the config file in the futher.

@bharatsubedi
Copy link

@ChaseMonsterAway after changing criterion equal to len(character) + 1 error was gone. but I received another error RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument find_unused_parameters=Truetotorch.nn.parallel.DistributedDataParallel; (2) making sure all forwardfunction outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module'sforwardfunction. Please include the loss function and the structure of the return value offorward of your module when reporting this issue (e.g. list, dict, iterable).

@ChaseMonsterAway
Copy link
Contributor

@bharatsubedi
Which config did you used?

@bharatsubedi
Copy link

bharatsubedi commented Jun 4, 2021

@ChaseMonsterAway
I am using STN_cstr.config file. adding find_unused_parameters=True in dataparallel no error during training.
I have error during validation.
image

@PhamLeQuangNhat
Copy link
Author

PhamLeQuangNhat commented Jun 5, 2021

@ChaseMonsterAway
I tried to add special character such as !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ in file config/cstr.py and trained with my own data. It seems to have run successfully until the model is validated, it throws an error:

batch_text[i][:len(text)] = torch.LongTensor(text)
RuntimeError: The expanded size of the tensor (26) must match the existing size (27) at non-singleton dimension 0. Target sizes: [26]. Tensor sizes: [27]

@bharatsubedi
I met same problem 'find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel' when running file config/stn_cstr.py with data from deep-text-recognition-benchmark. Did you solved this? Could you share your solution?

@bharatsubedi
Copy link

bharatsubedi commented Jun 5, 2021

@PhamLeQuangNhat
if you add find_unused_parameters=True in file inference_runner.py inside DistributedDataParallel(find_unused_parameters=True) this error will not happen, but you will receive a problem during validation as you mention. I don't know how to solve that error; we have to figure out and share everyone.

@ChaseMonsterAway
Copy link
Contributor

@bharatsubedi @PhamLeQuangNhat
Hi,
Sorry to have these problems.
I will fix some bugs and make the config file more clearly today.
After i test the code successfully, i will update the cstr branch.

@ChaseMonsterAway
Copy link
Contributor

@bharatsubedi @PhamLeQuangNhat
I have fixed some bugs and make the config file more clearly. Please use the latest code of cstr branch. If you has any other problems, please let me know. Thanks.

@PhamLeQuangNhat
Copy link
Author

@ChaseMonsterAway Yes, I tried to run the latest code of cstr branch. It works well. Thank you very much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants