-
Notifications
You must be signed in to change notification settings - Fork 983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cudnn error when using SPADE #57
Comments
@emanueledelsozzo Check that in your custom dataset doesn't exist class with label > num_labels in your config. Add CUDA_LAUNCH_BLOCKING=1 before python ... and see normal traceback. |
Hello @alexsannikoff and thank you for your reply!
Aborted (core dumped)` Probably the problem is related to your observation about the labels. What I am trying to do is to train a pix2pix able to learn the transformation from A to B. I am not really using any semantic information. I trained a similar network with pix2pixhd and it worked. Now I would like to do the same with SPADE, but I am not sure whether it is possible or not. I also tried to set netG to pix2pixhd, but I got the same error.
|
Hi, I'm trying to train a custom dataset as well and I'm facing the same error. In my case I'm using labels and instances but that doesn't make any difference. |
@emanueledelsozzo Here programm try to create one-hot encoding your semantic mask. And if num_classes on your mask > label_nc - this error occured. What about learning image translation from A to B. In this code I think in pix2pixhd model using SPADE block, which use semantic mask for input. I think, that in this version you can't learn image translation without change code. The main idea this repo in SPADE block. |
@aviel08 Check your semantic mask that max(np.unique(mask)) < label_nc, and don't forget that instance mask has more classes than semantic and using to separate objects with same class each other. label_nc - number of classes in semantic mask. |
@alexsannikoff |
@emanueledelsozzo Did you get it to work? I am having the same problem. My labels have 3 semantic labels and the background. I don't get how the labels are treated, since they are loaded in the range [0, 255] and then converted to one-hot encoding. But what happens if some label is missing in a specific image? |
Hi guys, just in case it helps anyone. I was also facing the same issues about CUDA error and was able to resolve it by using --no_instance and --contain_dontcare_label arguments. Also, want to mention that I am not using instance maps with my custom dataset. |
@vdhyani96 I am wondering if you are doing Image->Image or Semantic->Image task? |
@zhangdan8962 I am doing semantic->image task |
Hello, I had the same error when attempting to train with a custom data set. I made some changes to my data set and now it works. I don't know if it was just one or all of my changes that made the difference, but for the ones who are as desperate as me, here are all my changes:
General approach: first make it work on |
can be solved by |
I have tried all the above solutions still I am getting the same error. Has anyone tried some other solutions which helped to sort out this issue? I have no issues using the coco_stuff dataset, but I get this error only when using the custom dataset. |
谢谢,已收到。
|
Hello,
I am trying to use SPADE on a custom dataset. Here the command I run:
python train.py --name myTest --dataset_mode custom --label_dir datasets/label_dir --image_dir datasets/image_dir --no_instance --gpu_ids 0,1 --batchSize 2
However, after the creation of the web directory, I get this error:
/pytorch/aten/src/THC/THCTensorScatterGather.cu:188: void THCudaTensor_scatterFillKernel(TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, Real, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = -1]: block: [90,0,0], thread: [21,0,0] Assertion
indexValue >= 0 && indexValue < tensor.sizes[dim]failed. /pytorch/aten/src/THC/THCTensorScatterGather.cu:188: void THCudaTensor_scatterFillKernel(TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, Real, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = -1]: block: [90,0,0], thread: [23,0,0] Assertion
indexValue >= 0 && indexValue < tensor.sizes[dim]failed. /pytorch/aten/src/THC/THCTensorScatterGather.cu:188: void THCudaTensor_scatterFillKernel(TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, Real, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = -1]: block: [90,0,0], thread: [24,0,0] Assertion
indexValue >= 0 && indexValue < tensor.sizes[dim]failed. /pytorch/aten/src/THC/THCTensorScatterGather.cu:188: void THCudaTensor_scatterFillKernel(TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, Real, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = -1]: block: [90,0,0], thread: [25,0,0] Assertion
indexValue >= 0 && indexValue < tensor.sizes[dim]failed. /pytorch/aten/src/THC/THCTensorScatterGather.cu:188: void THCudaTensor_scatterFillKernel(TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, Real, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = -1]: block: [90,0,0], thread: [26,0,0] Assertion
indexValue >= 0 && indexValue < tensor.sizes[dim]failed. /pytorch/aten/src/THC/THCTensorScatterGather.cu:188: void THCudaTensor_scatterFillKernel(TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, Real, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = -1]: block: [90,0,0], thread: [27,0,0] Assertion
indexValue >= 0 && indexValue < tensor.sizes[dim]failed. /pytorch/aten/src/THC/THCTensorScatterGather.cu:188: void THCudaTensor_scatterFillKernel(TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, Real, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = -1]: block: [90,0,0], thread: [28,0,0] Assertion
indexValue >= 0 && indexValue < tensor.sizes[dim]failed. /pytorch/aten/src/THC/THCTensorScatterGather.cu:188: void THCudaTensor_scatterFillKernel(TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, Real, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = -1]: block: [90,0,0], thread: [29,0,0] Assertion
indexValue >= 0 && indexValue < tensor.sizes[dim]failed. /pytorch/aten/src/THC/THCTensorScatterGather.cu:188: void THCudaTensor_scatterFillKernel(TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, Real, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = -1]: block: [90,0,0], thread: [30,0,0] Assertion
indexValue >= 0 && indexValue < tensor.sizes[dim]failed. /pytorch/aten/src/THC/THCTensorScatterGather.cu:188: void THCudaTensor_scatterFillKernel(TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, Real, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = -1]: block: [90,0,0], thread: [31,0,0] Assertion
indexValue >= 0 && indexValue < tensor.sizes[dim]failed. /pytorch/aten/src/THC/THCTensorScatterGather.cu:188: void THCudaTensor_scatterFillKernel(TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, Real, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = -1]: block: [90,0,0], thread: [278,0,0] Assertion
indexValue >= 0 && indexValue < tensor.sizes[dim]failed. /pytorch/aten/src/THC/THCTensorScatterGather.cu:188: void THCudaTensor_scatterFillKernel(TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, Real, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = -1]: block: [90,0,0], thread: [279,0,0] Assertion
indexValue >= 0 && indexValue < tensor.sizes[dim]failed. /pytorch/aten/src/THC/THCTensorScatterGather.cu:188: void THCudaTensor_scatterFillKernel(TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, Real, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = -1]: block: [90,0,0], thread: [280,0,0] Assertion
indexValue >= 0 && indexValue < tensor.sizes[dim]failed. /pytorch/aten/src/THC/THCTensorScatterGather.cu:188: void THCudaTensor_scatterFillKernel(TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, Real, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = -1]: block: [90,0,0], thread: [281,0,0] Assertion
indexValue >= 0 && indexValue < tensor.sizes[dim]failed. /pytorch/aten/src/THC/THCTensorScatterGather.cu:188: void THCudaTensor_scatterFillKernel(TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, Real, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = -1]: block: [90,0,0], thread: [282,0,0] Assertion
indexValue >= 0 && indexValue < tensor.sizes[dim]failed. /pytorch/aten/src/THC/THCTensorScatterGather.cu:188: void THCudaTensor_scatterFillKernel(TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, Real, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = -1]: block: [90,0,0], thread: [283,0,0] Assertion
indexValue >= 0 && indexValue < tensor.sizes[dim]failed. /pytorch/aten/src/THC/THCTensorScatterGather.cu:188: void THCudaTensor_scatterFillKernel(TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, Real, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = -1]: block: [90,0,0], thread: [284,0,0] Assertion
indexValue >= 0 && indexValue < tensor.sizes[dim]failed. /pytorch/aten/src/THC/THCTensorScatterGather.cu:188: void THCudaTensor_scatterFillKernel(TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, Real, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = -1]: block: [90,0,0], thread: [285,0,0] Assertion
indexValue >= 0 && indexValue < tensor.sizes[dim]failed. /pytorch/aten/src/THC/THCTensorScatterGather.cu:188: void THCudaTensor_scatterFillKernel(TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, Real, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = -1]: block: [90,0,0], thread: [286,0,0] Assertion
indexValue >= 0 && indexValue < tensor.sizes[dim]failed. /pytorch/aten/src/THC/THCTensorScatterGather.cu:188: void THCudaTensor_scatterFillKernel(TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, Real, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = -1]: block: [90,0,0], thread: [287,0,0] Assertion
indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.Traceback (most recent call last):
File "train.py", line 40, in
trainer.run_generator_one_step(data_i)
File "/home/emanuele/repos/testSPADE/SPADE/trainers/pix2pix_trainer.py", line 35, in run_generator_one_step
g_losses, generated = self.pix2pix_model(data, mode='generator')
File "/home/emanuele/anaconda3/envs/SPADEEnv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/emanuele/anaconda3/envs/SPADEEnv/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/emanuele/anaconda3/envs/SPADEEnv/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/emanuele/anaconda3/envs/SPADEEnv/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply
raise output
File "/home/emanuele/anaconda3/envs/SPADEEnv/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker
output = module(*input, **kwargs)
File "/home/emanuele/anaconda3/envs/SPADEEnv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/emanuele/repos/testSPADE/SPADE/models/pix2pix_model.py", line 46, in forward
input_semantics, real_image)
File "/home/emanuele/repos/testSPADE/SPADE/models/pix2pix_model.py", line 137, in compute_generator_loss
input_semantics, real_image, compute_kld_loss=self.opt.use_vae)
File "/home/emanuele/repos/testSPADE/SPADE/models/pix2pix_model.py", line 196, in generate_fake
fake_image = self.netG(input_semantics, z=z)
File "/home/emanuele/anaconda3/envs/SPADEEnv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/emanuele/repos/testSPADE/SPADE/models/networks/generator.py", line 89, in forward
x = self.fc(x)
File "/home/emanuele/anaconda3/envs/SPADEEnv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/emanuele/anaconda3/envs/SPADEEnv/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 338, in forward
self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered (insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:564)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f25a3999441 in /home/emanuele/anaconda3/envs/SPADEEnv/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f25a3998d7a in /home/emanuele/anaconda3/envs/SPADEEnv/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #2: + 0x13652 (0x7f25a2f45652 in /home/emanuele/anaconda3/envs/SPADEEnv/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x50 (0x7f25a3989ce0 in /home/emanuele/anaconda3/envs/SPADEEnv/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #4: + 0x30facb (0x7f2573acdacb in /home/emanuele/anaconda3/envs/SPADEEnv/lib/python3.6/site-packages/torch/lib/libtorch.so.1)
frame #5: + 0x1420bb (0x7f25a3f270bb in /home/emanuele/anaconda3/envs/SPADEEnv/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #6: + 0x3c30f4 (0x7f25a41a80f4 in /home/emanuele/anaconda3/envs/SPADEEnv/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #7: + 0x3c3141 (0x7f25a41a8141 in /home/emanuele/anaconda3/envs/SPADEEnv/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #24: __libc_start_main + 0xe7 (0x7f25a8297b97 in /lib/x86_64-linux-gnu/libc.so.6)
Aborted (core dumped)
`
I am using python 3.6, CUDA 10.0, Torch 1.1.0, CuDNN 7.5.0
I have also tried on another machine with CUDA 9.0, Torch 1.0.0, CuDNN 7.4.1, but I got the same error.
Could you help me?
The text was updated successfully, but these errors were encountered: