Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Dimension out of range (expected to be in range of [-2, 1], but got 2) with CornerNet_Saccade #34

Closed
Sujith93 opened this issue May 8, 2020 · 15 comments
Labels
bug Something isn't working

Comments

@Sujith93
Copy link

Sujith93 commented May 8, 2020

I'm running

from train_detector import Detector
gtf = Detector();

root_dir = "/home/SK00495085/monk/Monk_Object_Detection/data";
coco_dir = "training_menu"
img_dir = "/"
set_dir = "Images"

gtf.Train_Dataset(root_dir, coco_dir, img_dir, set_dir, batch_size=4, num_workers=4)

root_dir = "/home/SK00495085/monk/Monk_Object_Detection/data";
coco_dir = "validation_menu"
img_dir = "/"
set_dir = "Images"

gtf.Val_Dataset(root_dir, coco_dir, img_dir, set_dir)
gtf.Model(model_name="CornerNet_Saccade")
gtf.Hyper_Params(lr=0.00025, total_iterations=6900000, val_interval=10000)
gtf.Setup();
gtf.Train();

I got this error:

loading annotations into memory...
Done (t=0.59s)
creating index...
index created!
loading annotations into memory...
Done (t=0.22s)
creating index...
index created!
Loading Model - core.models.CornerNet_Saccade
Model Loaded
start_iter = 0
distributed = False
world_size = 0
initialize = False
batch_size = 1
learning_rate = 0.00025
max_iteration = 6900000
stepsize = 5520000
snapshot = 3450000
val_iter = 10000
display = 100
decay_rate = 10
Process 0: building model...
total parameters: 116967797
start prefetching data...
shuffling indices...
setting learning rate to: 0.00025
training start...
start prefetching data...
shuffling indices...
0%| | 0/6900000 [00:00<?, ?it/s]
Traceback (most recent call last):
File "training_saccade.py", line 31, in
gtf.Train();
File "/home/SK00495085/monk/Monk_Object_Detection/6_cornernet_lite/lib/train_detector.py", line 298, in Train
training_loss = self.system_dict["local"]["nnet"].train(**training)
File "/home/SK00495085/monk/Monk_Object_Detection/6_cornernet_lite/lib/core/nnet/py_factory.py", line 93, in train
loss = self.network(xs, ys)
File "/home/SK00495085/.conda/envs/monk_env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/SK00495085/monk/Monk_Object_Detection/6_cornernet_lite/lib/core/models/py_utils/data_parallel.py", line 68, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/SK00495085/.conda/envs/monk_env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/SK00495085/monk/Monk_Object_Detection/6_cornernet_lite/lib/core/nnet/py_factory.py", line 20, in forward
loss = self.loss(preds, ys, **kwargs)
File "/home/SK00495085/.conda/envs/monk_env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/SK00495085/monk/Monk_Object_Detection/6_cornernet_lite/lib/core/models/py_utils/losses.py", line 150, in forward
pull, push = self.ae_loss(tl_tag, br_tag, gt_mask)
File "/home/SK00495085/monk/Monk_Object_Detection/6_cornernet_lite/lib/core/models/py_utils/losses.py", line 26, in _ae_loss
dist = tag_mean.unsqueeze(1) - tag_mean.unsqueeze(2)
RuntimeError: Dimension out of range (expected to be in range of [-2, 1], but got 2)

@Sujith93 Sujith93 changed the title RuntimeError: Dimension out of range (expected to be in range of [-2, 1], but got 2) RuntimeError: Dimension out of range (expected to be in range of [-2, 1], but got 2) with CornerNet_Saccade May 8, 2020
@THEFASHIONGEEK
Copy link
Contributor

Are you training on multiple GPUs?

@Sujith93
Copy link
Author

Sujith93 commented May 8, 2020

no, only one GPU

@THEFASHIONGEEK
Copy link
Contributor

try to increase the batch size to 8. if the error still persists please let us know.

@Sujith93
Copy link
Author

Sujith93 commented May 8, 2020

I changed the batch size to 8. its working fine.
The reason why I reduced to 1 is because of the below error.
"RuntimeError: CUDA out of memory. Tried to allocate 399.88 MiB"

Now it's running. Thank you

@Sujith93
Copy link
Author

Sujith93 commented May 8, 2020

It ran for some time, But now it's throwing this below error

batch_size = 8
learning_rate = 0.00025
max_iteration = 6900000
stepsize = 5520000
snapshot = 3450000
val_iter = 10000
display = 100
decay_rate = 10
Process 0: building model...
Traceback (most recent call last):
File "training_saccade.py", line 31, in
gtf.Train();
File "/home/SK00495085/monk/Monk_Object_Detection/6_cornernet_lite/lib/train_detector.py", line 232, in Train
self.system_dict["local"]["model"], distributed=distributed, gpu=gpu)
File "/home/SK00495085/monk/Monk_Object_Detection/6_cornernet_lite/lib/core/nnet/py_factory.py", line 51, in init
self.network = DataParallel(self.network, chunk_sizes=system_config.chunk_sizes)
File "/home/SK00495085/monk/Monk_Object_Detection/6_cornernet_lite/lib/core/models/py_utils/data_parallel.py", line 61, in init
self.module.cuda(device_ids[0])
File "/home/SK00495085/.conda/envs/monk_env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 260, in cuda
return self._apply(lambda t: t.cuda(device))
File "/home/SK00495085/.conda/envs/monk_env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 187, in _apply
module._apply(fn)
File "/home/SK00495085/.conda/envs/monk_env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 187, in _apply
module._apply(fn)
File "/home/SK00495085/.conda/envs/monk_env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 187, in _apply
module._apply(fn)
[Previous line repeated 7 more times]
File "/home/SK00495085/.conda/envs/monk_env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply
param.data = fn(param.data)
File "/home/SK00495085/.conda/envs/monk_env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 260, in
return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA out of memory. Tried to allocate 5.12 MiB (GPU 0; 7.93 GiB total capacity; 146.31 MiB already allocated; 5.56 MiB free; 839.00 KiB cached)

@Sujith93
Copy link
Author

Sujith93 commented May 8, 2020

It ran for the first time, but the same code running for the second time throws this Runtime error.

@Sujith93
Copy link
Author

Sujith93 commented May 8, 2020

The issues are with Cuda. I got it.
Thanks.

@Sujith93 Sujith93 closed this as completed May 8, 2020
@Sujith93
Copy link
Author

once again runtime error

11, 1.55s/it]^M 0%| | 9976/6900000 [20:38:55<2966:40:26, 1.55s/it]^M 0%| | 9977/6900000 [20:38:56<2967:57:09, 1.55s/it]^M 0%| | 9978/6900000 [20:38:58<2955:07:32, 1.54s/it]^M 0%| | 9979/6900000 [20:38:59<2944:00:52, 1.54s/it]^M 0%| | 9980/6900000 [20:39:01<2937:11:48, 1.53s/it]^M 0%| | 9981/6900000 [20:39:02<2937:09:15, 1.53s/it]^M 0%| | 9982/6900000 [20:39:04<2951:29:57, 1.54s/it]^M 0%| | 9983/6900000 [20:39:05<2943:31:21, 1.54s/it]^M 0%| | 9984/6900000 [20:39:07<2937:01:24, 1.53s/it]^M 0%| | 9985/6900000 [20:39:08<2935:26:05, 1.53s/it]^M 0%| | 9986/6900000 [20:39:10<2931:41:56, 1.53s/it]^M 0%| | 9987/6900000 [20:39:12<2928:14:04, 1.53s/it]^M 0%| | 9988/6900000 [20:39:13<2951:58:24, 1.54s/it]^M 0%| | 9989/6900000 [20:39:15<2983:16:16, 1.56s/it]^M 0%| | 9990/6900000 [20:39:16<3001:54:02, 1.57s/it]^M 0%| | 9991/6900000 [20:39:18<2993:27:21, 1.56s/it]^M 0%| | 9992/6900000 [20:39:19<2987:51:46, 1.56s/it]^M 0%| | 9993/6900000 [20:39:21<2970:26:55, 1.55s/it]^M 0%| | 9994/6900000 [20:39:22<2996:07:47, 1.57s/it]^M 0%| | 9995/6900000 [20:39:24<2977:08:39, 1.56s/it]^M 0%| | 9996/6900000 [20:39:26<2961:41:57, 1.55s/it]^M 0%| | 9997/6900000 [20:39:27<2948:27:08, 1.54s/it]^M 0%| | 9998/6900000 [20:39:29<2940:58:51, 1.54s/it]^M
^M 0%| | 9999/6900000 [20:39:32<2965:10:45, 1.55s/it]^M 0%| | 9999/6900000 [20:39:33<14235:41:00, 7.44s/it]
^MProcess 0: training loss at iteration 10000: 3.72452449798584

Traceback (most recent call last):
File "training_saccade.py", line 32, in
gtf.Train();
File "/home/SK00495085/monk/Monk_Object_Detection/6_cornernet_lite/lib/train_detector.py", line 307, in Train
validation_loss = self.system_dict["local"]["nnet"].validate(**validation)
File "/home/SK00495085/monk/Monk_Object_Detection/6_cornernet_lite/lib/core/nnet/py_factory.py", line 105, in validate
loss = self.network(xs, ys)
File "/home/SK00495085/.conda/envs/monk_env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/SK00495085/monk/Monk_Object_Detection/6_cornernet_lite/lib/core/models/py_utils/data_parallel.py", line 68, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/SK00495085/.conda/envs/monk_env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/SK00495085/monk/Monk_Object_Detection/6_cornernet_lite/lib/core/nnet/py_factory.py", line 20, in forward
loss = self.loss(preds, ys, **kwargs)
File "/home/SK00495085/.conda/envs/monk_env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/SK00495085/monk/Monk_Object_Detection/6_cornernet_lite/lib/core/models/py_utils/losses.py", line 134, in forward
focal_loss += self.focal_loss(tl_heats, gt_tl_heat, gt_tl_valid)
File "/home/SK00495085/monk/Monk_Object_Detection/6_cornernet_lite/lib/core/models/py_utils/losses.py", line 57, in _focal_loss_mask
pos_pred = pred[pos_inds]
RuntimeError: The shape of the mask [8, 18, 64, 64] at index 1 does not match the shape of the indexed tensor [8, 79, 64, 64] at index 1

@Sujith93 Sujith93 reopened this May 10, 2020
@abhi-kumar
Copy link
Contributor

Which dataset are you working on? Please point to that dataset and share your code so that we can reproduce the errors

@abhi-kumar abhi-kumar added the bug Something isn't working label May 10, 2020
@Sujith93
Copy link
Author

Sujith93 commented May 10, 2020

I'm working on a real-time project so I'm unable to share the data.
But the steps which I have done till now are listed below.

  1. I set up the required environment to run CornerNet_Saccade
  2. Converted VOC XML to Coco dataset using VOC Type to Coco - Via Monk Type Annotation.ipynb
  3. I have train and valid data so I considered Train With Validation Data.ipynb
    So kept the model running for the whole night, but suddenly it stopped throwing that run time error.

@abhi-kumar
Copy link
Contributor

Please check if there is any discrepancy within annotation files. Since it started and and ran for certain hours then the issue could be traced back to an image maybe with no labels and bounding boxes or box shapes crossing the image boundaries.

@Sujith93
Copy link
Author

yes, I crosschecked both train and valid data a couple of times regarding the bounding boxes and the labels. All are good. But when I go with only training data without validation data, everything works fine.
Seems there is an issue with the validation part.

@abhi-kumar
Copy link
Contributor

Thank you for the detailed analysis. Will check validation codes.

@abhi-kumar
Copy link
Contributor

We have run multiple tests on corner-net pipeline, yet the error wasn't reproduced. Have you reached a solution yet?

@abhi-kumar
Copy link
Contributor

Closing due to inactivity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants