Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train VisDrone data error: cuda out of memory #125

Closed
YannBai opened this issue Apr 10, 2020 · 1 comment
Closed

Train VisDrone data error: cuda out of memory #125

YannBai opened this issue Apr 10, 2020 · 1 comment

Comments

@YannBai
Copy link

YannBai commented Apr 10, 2020

When I trained on the VisDrone dataset, the error occurred like: cuda out of memory. Is the image size too large? But I have changed the input size to [511, 511]. Do you guys know how to fix it? PLZ help me, thanks!

loading all datasets...
using 4 threads
loading from cache file: cache/visdrone_train.pkl
loading annotations into memory...
Done (t=1.43s)
creating index...
index created!
loading from cache file: cache/visdrone_train.pkl
loading annotations into memory...
Done (t=1.59s)
creating index...
index created!
loading from cache file: cache/visdrone_train.pkl
loading annotations into memory...
Done (t=1.30s)
creating index...
index created!
loading from cache file: cache/visdrone_train.pkl
loading annotations into memory...
Done (t=1.47s)
creating index...
index created!
loading from cache file: cache/visdrone_val.pkl
loading annotations into memory...
Done (t=0.10s)
creating index...
index created!
system config...
{'batch_size': 8,
'cache_dir': 'cache',
'chunk_sizes': [2, 2, 2, 2],
'config_dir': 'config',
'data_dir': '/home/by/data',
'data_rng': <mtrand.RandomState object at 0x7f7f342904c8>,
'dataset': 'Visdrone',
'decay_rate': 10,
'display': 5,
'learning_rate': 0.00025,
'max_iter': 480000,
'nnet_rng': <mtrand.RandomState object at 0x7f7f34290510>,
'opt_algo': 'adam',
'prefetch_size': 6,
'pretrain': None,
'result_dir': 'results',
'sampling_function': 'kp_detection',
'snapshot': 5000,
'snapshot_name': 'CenterNet-104',
'stepsize': 450000,
'test_split': 'VisDrone2019-DET-test-dev',
'train_split': 'VisDrone2019-DET-train',
'val_iter': 500,
'val_split': 'VisDrone2019-DET-val',
'weight_decay': False,
'weight_decay_rate': 1e-05,
'weight_decay_type': 'l2'}
db config...
{'ae_threshold': 0.5,
'border': 128,
'categories': 10,
'data_aug': True,
'gaussian_bump': True,
'gaussian_iou': 0.7,
'gaussian_radius': -1,
'input_size': [511, 511],
'kp_categories': 1,
'lighting': True,
'max_per_image': 100,
'merge_bbox': False,
'nms_algorithm': 'exp_soft_nms',
'nms_kernel': 3,
'nms_threshold': 0.5,
'output_sizes': [[128, 128]],
'rand_color': True,
'rand_crop': True,
'rand_pushes': False,
'rand_samples': False,
'rand_scale_max': 1.4,
'rand_scale_min': 0.6,
'rand_scale_step': 0.1,
'rand_scales': array([0.6, 0.7, 0.8, 0.9, 1. , 1.1, 1.2, 1.3]),
'special_crop': False,
'test_scales': [1],
'top_k': 70,
'weight_exp': 8}
len of db: 6471
start prefetching data...
shuffling indices...
start prefetching data...
shuffling indices...
start prefetching data...
shuffling indices...
start prefetching data...
shuffling indices...
start prefetching data...
shuffling indices...
building model...
module_file: models.CenterNet-104
total parameters: 210062960
setting learning rate to: 0.00025
training start...
0%| | 0/480000 [00:00<?, ?it/s]Traceback (most recent call last):
File "train.py", line 203, in
train(training_dbs, validation_db, args.start_iter)
File "train.py", line 138, in train
training_loss, focal_loss, pull_loss, push_loss, regr_loss = nnet.train(**training)
File "/home/by/s/pytorch/DET/CenterNet/nnet/py_factory.py", line 82, in train
loss_kp = self.network(xs, ys)
File "/home/by/APP/anaconda2/envs/CenterNet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, *kwargs)
File "/home/by/s/pytorch/DET/CenterNet/models/py_utils/data_parallel.py", line 69, in forward
replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
File "/home/by/s/pytorch/DET/CenterNet/models/py_utils/data_parallel.py", line 74, in replicate
return replicate(module, device_ids)
File "/home/by/APP/anaconda2/envs/CenterNet/lib/python3.6/site-packages/torch/nn/parallel/replicate.py", line 12, in replicate
param_copies = Broadcast.apply(devices, params)
File "/home/by/APP/anaconda2/envs/CenterNet/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 19, in forward
outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
File "/home/by/APP/anaconda2/envs/CenterNet/lib/python3.6/site-packages/torch/cuda/comm.py", line 40, in broadcast_coalesced
return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: CUDA error: out of memory (allocate at /opt/conda/conda-bld/pytorch_1532581333611/work/aten/src/THC/THCCachingAllocator.cpp:510)
frame #0: THCStorage_resize + 0x123 (0x7f7f48085783 in /home/by/APP/anaconda2/envs/CenterNet/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so)
frame #1: THCTensor_resizeNd + 0x30f (0x7f7f4809341f in /home/by/APP/anaconda2/envs/CenterNet/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so)
frame #2: THCudaTensor_newWithStorage + 0xfa (0x7f7f480998fa in /home/by/APP/anaconda2/envs/CenterNet/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so)
frame #3: at::CUDAFloatType::th_tensor(at::ArrayRef) const + 0xa5 (0x7f7f47fb99d5 in /home/by/APP/anaconda2/envs/CenterNet/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so)
frame #4: at::native::tensor(at::Type const&, at::ArrayRef) + 0x3a (0x7f7f640d17da in /home/by/APP/anaconda2/envs/CenterNet/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
frame #5: at::Type::tensor(at::ArrayRef) const + 0x9 (0x7f7f642bfb69 in /home/by/APP/anaconda2/envs/CenterNet/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
frame #6: torch::autograd::VariableType::tensor(at::ArrayRef) const + 0x44 (0x7f7f65f38ea4 in /home/by/APP/anaconda2/envs/CenterNet/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #7: torch::cuda::broadcast(at::Tensor const&, at::ArrayRef) + 0x194 (0x7f7f663eaf64 in /home/by/APP/anaconda2/envs/CenterNet/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #8: torch::cuda::broadcast_coalesced(at::ArrayRefat::Tensor, at::ArrayRef, unsigned long) + 0xa10 (0x7f7f663ec200 in /home/by/APP/anaconda2/envs/CenterNet/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #9: + 0xc4256b (0x7f7f663f056b in /home/by/APP/anaconda2/envs/CenterNet/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #10: + 0x38a52b (0x7f7f65b3852b in /home/by/APP/anaconda2/envs/CenterNet/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)

frame #21: THPFunction_apply(_object
, _object
) + 0x38f (0x7f7f65f16bcf in /home/by/APP/anaconda2/envs/CenterNet/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #63: __libc_start_main + 0xf0 (0x7f7f79b2b830 in /lib/x86_64-linux-gnu/libc.so.6)

@YannBai YannBai closed this as completed Apr 11, 2020
@YannBai
Copy link
Author

YannBai commented Apr 11, 2020

solved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant