Train VisDrone data error: cuda out of memory #125

YannBai · 2020-04-10T13:13:37Z

When I trained on the VisDrone dataset, the error occurred like: cuda out of memory. Is the image size too large? But I have changed the input size to [511, 511]. Do you guys know how to fix it? PLZ help me, thanks!

loading all datasets...
using 4 threads
loading from cache file: cache/visdrone_train.pkl
loading annotations into memory...
Done (t=1.43s)
creating index...
index created!
loading from cache file: cache/visdrone_train.pkl
loading annotations into memory...
Done (t=1.59s)
creating index...
index created!
loading from cache file: cache/visdrone_train.pkl
loading annotations into memory...
Done (t=1.30s)
creating index...
index created!
loading from cache file: cache/visdrone_train.pkl
loading annotations into memory...
Done (t=1.47s)
creating index...
index created!
loading from cache file: cache/visdrone_val.pkl
loading annotations into memory...
Done (t=0.10s)
creating index...
index created!
system config...
{'batch_size': 8,
'cache_dir': 'cache',
'chunk_sizes': [2, 2, 2, 2],
'config_dir': 'config',
'data_dir': '/home/by/data',
'data_rng': <mtrand.RandomState object at 0x7f7f342904c8>,
'dataset': 'Visdrone',
'decay_rate': 10,
'display': 5,
'learning_rate': 0.00025,
'max_iter': 480000,
'nnet_rng': <mtrand.RandomState object at 0x7f7f34290510>,
'opt_algo': 'adam',
'prefetch_size': 6,
'pretrain': None,
'result_dir': 'results',
'sampling_function': 'kp_detection',
'snapshot': 5000,
'snapshot_name': 'CenterNet-104',
'stepsize': 450000,
'test_split': 'VisDrone2019-DET-test-dev',
'train_split': 'VisDrone2019-DET-train',
'val_iter': 500,
'val_split': 'VisDrone2019-DET-val',
'weight_decay': False,
'weight_decay_rate': 1e-05,
'weight_decay_type': 'l2'}
db config...
{'ae_threshold': 0.5,
'border': 128,
'categories': 10,
'data_aug': True,
'gaussian_bump': True,
'gaussian_iou': 0.7,
'gaussian_radius': -1,
'input_size': [511, 511],
'kp_categories': 1,
'lighting': True,
'max_per_image': 100,
'merge_bbox': False,
'nms_algorithm': 'exp_soft_nms',
'nms_kernel': 3,
'nms_threshold': 0.5,
'output_sizes': [[128, 128]],
'rand_color': True,
'rand_crop': True,
'rand_pushes': False,
'rand_samples': False,
'rand_scale_max': 1.4,
'rand_scale_min': 0.6,
'rand_scale_step': 0.1,
'rand_scales': array([0.6, 0.7, 0.8, 0.9, 1. , 1.1, 1.2, 1.3]),
'special_crop': False,
'test_scales': [1],
'top_k': 70,
'weight_exp': 8}
len of db: 6471
start prefetching data...
shuffling indices...
start prefetching data...
shuffling indices...
start prefetching data...
shuffling indices...
start prefetching data...
shuffling indices...
start prefetching data...
shuffling indices...
building model...
module_file: models.CenterNet-104
total parameters: 210062960
setting learning rate to: 0.00025
training start...
0%| | 0/480000 [00:00<?, ?it/s]Traceback (most recent call last):
File "train.py", line 203, in
train(training_dbs, validation_db, args.start_iter)
File "train.py", line 138, in train
training_loss, focal_loss, pull_loss, push_loss, regr_loss = nnet.train(**training)
File "/home/by/s/pytorch/DET/CenterNet/nnet/py_factory.py", line 82, in train
loss_kp = self.network(xs, ys)
File "/home/by/APP/anaconda2/envs/CenterNet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, *kwargs)
File "/home/by/s/pytorch/DET/CenterNet/models/py_utils/data_parallel.py", line 69, in forward
replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
File "/home/by/s/pytorch/DET/CenterNet/models/py_utils/data_parallel.py", line 74, in replicate
return replicate(module, device_ids)
File "/home/by/APP/anaconda2/envs/CenterNet/lib/python3.6/site-packages/torch/nn/parallel/replicate.py", line 12, in replicate
param_copies = Broadcast.apply(devices, params)
File "/home/by/APP/anaconda2/envs/CenterNet/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 19, in forward
outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
File "/home/by/APP/anaconda2/envs/CenterNet/lib/python3.6/site-packages/torch/cuda/comm.py", line 40, in broadcast_coalesced
return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: CUDA error: out of memory (allocate at /opt/conda/conda-bld/pytorch_1532581333611/work/aten/src/THC/THCCachingAllocator.cpp:510)
frame #0: THCStorage_resize + 0x123 (0x7f7f48085783 in /home/by/APP/anaconda2/envs/CenterNet/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so)
frame #1: THCTensor_resizeNd + 0x30f (0x7f7f4809341f in /home/by/APP/anaconda2/envs/CenterNet/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so)
frame #2: THCudaTensor_newWithStorage + 0xfa (0x7f7f480998fa in /home/by/APP/anaconda2/envs/CenterNet/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so)
frame #3: at::CUDAFloatType::th_tensor(at::ArrayRef) const + 0xa5 (0x7f7f47fb99d5 in /home/by/APP/anaconda2/envs/CenterNet/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so)
frame #4: at::native::tensor(at::Type const&, at::ArrayRef) + 0x3a (0x7f7f640d17da in /home/by/APP/anaconda2/envs/CenterNet/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
frame #5: at::Type::tensor(at::ArrayRef) const + 0x9 (0x7f7f642bfb69 in /home/by/APP/anaconda2/envs/CenterNet/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
frame #6: torch::autograd::VariableType::tensor(at::ArrayRef) const + 0x44 (0x7f7f65f38ea4 in /home/by/APP/anaconda2/envs/CenterNet/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #7: torch::cuda::broadcast(at::Tensor const&, at::ArrayRef) + 0x194 (0x7f7f663eaf64 in /home/by/APP/anaconda2/envs/CenterNet/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #8: torch::cuda::broadcast_coalesced(at::ArrayRefat::Tensor, at::ArrayRef, unsigned long) + 0xa10 (0x7f7f663ec200 in /home/by/APP/anaconda2/envs/CenterNet/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #9: + 0xc4256b (0x7f7f663f056b in /home/by/APP/anaconda2/envs/CenterNet/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #10: + 0x38a52b (0x7f7f65b3852b in /home/by/APP/anaconda2/envs/CenterNet/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)

frame #21: THPFunction_apply(_object, _object) + 0x38f (0x7f7f65f16bcf in /home/by/APP/anaconda2/envs/CenterNet/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so)
frame #63: __libc_start_main + 0xf0 (0x7f7f79b2b830 in /lib/x86_64-linux-gnu/libc.so.6)

YannBai · 2020-04-11T04:20:37Z

solved

YannBai closed this as completed Apr 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train VisDrone data error: cuda out of memory #125

Train VisDrone data error: cuda out of memory #125

YannBai commented Apr 10, 2020

YannBai commented Apr 11, 2020

Train VisDrone data error: cuda out of memory #125

Train VisDrone data error: cuda out of memory #125

Comments

YannBai commented Apr 10, 2020

YannBai commented Apr 11, 2020