New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Out of memory issue when training a new dataset #5
Comments
Hi, as for ImageNet models, we used four 12GB titan X (pascal) to train them. So perhaps you need to change your command to |
Hi @ShichenLiu , thanks a lot for your fast answer. Thus, the out of memory may appear because I am using only 1 gpu for training, right? |
I believe so. |
Hi @ShichenLiu , Now I'm using a shared node setting with 4 GPUs with --gpu 0,1,2,3. The problem was that the memory still grew at the initial steps of training using a batch size of 256 images, exceeding the 128GB of RAM available in the machines I am working with. When reducing batch size to 128 and 64 it also failed at posterior iterations within the same initial epoch, respectively. The amount of samples in my dataset is, however, much lower than the CIFAR and ImageNet, as I explained above, and I presume 128 GB of RAM should be enough to train your type of network. Therefore, I would appreciate if you think there could be some problem with the parameter setting I am using to avoid such growing memory problems on non-large datasets. Cheers, |
Hi @vponcelo , In fact the size of dataset should not affect the memory consumption. Instead, only batch size may cause the OOM problem. Could you please provide me with the network architecture printed right after running Thanks |
@vponcelo: You presumably had to make some changes in the data-loading to facilitate training on a different data set. Can you please double-check that you're not keeping all the data in RAM? (That is, that you're not doing, say, Presuming the above is not the case, this may be an issue where the Python garbage collector does not get invoked for some reason? You could try and add an explicit call to the garbage collection here (add |
Thanks a lot for your very fast and useful answers @ShichenLiu and @lvdmaaten. I went back to the first original OOM problem that was taking place here (before the invocation of the Python garbage collector may take place). In my case, the values of After trying several settings, I found that another solution was as simple as to set the parameter for the number of workers up to a maximum of 26 workers with Therefore, the OOM problem is now solved anyway. I am attaching the output of my trained network, so that you can see whether it looks fine before closing this issue. It seems it is working pretty good in my dataset from the starting epochs, isn't it? Cheers and thank you both again, |
Great to know that your problem has been solved and the log looks good to me. :-) |
* [ShichenLiu#2] Fix resenet, and add denseblock and denselayer with ltdn
Hi,
I am attempting to reproduce your code for the CondenseNet for training a dataset of 7 classes, and approximately between 100K - 150K training images splitted (non-equally) for those clases. My images consist of bounding boxes of different sizes. For that, first I'm using a similar setting you use to train the ImageNet, pointing to my dataset and preparing the class folders to find the paths properly. I resized all images to 256x256 as you did in your paper. Therefore, this is the command line I use for training the new dataset:
python main.py --model condensenet -b 256 -j 28 lima_train --stages 4-6-8-10-8 --growth 8-16-32-64-128 --gpu 0 --resume
where
lima_train
is a link file pointing to the folder containing all training data splitted in class subfolders as required.I'm using a datacenter whose GPU nodes use NVIDIA Tesla P100 of 16 GB each, and CUDA 8 with cuDNN. In this sense, I presume the training should not be a problem. I understand that a GPU of 16GB or even 8GB should be enough to train this network, shouldn't be? However, I'm getting the out of memory problem shown below. I modified the parameters to reduce the batch size to 64 and the number of workers according to the machine. Probably I am missing some step or I should modify the command line according to the settings of my data.
I would appreciate any feedback.
Thanks in advance and congratulations for this work.
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1503970438496/work/torch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory Traceback (most recent call last): File "main.py", line 479, in <module> main() File "main.py", line 239, in main train(train_loader, model, criterion, optimizer, epoch) File "main.py", line 303, in train output = model(input_var, progress) File "/mnt/storage/home/vp17941/.conda/envs/condensenet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in __call__ result = self.forward(*input, **kwargs) File "/mnt/storage/home/vp17941/.conda/envs/condensenet/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 58, in forward return self.module(*inputs[0], **kwargs[0]) File "/mnt/storage/home/vp17941/.conda/envs/condensenet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in __call__ result = self.forward(*input, **kwargs) File "/mnt/storage/home/vp17941/CondenseNet/models/condensenet.py", line 127, in forward features = self.features(x) File "/mnt/storage/home/vp17941/.conda/envs/condensenet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in __call__ result = self.forward(*input, **kwargs) File "/mnt/storage/home/vp17941/.conda/envs/condensenet/lib/python3.6/site-packages/torch/nn/modules/container.py", line 67, in forward input = module(input) File "/mnt/storage/home/vp17941/.conda/envs/condensenet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in __call__ result = self.forward(*input, **kwargs) File "/mnt/storage/home/vp17941/.conda/envs/condensenet/lib/python3.6/site-packages/torch/nn/modules/container.py", line 67, in forward input = module(input) File "/mnt/storage/home/vp17941/.conda/envs/condensenet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in __call__ result = self.forward(*input, **kwargs) File "/mnt/storage/home/vp17941/CondenseNet/models/condensenet.py", line 33, in forward x = self.conv_1(x) File "/mnt/storage/home/vp17941/.conda/envs/condensenet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in __call__ result = self.forward(*input, **kwargs) File "/mnt/storage/home/vp17941/CondenseNet/layers.py", line 42, in forward x = self.norm(x) File "/mnt/storage/home/vp17941/.conda/envs/condensenet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in __call__ result = self.forward(*input, **kwargs) File "/mnt/storage/home/vp17941/.conda/envs/condensenet/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 37, in forward self.training, self.momentum, self.eps) File "/mnt/storage/home/vp17941/.conda/envs/condensenet/lib/python3.6/site-packages/torch/nn/functional.py", line 639, in batch_norm return f(input, weight, bias) RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1503970438496/work/torch/lib/THC/generic/THCStorage.cu:66 srun: error: gpu09: task 0: Exited with exit code 1
The text was updated successfully, but these errors were encountered: