Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of memory issue when training a new dataset #5

Closed
vponcelo opened this issue Dec 5, 2017 · 8 comments
Closed

Out of memory issue when training a new dataset #5

vponcelo opened this issue Dec 5, 2017 · 8 comments

Comments

@vponcelo
Copy link

vponcelo commented Dec 5, 2017

Hi,

I am attempting to reproduce your code for the CondenseNet for training a dataset of 7 classes, and approximately between 100K - 150K training images splitted (non-equally) for those clases. My images consist of bounding boxes of different sizes. For that, first I'm using a similar setting you use to train the ImageNet, pointing to my dataset and preparing the class folders to find the paths properly. I resized all images to 256x256 as you did in your paper. Therefore, this is the command line I use for training the new dataset:

python main.py --model condensenet -b 256 -j 28 lima_train --stages 4-6-8-10-8 --growth 8-16-32-64-128 --gpu 0 --resume

where lima_train is a link file pointing to the folder containing all training data splitted in class subfolders as required.

I'm using a datacenter whose GPU nodes use NVIDIA Tesla P100 of 16 GB each, and CUDA 8 with cuDNN. In this sense, I presume the training should not be a problem. I understand that a GPU of 16GB or even 8GB should be enough to train this network, shouldn't be? However, I'm getting the out of memory problem shown below. I modified the parameters to reduce the batch size to 64 and the number of workers according to the machine. Probably I am missing some step or I should modify the command line according to the settings of my data.

I would appreciate any feedback.

Thanks in advance and congratulations for this work.

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1503970438496/work/torch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory Traceback (most recent call last): File "main.py", line 479, in <module> main() File "main.py", line 239, in main train(train_loader, model, criterion, optimizer, epoch) File "main.py", line 303, in train output = model(input_var, progress) File "/mnt/storage/home/vp17941/.conda/envs/condensenet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in __call__ result = self.forward(*input, **kwargs) File "/mnt/storage/home/vp17941/.conda/envs/condensenet/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 58, in forward return self.module(*inputs[0], **kwargs[0]) File "/mnt/storage/home/vp17941/.conda/envs/condensenet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in __call__ result = self.forward(*input, **kwargs) File "/mnt/storage/home/vp17941/CondenseNet/models/condensenet.py", line 127, in forward features = self.features(x) File "/mnt/storage/home/vp17941/.conda/envs/condensenet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in __call__ result = self.forward(*input, **kwargs) File "/mnt/storage/home/vp17941/.conda/envs/condensenet/lib/python3.6/site-packages/torch/nn/modules/container.py", line 67, in forward input = module(input) File "/mnt/storage/home/vp17941/.conda/envs/condensenet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in __call__ result = self.forward(*input, **kwargs) File "/mnt/storage/home/vp17941/.conda/envs/condensenet/lib/python3.6/site-packages/torch/nn/modules/container.py", line 67, in forward input = module(input) File "/mnt/storage/home/vp17941/.conda/envs/condensenet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in __call__ result = self.forward(*input, **kwargs) File "/mnt/storage/home/vp17941/CondenseNet/models/condensenet.py", line 33, in forward x = self.conv_1(x) File "/mnt/storage/home/vp17941/.conda/envs/condensenet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in __call__ result = self.forward(*input, **kwargs) File "/mnt/storage/home/vp17941/CondenseNet/layers.py", line 42, in forward x = self.norm(x) File "/mnt/storage/home/vp17941/.conda/envs/condensenet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in __call__ result = self.forward(*input, **kwargs) File "/mnt/storage/home/vp17941/.conda/envs/condensenet/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 37, in forward self.training, self.momentum, self.eps) File "/mnt/storage/home/vp17941/.conda/envs/condensenet/lib/python3.6/site-packages/torch/nn/functional.py", line 639, in batch_norm return f(input, weight, bias) RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1503970438496/work/torch/lib/THC/generic/THCStorage.cu:66 srun: error: gpu09: task 0: Exited with exit code 1

@ShichenLiu
Copy link
Owner

Hi, as for ImageNet models, we used four 12GB titan X (pascal) to train them. So perhaps you need to change your command to python main.py --model condensenet -b 256 -j 28 lima_train --stages 4-6-8-10-8 --growth 8-16-32-64-128 --gpu 0,1,2,3 --resume (maybe you only need three 16GB GPUs in your case). Hope this could help you.

@vponcelo
Copy link
Author

vponcelo commented Dec 5, 2017

Hi @ShichenLiu , thanks a lot for your fast answer. Thus, the out of memory may appear because I am using only 1 gpu for training, right?

@ShichenLiu
Copy link
Owner

I believe so.

@vponcelo
Copy link
Author

vponcelo commented Dec 6, 2017

Hi @ShichenLiu ,

Now I'm using a shared node setting with 4 GPUs with --gpu 0,1,2,3.

The problem was that the memory still grew at the initial steps of training using a batch size of 256 images, exceeding the 128GB of RAM available in the machines I am working with. When reducing batch size to 128 and 64 it also failed at posterior iterations within the same initial epoch, respectively.

The amount of samples in my dataset is, however, much lower than the CIFAR and ImageNet, as I explained above, and I presume 128 GB of RAM should be enough to train your type of network. Therefore, I would appreciate if you think there could be some problem with the parameter setting I am using to avoid such growing memory problems on non-large datasets.

Cheers,

@ShichenLiu
Copy link
Owner

Hi @vponcelo ,

In fact the size of dataset should not affect the memory consumption. Instead, only batch size may cause the OOM problem. Could you please provide me with the network architecture printed right after running python main.py --model condensenet -b 256 -j 20 /PATH/TO/DATASET --stages 4-6-8-10-8 --growth 8-16-32-64-128 --gpu 0,1,2,3 --resume ? I believe that could help us solve the problem easier.

Thanks

@lvdmaaten
Copy link
Collaborator

@vponcelo: You presumably had to make some changes in the data-loading to facilitate training on a different data set. Can you please double-check that you're not keeping all the data in RAM? (That is, that you're not doing, say, torch.load or pickle.load on a 200GB data file.)

Presuming the above is not the case, this may be an issue where the Python garbage collector does not get invoked for some reason? You could try and add an explicit call to the garbage collection here (add gc.collect()) to see if that helps.

@vponcelo
Copy link
Author

vponcelo commented Dec 6, 2017

Thanks a lot for your very fast and useful answers @ShichenLiu and @lvdmaaten.

I went back to the first original OOM problem that was taking place here (before the invocation of the Python garbage collector may take place). In my case, the values of input_var were [torch.FloatTensor of size <batchsize>x3x224x224], trying several values for the <batchsize> (16, 32, 64, 128, 256) where all of them raised the OOM problem. I was wondering whether I should crop the images to be 224x224 rather than using the original BBs sized to 256x256, as you did on your paper, but it didn't seem to be the problem at all. Then I tried to figure out any data-loading problem. However, I realized that the data-loading of very big files only takes place when loading a model -it could take place also when loading very large images, but this is not my case-.

After trying several settings, I found that another solution was as simple as to set the parameter for the number of workers up to a maximum of 26 workers with -j 26, even that the machine I am using have 28 workers. It seems that two of them are not available, reserved for system management purposes, or that Python gave some troubles when looking for the whole number of workers. Indeed, I am not sure why this was raising a problem related to exceeding memory limits.

Therefore, the OOM problem is now solved anyway. I am attaching the output of my trained network, so that you can see whether it looks fine before closing this issue. It seems it is working pretty good in my dataset from the starting epochs, isn't it?

Cheers and thank you both again,

output.log

@ShichenLiu
Copy link
Owner

Great to know that your problem has been solved and the log looks good to me. :-)

undol26 pushed a commit to undol26/CondenseNet that referenced this issue Oct 20, 2021
* [ShichenLiu#2] Fix resenet, and add denseblock and denselayer with ltdn
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants