Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about multi-gpu training #14

Open
RyanHTR opened this issue Apr 16, 2018 · 11 comments
Open

Questions about multi-gpu training #14

RyanHTR opened this issue Apr 16, 2018 · 11 comments
Labels
good first issue Good for newcomers

Comments

@RyanHTR
Copy link

RyanHTR commented Apr 16, 2018

It's a great work. Does this code support multi-gpu training? I've tried to alter NUM_GPUS and GPU_ID, but it seems like that the code just selects one gpu for training. Is there any clue about it? Thanks.

@JiahuiYu
Copy link
Owner

To enable multi-GPU training, you will need to change this line to MultiGPUTrainer.
Expect some adventures when using multi-GPU for this project. I am not sure about the behavior.

@zhiweige
Copy link

@RyanHTR Hello, RyanHTR, can you train the network successfully on multi-GPU?

@lipanpeng
Copy link

lipanpeng commented May 7, 2018

@RyanHTR I changed this line to MultiGPUTrainer. But I got an error "TypeError: 'NoneType' object is not callable" which I can't figure it out. Do you have this problem?

@zengyh1900
Copy link

zengyh1900 commented Jun 28, 2018

@JiahuiYu There is a bug for 'NoneType object is not callable' None()

@JiahuiYu
Copy link
Owner

@1900zyh This is not bug. Loss should be None for multi-GPU training.

@zengyh1900
Copy link

@JiahuiYu I think it should be
assert loss is None, 'For multigpu training, graph_def should be provided, instead of loss.'
Or it will report TypeError

@JiahuiYu
Copy link
Owner

@1900zyh Ohhhh I see. Thank you!

@bis-carbon
Copy link

bis-carbon commented Apr 4, 2019

I have 4 GTX 1080Ti GPUs and each gpu can handle batch size of 16 that means if I use all the gpus I can change batch size to 64. But when I do that my GPUs ran out of memory.
Am assuming here that ng.train.MultiGPUTrainer uses data parallelism to split input data (64 batch size) in to 4 gpus where each gpu gets 16 batch of images.

Because of that Issue I can only train on batch size of 16, whether I use 4 gpus or 1 gpu.
What are your thoughts about this?

@JiahuiYu
Copy link
Owner

JiahuiYu commented Apr 4, 2019

@bis-carbon The batch size here is the per-gpu batch size.

@bis-carbon
Copy link

Thank you for your quick response and great work.

@JiahuiYu JiahuiYu reopened this Aug 9, 2019
@JiahuiYu JiahuiYu added the good first issue Good for newcomers label Aug 15, 2019
@Adhiyaman-Manickam
Copy link

@1900zyh @bis-carbon @lipanpeng Hi. Have you figured out the issues that how to use multi gpu for training. If so, kinldy let me know, I am struggling. Thanks in advance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

7 participants