New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multigpu and change from epoch to step #783

Merged
merged 3 commits into from Jul 3, 2018

Conversation

Projects
None yet
2 participants
@vince62s
Contributor

vince62s commented Jun 27, 2018

ready for review

it still requires some documentation on how to use multi gpu and steps.

for multi gpu, there are extra options for future implementation of multi node, I will leave them there.
right now the only thing is "-gpuid 0 1 2 3"

@srush

srush approved these changes Jul 2, 2018

Looks great.

My only worry is making it a bit more easy for people to transition. Do you think we could add some logging where it tells users the rough "steps per epoch"?

# GPU
group.add_argument('-gpuid', default=[], nargs='+', type=int,
help="Use CUDA on the listed devices.")
group.add_argument('-gpu_rank', default=0, nargs='+', type=int,

This comment has been minimized.

@srush

srush Jul 2, 2018

Contributor

Can you give more clear comments here. Make it evident that these are used for distributed training. Maybe use --dist prefix

group.add_argument('-valid_batch_size', type=int, default=32,
help='Maximum batch size for validation')
group.add_argument('-max_generator_batches', type=int, default=32,
help="""Maximum batches of words in a sequence to run
the generator on in parallel. Higher is faster, but
uses more memory.""")
group.add_argument('-epochs', type=int, default=13,

This comment has been minimized.

@srush

srush Jul 2, 2018

Contributor

Could you make it so there is a clear error message when people try to use this argument? Maybe something like this https://gist.github.com/alzix/4868d66115a55567d3b43b32742f6f4f . Where it stays but triggers a deprecation warning and quits out.

@@ -0,0 +1,136 @@
""" Pytorch Distributed utils

This comment has been minimized.

@srush

srush Jul 2, 2018

Contributor

Can you give this file a better name? how about utils/distributed.py?

I need to study this code a bit more, but for now going to trust fb

if self.n_gpu == 0 or (i % self.n_gpu == self.gpu_rank):
if self.gpu_verbose > 1:
print("GPU %d: index: %d accum: %d"
% (self.gpu_rank, i, accum))

This comment has been minimized.

@srush

srush Jul 2, 2018

Contributor

I don't think "GPU" is the correct log to use here. Either call it "Rank" or give the underlying gpu_id

help="Rank the current gpu device.")
group.add_argument('-gpu_backend', default='nccl', nargs='+', type=str,
help="Type of torch distributed backend")
group.add_argument('-gpu_verbose', default=0, type=int,

This comment has been minimized.

@srush

srush Jul 2, 2018

Contributor

Think this should be a boolean, or make it clear that there are logging levels. (Maybe we can start to switch to logging https://docs.python.org/3/library/logging.html )

@@ -343,26 +309,46 @@ def start_report_manager(self, start_time=None):
else:
self.report_manager.start_time = start_time
def report_training(self, *args, **kwargs):
def maybe_gather_stats(self, stat):

This comment has been minimized.

@srush

srush Jul 2, 2018

Contributor

Can we just call this _gather_stats? Think its private right?

return onmt.utils.Statistics.all_gather_stats(stat)
return stat
def maybe_report_training(self, step, num_steps, learning_rate,

This comment has been minimized.

@srush

srush Jul 2, 2018

Contributor

_report_training ?

@@ -3,6 +3,9 @@
import time
import math
import sys
import torch
import onmt

This comment has been minimized.

@srush

srush Jul 2, 2018

Contributor

Is this needed? Don't think utils should require base. Can you just import the file needed?

all_stats = onmt.utils.multi_utils.all_gather_list(stat_list,
max_size=max_size)
our_rank = torch.distributed.get_rank()

This comment has been minimized.

@srush

srush Jul 2, 2018

Contributor

This is cool.

train.py Outdated
import onmt.opts as opts
from train_multi import main as multi_main
from train_single import main as single_main

This comment has been minimized.

@srush

srush Jul 2, 2018

Contributor

Any chance we could remove this as a separate file? Ideally train.py would have all the code for single_main and multi_main would be the only separate file. Want to have as few base level files as possible.

@vince62s vince62s changed the title from [WIP] multigpu and change from epoch to step to Multigpu and change from epoch to step Jul 3, 2018

@vince62s

This comment has been minimized.

Contributor

vince62s commented Jul 3, 2018

@srush I took care of most of the requested changes.
Mixing train_multi and train_single into train.py would make things very unclear so I chose to reclass them in onmt.
I changed from print ot logger.info in many places where it was easy. There are still some prints but I think it's ok for now.
Should be fine for merge.

@srush srush merged commit 4d17982 into OpenNMT:master Jul 3, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@vince62s vince62s deleted the vince62s:refactoring3 branch Jul 4, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment