Skip to content

Conversation

@williamFalcon
Copy link
Contributor

@williamFalcon williamFalcon commented Nov 1, 2019

This PR does the following:

  1. Fixes Add a way to operate on all outputs from training_step #446
  2. Adds a hook for modifying ddp init. Related to Support alternative distributed communication #353
  3. adds a hook for modifying apex.

return loss, dict with metrics for tqdm
:param called with batch, batch_nb
additional: optimizer_i if multiple optimizers used
:return:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what s the return? if there is none, drop this like

Must return model.
:param model:
:param device_ids:
:return:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return model...

output[k] = self.reduce_distributed_output(output[k], nb_gpus)

# do nothing when there's a scalar
elif isinstance(output[k], torch.Tensor) and output[k].dim() == 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in such case, you can skip this branching condition, right?

@williamFalcon williamFalcon changed the title Ddp2 fix [WIP] Ddp2 fix Nov 3, 2019
@williamFalcon
Copy link
Contributor Author

gpu tests passed. waiting on circle CI

@williamFalcon williamFalcon merged commit 3e38005 into master Nov 5, 2019
@neggert
Copy link
Contributor

neggert commented Nov 5, 2019

It's probably too late, but it occurs to me that since DDP and AMP have nothing to do with the actual research code, it might be better to pass them as callbacks or something rather than bundling them with the model. Maybe something to think about for the future.

@williamFalcon williamFalcon deleted the ddp2_fix branch November 5, 2019 15:30
@Borda
Copy link
Collaborator

Borda commented Nov 6, 2019

Do we have a CircleCI? I do not see its config in the repo master... @williamFalcon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a way to operate on all outputs from training_step

3 participants