Skip to content

Commit

Permalink
Workaround for NCCL bug in TF 1.15
Browse files Browse the repository at this point in the history
  • Loading branch information
tkarras committed Oct 8, 2020
1 parent 6af5afc commit 23f8bed
Show file tree
Hide file tree
Showing 3 changed files with 50 additions and 16 deletions.
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
# To view a copy of this license, visit
# https://nvlabs.github.io/stylegan2/license.html

FROM tensorflow/tensorflow:1.15.0-gpu-py3
FROM tensorflow/tensorflow:1.14.0-gpu-py3

RUN pip install scipy==1.3.3
RUN pip install requests==2.22.0
Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,8 +36,8 @@ For press and other inquiries, please contact Hector Marinez at [hmarinez@nvidia

* Both Linux and Windows are supported. Linux is recommended for performance and compatibility reasons.
* 64-bit Python 3.6 installation. We recommend Anaconda3 with numpy 1.14.3 or newer.
* TensorFlow 1.14 or 1.15 with GPU support. The code does not support TensorFlow 2.0.
* On Windows, you need to use TensorFlow 1.14 — TensorFlow 1.15 will not work.
* We recommend TensorFlow 1.14, which we used for all experiments in the paper, but TensorFlow 1.15 is also supported on Linux. TensorFlow 2.x is not supported.
* On Windows you need to use TensorFlow 1.14, as the standard 1.15 installation does not include necessary C++ headers.
* One or more high-end NVIDIA GPUs, NVIDIA drivers, CUDA 10.0 toolkit and cuDNN 7.5. To reproduce the results reported in the paper, you need an NVIDIA GPU with at least 16 GB of DRAM.
* Docker users: use the [provided Dockerfile](./Dockerfile) to build an image with the required library dependencies.

Expand Down
60 changes: 47 additions & 13 deletions dnnlib/tflib/optimizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@

"""Helper wrapper for a Tensorflow optimizer."""

import platform
import numpy as np
import tensorflow as tf

Expand All @@ -18,12 +19,9 @@

from .tfutil import TfExpression, TfExpressionEx

try:
# TensorFlow 1.13
from tensorflow.python.ops import nccl_ops
except:
# Older TensorFlow versions
import tensorflow.contrib.nccl as nccl_ops
_collective_ops_warning_printed = False
_collective_ops_group_key = 831766147
_collective_ops_instance_key = 436340067

class Optimizer:
"""A Wrapper for tf.train.Optimizer.
Expand Down Expand Up @@ -193,12 +191,12 @@ def apply_updates(self, allow_no_op: bool = False) -> tf.Operation:
# Sum gradients across devices.
if len(self._devices) > 1:
with tfutil.absolute_name_scope(self.scope + "/Broadcast"), tf.device(None):
for all_vars in zip(*[device.grad_clean.keys() for device in self._devices.values()]):
if len(all_vars) > 0 and all(dim > 0 for dim in all_vars[0].shape.as_list()): # NCCL does not support zero-sized tensors.
all_grads = [device.grad_clean[var] for device, var in zip(self._devices.values(), all_vars)]
all_grads = nccl_ops.all_sum(all_grads)
for device, var, grad in zip(self._devices.values(), all_vars, all_grads):
device.grad_clean[var] = grad
if platform.system() == "Windows": # Windows => NCCL ops are not available.
self._broadcast_fallback()
elif tf.VERSION.startswith("1.15."): # TF 1.15 => NCCL ops are broken: https://github.com/tensorflow/tensorflow/issues/41539
self._broadcast_fallback()
else: # Otherwise => NCCL ops are safe to use.
self._broadcast_nccl()

# Apply updates separately on each device.
for device_idx, device in enumerate(self._devices.values()):
Expand Down Expand Up @@ -247,7 +245,7 @@ def apply_updates(self, allow_no_op: bool = False) -> tf.Operation:

# Last device => report statistics.
if device_idx == len(self._devices) - 1:
all_ops.append(autosummary.autosummary(self.id + "/learning_rate", self.learning_rate))
all_ops.append(autosummary.autosummary(self.id + "/learning_rate", tf.convert_to_tensor(self.learning_rate)))
all_ops.append(autosummary.autosummary(self.id + "/overflow_frequency", tf.where(all_ok, 0, 1), condition=acc_ok))
if self.use_loss_scaling:
all_ops.append(autosummary.autosummary(self.id + "/loss_scaling_log2", device.loss_scaling_var))
Expand Down Expand Up @@ -286,6 +284,42 @@ def undo_loss_scaling(self, value: TfExpression) -> TfExpression:
return value
return value * tfutil.exp2(-self.get_loss_scaling_var(value.device)) # pylint: disable=invalid-unary-operand-type

def _broadcast_nccl(self):
"""Sum gradients across devices using NCCL ops (fast path)."""
from tensorflow.python.ops import nccl_ops # pylint: disable=no-name-in-module
for all_vars in zip(*[device.grad_clean.keys() for device in self._devices.values()]):
if any(x.shape.num_elements() > 0 for x in all_vars):
all_grads = [device.grad_clean[var] for device, var in zip(self._devices.values(), all_vars)]
all_grads = nccl_ops.all_sum(all_grads)
for device, var, grad in zip(self._devices.values(), all_vars, all_grads):
device.grad_clean[var] = grad

def _broadcast_fallback(self):
"""Sum gradients across devices using TensorFlow collective ops (slow fallback path)."""
from tensorflow.python.ops import collective_ops # pylint: disable=no-name-in-module
global _collective_ops_warning_printed, _collective_ops_group_key, _collective_ops_instance_key
if all(x.shape.num_elements() == 0 for device in self._devices.values() for x in device.grad_clean.values()):
return
if not _collective_ops_warning_printed:
print("------------------------------------------------------------------------")
print("WARNING: Using slow fallback implementation for inter-GPU communication.")
print("Please use TensorFlow 1.14 on Linux for optimal training performance.")
print("------------------------------------------------------------------------")
_collective_ops_warning_printed = True
for device in self._devices.values():
with tf.device(device.name):
combo = [tf.reshape(x, [x.shape.num_elements()]) for x in device.grad_clean.values()]
combo = tf.concat(combo, axis=0)
combo = collective_ops.all_reduce(combo, merge_op='Add', final_op='Id',
group_size=len(self._devices), group_key=_collective_ops_group_key,
instance_key=_collective_ops_instance_key)
cur_ofs = 0
for var, grad_old in device.grad_clean.items():
grad_new = tf.reshape(combo[cur_ofs : cur_ofs + grad_old.shape.num_elements()], grad_old.shape)
cur_ofs += grad_old.shape.num_elements()
device.grad_clean[var] = grad_new
_collective_ops_instance_key += 1


class SimpleAdam:
"""Simplified version of tf.train.AdamOptimizer that behaves identically when used with dnnlib.tflib.Optimizer."""
Expand Down

1 comment on commit 23f8bed

@johndpope
Copy link

@johndpope johndpope commented on 23f8bed Dec 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @tkarras
Background - I just bought a new nvidia 3090 card and want to get this running on it.
What are my options? Presumably I want/ need to use the latest cuda / graphics driver 455.
It seems only route is to upgrade tensorflow to version 2????

Assuming this is the case - if I run this command - against the project.

TENSORFLOW - OFFICIAL AUTO - UPGRADE
https://www.tensorflow.org/guide/upgrade

tf_upgrade_v2 \
  --intree stylegan2/ \
  --outtree stylegan2/ \
  --reportfile report.txt

I get this report
There's 20 'cannot be converted automatically'
Predominantly the scope of getting code to work on tensorflow2 - seems to encompasses just the following errors.

tf.contrib. tf.contrib.memory_stats.MaxBytesInUse
Using member tf.contrib in deprecated module tf.contrib. tf.contrib cannot be converted automatically.
tf.contrib. tf.contrib.opt.ScipyOptimizerInterface cannot be converted automatically. tf.contrib
tf.contrib. tf.contrib.opt.GGTOptimizer cannot be converted automatically

https://gist.github.com/johndpope/6fda5f9a39375437dbfceb9728ba13a0

Even though the project doesn't support tensorflow2 - I think a new branch would be prudent to push ahead.
Contributors could get the branch working(with windows NCCL caveats).
And then it's case closed.... please consider.

RELATED - NVlabs/stylegan2-ada#32

Please sign in to comment.