Skip to content

Commit

Permalink
Add TensorFlow 2.2.0 support (#46)
Browse files Browse the repository at this point in the history
Add the support for TensorFlow 2.2.0 which matches
the code level used in the WML CE early access conda
channel.
  • Loading branch information
smatzek committed Nov 2, 2020
1 parent 2157ec4 commit 0ec3709
Show file tree
Hide file tree
Showing 5 changed files with 3,023 additions and 68 deletions.
34 changes: 5 additions & 29 deletions README.md
Expand Up @@ -26,11 +26,14 @@ previously possible and, ultimately, generate more accurate results.

TFLMS is built into the `tensorflow-gpu` conda package so it is installed by
default when you install the GPU enabled TensorFlow from WML CE.
The support is currently available in the [WML CE conda channel](https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/#/).

The support is currently available for TensorFlow 2.2.0 in the [WML CE early access conda channel](https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access/).

The support is currently available for TensorFlow 2.1.0 in the [WML CE conda channel](https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/#/).

For more information on this channel, how to add channels, and install
frameworks see [this WML CE install documentation](https://www.ibm.com/support/knowledgecenter/SS5SF7_1.7.0/navigation/wmlce_install.htm).


# How to enable TFLMS

The TFLMS functionality is disabled by default in TensorFlow and needs to be
Expand Down Expand Up @@ -153,33 +156,6 @@ process have socket affinity with the GPU which allows the fastest
connection paths between system memory and GPU memory, which reduces the
training or inferencing time.

# Memory defragmentation
When using very large tensors or during the course of a very long training
operation, the model's memory allocation and usage pattern may lead to
fragmented GPU memory and out of memory errors. When this occurs there is
enough free memory in the GPU for the next allocation, but it is in
non-contiguous blocks. In these cases, the process will fail and output a
message like this:

```
Enough free memory to satisfy the allocation request exists but it is fragmented.
Enabling Large Model Support defragmentation may avoid this failure.
```

TFLMS is capable of defragmenting sections of GPU memory to gather a
contiguous block large enough for the request. This feature waits for current
GPU computation to finish and then relocates active tensors to coalesce
contiguous free memory blocks.

Even with the GPU computation cleared, the moving of active tensors carries
a risk of introducing NaN errors or other instability into the model. Despite
this risk it has performed well in multi-week training runs with very large
tensors and defragmentation called frequently.

Due to the possible risk of instability the Large Model Support defragmentation
is disabled by default and can be enabled along with LMS with the `tf.config.experimental.set_lms_defrag_enabled(True)` API or the
`config.gpu_options.experimental.lms_defrag_enabled=True` ConfigProto setting.

# Model memory usage analysis with allocator statistics
TFLMS adds several APIs to obtain GPU memory allocator statistics such as
the number of allocations, the peak memory usage, the amount
Expand Down
58 changes: 36 additions & 22 deletions examples/AllocatorStats.md
Expand Up @@ -68,6 +68,25 @@ Returns the limit of reservable memory.

**Parameter:** `gpu_id`: The zero indexed GPU ID for which to retrieve the statistic.

```python
tf.experimental.get_gpu_host_bytes_in_use(numa_node)
```
Returns the current number of bytes in use in the GPU host (CPU memory) allocator.

_Since: 2.2.0_

**Parameter:** `numa_node`: The ID of the NUMA node for the allocator.

```python
tf.experimental.get_gpu_host_peak_bytes_in_use(numa_node)
```
Returns the peak number of bytes in use in the GPU host (CPU memory) allocator.

_Since: 2.2.0_

**Parameter:** `numa_node`: The ID of the NUMA node for the allocator.


## Large Model Support Specific Statistics
The Large Model Support specific statistics provide information about Large
Model Support's memory management. The statics use the following terms:
Expand All @@ -80,9 +99,6 @@ Inactive tensors are those tensors which are not currently being used by an
executing operation or a soon-to-be executing operation.
* reclaim bytes - Reclaimed bytes are the bytes of inactive tensors which have
been moved from GPU memory to the system (host) memory.
* defragmentation - A method of producing contiguous memory blocks by moving
active bytes to allow free memory blocks between the active bytes to coalesce
into larger contiguous blocks.


```python
Expand Down Expand Up @@ -114,41 +130,39 @@ Returns the number of reclaimed bytes.

**Parameter:** `gpu_id`: The zero indexed GPU ID for which to retrieve the statistic.


```python
tf.experimental.get_num_single_reclaims(gpu_id)
tf.experimental.get_current_bytes_reclaimed(gpu_id)
```
Large Model Support will reclaim the bytes of single tensors when possible.
This returns the number of times single tensors' bytes were reclaimed.
Returns the current number of reclaimed bytes.

_Since: 2.2.0_

**Parameter:** `gpu_id`: The zero indexed GPU ID for which to retrieve the statistic.


```python
tf.experimental.get_num_full_reclaims(gpu_id)
tf.experimental.get_peak_bytes_reclaimed(gpu_id)
```
When no single tensor reclamation is able to free enough GPU memory for the
allocation request, all tensors are reclaimed. This returns the number
of times all tensors were reclaimed.
Returns the peak number of reclaimed bytes.

**Parameter:** `gpu_id`: The zero indexed GPU ID for which to retrieve the statistic.
_Since: 2.2.0_

**Parameter:** `gpu_id`: The zero indexed GPU ID for which to retrieve the statistic.

```python
tf.experimental.get_num_defragmentations(gpu_id)
tf.experimental.get_num_single_reclaims(gpu_id)
```
GPU memory may become fragmented such that there are no contiguous blocks which
can fulfill an allocation request, even after reclaiming all inactive
tensors. In this case, active tensors may be moved to allow free blocks to be
coalesced to produce a contiguous memory block large enough to fulfill the
allocation request. The defragmentation function of Large Model Support is
disabled by default. This API returns the number of times defragmentation was
performed.
Large Model Support will reclaim the bytes of single tensors when possible.
This returns the number of times single tensors' bytes were reclaimed.

**Parameter:** `gpu_id`: The zero indexed GPU ID for which to retrieve the statistic.


```python
tf.experimental.get_bytes_defragged(gpu_id)
tf.experimental.get_num_full_reclaims(gpu_id)
```
The number of bytes moved during GPU memory defragmentation.
When no single tensor reclamation is able to free enough GPU memory for the
allocation request, all tensors are reclaimed. This returns the number
of times all tensors were reclaimed.

**Parameter:** `gpu_id`: The zero indexed GPU ID for which to retrieve the statistic.
10 changes: 0 additions & 10 deletions examples/ManyModel.py
Expand Up @@ -134,8 +134,6 @@ def get_callbacks(args):
def run_model(args):
if args.lms:
tf.config.experimental.set_lms_enabled(True)
if args.lms_defrag:
tf.config.experimental.set_lms_defrag_enabled(True)

image_dim = args.image_size
opt = tf.keras.optimizers.RMSprop()
Expand Down Expand Up @@ -209,14 +207,6 @@ def main():
help='Disable LMS (Default)')
parser.set_defaults(lms=False)

defrag_group = parser.add_mutually_exclusive_group(required=False)
defrag_group.add_argument('--lms_defrag', dest='lms_defrag',
action='store_true',
help='Enable LMS defragmentation')
defrag_group.add_argument('--no-lms_defrag', dest='lms_defrag',
action='store_false',
help='Disable LMS defragmentation (Default)')
parser.set_defaults(lms_defrag=False)
lms_stats = parser.add_mutually_exclusive_group(required=False)
lms_stats.add_argument('--lms_stats', dest='lms_stats', action='store_true',
help='Log LMS per-step stats to a file named '
Expand Down
9 changes: 2 additions & 7 deletions examples/callbacks.py
Expand Up @@ -27,7 +27,7 @@
nvtx.nvtxMarkA.restype = None

STATS_KEYS = ['time', 'allocs', 'reclaim_ones',
'reclaim_alls', 'defrags', 'gib_reclaimed', 'gib_defragged']
'reclaim_alls', 'gib_reclaimed']

class CudaProfileCallback(Callback):
def __init__(self, profile_epoch, profile_batch_start, profile_batch_end):
Expand Down Expand Up @@ -66,9 +66,7 @@ def _get_stats(self):
stats['allocs'] = tf.experimental.get_num_allocs(self._gpu_id)
stats['reclaim_ones'] = tf.experimental.get_num_single_reclaims(self._gpu_id)
stats['reclaim_alls'] = tf.experimental.get_num_full_reclaims(self._gpu_id)
stats['defrags'] = tf.experimental.get_num_defragmentations(self._gpu_id)
stats['gib_reclaimed'] = tf.experimental.get_bytes_reclaimed(self._gpu_id) / 1073741824.0
stats['gib_defragged'] = tf.experimental.get_bytes_defragged(self._gpu_id) / 1073741824.0
return stats

def step_begin(self):
Expand Down Expand Up @@ -114,9 +112,7 @@ def write_step_stats(logfile, step_type, epoch, step_num, step_stats):
row.append(step_stats['allocs'])
row.append(step_stats['reclaim_ones'])
row.append(step_stats['reclaim_alls'])
row.append(step_stats['defrags'])
row.append(step_stats['gib_reclaimed'])
row.append(step_stats['gib_defragged'])
with open(logfile, 'a+', newline='') as csvfile:
statswriter = csv.writer(csvfile)
statswriter.writerow(row)
Expand All @@ -127,8 +123,7 @@ def write_step_log_header(logfile):
statswriter = csv.writer(csvfile)
statswriter.writerow(['step type', 'epoch', 'step',
'duration', 'allocs', 'reclaimOnes',
'reclaimAlls', 'defrags',
'GiB reclaimed', 'GiB defragged'])
'reclaimAlls', 'GiB reclaimed'])


class LMSStatsLogger(Callback):
Expand Down

0 comments on commit 0ec3709

Please sign in to comment.