In [None]:
# Copyright 2019 Google LLC
# 
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

<a target="_blank" href="https://colab.research.google.com/github/GoogleCloudPlatform/keras-idiomatic-programmer/blob/master/notebooks/estimating_your_training_utilization.ipynb">
<img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>

# Managing your CPU/GPU utilization during training with Warm-Up

## Objective

This notebook demonstrates a simple manner for you to estimate and control your CPU/GPU utilization during training. Currently training infrastructure does not do auto-scaling (unlike batch prediction). Instead, you sent your utilization strategy as part of starting your training job.

If your training on the cloud, a poor utilization may result in an under or over utilization. In under utilization, you're leaving compute power (money) on the table. In over utilization, the training job may become bottleneck or excessively interrupted by other processes.

Things you might consider when under utilizing. Do I scale up (larger instances) or do I scale out (distributed training). 

In this notebook, we will use short training runs (warm-start) combined with the psutil module to see what our utilization will be when we do a full training run. Since we are only interested in utilization, we don't care what the accuracy is --we can just use a defacto (best guess) on hyperparameters.

In my experience, I find the sweetspot for utilization on a single instance is 70%. That leaves enough compute power from background processes pre-empting the training and if training headless, to be able to ssh in and monitor the system.

## Imports

We will be using tensorflow and the psutil module. This notebook will work with both TF 1.X and TF 2.0.

In [None]:
import tensorflow
import psutil

## Get Dataset

Let's use the MNIST dataset (for brevity) as if this is the dataset you will use it for training. We will draw from the dataset during the warm-start training in the same manner that we plan to do in the later full training. In this case, because the total data is small enough to fit it into memory, we load the whole dataset into memory as a multi-dimensional numpy array.

In [None]:
# Get the builtin MNIST dataset
from tensorflow.keras.datasets import mnist
import numpy as np
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# Normalize the data
x_train = (x_train / 255.0).astype(np.float32)

## Make Model

We will use the `create_model()` function to create simple DNN models. DNN models are sufficient for training a MNIST model. This model will make N layers (`n_layers`) of the same number of nodes (`n_nodes`).

In [None]:
from tensorflow.keras.layers import Flatten, Dense
from tensorflow.keras import Sequential

def create_model(n_layers, n_nodes):
    model = Sequential()
    model.add(Flatten(input_shape=(28, 28)))
    for _ in range(n_layers):
        model.add(Dense(n_nodes, activation='relu'))
    model.add(Dense(10, activation='softmax'))
    model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['acc'])
    return model

## Do a warm start to view the CPU/GPU utilization

### Small Model, 1 layer 128 nodes

Okay, let's start. In our first test, we try a model with one hidden dense layer of 128 nodes.

We then do a `psutil.cpu_percent(interval=None)`. By setting that parameter `interval=None`, we set a checkpoint (start point) for measuring our CPU/GPU utilization.

We then train the model for a couple (`epochs=2`) of epochs. If your datasets are huge and drawn from storage, you might want to use just a sub-distribution from your data (i.e., a smaller amount of the dataset) by setting the `steps_per_epoch` parameter.

Once the training finishes, we then issue a `psutil.cpu_percent(interval=None, percpu=True)`. This will report the CPU/GPU utilization on all CPUs/GPUs on the instance since the start interval checkpoint. We then do a `psutil.cpu_percent(interval=None)` to show the average utilization across all the CPUs/GPUs.


In [None]:
model = create_model(1, 128)
model.summary()

import psutil
set_interval = psutil.cpu_percent(interval=None)
model.fit(x_train, y_train, epochs=2, verbose=1)
print(psutil.cpu_percent(interval=None, percpu=True), psutil.cpu_percent(interval=None))

### Larger Model, 2 layers 1024 nodes

On our next example, we will make the model 16X more computationally expensive by having two hidden layers of 1024 nodes each.

In [None]:
model = create_model(2, 1024)
model.summary()

set_interval = psutil.cpu_percent(interval=None)
model.fit(x_train, y_train, epochs=2, verbose=1, workers=2)
print(psutil.cpu_percent(interval=None, percpu=True), psutil.cpu_percent(interval=None))

### Even Larger Model, 4 layers 2048 nodes

In our last example, we will make the model 128X more computationally expensive by having two hidden layers of 1024 nodes each.

In [None]:
model = create_model(4, 2048)
model.summary()

set_interval = psutil.cpu_percent(interval=None)
model.fit(x_train, y_train, epochs=2, verbose=1)
print(psutil.cpu_percent(interval=None, percpu=True), psutil.cpu_percent(interval=None))