# Device placement


In this reading, we are going to be looking at device placement. We will see how to access the device associated to a given tensor, and compare the use of GPUs and CPUs.

When running this notebook, ensure that the GPU runtime type is selected (Runtime -> Change runtime type).

In [None]:
%%bash
pip install --no-cache-dir -qU pip wheel
pip install --no-cache-dir -qU numpy==1.23.0 pandas matplotlib seaborn scikit-learn
pip install --no-cache-dir -qU tensorflow pydot
pip check

In [None]:
import os
import numpy as np
import pandas as pd

%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import matplotlib.pyplot as plt

import seaborn as sns
sns.set_style('whitegrid')
sns.set(font='DejaVu Sans')

import tensorflow as tf
tf.keras.utils.set_random_seed(42)
tf.get_logger().setLevel('ERROR')

## Get the physical devices

First, we can list the physical devices available.

In [3]:
# List all physical devices

tf.config.list_physical_devices()

[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'),
 PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

If you have enabled the GPU runtime, then you should see the GPU device in the above list.

We can also check specifically for the GPU or CPU devices.

In [4]:
# Check for GPU devices

tf.config.list_physical_devices('GPU')

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

In [5]:
# Check for CPU devices

tf.config.list_physical_devices('CPU')

[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]

We can get the GPU device name as follows:

In [6]:
# Get the GPU device name

tf.test.gpu_device_name()

2022-12-31 08:11:51.809942: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-12-31 08:11:51.811681: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-31 08:11:51.812303: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-31 08:11:51.812713: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least on

'/device:GPU:0'

## Placement of Tensor operations

TensorFlow will automatically allocate Tensor operations to a physical device, and will handle the copying between CPU and GPU memory if necessary. 

Let's define a random Tensor:

In [8]:
# Define a Tensor

x = tf.random.uniform([3, 3])
x

<tf.Tensor: shape=(3, 3), dtype=float32, numpy=
array([[0.68789124, 0.48447883, 0.9309944 ],
       [0.252187  , 0.73115396, 0.89256823],
       [0.94674826, 0.7493341 , 0.34925628]], dtype=float32)>

We can see which device this Tensor is placed on using its `device` attribute.

In [9]:
# Get the Tensor device

x.device

'/job:localhost/replica:0/task:0/device:GPU:0'

The above string will end with `'GPU:K'` if the Tensor is placed on the `K`-th GPU device. We can also check if a tensor is placed on a specific device by using `device_endswith`:

In [10]:
# Test for device allocation

print("Is the Tensor on CPU #0:  "),
print(x.device.endswith('CPU:0'))
print('')
print("Is the Tensor on GPU #0:  "),
print(x.device.endswith('GPU:0'))

Is the Tensor on CPU #0:  
False

Is the Tensor on GPU #0:  
True


## Specifying device placement

As mentioned previously, TensorFlow will automatically allocate Tensor operations to specific devices. However, it is possible to force placement on specific devices, if they are available. 

We can view the benefits of GPU acceleration by running some tests and placing the operations on the CPU or GPU respectively.

In [11]:
# Define simple tests to time computation speed

import time

def time_matadd(x):
    start = time.time()
    for loop in range(10):
        tf.add(x, x)
    result = time.time()-start
    print("Matrix addition (10 loops): {:0.2f} ms".format(1000*result))

def time_matmul(x):
    start = time.time()
    for loop in range(10):
        tf.matmul(x, x)
    result = time.time()-start
    print("Matrix multiplication (10 loops): {:0.2f} ms".format(1000*result))

In the following cell, we run the above tests inside the context `with tf.device("CPU:0")`, which forces the operations to be run on the CPU.

In [12]:
# Force execution on CPU

print("On CPU:")
with tf.device("CPU:0"):
    x = tf.random.uniform([1000, 1000])
    assert x.device.endswith("CPU:0")
    time_matadd(x)
    time_matmul(x)

On CPU:
Matrix addition (10 loops): 12.74 ms
Matrix multiplication (10 loops): 137.07 ms


And now run the same operations on the GPU:

In [17]:
# Force execution on GPU #0 if available

if tf.config.experimental.list_physical_devices("GPU"):
    print("On GPU:")
    with tf.device("GPU:0"): 
        x = tf.random.uniform([1000, 1000])
        assert x.device.endswith("GPU:0")
        time_matadd(x)
        time_matmul(x)

On GPU:
Matrix addition (10 loops): 1.77 ms
Matrix multiplication (10 loops): 2.92 ms


Note the significant time difference between running these operations on different devices.

## Model training

Finally, we will demonstrate that GPU device placement offers speedup benefits for model training.

In [18]:
# Load the MNIST dataset

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train, x_test = x_train/255., x_test/255.

In [19]:
# Reduce the dataset size to speed up the test

x_train, y_train = x_train[:1000], y_train[:1000]

In [20]:
# Define a function to build the model

def get_model():
    model = tf.keras.Sequential([
        tf.keras.layers.Conv2D(filters=32, kernel_size=(3, 3),
            activation='relu', padding='same', input_shape=(28, 28, 1)),
        tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
        tf.keras.layers.Conv2D(filters=64, kernel_size=(3, 3),
            activation='relu', padding='same'),
        tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
        tf.keras.layers.Conv2D(filters=128, kernel_size=(3, 3),
            activation='relu', padding='same'),
        tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(units=64, activation='relu'),
        tf.keras.layers.Dense(units=10, activation='softmax')
    ])
    return model

In [22]:
# Time a training run on the CPU

with tf.device("CPU:0"):
    model = get_model()
    model.compile(
        optimizer=tf.keras.optimizers.RMSprop(
            learning_rate=1e-3), 
        loss=tf.keras.losses.SparseCategoricalCrossentropy(),
        metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])
    start = time.time()
    model.fit(
        x=x_train[..., np.newaxis], y=y_train,
        epochs=5, verbose=0)
    result = time.time() - start

print("CPU training time: {:0.2f}ms".format(1000 * result))

CPU training time: 2902.69ms


In [26]:
# Time a training run on the GPU

with tf.device("GPU:0"):
    model = get_model()
    model.compile(
        optimizer=tf.keras.optimizers.RMSprop(
            learning_rate=1e-3), 
        loss=tf.keras.losses.SparseCategoricalCrossentropy(),
        metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])
    start = time.time()
    model.fit(x=x_train[..., np.newaxis], y=y_train,
        epochs=5, verbose=0)
    result = time.time() - start

print("GPU training time: {:0.2f}ms".format(1000 * result))

GPU training time: 1922.78ms


## Further reading and resources 
* https://www.tensorflow.org/tutorials/customization/basics#gpu_acceleration