# What is my GPU doing? <br /><small style="font-size: 1.2rem;color: #111111; font-weight: 300;">Using PyNVML and NVDashboard to access GPU metrics.</small>

**Author** - Jacob Tomlinson (NVIDIA)

This notebook accompanies a talk of the same name presented at JupyterCon 2020.

`pynvml` provides a Python interface to GPU management and monitoring functions. In this notebook we will use it to gather metrics about our GPUs, create plots and realtime dashboards.

In [None]:
!pip install pynvml

In [None]:
import pynvml

In [None]:
pynvml.nvmlInit()

## Querying system info

You can query general system information with `pynvml` such as driver versions and the number of available GPUs. 

In [None]:
pynvml.nvmlSystemGetDriverVersion()

In [None]:
pynvml.nvmlSystemGetNVMLVersion()

In [None]:
device_count = pynvml.nvmlDeviceGetCount()
device_count

## Querying device info

You can also query individual GPU devices for information and metrics.

To query a device you need a handle which can be retrieved using the device index. This will be `[0, ..., n-1]` where `n` is the number of detected devices.

In [None]:
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
handle

In [None]:
pynvml.nvmlDeviceGetName(handle)

In [None]:
f"{pynvml.nvmlDeviceGetTemperature(handle, 0)}°C"

## Collecting data

Now that we know how to query our GPU devices for metrics let's write a function which gathers a bunch of metrics in one go.

In [None]:
import pandas as pd
from datetime import datetime, timedelta

In [None]:
def get_data_for_gpu(gpu_index):
    handle = pynvml.nvmlDeviceGetHandleByIndex(gpu_index)
    return {
        "timestamp": datetime.now(),
        "gpu_index": gpu_index,
        "name": pynvml.nvmlDeviceGetName(handle).decode(),
        "utilization": pynvml.nvmlDeviceGetUtilizationRates(handle).gpu,
        "memory_used": pynvml.nvmlDeviceGetMemoryInfo(handle).used,
        "memory_total": pynvml.nvmlDeviceGetMemoryInfo(handle).total,
        "pcie_supported_generation": pynvml.nvmlDeviceGetMaxPcieLinkGeneration(handle),
        "pcie_bandwidth_limit": pynvml.nvmlDeviceGetMaxPcieLinkWidth(handle),
        "pcie_throughput_tx": pynvml.nvmlDeviceGetPcieThroughput(handle, pynvml.NVML_PCIE_UTIL_TX_BYTES) * 1e3,
        "pcie_throughput_rx": pynvml.nvmlDeviceGetPcieThroughput(handle, pynvml.NVML_PCIE_UTIL_RX_BYTES) * 1e3,
        "temperature": pynvml.nvmlDeviceGetTemperature(handle, 0)
    }

get_data_for_gpu(0)

We can then write another function which gather metrics for all our GPUs and returns them in a `pandas` DataFrame.

In [None]:
def get_data():
    return pd.DataFrame.from_dict([get_data_for_gpu(i) for i in range(device_count)])

In [None]:
df = get_data()
df

We can now see a table of all our matrics that we queried about each device.

### Collecting data over time

To take this a step further let's query this information every 100ms and record it over time to create a time series DataFrame of our GPU usage.

In [None]:
import time

In [None]:
timeout_seconds = 10
samples_per_second = 10
timeout_start = time.time()

df = get_data()

while time.time() < timeout_start + timeout_seconds:
    time.sleep(1 / samples_per_second)
    df = pd.concat([df, get_data()])

df = df.reset_index()
df

### Plotting

Now that we have some time series data let's plot it so we can see what our GPUs are doing.

In [None]:
%matplotlib inline

In [None]:
df.groupby("gpu_index").memory_used.plot()

In [None]:
df.groupby("gpu_index").utilization.plot()

### Real time plotting

Seeing plots of our usage is really helpful, but what about real time monitoring. Let's use `bokeh` to create a live plot in our notebook showing real time GPU utilization.

In [None]:
from bokeh.io import output_file, show, push_notebook, output_notebook
from bokeh.models import ColumnDataSource
from bokeh.palettes import Spectral6
from bokeh.plotting import figure
from bokeh.transform import factor_cmap
from random import shuffle

output_notebook()

First we need to query our GPU names and IDs to give unique bar names for our plot.

In [None]:
def get_gpu_names():
    gpu_data = [get_data_for_gpu(i) for i in range(device_count)]
    return [f"{gpu['name']} ({gpu['gpu_index']})" for gpu in gpu_data]

gpus = get_gpu_names()
gpus

Now let's write a function which uses our `get_data_for_gpu()` function from before but instead returns a list of GPU utilization which we will use to set the values on our plot.

In [None]:
def update_plot_data():
    gpu_data = [get_data_for_gpu(i) for i in range(device_count)]
    return [gpu['utilization'] for gpu in gpu_data]

utilization = update_plot_data()
utilization

With this data we can create a `ColumnDataSource` and display the graph below.

In [None]:
source = ColumnDataSource(data=dict(gpus=gpus, utilization=utilization))

p = figure(x_range=gpus, plot_height=350, toolbar_location=None, title="GPU Utilization")

r = p.vbar(x='gpus', top='utilization', width=0.9, source=source,
       line_color='white', fill_color=factor_cmap('gpus', palette=Spectral6, factors=gpus))

p.y_range.start = 0
p.y_range.end = 100

target = show(p, notebook_handle=True)

Lastly we need to poll `pynvml` every 100ms and update the data source with new values and tell `bokeh` to re-render the plot.

In [None]:
while True:
    r.data_source.data["utilization"] = update_plot_data()
    push_notebook(handle=target)
    time.sleep(0.1)