# NVIDIA Monitor
---

Pulls data from NVIDIA-SMI output and saves it as a time-series. Commits data to TXT and CSV format for ease of reading and graphing respectively.

---

## <a name="toc"></a> Table of Contents
1. [Proof of Concept](#proof-of-concept)
    1. [Capture Raw Status Data](#capture-raw-status-data)
    2. [Parse Raw Status Data](#parse-raw-status-data)
    3. [Logging to TXT](#logging-to-txt)
    4. [Logging to CSV](#logging-to-csv)
2. [Object Oriented Approach](#object-oriented-approach)
3. [Automated Reporting](#automated-reporting)


In [1]:
%reset -f

## <a name="proof-of-concept"></a> [Proof of Concept](#toc)

Purely functional code, reduced functions representing the core tasks of the program.



#### <a name="capture-raw-status-data"></a> [Capture Raw Status Data](#toc)

By running the NVIDIA-SMI command, we obtain real-time status data for the GPUs installed on the machine. Using the native Python `os` module, this raw output can be captured from the terminal and saved by the local program for processing. Consequently, this procedure provides us with a window into the GPUs core sensors, providing us with valuable data to protect our assets from overheating or over-utilization.



In [2]:
# -------------------- RUN AND CAPTURE TERMINAL -------------------- #

import os

def get_raw_status_data(debug = False):
    if not debug:
        return os.popen('nvidia-smi').read()
    else:
        return """
Thu Mar  4 13:32:39 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.64       Driver Version: 430.64       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:09:00.0 Off |                  N/A |
|  0%   98C    P0     1W / 250W |      0MiB / 11019MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
"""

raw_status_data = get_raw_status_data(debug=True)
print(raw_status_data)


Thu Mar  4 13:32:39 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.64       Driver Version: 430.64       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  GeForce RTX 208...  Off  | 00000000:09:00.0 Off |                  N/A |
|  0%   98C    P0     1W / 250W |      0MiB / 11019MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No r

#### <a name="parsing-raw-status-data"></a> [Parsing Raw Status Data](#toc)

The NVIDIA-SMI status output is excellent for a human to read and comprehend but useless for logging. Using native string functions, we can rip and tear at the raw data to extract meaningful information that we can [log as txt](#txt) or [as csv](#csv).



In [3]:
# -------------------- PARSE NVIDIA SMI OUTPUT -------------------- #

def parse_day_term(status_terms):
    day = status_terms[2]
    if len(day) == 1:
        day = "0" + day
    return day


def parse_month_term(status_terms):
    months_str = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul",\
                  "Aug", "Sep", "Oct", "Nov", "Dec"]
    months_num = ["01", "02", "03", "04", "05", "06", "07", "08", \
                  "09", "10", "11", "12"]
    months_map = dict(zip(months_str, months_num))
    return months_map[status_terms[1]]


def parse_year_term(status_terms):
    return status_terms[4]


def parse_time_term(status_terms):
    return status_terms[3]


def string_int(string):
    try:
        _ = int(string)
        return True
    except:
        return False


def parse_temperature_term(status_terms):
    for status_term in status_terms:
        if status_term[-1] == "C" and string_int(status_term[:-1]):
            return status_term[:-1]
    return 1


def parse_memory_term(status_terms):
    memory = list()
    for status_term in status_terms:
        if status_term[-3:] == "MiB":
            memory.append(status_term.split('MiB')[0])
    return memory


def parse_power_term(status_terms):
    power = list()
    for status_term in status_terms:
        if status_term[-1] == "W":
            power.append(status_term.split('W')[0])
    return power


def parse_fan_term(status_terms):
    for i in range(len(status_terms)):
        status_term = status_terms[i]
        if status_term[-1] == "%":
            # If the last two terms are equal, then they must
            # be "||" characters defining the wall of the raw
            # status data. Ergo, we found the utilization %.
            if status_terms[i-1] == status_terms[i-2]:
                fan = status_term.split("%")[0]
    return fan


def parse_utilization_term(status_terms):
    for term_i in range(len(status_terms)):
        status_term = status_terms[term_i]
        if status_term[-1] == "%":
            # If the last two terms are not equal, then we found
            # the utilization % in the middle of the raw status
            # data since it's not up against a "||" wall.
            if status_terms[term_i-1] != status_terms[term_i-2]:
                utilization = status_term.split("%")[0]
    return utilization


def parse_raw_status_data(raw_status_data):
    status_terms = raw_status_data.split()
    status = dict()
    status["day"] = parse_day_term(status_terms)
    status["month"] = parse_month_term(status_terms)
    status["year"] = parse_year_term(status_terms)
    status["temperature"] = parse_temperature_term(status_terms)
    status["time"] = parse_time_term(status_terms)
    status["fan"] = parse_fan_term(status_terms)
    status["utilization"] = parse_utilization_term(status_terms)
    status["memory used"] = parse_memory_term(status_terms)[0]
    status["memory max"] = parse_memory_term(status_terms)[1]
    status["power used"] = parse_power_term(status_terms)[0]
    status["power max"] = parse_power_term(status_terms)[1]
    return status


#### TEST ####
status = parse_raw_status_data(raw_status_data)
status

{'day': '04',
 'month': '03',
 'year': '2021',
 'temperature': '98',
 'time': '13:32:39',
 'fan': '0',
 'utilization': '0',
 'memory used': '0',
 'memory max': '11019',
 'power used': '1',
 'power max': '250'}

#### <a name="logging-to-txt"></a> [Logging to TXT](#toc)

From within the terminal, we need a logfile that is easily interpreted by a human using less, more, or an editor like vim or emacs. Text files easily satisfy this requirement and can be printed in a pretty format for politeness.



In [4]:
# -------------------- TEXT OUTPUT -------------------- #

def format_txt_log_entry(status):
    return f"[{status['day']}-{status['month']}" \
           f"-{status['year']} @ {status['time']}] >" \
           f"  Temp: {status['temperature']}C" \
           f"  Power: {status['power used']}/{status['power max']} W" \
           f"  Fan: {status['fan']}%" \
           f"  Utilization: {status['utilization']}%" \
           f"  Memory: {status['memory used']}/{status['memory max']} MiB\n"
    

def write_status_to_txt_logfile(status, txt_logfile):
    __func__ = "write_status_to_txt_logfile"
    try:
        with open(txt_logfile, mode="a") as file:
            entry = format_txt_log_entry(status)
            file.write(entry)
            file.close()
        return 0
    except FileNotFoundError:
        print(f"Error in {__func__}: {txt_logfile} not found.")
        return 1
    except KeyError:
        print(f"Error in {__func__}: KeyError when parsing data: {status}")
        return 1


#### TEST ####
txt_logfile = "gpu_log.txt"
write_status_to_txt_logfile(status, txt_logfile)

0

#### <a name="logging-to-csv"></a> Logging to CSV

CSV time-series data can be read by analysis programs and easily fed into MatPlotLib or other visualization software for graphical reporting.



In [5]:
# -------------------- CSV OUTPUT -------------------- #

def format_csv_log_entry(status):
    return f"{status['day']}-{status['month']}" \
           f"-{status['year']},{status['time']}," \
           f"{status['temperature']},{status['power used']}," \
           f"{status['power max']},{status['fan']}," \
           f"{status['utilization']},{status['memory used']}," \
           f"{status['memory max']}\n"

def write_status_to_csv_logfile(status, csv_logfile):
    __func__ = "write_status_to_csv_logfile"
    file_empty = os.stat(csv_logfile).st_size == 0
    try:
        with open(csv_logfile, mode="a") as file:
            if file_empty:
                file.write("Date,Time,Temperature,Power Used, Power Max,Fan," \
                           "Utilization,Memory Used,Memory Max\n")
            entry = format_csv_log_entry(status)
            file.write(entry)
            file.close()
        return 0
    except FileNotFoundError:
        print(f"Error in {__func__}: {csv_logfile} not found.")
        return 1
    except KeyError:
        print(f"Error in {__func__}: KeyError when parsing data: {status}")
        return 1


#### TEST ####
csv_logfile = "gpu_log.csv"
write_status_to_csv_logfile(status, csv_logfile)

0

## <a name="object-oriented-approach"></a> [Object Oriented Approach](#toc)

Using the cleansed functions above, it becomes clear that they act on a shared data flow. Additionally, the intermediate data is irrelevant to the user. Ultimately, the use is only interested in the output: accurate logfiles of the GPU status.

```
raw_status_data -> status -> logfile
```

Consequently, we can shield this flow from the user and expose only those functions that the user needs. Using an object oriented approach, we are left with the following object call:

```PYTHON
Status("gpu_log.txt", "gpu_log.csv", debug=False)
```

This object will call NVIDIA-SMI and capture the raw data, parse the data into a valid status format and write the status data into both TXT and CSV formats.


**TODO:**
- Eventually, we will want automated notifications when the GPU temperature reaches a certain point. The GeForce RTX 2080 TI, for instance, has a maximum temperature of 89C according to [NVIDIA](https://www.nvidia.com/en-us/geforce/graphics-cards/rtx-2080-ti/)(under Specs > View Full Specs). Ergo, we want to receive a notification whenever the GPU temperature reaches, say, 85C.
- What if the logfiles get too long? Ideally, there is a routine that inspects the logfiles and, if they are longer than a certain amount, deletes the top line during every write. This strategy will prevent the logging system from failing if the files get bloated.



In [6]:
# -------------------- INSTANTIATE STATUS -------------------- #

from custom.Status import Status

status = Status("gpu_log.txt", "gpu_log.csv", debug=True)
status.display()

----------- META -----------
Hostname: Seaborns-Mini.ngs.local
TXT Logfile: gpu_log.txt
CSV Logfile: gpu_log.csv
Debug: True
---------- STATUS ----------
2021-03-04 @ 13:32:39
Temperature: 95 C
Utilization: 0%
Fan: 0%
Memory: 0 / 11019 MiB
Power 1 / 250 W



In [7]:
# -------------------- WRITE LOGFILES -------------------- #

status.write_status_to_logfiles()

'Done.'

This approach is much cleaner than the functional approach and it exposes the user to the only thing they really need: passing in logfile names and getting the status output written to the logfiles. I exposed the `write_status_to_logfiles` function so that future developers can see the difference between instantiating the Status object and committing it to logfiles.



## <a name="automated-reporting"></a> [Automated Reporting](#toc)

ventually, we will want automated notifications when the GPU temperature reaches a certain point. The GeForce RTX 2080 TI, for instance, has a maximum temperature of 89C according to [NVIDIA](https://www.nvidia.com/en-us/geforce/graphics-cards/rtx-2080-ti/)(under Specs > View Full Specs). Ergo, we want to receive a notification whenever the GPU temperature reaches, say, 85C.

The following technique works well but it requires that the system has `sendmail` installed and properly configured for sending email through the local firewalls.

**TODO:**
- Email notifications are annoying, slow, and bloat the mailboxes of their recipients. What if, instead of receiving emails, this system communicated important notifications via telegram (an end-to-end encrypted messaging service)? Future bots could be created and organized as contacts into channels that are organized by constellation. This method would keep the messages timely and organized (\*).



In [8]:
# -------------------- SEND EMAIL NOTIFICATIONS -------------------- #

import os
import smtplib
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText

def send_alert(alert, recipient, status):

    subject = "{} GPU Status Update".format(status.hostname)

    with open("templates/body.html") as file:
        body = file.read()
        file.close()

    body = body.format(status.hostname, alert_message,
                       status.year, status.month, status.day,
                       status.time, status.utilization,
                       status.temperature, status.fan,
                       status.power_used, status.power_max,
                       status.memory_used, status.memory_max)

    with open("templates/email.txt", "r") as file:
        email = file.read()
        file.close()

    email = email.format(recipient, subject, body)

    with open("message.txt", "w") as file:
        file.write(email)
        file.close

    msg_status = os.popen(f"sendmail -vt < message.txt").read()
    return msg_status

#####

status = Status("gpu_log.txt", "gpu_log.csv", debug=True)
recipient = "seaborn.dev@gmail.com"
alert_message = "Your GPU is practically barbequing in your rack."

send_alert(alert_message, recipient, status)


'Mail Delivery Status Report will be mailed to <seaborn>.\n'

This function is added to the Status object so that, when instantiating an instance of this object or updating the status attributes, an email will automatically go out to the recipient if the temperature exceeds the set threshold.

In [9]:
# -------------------- CHECK AUTOREPORTING -------------------- #

limits={"temperature": 85}
status = Status("gpu_log.txt", "gpu_log.csv", limits=limits, recipient="seaborn.dev@gmail.com", debug=True)
status.update()


*Written by Austin Dial. Maintained by Alice Seaborn*

\* Yeah right! Instant messages can become bloated and otherwise totally ignored just like emails can. But messages are more easily handled by users do using Telegram is still a viable solution.