# Analysis on the Relationship CPU Utilization and Temperature on Raspberry Pi Devices

## Introduction

This analysis explores the relationship between CPU utilization and temperature across different Raspberry Pi specifications. The goal is to understand how CPU workload affects device temperature and whether this relationship can be used to predict thermal behavior 
/*looks sus*/ under various application states. This information is valuable for optimizing edge computing applications and preventing thermal throttling.

#### Research Question

How does resource utilization affect temperature across different Raspberry Pi configurations and application states, and can we identify patterns in this relationship that predict thermal behavior?

## Methods

### Dataset Description

The dataset used in this analysis was constructed and collected by researched is the Queen's Telecommunation Research Lab (TRL), led by Ruslan Kain. It contains resource usage information from four heterogeneous Raspberry Pi 4 devices, with different RAM sizes (2GB, 4GB, 8GB) and CPU frequencies (1200MHz, 1500MHz, 1800MHz), measured under different application usages including gaming, streaming, augmented reality, mining, and idling. The data is collected in ~5 second intervals using the `PsUtil` Python package.

Since the dataset is separated into multiple smaller datasets for each device and application usage patterns (consistent pattern vs. random), this exploratory analysis will use only one for each device and use the data with consistent usage pattern. 

### Data Wrangling
The dataset will be subsetted on the following columns 

- `cpu`
- `state`
- `memory`
- `cpu_freq`
- `net_upload_rate`
- `net_download_rate`
- `temp`

This choice is made based on domain-specific knowledge, specifically, CPU, memory, and network activity is known to generate heat, so they are included to account for confounding effects. The state is also kept to observe resource activity difference between states.


In [None]:
import requests
import zipfile

url = "https://borealisdata.ca/api/access/dataset/:persistentId"
rpi_id = "doi:10.5683/SP3/GOZAJE"
params = {
    "persistentId": rpi_id,
}

with requests.get(url, params=params, stream=True) as resp:
    if resp.status_code != 200:
        print("request fail")
    else:
        with open("data.zip", "wb") as f:
            for chunk in resp.iter_content(16384):
                f.write(chunk)

assert zipfile.is_zipfile("data.zip")

zipfile.ZipFile("data.zip").extractall("data")

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import gc

import warnings
warnings.filterwarnings('ignore')

COLS = [ 'time_stamp', 'time', 'state', 'cpu', 'cpu_freq', 'memory', 'net_upload_rate', 'net_download_rate', 'temp' ]
# read dataframes 
# df_1 : 2GB 1200MHz
# df_2 : 2GB 1500MHz
# df_3 : 4GB 1500MHz
# df_4 : 8GB 1800MHz
df_1 = pd.read_table('data/RPi4B2GB1_1200MHz_res_usage_data_rvp_pattern_48hr.tab',
                    parse_dates=['time_stamp'],
                    index_col='time_stamp',
                    usecols=COLS,
                )
df_2 = pd.read_table('data/RPi4B2GB2_1500MHz_res_usage_data_rvp_pattern_48hr.tab',
                    parse_dates=['time_stamp'],
                    index_col='time_stamp',
                    usecols=COLS,
                )
df_3 = pd.read_table('data/RPi4B4GB_1500MHz_res_usage_data_rvp_pattern_48hr.tab',
                    parse_dates=['time_stamp'],
                    index_col='time_stamp',
                    usecols=COLS,
                )
df_4 = pd.read_table('data/RPi4B8GB_1800MHz_res_usage_data_rvp_pattern_48hr.tab',
                    parse_dates=['time_stamp'],
                    index_col='time_stamp',
                    usecols=COLS,
                    )

In [None]:
# we will only use temp, memory, cpu, cpu_freq for now
def clean(df):
    df['cpu_ma'] = df.cpu.rolling(window=3, closed='both').mean()
    df['temp_ma'] = df.temp.rolling(window=3, closed='both', win_type='exponential').mean(center=0, tau=5, sym=False)
    df['mem_diff'] = df.memory.diff().fillna(0)
    return df

In [None]:
# prepare dataframes
df_1 = clean(df_1)
df_2 = clean(df_2)
df_3 = clean(df_3)
df_4 = clean(df_4)

df = df_4_1500
gc.collect()

def foo():
    pass
foo()

In [None]:
def plot_facet(df, datacol):
    fig, ax = plt.subplots(nrows=1, ncols=5, figsize=(12, 3))
    for i, state in enumerate(states):
        sns.histplot(df[df.state == state][datacol], bins=30, kde=True, ax=ax[i])
        ax[i].set_title(f'state: {state}')
    fig.suptitle(f"{datacol} distribution")
    plt.tight_layout()
    plt.show()

# plot_facet(df, 'cpu')
# plot_facet(df, 'memory')
# plot_facet(df, 'net_upload_rate')
# plot_facet(df, 'net_download_rate')

In [None]:
df = df_1
plotdf = df[(df.time < df.time.quantile(0.125))]

def plot_time_series(plotdf, colname, ax=None, show=True, retpatches=False):
    if ax is None:
        fig, ax = plt.subplots(figsize=(16, 3))
    data = plotdf[colname]
    ax.plot(plotdf.time, data)

    state_colors = {
        'augmented_reality': 'tab:blue',
        'game': 'tab:orange',
        'idle': 'tab:green',
        'mining': 'tab:red',
        'stream': 'tab:cyan',
    }
    legend_patches = []
    for state, color in state_colors.items():
        ax.fill_between(plotdf.time, data.min(), data.max(), where=(plotdf.state == state), color=color, alpha=0.5)
        legend_patches.append(plt.Rectangle((0,0), 1, 1, fc=color, alpha=0.3, label=state))
        
    ax.set_xbound(plotdf.time.iloc[0], plotdf.time.iloc[-1])
    if not retpatches:
        ax.legend(handles=legend_patches, loc='upper right', title="states")
    if show:
        plt.show()
    return legend_patches if retpatches else None

In [None]:
def plot_mult_time_series(df, colnames, ylabs=None, titles=None, suptitle=None):
    fig, axs = plt.subplots(nrows=len(colnames), figsize=(12, 3 * len(colnames)))
    if suptitle is not None:
        fig.suptitle(suptitle)
    for i, col in enumerate(colnames):
        patches = plot_time_series(df, col, ax=axs[i], show=False, retpatches=True)
        axs[i].set_xlabel('time (ticks)')
        axs[i].set_ylabel(ylabs[i])
    fig.legend(handles=patches, loc='upper right', title="states")
    plt.tight_layout()
    plt.show()


Since each application runs on its own, it is independent of each other.So, the dataset will be separated into time windows per state since the device runs an application for a window of time and analyse each window as independent samples. 

### Feature Engineering
Resource activity data can contain a lot of noise especially for the CPU, upload/download rate (network), and temperature perhaps due to inherent jitter or measurement variance. For those columns, we will calculate a moving average to smoothen them out. The window size and weighing method is chosen based on the type of resource. Specifically, CPU and upload/download rate is unweighted, while temperature is weighted with an exponential window.

Memory usage on its own does affect the utilization of other resources since this can mean nothing is going on at the moment. So, the percentage change in RAM usage is added to account for memory usage by the device for each application.


In [None]:
plot_mult_time_series(plotdf,
                      ['cpu_ma', 'temp_ma'],
                      ylabs=['CPU usage (%)', 'Temperature (celcius)'],
                      suptitle='CPU Activity and Temperature Moving Average over Time'
                     )

## Preliminary Results

We see that there is stark difference in CPU and memory usage for each state with each state utilizing around the same percentage of CPU utilization for each of their usage window. However, the utilization percentage for each state is similar across devices. We also observe CPU throttling in `cpu_freq`, where the CPU scales down its frequency to conserve power usage when running processes with lower utilization.

Furthermore, the temperature seems to increase/decrease quickly when the CPU utilization changes rapidly due to change of device application state, and plateauing to a stable temperature that is consistent for each state. This might imply that the device temperature change has some relationship to CPU utilization.

In [None]:
plot_mult_time_series(plotdf,
                      ['cpu', 'memory', 'net_upload_rate', 'net_download_rate', 'temp'],
                      ylabs=['CPU usage (%)', 'Memory usage (%)', 'Upload Rate (Mb/s)', 'Download Rate (Mb/s)', 'Temperature (celcius)'],
                      suptitle='Resource Activity and Temperature over Time'
                     )

Finally, we see that the network activity is relevant when the device is streaming, specifically, the download rate increases. However, on any other states, the device shows little to no network activity, so the upload rate will be removed from future analyses. 

In [None]:
fig, ax = plt.subplots(ncols=2, nrows=1, figsize=(12, 4))
sns.histplot(df_1, x='cpu', hue='state', ax=ax[0], bins=50, hue_order=states, kde=True)
sns.histplot(df_1, x='memory', hue='state', ax=ax[1], bins=50, hue_order=states, kde=True)
fig.suptitle('Resource Activity in Raspberry Pi 4 (2GB 1200Hz)')
plt.show()
fig, ax = plt.subplots(ncols=2, nrows=1, figsize=(12, 4))
sns.histplot(df_2, x='cpu', hue='state', ax=ax[0], bins=50, hue_order=states, kde=True)
sns.histplot(df_2, x='memory', hue='state', ax=ax[1], bins=50, hue_order=states, kde=True)
fig.suptitle('Resource Activity in Raspberry Pi 4 (2GB 1500Hz)')
plt.show()
fig, ax = plt.subplots(ncols=2, nrows=1, figsize=(12, 4))
sns.histplot(df_3, x='cpu', hue='state', ax=ax[0], bins=50, hue_order=states, kde=True)
sns.histplot(df_3, x='memory', hue='state', ax=ax[1], bins=50, hue_order=states, kde=True)
fig.suptitle('Resource Activity in Raspberry Pi 4 (4GB 1500Hz)')
plt.show()
fig, ax = plt.subplots(ncols=2, nrows=1, figsize=(12, 4))
sns.histplot(df_4, x='cpu', hue='state', ax=ax[0], bins=50, hue_order=states, kde=True)
sns.histplot(df_4, x='memory', hue='state', ax=ax[1], bins=50, hue_order=states, kde=True)
fig.suptitle('Resource Activity in Raspberry Pi 4 (8GB 1800Hz)')
plt.show()

From the histograms above, we notice that the resource usage distribution (CPU and RAM) for each application is at large unimodal with slight variations. Below is a table summarizing some statistics on these distributions.

In [None]:
def get_windows(df):
    cur = df.state.iloc[0]
    prev = 0
    windows = {
        state : [] for state in states
    }
    for i, state in enumerate(df.state):
        if cur == state:
            continue
        windows[cur].append((prev, i-1))
        cur = state
        prev = i
    return windows

In [None]:
w1 = get_windows(df_1)
w2 = get_windows(df_2)
w3 = get_windows(df_3)
w4 = get_windows(df_4)

In [None]:
df_1.describe()

In [None]:
df_2.describe()

In [None]:
df_3.describe()

In [None]:
df_4.describe()

## Summary and Future Plans

The data shows promising relationship between resources such as CPU utilization with temperature with consistent trends across devices and states. Future analysis will include uncovering this relationship and modelling the relationship. Furthermore, other variables such as RAM usage and changes in RAM might be incorporated to account for confounding relationships, for example, RAM usage might increase significantly with high download rate which in turn makes the operation IO bound, so the CPU utilization might drop. This can be seen in the fluctuation in the streaming state.