# VisCPU: data pre-processing

TODO:
- write about;
- why this notebook;

## Dependencies and imports

Install and import required packages.

In [None]:
!pip install pandas
!pip install psutil
!conda install -c plotly plotly-orca -y

In [321]:
import json
import pandas as pd
import numpy as np
from viscpu import utils

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Load datasets

Load the two datasets that will be compared. We use the first dataset as base for comparison. For example: if the first experiment has 10 cache misses and the second has 15 cache misses, this means that the number of cache misses increased.

In [311]:
dataset_1 = "../applications/simple-ff-test/data/perf-test-1.csv"
dataset_2 = "../applications/simple-ff-test/data/perf-test-2.csv"

column_names=["time", "cpu", "counter_value", "ignore_1", "event",
              "ignore_2", "ignore_3", "ignore_4", "ignore_5", "ignore_6"]
usecols=["time", "cpu", "counter_value", "event"]

df_1 = pd.read_csv(dataset_1, skiprows=1, header=None, names=column_names, usecols=usecols)
df_1["time"] = df_1["time"].round(4)
df_2 = pd.read_csv(dataset_2, skiprows=1, header=None, names=column_names, usecols=usecols)
df_2["time"] = df_2["time"].round(4)

Define captured events:

In [341]:
events = ["cpu-cycles", "cache-misses", "instructions", "cycle_activity.cycles_l3_miss"]

## Pre-processing

In [342]:
output = {"events": events, "dataset-1": {"raw": {}, "aggregated": {}}, "dataset-2": {"raw": {}, "aggregated": {}}, "comparison": {}}

Create the setup of the CPUs. Here you can choose how many CPUs will be shown on each row.

In [343]:
cpu_labels, cpu_setup = utils.get_cpu_setup(df_1["cpu"].unique(), cpus_per_row=4)

output["cpu_labels"] = cpu_labels
output["cpu_setup"] = cpu_setup

Load data from each captured event and write in `output`:

In [344]:
for event in events:
    print(f"Processing event '{event}'...")
    times, captures = utils.get_event_data(df_1, cpu_setup, event)
    output["dataset-1"]["raw"][event] = {
        "captures": captures,
        "captures_min": float(df_1[df_1["event"] == event]["counter_value"].min()),
        "captures_max": float(df_1[df_1["event"] == event]["counter_value"].max())
    }
    
    times, captures = utils.get_event_data(df_2, cpu_setup, event)
    output["dataset-2"]["raw"][event] = {
        "captures": captures,
        "captures_min": float(df_2[df_2["event"] == event]["counter_value"].min()),
        "captures_max": float(df_2[df_2["event"] == event]["counter_value"].max())
    }
    print(f"Finished event '{event}'.")

Processing event 'cpu-cycles'...
Finished event 'cpu-cycles'.
Processing event 'cache-misses'...
Finished event 'cache-misses'.
Processing event 'instructions'...
Finished event 'instructions'.
Processing event 'cycle_activity.cycles_l3_miss'...
Finished event 'cycle_activity.cycles_l3_miss'.


Aggregate time series of events. This will allow to compare the overall performance of the experiments.

In [352]:
for event in events:
    print(f"Processing event '{event}'...")
    df_1_aggr = df_1[df_1["event"] == event].groupby(["cpu"], as_index=False)["counter_value"]
    df_2_aggr = df_2[df_2["event"] == event].groupby(["cpu"], as_index=False)["counter_value"]
    
    if event not in output["dataset-1"]["aggregated"]:
        output["dataset-1"]["aggregated"][event] = {"mean": {}, "sum": {}}
        output["dataset-2"]["aggregated"][event] = {"mean": {}, "sum": {}}
    
    if event not in output["comparison"]:
        output["comparison"][event] = {}
    
    a = utils.transform_cpu_data(df_1_aggr.mean(), cpu_setup)
    b = utils.transform_cpu_data(df_2_aggr.mean(), cpu_setup)
    output["dataset-1"]["aggregated"][event]["mean"] = a
    output["dataset-2"]["aggregated"][event]["mean"] = b
    
    output["comparison"][event]["mean"] = (np.array(a) - np.array(b)).tolist()
    
    a = utils.transform_cpu_data(df_1_aggr.sum(), cpu_setup)
    b = utils.transform_cpu_data(df_2_aggr.sum(), cpu_setup)
    output["dataset-1"]["aggregated"][event]["sum"] = a
    output["dataset-2"]["aggregated"][event]["sum"] = b
    
    output["comparison"][event]["sum"] = (np.array(a) - np.array(b)).tolist()
    
    print(f"Finished event '{event}'.")

Processing event 'cpu-cycles'...
Finished event 'cpu-cycles'.
Processing event 'cache-misses'...
Finished event 'cache-misses'.
Processing event 'instructions'...
Finished event 'instructions'.
Processing event 'cycle_activity.cycles_l3_miss'...
Finished event 'cycle_activity.cycles_l3_miss'.


## Output

Write `output` to a JSON file:

In [354]:
!mkdir data

with open("data/simple-ff-test-1x2.json", "w") as f:
    json.dump(output, f)

mkdir: cannot create directory ‘data’: File exists
