### Notebook for aggregating a collection of HPC tasks on GraphWorld
Given a type of experiment, this notebook takes all the json result files of a collection of HPC tasks and moves them into a single file in the `processed` directory. It also maintains a summary file in the same folder for all files for the experiment. Finally it loads the result files and prints basic statistics not part of the summary file (see last cell of this file).

Set `RAW_DIR` to the raw experiments you want to process.

Set `PROCESSED_DIR` to where the processed results should be stored.
The processed results will be stored in shards. Each time this notebook is ran, 1 shard is created. E.g. the shard size depends on the contents of the `RAW_DIR`.

The processing assumes that the raw results come from our HPC experimental setup.

In [1]:
RAW_DIR = f'results/mode1/raw/node_classification_daen'
PROCESSED_DIR = f'results/mode1/processed'

In [2]:
import os
import pandas as pd
import json

PROCESSED_SHARDS = f'{PROCESSED_DIR}/shards'
NSHARDS = 10

if not os.path.exists(PROCESSED_SHARDS):
    os.makedirs(PROCESSED_SHARDS)

In [3]:
# Read (existing) summary file for experiment
try:
    with open(f'{PROCESSED_DIR}/summary.json', 'r') as f:
        summary = json.load(f)
except FileNotFoundError:
    summary = {
        'N_GRAPHS': 0,
        'N_RUNS': 0,
        'RUN_GRAPHS': []
    }

summary['N_RUNS'] += 1

In [4]:
# Here we read the json shards of each HPC task, 
# aggregate them and store everything in one file in the processed folder
lines = []
for sub_dir in next(os.walk(RAW_DIR))[1]:
  for shard_idx in range(NSHARDS):
    filename = 'results.ndjson-%s-of-%s' % (str(shard_idx).zfill(5), str(NSHARDS).zfill(5))
    with open(f'{RAW_DIR}/{sub_dir}/{filename}', 'r') as f:
      lines.extend(f.readlines()) # aggregate all shards (collection of graphs)

with open(f'{PROCESSED_DIR}/shards/{summary["N_RUNS"]}.ndjson', "w") as dst:
  for line in lines:
    dst.write(line) # Write all graph experiments to same file

In [5]:
# Load lines dataframe for printing statistics
records = map(json.loads, lines)
results_df = pd.DataFrame.from_records(records)

In [6]:
# Getting running times
times = []

for task in next(os.walk(RAW_DIR))[1]:
  with open(f'{RAW_DIR}/slurm_{task}.out', 'r') as f:
    lines = f.readlines()
    times.append(int(lines[-1].split(" ")[1]) // 60)

In [7]:
# Getting basic statistics of raw data
N_GRAPHS = len(results_df)
N_METHODS = len([col for col in results_df if 'encoder_hidden_channels' in col])
N_TASKS = len(next(os.walk(RAW_DIR))[1])

AVG_TIME = sum(times) / len(times)
MAX_TIME = max(times)
MIN_TIME = min(times)

In [8]:
# Getting methods that have crashed / are skipped
skipped_methods = {}
for s_col in [col for col in results_df if '_skipped' in col]:
    count = results_df[s_col].sum()
    if count > 0:
        skipped_methods.update({s_col.removesuffix('_skipped'): count})

In [9]:
# Update summary file
summary['N_GRAPHS'] += N_GRAPHS
summary['RUN_GRAPHS'].append(N_GRAPHS)
with open(f'{PROCESSED_DIR}/summary.json', 'w') as s:
  s.write(json.dumps(summary))

In [10]:
# Printing statistics
print('------- Task/Graph statistics -------')
print(f'Total processed tasks: {N_TASKS}')
print(f'Total processed graphs: {N_GRAPHS}')
print(f'Graphs per task: {N_GRAPHS / N_TASKS}')
print(f'Avg task runtime (min): {AVG_TIME} ({AVG_TIME / (N_GRAPHS / N_TASKS)} per graph)')
print(f'Max task runtime (min): {MAX_TIME} ({MAX_TIME / (N_GRAPHS / N_TASKS)} per graph)')
print(f'Min task runtime (min): {MIN_TIME} ({MIN_TIME / (N_GRAPHS / N_TASKS)} per graph)\n')

print('------- Skipped (crashed) methods -------')
for k,v in skipped_methods.items():
    print(f'{k} skipped {v} times')

------- Task/Graph statistics -------
Total processed tasks: 100
Total processed graphs: 10000
Graphs per task: 100.0
Avg task runtime (min): 767.48 (7.6748 per graph)
Max task runtime (min): 1043 (10.43 per graph)
Min task runtime (min): 543 (5.43 per graph)

------- Skipped (crashed) methods -------
GCN_GBT_JL skipped 24 times
GAT_GBT_JL skipped 25 times
GIN_GBT_JL skipped 39 times
