# Slurm EDA - Anonymised June Data Preparation

In this Jupyter Notebook, we continue the EDA started in the *Slurm-EDA-Sample-Data* notebook. We will continue to work on the processing of the data, however we will use the anonymised June sacct data rather than our own sample data. 

We will begin by importing all the necessary libraries. 

In [1]:
import pandas as pd
import numpy as np
import datetime
import re
import plotly.express as px
import plotly.graph_objects as go
import plotly.subplots as sp

### Loading in The Partition Data

Here we will load in the *dfPartition* DataFrame that we prepared in the *Slurm-EDA-Sample-Data* notebook so that we can use it in this notebook. 

In [2]:
# Here we read the csv file containing the dfPartition DataFrame
dfPartition = pd.read_csv('../data/dfPartition.csv', index_col=0)

### Loading in the Job Data

We have saved the dataset containing all the anonymized job data for June in the file *sacct_june_anonymized.csv*. 

We will read the anonymized job data for June and store it in the DataFrame 'dfSacctJune'.

In [3]:
# We will now read the .csv file containing the anonymized job data for June into a pandas DataFrame
# Here we specify 'index_col=0', since the frist column of the csv file contains the indexes of the rows. 
sSlurmDataPath = '../data/data_anonymized.csv'
dfSacct = pd.read_csv(sSlurmDataPath, index_col=0)

# We are also going to make the JobIDRaw column the index of our DataFrame
dfSacct.set_index(['JobIDRaw'], drop=True, inplace=True)

# We will now output the first 5 rows of the DataFrame
dfSacct.head()

Unnamed: 0_level_0,JobName,Partition,ElapsedRaw,Account,State,CPUTimeRAW,NodeList,User,AllocCPUS,AllocNodes,QOS,Start,End,Timelimit,Suspended
JobIDRaw,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
15669021,0036959c5bc8397d719d9c1699b38d751afc8fb6eca27c...,skylake,0,bc9b09a5785b66545ba030dbf421adb9f676a3fccb81dd...,CANCELLED by 628,0,None assigned,8138da81aa8ac07579ba662a85b6886c0180b30648ce31...,768,0,dirac-cpu1,Unknown,2023-07-24T09:26:17,1-12:00:00,00:00:00
16604661,8cf2680e34f11d7869634180971329487846dc1a37796f...,ampere,0,89d68041add78dfc7c00c487e8014ac31de6a2302c9237...,PENDING,0,None assigned,a09eadd00bfbbad5faf87c1f2cb461eb1ede976c0f6bbd...,1,0,gpu1,Unknown,Unknown,12:00:00,00:00:00
16604679,a0a4ddbfff70d045b46cea150835e8ed5c954746b13529...,ampere,0,89d68041add78dfc7c00c487e8014ac31de6a2302c9237...,PENDING,0,None assigned,a09eadd00bfbbad5faf87c1f2cb461eb1ede976c0f6bbd...,1,0,gpu1,Unknown,Unknown,12:00:00,00:00:00
16604696,9bef61280f5a21bb68a4e7981914381a7014c069fab138...,ampere,0,89d68041add78dfc7c00c487e8014ac31de6a2302c9237...,PENDING,0,None assigned,a09eadd00bfbbad5faf87c1f2cb461eb1ede976c0f6bbd...,1,0,gpu1,Unknown,Unknown,12:00:00,00:00:00
16604714,bb1707d470849c6d4c96460b43058b37319658826df637...,ampere,0,89d68041add78dfc7c00c487e8014ac31de6a2302c9237...,PENDING,0,None assigned,a09eadd00bfbbad5faf87c1f2cb461eb1ede976c0f6bbd...,1,0,gpu1,Unknown,Unknown,12:00:00,00:00:00


### Processing The Job Data

As seen by the first 5 rows of the *dfSacctJune* DataFrame, some jobs contained in this dataset have been cancelled. Since cancelled jobs have not run, they will not have consumed any energy resulting in no carbon emissions. As a result we can remove these jobs from the DataFrame.

We are going to remove all jobs that have 0 CPU time, since these jobs will not have run.

In [4]:
# First we create a boolean mask to filter out the rows that have a CPU time of 0. 
bNoCPUTimeMask = dfSacct['CPUTimeRAW'] != 0

# We now apply this mask to the dfSacctJune DataFrame
dfSacct = dfSacct[bNoCPUTimeMask]

Furthermore, some jobs towards the end of the month have not yet ended, resunting in an *Unknown* end time. We have decided that, in the case of a monthly analysis, we will not take into account any jobs that do not end within the specified month. As a result, we will remove the jobs with *Unknown* end times from our DataFrame. 

In [5]:
# First we create a boolean mask to filter out the rows that have an 
# 'Unknown' end time. 
bKnownEndTime = dfSacct['End'] != 'Unknown'

# We now apply this mask to the dfSacctJune DataFrame
dfSacct = dfSacct[bKnownEndTime]

dfSacct.head()

Unnamed: 0_level_0,JobName,Partition,ElapsedRaw,Account,State,CPUTimeRAW,NodeList,User,AllocCPUS,AllocNodes,QOS,Start,End,Timelimit,Suspended
JobIDRaw,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
23299509,fbe0f463a6725956666ee2ca8997a8013deeaaf7933876...,ampere,129615,064913b8f3d7de55ed18a2a0f29cf9f2d49a9bf2a68467...,TIMEOUT,129615,gpu-q-15,9d2240399ac4b590d6f3d5fd8bf13a6a29937bd5124d64...,1,1,dirac-gpu1,2023-07-03T11:26:17,2023-07-04T23:26:32,1-12:00:00,00:00:00
23299510,fbe0f463a6725956666ee2ca8997a8013deeaaf7933876...,ampere,129615,064913b8f3d7de55ed18a2a0f29cf9f2d49a9bf2a68467...,TIMEOUT,129615,gpu-q-15,9d2240399ac4b590d6f3d5fd8bf13a6a29937bd5124d64...,1,1,dirac-gpu1,2023-07-03T11:26:17,2023-07-04T23:26:32,1-12:00:00,00:00:00
23299511,fbe0f463a6725956666ee2ca8997a8013deeaaf7933876...,ampere,129615,064913b8f3d7de55ed18a2a0f29cf9f2d49a9bf2a68467...,TIMEOUT,129615,gpu-q-15,9d2240399ac4b590d6f3d5fd8bf13a6a29937bd5124d64...,1,1,dirac-gpu1,2023-07-03T11:26:17,2023-07-04T23:26:32,1-12:00:00,00:00:00
23299512,fbe0f463a6725956666ee2ca8997a8013deeaaf7933876...,ampere,129615,064913b8f3d7de55ed18a2a0f29cf9f2d49a9bf2a68467...,TIMEOUT,129615,gpu-q-38,9d2240399ac4b590d6f3d5fd8bf13a6a29937bd5124d64...,1,1,dirac-gpu1,2023-07-03T11:26:17,2023-07-04T23:26:32,1-12:00:00,00:00:00
23299513,fbe0f463a6725956666ee2ca8997a8013deeaaf7933876...,ampere,129615,064913b8f3d7de55ed18a2a0f29cf9f2d49a9bf2a68467...,TIMEOUT,129615,gpu-q-70,9d2240399ac4b590d6f3d5fd8bf13a6a29937bd5124d64...,1,1,dirac-gpu1,2023-07-03T11:26:17,2023-07-04T23:26:32,1-12:00:00,00:00:00


We will now check the data type of each column in the DataFrame *dfSacctJune*. 

In [6]:
dfSacct.dtypes

JobName       object
Partition     object
ElapsedRaw     int64
Account       object
State         object
CPUTimeRAW     int64
NodeList      object
User          object
AllocCPUS      int64
AllocNodes     int64
QOS           object
Start         object
End           object
Timelimit     object
Suspended     object
dtype: object

As you can see from the code cell above, the *Start* and *End* columns do not contain datetime values, even though all values will be in a datatime64 format. Therefore, we will now convert the values in these two columns to the datatime64 type

In [7]:
dfSacct['Start'] = pd.to_datetime(dfSacct['Start'], format='%Y-%m-%dT%H:%M:%S')
dfSacct['End'] = pd.to_datetime(dfSacct['End'], format='%Y-%m-%dT%H:%M:%S')

Now that these columns have values of the type datetime64, we can sort the DataFrame based on the start times of the jobs. 

In [8]:
dfSacct.sort_values('Start', axis=0, inplace=True)

As seen by the code block below, some jobs run across multiple partitions. However we have not accounted for this when initially writing our code to check for the exclusiveness of a job. There are also some partitions that did not appear when we obtained our partition information using the *sinfo* command. As a result we do not have the information on the number of CPUs per node for these partitions. 

In [9]:
# The output of this code block shows that some jobs run across multiple partitions 
# (E.g., the value 'cclake,skylake-himem,cclake-himem,icelake-himem,skylake,icelake'
# in the 'Partition' column shows that at least one job runs across 6 partitions)
dfSacct['Partition'].unique()

array(['mtg', 'ampere-long', 'cclake-long', 'epid', 'icelake', 'cclake',
       'ampere', 'skylake', 'pascal', 'skylake-himem', 'login-epicov',
       'cclake-himem', 'icelake-himem', 'cardio', 'desktop', 'bluefield',
       'cardio_intr', 'icelake-long', 'mtg-himem'], dtype=object)

We are first going to calculate the proportion of total jobs that we do not have partition information for (this includes the jobs that run across multiple partitions). 

In [10]:
# We are going to iterate through each partition present in the sacct dataset. For each partition, if it is not present
# in the dfPartition DataFrame, then we are going to find the number of jobs running on that partition before adding 
# this value to our total. 
iExcludedCount = 0
iExcludeRuntime = 0

for sPartitionName in dfSacct['Partition'].unique():
    if (sPartitionName not in dfPartition.index): 
        iExcludedCount += np.sum(dfSacct['Partition'] == sPartitionName)
        iExcludeRuntime += np.sum(dfSacct[dfSacct['Partition'] == sPartitionName]['CPUTimeRAW'])

iPercentage = np.round((iExcludedCount/len(dfSacct))*100, decimals=3)
iRunPercentage = np.round((iExcludeRuntime/np.sum(dfSacct['CPUTimeRAW'])), decimals=10)
        
print("A total of", iExcludedCount, "jobs are not accounted for in the dfPartition DataFrame.")
print("This makes up", iPercentage, "% of the sacct data from June.")
print(f'{iRunPercentage}% of CPU time does not have partition data')

A total of 289 jobs are not accounted for in the dfPartition DataFrame.
This makes up 0.033 % of the sacct data from June.
0.0002561274% of CPU time does not have partition data


As we can see from the code cell above, 0.033% of the June sacct data is not accounted for by the *dfPartition* DataFrame. 

Since this makes up such a small proportion of the total data, we will initially remove these jobs from the dataset. 

In [11]:
# We first create a boolean mask to filter out the jobs that do not run partitions accounted for by dfPartition.
bExcludePartitionMask = dfSacct['Partition'].isin(dfPartition.index)

# We then apply this boolean mask to dfSacctJune.
dfSacct = dfSacct[bExcludePartitionMask]

We will now save the *dfSacctJune* DataFrame as a .csv file so that it can be accessed in other notebooks. 

In [11]:
# We will also create a csv file to store the data types of the DataFrame columns. 
dfTypes = dfSacct.dtypes.to_frame()
dfTypes.loc[['Start', 'End']] = 'str'
dfTypes.to_csv('../data/SacctTypes.csv')

# Here we save the DataFrame as a .csv file. 
dfSacct.to_csv('../data/dfSacctPartialProcessed.csv')

*NOTE: we will now end this notebook to ensure it remains readable. We will continue the pre-processing of the anonymised sacct data in the 'Slurm-EDA-June-Exclusive' notebook, where we will carry out the checks for whether or not a job is exclusive.*