# Slurm EDA - Finding Exclusive Jobs

In this Jupyter Notebook, we continue the EDA started in the *Slurm-EDA-June-Preparation* notebook. We will continue to work on the processing of the data, and in particular we will identify which jobs are exclusive.

We will begin by importing all the necessary libraries. 

In [19]:
import pandas as pd
import numpy as np
import datetime
import re
from joblib import Parallel, delayed

### Loading in The Partition Data

Here we will load in the *dfPartition* DataFrame that we prepared in the *Slurm-EDA-Sample-Data* notebook so that we can use it in this notebook. 

In [20]:
# Here we read the csv file containing the dfPartition DataFrame
dPartitionTypes = pd.read_csv('../data/PartitionTypes.csv', index_col=0).to_dict()
dfPartition = pd.read_csv('../data/dfPartition.csv', index_col=0, dtype=dPartitionTypes['0'])

### Loading in the Job Data

We have already started to process the anonymised sacct data for june in the *Slurm-EDA-June-Preparation* notebook. 

We will now load in the partially processed DataFrame obtained from the *Slurm-EDA-June-Preparation* notebook.

In [21]:
# We will now read the .csv file containing the anonymized job data for June into a pandas DataFrame
# Here we specify 'index_col=0', since the frist column of the csv file contains the indexes of the rows. 
dSacctTypes = pd.read_csv('../data/SacctTypes.csv', index_col=0).to_dict()
sSlurmDataPath = '../data/dfSacctPartialProcessed.csv'
dfSacct = pd.read_csv(sSlurmDataPath, index_col=0, dtype=dSacctTypes['0'], parse_dates=['Start', 'End'], infer_datetime_format=True)

### Finding Exclusive Jobs

We are now going to check whether or not the jobs run exclusively on their nodes.

We specify a job to be exclusive if:
1. it has been allocated all of the CPUs on the node it is using, or
2. it is the only job running on a node (for the duration of its execution)

We will now check whether the jobs are exclusive according to our first definition above. 

In [22]:
# We will now check whether or not each job is exclusive by the first definition of exclusiveness above. 

# We first create series of all the necessary values: the partition names, the CPUs per node for each partition, 
# the number of allocated CPUs and the number of allocated nodes. 
SPartitionNames = dfSacct['Partition']
ICPUsPerNode = dfPartition.loc[SPartitionNames]['CPUS'] 
IAllocatedCPUs = dfSacct['AllocCPUS']
IAllocatedNodes = dfSacct['AllocNodes']

# We then calculate the expected number of CPUs for an exclusive job. 
IExclusiveCPUCount = ICPUsPerNode.values * IAllocatedNodes

# We then calculate the actual number of CPUs allocated for the job. 
ICPUCount = IAllocatedCPUs

# We now create the 'ExclusiveCPU' column, adding the value True if the job is exclusive and the 
# value False if the job is shared (by the first definition of exclusiveness).
dfSacct['ExclusiveCPU'] = (ICPUCount == IExclusiveCPUCount)

dfSacct.head()

Unnamed: 0_level_0,JobName,Partition,ElapsedRaw,Account,State,CPUTimeRAW,NodeList,User,AllocCPUS,AllocNodes,QOS,Start,End,Timelimit,Suspended,ExclusiveCPU
JobIDRaw,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
21746840,0ad8058c9113c4dbf65087ddf7378c4ff006e7f3bc11b6...,mtg,2592058,acfc5635378d1cfdb38e5c8153b64aac9f7bc780f6db14...,TIMEOUT,145155248,cpu-c-18,c4ece8994745f178a32e57ab4ae3f3ab5fe514b6ed876a...,56,1,sqos1,2023-06-05 10:40:53,2023-07-05 10:41:51,30-00:00:00,00:00:00,True
22338038,0ad8058c9113c4dbf65087ddf7378c4ff006e7f3bc11b6...,mtg,2283562,acfc5635378d1cfdb38e5c8153b64aac9f7bc780f6db14...,CANCELLED by 17574,1278794720,cpu-c-[1-10],e3165dadd9e35c2862c565ad555b75a554606fe79d3d2f...,560,10,sqos1,2023-06-16 13:34:04,2023-07-12 23:53:26,30-00:00:00,00:00:00,True
21905921,0ad8058c9113c4dbf65087ddf7378c4ff006e7f3bc11b6...,mtg,2417133,acfc5635378d1cfdb38e5c8153b64aac9f7bc780f6db14...,CANCELLED by 17636,135359448,cpu-c-12,7543217955e8d686ede335dae3a036654be47f5f865008...,56,1,sqos1,2023-06-19 10:33:16,2023-07-17 09:58:49,30-00:00:00,00:00:00,True
22569784,0ad8058c9113c4dbf65087ddf7378c4ff006e7f3bc11b6...,mtg,2061307,acfc5635378d1cfdb38e5c8153b64aac9f7bc780f6db14...,CANCELLED by 17636,115433192,cpu-c-13,7543217955e8d686ede335dae3a036654be47f5f865008...,56,1,sqos1,2023-06-23 13:23:42,2023-07-17 09:58:49,30-00:00:00,00:00:00,True
21248177,0ad8058c9113c4dbf65087ddf7378c4ff006e7f3bc11b6...,mtg,2057023,acfc5635378d1cfdb38e5c8153b64aac9f7bc780f6db14...,CANCELLED by 17636,115193288,cpu-c-16,7543217955e8d686ede335dae3a036654be47f5f865008...,56,1,sqos1,2023-06-23 14:35:06,2023-07-17 09:58:49,30-00:00:00,00:00:00,True


We will now calculate what proportion of the total jobs is exclusive by our first definition of exclusiveness.

In [23]:
iTotalExclusive = sum(dfSacct['ExclusiveCPU'])
iTotalJobs = len(dfSacct)
iPercentage = (iTotalExclusive)/(iTotalJobs)*100

iExclusiveRuntime = sum(dfSacct[dfSacct['ExclusiveCPU'] == True]['ElapsedRaw'])
iTotalRuntime = sum(dfSacct['ElapsedRaw'])
iRuntimePercentage = (iExclusiveRuntime)/(iTotalRuntime) * 100

print(iPercentage, '% of jobs are exclusive by our first definition of excluisveness')
print(iRuntimePercentage, '% of runtime is made up of exclusive jobs (by our first definition of exclusiveness).')

6.338754920407716 % of jobs are exclusive by our first definition of excluisveness
19.795271292238322 % of runtime is made up of exclusive jobs (by our first definition of exclusiveness).


For the jobs that have been declared as not exclusive, we will now check whether or not they are exclusive by our second definition above.

In [24]:
# We will first create a new database only containing non-exlusive jobs.
bExlusiveMask = dfSacct['ExclusiveCPU'] == False
dfNonExclusive = dfSacct[bExlusiveMask]

Before we check whether or not the jobs on a node overlap, we must separate out the jobs that run on multiple nodes. We will now create duplicate rows for jobs that run on multiple nodes (one per node). 

In [25]:
# We first separate out the node prefix and the node numbers
# The node prefix is in the first column (0) of the DataFrame created by .str.extract()
# The node numbers are in the second column (1) of the DataFrame created by .str.extract()
dfNonExclusive['NodePrefix'] = dfNonExclusive['NodeList'].str.extract(r'(.+?)(?:-\[|\-)(\d+(?:-\d+)?(?:,\d+(?:-\d+)?)*)(?:\]|$)')[0]
dfNonExclusive['NodeNumbers'] = dfNonExclusive['NodeList'].str.extract(r'(.+?)(?:-\[|\-)(\d+(?:-\d+)?(?:,\d+(?:-\d+)?)*)(?:\]|$)')[1]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfNonExclusive['NodePrefix'] = dfNonExclusive['NodeList'].str.extract(r'(.+?)(?:-\[|\-)(\d+(?:-\d+)?(?:,\d+(?:-\d+)?)*)(?:\]|$)')[0]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfNonExclusive['NodeNumbers'] = dfNonExclusive['NodeList'].str.extract(r'(.+?)(?:-\[|\-)(\d+(?:-\d+)?(?:,\d+(?:-\d+)?)*)(?:\]|$)')[1]


In [26]:
# We now duplicate the rows based on the number of nodes the jobs run on. 
# Below we separate out the separate ranges of numbers.
df_duplicate = dfNonExclusive.assign(list=dfNonExclusive['NodeNumbers'].str.split(',')).explode('list')

# Now we separate out the two bounds of the range of nodes and create a list of all node numbers within that range. 
df_duplicate = df_duplicate.assign(consecutive=df_duplicate['list'].str.split('-'))
df_duplicate['consecutive'] = df_duplicate['consecutive'].apply(lambda lsRange: [lsRange] if type(lsRange) == float else list(range(int(lsRange[0]), int(lsRange[1]) + 1)) if len(lsRange) > 1 else [int(lsRange[0])])

# We now separate out all of the node numbers from the range of nodes. 
df_duplicate = df_duplicate.explode('consecutive')

# Add prefix from the 'NodePrefix' column to each integer in the 'list' column. 
# We then replace the node name (which contained all node numbers) with a single node name.
df_duplicate['NodeList'] = df_duplicate.apply(lambda row: str(row['NodePrefix']) + '-' + str(row['consecutive']), axis=1)

# We now drop the 'NodePrefix', 'NodeNumbers', 'list' and 'consecutive' columns
df_duplicate = df_duplicate.drop(['NodePrefix', 'list', 'NodeNumbers', 'consecutive'], axis=1)

# Display the modified DataFrame
df_duplicate

Unnamed: 0_level_0,JobName,Partition,ElapsedRaw,Account,State,CPUTimeRAW,NodeList,User,AllocCPUS,AllocNodes,QOS,Start,End,Timelimit,Suspended,ExclusiveCPU
JobIDRaw,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
23233722,07398de0c7874722fe3f07e03d798808577cf3e7c3bac5...,ampere-long,335248,1b79191747cee4d46e6eb0a7fafa636bda8db2cfeb9658...,CANCELLED by 19685,10727936,gpu-q-80,e01ae5b4a76ce40b994cd8504289af4107b2536d572977...,32,1,gpul,2023-06-30 14:36:31,2023-07-04 11:43:59,7-00:00:00,00:00:00,False
23136491,f0a289923ed634acec748941a7fab6a057e5d4a5cb29e5...,cclake-long,261559,132b1ed7e0e25cac29527013febd1dc3e0d071546ee52e...,COMPLETED,7323652,cpu-p-41,9dc5bb5fd2330534ccf2b31bd331d21abff1910f94959d...,28,1,cpul,2023-07-01 10:09:02,2023-07-04 10:48:21,7-00:00:00,00:00:00,False
23136492,f0a289923ed634acec748941a7fab6a057e5d4a5cb29e5...,cclake-long,250811,132b1ed7e0e25cac29527013febd1dc3e0d071546ee52e...,COMPLETED,7022708,cpu-p-18,9dc5bb5fd2330534ccf2b31bd331d21abff1910f94959d...,28,1,cpul,2023-07-01 11:14:02,2023-07-04 08:54:13,7-00:00:00,00:00:00,False
23136493,f0a289923ed634acec748941a7fab6a057e5d4a5cb29e5...,cclake-long,348295,132b1ed7e0e25cac29527013febd1dc3e0d071546ee52e...,COMPLETED,9752260,cpu-p-13,9dc5bb5fd2330534ccf2b31bd331d21abff1910f94959d...,28,1,cpul,2023-07-01 12:07:47,2023-07-05 12:52:42,7-00:00:00,00:00:00,False
23136516,f0a289923ed634acec748941a7fab6a057e5d4a5cb29e5...,cclake-long,273070,132b1ed7e0e25cac29527013febd1dc3e0d071546ee52e...,COMPLETED,7645960,cpu-p-31,9dc5bb5fd2330534ccf2b31bd331d21abff1910f94959d...,28,1,cpul,2023-07-01 12:19:28,2023-07-04 16:10:38,7-00:00:00,00:00:00,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24261763,171f8e926c82dd83fc75053ec9b110f092aa32b41d5c98...,ampere,7225,99fdc1f587a9423c1abc5a1ce22053628b94a5226d3c0f...,TIMEOUT,231200,gpu-q-63,dd68b7c728069b005e5dac9c3e9d59a7379b1347fa4e6f...,32,1,gpu2,2023-07-27 08:13:36,2023-07-27 10:14:01,02:00:00,00:00:00,False
24261762,171f8e926c82dd83fc75053ec9b110f092aa32b41d5c98...,ampere,29,99fdc1f587a9423c1abc5a1ce22053628b94a5226d3c0f...,COMPLETED,928,gpu-q-63,dd68b7c728069b005e5dac9c3e9d59a7379b1347fa4e6f...,32,1,gpu2,2023-07-27 08:13:36,2023-07-27 08:14:05,02:00:00,00:00:00,False
24261759,171f8e926c82dd83fc75053ec9b110f092aa32b41d5c98...,ampere,28,99fdc1f587a9423c1abc5a1ce22053628b94a5226d3c0f...,COMPLETED,896,gpu-q-14,dd68b7c728069b005e5dac9c3e9d59a7379b1347fa4e6f...,32,1,gpu2,2023-07-27 08:13:36,2023-07-27 08:14:04,02:00:00,00:00:00,False
24261761,171f8e926c82dd83fc75053ec9b110f092aa32b41d5c98...,ampere,29,99fdc1f587a9423c1abc5a1ce22053628b94a5226d3c0f...,COMPLETED,928,gpu-q-62,dd68b7c728069b005e5dac9c3e9d59a7379b1347fa4e6f...,32,1,gpu2,2023-07-27 08:13:36,2023-07-27 08:14:05,02:00:00,00:00:00,False


We will now check which nodes could possibly have jobs that overlap. This means that we will not need to check all nodes for overlapping jobs. 

In [27]:
# We first sort the jobs by the node they run on and their start time. 
df_duplicate.sort_values(['NodeList', 'Start'], axis=0, inplace=True)

# We then create a shifted DataFrame. 
df_duplicate_shift = df_duplicate.shift(periods=1)

# We then compare each job to the job that runs after it. If the two jobs run on the same 
lPossibleOverlapNodes = df_duplicate[(df_duplicate_shift['NodeList'] == df_duplicate['NodeList']) & (df_duplicate_shift['End'] > df_duplicate['Start'])]['NodeList'].unique()

We will now filter the DataFrame to include only the nodes that possible overlaps. 

In [28]:
# We first create a boolean mask that only includes nodes with a possible overlap
bPossibleSharedMask = df_duplicate['NodeList'].isin(lPossibleOverlapNodes)

# We now apply the boolean mask to the DataFrame
dfPossibleSharedExpanded = df_duplicate[bPossibleSharedMask]

We will now compute which jobs are shared. 

We will start by creating a function which compares a job to all other jobs that start at most 36 hours before the job starts or that start during the job's runtime. We check for 36 hours before the job starts since the maximum runtime of a job (that isn't on a 'long' partition) is 36 hours. This allows us to minimise the number of other jobs that we need to compare against. 

In [29]:
def fFindSharedJobs(job, df):
    
    # We first store some useful information about the job we are checking. 
    sNode = job.NodeList
    tJobStart = job.Start
    tJobEnd = job.End
    tJobTime = job.ElapsedRaw
    
    # We now create the boolean mask that filters out any jobs that start too early/ late for a possible overlap.
    bTimeMask = (df.Start <= job.Start + datetime.timedelta(seconds = tJobTime)) & (df.Start >= job.Start - datetime.timedelta(hours = 36))
    
    # We now create series of the start and end times of all other jobs
    StOtherStart = df.Start[bTimeMask]
    StOtherEnd = df.End[bTimeMask]
    
    # We now create a series of boolean series checking separate conditions that must be met for a job to overlap. 
    bTest1 = (df[bTimeMask].index != job.name) 
    bTest2 = (df[bTimeMask]['NodeList'] == sNode)
    bTest3 = ((StOtherStart >= tJobStart) & (StOtherStart <= tJobEnd)) 
    bTest4 = ((StOtherEnd >= tJobStart) & (StOtherEnd <= tJobEnd)) 
    bTest5 = ((StOtherStart <= tJobStart) & (StOtherEnd >= tJobEnd))
    
    # We now put all of the boolean checks above together to make a boolean mask. 
    bMask = bTest1 & bTest2 & (bTest3 | bTest4 | bTest5)
    
    # This commented out code would return True if the mask has at least one True value (and so the job overlaps with at least one other job.)
    # return sum(bMask) > 0 

    # This will return a list of overlapping jobs rather than a boolean value.
    return list(df[bTimeMask][bMask].index)

We will now create a function that checks if a shared job only shares the node with jobs submitted by the same user. 

*NOTE: This function is not used in the final framework as I did not have enough time to include it. I created this function so that we could group shared jobs submitted by the same user and provide a total carbon footprint for that group of jobs (rather than excluding the jobs completely).*

In [30]:
def sharedSameUser(job, df):
    
    # We first need to store some useful information about the job we are checking
    sUser = job.User
    lOverlapping = job.Overlapping

    lUsers = []

    for sJob in lOverlapping:
        lUsers.append(df.loc[sJob, 'User'])
    
    if len(set(lUsers)) == 1:
        return True
    else: 
        return False

Now that we have a function to check whether or not a job is shared, we will apply this function to a test DataFrame containing 7 jobs. 

In [31]:
# Below we create our test DataFrame, editing the timestamps to get a set of all possible overlap combinations and 
# changing the node to ensure all jobs are running on the same node. 

dfTest = dfSacct.iloc[0:7].copy()

dfTest.loc[dfTest.index[0], 'Start'] = '2023-06-19 08:00:00'
dfTest.loc[dfTest.index[0], 'End'] = '2023-06-19 09:00:00'
dfTest.loc[dfTest.index[0], 'NodeList'] = 'cpu-c-14'

dfTest.loc[dfTest.index[1], 'Start'] = '2023-06-19 07:00:00'
dfTest.loc[dfTest.index[1], 'End'] = '2023-06-19 07:30:00'
dfTest.loc[dfTest.index[1], 'NodeList'] = 'cpu-c-14'

dfTest.loc[dfTest.index[2], 'Start'] = '2023-06-19 09:01:00'
dfTest.loc[dfTest.index[2], 'End'] = '2023-06-19 09:30:00'
dfTest.loc[dfTest.index[2], 'NodeList'] = 'cpu-c-14'

dfTest.loc[dfTest.index[3], 'Start'] = '2023-06-19 07:51:00'
dfTest.loc[dfTest.index[3], 'End'] = '2023-06-19 08:51:00'
dfTest.loc[dfTest.index[3], 'NodeList'] = 'cpu-c-14'

dfTest.loc[dfTest.index[4], 'Start'] = '2023-06-19 08:51:00'
dfTest.loc[dfTest.index[4], 'End'] = '2023-06-19 09:51:00'
dfTest.loc[dfTest.index[4], 'NodeList'] = 'cpu-c-14'
dfTest.loc[dfTest.index[4], 'User'] = 'TEST'

dfTest.loc[dfTest.index[5], 'Start'] = '2023-06-19 07:51:00'
dfTest.loc[dfTest.index[5], 'End'] = '2023-06-19 09:51:00'
dfTest.loc[dfTest.index[5], 'NodeList'] = 'cpu-c-14'
dfTest.loc[dfTest.index[5], 'User'] = 'TEST'

dfTest.loc[dfTest.index[6], 'Start'] = '2023-06-19 07:51:00'
dfTest.loc[dfTest.index[6], 'End'] = '2023-06-19 07:52:00'
dfTest.loc[dfTest.index[6], 'NodeList'] = 'cpu-c-14'

# We will now apply our functions to each row of the test DataFrame
dfTest['Overlapping'] = dfTest.apply(lambda row : fFindSharedJobs(row, dfTest), axis=1)
dfTest['SharedSameUser'] = dfTest.apply(lambda row : sharedSameUser(row, dfTest), axis=1)

# We will output the test DataFrame to ensure the function worked as expected. 
dfTest

Unnamed: 0_level_0,JobName,Partition,ElapsedRaw,Account,State,CPUTimeRAW,NodeList,User,AllocCPUS,AllocNodes,QOS,Start,End,Timelimit,Suspended,ExclusiveCPU,Overlapping,SharedSameUser
JobIDRaw,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
21746840,0ad8058c9113c4dbf65087ddf7378c4ff006e7f3bc11b6...,mtg,2592058,acfc5635378d1cfdb38e5c8153b64aac9f7bc780f6db14...,TIMEOUT,145155248,cpu-c-14,c4ece8994745f178a32e57ab4ae3f3ab5fe514b6ed876a...,56,1,sqos1,2023-06-19 08:00:00,2023-06-19 09:00:00,30-00:00:00,00:00:00,True,"[22569784, 21248177, 23233722]",False
22338038,0ad8058c9113c4dbf65087ddf7378c4ff006e7f3bc11b6...,mtg,2283562,acfc5635378d1cfdb38e5c8153b64aac9f7bc780f6db14...,CANCELLED by 17574,1278794720,cpu-c-14,e3165dadd9e35c2862c565ad555b75a554606fe79d3d2f...,560,10,sqos1,2023-06-19 07:00:00,2023-06-19 07:30:00,30-00:00:00,00:00:00,True,[],False
21905921,0ad8058c9113c4dbf65087ddf7378c4ff006e7f3bc11b6...,mtg,2417133,acfc5635378d1cfdb38e5c8153b64aac9f7bc780f6db14...,CANCELLED by 17636,135359448,cpu-c-14,7543217955e8d686ede335dae3a036654be47f5f865008...,56,1,sqos1,2023-06-19 09:01:00,2023-06-19 09:30:00,30-00:00:00,00:00:00,True,"[21248177, 23233722]",True
22569784,0ad8058c9113c4dbf65087ddf7378c4ff006e7f3bc11b6...,mtg,2061307,acfc5635378d1cfdb38e5c8153b64aac9f7bc780f6db14...,CANCELLED by 17636,115433192,cpu-c-14,7543217955e8d686ede335dae3a036654be47f5f865008...,56,1,sqos1,2023-06-19 07:51:00,2023-06-19 08:51:00,30-00:00:00,00:00:00,True,"[21746840, 21248177, 23233722, 23136491]",False
21248177,0ad8058c9113c4dbf65087ddf7378c4ff006e7f3bc11b6...,mtg,2057023,acfc5635378d1cfdb38e5c8153b64aac9f7bc780f6db14...,CANCELLED by 17636,115193288,cpu-c-14,TEST,56,1,sqos1,2023-06-19 08:51:00,2023-06-19 09:51:00,30-00:00:00,00:00:00,True,"[21746840, 21905921, 22569784, 23233722]",False
23233722,07398de0c7874722fe3f07e03d798808577cf3e7c3bac5...,ampere-long,335248,1b79191747cee4d46e6eb0a7fafa636bda8db2cfeb9658...,CANCELLED by 19685,10727936,cpu-c-14,TEST,32,1,gpul,2023-06-19 07:51:00,2023-06-19 09:51:00,7-00:00:00,00:00:00,False,"[21746840, 21905921, 22569784, 21248177, 23136...",False
23136491,f0a289923ed634acec748941a7fab6a057e5d4a5cb29e5...,cclake-long,261559,132b1ed7e0e25cac29527013febd1dc3e0d071546ee52e...,COMPLETED,7323652,cpu-c-14,9dc5bb5fd2330534ccf2b31bd331d21abff1910f94959d...,28,1,cpul,2023-06-19 07:51:00,2023-06-19 07:52:00,7-00:00:00,00:00:00,False,"[22569784, 23233722]",False


We will now create a list of DataFrames, each one fileterd by the node the jobs run on. This allows us to easily compare each job to every other job on the node. This also allows us to maximise the efficiency of our program since we don't compare the jobs to jobs on other nodes (which is pointless since, by definition of exclusive jobs, we are only interested in overlapping jobs on the same node). 

In [32]:
# We will first create a list of all nodes available. 
lNodes = dfPossibleSharedExpanded['NodeList'].unique()

# We will also remove the 'nan-nan' element from the list. 
lNodes = np.delete(lNodes, -1)

# We will now create a list of filtered DataFrames, each one only containing the jobs that run on a certain node. 
lNodeDataFrames =  []

for sNode in lNodes:
    lNodeDataFrames.append(dfPossibleSharedExpanded[dfPossibleSharedExpanded['NodeList'] == sNode].copy())

We will now apply our fFindSharedJobs function to each job on the separate nodes. We have used the joblib library to implement paralelism and maximise the efficiency of our code. 

In [33]:

def fFindOnNode(df):
    df['Overlapping'] = df.apply(lambda row : fFindSharedJobs(row, df), axis=1)
    df['SharedSameUser'] = df.apply(lambda row : sharedSameUser(row, df), axis=1)
    return df

lOverlapDataFrames = Parallel(n_jobs=8)(delayed(fFindOnNode)(df) for df in lNodeDataFrames)

We will now concatenate all of the DataFrames from above to create one DataFrame containing all of the jobs and their *Overlapping* column. 

In [34]:
dfPossibleSharedOverlapping = pd.concat(lOverlapDataFrames)

We will now check the rows of *dfPossibleSharedOverlapping* for one node to ensure our code has worked as expected and the correct jobs have an 'Overlapping' value of *True*.

In [35]:
# We first create the boolean mask to filter the DataFrame by node. 
bNodeMask = dfPossibleSharedOverlapping['NodeList'] == 'cpu-b-51'
dfPossibleSharedOverlapping[bNodeMask]

Unnamed: 0_level_0,JobName,Partition,ElapsedRaw,Account,State,CPUTimeRAW,NodeList,User,AllocCPUS,AllocNodes,QOS,Start,End,Timelimit,Suspended,ExclusiveCPU,Overlapping,SharedSameUser
JobIDRaw,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
23299233,2470f61f0db22994a9ccdcf9870c71eb4da95cdc949f30...,login-epicov,86409,90e3af25b42bb3534d084cdbd7a522bc3202d8814f628a...,TIMEOUT,86409,cpu-b-51,c6d7d845ecc01a5a200c097eac49143eb341e3c1640abf...,1,1,sqos1,2023-07-03 11:09:34,2023-07-04 11:09:43,1-00:00:00,00:00:00,False,"[23299237, 23341843]",False
23299237,2470f61f0db22994a9ccdcf9870c71eb4da95cdc949f30...,login-epicov,86409,90e3af25b42bb3534d084cdbd7a522bc3202d8814f628a...,TIMEOUT,86409,cpu-b-51,c6d7d845ecc01a5a200c097eac49143eb341e3c1640abf...,1,1,sqos1,2023-07-03 11:09:34,2023-07-04 11:09:43,1-00:00:00,00:00:00,False,"[23299233, 23341843]",False
23341843,2470f61f0db22994a9ccdcf9870c71eb4da95cdc949f30...,login-epicov,28825,90e3af25b42bb3534d084cdbd7a522bc3202d8814f628a...,TIMEOUT,28825,cpu-b-51,accda041ca87e9e74a68e86a0319d9e9bec61b54868c5c...,1,1,sqos1,2023-07-04 11:00:54,2023-07-04 19:01:19,08:00:00,00:00:00,False,"[23299233, 23299237, 23345862, 23346672, 23346...",False
23345862,d79bad4ffccdbe4cc15961745f4bd1084bce06b7b6f1bf...,login-epicov,77,90e3af25b42bb3534d084cdbd7a522bc3202d8814f628a...,CANCELLED by 9885,154,cpu-b-51,cc1066aea69b3dcff448e8e5d01612c20b3222abe78b7d...,2,1,sqos1,2023-07-04 13:33:01,2023-07-04 13:34:18,12:00:00,00:00:00,False,[23341843],True
23346672,2470f61f0db22994a9ccdcf9870c71eb4da95cdc949f30...,login-epicov,9652,90e3af25b42bb3534d084cdbd7a522bc3202d8814f628a...,COMPLETED,9652,cpu-b-51,f2bc06da75869f67d29da165a48eb0d5a33e09bd72dc12...,1,1,sqos1,2023-07-04 14:30:27,2023-07-04 17:11:19,03:00:00,00:00:00,False,"[23341843, 23346676, 23346714, 23346881]",False
23346676,d79bad4ffccdbe4cc15961745f4bd1084bce06b7b6f1bf...,login-epicov,9618,90e3af25b42bb3534d084cdbd7a522bc3202d8814f628a...,COMPLETED,19236,cpu-b-51,f2bc06da75869f67d29da165a48eb0d5a33e09bd72dc12...,2,1,sqos1,2023-07-04 14:30:57,2023-07-04 17:11:15,03:00:00,00:00:00,False,"[23341843, 23346672, 23346714, 23346881]",False
23346714,2470f61f0db22994a9ccdcf9870c71eb4da95cdc949f30...,login-epicov,86422,90e3af25b42bb3534d084cdbd7a522bc3202d8814f628a...,TIMEOUT,86422,cpu-b-51,c6d7d845ecc01a5a200c097eac49143eb341e3c1640abf...,1,1,sqos1,2023-07-04 14:36:16,2023-07-05 14:36:38,1-00:00:00,00:00:00,False,"[23341843, 23346672, 23346676, 23346881, 23454...",False
23346881,d79bad4ffccdbe4cc15961745f4bd1084bce06b7b6f1bf...,login-epicov,3600,90e3af25b42bb3534d084cdbd7a522bc3202d8814f628a...,TIMEOUT,72000,cpu-b-51,f2bc06da75869f67d29da165a48eb0d5a33e09bd72dc12...,20,1,sqos1,2023-07-04 14:45:12,2023-07-04 15:45:12,01:00:00,00:00:00,False,"[23341843, 23346672, 23346676, 23346714]",False
23454486,2470f61f0db22994a9ccdcf9870c71eb4da95cdc949f30...,login-epicov,13589,90e3af25b42bb3534d084cdbd7a522bc3202d8814f628a...,COMPLETED,13589,cpu-b-51,f2bc06da75869f67d29da165a48eb0d5a33e09bd72dc12...,1,1,sqos1,2023-07-05 14:34:28,2023-07-05 18:20:57,04:00:00,00:00:00,False,"[23346714, 23454485, 23455061, 23455057, 23489...",False
23454485,d79bad4ffccdbe4cc15961745f4bd1084bce06b7b6f1bf...,login-epicov,13587,90e3af25b42bb3534d084cdbd7a522bc3202d8814f628a...,COMPLETED,27174,cpu-b-51,f2bc06da75869f67d29da165a48eb0d5a33e09bd72dc12...,2,1,sqos1,2023-07-05 14:34:28,2023-07-05 18:20:55,04:00:00,00:00:00,False,"[23346714, 23454486, 23455061, 23455057, 23489...",False


By checking the *Start* and *End* columns as well as the *Overlapping* value, we can see that the code has worked as expected. 

We will now create two new columns in the *dfSacctJune* DataFrame:

    - *ExclusiveOverlapping* which will be *True* if the job is exclusive by our second definition of exclusiveness (if the job doesn't overlap with any other jobs) amd *False* if the job is not exclusive by our second definition of exclusiveness. 
    - *Exclusive* which will be *True* if the job is exclusive by either one of our definitions of exclusiveness, and *False* otherwise. 

In [36]:
# We will first create a boolean mask for all jobs that do not overlap. 
bNoOverlap = dfPossibleSharedOverlapping['Overlapping'] == False

# We will now apply this boolean mask to our 'dfPossibleSharedOverlapping' DataFrame to get all jobs which are exclusive by our second definition
dfNoOverlap = dfPossibleSharedOverlapping[bNoOverlap]

# We will now get a list of the job indexes of the jobs that do not overlap
lNotOverlapping = list(dfNoOverlap.index.unique())

# For the jobs that do not overlap, we will give them a value of 'True' for the 'Exclusive (Overlapping)' column of the dfSacctJune DataFrame. 
dfSacct['ExclusiveOverlapping'] = False
dfSacct.loc[lNotOverlapping, 'ExclusiveOverlapping'] = True

# We will now create a function to check whether a job is exclusive by either one of our definitions. 
def isExclusive(job):
    """ This function will return True if the job has a value of True for either the 'Exclusive (CPU)' column or the 'Exclusive (Overlapping)' column."""

    # We will first get the values of the 'Exclusive (CPU)' and 'Exclusive (Overlapping)' columns for that job.
    bCPU = job.ExclusiveCPU
    bOverlapping = job.ExclusiveOverlapping

    return bCPU or bOverlapping

# We will now apply the function above to each row of the dfSacctJune DataFrame to create the new 'Exclusive' Column. 
dfSacct['Exclusive'] = dfSacct.apply(lambda row : isExclusive(row), axis=1)


We will now add the *sharedSameUser* column to the *dfSacct* DataFrame.

In [56]:
# We will first create a boolean mask for all jobs that only share a node with jobs ran by the same user. 
bSameUser = dfPossibleSharedOverlapping['SharedSameUser'] == True

# We will now apply this to the 'dfPossibleSharedOverlapping' DataFrame to obtain only the shared jobs
# that share with no other users. 
dfSameUser = dfPossibleSharedOverlapping[bSameUser]

# We will now get a list of all shared job indexes that do not share with any other user
lSameUserIndex = list(dfSameUser.index.unique())

# We will now create the 'SharedSameUser' column in the dfSacct DataFrame
dfSacct['SharedSameUser'] = False
dfSacct.loc[lSameUserIndex, 'SharedSameUser'] = True

We are now going to create another DataFrame *dfSacctJuneExtended* that separates out the different nodes that a job runs on. We may be interested in this when analyising the data. 

In [57]:
# We first create a copy of the 'dfSacctJune' DataFrame
dfSacctExtended = dfSacct.copy()

# We then separate out the node prefix and the node numbers
# The node prefix is in the first column (0) of the DataFrame created by .str.extract()
# The node numbers are in the second column (1) of the DataFrame created by .str.extract()
dfSacctExtended['NodePrefix'] = dfSacctExtended['NodeList'].str.extract(r'(.+?)(?:-\[|\-)(\d+(?:-\d+)?(?:,\d+(?:-\d+)?)*)(?:\]|$)')[0]
dfSacctExtended['NodeNumbers'] = dfSacctExtended['NodeList'].str.extract(r'(.+?)(?:-\[|\-)(\d+(?:-\d+)?(?:,\d+(?:-\d+)?)*)(?:\]|$)')[1]

# We now duplicate the rows based on the number of nodes the jobs run on. 
# Below we separate out the separate ranges of numbers.
dfSacctExtended = dfSacctExtended.assign(list=dfSacctExtended['NodeNumbers'].str.split(',')).explode('list')

# Now we separate out the two bounds of the range of nodes and create a list of all node numbers within that range. 
dfSacctExtended = dfSacctExtended.assign(consecutive=dfSacctExtended['list'].str.split('-'))
dfSacctExtended['consecutive'] = dfSacctExtended['consecutive'].apply(lambda lsRange: [lsRange] if type(lsRange) == float else list(range(int(lsRange[0]), int(lsRange[1]) + 1)) if len(lsRange) > 1 else [int(lsRange[0])])

# We now separate out all of the node numbers from the range of nodes. 
dfSacctExtended = dfSacctExtended.explode('consecutive')

# Add prefix from the 'NodePrefix' column to each integer in the 'list' column. 
# We then replace the node name (which contained all node numbers) with a single node name.
dfSacctExtended['NodeList'] = dfSacctExtended.apply(lambda row: str(row['NodePrefix']) + '-' + str(row['consecutive']), axis=1)

# We now drop the 'NodePrefix', 'NodeNumbers', 'list' and 'consecutive' columns
dfSacctExtended = dfSacctExtended.drop(['NodePrefix', 'list', 'NodeNumbers', 'consecutive', 'AllocNodes', 'AllocCPUS', 'CPUTimeRAW'], axis=1)

# Display the modified DataFrame
dfSacctExtended

Unnamed: 0_level_0,JobName,Partition,ElapsedRaw,Account,State,NodeList,User,QOS,Start,End,Timelimit,Suspended,ExclusiveCPU,ExclusiveOverlapping,Exclusive,SharedSameUser
JobIDRaw,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
21746840,0ad8058c9113c4dbf65087ddf7378c4ff006e7f3bc11b6...,mtg,2592058,acfc5635378d1cfdb38e5c8153b64aac9f7bc780f6db14...,TIMEOUT,cpu-c-18,c4ece8994745f178a32e57ab4ae3f3ab5fe514b6ed876a...,sqos1,2023-06-05 10:40:53,2023-07-05 10:41:51,30-00:00:00,00:00:00,True,False,True,False
22338038,0ad8058c9113c4dbf65087ddf7378c4ff006e7f3bc11b6...,mtg,2283562,acfc5635378d1cfdb38e5c8153b64aac9f7bc780f6db14...,CANCELLED by 17574,cpu-c-1,e3165dadd9e35c2862c565ad555b75a554606fe79d3d2f...,sqos1,2023-06-16 13:34:04,2023-07-12 23:53:26,30-00:00:00,00:00:00,True,False,True,False
22338038,0ad8058c9113c4dbf65087ddf7378c4ff006e7f3bc11b6...,mtg,2283562,acfc5635378d1cfdb38e5c8153b64aac9f7bc780f6db14...,CANCELLED by 17574,cpu-c-2,e3165dadd9e35c2862c565ad555b75a554606fe79d3d2f...,sqos1,2023-06-16 13:34:04,2023-07-12 23:53:26,30-00:00:00,00:00:00,True,False,True,False
22338038,0ad8058c9113c4dbf65087ddf7378c4ff006e7f3bc11b6...,mtg,2283562,acfc5635378d1cfdb38e5c8153b64aac9f7bc780f6db14...,CANCELLED by 17574,cpu-c-3,e3165dadd9e35c2862c565ad555b75a554606fe79d3d2f...,sqos1,2023-06-16 13:34:04,2023-07-12 23:53:26,30-00:00:00,00:00:00,True,False,True,False
22338038,0ad8058c9113c4dbf65087ddf7378c4ff006e7f3bc11b6...,mtg,2283562,acfc5635378d1cfdb38e5c8153b64aac9f7bc780f6db14...,CANCELLED by 17574,cpu-c-4,e3165dadd9e35c2862c565ad555b75a554606fe79d3d2f...,sqos1,2023-06-16 13:34:04,2023-07-12 23:53:26,30-00:00:00,00:00:00,True,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24261763,171f8e926c82dd83fc75053ec9b110f092aa32b41d5c98...,ampere,7225,99fdc1f587a9423c1abc5a1ce22053628b94a5226d3c0f...,TIMEOUT,gpu-q-63,dd68b7c728069b005e5dac9c3e9d59a7379b1347fa4e6f...,gpu2,2023-07-27 08:13:36,2023-07-27 10:14:01,02:00:00,00:00:00,False,False,False,True
24261762,171f8e926c82dd83fc75053ec9b110f092aa32b41d5c98...,ampere,29,99fdc1f587a9423c1abc5a1ce22053628b94a5226d3c0f...,COMPLETED,gpu-q-63,dd68b7c728069b005e5dac9c3e9d59a7379b1347fa4e6f...,gpu2,2023-07-27 08:13:36,2023-07-27 08:14:05,02:00:00,00:00:00,False,False,False,True
24261759,171f8e926c82dd83fc75053ec9b110f092aa32b41d5c98...,ampere,28,99fdc1f587a9423c1abc5a1ce22053628b94a5226d3c0f...,COMPLETED,gpu-q-14,dd68b7c728069b005e5dac9c3e9d59a7379b1347fa4e6f...,gpu2,2023-07-27 08:13:36,2023-07-27 08:14:04,02:00:00,00:00:00,False,False,False,False
24261761,171f8e926c82dd83fc75053ec9b110f092aa32b41d5c98...,ampere,29,99fdc1f587a9423c1abc5a1ce22053628b94a5226d3c0f...,COMPLETED,gpu-q-62,dd68b7c728069b005e5dac9c3e9d59a7379b1347fa4e6f...,gpu2,2023-07-27 08:13:36,2023-07-27 08:14:05,02:00:00,00:00:00,False,False,False,True


We will now save the two DataFrames *dfSacctJune* and *dfSacctJuneExtended* as .csv files so that they can be used in other notebooks. 

In [58]:
dfSacct.to_csv('../data/dfSacctFinal.csv')
dfSacctExtended.to_csv('../data/dfSacctExtended.csv')