# Slurm EDA - Sample Data

In this Jupyter Notebook, we start an EDA on the Slurm data from the CSD3 supercomputer with the goal of gaining valuable insights into the data, as well as the proportion of shared jobs submitted using Slurm.

We will begin by importing all the necessary libraries. 

In [2]:
import pandas as pd
import numpy as np
import datetime
import re
import plotly.express as px
import plotly.graph_objects as go
import plotly.subplots as sp

## Test on Sample Data

Initially, we did not have access to any user data from Slurm. As a result we will begin by analysing the Slurm data for the jobs that we have submitted.

### Loading in The Partition Data

Before we explore the job data, we are going to create a dictionary containing the number of cores per node for each partition. 

To do this we first need to load in some data containing the partition name, the number of nodes of that partition and the number of CPUs per node of that partition. This data was obtained using the command *sinfo --format '%R|%D|%c|'*.

In [3]:
# We will no read the .txt file containing the partition data into a pandas DataFrame.
sPartitionDataPath = '../data/partition-info-all.txt'
dfPartition = pd.read_csv(sPartitionDataPath, sep='|')

# This DataFrame contains an extra column at the end which we will remove now. 
dfPartition = dfPartition.iloc[:, :-1]

# We will now output the DataFrame
dfPartition

Unnamed: 0,PARTITION,NODES,CPUS
0,skylake,1145,32
1,skylake-himem,384,32
2,skylake-long,50,32
3,cclake,672,56
4,cclake-himem,56,56
5,cclake-long,84,56
6,icelake,544,76
7,icelake-himem,136,76
8,icelake-long,56,76
9,sapphire,112,112


We will now modify the *dfPartition* DataFrame to only include the *CPUS* column and have the *PARTITION* column as the index. 

In [4]:
# We will make the 'PARTITION' column the index of this DataFrame and then remove the column.
# We are not interested in the number of nodes for each partition, so we will remove this column. 
dfPartition.index = dfPartition['PARTITION']
dfPartition = dfPartition.drop(['NODES', 'PARTITION'], axis=1)

# We will now output the first 5 rows of the DataFrame
dfPartition.head()

Unnamed: 0_level_0,CPUS
PARTITION,Unnamed: 1_level_1
skylake,32
skylake-himem,32
skylake-long,32
cclake,56
cclake-himem,56


The *pvc* and *bluefield* partitions contain 104+ and 8+ CPUs per node respectively. ***At the time of writing it is unclear what this means. For the moment we are going to remove any job data for the pvc and bluefield partitions and focus the EDA on the rest of the slurm data. The code cell below removes the '+' from these values to allow their data to be used and is kept in case we decide to use it in the future***.

In [5]:
# dfPartition['CPUS'] = pd.to_numeric(dfPartition['CPUS'].str.strip('+'), errors='coerce')
# dfPartition.dtypes

***The code cell below removes any rows in the dfPartition DataFrame that have a value in the CPUS column that ends in a '+', as explained above***.

In [6]:
# We create a boolean mask that returns True for all rows that have a 
# CPUS value that doesn't end in a '+'
bNoPlusMask = ~dfPartition['CPUS'].str.endswith('+')

# We now apply this mask to the dfPartition DataFrame
dfPartition = dfPartition[bNoPlusMask]

The code block below shows that, although all the values in dfPartition are integers, they are an object type. As a result we will now cast them to a numeric data type. 

In [7]:
dfPartition.dtypes

CPUS    object
dtype: object

In [8]:
# Here we cast the values of the CPUS column to a numeric data type
dfPartition['CPUS'] = pd.to_numeric(dfPartition['CPUS'], errors='coerce')

We will now save the dfPartition DataFrame to a .csv file so that we can use it in other notebooks. 

In [32]:
# We will also create a csv file to store the data types of the DataFrame columns. 
dfTypes = dfPartition.dtypes.to_frame()
dfTypes.to_csv('../data/partitionTypes.csv')

# Here we save the DataFrame as a .csv file. 
dfPartition.to_csv('../data/dfPartition.csv')

*NOTE: this is the end of the Slurm EDA using the sample data. At this point we were given the anonymised sacct data for June and continued our work using this new dataset. To see this work please look at the 'Slrum-EDA-June-Preparation' notebook.*