# Calculating Carbon Footprint

In this Jupyter Notebook we will expand on the previous notebooks, creating functions that query both Victoria Metrics (in order to obtain node power readings) and a public carbon intensity API (in order to obtain the carbon intensity in the UK for the given times). 

## Preparing the Notebook 

We will first import all of the necessary libraries.

In [2]:
import requests
import pandas as pd
import numpy as np 
from datetime import datetime
import pytz
from joblib import Parallel, delayed

We will now set up our jupyter notebook to make queries from VictoriaMetrics.

In [3]:
# I have removed the code to set up the jupyter notebook to make queries from VictoriaMetrics since it was accessing a private database. 

## Loading in the Data

We have saved the processed Slurm data in csv files, which we will now load into DataFrames.

In [4]:
# We will now read the .csv file containing the Slurm data for June into a DataFrame
sSlurmDataPath = '../data/dfSacctFinal.csv'
dfSacct = pd.read_csv(sSlurmDataPath, index_col=0, parse_dates=['Start', 'End'], infer_datetime_format=True)

# We will now read the .csv file containing the extended Slurm data for June into a DataFrame. 
sSlurmExtendedPath = '../data/dfSacctExtended.csv'
dfSacctExtended = pd.read_csv(sSlurmExtendedPath, index_col=0, parse_dates=['Start', 'End'], infer_datetime_format=True)

## Approach 

We will start by creating all of the necessary functions before applying then to a set of test data to ensure that the code works as expected. 

Once we know that the code works, we will apply the functions to the job data above.

## Ensuring the Time is in UTC

We must first ensure that all times are in UTC so that we access the correct power readings from Victoria Metrics. 

Therefore, we will create a function that converts all times in the DataFrame to UTC.

In [5]:
def dfToUTC(df):
    """
    Returns a pd DataFrame containing columns for the start and end times in UTC.

    Parameters
    ----------
    df: pdDataFrame 
        The pd DataFrame containing all of the job data. This DataFrame must contain 
        a 'Start' and 'End' column of pd DateTime64 objects. 
    
    Returns
    ----------
    df: pdDataFrame
        The pd DataFrame that was passed in as a parameter with two new columns:
        'StartUTC' and 'EndUTC' of pd DateTime64 objects, which contain the original 
        start and end times in UTC rather than local time. 
    """

    df['UTCStart'] = df['Start'].dt.tz_localize('Europe/London')
    df['UTCEnd'] = df['End'].dt.tz_localize('Europe/London')

    df['UTCStart'] = df['UTCStart'].dt.tz_convert(pytz.utc)
    df['UTCEnd'] = df['UTCEnd'].dt.tz_convert(pytz.utc)

    return df 
    

## Querying Victoria Metrics

Rather than constantly querying Victoria Metrics, I thought it would be a good idea to  query Victoria Metrics once at the start and store all of the power data for all of the nodes across the entire July time period in a .csv file called *VMPowerJuly.csv*.

Below I created a function that checks whether the file *VMPowerJuly.csv* exists. If it exists, the function will return a DataFrame containing the data in the .csv file. If it does not exist, the function will query Victoria Metrics and will create the file *VMPowerJuly.csv* to store the data. The function will then return a DataFrame containig the power data. 

*NOTE: We will not be using this function for our program. The function below uses a lot of RAM because pandas stores DataFrames in memory rather than on the disk and is too inefficient to use on all of the data. As a result we will query Victoria Metrics for each job rather than making one big query at the start.*

In [6]:
def getPowerDataMonth(dfJobs):
    """
    Returns a DataFrame containing the Victoria Metrics power data for the specified month.

    Checks whether the .csv file containing the Victoria Metrics power data for the specified 
    month (with the format 'VMPower<Month>.csv') exists. If the file exists the DataFrame 
    containing the power data is returned. If the file does not exist, Victoria Metrics is 
    queried and the .csv file is created; the function then returns the DataFrame containing
    the power data. If there is a problem while querying Victoria Metrics, the function will 
    return None.

    Parameters 
    ----------
    dfJobs: pdDataFrame
        The DataFrame containing all of the job data that is to be processed. This DataFrame 
        must have already been processed to contain UTCStart and UTCEnd columns. This can be 
        done using the dfToUTC() function. 
    
    Returns
    ----------
    dfJobPower: pdDataFrame
        The DataFrame containing all of the power data. 
    None: NoneType
        Returned if there is a problem while querying Victoria Metrics.

    """

    # We start by obtaining the file name from the input DataFrame
    sMonth = dfJobs['UTCStart'].dt.month_name(locale='English').value_counts().index[0]
    sFileName = '../data/VMPower' + sMonth + '.csv'

    # We will now open the file and check whether it is empty. If the file is 
    # empty we will query Victoria Metrics and obtain the data. If the file is 
    # not empty we will load in the data and return a DataFrame. 
    fPowerData = open(sFileName, 'a+')

    fPowerData.seek(0)
    bEmpty = len(fPowerData.read()) == 0

    fPowerData.close()

    if bEmpty:
        # We will now obtain the start and end dates and times for our query.
        start = dfJobs.iloc[0]['UTCStart'].strftime("%Y-%m-%dT%H:%M:%SZ")
        end = dfJobs.iloc[round(len(dfJobs)/30)]['UTCEnd'].strftime("%Y-%m-%dT%H:%M:%SZ")

        # We will now query Victoria Metrics to obtain the power data
        data = {
            'query': f'amperageProbeReading{{amperageProbeLocationName="System Board Pwr Consumption"}}',
            'start': start,
            'end' : end,
            'step': '30s'
        }

        response = requests.put(
            url, 
            data=data,
            proxies=proxies,
            headers=headers,
            timeout=10
        )  

        # We will now ensure that the request was successfull 
        if (response.status_code != 200):
            print('Response status code was not 200.')
            print(f'The response was {response.status_code}')
            return None

        # We will now create a DataFrame containing the power data
        dNodePowers = {}
        dPowerData = {}
        lTicks = []
        lNodes = []

        for dNodeData in response.json()['data']['result']:

            # Here we remove any entry's that do not have an alias.
            if 'alias' not in dNodeData['metric'].keys():
                continue 

            sNode = dNodeData['metric']['alias']
            lData = dNodeData['values']

            lNodes.append(sNode)

            dNodePowers[sNode] = lData

            for lDataPoint in lData:
                iTick = lDataPoint[0]
                iPower = lDataPoint[1]
                lTicks.append(iTick)
                dPowerData[(iTick, sNode)] = iPower
        
        lTicks.sort()
        setTicksOrdered = set(lTicks)

        # Below we create the structure of the DataFrame
        dfJobPower = pd.DataFrame(
            index = setTicksOrdered,
            columns = lNodes
        )

        # We now populate the DataFrame with the power data. 
        for tIndex in dPowerData.keys():
            dfJobPower.loc[tIndex[0], tIndex[1]] = dPowerData[tIndex]

        # We now change the format of the timestamp
        dfJobPower.index = pd.to_datetime(dfJobPower.index, unit='s', utc=True)
        dfJobPower['Date'] = dfJobPower.index.strftime('%Y-%m-%d')
        dfJobPower['Date'] = dfJobPower['Date'].str.cat((((dfJobPower.index).hour * 2) + ((dfJobPower.index).minute//30) + 1).astype(str), sep=" ")

        # We now resample the DataFrame and interpolate the data to obtain 
        # a datapoint for each 30s. 
        dfJobPower[dfJobPower.columns[:-1]] = dfJobPower[dfJobPower.columns[:-1]].apply(pd.to_numeric, axis=1)
        dfJobPower = dfJobPower.resample('30S').interpolate()

        # We now save the DataFrame as a .csv file and return the DataFrame
        dfJobPower.to_csv(sFileName)

        return dfJobPower
    
    # Below we read in the .csv file containing the power data
    # and we format the DataFrame before returning it. 
    dfJobPower = pd.read_csv(sFileName, parse_dates=[0], infer_datetime_format=True)
    dfJobPower.set_index('Unnamed: 0', inplace=True)
    dfJobPower.index.rename('Timestamp', inplace=True)

    return dfJobPower



## Calculating the Energy Consumption

We will now write code to calculate the energy consumption of each job, in Wh, for each 30 minute time period that the job runs. 

*NOTE: We separate the energy consumption into 30 minute time periods because the carbon API we are accessing provides a separate carbon intensity for each 30 minute period of the day.*

We will first create a function that queries Victoria Metrics and returns a DataFrame of the power usage for each node the job runs on for the duration of the job. 

In [7]:
def getJobPower(jobID, dfJobs):
    """
    Returns a DataFrame containing the power readings, in W, on each node that the job runs on for the duration of the job. 

    Returns a DataFrame whose index is the timestamp and whose columns are the power readings, in W, for each node the job 
    runs on. If there is a problem while querying victoria metrics, an exception is thrown. 

    Parameters
    ----------
    jobID: integer
        The integer job ID for the job in question. 
    dfJobs: pdDataFrame
        The DataFrame containing all of the job data for the time period in question. 
    
    Returns
    ----------
    dfJobPower: pdDataFrame
        The DataFrame containing the power readings, in W, for each node the job runs on. 
    None: NoneType
        Returns None if there is a problem while querying Victoria Metrics
    """

    # We must first find the UTC start and end time of the job. 
    # We need to reformat these times to the correct format for Victoria Metrics. 
    if len(dfJobs.loc[jobID].to_frame().transpose()) > 1:
        sStart = dfJobs.loc[jobID, 'UTCStart'].iloc[0].strftime("%Y-%m-%dT%H:%M:%SZ")
        sEnd = dfJobs.loc[jobID, 'UTCEnd'].iloc[0].strftime("%Y-%m-%dT%H:%M:%SZ")
    else: 
        sStart = dfJobs.loc[jobID, 'UTCStart'].strftime("%Y-%m-%dT%H:%M:%SZ")
        sEnd = dfJobs.loc[jobID, 'UTCEnd'].strftime("%Y-%m-%dT%H:%M:%SZ")

    # We will now create a list of the nodes that the job runs on.
    if sum(dfJobs.index == jobID) > 1:
        lNodeList = list(dfJobs.loc[jobID, 'NodeList'])
    else:
        lNodeList = [dfJobs.loc[jobID, 'NodeList']]

    # We will now create the string containing the logical node query for Victoria Metrics
    sNodeQuery = '|'.join(lNodeList)

    # We will now query Victoria Metrics to obtain the power data
    data = {
        'query': f'amperageProbeReading{{alias=~"{sNodeQuery}", amperageProbeLocationName="System Board Pwr Consumption"}}',
        'start': sStart,
        'end' : sEnd,
        'step': '30s'
    }

    response = requests.put(
        url, 
        data=data,
        proxies=proxies,
        headers=headers,
        timeout=10
    )  

    # We will now ensure that the request was successfull 
    if (response.status_code != 200):
        print('Response status code was not 200.')
        print(f'The response was {response.status_code}')
        return None

    # We will now create a DataFrame containing the power data
    dNodePowers = {}
    dPowerData = {}
    lTicks = []
    lNodes = []

    for dNodeData in response.json()['data']['result']:
        sNode = dNodeData['metric']['alias']
        
        if sNode in lNodes:
            continue 

        lData = dNodeData['values']
        lNodes.append(sNode)
        dNodePowers[sNode] = lData

        for lDataPoint in lData:
            iTick = lDataPoint[0]
            iPower = lDataPoint[1]
            lTicks.append(iTick)
            dPowerData[(iTick, sNode)] = iPower
    
    lTicks.sort()
    setTicksOrdered = set(lTicks)

    # Below we create the structure of the DataFrame
    dfJobPower = pd.DataFrame(
        index = setTicksOrdered,
        columns = lNodes
    )

    # We now populate the DataFrame with the power data. 
    for tIndex in dPowerData.keys():
        dfJobPower.loc[tIndex[0], tIndex[1]] = dPowerData[tIndex]

    # We now change the format of the timestamp
    dfJobPower.index = pd.to_datetime(dfJobPower.index, unit='s', utc=True)
    dfJobPower['Date'] = dfJobPower.index.strftime('%Y-%m-%d')
    dfJobPower['Date'] = dfJobPower['Date'].str.cat((((dfJobPower.index).hour * 2) + ((dfJobPower.index).minute//30) + 1).astype(str), sep=" ")

    # We now resample the DataFrame and interpolate the data to obtain 
    # a datapoint for each 30s. 
    dfJobPower[dfJobPower.columns[:-1]] = dfJobPower[dfJobPower.columns[:-1]].apply(pd.to_numeric, axis=1)
    dfJobPower = dfJobPower.resample('30S').interpolate()

    return dfJobPower

    
    

We will now create a function that calculates the energy consumed by each node the job runs on for each 30 minute period. 

In [8]:
def getJobEnergy(dfPowerData):
    """
    Returns a DataFrame containing the energy consumed by each node the job runs on (in Wh) for each 30 minute time period of the Job's duration. 

    Parameters 
    ----------
    dfPowerData: pdDataFrame
        The DataFrame containin the Victoria Metrics power data for each node the job runs on for the job's running period. 
        This DataFrame is returned by the getJobPower() function.  

    Returns
    ----------
    dfNodeEnergies: pdDataFrame
        The DataFrame containing the energy consumed by each node (in Wh) for each 30 minute time period. 
    """

    dEnergies = {}

    # Below we iterate through each node, creating a dictionary of energies for 
    # each 30 minute interval.
    for sNode in dfPowerData.columns[:-1]:
            dIntervalEnergies = {}

            # Below iterate through each 30 minute interval, calculating the 
            # energy for that interval.

            # We will first create a list of DataFrames, one for each interval in the 
            # time frame. 
            lPeriodDFs = []

            for interval in dfPowerData['Date'].unique():                
                bIntervalMask = dfPowerData['Date'] == interval
                lPeriodDFs.append(dfPowerData[bIntervalMask])

            for dfIndex in range(len(lPeriodDFs)):
                interval = lPeriodDFs[dfIndex]['Date'].unique()[0]

                if dfIndex != 0:
                        dfIntervalPower = pd.concat([lPeriodDFs[dfIndex - 1].iloc[-1].to_frame().transpose(), lPeriodDFs[dfIndex]])
                else:
                        dfIntervalPower = lPeriodDFs[dfIndex]

                if dfIntervalPower[sNode].isnull().values.any():
                        dIntervalEnergies[interval] = None
                        continue 

                iJoules = np.trapz(dfIntervalPower[sNode].astype(int), dx=30)
                iWattHour = iJoules/3600

                dIntervalEnergies[interval] = iWattHour

            dEnergies[sNode] = dIntervalEnergies

    # Below we create a DataFrame from the dictionary of energies.
    dfNodeEnergies = pd.DataFrame.from_dict(dEnergies)
    
    return dfNodeEnergies

## Querying the Carbon Intensity API

As we do not want to contnuously query the public carbon intensity API, we are now going to create a function that queries the API to obtain all of the carbon intensity data for the entire time period we are interested in. We will then store this carbon intensity data in a .csv file (if it doesn't already exist).

*NOTE: The API only allows for a maximum date range of 30 days*

We will now create a function that returns a DataFrame of all the carbon intensities that we are interested in.

In [50]:
def getCarbonIntensities(dfJobs):
    """ 
    Returns a DataFrame of the carbon intensities for each 30 minute time period in the interval.

    Checks whether the .csv file containing the carbon intensities for the given month, with the 
    format 'CarbonIntensities<Month>.csv' already exists. If it exists, returns a DataFrame, from 
    the .csv file, whose index is the time period, in the format 'YYYY-MM-DD PERIOD' where PERIOD 
    is the 30 minute time period of that date as an integer from 1-48. The DataFrame contains all 
    carbon intensities, in gCO2/kWh, for each 30 minute time period in the given interval. If the 
    .csv file does not exist, the DataFrame with the format above will be created and saved into a
    .csv file, before being returned.

    Parameters
    ----------
    dfJobs: pdDataFrame
        The DataFrame containing all of the job data for the given time period. 

    Returns
    ----------
    dfIntensities: pdDataFrame 
        A DataFrame containing the carbon intensity, in gCO2/kWh, for each 30 minute interval within
        the specified time range. 
    None: NoneType
        Returns 'None' if there is a problem accessing the carbon intensity API.
    """

    # We start by obtaining the file name from the input DataFrame
    sMonth = dfJobs['UTCStart'].dt.month_name(locale='English').value_counts().index[0]
    sFileName = '../data/CarbonIntensities' + sMonth + '.csv'

    # We will now open the file and check whether it is empty. If the file is 
    # empty we will create the carbon intensity DataFrame. If the file is not
    # empty we will load in the data and return a DataFrame. 
    fCarbonData = open(sFileName, 'a+')

    fCarbonData.seek(0)
    bEmpty = len(fCarbonData.read()) == 0

    fCarbonData.close()

    if bEmpty:
        # We will first find the start and end dates and times for the period 
        # over which the jobs run. 
        start = dfJobs.iloc[0]['UTCStart']
        end = dfJobs.iloc[-1]['UTCEnd']

        # We must first check if the date range is longer than 30 days. 
        # If it is, we will split the date range into chunks that are 30
        # days or shorter. 
        timeDeltaSeconds = (end-start).total_seconds()
        iChunks = int(np.ceil(timeDeltaSeconds/(30 * 86400)))

        lCarbonDFs = []

        for iCount in range(iChunks):
            start = start + pd.Timedelta((30 * iCount), 'd')
            if iCount < iChunks - 1:
                tempEnd = start + pd.Timedelta(30, 'd')
            else:
                #tempEnd = end + pd.Timedelta(30, 'min')
                tempEnd = end.date() + pd.Timedelta(24, 'h')
            
            sStart = start.strftime("%Y-%m-%dT%H:%MZ")
            sEnd = tempEnd.strftime("%Y-%m-%dT%H:%MZ")

            # We will now request the carbon intensity data from the API
            # and we will check if the request was successful. If the request
            # is unsuccessful we will return None. 
            intensity = requests.get(f'https://api.carbonintensity.org.uk/intensity/{sStart}/{sEnd}')

            if (intensity.status_code == 400):
                print('Status code was 400.')
                print('There was a bad request.')
                return None
            elif (intensity.status_code == 500):
                print('Status code was 500.')
                print('There was an internal server error.')
                return None

            # We will now create and reformat a DataFrame from the 
            # carbon intensity data. 
            dfTempIntensities = pd.DataFrame(intensity.json()['data'])

            dfTempIntensities['from'] = pd.to_datetime(dfTempIntensities['from'])
            dfTempIntensities['date'] = dfTempIntensities['from'].dt.date
            dfTempIntensities['period'] = dfTempIntensities['date'].astype(str) + " " + ((dfTempIntensities['from'].dt.hour * 2) + (dfTempIntensities['from'].dt.minute//30) + 1).astype(str)

            dfTempIntensities.drop(columns=['from', 'to', 'date'], inplace=True)
            dfTempIntensities['intensity'] = pd.json_normalize(dfTempIntensities['intensity'])['actual']
            dfTempIntensities.set_index('period', inplace=True)

            lCarbonDFs.append(dfTempIntensities)

        dfIntensities = pd.concat(lCarbonDFs)

        # We will now save the DataFrame to a .csv file and return it. 
        dfIntensities.to_csv(sFileName)
        
        return dfIntensities

    # Below we will read in the data from the .csv file to a DataFrame 
    # which we will return.
    dfIntensities = pd.read_csv(sFileName, parse_dates=[0], infer_datetime_format=True)
    dfIntensities.set_index('period', inplace=True)

    return dfIntensities
    

## Calculating the Carbon Footprint

We will now write a function to calculate the carbon footprint given the Watt Hours.

In [130]:
def getCarbonFootprint(jobID, df):
     """
    Returns a DataFrame containing the job's carbon footprint data.

    Returns a DataFrame containing the job's energy consumption, in Wh; 
    carbon footprint, in gCO2; and the distance driven by a medium sized
    diesel car, in km, that releases the same amount of carbon dioxide. 

    Parameters 
    ----------
    jobID: integer
        The integer job ID of the job whose carbon footprint is calculated.
    df: pdDataFrame
        The pandas DataFrame containing all the job data.

    Returns
    ----------
    dfCarbonData: pd.DataFrame
        The pd DataFrame containing the job's carbon data.  

    """

    dfJobPower = getJobPower(jobID, df)

    if isinstance(dfJobPower, pd.DataFrame) and dfJobPower.isnull().values.any():
        print(f'Missing Data for job: {jobID}')
        return pd.DataFrame([np.nan, np.nan]) 
    elif isinstance(dfJobPower, type(None)):
        print(f'Missing Data for job: {jobID}')
        return pd.DataFrame([np.nan, np.nan]) 
    
    
    dfJobEnergy = getJobEnergy(dfJobPower)

    if '2023-07-27 16' in dfJobEnergy.index:
        print(jobID)

    dfJobCarbonIntensities = getCarbonIntensities(df).loc[dfJobEnergy.index]

    dfJobEnergy.rename(columns={dfJobEnergy.columns[0] : 'Data'}, inplace=True)
    dfJobCarbonIntensities.rename(columns={dfJobCarbonIntensities.columns[0] : 'Data'}, inplace=True)

    iEnergyTotal = sum(dfJobEnergy[dfJobEnergy.columns[0]])

    dfJobCarbon = (dfJobEnergy * dfJobCarbonIntensities)/1000

    iCarbon = round(sum(dfJobCarbon['Data']))

    iDistance = iCarbon/171

    return pd.DataFrame([iCarbon, iEnergyTotal, iDistance])

## Testing the Code 

In order to test the carbon footprint code above, I am going to create some test power reading data which will have the following structure: 

    - A power of 300 W for the first 1800 seconds.
    - A power of 150 W for the next 1289 seconds. 
    - A power of 300 W for the next 1800 seconds.

The total energy consumed by this test job should be 353.7 wh.

I will now create a test job DataFrame containing the test job only. 

In [11]:
dfTest = dfSacctExtended[dfSacctExtended.index.value_counts() == 1].iloc[-1].to_frame().transpose().reset_index(drop=True)

dfTest['End'] = pd.to_datetime("2023-07-27 09:35:05")
dfTest['UTCEnd'] = pd.to_datetime("2023-07-27 08:35:05+00:00")
dfTest['NodeList'] = 'Test'

  dfTest = dfSacctExtended[dfSacctExtended.index.value_counts() == 1].iloc[-1].to_frame().transpose().reset_index(drop=True)


In [12]:
dfTest

Unnamed: 0,JobName,Partition,ElapsedRaw,Account,State,NodeList,User,QOS,Start,End,Timelimit,Suspended,ExclusiveCPU,ExclusiveOverlapping,Exclusive,SharedSameUser,UTCEnd
0,171f8e926c82dd83fc75053ec9b110f092aa32b41d5c98...,ampere,29,99fdc1f587a9423c1abc5a1ce22053628b94a5226d3c0f...,COMPLETED,Test,dd68b7c728069b005e5dac9c3e9d59a7379b1347fa4e6f...,gpu2,2023-07-27 08:13:36,2023-07-27 09:35:05,02:00:00,00:00:00,False,False,False,True,2023-07-27 08:35:05+00:00


The test power data will be created using the following code. 

In [13]:
lPowerTest = []

for tTick in list(range(1690355610, 1690360500, 30)):
    if tTick < (1690355610 + 1800):
        lPowerTest.append([tTick, 300])
    elif (1690355610 + 1800) < tTick and tTick < (1690355610 + 3089):
        lPowerTest.append([tTick, 150])
    elif tTick > (1690355610 + 3089):
        lPowerTest.append([tTick, 300])


I am now going to modify the getJobPower() function to use out test data.

In [14]:
def getJobPower(jobID, dfJobs):
    """
    Returns a DataFrame containing the power readings, in W, on each node that the job runs on for the duration of the job. 

    Returns a DataFrame whose index is the timestamp and whose columns are the power readings, in W, for each node the job 
    runs on. If there is a problem while querying victoria metrics, an exception is thrown. 

    Parameters
    ----------
    jobID: integer
        The integer job ID for the job in question. 
    dfJobs: pdDataFrame
        The DataFrame containing all of the job data for the time period in question. 
    
    Returns
    ----------
    dfJobPower: pdDataFrame
        The DataFrame containing the power readings, in W, for each node the job runs on. 
    None: NoneType
        Returns None if there is a problem while querying Victoria Metrics
    """

    if jobID != 0:
        # We must first find the UTC start and end time of the job. 
        # We need to reformat these times to the correct format for Victoria Metrics. 
        if dfJobs.index.value_counts().loc[jobID] > 1:
            sStart = dfJobs.loc[jobID, 'UTCStart'].iloc[0].strftime("%Y-%m-%dT%H:%M:%SZ")
            sEnd = dfJobs.loc[jobID, 'UTCEnd'].iloc[0].strftime("%Y-%m-%dT%H:%M:%SZ")
        else: 
            sStart = dfJobs.loc[jobID, 'UTCStart'].strftime("%Y-%m-%dT%H:%M:%SZ")
            sEnd = dfJobs.loc[jobID, 'UTCEnd'].strftime("%Y-%m-%dT%H:%M:%SZ")

        # We will now create a list of the nodes that the job runs on.
        if sum(dfJobs.index == jobID) > 1:
            lNodeList = list(dfJobs.loc[jobID, 'NodeList'])
        else:
            lNodeList = [dfJobs.loc[jobID, 'NodeList']]

        # We will now create the string containing the logical node query for Victoria Metrics
        sNodeQuery = '|'.join(lNodeList)

        # We will now query Victoria Metrics to obtain the power data
        data = {
            'query': f'amperageProbeReading{{alias=~"{sNodeQuery}", amperageProbeLocationName="System Board Pwr Consumption"}}',
            'start': sStart,
            'end' : sEnd,
            'step': '30s'
        }

        try:
            response = requests.put(
            url, 
            data=data,
            proxies=proxies,
            headers=headers,
            timeout=10
        )  
        except requests.exceptions.ConnectionError as e:
            print('ConnecionError')
            return None
        except requests.exceptions.ReadTimeout as e:
            print('ReadRimeout')
            return None

        # We will now ensure that the request was successfull 
        if (response.status_code != 200):
            print('Response status code was not 200.')
            print(f'The response was {response.status_code}')
            return None

        # We will now check that the result is not empty.
        if len(response.json()['data']['result']) == 0:
            print('No data.')
            return None

        # We will now create a DataFrame containing the power data
        dNodePowers = {}
        dPowerData = {}
        lTicks = []
        lNodes = []

        for dNodeData in response.json()['data']['result']:
            sNode = dNodeData['metric']['alias']
            
            if sNode in lNodes:
                continue 

            lData = dNodeData['values']
            lNodes.append(sNode)
            dNodePowers[sNode] = lData

            for lDataPoint in lData:
                iTick = lDataPoint[0]
                iPower = lDataPoint[1]
                lTicks.append(iTick)
                dPowerData[(iTick, sNode)] = iPower
        
        lTicks.sort()
        setTicksOrdered = set(lTicks)

        # Below we create the structure of the DataFrame
        dfJobPower = pd.DataFrame(
            index = setTicksOrdered,
            columns = lNodes
        )

        # We now populate the DataFrame with the power data. 
        for tIndex in dPowerData.keys():
            dfJobPower.loc[tIndex[0], tIndex[1]] = dPowerData[tIndex]

        # We now change the format of the timestamp
        dfJobPower.index = pd.to_datetime(dfJobPower.index, unit='s', utc=True)
        dfJobPower['Date'] = dfJobPower.index.strftime('%Y-%m-%d')
        dfJobPower['Date'] = dfJobPower['Date'].str.cat((((dfJobPower.index).hour * 2) + ((dfJobPower.index).minute//30) + 1).astype(str), sep=" ")

        # We now resample the DataFrame and interpolate the data to obtain 
        # a datapoint for each 30s. 
        dfJobPower[dfJobPower.columns[:-1]] = dfJobPower[dfJobPower.columns[:-1]].apply(pd.to_numeric, axis=1)
        dfJobPower = dfJobPower.resample('30S', origin='start').interpolate()

        return dfJobPower
    else:
        # We will now create a DataFrame containing the power data
        dNodePowers = {}
        dPowerData = {}
        lTicks = []
        lNodes = []

        lData = []

        for tTick in list(range(1690355610, 1690360500, 30)):
            if tTick < (1690355610 + 1800):
                lData.append([tTick, 300])
            elif (1690355610 + 1800) < tTick and tTick < (1690355610 + 3089):
                lData.append([tTick, 150])
            elif tTick > (1690355610 + 3089):
                lData.append([tTick, 300])

        sNode = 'TEST'

        lNodes.append(sNode)
        dNodePowers[sNode] = lData

        for lDataPoint in lData:
            iTick = lDataPoint[0]
            iPower = lDataPoint[1]
            lTicks.append(iTick)
            dPowerData[(iTick, sNode)] = iPower
        
        lTicks.sort()
        setTicksOrdered = set(lTicks)

        # Below we create the structure of the DataFrame
        dfJobPower = pd.DataFrame(
            index = setTicksOrdered,
            columns = lNodes
        )

        # We now populate the DataFrame with the power data. 
        for tIndex in dPowerData.keys():
            dfJobPower.loc[tIndex[0], tIndex[1]] = dPowerData[tIndex]

        # We now change the format of the timestamp
        dfJobPower.index = pd.to_datetime(dfJobPower.index, unit='s', utc=True)

        # We now resample the DataFrame and interpolate the data to obtain 
        # a datapoint for each 30s. 
        dfJobPower = dfJobPower.apply(pd.to_numeric, axis=1)
        dfJobPower = dfJobPower.resample('30S').interpolate()

        # We will now create the 'Date' Column
        dfJobPower['Date'] = dfJobPower.index.strftime('%Y-%m-%d')
        dfJobPower['Date'] = dfJobPower['Date'].str.cat((((dfJobPower.index).hour * 2) + ((dfJobPower.index).minute//30) + 1).astype(str), sep=" ")

        return dfJobPower
    
    

We will now run our code on the test data to ensure it works as expected. 

In [17]:
iTestEnergy = sum(getJobEnergy(getJobPower(0, dfTest))['TEST'])
iCorrectTestEnergy = 353.7

print(f'The total energy usage for the test job is: {iTestEnergy} Wh')

iEnergyPercentage = iTestEnergy/iCorrectTestEnergy * 100

print(f'The calculated energy is {round(iEnergyPercentage)}% of the expected energy.')

The total energy usage for the test job is: 351.875 Wh
The calculated energy is 99% of the expected energy.


## Calculating the Carbon Footprint of the Jobs


We first need to add the UTC columns to our DataFrame

In [18]:
dfSacctExtended = dfToUTC(dfSacctExtended)

Now that we know that our code that calculates the carbon footprint works, we will apply this function to all of the jobs in our job DataFrame.

We will use the jobLib library to parallelise this code and maximise its efficiency. 

In [28]:
# We will first create a list of DataFrames, each one containing jobs that 
# run on a different partition.

lPartitionDFs = []

for partition in dfSacctExtended['Partition'].unique():
    bPartitionMask = dfSacctExtended['Partition'] == partition
    lPartitionDFs.append(dfSacctExtended[bPartitionMask])

# We will also create a list of DataFrames, each one containing jobs that 
# run for a different user.

lUserDFs = []

for user in dfSacctExtended['User'].unique():
    bUserMask = dfSacctExtended['User'] == user
    lUserDFs.append(dfSacctExtended[bUserMask])

In [None]:
def findCarbonEnergy(df):
    df[['CarbonFootprint(gCO2)', 'TotalEnergy', 'EquivalentDistance(km)']] = df.apply(lambda row : getCarbonFootprint(row.name, df)[0], axis=1)

    return df

In [None]:

lCarbonDFs = Parallel(n_jobs=8)(delayed(findCarbonEnergy)(df) for df in lUserDFs)

: 

In [None]:
dfFinal = pd.concat(lCarbonDFs)

: 