# Selecting Cloudydap Cost and Usage Data

In this notebook:

* Read in AWS Cost and Usage Report data from a CSV file into a `pandas` DataFrame
* Specify use case run periods for each of the Cloudydap Architectures
* Create a new column, named "Arch", for storing Cloudydap architecture identifiers
* Select cost and usage data based on use case run periods and assign corresponding architecture identifiers (`A1`, `A2`, and `A3`)
* Remove all the other cost and usage data
* Save the use case cost and usage data into a CSV file

In [1]:
from pathlib import Path
import datetime as dt
import numpy as np
import pytz
import pandas as pd

## Raw Cost and Usage Data

In [2]:
r = pd.read_csv('../../../Arch1-20170201-20170301-1.csv')

In [3]:
r.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13292 entries, 0 to 13291
Data columns (total 73 columns):
identity/LineItemId                    13292 non-null object
identity/TimeInterval                  13292 non-null object
bill/InvoiceId                         0 non-null float64
bill/BillingEntity                     13292 non-null object
bill/BillType                          13292 non-null object
bill/PayerAccountId                    13292 non-null int64
bill/BillingPeriodStartDate            13292 non-null object
bill/BillingPeriodEndDate              13292 non-null object
lineItem/UsageAccountId                13292 non-null int64
lineItem/LineItemType                  13292 non-null object
lineItem/UsageStartDate                13292 non-null object
lineItem/UsageEndDate                  13292 non-null object
lineItem/ProductCode                   13292 non-null object
lineItem/UsageType                     13292 non-null object
lineItem/Operation                     132

Convert two columns to datetime, and sort all entries by time:

In [4]:
r['lineItem/UsageStartDate'] = pd.to_datetime(r['lineItem/UsageStartDate'])
r['lineItem/UsageEndDate'] = pd.to_datetime(r['lineItem/UsageEndDate'])
r.sort_values('lineItem/UsageStartDate', inplace=True)

In [5]:
r.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13292 entries, 0 to 13291
Data columns (total 73 columns):
identity/LineItemId                    13292 non-null object
identity/TimeInterval                  13292 non-null object
bill/InvoiceId                         0 non-null float64
bill/BillingEntity                     13292 non-null object
bill/BillType                          13292 non-null object
bill/PayerAccountId                    13292 non-null int64
bill/BillingPeriodStartDate            13292 non-null object
bill/BillingPeriodEndDate              13292 non-null object
lineItem/UsageAccountId                13292 non-null int64
lineItem/LineItemType                  13292 non-null object
lineItem/UsageStartDate                13292 non-null datetime64[ns]
lineItem/UsageEndDate                  13292 non-null datetime64[ns]
lineItem/ProductCode                   13292 non-null object
lineItem/UsageType                     13292 non-null object
lineItem/Operation        

## Use Case Run Times

Because AWS cost and usage information is reported for each hour it is not possible to precisely correlate use cases with their costs. Below are listed time periods when the use cases for various Cloudydap architectures were run.

Each use case run period is a tuple with the start and end time in the US Central Time Zone.

In [6]:
def set_time(year, month, day, hour, tzname='America/Chicago'):
    ctz = pytz.timezone(tzname)
    return ctz.localize(dt.datetime(year, month, day, hour), is_dst=None)

In [7]:
arch1_runs = [(set_time(2017, 2, 23, 4), set_time(2017, 2, 23, 23))]
arch2_runs = [(set_time(2017, 2, 24, 4), set_time(2017, 2, 24, 6))]
arch3_runs = [(set_time(2017, 2, 25, 4), set_time(2017, 2, 25, 6))]

## Assign Use Case Cost and Usage Data to Architectures

Create a new column `Arch` to hold architecture identifiers and fill it with `NaN`s:

In [8]:
r['Arch'] = np.nan

In [9]:
def assign_arch(df, arch_times, arch_name):
    utc = pytz.timezone('UTC')
    for rt in arch_times:
        arch_start = rt[0].astimezone(utc).isoformat()
        arch_end = rt[1].astimezone(utc).isoformat()
        mask = ((df['lineItem/UsageStartDate'] < arch_end) &
                (df['lineItem/UsageEndDate'] > arch_start))
        df.loc[mask, 'Arch'] = arch_name

In [10]:
assign_arch(r, arch1_runs, 'A1')
assign_arch(r, arch2_runs, 'A2')
assign_arch(r, arch3_runs, 'A3')

In [11]:
r.Arch.value_counts()

A1    437
A3     61
A2     55
Name: Arch, dtype: int64

Remove all the cost entries that do not belong to the Architecture use case runs:

In [12]:
r.dropna(subset=['Arch'], inplace=True)

In [13]:
r.shape

(553, 74)

## Save Use Case Cost Data to a File

CSV file where to save the use case cost data:

In [14]:
outfile = Path('cloudydap_costs.csv')

In [15]:
if outfile.suffix != '.csv':
    raise ValueError('The file name must end with ".csv"')

If the file already exsits, just append the new data:

In [16]:
if outfile.exists():
    mode = 'a'
    header = False
else:
    mode = 'w'
    header=True
r.to_csv(str(outfile), mode=mode, header=header, index=False, 
         date_format='%Y-%m-%dT%H:%M:%S+00:00')