# Cloudydap Cost Analysis

In [1]:
import numpy as np
import pandas as pd
from aws_price_list import AWSOffersIndex
from bokeh.charts import (output_notebook, output_file, show, 
                          Scatter, Histogram, TimeSeries, Donut, Step, Bar)
from bokeh.plotting import figure, ColumnDataSource
from bokeh.models import Range1d, HoverTool, ResizeTool


output_notebook()

In [2]:
def percent_change(before, after):
    return 100 * (after/before - 1).dropna(how='all')

# Input Cloudydap Cost Data

In [3]:
r = pd.read_csv('../../logs/cloudydap_costs.csv')

Check if there is a column named "Arch" (indicates these are indeed Cloudydap cost data):

In [4]:
if 'Arch' not in r:
    raise RuntimeError('Missing "Arch" column')

Are there any null values in the `Arch` column:

In [5]:
if r.Arch.isnull().any():
    raise ValueError('Null values detected in the "Arch" column')

What do we have?

In [6]:
r.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 553 entries, 0 to 552
Data columns (total 74 columns):
identity/LineItemId                    553 non-null object
identity/TimeInterval                  553 non-null object
bill/InvoiceId                         0 non-null float64
bill/BillingEntity                     553 non-null object
bill/BillType                          553 non-null object
bill/PayerAccountId                    553 non-null int64
bill/BillingPeriodStartDate            553 non-null object
bill/BillingPeriodEndDate              553 non-null object
lineItem/UsageAccountId                553 non-null int64
lineItem/LineItemType                  553 non-null object
lineItem/UsageStartDate                553 non-null object
lineItem/UsageEndDate                  553 non-null object
lineItem/ProductCode                   553 non-null object
lineItem/UsageType                     553 non-null object
lineItem/Operation                     553 non-null object
lineItem/Avai

## Data Preprocessing


### Remove EBS Volume Costs For Arch. \#2 and \#3

Because Architectures \#2 an \#3 do not need an EBS volume remove those costs for them.

Find and display those cost entries:

In [7]:
no_ebs = (r.Arch != 'A1') & (r['lineItem/UsageType'].str.startswith('EBS'))

In [8]:
r.loc[no_ebs, ['Arch', 'lineItem/ProductCode', 'lineItem/UsageType', 'lineItem/BlendedCost']]

Unnamed: 0,Arch,lineItem/ProductCode,lineItem/UsageType,lineItem/BlendedCost
437,A2,AmazonEC2,EBS:VolumeUsage.gp2,0.011905
438,A2,AmazonEC2,EBSOptimized:m4.xlarge,0.0
439,A2,AmazonEC2,EBSOptimized:m4.xlarge,0.0
440,A2,AmazonEC2,EBSOptimized:m4.xlarge,0.0
441,A2,AmazonEC2,EBSOptimized:m4.xlarge,0.0
447,A2,AmazonEC2,EBS:VolumeUsage.gp2,0.014881
450,A2,AmazonEC2,EBS:VolumeUsage.gp2,0.014881
451,A2,AmazonEC2,EBS:VolumeUsage.gp2,0.014881
464,A2,AmazonEC2,EBSOptimized:m4.xlarge,0.0
465,A2,AmazonEC2,EBS:VolumeUsage.gp2,0.014881


What are these costs in US$ per architecture?

In [9]:
r.loc[no_ebs, ['Arch', 'lineItem/BlendedCost']].groupby('Arch').sum()

Unnamed: 0_level_0,lineItem/BlendedCost
Arch,Unnamed: 1_level_1
A2,0.113095
A3,0.142857


Remove those entries:

In [10]:
r.drop(r[no_ebs].index, inplace=True)

Covert time columns to datetime type:

In [11]:
r['lineItem/UsageStartDate'] = pd.to_datetime(r['lineItem/UsageStartDate'])
r['lineItem/UsageEndDate'] = pd.to_datetime(r['lineItem/UsageEndDate'])

## Analysis

Breakdown on the entries per architecture (and what their identifiers are):

In [12]:
r.Arch.value_counts().sort_index()

A1    437
A2     39
A3     43
Name: Arch, dtype: int64

Time span for each architecture's cost data:

In [13]:
arch_dur = dict()
grp = r.groupby('Arch')
for arch in sorted(r.groupby('Arch').groups.keys()):
    start = grp.get_group(arch)['lineItem/UsageStartDate'].min()
    end = grp.get_group(arch)['lineItem/UsageEndDate'].max()
    print('Architecture:', arch)
    print('    Start: {}\n    End:   {}'.format(start, end))
    arch_dur[arch] = end - start
    print('    Duration:', arch_dur[arch])

Architecture: A1
    Start: 2017-02-23 10:00:00
    End:   2017-02-24 05:00:00
    Duration: 0 days 19:00:00
Architecture: A2
    Start: 2017-02-24 10:00:00
    End:   2017-02-24 12:00:00
    Duration: 0 days 02:00:00
Architecture: A3
    Start: 2017-02-25 10:00:00
    End:   2017-02-25 12:00:00
    Duration: 0 days 02:00:00


### Cost

Group the data by architecture, AWS service, its usage type, and its operation:

In [14]:
grp = r.groupby(['Arch', 'lineItem/ProductCode', 'lineItem/UsageType', 'lineItem/Operation'])

#### Blended Costs

Breakdown of the blended cost:

In [15]:
blend_cost = grp['lineItem/BlendedCost'].sum()
blend_cost

Arch  lineItem/ProductCode  lineItem/UsageType           lineItem/Operation
A1    AmazonEC2             BoxUsage:m4.xlarge           RunInstances          1.268500e+01
                            DataTransfer-In-Bytes        RunInstances          0.000000e+00
                            DataTransfer-Out-Bytes       RunInstances          0.000000e+00
                            DataTransfer-Regional-Bytes  PublicIP-In           0.000000e+00
                                                         PublicIP-Out          8.723100e-04
                            EBS:SnapshotUsage            CreateSnapshot        6.504439e-02
                            EBS:VolumeUsage.gp2          CreateVolume-Gp2      8.214284e-01
                            EBSOptimized:m4.xlarge       Hourly                0.000000e+00
      AmazonS3              DataTransfer-Out-Bytes       GetObject             0.000000e+00
                            Requests-Tier1               ListBucketVersions    2.751000e-02
    

Remove the zero costs for easier calculations later:

In [16]:
blend_cost = blend_cost[blend_cost != 0]

Stand by for special announcement...

---

**SPECIAL ANNOUNCEMENT**

All the study data (files and byte streams) are stored in the same S3 bucket so the costs per architecture do not currently reflect the exact situation. The following code tries to remedy this situation by manually modifying S3 storage cost for each architecture. The total S3 storage capacity is assumed to be:

* Arch. \#1 = 156GB
* Arch. \#2 = 158.1GB
* Arch. \#3 = 126.1GB

The S3 storage price rate for these cases is $0.023 per GB-month, converted to GB-hour with one month having 30 days. The time over which these costs are acrued will equal the number of hours of use case execution.

In [17]:
s3_rate = 0.023/(24*30)
s3_rate

3.194444444444444e-05

In [18]:
arch_storage = {'A1': 156, 'A2': 158.1, 'A3': 126.1}
for arch in blend_cost.groupby(level=0).groups.keys():
    blend_cost[arch]['AmazonS3']['TimedStorage-ByteHrs']['StandardStorage'] =\
        arch_storage[arch] * s3_rate * (arch_dur[arch] / pd.Timedelta('1 hour'))

The new S3 storage costs are:

In [19]:
blend_cost.loc[pd.IndexSlice[:, ['AmazonS3'], ['TimedStorage-ByteHrs'], ['StandardStorage']]]

Arch  lineItem/ProductCode  lineItem/UsageType    lineItem/Operation
A1    AmazonS3              TimedStorage-ByteHrs  StandardStorage       0.094683
A2    AmazonS3              TimedStorage-ByteHrs  StandardStorage       0.010101
A3    AmazonS3              TimedStorage-ByteHrs  StandardStorage       0.008056
Name: lineItem/BlendedCost, dtype: float64

---

Continue with cost analysis...

Total cost per architecture in US$:

In [20]:
tot_cost = blend_cost.groupby(level=[0]).sum()
tot_cost

Arch
A1    13.745222
A2     1.752961
A3     1.752981
Name: lineItem/BlendedCost, dtype: float64

Total cost as a percentage of the total's total:

In [21]:
100 * tot_cost/tot_cost.sum()

Arch
A1    79.677072
A2    10.161407
A3    10.161521
Name: lineItem/BlendedCost, dtype: float64

In [22]:
f = Donut(blend_cost.groupby(level=[0, 1]).sum(), 
          title='AWS Total Cost Breakdown per Architecture')
show(f)

Cost percentage change when compared to Architecture \#1:

In [23]:
percent_change(tot_cost['A1'], tot_cost)

Arch
A1     0.000000
A2   -87.246761
A3   -87.246618
Name: lineItem/BlendedCost, dtype: float64

##### Blended Cost Percent Changes between Two Architectures

Cost percentage change between Arch. \#2 and \#1:

In [24]:
percent_change(blend_cost['A1'], blend_cost['A2'])

lineItem/ProductCode  lineItem/UsageType           lineItem/Operation
AmazonEC2             BoxUsage:m4.xlarge           RunInstances         -86.440678
                      DataTransfer-Regional-Bytes  PublicIP-Out          52.058328
AmazonS3              Requests-Tier1               ListBucketVersions   -90.676118
                      Requests-Tier2               GetObject             88.502714
                                                   HeadBucket           -90.372671
                      TimedStorage-ByteHrs         StandardStorage      -89.331984
                      USE1-USW2-AWS-Out-Bytes      GetObject            -50.921642
                                                   HeadBucket           -90.229885
                                                   ListBucketVersions   -90.645253
Name: lineItem/BlendedCost, dtype: float64

Cost percentage change between Arch. \#3 and \#1:

In [25]:
percent_change(blend_cost['A1'], blend_cost['A3'])

lineItem/ProductCode  lineItem/UsageType           lineItem/Operation
AmazonEC2             BoxUsage:m4.xlarge           RunInstances         -86.440678
                      DataTransfer-Regional-Bytes  PublicIP-Out          52.043425
AmazonS3              Requests-Tier1               ListBucketVersions   -88.131589
                      Requests-Tier2               GetObject             75.513725
                                                   HeadBucket           -90.838509
                      TimedStorage-ByteHrs         StandardStorage      -91.491228
                      USE1-USW2-AWS-Out-Bytes      GetObject            -54.420944
                                                   HeadBucket           -90.804598
                                                   ListBucketVersions   -88.103912
Name: lineItem/BlendedCost, dtype: float64

Cost percentage change between Arch. \#3 and \#2:

In [26]:
percent_change(blend_cost['A2'], blend_cost['A3'])

lineItem/ProductCode  lineItem/UsageType           lineItem/Operation
AmazonEC2             BoxUsage:m4.xlarge           RunInstances           0.000000
                      DataTransfer-Regional-Bytes  PublicIP-Out          -0.009801
AmazonS3              Requests-Tier1               ListBucketVersions    27.290448
                      Requests-Tier2               GetObject             -6.890612
                                                   HeadBucket            -4.838710
                      TimedStorage-ByteHrs         StandardStorage      -20.240354
                      USE1-USW2-AWS-Out-Bytes      GetObject             -7.130031
                                                   HeadBucket            -5.882353
                                                   ListBucketVersions    27.166324
Name: lineItem/BlendedCost, dtype: float64

##### Blended Cost as Percentage of the Total per Architecture

Architecture \#1:

In [27]:
blend_cost['A1']/blend_cost['A1'].sum() * 100

lineItem/ProductCode  lineItem/UsageType           lineItem/Operation
AmazonEC2             BoxUsage:m4.xlarge           RunInstances          92.286612
                      DataTransfer-Regional-Bytes  PublicIP-Out           0.006346
                      EBS:SnapshotUsage            CreateSnapshot         0.473215
                      EBS:VolumeUsage.gp2          CreateVolume-Gp2       5.976101
AmazonS3              Requests-Tier1               ListBucketVersions     0.200142
                                                   PutObject              0.000364
                      Requests-Tier2               GetObject              0.057355
                                                   HeadBucket             0.001874
                                                   ReadACL                0.000012
                      TimedStorage-ByteHrs         StandardStorage        0.688845
                      USE1-USW2-AWS-Out-Bytes      GetObject              0.001981
                 

Architecture \#2:

In [28]:
blend_cost['A2']/blend_cost['A2'].sum() * 100

lineItem/ProductCode  lineItem/UsageType           lineItem/Operation
AmazonEC2             BoxUsage:m4.xlarge           RunInstances          98.119696
                      DataTransfer-Regional-Bytes  PublicIP-Out           0.075667
AmazonS3              Requests-Tier1               ListBucketVersions     0.146324
                      Requests-Tier2               GetObject              0.847754
                                                   HeadBucket             0.001415
                      TimedStorage-ByteHrs         StandardStorage        0.576216
                      USE1-USW2-AWS-Out-Bytes      GetObject              0.007625
                                                   HeadBucket             0.000029
                                                   ListBucketVersions     0.225274
Name: lineItem/BlendedCost, dtype: float64

Architecture \#3:

In [29]:
blend_cost['A3']/blend_cost['A3'].sum() * 100

lineItem/ProductCode  lineItem/UsageType           lineItem/Operation
AmazonEC2             BoxUsage:m4.xlarge           RunInstances          98.118591
                      DataTransfer-Regional-Bytes  PublicIP-In            0.075659
                                                   PublicIP-Out           0.075659
AmazonS3              Requests-Tier1               ListBucketVersions     0.186254
                      Requests-Tier2               GetObject              0.789330
                                                   HeadBucket             0.001346
                      TimedStorage-ByteHrs         StandardStorage        0.459582
                      USE1-USW2-AWS-Out-Bytes      GetObject              0.007081
                                                   HeadBucket             0.000027
                                                   ListBucketVersions     0.286470
Name: lineItem/BlendedCost, dtype: float64

### Usage Amount

In [30]:
use_amt = grp['lineItem/UsageAmount'].sum()
use_amt = use_amt[use_amt != 0]
use_amt

Arch  lineItem/ProductCode  lineItem/UsageType           lineItem/Operation
A1    AmazonEC2             BoxUsage:m4.xlarge           RunInstances          5.900000e+01
                            DataTransfer-In-Bytes        RunInstances          6.271791e-01
                            DataTransfer-Out-Bytes       RunInstances          3.444954e-02
                            DataTransfer-Regional-Bytes  PublicIP-In           8.722682e-02
                                                         PublicIP-Out          8.723084e-02
                            EBS:SnapshotUsage            CreateSnapshot        1.300888e+00
                            EBS:VolumeUsage.gp2          CreateVolume-Gp2      8.214286e+00
                            EBSOptimized:m4.xlarge       Hourly                5.900000e+01
      AmazonS3              DataTransfer-Out-Bytes       GetObject             3.551022e-02
                            Requests-Tier1               ListBucketVersions    5.502000e+03
    

Isolate only the number of Hyrax S3 requests:

In [31]:
use_amt.loc[pd.IndexSlice[:, ['AmazonS3'], ['Requests-Tier2'], ['GetObject']]]

Arch  lineItem/ProductCode  lineItem/UsageType  lineItem/Operation
A1    AmazonS3              Requests-Tier2      GetObject             19709.0
A2    AmazonS3              Requests-Tier2      GetObject             37152.0
A3    AmazonS3              Requests-Tier2      GetObject             34592.0
Name: lineItem/UsageAmount, dtype: float64

The change in the number of S3 requests between the architectures (A1 → A2 → A3):

In [32]:
use_amt.loc[pd.IndexSlice[:, ['AmazonS3'], ['Requests-Tier2'], ['GetObject']]].diff()

Arch  lineItem/ProductCode  lineItem/UsageType  lineItem/Operation
A1    AmazonS3              Requests-Tier2      GetObject                 NaN
A2    AmazonS3              Requests-Tier2      GetObject             17443.0
A3    AmazonS3              Requests-Tier2      GetObject             -2560.0
Name: lineItem/UsageAmount, dtype: float64

## AWS Product SKUs

Below is the list of AWS product SKUs found in the Cloudydap cost report and their descriptions:

In [33]:
SKUs = grp['product/sku'].unique()
pc = SKUs.groupby(level='lineItem/ProductCode').apply(np.unique)
oi = AWSOffersIndex()
for g in pc.index:
    print(g, ':')
    s = oi.offer(g)
    for sku in np.hstack(pc[g].flat):
        print('  SKU:', sku)
        prod = s.product(sku)
        if len(prod.pricing) > 1:
            raise ValueError('{}: More than one product pricing info')
        tiers = prod.pricing[0].tiers
        for t in tiers:
            print('    ', t.description)
    print('\n')

AmazonEC2 :
  SKU: 33Y5KYZ4JQEF6J66
     $0.00 per GB - US East (Northern Virginia) data transfer from US West (Oregon)
  SKU: 47GP959QAF69YPG5
     $0.215 per On Demand Linux m4.xlarge Instance Hour
  SKU: 7U7TWP44UP36AT3R
     $0.05 per GB-Month of snapshot data stored - US East (Northern Virginia)
  SKU: 9MG5B7V4UUU2WPAV
     $0.000 per GB - data transfer in per month
  SKU: 9W95WEA2F9V4BVUJ
     $0.000 for 750 Mbps per m4.xlarge instance-hour (or partial hour)
  SKU: HQEH3ZWJVT46JHRG
     $0.000 per GB - first 1 GB of data transferred out per month
     $0.090 per GB - first 10 TB / month data transfer out beyond the global free tier
     $0.085 per GB - next 40 TB / month data transfer out
     $0.070 per GB - next 100 TB / month data transfer out
     $0.050 per GB - greater than 150 TB / month data transfer out
  SKU: HY3BZPP2B6K8MSJF
     $0.10 per GB-month of General Purpose SSD (gp2) provisioned storage - US East (Northern Virginia)
  SKU: PNUBVW4CPC8XA46W
     $0.010 per GB 