# AWS Cost and Usage Report Analysis

The report type being analyzed here is the Detailed Hourly AWS Cost and Usage Report with Resource IDs and Tags. These reports are downloaded using the [get-aws-cost.py](https://github.com/OPENDAP/cloudydap/blob/master/python/logging/get-aws-cost.py) command-line program developed for this project.

In [1]:
import numpy as np
import pandas as pd
from aws_price_list import AWSOffersIndex
from bokeh.charts import (output_notebook, output_file, show, 
                          Scatter, Histogram, TimeSeries, Donut, Step)
from bokeh.plotting import figure, ColumnDataSource
from bokeh.models import Range1d, HoverTool, ResizeTool


output_notebook()

## Report to Analyze

In [2]:
r = pd.read_csv('../../../Arch1-20170201-20170301-1.csv')

What do we have?

In [3]:
r.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10120 entries, 0 to 10119
Data columns (total 73 columns):
identity/LineItemId                    10120 non-null object
identity/TimeInterval                  10120 non-null object
bill/InvoiceId                         0 non-null float64
bill/BillingEntity                     10120 non-null object
bill/BillType                          10120 non-null object
bill/PayerAccountId                    10120 non-null int64
bill/BillingPeriodStartDate            10120 non-null object
bill/BillingPeriodEndDate              10120 non-null object
lineItem/UsageAccountId                10120 non-null int64
lineItem/LineItemType                  10120 non-null object
lineItem/UsageStartDate                10120 non-null object
lineItem/UsageEndDate                  10120 non-null object
lineItem/ProductCode                   10120 non-null object
lineItem/UsageType                     10120 non-null object
lineItem/Operation                     101

Convert two columns to datetime, and sort all entries by time:

In [4]:
r['lineItem/UsageStartDate'] = pd.to_datetime(r['lineItem/UsageStartDate'])
r['lineItem/UsageEndDate'] = pd.to_datetime(r['lineItem/UsageEndDate'])
r.sort_values('lineItem/UsageStartDate', inplace=True)

In [5]:
r.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10120 entries, 0 to 10119
Data columns (total 73 columns):
identity/LineItemId                    10120 non-null object
identity/TimeInterval                  10120 non-null object
bill/InvoiceId                         0 non-null float64
bill/BillingEntity                     10120 non-null object
bill/BillType                          10120 non-null object
bill/PayerAccountId                    10120 non-null int64
bill/BillingPeriodStartDate            10120 non-null object
bill/BillingPeriodEndDate              10120 non-null object
lineItem/UsageAccountId                10120 non-null int64
lineItem/LineItemType                  10120 non-null object
lineItem/UsageStartDate                10120 non-null datetime64[ns]
lineItem/UsageEndDate                  10120 non-null datetime64[ns]
lineItem/ProductCode                   10120 non-null object
lineItem/UsageType                     10120 non-null object
lineItem/Operation        

## AWS Price Information

### AWS Offer Index

In [6]:
oi = AWSOffersIndex()

In [7]:
oi.published.strftime('%c %Z')

'Mon Feb 20 21:47:26 2017 UTC+00:00'

In [8]:
oi.accessed.strftime('%c %Z')

'Tue Feb 21 02:48:13 2017 UTC+00:00'

### Amazon EC2

In [9]:
ec2o = oi.offer('AmazonEC2')

In [10]:
ec2o.version

'20170210223144'

In [11]:
ec2o.published.strftime('%c %Z')

'Fri Feb 10 22:31:44 2017 UTC+00:00'

In [12]:
ec2o.accessed.strftime('%c %Z')

'Tue Feb 21 02:48:27 2017 UTC+00:00'

### Amazon S3

In [13]:
s3o = oi.offer('AmazonS3')

In [14]:
s3o.version

'20170127221642'

In [15]:
s3o.published.strftime('%c %Z')

'Fri Jan 27 22:16:42 2017 UTC+00:00'

In [16]:
s3o.accessed.strftime('%c %Z')

'Tue Feb 21 02:48:28 2017 UTC+00:00'

## Analysis

Time span of the report's data:

In [17]:
r['lineItem/UsageStartDate'].min()

Timestamp('2017-02-01 00:00:00')

In [18]:
r['lineItem/UsageEndDate'].max()

Timestamp('2017-02-20 08:00:00')

How many different Availability Zones?

In [19]:
r['lineItem/AvailabilityZone'].unique()

array(['us-east-1b', nan], dtype=object)

How many different product codes?

In [20]:
r['lineItem/ProductCode'].unique()

array(['AmazonEC2', 'AmazonS3'], dtype=object)

How many report entries for each product code?

In [21]:
r['lineItem/ProductCode'].value_counts()

AmazonEC2    7079
AmazonS3     3041
Name: lineItem/ProductCode, dtype: int64

Cost breakdown per AWS product:

In [22]:
grp = r.groupby('lineItem/ProductCode')
x = grp['lineItem/BlendedCost'].sum()
x

lineItem/ProductCode
AmazonEC2    315.790803
AmazonS3       8.109816
Name: lineItem/BlendedCost, dtype: float64

In [23]:
f = Donut(x.iloc[x.nonzero()], title='AWS Product Cost Breakdown')
show(f)

### Analysis: AmazonEC2

In [24]:
ec2 = r[r['lineItem/ProductCode'] == 'AmazonEC2']

How many usage types?

In [25]:
ec2['lineItem/UsageType'].value_counts()

DataTransfer-In-Bytes          1392
DataTransfer-Out-Bytes         1392
BoxUsage:m4.xlarge             1388
EBSOptimized:m4.xlarge         1388
EBS:VolumeUsage.gp2            1386
DataTransfer-Regional-Bytes      46
USE1-USW2-AWS-Out-Bytes          25
USE1-USW2-AWS-In-Bytes           25
EBS:SnapshotUsage                19
USE1-APS1-AWS-Out-Bytes           3
USE1-APS1-AWS-In-Bytes            3
USE1-EU-AWS-Out-Bytes             2
USE1-USW1-AWS-In-Bytes            2
USE1-EU-AWS-In-Bytes              2
USE1-APS3-AWS-In-Bytes            2
USE1-USW1-AWS-Out-Bytes           2
USE1-APS3-AWS-Out-Bytes           2
Name: lineItem/UsageType, dtype: int64

Cost breakdown for different product types:

In [26]:
grp = ec2.groupby('lineItem/UsageType')

In [27]:
x = grp['lineItem/BlendedCost'].sum()
x

lineItem/UsageType
BoxUsage:m4.xlarge             298.420000
DataTransfer-In-Bytes            0.000000
DataTransfer-Out-Bytes           0.000000
DataTransfer-Regional-Bytes      0.000000
EBS:SnapshotUsage                1.120799
EBS:VolumeUsage.gp2             16.249997
EBSOptimized:m4.xlarge           0.000000
USE1-APS1-AWS-In-Bytes           0.000000
USE1-APS1-AWS-Out-Bytes          0.000000
USE1-APS3-AWS-In-Bytes           0.000000
USE1-APS3-AWS-Out-Bytes          0.000005
USE1-EU-AWS-In-Bytes             0.000000
USE1-EU-AWS-Out-Bytes            0.000000
USE1-USW1-AWS-In-Bytes           0.000000
USE1-USW1-AWS-Out-Bytes          0.000000
USE1-USW2-AWS-In-Bytes           0.000000
USE1-USW2-AWS-Out-Bytes          0.000001
Name: lineItem/BlendedCost, dtype: float64

In [28]:
f = Donut(x.iloc[x.nonzero()], title='AWS EC2 Product Cost Breakdown',
          plot_height=600, plot_width=600)
show(f)

How many unique operations for each of the AmazonEC2 product types?

In [29]:
grp['lineItem/Operation'].unique()

lineItem/UsageType
BoxUsage:m4.xlarge                          [RunInstances]
DataTransfer-In-Bytes                       [RunInstances]
DataTransfer-Out-Bytes                      [RunInstances]
DataTransfer-Regional-Bytes    [PublicIP-Out, PublicIP-In]
EBS:SnapshotUsage                         [CreateSnapshot]
EBS:VolumeUsage.gp2                     [CreateVolume-Gp2]
EBSOptimized:m4.xlarge                            [Hourly]
USE1-APS1-AWS-In-Bytes                       [PublicIP-In]
USE1-APS1-AWS-Out-Bytes                     [PublicIP-Out]
USE1-APS3-AWS-In-Bytes                       [PublicIP-In]
USE1-APS3-AWS-Out-Bytes                     [PublicIP-Out]
USE1-EU-AWS-In-Bytes                         [PublicIP-In]
USE1-EU-AWS-Out-Bytes                       [PublicIP-Out]
USE1-USW1-AWS-In-Bytes                       [PublicIP-In]
USE1-USW1-AWS-Out-Bytes                     [PublicIP-Out]
USE1-USW2-AWS-In-Bytes                       [PublicIP-In]
USE1-USW2-AWS-Out-Bytes              

How many unique product SKUs for each of the AmazonEC2 product types?

In [30]:
skus = grp['product/sku'].unique()
skus

lineItem/UsageType
BoxUsage:m4.xlarge             [47GP959QAF69YPG5]
DataTransfer-In-Bytes          [9MG5B7V4UUU2WPAV]
DataTransfer-Out-Bytes         [HQEH3ZWJVT46JHRG]
DataTransfer-Regional-Bytes    [PNUBVW4CPC8XA46W]
EBS:SnapshotUsage              [7U7TWP44UP36AT3R]
EBS:VolumeUsage.gp2            [HY3BZPP2B6K8MSJF]
EBSOptimized:m4.xlarge         [9W95WEA2F9V4BVUJ]
USE1-APS1-AWS-In-Bytes         [DZHU5BKVVZXEHEYR]
USE1-APS1-AWS-Out-Bytes        [2T92AZQGNFAQHEXW]
USE1-APS3-AWS-In-Bytes         [CA5RFBSY7MFBKHTZ]
USE1-APS3-AWS-Out-Bytes        [BHM7BATGW8NG4NZ4]
USE1-EU-AWS-In-Bytes           [725FHGTUB3P2B9EU]
USE1-EU-AWS-Out-Bytes          [NW4B786HNAH6HZ7R]
USE1-USW1-AWS-In-Bytes         [NBUQPTSYHSXS2EB6]
USE1-USW1-AWS-Out-Bytes        [8X3QU4DYXVJAXZK3]
USE1-USW2-AWS-In-Bytes         [33Y5KYZ4JQEF6J66]
USE1-USW2-AWS-Out-Bytes        [XGXYRYWGNXSSEUVT]
Name: product/sku, dtype: object

Display description for all of these SKUs and their pricing tiers:

In [31]:
for ut, sku_list in skus.iteritems():
    print('Usage Type:', ut)
    for sku in sku_list:
        print('  SKU:', sku)
        prod = ec2o.product(sku)
        if len(prod.pricing) > 1:
            raise ValueError('{}: More than one product pricing info')
        tiers = prod.pricing[0].tiers
        for t in tiers:
            print('    ', t.description)
        print('\n')

Usage Type: BoxUsage:m4.xlarge
  SKU: 47GP959QAF69YPG5
     $0.215 per On Demand Linux m4.xlarge Instance Hour


Usage Type: DataTransfer-In-Bytes
  SKU: 9MG5B7V4UUU2WPAV
     $0.000 per GB - data transfer in per month


Usage Type: DataTransfer-Out-Bytes
  SKU: HQEH3ZWJVT46JHRG
     $0.000 per GB - first 1 GB of data transferred out per month
     $0.090 per GB - first 10 TB / month data transfer out beyond the global free tier
     $0.050 per GB - greater than 150 TB / month data transfer out
     $0.070 per GB - next 100 TB / month data transfer out
     $0.085 per GB - next 40 TB / month data transfer out


Usage Type: DataTransfer-Regional-Bytes
  SKU: PNUBVW4CPC8XA46W
     $0.010 per GB - regional data transfer - in/out/between EC2 AZs or using elastic IPs or ELB


Usage Type: EBS:SnapshotUsage
  SKU: 7U7TWP44UP36AT3R
     $0.05 per GB-Month of snapshot data stored - US East (Northern Virginia)


Usage Type: EBS:VolumeUsage.gp2
  SKU: HY3BZPP2B6K8MSJF
     $0.10 per GB-month of G

### Analysis: AmazonS3

In [32]:
s3 = r[r['lineItem/ProductCode'] == 'AmazonS3']

How many usage types?

In [33]:
s3['lineItem/UsageType'].value_counts()

USE1-USW2-AWS-Out-Bytes    1393
Requests-Tier2             1006
Requests-Tier1              578
DataTransfer-Out-Bytes       44
TimedStorage-ByteHrs         20
Name: lineItem/UsageType, dtype: int64

Cost breakdown for different product types:

In [34]:
grp = s3.groupby('lineItem/UsageType')

In [35]:
x = grp['lineItem/BlendedCost'].sum()
x

lineItem/UsageType
DataTransfer-Out-Bytes     0.000000
Requests-Tier1             0.336420
Requests-Tier2             0.395917
TimedStorage-ByteHrs       7.045189
USE1-USW2-AWS-Out-Bytes    0.332290
Name: lineItem/BlendedCost, dtype: float64

In [36]:
f = Donut(x.iloc[x.nonzero()], title='AWS S3 Cost Breakdown',
          plot_height=600, plot_width=600)
show(f)

How many unique operations for each of the AmazonS3 product types?

In [37]:
grp['lineItem/Operation'].unique()

lineItem/UsageType
DataTransfer-Out-Bytes                          [GetObject, ListBucket]
Requests-Tier1              [ListBucketVersions, PutObject, ListBucket]
Requests-Tier2             [HeadBucket, GetObject, ReadACL, HeadObject]
TimedStorage-ByteHrs                                  [StandardStorage]
USE1-USW2-AWS-Out-Bytes     [ListBucketVersions, HeadBucket, GetObject]
Name: lineItem/Operation, dtype: object

How many unique product SKUs for each of the AmazonS3 product types?

In [38]:
skus = grp['product/sku'].unique()
skus

lineItem/UsageType
DataTransfer-Out-Bytes     [HQEH3ZWJVT46JHRG]
Requests-Tier1             [E9YHNFENF4XQBZR6]
Requests-Tier2             [ZWQ6Q48CRJXX4FXE]
TimedStorage-ByteHrs       [WP9ANXZGBYYSGJEA]
USE1-USW2-AWS-Out-Bytes    [XGXYRYWGNXSSEUVT]
Name: product/sku, dtype: object

Display description for all of these SKUs and their pricing tiers:

In [39]:
for ut, sku_list in skus.iteritems():
    print('Usage Type:', ut)
    for sku in sku_list:
        print('  SKU:', sku)
        prod = s3o.product(sku)
        if len(prod.pricing) > 1:
            raise ValueError('{}: More than one product pricing info')
        tiers = prod.pricing[0].tiers
        for t in tiers:
            print('    ', t.description)
        print('\n')

Usage Type: DataTransfer-Out-Bytes
  SKU: HQEH3ZWJVT46JHRG
     $0.000 per GB - first 1 GB of data transferred out per month
     $0.090 per GB - first 10 TB / month data transfer out beyond the global free tier
     $0.050 per GB - greater than 150 TB / month data transfer out
     $0.070 per GB - next 100 TB / month data transfer out
     $0.085 per GB - next 40 TB / month data transfer out


Usage Type: Requests-Tier1
  SKU: E9YHNFENF4XQBZR6
     $0.005 per 1,000 PUT, COPY, POST, or LIST requests


Usage Type: Requests-Tier2
  SKU: ZWQ6Q48CRJXX4FXE
     $0.004 per 10,000 GET and all other requests


Usage Type: TimedStorage-ByteHrs
  SKU: WP9ANXZGBYYSGJEA
     $0.021 per GB - storage used / month over 500 TB
     $0.022 per GB - next 450 TB / month of storage used
     $0.023 per GB - first 50 TB / month of storage used


Usage Type: USE1-USW2-AWS-Out-Bytes
  SKU: XGXYRYWGNXSSEUVT
     $0.02 per GB - US East (Northern Virginia) data transfer to US West (Oregon)




## Sandbox

Toying with various ideas below... Eventually, this stuff may be pulled somewhere up.

In [40]:
f = Step(s3[(s3['lineItem/UsageType'] == 'TimedStorage-ByteHrs') & 
            (s3['lineItem/Operation'] == 'StandardStorage')],
         x='lineItem/UsageStartDate', y='lineItem/UsageAmount', xscale='datetime')
show(f)