Skip to content

Cost Modeling Plan

Aleksandar Jelenak edited this page Mar 7, 2017 · 4 revisions

Cost Modeling Plan

The goal of this document is to define the methodology for estimating the cost of running the three proposed Hyrax architectures based on the Amazon Web Services.

All costs fall into two general categories: fixed and dynamic. Fixed costs can be applicable to all three architectures while dynamic costs are always related to the specific architecture. Examples of fixed costs are:

  • Data storage in S3.
  • AWS outbound data as a result of running use cases.
  • EC2 instance(s)

Examples of dynamic costs are:

  • The type and size of the Hyrax server's cache employed.
  • The number of S3 requests made by the Hyrax server.
  • The data flow between the S3 bucket and the Hyrax server if they are not located in the same AWS region.

What is not going to be considered in the cost modeling are:

  • EC2 instance type for the Hyrax server.
  • How many Hyrax servers to run at any one time.
  • Other types of data storage available in AWS, like Glacier or S3 Infrequent Access.
  • Usage of any other AWS services that could make the proposed architectures more efficient or robust, like Lambda, SNS, etc.

The common configuration for all three architectures is:

  • One EC2 instance of the same type for the Hyrax server.
  • Sample HDF5 data stored in S3.
  • Both the EC2 instance and the S3 bucket with sample data are in the same AWS region.

Cost Variables

Beside the fixed costs of staging data in S3 and running Hyrax as an EC2 instance, all other usage costs stem from user DAP requests. Ability to model them appropriately is crucial for estimating realistic total costs. Properties of user DAP requests of interest for modeling costs are:

  • type (metadata, data),
  • if a data request:
    • subsetting, or
    • aggregation, or
    • file download;
  • source data file(s).

Architecture #1

Cost variables:

  • Type of cache used for storing data files retrieved from the S3 bucket:
    • Elastic Block Store (EBS),
    • Elastic File System (EFS),
    • ElastiCache;
  • Cache size
  • Cache eviction policy

Assumptions:

  • DAP metadata requests can be served from the DMR++ files only.

Of the three different cache types, the EBS seems most appropriate when running only one Hyrax server. However, a convenient feature of the EFS is its automatic scalability to the actual size of its data thus reducing the cost for unused cache. Coupled with an appropriate cache eviction policy the EFS may even be the cost-effective option for a single Hyrax server.

User DAP requests have a limited influence in this architecture. One DAP request will produce either zero or one Hyrax S3 request regardless of its type.

Architecture #2

Cost variables:

  • The number of HTTP range GET requests for each DAP data request (theoretically can go from one up to the total number of dataset's chunks).

Assumptions:

  • DAP metadata requests can be served from the DMR++ files only.
  • No Hyrax cache.

Architecture #3

Cost variables:

  • Data storage cost in S3 (can be reduced by the difference between the total file size and only the size of its datasets).
  • The number of Hyrax S3 requests (HTTP GET) for each DAP data request (theoretically can go from one up to the total number of dataset's chunks).

Assumptions:

  • DAP metadata requests can be served from the DMR++ files only.
  • No Hyrax cache.

Modeling Methodology

Sources of usage information:

  • AWS Hourly Usage and Cost Reports,
  • S3 access logs,
  • NASA DAAC OPeNDAP logs.

NASA DAAC OPeNDAP logs are the only real-life source of user DAP requests so it is important to mine them for useful information. The following information should be extracted for at least one of the data collections appearing in those logs:

  • Number (min/ave/max) of DAP requests per day.
  • Percentage of metadata and data DAP requests from the daily total.
  • Total byte size (min/ave/max) of DAP responses per day.
  • Typical byte size of a metadata DAP response.
  • DAP response daily total byte size as percentage of the total size of the data hosted.
  • Number (min/ave/max) of arrays (variables, HDF5 datasets) in one DAP request.
  • How often was the same file accessed in a given time period (e.g. one day).
  • Return file access rate on the 2nd, 3rd, etc. successive day since the first time the file was accessed.

AWS Hourly Usage and Cost Reports can provide information on:

  • exact AWS services (product SKU) used,
  • all accrued (fixed and dynamic) costs.

S3 access logs are another essential source of information:

  • The number of S3 requests for each user DAP request.
  • Response time and its byte size.

With all this information, we should be able to estimate:

  • Relative distribution of DAP request types (% of metadata, data, download, etc.).
  • The optimal number of files to keep in a cache and, hence, estimate the cache size.
  • Monitor the throughput of S3.
  • Estimate the upper and lower limit of the number of S3 requests for different DAP data requests.

Combining these estimates should allow us to model more realistic costs associated with each of the three architectures.

Clone this wiki locally