# Setup

In [1]:
import pandas as pd

In [2]:
config = {
    'n_col_metadata': 50, # FH135 has 46 columns
    'n_images': 3e4, # FH135 has 33783 images
    'raw_to_tiff_ratio': 5,
    'raw_image_size': 4.4e-3,
    'tiff_image_size': 13e-3,
    'n_referenced_images': 2e4,
    'default_byte_size': 4,
    'units': ('GB', 1024**3),
    'disk_to_ram_ratio': 10, # Conservative numbers are higher
    'n_flights': 20, # Currently there are six flights
    'transfer_speed': 1e-2, # In GB/s.
    'processing_speed': 5e-3, # In GB/s. This is probably a conservative estimate.
    'n_full_metadata_queries_per_week': 50, # Should be conservative again
    'n_full_image_queries_per_week': 1,
    'minimum_processing_time': 60., # Assume no query uses less than this time in seconds
}

# Stakeholders

#### Far Horizons Needs

* Reference as many of the images as possible
* Combine referenced images into a cohesive map
* Give users control over how the images to make the map are selected
* Give users control over how pixel values are calculated
* Select all images within a given distance of a coordinate
* Do the above for each flight
* Maximize accessibility, maintainability, and editability

#### External Needs

* Retrieve reliable pixel values for within a given distance of a coordinate
* Retrieve and visualize a map
* Select different maps for different times

# Requirements Estimation

## Images

In [3]:
# Per flight reqs
n_tiff = config['n_images'] / (1. + config['raw_to_tiff_ratio'])
n_raw = config['n_images'] - n_tiff
images_volume = config['tiff_image_size'] * n_tiff + \
    config['raw_image_size'] * n_raw
images_volume

175.0

In [4]:
total_image_volume = images_volume * config['n_flights']

In [5]:
# Total metadata cpu usage assuming every query retrieves all the metadata
image_time_estimate = (
    (
        # Time to process all the metadata
        total_image_volume
        / config['transfer_speed']
    )
    * config['n_full_image_queries_per_week']
    / 3600. # Convert seconds to hours
)

## Metadata

In [6]:
# Per flight reqs
row_size_bytes = config['n_col_metadata'] * config['default_byte_size']
table_size = (
    row_size_bytes * config['n_referenced_images']
    / config['units'][1]
)

In [7]:
# Across all flights
total_metadata_volume = table_size * config['n_flights']
metadata_ram = total_metadata_volume * config['disk_to_ram_ratio']

In [8]:
# Total metadata cpu usage assuming every query retrieves all the metadata
metadata_time_estimate = (
    (
        # Time to process all the metadata
        total_metadata_volume
        / config['processing_speed']
        + config['minimum_processing_time']
    )
    * config['n_full_metadata_queries_per_week']
    / 3600. # Convert seconds to hours
)
n_writes_metadata = (
    config['n_full_metadata_queries_per_week'] * config['n_referenced_images']
)

## Summarize Requirements

In [9]:
print(
f'''
Size per row: {row_size_bytes/1000:.2g} KB
Number of writes per week: {n_writes_metadata:.2g}
Required volume for images: {total_image_volume:.2g} {config['units'][0]}
Images usage estimate: {image_time_estimate:.3g} hrs/week

Required volume for metadata: {total_metadata_volume:.2g} {config['units'][0]}
Required RAM for metadata: {metadata_ram:.2g} {config['units'][0]}
Metadata usage estimate: {metadata_time_estimate:.2g} hrs/week
'''
)


Size per row: 0.2 KB
Number of writes per week: 1e+06
Required volume for images: 3.5e+03 GB
Images usage estimate: 97.2 hrs/week

Required volume for metadata: 0.075 GB
Required RAM for metadata: 0.75 GB
Metadata usage estimate: 1 hrs/week



# Possible Solutions

## Overview

### Metadata Storage

All solutions will require a way for unauthenticated users to access the data.
[This guide](https://docs.aws.amazon.com/lambda/latest/operatorguide/public-endpoints.html) addresses a few possibilities, including using [lambda](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/rds-lambda-tutorial.html) or a website.

#### Amazon RDS with Proxy

In this solution the DB lives on Amazon RDS.
The DB is stopped for most of the time, but is turned on when a request is made.

Notes:
- There are ways to automate the starting and stopping of the DB instance ([see here](https://aws.amazon.com/blogs/database/schedule-amazon-rds-stop-and-start-using-aws-lambda/))
- Due to the small size of the DB, most of the cost will be in the proxy

#### Amazon Dynamo DB

In this solution the DB lives on Amazon Dynamo DB, and users query it.

Notes:
- Most operations we perform will be batch operations (25 items/operation max)

#### S3

In this solution the data is stored on S3, and the full metadata dataset
is downloaded everytime it's used. This is feasible because the full metadata
dataset is anticipated to be <0.1 GB.

Notes:
- S3 Object Lambda could feasibly be used to filter a selection.
- We might just want to store the data as a csv.

## Cost Estimates

In [11]:
estimates = pd.read_csv('./NITELite_estimate.csv', header=5, skipfooter=2)
estimates

  estimates = pd.read_csv('./NITELite_estimate.csv', header=5, skipfooter=2)


Unnamed: 0,Group hierarchy,Region,Description,Service,Upfront,Monthly,First 12 months total,Currency,Status,Configuration summary
0,NITELite_pipeline,US East (N. Virginia),DynamoDB:metadata,DynamoDB on-demand capacity,0.0,13.13,157.56,USD,,"Table class (Standard), Average item size (all..."
1,NITELite_pipeline,US East (N. Virginia),DynamoDB:metadata,DynamoDB Backup and restore,0.0,0.9,10.8,USD,,"On-demand backup data storage (2 GB), Table da..."
2,NITELite_pipeline,US East (N. Virginia),DynamoDB:metadata,DynamoDB Data export to Amazon S3,0.0,0.4,4.8,USD,,"Full export to Amazon S3 (2 GB), Incremental e..."
3,NITELite_pipeline,US East (N. Virginia),DynamoDB:metadata,DynamoDB Data Import from Amazon S3,0.0,0.3,3.6,USD,,Uncompressed source file size for Import from ...
4,NITELite_pipeline,US East (N. Virginia),RDS:metadata,Amazon RDS for PostgreSQL,0.0,23.2325,278.79,USD,,"Storage volume (General Purpose SSD (gp2)), St..."
5,NITELite_pipeline,US East (N. Virginia),S3:metadata,S3 Standard,0.000715,0.03,0.36,USD,,"S3 Standard storage (1 GB per month), S3 Stand..."
6,NITELite_pipeline,US East (N. Virginia),S3:metadata,Data Transfer,0.0,0.45,5.4,USD,,"DT Inbound: Internet (0 TB per month), DT Outb..."
7,NITELite_pipeline,US East (N. Virginia),S3:images,S3 Standard,0.0,94.65,1135.8,USD,,"S3 Standard storage (4 TB per month), PUT, COP..."
8,NITELite_pipeline,US East (N. Virginia),S3:images,Data Transfer,0.0,18.0,216.0,USD,,"DT Inbound: Internet (0 TB per month), DT Outb..."
9,NITELite_pipeline,US East (N. Virginia),RDS:images,Amazon RDS for PostgreSQL,0.0,1027.4734,12329.68,USD,,"Storage volume (General Purpose SSD (gp2)), St..."


In [12]:
estimates.groupby('Description')['Monthly'].sum()

Description
DynamoDB:metadata      14.7300
RDS:images           1027.4734
RDS:metadata           23.2325
S3:images             112.6500
S3:metadata             0.4800
Name: Monthly, dtype: float64

## Pros/Cons

Regarding images, it is cost-prohibitive to store the images on RDS, so storing the images on S3, where they are now, is the only real choice.
The rest of the analysis will focus on the metadata.

### RDS

Pros:
- Good SQL practice ;)

Cons:
- Requires an elaborate set-up
- Most-expensive of the options if using a proxy. Not prohibitively expensive, but not free.

### DynamoDB

Pros:
- Fully on-demand usage.

Cons:
- Not prohibitively expensive, but not free.

### S3

Pros:
- Compatible with CSV, the format researchers will expect.
- Practically free.

Cons:
- Access *may* be more difficult than for RDS or DynamoDB.