Skip to content
This repository has been archived by the owner on Jan 18, 2024. It is now read-only.

Work to reduce database costs #32

Open
schnuerle opened this issue Apr 23, 2018 · 8 comments
Open

Work to reduce database costs #32

schnuerle opened this issue Apr 23, 2018 · 8 comments
Labels
Future Effort Help Wanted Good items to start with if you are looking to help with the project Phase 1 RDS End to end data processor with hooks and alarms

Comments

@schnuerle
Copy link
Contributor

Once the RDS is complete in Phase 1, we'd like to find ways to reduce the overall cloud costs. If you have ideas, you can make a pull request to add support for it with examples to this repo so everyone can benefit.

One idea is to pull data into a database that is real-time in nature (gets updated every 2 minutes, but maybe only has 1 day of data) that runs all the time, and then have a second serverless database that is historic in nature (has all data and gets an update from the real-time table nightly) which will only be active when queried, thus saving costs.

Use this issue as a place to discuss and collaborate.

@schnuerle schnuerle created this issue from a note in Phase 1 - Raw Waze to RDS (Data Store) Apr 23, 2018
@schnuerle schnuerle added Help Wanted Good items to start with if you are looking to help with the project Phase 1 RDS End to end data processor with hooks and alarms labels Apr 23, 2018
@jedsundwall
Copy link

A "serverless database" could be nice. Depending on the volume of data and number of tables, there are a few ways you could do this with S3 – either storing the data in columnar a format like Parquet or simply storing it as gzipped CSVs. Depending on the route you take, you can use S3 Select or Athena to interrogate the data.

Going even further, if this is data that you know you can reliably recreate if needed, you could save even more costs by storing it in S3 One Zone-Infrequent Access.

@schnuerle
Copy link
Contributor Author

Thanks @jedsundwall for hopping in with suggestions. I'm not that familiar with Parquet and the kinds of data it can handle. The JSON files that come from Waze every 2 minutes have a hierarchy and redundant data and to use the data for analysis you have query across multiple files. There is also a need to backload old files, and these have to be checked for duplicate data before loading. I have used Athena to test and create a schema around them but it seemed to not be efficient enough.

Maybe I can post a query we are doing now with Power BI to an internal DB (we'll be migrating it to the cloud soon) so you can see the types of things we need to run. And it seems like with an RDS you can more easily build integrations like APIs, add enriched fields and FK to other data tables, and query using standard DB desktop tools, but maybe that's me speaking from my RDS world view. @jrstill might know more.

The main cost right now is the DB (the S3 storage cost is minimal) since it always doing something every 1-2 minutes. A 1 day realtime DB and multi-year historic serverless DB combo makes the most sense to me to improve this setup and reduce cost, if sticking with an RDS setup.

@schnuerle
Copy link
Contributor Author

We are pushing this off to later in the year, because Aurora Serverless Postgres is not available until later in 2018. https://aws.amazon.com/blogs/aws/in-the-works-amazon-aurora-serverless/

Also right now we are running a large instance, which is as small as Aurora Postgres goes now. There is talk of making a medium and/or small option, but until that happens, running a large real-time RDS won't reduce the costs at all. It may be possible to switch to a MySQL instance for the RT, which does support medium and small, but we'd have to change how some of our data is stored I think (like JSON chunks).

@sinacek
Copy link

sinacek commented Sep 24, 2018

Do you need AWS Aurora? I manually switched to RDS PostgreSQL (db.t2.medium) and it works and its much cheaper (reduced from $10 to $2 per day).

@schnuerle
Copy link
Contributor Author

@sinacek nice, thanks for trying and the suggestion. How does everything work? Can it process the data coming in alright? Get any DB errors?

The nice thing about non Aurora is you can shrink the DB down to save costs, then up when you are doing bigger analysis/extracts. @jrstill what do you think about this?

@sinacek
Copy link

sinacek commented Sep 28, 2018

It works fine without any error. For now I have just updated lambda functions manually, because I haven't experience with terraform.

@schnuerle
Copy link
Contributor Author

@sinacek can you specify what changes you made to your lambda functions to get this to work? Can you provide instructions for those who want to try this?

@sinacek
Copy link

sinacek commented Aug 12, 2019

@schnuerle Sorry I'm not able found what I have changed year ago :/

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Future Effort Help Wanted Good items to start with if you are looking to help with the project Phase 1 RDS End to end data processor with hooks and alarms
Projects
Development

No branches or pull requests

4 participants
@jedsundwall @sinacek @schnuerle and others