WithStreamingInputTable() #8

EvanBoyle · 2019-12-28T02:35:57Z

Large refactor. I've created a fluent API for adding tables that I think makes more sense. The DW itself only creates the necessary buckets and top level glue database upon construction. Subsequent fluent withTable or withStreamingInputTable calls actually create glue tables and store them within the component. This is congruent with the idea that you'd like to create a single database with multiple tables.

Some more detail of what included in this review:

Fluent .withTable() api. This is the escape hatch that allows a user to create a table of any definition, and then populate the underlying S3 data in some sort of custom manner. We still need to update the API at some point to support formats other than parquet (json, csv, tsv), but I've added a todo for now.
Fluent .withStreamingInputTable() api. This created the glue table, kinesis stream, parquet firehose stream, and partition registrat, and just works. This is the core use case that I was targeting when I though about building this library. Calling this API gives you an endpoint that you can send records to, and they will automatically end up queryable via athena in an inserted_at partitioned format.
Support multiple tables, test this in the example. There were lots of places in the code where we had things hardcoded in, either for the logs table name or for pulumi resource names (which must be unique if you want to create multiple instances). Now the example created impressions and clicks tables that share the same schema.
I added a TODO and method stub for withBatchInputTable(). I think we should have a batch counterpart for the streaming. I can see this being useful for cases where you may aggregate streaming events on an hourly basis into higher level statistics tables. I'll create an issue with more details on what this might look like.

One thing that became painfully obvious to me during this process was that we need an integration test. My current process for making changes is tearing down my stack (with requires manually deleting all of the data from the S3 buckets first), recreating them, and then issuing an athena query to make sure I get data back from both tables. We could definitely write a script that automates this.

jmaysrowland

What are your thoughts about adding testing for the features?

It's something I've been thinking about for a while, and honestly don't know how we would go about it.
Could brake the code up into smaller components, and test the components that are easier to test. But the ones that actually call AWS, we wouldn't really be able to test. Right?

Refactor looks good.

jmaysrowland · 2019-12-28T07:28:57Z

lib/datawarehouse/partitionRegistrar/index.ts

+import { getS3Location } from "../../utils";
+import { createPartitionDDLStatement } from "./partitionHelper";
+
+export class HourlyPartitionRegistrar extends pulumi.ComponentResource {


Do we see a case for not having hourly buckets? Maybe Daily or every minute?

Yeah I definitely can see some customization here for other use cases. For this one, we specifically care about hourly as that is all that kinesis firehose supports for the output folders. But it might make sense to refactor or create something new for registering the output for batch tables.

jmaysrowland · 2019-12-28T07:31:06Z

lib/datawarehouse/partitionRegistrar/index.ts

+    athenaResultsBucket: aws.s3.Bucket;
+    database: aws.glue.CatalogDatabase;
+    region: string;
+    scheduleExpression?: string; // TODO: we should remove this. It's useful in active development, but users would probably never bother. 


Do we see a case for not having hourly buckets? Maybe Daily or every minute?

If not, I can definitely see the benefit of taking this out.

This was more about being able to set a 1 minute interval during development when we're manually testing things. Just so we didn't have to wait an hour to see results between changes.

EvanBoyle added 5 commits December 27, 2019 14:49

start of api for withStreamingInputTable()

11c0f92

add partition registration to withStreamingInputTable()

c43678f

subdivide tables into s3 folders

2c87f4c

create helper function for event generator

3e49c29

support multiple tables, update example

9512c2a

EvanBoyle changed the title ~~Evan/with streaming input table~~ WithStreamingInputTable() Dec 28, 2019

add table format todo

384b1d6

EvanBoyle requested a review from jmaysrowland December 28, 2019 02:49

EvanBoyle mentioned this pull request Dec 28, 2019

Integration Tests #9

Closed

jmaysrowland approved these changes Dec 28, 2019

View reviewed changes

jmaysrowland merged commit 1ec10b5 into master Dec 28, 2019

jmaysrowland deleted the evan/withStreamingInputTable branch December 28, 2019 07:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WithStreamingInputTable() #8

WithStreamingInputTable() #8

EvanBoyle commented Dec 28, 2019 •

edited

Loading

jmaysrowland left a comment

jmaysrowland Dec 28, 2019

EvanBoyle Dec 28, 2019

jmaysrowland Dec 28, 2019

EvanBoyle Dec 28, 2019

WithStreamingInputTable() #8

WithStreamingInputTable() #8

Conversation

EvanBoyle commented Dec 28, 2019 • edited Loading

jmaysrowland left a comment

Choose a reason for hiding this comment

jmaysrowland Dec 28, 2019

Choose a reason for hiding this comment

EvanBoyle Dec 28, 2019

Choose a reason for hiding this comment

jmaysrowland Dec 28, 2019

Choose a reason for hiding this comment

EvanBoyle Dec 28, 2019

Choose a reason for hiding this comment

EvanBoyle commented Dec 28, 2019 •

edited

Loading