-
Notifications
You must be signed in to change notification settings - Fork 2.2k
feat(batch-exports): Write initial batch export data to internal S3 stage prior to exporting #32594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I mean, we have to read it again from S3 instead of CH. I don't see how we can continue from where we left off with S3 (without reading the entire file again). In fact, we had that feature already with CH (as with CH we can query from a new timestamp), but it had to be removed for some reason. EDIT: I guess what I am asking is, could you clarify this? |
Are you aware if this is even possible? Concretely: Can we impact memory consumption by controlling the number of partitions when exporting to s3? EDIT: If anything, I would assume that more partitions = more parallelization = more resources can be used to gain more speed. So, we gain speed but in fact use up more memory. |
| # TODO - remove this once testing over | ||
| # need to wait for query info to become available in system.query_log | ||
| await asyncio.sleep(5) | ||
| memory_usage = await client.read_query( | ||
| f"SELECT formatReadableSize(memory_usage) as memory_used FROM system.query_log WHERE query_id = '{query_id}' AND type='QueryFinish' ORDER BY event_time DESC LIMIT 1", | ||
| ) | ||
| await logger.ainfo(f"Query memory usage = {memory_usage.decode('utf-8').strip()}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe consider putting this in a background task instead so as to not block the main thread? I know we are testing but still: What if it takes longer than 5 seconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I can probably just remove this now
yeah sorry, I meant we can continue from the 2nd activity (reading from S3) rather than needing to read the data again from ClickHouse. We would indeed need to find a better way to handle failures (for example, storing multi-part upload progress in the DB) |
| '$s3_secret', | ||
| 'Parquet' | ||
| ) | ||
| PARTITION BY rand() %% 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't read that far so maybe this is covered later, if so just tell me to look it up later:
This partitioning scheme doesn't consider data size. We could end up with a 100 row export (very light) partitioned into up to 10 files (assuming a uniform distribution, each one should be expected to have 10 rows).
Why did we choose 10? Are we considering that these 10 files will need to be joined together later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found my solution for one of the two questions: We are letting arrow deal with multiple files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly, whereas partitioning will result in faster writes from clickhouse, it will result in slower reads from us. As we HAVE to read all files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure where to position ourselves with this tradeoff, I think we'll have to experiment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good questions!
10 is just an arbitrary value I picked for now - as mentioned, I want to do some more testing of this to see how it affects performance and memory usage.
Initially, I was thinking of keeping the number of partitions constant in order to make the load on ClickHouse as predictable as possible, which I think is more important than file size.
We could try partitioning based on max file size but then if there's a lot of data we could be trying to write say 100 partitions at once, which I presume would consume a lot of memory in CH (again, just an assumption at this point). I suppose we could make it dynamic with an upper limit but at this point I'm not sure if there is much benefit or not.
I chose to use pyarrow.datasets to read in the data from S3. I have not worked with it before but it sounds like a very performant way of reading in data from S3 which could be contained in multiple files, and also has the benefit of working with our existing code (our Consumer expects a queue of RecordBatches).
If we wanted to copy the data directly from our own S3 to the customer's S3 then I agree, we would probably want to control the creation of these files but for now I think it's better to keep the implementation generic so it can be used across all destinations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we wanted to copy the data directly from our own S3 to the customer's S3 then I agree
Nah, I don't think I would want to go that way personally. I think keeping the extra level of indirection is valuable even for S3.
Maaaaaaybe later down the line as a super-optimization for some really time sensitive exports, but otherwise no.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose to use pyarrow.datasets to read in the data from S3.
Yeah, let's talk about this one. I've pinged you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially, I was thinking of keeping the number of partitions constant in order to make the load on ClickHouse as predictable as possible, which I think is more important than file size.
Keep in mind that we will now have to deal with a new kind of memory pressure: Pod memory pressure. I think this is also impacted by the choice of parquet format. I've brought this up in the thread I've opened up with you too.
I still need to test this out. This is just an assumption from what I read in the docs:
I assume writing to multiple files at once would be a lot faster but at the cost of higher memory usage, but still need to test it. I could also try playing around with different file formats & compression. |
Yeah, I read the same docs and arrived at the same assumption:
I just asked because the PR calls out a potential benefit to memory usage, whereas in fact the benefit would be in speed at the cost of memory usage:
EDIT: As we can't have negative partitions of course! |
|
|
||
| class ProducerFromInternalS3Stage: | ||
| """ | ||
| This is an alernative implementation of the `spmc.Producer` class that reads data from the internal S3 staging area. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we could inherit from spmc.Producer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe fine to do this later though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method signatures are different so not sure it would help that much to inherit, apart from documentation. Also, if we decide to migrate to this new architecture for all teams then we'll no longer need the old Producer anyway
|
|
||
| # Read in batches | ||
| try: | ||
| for batch in dataset.to_batches(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very blocking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we will need to extend asyncpa to support this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm yeah, internally threads are used, we may not be able to use async python at all without changing that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just pushed a commit to fix this: edb36d7 (#32594)
18b1579 to
edb36d7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Summary
This PR introduces a two-stage batch export process that first writes data to an internal S3 staging area before exporting to the final destination. The change aims to improve resource management and error recovery by decoupling data reading from ClickHouse and writing operations.
- Adds new
pre_export_stage.pymodule implementing the core S3 staging functionality with configurable partitioning for performance optimization - Introduces feature flag
BATCH_EXPORT_USE_INTERNAL_S3_STAGE_TEAM_IDSto gradually roll out the feature to specific teams - Improves memory usage and query performance through configurable partitioning (5 partitions by default) when writing to S3
- Adds ability to resume failed exports from S3 stage without re-querying ClickHouse, reducing database load
- Maintains data copies in S3 staging for debugging purposes while properly cleaning up after successful exports
14 file(s) reviewed, 5 comment(s)
Edit PR Review Bot Settings | Greptile
| BATCH_EXPORT_USE_INTERNAL_S3_STAGE_TEAM_IDS: list[str] = get_list( | ||
| os.getenv("BATCH_EXPORT_USE_INTERNAL_S3_STAGE_TEAM_IDS", "") | ||
| ) | ||
| BATCH_EXPORT_INTERNAL_STAGING_BUCKET: str = os.getenv("BATCH_EXPORT_INTERNAL_STAGING_BUCKET", "posthog") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Default bucket name 'posthog' is too generic and could conflict with existing buckets. Consider a more specific default like 'posthog-batch-export-staging'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is mainly for the local environment where a posthog bucket is used elsewhere
| ], | ||
| select_from=ast.JoinExpr(table=ast.Field(chain=["sessions"])), | ||
| order_by=[ast.OrderExpr(expr=ast.Field(chain=["_inserted_at"]), order="ASC")], | ||
| # TODO: Add log_comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: TODO comment about adding log_comment should be removed since log_comment has been implemented in all new queries
| # TODO: Add log_comment | |
| # log_comment is used in all export queries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it hasn't been implemented for sessions yet
| # TODO - remove this once testing over | ||
| # need to wait for query info to become available in system.query_log | ||
| await asyncio.sleep(5) | ||
| memory_usage = await client.read_query( | ||
| f"SELECT formatReadableSize(memory_usage) as memory_used FROM system.query_log WHERE query_id = '{query_id}' AND type='QueryFinish' ORDER BY event_time DESC LIMIT 1", | ||
| ) | ||
| await self.logger.ainfo(f"Query memory usage = {memory_usage.decode('utf-8').strip()}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Remove debug logging block before production deployment
| is_backfill: bool = False | ||
| # TODO - pass these in to all inherited classes | ||
| batch_export_id: str | None = None | ||
| destination_default_fields: list[BatchExportField] | None = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is BatchExportField serializable? Probably since it's a dict I guess...
| # TODO - should we use our own set of env vars for this? | ||
| # TODO - check these are available in production workers | ||
| aws_access_key_id=settings.OBJECT_STORAGE_ACCESS_KEY_ID, | ||
| aws_secret_access_key=settings.OBJECT_STORAGE_SECRET_ACCESS_KEY, | ||
| endpoint_url=settings.OBJECT_STORAGE_ENDPOINT, | ||
| region_name=settings.OBJECT_STORAGE_REGION, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not familiar with these env variables, so I think we should look up what they are used for, just in case they don't get changed from under us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be easier to just use our own.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I've confirmed we'll be using our own bucket with existing credentials
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will update this
| ORDER BY | ||
| _inserted_at, event |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can get rid of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'll need to test this with some real data to make the call. Initially, I have high hopes. Let's just get rid of the ordering constraints before shipping.
f2270a8 to
4cf4bcf
Compare
8d8b715 to
7b1d3e5
Compare
…tage prior to exporting (#32594)
Problem
Batch export queries can consume a considerable amount of ClickHouse resources, particularly when exporting a large amount of data:
Changes
Here, we use an intermediate S3 stage, meaning we export the data directly from ClickHouse to our internal S3 bucket, before then exporting the data from S3 to the final destination.
By separating the reading of data from ClickHouse from the writing of data to the destination, we get some benefits:
Initial version
Note that this is just an initial MVP, and needs thorough testing before rolling it out to all users. Therefore, it will only be enabled for certain teams using an environment variable.
The code could definitely be made nicer. I'm also open to suggestions on structure/organization: I've put most of the new code in a new
posthog/temporal/batch_exports/pre_export_stage.pymodule to keep it separate, but am happy to put it somewhere else.Limitations
Questions
TODO
log_commentto queriesDid you write or update any docs for this change?
Docs will need updating once this is live for all users.
How did you test this code?
Local testing to ensure data exported looks correct.
Have added some test cases which use the new activities and assert the data is the same as before (although it seems to be in a different order, so have updated the tests to not care about ordering of events exported since I don't think is something we guarantee anyway)
Performance testing
We can improve performance by varying the number of file partitions we use when writing to S3 from ClickHouse.
Query performance for exporting 100k events locally: