Skip to content
This repository has been archived by the owner on Feb 1, 2024. It is now read-only.

Overcoming memory issues by configurable repartition step #21

Open
RickardCardell opened this issue Jun 15, 2020 · 0 comments
Open

Overcoming memory issues by configurable repartition step #21

RickardCardell opened this issue Jun 15, 2020 · 0 comments

Comments

@RickardCardell
Copy link

Hi
We got some S3 access logs stored in S3 and we tried to use this lib but we were unable to make the job run on that dataset: 25GB s3 access logs/day for 30days.

I've tried with:

150 standard DPUs
100 G.1X
50 G.2X
all with many combinations of memory settings to no avail.

I instead went to the code and skipped the repartition stage: https://github.com/awslabs/athena-glue-service-logs/blob/master/athena_glue_service_logs/converter.py#L66
I also had to add spark.hadoop.fs.s3.maxRetries=20 since it now makes quite a lot S3 calls which caused throttling.

The job succeeded with 100 'standard' workers after only 4hours.
The drawback is of course that more objects were created: between 50-140 per day-partition. For smaller datasets the amount of files are higher: some thousands.

But for us at least it is better to have the jobs succeeding, than having no log data at all. Also, for our use case, the athena query performance will be good enough.

Would it make sense to make the repartitioning step configurable? I.e being able to skip it.
I can foresee that someone will mention the option to use coalesce instead of repartition. I have tried that already, and that only failed as well.

Another option is to have a (separate?) step that reduces the number of objects but more efficiently.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant