Overcoming memory issues by configurable repartition step #21

RickardCardell · 2020-06-15T11:13:54Z

Hi
We got some S3 access logs stored in S3 and we tried to use this lib but we were unable to make the job run on that dataset: 25GB s3 access logs/day for 30days.

I've tried with:

150 standard DPUs
100 G.1X
50 G.2X
all with many combinations of memory settings to no avail.

I instead went to the code and skipped the repartition stage: https://github.com/awslabs/athena-glue-service-logs/blob/master/athena_glue_service_logs/converter.py#L66
I also had to add spark.hadoop.fs.s3.maxRetries=20 since it now makes quite a lot S3 calls which caused throttling.

The job succeeded with 100 'standard' workers after only 4hours.
The drawback is of course that more objects were created: between 50-140 per day-partition. For smaller datasets the amount of files are higher: some thousands.

But for us at least it is better to have the jobs succeeding, than having no log data at all. Also, for our use case, the athena query performance will be good enough.

Would it make sense to make the repartitioning step configurable? I.e being able to skip it.
I can foresee that someone will mention the option to use coalesce instead of repartition. I have tried that already, and that only failed as well.

Another option is to have a (separate?) step that reduces the number of objects but more efficiently.

The text was updated successfully, but these errors were encountered:

RickardCardell mentioned this issue Jun 15, 2020

Error: Container is running beyond physical memory limits #19

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overcoming memory issues by configurable repartition step #21

Overcoming memory issues by configurable repartition step #21

RickardCardell commented Jun 15, 2020

Overcoming memory issues by configurable repartition step #21

Overcoming memory issues by configurable repartition step #21

Comments

RickardCardell commented Jun 15, 2020