Adaptive chunksize test implementation #544

btovar · 2021-06-10T13:37:01Z

This is an idea we have been playing with to adapt the chunksize as the workflow progresses. Instead of performing a trial-an-error, the objective is to let the executors choose a correct chunksize given a desired memory usage or desired runtime per task. (Currently only the runtime one is implemented, and only for the work_queue_executor.)

To achieve this, we let the WorkItem generator reach the executor as is. Before, run_uproot_job would collect the generator into a list. This also meant that we needed to factor into a function the preprocessing code outside run_uproot_job, and that the fileset is filtered (if needed) of bad files before reaching the generator.

The main complication is how to control the loops. Since the number of chunks is not known before hand, we control the loop by counting the number of events. However, this cannot be done directly with the generator, as it would be consumed before reaching the executor. Therefore, we count the events directly from the filemeta objects, but this may not work well when using alignclusters or maxchunks. In the current implementation, dynamic-chunksize cannot be used with aligncluster or maxchunks.

Another complication is that since chunks are accounted by their number of events, and not simply by how many there are, the executor needs to know how to accumulate the items. Otherwise, the executor cannot be used unchanged for preprocessing. This could be solved by adding proper len(file) = 1 when preprocessing, len(WorkItem) = numitems for Processing. In the current implementation we only did this for processing.

The chunksize itself is computed with a linear regression rounded up to a power of two of a base chunksize. Some variation is added to this value to have sampling data for the regression.

If this sounds interesting to you, please let us known how can be make this change more amenable for the other executors.

Thanks!

Filemeta.chunks generator is changed so that it can receive an updated chunksize. E.g., to get the next chunks of a size chunksize and 2 * chunksize: ```python chunks = _chunk_generator(fileset, metadata_fetcher, metadata_cache, chunksize, align_clusters, maxchunks, dynamic_chunksize=True) ... next(chunks) # move to next yield chunk = chunks.send(chunksize) ... next(chunks) # move to next yield chunk = chunks.send(2*chunksize) ``` For this to work, preprocessing was moved to its own function that is called inside run_uproot_job.

lgray · 2021-06-24T13:34:30Z

This looks really nice - we should think about if there is a nice way to bring this to the other executors, since it would be convenient to have this available where it makes sense.

I'm just not sure off the top of my head. Happy to continue the discussion.

I'll merge this if you guys are OK with it?

btovar · 2021-06-24T14:30:32Z

@lgray: Sounds very good!

Something that will make it more easy to adapt to other executors is to more easily know the number of total events. It is still not clear to me how maxchunks works without doing the preprocessing step. (Also, maybe it does not make sense to use dynamic chunksize with maxchunks? Is maxchunks used only for testing purposes? Along the same lines, align_clusters probably does not make much sense, as chunks will be different by design.)

One thing that is not in the current pr is that it may make more sense to have a chunksize per dataset, rather than per workflow.

lgray · 2021-06-24T14:32:27Z

Yeah maxchunks/align-clusters + dynamic chunking should produce an error, it's really easy to get undefined behavior out of that.

@nsmith- ?

nsmith-

Sorry for the delay in reviewing this. Thanks for the nice addition.

One issue with dynamic chunksize is that if we are using a column caching solution, then this is going to be more likely to spoil the cache since it caches by chunk. To see what I mean, take a look at the way persistent_cache behaves with NanoEventsFactory on a local file with various choices of entry_stop. On the other hand, if you keep to a fixed set of sizes as you do (powers of two--by the way, why the base_chunksize? Just sticking to powers of two would give the same flexibility and you can just set the start to 2**10 instead of 1000) then perhaps the column cache can be intelligent enough to know how to subdivide if requested.

It is still not clear to me how maxchunks works without doing the preprocessing step

As you guessed, maxchunks is mainly used as a testing device. It simply opens files sequentially in the current thread of execution until it reaches enough chunks, and then sends the list to the executor. In a system where the pre-executor and executor are both operating as streams, we could improve this logic to allow parallel execution of the preprocessing stage while keeping the maxchunks feature. Also, the motivation for sharing the same execcutor abstraction for preprocessing and processing is not so high, and it might make sense to specialize the file preprocessing anyway. The results of preprocessing are also better kept in a local database than recomputed each time, but our current setup doesn't really accomodate this in a user-friendly way.

coffea/processor/executor.py

btovar · 2021-06-28T18:26:01Z

@nsmith- Thanks for your comments! I've changed so that the chunksizes are always powers of 2.

lgray · 2021-06-28T21:47:35Z

@nsmith- you happy?

nsmith-

🚀

btovar force-pushed the chunksize_adaptive branch from d0a56a1 to fbd8d2f Compare June 10, 2021 16:43

btovar added 7 commits June 22, 2021 10:27

Define len of WorkItem as its number of events

6d01ca8

change wq executor to use dynamic chunksize

f1e6ab7

black reformat

7a1dfb2

catch nan slope (e.g. vertical lines when few data points)

fd5f38c

use stats from task to not depend on monitoring mode

6b8b236

remove debug chunksize computation

a1df7d2

btovar force-pushed the chunksize_adaptive branch from fbd8d2f to a1df7d2 Compare June 22, 2021 14:27

nsmith- self-requested a review June 24, 2021 13:39

nsmith- requested changes Jun 25, 2021

View reviewed changes

btovar added 6 commits June 28, 2021 14:03

np to numpy, sc to scipy

439dc7a

use len(chunk) as appropiate

5bf2bcb

keywords to _

1e24648

use one yield instead of two.

9b9ec8d

chunksizes in power of 2s

1c19b35

black reformat

144a61b

nsmith- approved these changes Jun 28, 2021

View reviewed changes

Merge branch 'master' into chunksize_adaptive

79b45cd

lgray merged commit dea5818 into CoffeaTeam:master Jun 28, 2021

btovar deleted the chunksize_adaptive branch July 19, 2021 11:56

nsmith- mentioned this pull request Aug 12, 2021

[Executors] Allow better interoperability with custom executors #557

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adaptive chunksize test implementation #544

Adaptive chunksize test implementation #544

btovar commented Jun 10, 2021

lgray commented Jun 24, 2021

btovar commented Jun 24, 2021

lgray commented Jun 24, 2021

nsmith- left a comment •

edited

btovar commented Jun 28, 2021

lgray commented Jun 28, 2021

nsmith- left a comment

Adaptive chunksize test implementation #544

Adaptive chunksize test implementation #544

Conversation

btovar commented Jun 10, 2021

lgray commented Jun 24, 2021

btovar commented Jun 24, 2021

lgray commented Jun 24, 2021

nsmith- left a comment • edited

Choose a reason for hiding this comment

btovar commented Jun 28, 2021

lgray commented Jun 28, 2021

nsmith- left a comment

Choose a reason for hiding this comment

nsmith- left a comment •

edited