Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow compute to return a generator instead of chunks #751

Closed
wants to merge 7 commits into from

Conversation

WenzDaniel
Copy link
Collaborator

@WenzDaniel WenzDaniel commented Aug 24, 2023

What is the problem / what does the code in this PR do
Slight modification which allows to yield chunks within a plugin instead of returning them.

This allows to reduce the chunk size while creating some new data. This is not needed for normal processing of data, but for simulations. Simulations starts with a small list of photons which will then be changed into pulses and fragments where each photon takes then 110 x 16 bit of data. Thus, in such a case it is helpful to yield multiple smaller chunks for a singe larger input chunk.

Since the chunking needs to be done while computing, the plugin's compute method needs to yield the data early. An example plugin can be found in testutils.

This change will help fuse to avoid the out-of-memory issues we were facing with wfsim leading to a more reliable and stable performance.

@coveralls
Copy link

coveralls commented Aug 24, 2023

Coverage Status

coverage: 91.517% (+0.08%) from 91.439% when pulling bc4e4eb on add_chunk_yielding_for_fuse into 5aba752 on master.

@WenzDaniel WenzDaniel marked this pull request as ready for review August 25, 2023 15:15
@WenzDaniel
Copy link
Collaborator Author

WenzDaniel commented Aug 25, 2023

I wont be able to fix the codefactor complained about a too complex method. So I would propose to merge without fixing it.

@jmosbacher
Copy link
Contributor

This is essentially just online rechunking right?
in that case wouldn't it be better to have the context control the online rechunking via some other settings such as max_chunk_length or something like that?
why is the control over the re-chunking ceded to the plugin compute method?
I feel very uncomfortable about this PR. Maybe we can have an in depth discussion about it in team A?

@dachengx
Copy link
Collaborator

I guess that from the logic of Plugin.compute, the memory overflow happens when the resulting chunk is very large. Then the new function splits the result into pieces.

@WenzDaniel
Copy link
Collaborator Author

WenzDaniel commented Aug 28, 2023

This is essentially just online rechunking right?

Yes.

in that case wouldn't it be better to have the context control the online rechunking via some other settings such as max_chunk_length or something like that?
why is the control over the re-chunking ceded to the plugin compute method?

Hmm not sure if this will work. As Dacheng pointed out the overflow will happen while performing the tasks in the compute method. So compute must yield early. For example in case of wfsim and fuse we will transform some information in form of photons detected by a PMT (time, channel) into raw_records. This means that suddenly you blow up your information from ~10 Byte/photon to ~10 + 220 Byte/photon.
So either you need to make you input chunks all tiny or you need to provide a way that a plugin can yield results early if a user defined condition is met.

I feel very uncomfortable about this PR. Maybe we can have an in depth discussion about it in team A?

I can see your point, I am also not 100 % happy about this solution. An alternative solution would be to develop a dedicated ChunkDown plugin class with a dedicated do_compute and iter method (which is of course much more work). But sure let us discuss in team A.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants