Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use tree reduction to aggregate files in preprocessing #1079

Merged

Conversation

alexander-held
Copy link
Contributor

@alexander-held alexander-held commented Apr 14, 2024

Preprocessing used to fail with RecursionError when enough files were included (#1078). Following helpful advice from @lgray, this applies the exact same method as dask-contrib/dask-awkward#479 to resolve the problem by using a tree reduction strategy.

This seems to have worked as a rather clean copy/paste with no real changes apart from the paths to all the imported objects and functions. Perhaps that is not so surprising given that the approach is generic, but it did make me slightly suspicious at first. I did not find anything wrong with it myself so far. I was also testing with sys.setrecursionlimit(200) to more quickly hit the error. I don't think that this is an issue, but wanted to mention it to be sure.

I am not very familiar with the internals here and only have a rough understanding of everything happening. In addition, my test case is running distributed on a facility (which works since the relevant code only is needed on the head node). I can confirm that these changes do resolve the issue in my test setup, but a critical look / other tests would be very welcome.

resolves #1078

Comment on lines 325 to 328
files_trl_label = f"{name}"
files_trl_token = dask.base.tokenize(dak_norm_files, concat_fn, split_every)
files_trl_name = f"{files_trl_label}-{files_trl_token}"
files_trl_tree_node_name = f"{files_trl_label}-tree-node-{files_trl_token}"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These use essentially the fileset keys to identify the names, which looked fine to me in the task graph but I'm happy for these to change as desired.

@alexander-held
Copy link
Contributor Author

The errors seen in CI come from Delphes and treemaker nanoevents and seem to also happen in nightly CI, so I believe they are unrelated and I'm guessing resolved by #1077?

@lgray
Copy link
Collaborator

lgray commented Apr 15, 2024

Can you get an image of the task graph for preprocess before and after this change? Just for posterity?

@alexander-held
Copy link
Contributor Author

alexander-held commented Apr 15, 2024

Before the change:

before_change

After the change:
after_change

This is for a fileset with key ttbar and 30 files. For full context, produced with optimization:

files_to_preprocess[name].visualize(filename=..., optimize_graph=True)

@lgray
Copy link
Collaborator

lgray commented Apr 15, 2024

We'll merge this after #1076

@lgray lgray enabled auto-merge April 15, 2024 21:13
@lgray lgray merged commit 92adea6 into CoffeaTeam:master Apr 15, 2024
14 checks passed
@alexander-held alexander-held deleted the feat/preprocessing-tree-reduction branch April 15, 2024 22:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Maximum recursion depth during preprocessing
2 participants