-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: use tree reduction to aggregate files in preprocessing #1079
feat: use tree reduction to aggregate files in preprocessing #1079
Conversation
files_trl_label = f"{name}" | ||
files_trl_token = dask.base.tokenize(dak_norm_files, concat_fn, split_every) | ||
files_trl_name = f"{files_trl_label}-{files_trl_token}" | ||
files_trl_tree_node_name = f"{files_trl_label}-tree-node-{files_trl_token}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These use essentially the fileset keys to identify the names, which looked fine to me in the task graph but I'm happy for these to change as desired.
The errors seen in CI come from Delphes and treemaker nanoevents and seem to also happen in nightly CI, so I believe they are unrelated and I'm guessing resolved by #1077? |
Can you get an image of the task graph for preprocess before and after this change? Just for posterity? |
We'll merge this after #1076 |
Preprocessing used to fail with
RecursionError
when enough files were included (#1078). Following helpful advice from @lgray, this applies the exact same method as dask-contrib/dask-awkward#479 to resolve the problem by using a tree reduction strategy.This seems to have worked as a rather clean copy/paste with no real changes apart from the paths to all the imported objects and functions. Perhaps that is not so surprising given that the approach is generic, but it did make me slightly suspicious at first. I did not find anything wrong with it myself so far. I was also testing with
sys.setrecursionlimit(200)
to more quickly hit the error. I don't think that this is an issue, but wanted to mention it to be sure.I am not very familiar with the internals here and only have a rough understanding of everything happening. In addition, my test case is running distributed on a facility (which works since the relevant code only is needed on the head node). I can confirm that these changes do resolve the issue in my test setup, but a critical look / other tests would be very welcome.
resolves #1078