feat: use tree reduction to aggregate files in preprocessing #1079

alexander-held · 2024-04-14T22:42:03Z

Preprocessing used to fail with RecursionError when enough files were included (#1078). Following helpful advice from @lgray, this applies the exact same method as dask-contrib/dask-awkward#479 to resolve the problem by using a tree reduction strategy.

This seems to have worked as a rather clean copy/paste with no real changes apart from the paths to all the imported objects and functions. Perhaps that is not so surprising given that the approach is generic, but it did make me slightly suspicious at first. I did not find anything wrong with it myself so far. I was also testing with sys.setrecursionlimit(200) to more quickly hit the error. I don't think that this is an issue, but wanted to mention it to be sure.

I am not very familiar with the internals here and only have a rough understanding of everything happening. In addition, my test case is running distributed on a facility (which works since the relevant code only is needed on the head node). I can confirm that these changes do resolve the issue in my test setup, but a critical look / other tests would be very welcome.

resolves #1078

alexander-held · 2024-04-14T22:43:45Z

src/coffea/dataset_tools/preprocess.py

+        files_trl_label = f"{name}"
+        files_trl_token = dask.base.tokenize(dak_norm_files, concat_fn, split_every)
+        files_trl_name = f"{files_trl_label}-{files_trl_token}"
+        files_trl_tree_node_name = f"{files_trl_label}-tree-node-{files_trl_token}"


These use essentially the fileset keys to identify the names, which looked fine to me in the task graph but I'm happy for these to change as desired.

alexander-held · 2024-04-14T23:28:56Z

The errors seen in CI come from Delphes and treemaker nanoevents and seem to also happen in nightly CI, so I believe they are unrelated and I'm guessing resolved by #1077?

lgray · 2024-04-15T08:49:44Z

Can you get an image of the task graph for preprocess before and after this change? Just for posterity?

alexander-held · 2024-04-15T14:37:47Z

Before the change:

After the change:

This is for a fileset with key ttbar and 30 files. For full context, produced with optimization:

files_to_preprocess[name].visualize(filename=..., optimize_graph=True)

lgray · 2024-04-15T15:12:35Z

We'll merge this after #1076

feat: use tree reduction to aggregate files in preprocessing

50a7069

alexander-held commented Apr 14, 2024

View reviewed changes

alexander-held mentioned this pull request Apr 14, 2024

Recursion errors at scale iris-hep/idap-200gbps-atlas#37

Closed

lgray added 2 commits April 15, 2024 11:47

Merge branch 'master' into feat/preprocessing-tree-reduction

1771a71

indicate task is preprocess

7bfdcc1

lgray enabled auto-merge April 15, 2024 21:13

lgray merged commit 92adea6 into CoffeaTeam:master Apr 15, 2024
14 checks passed

alexander-held deleted the feat/preprocessing-tree-reduction branch April 15, 2024 22:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: use tree reduction to aggregate files in preprocessing #1079

feat: use tree reduction to aggregate files in preprocessing #1079

alexander-held commented Apr 14, 2024 •

edited

Loading

alexander-held Apr 14, 2024

alexander-held commented Apr 14, 2024

lgray commented Apr 15, 2024

alexander-held commented Apr 15, 2024 •

edited

Loading

lgray commented Apr 15, 2024

feat: use tree reduction to aggregate files in preprocessing #1079

feat: use tree reduction to aggregate files in preprocessing #1079

Conversation

alexander-held commented Apr 14, 2024 • edited Loading

alexander-held Apr 14, 2024

Choose a reason for hiding this comment

alexander-held commented Apr 14, 2024

lgray commented Apr 15, 2024

alexander-held commented Apr 15, 2024 • edited Loading

lgray commented Apr 15, 2024

alexander-held commented Apr 14, 2024 •

edited

Loading

alexander-held commented Apr 15, 2024 •

edited

Loading