feat!: switch over to dask-based processing idioms, improve dataset handling #882

lgray · 2023-08-22T22:25:23Z

This will:

add a tool for dask-based pre-processing of "filesets" (which are the uproot file-dictionary spec + metadata support)
- Note that this produces json-serializable output so we can now easily store the results and do preprocessing a minimal number of times instead of every run.
add support for dask-driven execution of processors
provide tools for users that no longer wish to use processors which make their lives easier

lgray · 2023-08-22T22:35:00Z

@valsdav Here's the PR with the fileset pre-processor. The query tool you're working on with rucio / servicex backend should produce json that's compatible with this spec (and we can go so far as to define a schema if you like).

Essentially - nests the spec mentioned here in uproot. So that you've got an object that looks like

dataset = {
    "dataset_name": {"files": <the-uproot-spec>, "metadata": {...}, ...},
    "dataset2_name": ...,
    ...
}

The preprocessing step, as it is, strips out files from the file set, and runs a preprocessing step on each file to yield the desired chunking. It then returns the original input fileset with the files it was able to parse and determine the chunking of. It then returns the "available" dataset that it was able to parse and the "complete" dataset is all files which were input and only those reachable are updated, leaving unreachable files as given.

I'm going to add some modifiers on top of this to recover capabilities like max_chunks, etc.

valsdav · 2023-08-23T07:20:27Z

Hi @lgray thanks! Starting to work on the query side :)

for more information, see https://pre-commit.ci

…nto local_executors_to_dask

for more information, see https://pre-commit.ci

feat: Dataset querying features using rucio

…nto local_executors_to_dask

feat: dataset discovery CLI

for more information, see https://pre-commit.ci

ikrommyd · 2023-08-30T20:47:33Z

Pre-processing fails on a distributed Client with

TypeError: ('Could not serialize object of type HighLevelGraph', '<ToPickle: HighLevelGraph with 1 layers.\n<dask.highlevelgraph.HighLevelGraph object at 0x7f525471cfd0>\n 0. -get-steps-6f666e09968af4341f760475fe63382d\n>')

which fails during all_processed_files = dask.compute(files_to_preprocess)[0] with

AssertionError: bug in Awkward Array: attempt to get length of a TypeTracerArray

See if this has been reported at https://github.com/scikit-hep/awkward/issues

The above exception was the direct cause of the following exception:

throughout the error message. Works fine with local threads or processes

lgray · 2023-08-31T13:45:46Z

@iasonkrom this is a limitation in the latest awkward array, I think. @agoose77 would know better.

Though I'm a bit confused why it's trying to serialize a typetracer in the first place, that shouldn't be happening.

…nto local_executors_to_dask Conflicts: src/coffea/dataset_tools/dataset_query.py

for more information, see https://pre-commit.ci

…nto local_executors_to_dask

for more information, see https://pre-commit.ci

…cal_executors_to_dask

…nto local_executors_to_dask

…a into local_executors_to_dask

fix: improvements to dataset_query tools

lgray · 2023-12-12T11:45:07Z

@nsmith- can you give this another run through? Intend to merge later today unless there is something big.

lgray added 3 commits August 22, 2023 12:25

some first movement on dask-task-graph style runner/executor

0e24cb7

changes for preprocessing prototype

dcb6f74

new dask-based dataset pre-processor

a03beb9

valsdav and others added 23 commits August 23, 2023 11:41

Added the rucio utils functions from pocketcoffea

95c5fc8

Added dataset querying function

1f870c2

Working on interface for datasets query

c5d2d57

Querying and listing implemented: selected of results

d4d371c

Printing sites availability for replicas

e1f11cf

Added replica site selection

1e191bf

Added saving

02cbbbe

[pre-commit.ci] auto fixes from pre-commit.com hooks

daf5e52

for more information, see https://pre-commit.ci

Formatting and flake8

c9101bb

Merge branch 'local_executors_to_dask' of github.com:valsdav/coffea i…

0c11976

…nto local_executors_to_dask

[pre-commit.ci] auto fixes from pre-commit.com hooks

b78f640

for more information, see https://pre-commit.ci

Fixed comments spelling

21cd240

py 3.9 for cirrus

dbed7d4

Merge pull request #883 from valsdav/local_executors_to_dask

f8653eb

feat: Dataset querying features using rucio

Switched to rucio-clients

153c0ae

Merge branch 'local_executors_to_dask' of github.com:valsdav/coffea i…

9fb3026

…nto local_executors_to_dask

Added some docs to the cli

5a45e4a

Merge pull request #884 from valsdav/local_executors_to_dask

acb9372

feat: dataset discovery CLI

[pre-commit.ci] auto fixes from pre-commit.com hooks

00fb57c

for more information, see https://pre-commit.ci

roll back test to py3.8

6d5d800

math.ceil instead of integer division

e0f1726

[pre-commit.ci] auto fixes from pre-commit.com hooks

885bbf0

for more information, see https://pre-commit.ci

Forgot to remove the slash.

3283256

valsdav and others added 21 commits December 11, 2023 13:04

Moved from cmd2 to pure rich interface for the CLI

4ba8c0a

Merge branch 'local_executors_to_dask' of github.com:valsdav/coffea i…

d116351

…nto local_executors_to_dask Conflicts: src/coffea/dataset_tools/dataset_query.py

Updated help message

dd21cbc

[pre-commit.ci] auto fixes from pre-commit.com hooks

c079634

for more information, see https://pre-commit.ci

linting

f09b64f

Merge branch 'local_executors_to_dask' of github.com:valsdav/coffea i…

126d42d

…nto local_executors_to_dask

typo

2049913

Adding non-cli interaction from datacard

5240110

Merge branch 'local_executors_to_dask' of github.com:valsdav/coffea i…

e64e3c0

…nto local_executors_to_dask

[pre-commit.ci] auto fixes from pre-commit.com hooks

d826030

for more information, see https://pre-commit.ci

better defaults and typos

dd31a2a

Merge branch 'master' into local_executors_to_dask

66ffcd7

Merge remote-tracking branch 'origin/local_executors_to_dask' into lo…

c551b93

…cal_executors_to_dask

Adding docs and dataset_discovery notebook

6986a10

typo

95b9949

more docs in the notebook

a166a81

Merge branch 'local_executors_to_dask' of github.com:valsdav/coffea i…

66f9120

…nto local_executors_to_dask

Merge branch 'master' into local_executors_to_dask

a10a6e7

Merge branch 'master' into local_executors_to_dask

0bcda6c

Merge branch 'local_executors_to_dask' of github.com:CoffeaTeam/coffe…

0aa91ed

…a into local_executors_to_dask

Merge pull request #940 from valsdav/local_executors_to_dask

2457e08

fix: improvements to dataset_query tools

lgray added 5 commits December 12, 2023 05:48

actually pass uproot_options down

02001b3

fix get_failed_steps_for_fileset/dataset

ba88fe4

properly check each input file for steps

6e22866

adjust uproot pins, add test

6d0722c

typing and docs

994ed4a

lgray enabled auto-merge December 12, 2023 20:52

lgray merged commit 16dd8df into master Dec 12, 2023
14 checks passed

lgray deleted the local_executors_to_dask branch December 14, 2023 19:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat!: switch over to dask-based processing idioms, improve dataset handling #882

feat!: switch over to dask-based processing idioms, improve dataset handling #882

lgray commented Aug 22, 2023 •

edited

Loading

lgray commented Aug 22, 2023 •

edited

Loading

valsdav commented Aug 23, 2023

ikrommyd commented Aug 30, 2023

lgray commented Aug 31, 2023

lgray commented Dec 12, 2023

feat!: switch over to dask-based processing idioms, improve dataset handling #882

feat!: switch over to dask-based processing idioms, improve dataset handling #882

Conversation

lgray commented Aug 22, 2023 • edited Loading

lgray commented Aug 22, 2023 • edited Loading

valsdav commented Aug 23, 2023

ikrommyd commented Aug 30, 2023

lgray commented Aug 31, 2023

lgray commented Dec 12, 2023

lgray commented Aug 22, 2023 •

edited

Loading

lgray commented Aug 22, 2023 •

edited

Loading