fix: improved DataDiscoveryCLI interface #965

valsdav · 2023-12-13T15:16:48Z

Improving the DataDiscoveryCLI

do_replicas and load_dataset_definitions now returns uproot compatible metadata
Added "first" option for rucio replicas
Returning both overall and updated datasets from preprocess
Fixes and docs

@iasonkrom

src/coffea/dataset_tools/dataset_query.py

ikrommyd · 2023-12-13T15:47:28Z

Should we have "first" or "round-robin" be the default though? Opinions?

valsdav · 2023-12-13T15:48:21Z

round-robin by default distributes the load more between sites.

ikrommyd · 2023-12-13T15:49:24Z

Does it "reduce" the quality of the sites though? Are "bigger" sites generally better?

lgray · 2023-12-13T15:56:23Z

Round-robin is fine and probably scales better. If it turns out that it causes problems we can change it later, but I think that gets determined by experimentation rather than how we think it is supposed to work.

ikrommyd · 2023-12-13T16:00:23Z

One final thing to ask is a method that returns the dictionary that do_save would write to a json file. Maybe .as_dict like @lgray mentioned in Slack. After that, it looks good to me!

lgray · 2023-12-13T16:13:47Z

Either as_dict or to_dict would be fine, but yes, that's needed @valsdav .

…nto local_executors_to_dask

ikrommyd · 2023-12-13T16:27:51Z

Oh I got one more thought. What if we also keep out_available and out_updated dictionaries as class attributes if preprocess was run? Just to have direct access to them from the class and not only from the json files.

ikrommyd · 2023-12-13T16:45:11Z

src/coffea/dataset_tools/dataset_query.py

@@ -542,19 +544,19 @@ def do_preprocess(
            "[red] Preprocessing files to extract available chunks with dask[/]"
        ):
            with Client(dask_cluster) as _:
-                out_available, out_updated = preprocess(
+                self.preprocess_available, self.preprocessed_total = preprocess(


self.preprocessed_available not preprocess_available

ikrommyd · 2023-12-13T16:45:30Z

src/coffea/dataset_tools/dataset_query.py

        with gzip.open(f"{output_file}_all.json.gz", "wt") as file:
            print(f"Saved all fileset chunks to {output_file}_all.json.gz")
-            json.dump(out_updated, file, indent=2)
-        return out_available, out_updated
+            json.dump(self.preprocess_available, file, indent=2)


ikrommyd · 2023-12-13T16:45:40Z

src/coffea/dataset_tools/dataset_query.py

-            json.dump(out_updated, file, indent=2)
-        return out_available, out_updated
+            json.dump(self.preprocess_available, file, indent=2)
+        return self.preprocessed_total, self.preprocess_available


lgray · 2023-12-13T16:50:49Z

@valsdav @iasonkrom ping me when you've converged, and I'll merge it.

ikrommyd · 2023-12-13T16:51:53Z

@lgray do_preprocess() takes in dask_cluster=None and spawns a with Client(dask_cluster) as _ to do the preprocessing. But what if the user already has a Client running? No reason to spawn another one right? I'm good with all the rest

ikrommyd · 2023-12-13T16:54:07Z

I don't know if passing in dask_client and not dask_cluster is the correct approach here

valsdav · 2023-12-13T16:54:28Z

if you provide a dask scheduler url the Client is going to connect to that. if None it will spawn a local multiprocessing cluster, isn't it?

ikrommyd · 2023-12-13T16:57:41Z

I'm not sure about that. Lindsey will have to confirm that it's fine. I'm good with all the rest. @lgray you're free to review

lgray · 2023-12-13T16:58:35Z

That's all the functionality we should expect in a CLI system, see https://distributed.dask.org/en/stable/client.html.

So I think we can provide the scheduler endpoint as a string and call it a day here.

If you need to do more complicated preprocessing, you can get the raw list and do what you want, or call the pre-processing command in a script in the appropriate context manager with your dask cluster setup.

What's there is fit to task.

lgray · 2023-12-13T16:59:50Z

@valsdav change dask_cluster to scheduler_url and I think the usage becomes more clear.

ikrommyd · 2023-12-13T17:06:45Z

so if I've spawned an lpc condor client with:

from distributed import Client
from lpcjobqueue import LPCCondorCluster


cluster = LPCCondorCluster()
cluster.adapt(minimum=0, maximum=100)
client = Client(cluster)

what should I do to make do_preprocess use that Client?

lgray · 2023-12-13T17:09:21Z

as_dict -> call coffea.dataset_tools.preprocess

ikrommyd · 2023-12-13T17:11:11Z

as_dict -> call coffea.dataset_tools.preprocess

ah okay, so we just don't use the CLI class's method in that case. Thanks

I'm good with merging as soon as you review! Thanks a lot @valsdav!

valsdav marked this pull request as ready for review December 13, 2023 15:16

ikrommyd reviewed Dec 13, 2023

View reviewed changes

src/coffea/dataset_tools/dataset_query.py Outdated Show resolved Hide resolved

improved DataDiscoveryCLI interface

ad1ac02

valsdav force-pushed the local_executors_to_dask branch from 64ede9c to ad1ac02 Compare December 13, 2023 15:23

ikrommyd reviewed Dec 13, 2023

View reviewed changes

src/coffea/dataset_tools/dataset_query.py Show resolved Hide resolved

Forgot to return output

0e68491

Merge branch 'master' into local_executors_to_dask

7d7cd82

valsdav added 2 commits December 13, 2023 17:18

added as_dict

7357c2f

Merge branch 'local_executors_to_dask' of github.com:valsdav/coffea i…

dfc028e

…nto local_executors_to_dask

ikrommyd reviewed Dec 13, 2023

View reviewed changes

storing preprocessed output

ce98b8a

valsdav force-pushed the local_executors_to_dask branch from d969544 to ce98b8a Compare December 13, 2023 16:48

dask_cluster --> scheduler_url

652a0af

lgray enabled auto-merge December 13, 2023 17:12

Merge branch 'master' into local_executors_to_dask

d47cd55

lgray merged commit 7a236f1 into CoffeaTeam:master Dec 13, 2023
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: improved DataDiscoveryCLI interface #965

fix: improved DataDiscoveryCLI interface #965

valsdav commented Dec 13, 2023 •

edited

ikrommyd commented Dec 13, 2023

valsdav commented Dec 13, 2023

ikrommyd commented Dec 13, 2023

lgray commented Dec 13, 2023 •

edited

ikrommyd commented Dec 13, 2023

lgray commented Dec 13, 2023

ikrommyd commented Dec 13, 2023 •

edited

ikrommyd Dec 13, 2023

ikrommyd Dec 13, 2023

ikrommyd Dec 13, 2023

lgray commented Dec 13, 2023 •

edited

ikrommyd commented Dec 13, 2023 •

edited

ikrommyd commented Dec 13, 2023

valsdav commented Dec 13, 2023

ikrommyd commented Dec 13, 2023

lgray commented Dec 13, 2023 •

edited

lgray commented Dec 13, 2023

ikrommyd commented Dec 13, 2023

lgray commented Dec 13, 2023

ikrommyd commented Dec 13, 2023

fix: improved DataDiscoveryCLI interface #965

fix: improved DataDiscoveryCLI interface #965

Conversation

valsdav commented Dec 13, 2023 • edited

ikrommyd commented Dec 13, 2023

valsdav commented Dec 13, 2023

ikrommyd commented Dec 13, 2023

lgray commented Dec 13, 2023 • edited

ikrommyd commented Dec 13, 2023

lgray commented Dec 13, 2023

ikrommyd commented Dec 13, 2023 • edited

ikrommyd Dec 13, 2023

Choose a reason for hiding this comment

ikrommyd Dec 13, 2023

Choose a reason for hiding this comment

ikrommyd Dec 13, 2023

Choose a reason for hiding this comment

lgray commented Dec 13, 2023 • edited

ikrommyd commented Dec 13, 2023 • edited

ikrommyd commented Dec 13, 2023

valsdav commented Dec 13, 2023

ikrommyd commented Dec 13, 2023

lgray commented Dec 13, 2023 • edited

lgray commented Dec 13, 2023

ikrommyd commented Dec 13, 2023

lgray commented Dec 13, 2023

ikrommyd commented Dec 13, 2023

valsdav commented Dec 13, 2023 •

edited

lgray commented Dec 13, 2023 •

edited

ikrommyd commented Dec 13, 2023 •

edited

lgray commented Dec 13, 2023 •

edited

ikrommyd commented Dec 13, 2023 •

edited

lgray commented Dec 13, 2023 •

edited