Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: improved DataDiscoveryCLI interface #965

Merged
merged 8 commits into from
Dec 13, 2023

Conversation

valsdav
Copy link
Contributor

@valsdav valsdav commented Dec 13, 2023

Improving the DataDiscoveryCLI

  • do_replicas and load_dataset_definitions now returns uproot compatible metadata
  • Added "first" option for rucio replicas
  • Returning both overall and updated datasets from preprocess
  • Fixes and docs

@iasonkrom

@valsdav valsdav marked this pull request as ready for review December 13, 2023 15:16
@ikrommyd
Copy link
Contributor

Should we have "first" or "round-robin" be the default though? Opinions?

@valsdav
Copy link
Contributor Author

valsdav commented Dec 13, 2023

round-robin by default distributes the load more between sites.

@ikrommyd
Copy link
Contributor

Does it "reduce" the quality of the sites though? Are "bigger" sites generally better?

@lgray
Copy link
Collaborator

lgray commented Dec 13, 2023

Round-robin is fine and probably scales better. If it turns out that it causes problems we can change it later, but I think that gets determined by experimentation rather than how we think it is supposed to work.

@ikrommyd
Copy link
Contributor

One final thing to ask is a method that returns the dictionary that do_save would write to a json file. Maybe .as_dict like @lgray mentioned in Slack. After that, it looks good to me!

@lgray
Copy link
Collaborator

lgray commented Dec 13, 2023

Either as_dict or to_dict would be fine, but yes, that's needed @valsdav .

@ikrommyd
Copy link
Contributor

ikrommyd commented Dec 13, 2023

Oh I got one more thought. What if we also keep out_available and out_updated dictionaries as class attributes if preprocess was run? Just to have direct access to them from the class and not only from the json files.

@@ -542,19 +544,19 @@ def do_preprocess(
"[red] Preprocessing files to extract available chunks with dask[/]"
):
with Client(dask_cluster) as _:
out_available, out_updated = preprocess(
self.preprocess_available, self.preprocessed_total = preprocess(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.preprocessed_available not preprocess_available

with gzip.open(f"{output_file}_all.json.gz", "wt") as file:
print(f"Saved all fileset chunks to {output_file}_all.json.gz")
json.dump(out_updated, file, indent=2)
return out_available, out_updated
json.dump(self.preprocess_available, file, indent=2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

json.dump(out_updated, file, indent=2)
return out_available, out_updated
json.dump(self.preprocess_available, file, indent=2)
return self.preprocessed_total, self.preprocess_available
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@lgray
Copy link
Collaborator

lgray commented Dec 13, 2023

@valsdav @iasonkrom ping me when you've converged, and I'll merge it.

@ikrommyd
Copy link
Contributor

ikrommyd commented Dec 13, 2023

@lgray do_preprocess() takes in dask_cluster=None and spawns a with Client(dask_cluster) as _ to do the preprocessing. But what if the user already has a Client running? No reason to spawn another one right? I'm good with all the rest

@ikrommyd
Copy link
Contributor

I don't know if passing in dask_client and not dask_cluster is the correct approach here

@valsdav
Copy link
Contributor Author

valsdav commented Dec 13, 2023

if you provide a dask scheduler url the Client is going to connect to that. if None it will spawn a local multiprocessing cluster, isn't it?

@ikrommyd
Copy link
Contributor

I'm not sure about that. Lindsey will have to confirm that it's fine. I'm good with all the rest. @lgray you're free to review

@lgray
Copy link
Collaborator

lgray commented Dec 13, 2023

That's all the functionality we should expect in a CLI system, see https://distributed.dask.org/en/stable/client.html.

So I think we can provide the scheduler endpoint as a string and call it a day here.

If you need to do more complicated preprocessing, you can get the raw list and do what you want, or call the pre-processing command in a script in the appropriate context manager with your dask cluster setup.

What's there is fit to task.

@lgray
Copy link
Collaborator

lgray commented Dec 13, 2023

@valsdav change dask_cluster to scheduler_url and I think the usage becomes more clear.

@ikrommyd
Copy link
Contributor

so if I've spawned an lpc condor client with:

from distributed import Client
from lpcjobqueue import LPCCondorCluster


cluster = LPCCondorCluster()
cluster.adapt(minimum=0, maximum=100)
client = Client(cluster)

what should I do to make do_preprocess use that Client?

@lgray
Copy link
Collaborator

lgray commented Dec 13, 2023

as_dict -> call coffea.dataset_tools.preprocess

@ikrommyd
Copy link
Contributor

as_dict -> call coffea.dataset_tools.preprocess

ah okay, so we just don't use the CLI class's method in that case. Thanks

I'm good with merging as soon as you review! Thanks a lot @valsdav!

@lgray lgray enabled auto-merge December 13, 2023 17:12
@lgray lgray merged commit 7a236f1 into CoffeaTeam:master Dec 13, 2023
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants