-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: improved DataDiscoveryCLI interface #965
Conversation
64ede9c
to
ad1ac02
Compare
Should we have "first" or "round-robin" be the default though? Opinions? |
|
Does it "reduce" the quality of the sites though? Are "bigger" sites generally better? |
Round-robin is fine and probably scales better. If it turns out that it causes problems we can change it later, but I think that gets determined by experimentation rather than how we think it is supposed to work. |
One final thing to ask is a method that returns the dictionary that |
Either |
…nto local_executors_to_dask
Oh I got one more thought. What if we also keep |
@@ -542,19 +544,19 @@ def do_preprocess( | |||
"[red] Preprocessing files to extract available chunks with dask[/]" | |||
): | |||
with Client(dask_cluster) as _: | |||
out_available, out_updated = preprocess( | |||
self.preprocess_available, self.preprocessed_total = preprocess( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.preprocessed_available
not preprocess_available
with gzip.open(f"{output_file}_all.json.gz", "wt") as file: | ||
print(f"Saved all fileset chunks to {output_file}_all.json.gz") | ||
json.dump(out_updated, file, indent=2) | ||
return out_available, out_updated | ||
json.dump(self.preprocess_available, file, indent=2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
json.dump(out_updated, file, indent=2) | ||
return out_available, out_updated | ||
json.dump(self.preprocess_available, file, indent=2) | ||
return self.preprocessed_total, self.preprocess_available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
d969544
to
ce98b8a
Compare
@valsdav @iasonkrom ping me when you've converged, and I'll merge it. |
@lgray |
I don't know if passing in |
if you provide a dask scheduler url the Client is going to connect to that. if None it will spawn a local multiprocessing cluster, isn't it? |
I'm not sure about that. Lindsey will have to confirm that it's fine. I'm good with all the rest. @lgray you're free to review |
That's all the functionality we should expect in a CLI system, see https://distributed.dask.org/en/stable/client.html. So I think we can provide the scheduler endpoint as a string and call it a day here. If you need to do more complicated preprocessing, you can get the raw list and do what you want, or call the pre-processing command in a script in the appropriate context manager with your dask cluster setup. What's there is fit to task. |
@valsdav change |
so if I've spawned an lpc condor client with: from distributed import Client
from lpcjobqueue import LPCCondorCluster
cluster = LPCCondorCluster()
cluster.adapt(minimum=0, maximum=100)
client = Client(cluster) what should I do to make |
|
ah okay, so we just don't use the CLI class's method in that case. Thanks I'm good with merging as soon as you review! Thanks a lot @valsdav! |
Improving the DataDiscoveryCLI
do_replicas
andload_dataset_definitions
now returns uproot compatible metadata@iasonkrom