Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: ServiceX processor #488

Merged
merged 15 commits into from
Apr 16, 2021
Merged

WIP: ServiceX processor #488

merged 15 commits into from
Apr 16, 2021

Conversation

BenGalewsky
Copy link
Contributor

@BenGalewsky BenGalewsky commented Apr 8, 2021

This PR integrates the asynchronous serviceX processor into the Coffea repo.

It represents a slightly different usage pattern since we have access to a streaming source of event data.

This streaming interface is provided by the DataSource class. The constructor to this class accepts a func_adl query, a list of ServiceX datasets and optional metadata. It produces an aynchio stream of parquet files that are the result of the ServiceX transform on the provided dataset and selection statements. Users of this class can request a stream of paths to locally downloaded files or URLs to files on the ServiceX Minio object store.

We've created a simple Analysis base class that analyzers subclass to create the physics code to get their work done. We've boiled out almost all of the standard code to make an analysis tightly focused. This meant subclassing dict_accumulator to inject the standard setup.

Finally, we created a base class and a series of subclasses for an Executor - the contract of the executor is to consume files from the DataSource instance and invoke asynchronous calls to the analysis code on a backend of choice. The return value is an asynchio stream of histogram objects. It performs some last-minute packaging to further reduce duplicated code that the analyzer might have to include.

Still to Come

Need to write unit tests
More documentation
Try this out on coffea-casa

Outstanding Questions

  1. Is this the right location for these classes?
  2. My IDE includes the IRIS-HEP copyright on new files. I wonder if files contributed by IRIS-HEP should include that?
  3. I put the example notebooks in the binder folder, but I don't know how these will run in binder. We have to provide a .servicex file somehow.

@BenGalewsky BenGalewsky marked this pull request as draft April 8, 2021 15:55
@lgray
Copy link
Collaborator

lgray commented Apr 10, 2021

@BenGalewsky can you rebase this on the latest master at some point? I can't seem to merge in today's additions.

@lgray
Copy link
Collaborator

lgray commented Apr 10, 2021

FYI we're fine with justified # noqa: F401 lines in case you load stuff dynamically later.
Just write a comment to justify the exception. :-)

@BenGalewsky BenGalewsky force-pushed the servicex_coffea branch 2 times, most recently from 1c381fa to a9a612c Compare April 10, 2021 20:09
@lgray
Copy link
Collaborator

lgray commented Apr 10, 2021

FYI it looks like #495 may actually fix the random failures with the spark and parsl unit tests.

@BenGalewsky BenGalewsky force-pushed the servicex_coffea branch 5 times, most recently from 7c888b0 to a9f5223 Compare April 11, 2021 14:14
@lgray
Copy link
Collaborator

lgray commented Apr 11, 2021

Hmm it has spent 1h installing in py39. What does SX drag along with it?

@lgray
Copy link
Collaborator

lgray commented Apr 11, 2021

It is also OK to not test in all versions on all platforms if it's significantly painful to do so.
Right now we don't test quite a few things on windows, parsl is very flakey depending on OSX and py version, spark 2.4 simply doesn't work on python >= 3.8, etc...

But, if you can make your stuff work everywhere - awesome. :-)

@BenGalewsky
Copy link
Contributor Author

Hmm it has spent 1h installing in py39. What does SX drag along with it?

Not much at all, just some asynchronous I/O libraries and a little abstract syntax tree manipulation. Let me try building locally and see what the hangup could be

@BenGalewsky
Copy link
Contributor Author

Looks like some of the func_adl libraries don't offer support for 3.9 - rather than hack up the CI job to omit installing serviceX only in 3.9, I deactivated tests for serviceX in the CI build. I'll work with Gordon to get us working in 3.9 and update then

@lgray
Copy link
Collaborator

lgray commented Apr 11, 2021

@BenGalewsky I'd still recommend testing things and continuing to kick this down the field. You may go the route of SkyHook/Workqueue and add your own separate (non-required) tests in a python version that is suitably comfy.

@BenGalewsky BenGalewsky marked this pull request as ready for review April 12, 2021 01:27
@lgray
Copy link
Collaborator

lgray commented Apr 12, 2021

I had a few minor comments, otherwise it looks fine for now.
We could probably do with fewer tests while we're waiting for func_adl to get into py3.9...
That's about it.

@nsmith- take a look as well?

coffea/processor/servicex/schema.py Outdated Show resolved Hide resolved
.github/workflows/ci.yml Show resolved Hide resolved
coffea/processor/servicex/analysis.py Outdated Show resolved Hide resolved
coffea/processor/servicex/executor.py Outdated Show resolved Hide resolved
x = await r
return x

finished_events = aiostream.stream.map(func_results, inline_wait, ordered=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know the lifecycle of the futures here? Will they be cleaned up as soon as we're done adding them to outpput? If not, there will be a good chance of a memory explosion for large outputs. Further, since everything is async, you'll want to put some throttling/backpressure semaphores in this flow so that results don't pileup somewhere.

coffea/processor/servicex/executor.py Outdated Show resolved Hide resolved
coffea/processor/servicex/executor.py Outdated Show resolved Hide resolved
coffea/processor/servicex/accumulator.py Outdated Show resolved Hide resolved
@gordonwatts
Copy link
Contributor

I had a few minor comments, otherwise it looks fine for now.
We could probably do with fewer tests while we're waiting for func_adl to get into py3.9...
That's about it.

Just seeing this. If you mean just the func_adl package, that is already at 3.9. The servicex_frontend isn't there yet because last time I attempted to run it I couldn't find numpy wheels - but this was before my teaching started. I can prioritize that - I suspect it will just work in 3.9. Let me know what you are specifically refering to here.

Copy link
Member

@nsmith- nsmith- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be interested to see how the memory behavior plays out at scale with aiostream. If it is doing a good job of dropping references early that would be great news and maybe we can refactor the regular processor to use it.

Overall there's a lot of bifurcation here:

  • processor.servicex.executor.run_coffea_processor is akin to processor.processor._work_function
  • processor.servicex.executor.Executor is akin to run_uproot_job

In the long run we'll need to harmonize (likely in the direction of objects representing analysis state, input state, output state, and executor state) but for now it can be considered an experimental interface

@nsmith- nsmith- merged commit e68ce6f into CoffeaTeam:master Apr 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants