WIP: ServiceX processor #488

BenGalewsky · 2021-04-08T15:55:15Z

This PR integrates the asynchronous serviceX processor into the Coffea repo.

It represents a slightly different usage pattern since we have access to a streaming source of event data.

This streaming interface is provided by the DataSource class. The constructor to this class accepts a func_adl query, a list of ServiceX datasets and optional metadata. It produces an aynchio stream of parquet files that are the result of the ServiceX transform on the provided dataset and selection statements. Users of this class can request a stream of paths to locally downloaded files or URLs to files on the ServiceX Minio object store.

We've created a simple Analysis base class that analyzers subclass to create the physics code to get their work done. We've boiled out almost all of the standard code to make an analysis tightly focused. This meant subclassing dict_accumulator to inject the standard setup.

Finally, we created a base class and a series of subclasses for an Executor - the contract of the executor is to consume files from the DataSource instance and invoke asynchronous calls to the analysis code on a backend of choice. The return value is an asynchio stream of histogram objects. It performs some last-minute packaging to further reduce duplicated code that the analyzer might have to include.

Still to Come

~~Need to write unit tests~~
More documentation
Try this out on coffea-casa

Outstanding Questions

Is this the right location for these classes?
My IDE includes the IRIS-HEP copyright on new files. I wonder if files contributed by IRIS-HEP should include that?
I put the example notebooks in the binder folder, but I don't know how these will run in binder. We have to provide a .servicex file somehow.

lgray · 2021-04-10T02:59:51Z

@BenGalewsky can you rebase this on the latest master at some point? I can't seem to merge in today's additions.

lgray · 2021-04-10T19:22:31Z

FYI we're fine with justified # noqa: F401 lines in case you load stuff dynamically later.
Just write a comment to justify the exception. :-)

lgray · 2021-04-10T20:50:07Z

FYI it looks like #495 may actually fix the random failures with the spark and parsl unit tests.

coffea/processor/servicex/__init__.py

lgray · 2021-04-11T15:31:40Z

Hmm it has spent 1h installing in py39. What does SX drag along with it?

lgray · 2021-04-11T15:58:35Z

It is also OK to not test in all versions on all platforms if it's significantly painful to do so.
Right now we don't test quite a few things on windows, parsl is very flakey depending on OSX and py version, spark 2.4 simply doesn't work on python >= 3.8, etc...

But, if you can make your stuff work everywhere - awesome. :-)

BenGalewsky · 2021-04-11T16:53:58Z

Hmm it has spent 1h installing in py39. What does SX drag along with it?

Not much at all, just some asynchronous I/O libraries and a little abstract syntax tree manipulation. Let me try building locally and see what the hangup could be

BenGalewsky · 2021-04-11T17:09:43Z

Looks like some of the func_adl libraries don't offer support for 3.9 - rather than hack up the CI job to omit installing serviceX only in 3.9, I deactivated tests for serviceX in the CI build. I'll work with Gordon to get us working in 3.9 and update then

lgray · 2021-04-11T17:11:07Z

@BenGalewsky I'd still recommend testing things and continuing to kick this down the field. You may go the route of SkyHook/Workqueue and add your own separate (non-required) tests in a python version that is suitably comfy.

coffea/processor/servicex/__init__.py

coffea/processor/servicex/accumulator.py

coffea/processor/servicex/analysis.py

coffea/processor/servicex/dask_executor.py

coffea/processor/servicex/schema.py

lgray · 2021-04-12T17:08:32Z

I had a few minor comments, otherwise it looks fine for now.
We could probably do with fewer tests while we're waiting for func_adl to get into py3.9...
That's about it.

@nsmith- take a look as well?

coffea/processor/servicex/schema.py

.github/workflows/ci.yml

coffea/processor/servicex/analysis.py

coffea/processor/servicex/executor.py

nsmith- · 2021-04-14T18:50:47Z

coffea/processor/servicex/executor.py

+            x = await r
+            return x
+
+        finished_events = aiostream.stream.map(func_results, inline_wait, ordered=False)


Do you know the lifecycle of the futures here? Will they be cleaned up as soon as we're done adding them to outpput? If not, there will be a good chance of a memory explosion for large outputs. Further, since everything is async, you'll want to put some throttling/backpressure semaphores in this flow so that results don't pileup somewhere.

coffea/processor/servicex/executor.py

coffea/processor/servicex/accumulator.py

gordonwatts · 2021-04-15T20:27:38Z

I had a few minor comments, otherwise it looks fine for now.
We could probably do with fewer tests while we're waiting for func_adl to get into py3.9...
That's about it.

Just seeing this. If you mean just the func_adl package, that is already at 3.9. The servicex_frontend isn't there yet because last time I attempted to run it I couldn't find numpy wheels - but this was before my teaching started. I can prioritize that - I suspect it will just work in 3.9. Let me know what you are specifically refering to here.

…late function

nsmith-

I'll be interested to see how the memory behavior plays out at scale with aiostream. If it is doing a good job of dropping references early that would be great news and maybe we can refactor the regular processor to use it.

Overall there's a lot of bifurcation here:

processor.servicex.executor.run_coffea_processor is akin to processor.processor._work_function
processor.servicex.executor.Executor is akin to run_uproot_job

In the long run we'll need to harmonize (likely in the direction of objects representing analysis state, input state, output state, and executor state) but for now it can be considered an experimental interface

BenGalewsky marked this pull request as draft April 8, 2021 15:55

BenGalewsky force-pushed the servicex_coffea branch from b02369f to b7415da Compare April 10, 2021 18:49

BenGalewsky force-pushed the servicex_coffea branch 2 times, most recently from 1c381fa to a9a612c Compare April 10, 2021 20:09

BenGalewsky force-pushed the servicex_coffea branch 5 times, most recently from 7c888b0 to a9f5223 Compare April 11, 2021 14:14

lgray reviewed Apr 11, 2021

View reviewed changes

coffea/processor/servicex/__init__.py Outdated Show resolved Hide resolved

BenGalewsky force-pushed the servicex_coffea branch from a9f5223 to e3dda8a Compare April 11, 2021 17:07

BenGalewsky force-pushed the servicex_coffea branch from 8c8d55b to fb71304 Compare April 11, 2021 20:29

BenGalewsky marked this pull request as ready for review April 12, 2021 01:27