Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-562] Add DoFn.setup and DoFn.teardown to Python SDK #7994

Merged
merged 11 commits into from
May 14, 2019

Conversation

yifanmai
Copy link
Contributor

@yifanmai yifanmai commented Mar 5, 2019

DoFn.setup and DoFn.teardown is currently supported in Java but not Python. These methods are useful for performing expensive per-thread initialization. This change adds those methods to make the Python SDK more consistent with the Java SDK. It also modifies the direct runner to invoke these methods.

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- --- --- --- ---
Java Build Status Build Status Build Status Build Status
Build Status
Build Status
Build Status Build Status Build Status
Python Build Status
Build Status
--- Build Status
Build Status
Build Status --- --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

@yifanmai
Copy link
Contributor Author

yifanmai commented Mar 5, 2019

@charlesccychen PTAL

Copy link
Contributor

@charlesccychen charlesccychen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Member

@kennknowles kennknowles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only reviewed to the point of my last comment, not further.

sdks/python/apache_beam/runners/common.py Show resolved Hide resolved
@@ -619,6 +619,7 @@ def start_bundle(self):
step_name=self._applied_ptransform.full_label,
state=DoFnState(self._counter_factory),
user_state_context=self.user_state_context)
self.runner.setup()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be called once per instance, not in start_bundle.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the DirectRunner, we currently have one instance per bundle. If we want these instances to be long-living, we need a larger refactoring here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, unfortunately the DirectRunner does not cache instances between bundles. Fortunately this code path is only used by DirectRunner and not the other runners.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If self._perform_dofn_pickle_test is false, it looks like it is not cloned but comes directly from self._appliedptransform.transform.dofn. Is the whole appliedptransform clone per bundle?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some pdb breakpoints and confirmed that the DoFn is cloned during the test and that the lifecycle methods are called on the clones. See this thread for where the cloning occurs.

@yifanmai
Copy link
Contributor Author

Hi @kennknowles, would you be willing to be the primary reviewer on this since @charlesccychen is unavailable for a while?

@kennknowles
Copy link
Member

Sure, I would be happy to.

@kennknowles kennknowles self-requested a review April 15, 2019 22:39
@yifanmai
Copy link
Contributor Author

Thank you! Could you also let me know which of your previous review comments are still actionable?

sdks/python/apache_beam/transforms/dofn_lifecycle_test.py Outdated Show resolved Hide resolved
with TestPipeline() as p:
(p
| 'Start' >> beam.Create([1, 2, 3])
| 'Do' >> beam.ParDo(CallSequenceEnforcingDoFn()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about, adding an assert after the end of the the with: to assert that teardown was actually called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would that actually work? My understanding is that with remote runners, the DoFn instance here and the one on the worker are different physical instances, and there isn't a way to check for state changes on the DoFn on the worker to figure out if teardown was called or not.

Also, when I was doing print debugging, it looks like teardown is not being called on the direct runner.

Copy link
Member

@aaltay aaltay Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would not work with remote runners but it would work DirectRunner. Is not direct runner default for TestPipeline?

Also, when I was doing print debugging, it looks like teardown is not being called on the direct runner.

That looks like a bug in this PR. (Or maybe not since it is not guaranteed to be called but I think we should exercise these paths in DirectRunner.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. If we only run this test for DirectRunner, then I can add the extra assert.

Yes, I will investigate and see if DirectRunner can be made to call teardown.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Print debugging shows that teardown is being correctly called. However, the assert in this location fails because the worker is using a different unpickled copy of the DoFn.

This means that it is difficult to get a handle to the DoFn instance that the direct runner worker is actually using.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a test that checks that a global flag is set. This assumes that the worker is running in the same process as the test.

@@ -619,6 +619,7 @@ def start_bundle(self):
step_name=self._applied_ptransform.full_label,
state=DoFnState(self._counter_factory),
user_state_context=self.user_state_context)
self.runner.setup()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If self._perform_dofn_pickle_test is false, it looks like it is not cloned but comes directly from self._appliedptransform.transform.dofn. Is the whole appliedptransform clone per bundle?



@attr('ValidatesRunner')
class DoFnLifecycleTest(unittest.TestCase):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Python have TestStream implemented in the direct runner? Or something equivalent for testing directly against the SDK harness? It would be good to have a test that explicitly crosses multiple bundles, since the purpose of setup / teardown is for operations that cross bundles.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestStream is implemented in direct runner but it would not use sdk harness.

Also, Yifan's implementation here adds the APIs, but does not really change the lifecycle to re-use dofns.

Nevertheless, it will be good to run a test with sdk harness.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the test already checks that setup is called exactly once, so all that is needed is a way to generate a test such that the runner runs multiple bundles with one DoFn. I'm not sure if that capability exists currently.

Unfortunately the test only checks that teardown is called at most once, and not exactly once. I don't have a good way of checking that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this will be possible if we set _perform_dofn_pickle_test to false in a test. However that needs to be somehow exposed as an option or monkey patched.

@aaltay
Copy link
Member

aaltay commented Apr 19, 2019

I cannot reply to this comment in place

If self._perform_dofn_pickle_test is false, it looks like it is not cloned but comes directly from self._appliedptransform.transform.dofn. Is the whole appliedptransform clone per bundle?

applied transform will be a new instance for each bundle (for direct runner). (See: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/direct/executor.py#L366)

@kennknowles
Copy link
Member

kennknowles commented Apr 19, 2019 via email

@aaltay
Copy link
Member

aaltay commented Apr 19, 2019

That line does not show a cloning of appliedptransform...

On Fri, Apr 19, 2019 at 3:44 PM Ahmet Altay @.***> wrote: I cannot reply to this comment in place If self._perform_dofn_pickle_test is false, it looks like it is not cloned but comes directly from self._appliedptransform.transform.dofn. Is the whole appliedptransform clone per bundle? applied transform will be a new instance for each bundle (for direct runner). (See: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/direct/executor.py#L366 ) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7994 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEMO6ONQUU57D4336BHZS3PRJDNRANCNFSM4G36BRSA .

You are right, I confused by evaluator. perform_dofn_pickle_test defaults to True, and there is only one path that is setting to false related to sdf evaluation.

I think we should clone by default, probably by getting rid of that flag. I am not sure why we had the flag in the first place, but I am guessing there will be an underlying issue.

@yifanmai
Copy link
Contributor Author

Added a test to ensure that shutdown is called on DirectRunner. PTAL

I discovered that metrics counter increments during setup and teardown will not be registered in the pipeline result metrics. I'm not sure if this should be considered a bug, since setup and teardown may could happen outside the pipeline run (it happens within the pipeline run for DirectRunner, but not necessarily for other runners) I looked at the metrics API, but could not figure out how to set up the metrics environment correctly for those stages. @aaltay do you have any suggestions on how to handle this?

@aaltay
Copy link
Member

aaltay commented Apr 29, 2019

@pabloem or @ajamato will likely have a more informed opinion than myself. Pablo, Alex, what is the expected behavior for metric in setup/teardown?

@pabloem
Copy link
Member

pabloem commented May 2, 2019

Hmm my initial gut reaction is to think that metrics are computed / committed as part of a bundle, and setup/teardown are not part of bundle execution.

on the other hand, I can imagine user wanting access to metrics from setup/teardown... So perhaps I'd say to file a bug so we figure out what's necessary to make metrics available there...

@lukecwik
Copy link
Member

lukecwik commented May 2, 2019

@pabloem Your correct that setup/teardown are not part of bundle execution but are part of the DoFn lifecycle.

Metrics outside of a bundle do make sense since you may want to capture something like memory usage, CPU utilization, caching stats, ... that cross bundle boundaries or are global. A lot of this falls under worker "health" though.

@aaltay
Copy link
Member

aaltay commented May 2, 2019

@yifanmai In addition to filing a bug related to metrics, could this PR move forward?

@yifanmai
Copy link
Contributor Author

yifanmai commented May 3, 2019

Yes, I think so. I will open a bug to track the fact that metrics in shutdown. (I will do it after this PR otherwise the bug will not make sense.)

I have addressed the earlier feedback regarding adding tests to make sure that shutdown is called; @aaltay could you and @kennknowles take another look?

@aaltay
Copy link
Member

aaltay commented May 6, 2019

The change LGTM. But there are test errors. Please take a look at those.

(If you need help with py3 compatibility, cc: @tvalentyn could help.)

One I found in the logs is the following, there might be other errors as well:

6:57:54 FAILED (SKIP=448, errors=24, failures=5)
16:57:54 ERROR: InvocationError for command '/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/test-suites/tox/py35/build/srcs/sdks/python/target/.tox-py35-cython/py35-cython/bin/python setup.py nosetests' (exited with code 1)
16:57:54 ___________________________________ summary ____________________________________
16:57:54 ERROR: py35-cython: commands failed

16:58:01 File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
16:58:01 self.run()
16:58:01 File "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/build/srcs/sdks/python/apache_beam/runners/portability/local_job_service.py", line 245, in run
16:58:01 self._pipeline_proto)
16:58:01 File "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/build/srcs/sdks/python/apache_beam/runners/portability/fn_api_runner.py", line 281, in run_via_runner_api
16:58:01 return self.run_stages(*self.create_stages(pipeline_proto))
16:58:01 File "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/build/srcs/sdks/python/apache_beam/runners/portability/fn_api_runner.py", line 357, in run_stages
16:58:01 stage_context.safe_coders)
16:58:01 File "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/build/srcs/sdks/python/apache_beam/runners/portability/fn_api_runner.py", line 521, in run_stage
16:58:01 data_input, data_output)
16:58:01 File "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/build/srcs/sdks/python/apache_beam/runners/portability/fn_api_runner.py", line 1227, in process_bundle
16:58:01 result_future = self._controller.control_handler.push(process_bundle)
16:58:01 File "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/build/srcs/sdks/python/apache_beam/runners/portability/fn_api_runner.py", line 842, in push
16:58:01 response = self.worker.do_instruction(request)
16:58:01 File "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/build/srcs/sdks/python/apache_beam/runners/worker/sdk_worker.py", line 342, in do_instruction
16:58:01 request.instruction_id)
16:58:01 File "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/build/srcs/sdks/python/apache_beam/runners/worker/sdk_worker.py", line 382, in process_bundle
16:58:01 self.bundle_processor_cache.discard(instruction_id)
16:58:01 File "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/build/srcs/sdks/python/apache_beam/runners/worker/sdk_worker.py", line 314, in discard
16:58:01 self.active_bundle_processors[instruction_id].shutdown()
16:58:01 AttributeError: 'tuple' object has no attribute 'shutdown'

@yifanmai yifanmai force-pushed the yifan/setup branch 2 times, most recently from 1f0078f to 830228a Compare May 8, 2019 17:07
@yifanmai
Copy link
Contributor Author

yifanmai commented May 8, 2019

@kennknowles @aaltay tests are passing now. PTAL?

@aaltay
Copy link
Member

aaltay commented May 8, 2019

Run Python PostCommit

@aaltay
Copy link
Member

aaltay commented May 10, 2019

Fails with a py3 error, not sure how it is related:
22:36:06 File "/usr/local/lib/python3.5/site-packages/apache_beam/runners/worker/sdk_worker.py", line 613, in _blocking_request
22:36:06 raise RuntimeError(response.error)
22:36:06 RuntimeError: java.lang.ClassCastException: java.lang.String cannot be cast to [B

@aaltay
Copy link
Member

aaltay commented May 10, 2019

Run Python PostCommit

@yifanmai
Copy link
Contributor Author

The shutdown assert is breaking in Dataflow also (because it only works for in-process). Can I delete that assert, or disable it for Dataflow?

14:22:11 Traceback (most recent call last):
14:22:11   File "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify_PR/src/sdks/python/apache_beam/transforms/dofn_lifecycle_test.py", line 90, in test_dofn_lifecycle
14:22:11     self.assertTrue(_global_teardown_called)
14:22:11 AssertionError: False is not true

@aaltay
Copy link
Member

aaltay commented May 10, 2019

The shutdown assert is breaking in Dataflow also (because it only works for in-process). Can I delete that assert, or disable it for Dataflow?

Let's delete it. I do not think it will be easy to assert just for Dataflow runner.

@yifanmai
Copy link
Contributor Author

Run Python PostCommit

@yifanmai
Copy link
Contributor Author

Failing at :beam-sdks-python:crossLanguageTests, which is suspicious since it works for me when run locally.

13:26:16 [grpc-default-executor-0] ERROR org.apache.beam.runners.fnexecution.jobsubmission.InMemoryJobService - Encountered Unexpected Exception for Invocation job_de1c9355-5fa3-48c7-bedd-29035031739a
13:26:16 org.apache.beam.vendor.grpc.v1p13p1.io.grpc.StatusException: NOT_FOUND
13:26:16 	at org.apache.beam.vendor.grpc.v1p13p1.io.grpc.Status.asException(Status.java:534)
13:26:16 	at org.apache.beam.runners.fnexecution.jobsubmission.InMemoryJobService.getInvocation(InMemoryJobService.java:341)
13:26:16 	at org.apache.beam.runners.fnexecution.jobsubmission.InMemoryJobService.getStateStream(InMemoryJobService.java:262)
13:26:16 	at org.apache.beam.model.jobmanagement.v1.JobServiceGrpc$MethodHandlers.invoke(JobServiceGrpc.java:770)
13:26:16 	at org.apache.beam.vendor.grpc.v1p13p1.io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:171)
13:26:16 	at org.apache.beam.vendor.grpc.v1p13p1.io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
13:26:16 	at org.apache.beam.vendor.grpc.v1p13p1.io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
13:26:16 	at org.apache.beam.vendor.grpc.v1p13p1.io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)
13:26:16 	at org.apache.beam.vendor.grpc.v1p13p1.io.grpc.Contexts$ContextualizedServerCallListener.onHalfClose(Contexts.java:86)
13:26:16 	at org.apache.beam.vendor.grpc.v1p13p1.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:283)

@yifanmai
Copy link
Contributor Author

Run Python PostCommit

@aaltay aaltay merged commit 4629e82 into apache:master May 14, 2019
@aaltay
Copy link
Member

aaltay commented May 14, 2019

Thank you @yifanmai !

charithe pushed a commit to shehzaadn-vd/vend-beam that referenced this pull request May 16, 2019
* [BEAM-562] Add setup and teardown to Python DoFn
@yifanmai
Copy link
Contributor Author

Thanks @aaltay, @kennknowles and @charlesccychen for your help!

I added https://issues.apache.org/jira/browse/BEAM-7340 to track the issue related to metrics in DoFn.teardown, as discussed earlier.

@NikeNano
Copy link
Contributor

NikeNano commented Aug 2, 2019

@yifanmai is this also expected to work for unbounded sources? I don't get it to work when reading from pubsub....

@aaltay
Copy link
Member

aaltay commented Aug 2, 2019

@NikeNano it should work with all dofns regardless of the types of sources. If that is not working, please file an issue.

@NikeNano
Copy link
Contributor

NikeNano commented Aug 2, 2019

OK, thanks @aaltay. Will investigate further and file an issue if i don't get it to work.

@NikeNano
Copy link
Contributor

NikeNano commented Aug 2, 2019

Created issues for DoFn.setup, https://issues.apache.org/jira/browse/BEAM-7885

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants