[BEAM-562] Add DoFn.setup and DoFn.teardown to Python SDK #7994

yifanmai · 2019-03-05T21:13:55Z

DoFn.setup and DoFn.teardown is currently supported in Java but not Python. These methods are useful for performing expensive per-thread initialization. This change adds those methods to make the Python SDK more consistent with the Java SDK. It also modifies the direct runner to invoke these methods.

Post-Commit Tests Status (on master branch)

Lang	Apex	Dataflow	Flink	Gearpump	Samza	Spark
Go	---	---	---	---	---	---
Java
Python	---			---	---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

yifanmai · 2019-03-05T21:14:26Z

@charlesccychen PTAL

charlesccychen

Thanks!

sdks/python/apache_beam/runners/worker/operations.py

sdks/python/apache_beam/runners/worker/bundle_processor.py

sdks/python/apache_beam/runners/portability/fn_api_runner.py

sdks/python/apache_beam/runners/worker/sdk_worker.py

kennknowles

Only reviewed to the point of my last comment, not further.

sdks/python/apache_beam/runners/common.py

kennknowles · 2019-03-07T05:45:03Z

sdks/python/apache_beam/runners/direct/transform_evaluator.py

@@ -619,6 +619,7 @@ def start_bundle(self):
        step_name=self._applied_ptransform.full_label,
        state=DoFnState(self._counter_factory),
        user_state_context=self.user_state_context)
+    self.runner.setup()


This should be called once per instance, not in start_bundle.

In the DirectRunner, we currently have one instance per bundle. If we want these instances to be long-living, we need a larger refactoring here.

Yes, unfortunately the DirectRunner does not cache instances between bundles. Fortunately this code path is only used by DirectRunner and not the other runners.

If self._perform_dofn_pickle_test is false, it looks like it is not cloned but comes directly from self._appliedptransform.transform.dofn. Is the whole appliedptransform clone per bundle?

I added some pdb breakpoints and confirmed that the DoFn is cloned during the test and that the lifecycle methods are called on the clones. See this thread for where the cloning occurs.

sdks/python/apache_beam/runners/direct/transform_evaluator.py

yifanmai · 2019-04-15T22:28:13Z

Hi @kennknowles, would you be willing to be the primary reviewer on this since @charlesccychen is unavailable for a while?

kennknowles · 2019-04-15T22:38:37Z

Sure, I would be happy to.

yifanmai · 2019-04-15T22:40:13Z

Thank you! Could you also let me know which of your previous review comments are still actionable?

sdks/python/apache_beam/transforms/dofn_lifecycle_test.py

aaltay · 2019-04-15T22:50:50Z

sdks/python/apache_beam/transforms/dofn_lifecycle_test.py

+    with TestPipeline() as p:
+      (p
+        | 'Start' >> beam.Create([1, 2, 3])
+        | 'Do' >> beam.ParDo(CallSequenceEnforcingDoFn()))


How about, adding an assert after the end of the the with: to assert that teardown was actually called.

Would that actually work? My understanding is that with remote runners, the DoFn instance here and the one on the worker are different physical instances, and there isn't a way to check for state changes on the DoFn on the worker to figure out if teardown was called or not.

Also, when I was doing print debugging, it looks like teardown is not being called on the direct runner.

It would not work with remote runners but it would work DirectRunner. Is not direct runner default for TestPipeline?

Also, when I was doing print debugging, it looks like teardown is not being called on the direct runner.

That looks like a bug in this PR. (Or maybe not since it is not guaranteed to be called but I think we should exercise these paths in DirectRunner.)

I see. If we only run this test for DirectRunner, then I can add the extra assert.

Yes, I will investigate and see if DirectRunner can be made to call teardown.

Print debugging shows that teardown is being correctly called. However, the assert in this location fails because the worker is using a different unpickled copy of the DoFn.

A pickle round trip happens in ptransform.py during pipeline construction time.

The default direct runner FnApiRunner unpickles the DoFn in operations.py and bundle_processor.py.

The alternative direct runner BundleBasedDirectRunner performs a pickle round trip in transform_evaluator.py if _perform_dofn_pickle_test is true.

This means that it is difficult to get a handle to the DoFn instance that the direct runner worker is actually using.

Added a test that checks that a global flag is set. This assumes that the worker is running in the same process as the test.

sdks/python/apache_beam/runners/worker/operations.py

kennknowles · 2019-04-19T22:23:23Z

sdks/python/apache_beam/runners/direct/transform_evaluator.py

@@ -619,6 +619,7 @@ def start_bundle(self):
        step_name=self._applied_ptransform.full_label,
        state=DoFnState(self._counter_factory),
        user_state_context=self.user_state_context)
+    self.runner.setup()


If self._perform_dofn_pickle_test is false, it looks like it is not cloned but comes directly from self._appliedptransform.transform.dofn. Is the whole appliedptransform clone per bundle?

kennknowles · 2019-04-19T22:27:12Z

sdks/python/apache_beam/transforms/dofn_lifecycle_test.py

+
+
+@attr('ValidatesRunner')
+class DoFnLifecycleTest(unittest.TestCase):


Does Python have TestStream implemented in the direct runner? Or something equivalent for testing directly against the SDK harness? It would be good to have a test that explicitly crosses multiple bundles, since the purpose of setup / teardown is for operations that cross bundles.

TestStream is implemented in direct runner but it would not use sdk harness.

Also, Yifan's implementation here adds the APIs, but does not really change the lifecycle to re-use dofns.

Nevertheless, it will be good to run a test with sdk harness.

Yes, the test already checks that setup is called exactly once, so all that is needed is a way to generate a test such that the runner runs multiple bundles with one DoFn. I'm not sure if that capability exists currently.

Unfortunately the test only checks that teardown is called at most once, and not exactly once. I don't have a good way of checking that.

I believe this will be possible if we set _perform_dofn_pickle_test to false in a test. However that needs to be somehow exposed as an option or monkey patched.

aaltay · 2019-04-19T22:44:29Z

I cannot reply to this comment in place

If self._perform_dofn_pickle_test is false, it looks like it is not cloned but comes directly from self._appliedptransform.transform.dofn. Is the whole appliedptransform clone per bundle?

applied transform will be a new instance for each bundle (for direct runner). (See: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/direct/executor.py#L366)

kennknowles · 2019-04-19T23:05:33Z

That line does not show a cloning of appliedptransform...

…

On Fri, Apr 19, 2019 at 3:44 PM Ahmet Altay ***@***.***> wrote: I cannot reply to this comment in place If self._perform_dofn_pickle_test is false, it looks like it is not cloned but comes directly from self._appliedptransform.transform.dofn. Is the whole appliedptransform clone per bundle? applied transform will be a new instance for each bundle (for direct runner). (See: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/direct/executor.py#L366 ) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7994 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEMO6ONQUU57D4336BHZS3PRJDNRANCNFSM4G36BRSA> .

aaltay · 2019-04-19T23:17:08Z

That line does not show a cloning of appliedptransform...
…
On Fri, Apr 19, 2019 at 3:44 PM Ahmet Altay @.***> wrote: I cannot reply to this comment in place If self._perform_dofn_pickle_test is false, it looks like it is not cloned but comes directly from self._appliedptransform.transform.dofn. Is the whole appliedptransform clone per bundle? applied transform will be a new instance for each bundle (for direct runner). (See: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/direct/executor.py#L366 ) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7994 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEMO6ONQUU57D4336BHZS3PRJDNRANCNFSM4G36BRSA .

You are right, I confused by evaluator. perform_dofn_pickle_test defaults to True, and there is only one path that is setting to false related to sdf evaluation.

I think we should clone by default, probably by getting rid of that flag. I am not sure why we had the flag in the first place, but I am guessing there will be an underlying issue.

yifanmai · 2019-04-27T01:45:45Z

Added a test to ensure that shutdown is called on DirectRunner. PTAL

I discovered that metrics counter increments during setup and teardown will not be registered in the pipeline result metrics. I'm not sure if this should be considered a bug, since setup and teardown may could happen outside the pipeline run (it happens within the pipeline run for DirectRunner, but not necessarily for other runners) I looked at the metrics API, but could not figure out how to set up the metrics environment correctly for those stages. @aaltay do you have any suggestions on how to handle this?

aaltay · 2019-04-29T21:15:53Z

@pabloem or @ajamato will likely have a more informed opinion than myself. Pablo, Alex, what is the expected behavior for metric in setup/teardown?

pabloem · 2019-05-02T17:37:16Z

Hmm my initial gut reaction is to think that metrics are computed / committed as part of a bundle, and setup/teardown are not part of bundle execution.

on the other hand, I can imagine user wanting access to metrics from setup/teardown... So perhaps I'd say to file a bug so we figure out what's necessary to make metrics available there...

lukecwik · 2019-05-02T18:04:43Z

@pabloem Your correct that setup/teardown are not part of bundle execution but are part of the DoFn lifecycle.

Metrics outside of a bundle do make sense since you may want to capture something like memory usage, CPU utilization, caching stats, ... that cross bundle boundaries or are global. A lot of this falls under worker "health" though.

aaltay · 2019-05-02T22:36:57Z

@yifanmai In addition to filing a bug related to metrics, could this PR move forward?

yifanmai · 2019-05-03T00:53:26Z

Yes, I think so. I will open a bug to track the fact that metrics in shutdown. (I will do it after this PR otherwise the bug will not make sense.)

I have addressed the earlier feedback regarding adding tests to make sure that shutdown is called; @aaltay could you and @kennknowles take another look?

aaltay · 2019-05-06T17:32:38Z

The change LGTM. But there are test errors. Please take a look at those.

(If you need help with py3 compatibility, cc: @tvalentyn could help.)

One I found in the logs is the following, there might be other errors as well:

6:57:54 FAILED (SKIP=448, errors=24, failures=5)
16:57:54 ERROR: InvocationError for command '/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/test-suites/tox/py35/build/srcs/sdks/python/target/.tox-py35-cython/py35-cython/bin/python setup.py nosetests' (exited with code 1)
16:57:54 ___________________________________ summary ____________________________________
16:57:54 ERROR: py35-cython: commands failed

16:58:01 File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
16:58:01 self.run()
16:58:01 File "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/build/srcs/sdks/python/apache_beam/runners/portability/local_job_service.py", line 245, in run
16:58:01 self._pipeline_proto)
16:58:01 File "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/build/srcs/sdks/python/apache_beam/runners/portability/fn_api_runner.py", line 281, in run_via_runner_api
16:58:01 return self.run_stages(*self.create_stages(pipeline_proto))
16:58:01 File "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/build/srcs/sdks/python/apache_beam/runners/portability/fn_api_runner.py", line 357, in run_stages
16:58:01 stage_context.safe_coders)
16:58:01 File "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/build/srcs/sdks/python/apache_beam/runners/portability/fn_api_runner.py", line 521, in run_stage
16:58:01 data_input, data_output)
16:58:01 File "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/build/srcs/sdks/python/apache_beam/runners/portability/fn_api_runner.py", line 1227, in process_bundle
16:58:01 result_future = self._controller.control_handler.push(process_bundle)
16:58:01 File "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/build/srcs/sdks/python/apache_beam/runners/portability/fn_api_runner.py", line 842, in push
16:58:01 response = self.worker.do_instruction(request)
16:58:01 File "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/build/srcs/sdks/python/apache_beam/runners/worker/sdk_worker.py", line 342, in do_instruction
16:58:01 request.instruction_id)
16:58:01 File "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/build/srcs/sdks/python/apache_beam/runners/worker/sdk_worker.py", line 382, in process_bundle
16:58:01 self.bundle_processor_cache.discard(instruction_id)
16:58:01 File "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/build/srcs/sdks/python/apache_beam/runners/worker/sdk_worker.py", line 314, in discard
16:58:01 self.active_bundle_processors[instruction_id].shutdown()
16:58:01 AttributeError: 'tuple' object has no attribute 'shutdown'

yifanmai · 2019-05-08T21:37:23Z

@kennknowles @aaltay tests are passing now. PTAL?

aaltay · 2019-05-08T21:42:49Z

Run Python PostCommit

aaltay · 2019-05-10T20:12:29Z

Fails with a py3 error, not sure how it is related:
22:36:06 File "/usr/local/lib/python3.5/site-packages/apache_beam/runners/worker/sdk_worker.py", line 613, in _blocking_request
22:36:06 raise RuntimeError(response.error)
22:36:06 RuntimeError: java.lang.ClassCastException: java.lang.String cannot be cast to [B

aaltay · 2019-05-10T20:12:38Z

Run Python PostCommit

yifanmai · 2019-05-10T21:38:17Z

The shutdown assert is breaking in Dataflow also (because it only works for in-process). Can I delete that assert, or disable it for Dataflow?

14:22:11 Traceback (most recent call last):
14:22:11   File "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python3_Verify_PR/src/sdks/python/apache_beam/transforms/dofn_lifecycle_test.py", line 90, in test_dofn_lifecycle
14:22:11     self.assertTrue(_global_teardown_called)
14:22:11 AssertionError: False is not true

aaltay · 2019-05-10T22:40:33Z

The shutdown assert is breaking in Dataflow also (because it only works for in-process). Can I delete that assert, or disable it for Dataflow?

Let's delete it. I do not think it will be easy to assert just for Dataflow runner.

yifanmai · 2019-05-13T20:14:15Z

Run Python PostCommit

yifanmai · 2019-05-13T22:49:29Z

Failing at :beam-sdks-python:crossLanguageTests, which is suspicious since it works for me when run locally.

13:26:16 [grpc-default-executor-0] ERROR org.apache.beam.runners.fnexecution.jobsubmission.InMemoryJobService - Encountered Unexpected Exception for Invocation job_de1c9355-5fa3-48c7-bedd-29035031739a
13:26:16 org.apache.beam.vendor.grpc.v1p13p1.io.grpc.StatusException: NOT_FOUND
13:26:16 	at org.apache.beam.vendor.grpc.v1p13p1.io.grpc.Status.asException(Status.java:534)
13:26:16 	at org.apache.beam.runners.fnexecution.jobsubmission.InMemoryJobService.getInvocation(InMemoryJobService.java:341)
13:26:16 	at org.apache.beam.runners.fnexecution.jobsubmission.InMemoryJobService.getStateStream(InMemoryJobService.java:262)
13:26:16 	at org.apache.beam.model.jobmanagement.v1.JobServiceGrpc$MethodHandlers.invoke(JobServiceGrpc.java:770)
13:26:16 	at org.apache.beam.vendor.grpc.v1p13p1.io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:171)
13:26:16 	at org.apache.beam.vendor.grpc.v1p13p1.io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
13:26:16 	at org.apache.beam.vendor.grpc.v1p13p1.io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
13:26:16 	at org.apache.beam.vendor.grpc.v1p13p1.io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)
13:26:16 	at org.apache.beam.vendor.grpc.v1p13p1.io.grpc.Contexts$ContextualizedServerCallListener.onHalfClose(Contexts.java:86)
13:26:16 	at org.apache.beam.vendor.grpc.v1p13p1.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:283)

yifanmai · 2019-05-13T22:53:56Z

Run Python PostCommit

aaltay · 2019-05-14T01:30:27Z

Thank you @yifanmai !

* [BEAM-562] Add setup and teardown to Python DoFn

yifanmai · 2019-05-16T19:12:03Z

Thanks @aaltay, @kennknowles and @charlesccychen for your help!

I added https://issues.apache.org/jira/browse/BEAM-7340 to track the issue related to metrics in DoFn.teardown, as discussed earlier.

NikeNano · 2019-08-02T20:40:33Z

@yifanmai is this also expected to work for unbounded sources? I don't get it to work when reading from pubsub....

aaltay · 2019-08-02T20:44:13Z

@NikeNano it should work with all dofns regardless of the types of sources. If that is not working, please file an issue.

NikeNano · 2019-08-02T20:45:37Z

OK, thanks @aaltay. Will investigate further and file an issue if i don't get it to work.

NikeNano · 2019-08-02T21:11:07Z

Created issues for DoFn.setup, https://issues.apache.org/jira/browse/BEAM-7885

charlesccychen reviewed Mar 5, 2019

View reviewed changes

kennknowles reviewed Mar 7, 2019

View reviewed changes

kennknowles self-requested a review April 15, 2019 22:39

aaltay reviewed Apr 15, 2019

View reviewed changes

kennknowles reviewed Apr 19, 2019

View reviewed changes

yifanmai force-pushed the yifan/setup branch from 5cccc0f to bced50a Compare April 26, 2019 01:25

yifanmai force-pushed the yifan/setup branch 2 times, most recently from 1f0078f to 830228a Compare May 8, 2019 17:07

Yifan Mai added 2 commits May 13, 2019 10:53

[BEAM-562] Add setup and teardown to Python DoFn

2afc8af

Added test

b4a9203

Yifan Mai added 9 commits May 13, 2019 10:53

Documentation fixes

8c51ecb

Shutdown test

21b1a04

Remove metrics test

6507d85

Add setup code back

86333b1

Lint fixes

2329684

Fix tests

786c8b5

Fix tests

e9eff49

Fix lint

6d0fa7c

Removed shutdown test

48d82fc

yifanmai force-pushed the yifan/setup branch from 77033e0 to 48d82fc Compare May 13, 2019 17:53

aaltay approved these changes May 14, 2019

View reviewed changes

aaltay merged commit 4629e82 into apache:master May 14, 2019

charithe pushed a commit to shehzaadn-vd/vend-beam that referenced this pull request May 16, 2019

[BEAM-562] Add DoFn.setup and DoFn.teardown to Python SDK (apache#7994)

d94e599

* [BEAM-562] Add setup and teardown to Python DoFn

davidcavazos mentioned this pull request Aug 30, 2019

[BEAM-7389] Add DoFn methods sample #9257

Merged

3 tasks

kurtisvg mentioned this pull request Aug 1, 2022

Change cert refresh behavior for managed-pool environments GoogleCloudPlatform/cloud-sql-python-connector#415

Closed



		@attr('ValidatesRunner')
		class DoFnLifecycleTest(unittest.TestCase):

[BEAM-562] Add DoFn.setup and DoFn.teardown to Python SDK #7994

[BEAM-562] Add DoFn.setup and DoFn.teardown to Python SDK #7994

Conversation

yifanmai commented Mar 5, 2019 • edited

Post-Commit Tests Status (on master branch)

yifanmai commented Mar 5, 2019

charlesccychen left a comment

Choose a reason for hiding this comment

kennknowles left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yifanmai commented Apr 15, 2019

kennknowles commented Apr 15, 2019

yifanmai commented Apr 15, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaltay Apr 19, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaltay commented Apr 19, 2019

kennknowles commented Apr 19, 2019 via email

aaltay commented Apr 19, 2019

yifanmai commented Apr 27, 2019

aaltay commented Apr 29, 2019

pabloem commented May 2, 2019

lukecwik commented May 2, 2019

aaltay commented May 2, 2019

yifanmai commented May 3, 2019

aaltay commented May 6, 2019

yifanmai commented May 8, 2019

aaltay commented May 8, 2019

aaltay commented May 10, 2019

aaltay commented May 10, 2019

yifanmai commented May 10, 2019

aaltay commented May 10, 2019 • edited

yifanmai commented May 13, 2019

yifanmai commented May 13, 2019

yifanmai commented May 13, 2019

aaltay commented May 14, 2019

yifanmai commented May 16, 2019

NikeNano commented Aug 2, 2019

aaltay commented Aug 2, 2019

NikeNano commented Aug 2, 2019

NikeNano commented Aug 2, 2019

yifanmai commented Mar 5, 2019 •

edited

aaltay Apr 19, 2019 •

edited

aaltay commented May 10, 2019 •

edited