You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Databricks task classes are not thread safe and therefore cannot be mapped over when using a non-sequential executor like LocalDaskExecutor. This is because they store the run id for the Databricks job they are polling as an attribute on the task instance. If I call run() on the task instance before the prior one has finished, the attribute will be overwritten and the concurrently running tasks will return the wrong result.
Expected Behavior
When mapping N-fold over a DatabricksSubmitRun or DatabricksRunNow task, I expect to get N distinct Databricks job run IDs in the result. Instead I get N copies of the same job run ID (whichever one started last).
Reproduction
This simplified version of the code illustrates the problem:
fromrandomimportrandomfromtimeimportsleepimportprefectfromprefect.executorsimportLocalDaskExecutorclassMyTask(prefect.Task):
defrun(self, person: str=None) ->str:
# submit to API and get a run idself.run_id=person# poll until job is donesleep(random())
# return run id, but since it was stored as an attribute,# other threads may have changed it while polling!returnself.run_idmy_task=MyTask()
@prefect.taskdefreducer(result):
returnsorted(result)
withprefect.Flow("my_flow", executor=LocalDaskExecutor()) asflow:
mapped_result=my_task.map(person=["arthur", "ford", "marvin"])
reduced_result=reducer(mapped_result)
state=flow.run()
assertstate.result[reduced_result].result==sorted(["arthur", "ford", "marvin"])
This will fail because the result is ["marvin", "marvin", "marvin"] or similar.
MyTask is just a mockup, but both DatabricksSubmitRun and DatabricksRunNow have this same structure (mutating task instance attributes in the run method).
Workaround
This led to some unexpected results in our application. We have worked around either by discontinuing our use of LocalDaskExecutor, or by making sure to instantiate the Databricks task class instances in a scope where they cannot be shared (e.g., inside of a FunctionTask).
I think the long-term fix would be to use a local variable for run_id instead of assigning to attributes of self in the run method of these task classes.
The text was updated successfully, but these errors were encountered:
anna-geller
added a commit
to anna-geller/prefect
that referenced
this issue
Oct 6, 2021
* Prevent run_id to be modified outside of run() method
to make it work with mapped tasks. Details in the issue: #4958
* Create issue4958.yaml
* Update test_databricks.py
* Prevent run_id to be modified outside of run() method
to make it work with mapped tasks. Details in the issue: PrefectHQ#4958
* Create issue4958.yaml
* Update test_databricks.py
Description
The Databricks task classes are not thread safe and therefore cannot be mapped over when using a non-sequential executor like
LocalDaskExecutor
. This is because they store the run id for the Databricks job they are polling as an attribute on the task instance. If I callrun()
on the task instance before the prior one has finished, the attribute will be overwritten and the concurrently running tasks will return the wrong result.Expected Behavior
When mapping N-fold over a
DatabricksSubmitRun
orDatabricksRunNow
task, I expect to get N distinct Databricks job run IDs in the result. Instead I get N copies of the same job run ID (whichever one started last).Reproduction
This simplified version of the code illustrates the problem:
This will fail because the result is
["marvin", "marvin", "marvin"]
or similar.MyTask
is just a mockup, but bothDatabricksSubmitRun
andDatabricksRunNow
have this same structure (mutating task instance attributes in the run method).Workaround
This led to some unexpected results in our application. We have worked around either by discontinuing our use of
LocalDaskExecutor
, or by making sure to instantiate the Databricks task class instances in a scope where they cannot be shared (e.g., inside of aFunctionTask
).I think the long-term fix would be to use a local variable for
run_id
instead of assigning to attributes ofself
in the run method of these task classes.The text was updated successfully, but these errors were encountered: