Databricks task instances cannot be mapped #4958

baldwint · 2021-09-11T00:25:33Z

Description

The Databricks task classes are not thread safe and therefore cannot be mapped over when using a non-sequential executor like LocalDaskExecutor. This is because they store the run id for the Databricks job they are polling as an attribute on the task instance. If I call run() on the task instance before the prior one has finished, the attribute will be overwritten and the concurrently running tasks will return the wrong result.

Expected Behavior

When mapping N-fold over a DatabricksSubmitRun or DatabricksRunNow task, I expect to get N distinct Databricks job run IDs in the result. Instead I get N copies of the same job run ID (whichever one started last).

Reproduction

This simplified version of the code illustrates the problem:

from random import random
from time import sleep

import prefect
from prefect.executors import LocalDaskExecutor


class MyTask(prefect.Task):
    def run(self, person: str = None) -> str:

        # submit to API and get a run id
        self.run_id = person

        # poll until job is done
        sleep(random())

        # return run id, but since it was stored as an attribute,
        # other threads may have changed it while polling!
        return self.run_id


my_task = MyTask()


@prefect.task
def reducer(result):
    return sorted(result)


with prefect.Flow("my_flow", executor=LocalDaskExecutor()) as flow:
    mapped_result = my_task.map(person=["arthur", "ford", "marvin"])
    reduced_result = reducer(mapped_result)

state = flow.run()

assert state.result[reduced_result].result == sorted(["arthur", "ford", "marvin"])

This will fail because the result is ["marvin", "marvin", "marvin"] or similar.

MyTask is just a mockup, but both DatabricksSubmitRun and DatabricksRunNow have this same structure (mutating task instance attributes in the run method).

Workaround

This led to some unexpected results in our application. We have worked around either by discontinuing our use of LocalDaskExecutor, or by making sure to instantiate the Databricks task class instances in a scope where they cannot be shared (e.g., inside of a FunctionTask).

I think the long-term fix would be to use a local variable for run_id instead of assigning to attributes of self in the run method of these task classes.

The text was updated successfully, but these errors were encountered:

to make it work with mapped tasks. Details in the issue: PrefectHQ#4958

* Prevent run_id to be modified outside of run() method to make it work with mapped tasks. Details in the issue: #4958 * Create issue4958.yaml * Update test_databricks.py

* Prevent run_id to be modified outside of run() method to make it work with mapped tasks. Details in the issue: PrefectHQ#4958 * Create issue4958.yaml * Update test_databricks.py

anna-geller added a commit to anna-geller/prefect that referenced this issue Oct 6, 2021

Prevent run_id to be modified outside of run() method

ca1906b

to make it work with mapped tasks. Details in the issue: PrefectHQ#4958

anna-geller mentioned this issue Oct 6, 2021

Prevent run_id to be modified outside of run() method #5023

Merged

zanieb closed this as completed Oct 8, 2021

zanieb mentioned this issue Oct 21, 2021

Release 0.15.7 #5073

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Databricks task instances cannot be mapped #4958

Databricks task instances cannot be mapped #4958

baldwint commented Sep 11, 2021

Databricks task instances cannot be mapped #4958

Databricks task instances cannot be mapped #4958

Comments

baldwint commented Sep 11, 2021

Description

Expected Behavior

Reproduction

Workaround