Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid mutating collections unless absolutely necessary #14048

Merged
merged 12 commits into from
Jun 17, 2024

Conversation

jlowin
Copy link
Member

@jlowin jlowin commented Jun 14, 2024

Today, Prefect uses the visit_collections function to resolve task arguments into futures. This is because the future might be passed directly (e.g. f(x=Future())) or in a deeply nested collection or BaseModel attribute (or both).

Currently, visit collections recursively visits every member of every collection in a DFS, calling the visit_fn on each one, and then rebuilds a copy of the data structure. For example, if the visit_fn resolves futures to results and you pass dict(x=[1, 2], y=[1, 2, Future(3)]) as the arg, then a new object dict(x=[1, 2], y=[1, 2, 3]) will be generated and passed to the task.

The problem is that the object is copied even if it doesn't have any futures in it.

from prefect import task

@task
def f(x):
    return x

val = dict(x=1)

assert f(val) == val # true
assert f(val) is val # false, very surprisingly

This PR modifies the behavior of visit_collections to make the smallest number of copies possible. The way it does this is by examining the return value of the visit_fn. if visit_fn(x) is x for all values of a collection, then the collection is left unchanged. But if one of those values changes, say because x is a Future and visit_fn resolves it to a value, then the collection is copied. So in addition to passing the test above, with this strategy and a hypothetical resolve_futures fn:

val = [[1, 2], [3, 4]]
args = resolve_futures(val)

assert args[0] is val[0]
assert args[1] is val[1]
assert args is val

but:

val = [[1, 2], [3, Future(4)]
args = resolve_futures(val)

assert args == [[1, 2], [3, 4]]

# the container without a future is untouched
assert args[0] is val[0]

# the container with the future is a copy
assert args[1] is not val[1]

# the parent container is a copy, because one of its elements was modified
assert args is not val

Note that this only affects identity, not equality!

What brought this to my attention is I was modifying mutable objects in one task that then appeared, unmodified, in a different task, despite the fact that everything was in-memory. Since I wasn't using futures, I had no expectation that I was working with copies of data.

This implementation of visit_collections also fixes the following behaviors:

  • visit_collections will no longer iterate over a generator. Previously, it automatically converted generators to lists. Since generators can only be consumed once, and may be infinite, this is almost always going to be the wrong behavior, at best silently exhausting the generator before the user can call it, and at worst blocking forever!
  • visit_collections will no longer recurse endlessly on collections with mutual references, allowing us to fix two xfail tests.

@jlowin jlowin requested a review from a team as a code owner June 14, 2024 22:26
Copy link
Collaborator

@chrisguidry chrisguidry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Love the new tests

@jlowin jlowin merged commit 65fcefe into main Jun 17, 2024
26 checks passed
@jlowin jlowin deleted the collections_avoid_mutation branch June 17, 2024 16:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants