Input cacheing #78

cicdw · 2018-07-27T19:31:47Z

Addresses half of #56 w/ input cacheing.

Use cases:

Retrying states
manual_only triggers

Also updates Parameter repr.

Question for @jlowin: should we just let cached_inputs be an attribute of the State class (defaulting to None)? Reaching into the data attribute sometimes feels like unnecessary work.

jlowin · 2018-07-27T20:25:30Z

My gut concern with making it an attribute of State is that right now we have uniform attributes across all states (data, message, and _timestamp), which is nice.

That said, I'm not sure I see a downside because we can check a state's type before trying to access its unique attributes -- by which I mean any attributes other than data or message. For Pending states, that would include cached_inputs (and later cached_outputs); for Retrying states, that would additionally include retry_time, etc.

jlowin · 2018-07-27T20:27:48Z

src/prefect/engine/flow_runner.py

@@ -179,6 +182,9 @@ def get_run_state(
                            lambda s: s.data, task_states[edge.upstream_task]
                        )

+                if task in start_tasks and hasattr(task_states.get(task), "data"):
+                    upstream_inputs.update(task_states[task].data.get("cached_inputs"))


Comment that may not matter, just thinking here: is it better to use upstream_inputs.update(), which has the potential to allow any existing data in upstream_inputs stick around, or wholesale replace upstream_inputs with whatever is in cached_inputs?

I considered that; I don't think I have enough experience to make an informed decision here, which is why I went with prioritizing the cache without fully clearing upstream_inputs. Maybe there will one day be a situation where task is a start_task but for some reason the cached_inputs key is empty, or where upstream_inputs has more data in it than the cache tracks? Within our current setup it won't break anything either way.

jlowin · 2018-07-27T20:29:27Z

src/prefect/engine/flow_runner.py

@@ -179,6 +182,9 @@ def get_run_state(
                            lambda s: s.data, task_states[edge.upstream_task]
                        )

+                if task in start_tasks and hasattr(task_states.get(task), "data"):


A State should always have a data attribute, so I'm not sure the second check in this line is ever going to fail (as long as it is, in fact, a state). However, the data attribute might not be a dict which would cause the next line to error even if this check passed. To be clear, that would be unexpected (if you're telling the system to run a start_task, I think it would be wrong to be anything other than a Pending state with a dict data attribute) -- but certainly possible

If task is not actually present in task_states, then task_states.get(task) will be None. Basically the check is equivalent to task in task_states -- maybe I should change it to that.

jlowin · 2018-07-27T20:31:12Z

tests/engine/test_flow_runner.py

+        assert isinstance(state, Success)
+
+
+class TestInputCacheing:


Minor spelling error here and a couple other places - caching vs cacheing

Interesting; I thought it looked weird but google told me cacheing was the correct gerund form. I'll change though. 👍

jlowin · 2018-07-27T20:33:27Z

src/prefect/core/task.py

@@ -199,6 +199,9 @@ def __init__(self, name: str, default: Any = None, required: bool = True) -> Non

        super().__init__(name=name, slug=name)

+    def __repr__(self) -> str:


cicdw · 2018-07-27T20:50:27Z

Re: uniform attributes across states; while I definitely understand your concern, functionally it doesn't feel very different from state.data.get('unique_key') or if 'unique_key' in state.data, except it removes one level of depth and allows for the case when state.data is None. Does it have any consequences on the server side (cc: @joshmeek )?

EDIT: a benefit of having everything in data is that it makes for convenient reprs: state.data displays most of the relevant info.

joshmeek · 2018-07-27T21:44:19Z

@cicdw No consequences on the server side at all! As long as the State object is still serializable everything will work fine.

jlowin · 2018-07-27T21:51:48Z

After looking through PR I think I understand why constantly accessing data as a dict felt uncomfortable... it's like how it felt accessing states before we turned them into classes. I think I'm +1 for making them unique attributes

⚫ Remove attributes from base State class oof

State class refactor

jlowin · 2018-07-30T18:29:19Z

src/prefect/engine/state.py

+            for attr in self.__dict__:
+                if attr.startswith("_") or attr == "message":
+                    continue
+                eq &= getattr(self, attr) == getattr(other, attr)


This will raise errors if someone (for whatever reason) added an after-the-fact attribute to their state object that doesn't exist on the other one. I suggest getattr(self, attr) == getattr(other, attr, object()) -- this will allow missing (or extra) attributes to be compared safely, and object() is never equal to anything else so it won't give false matches.

Good idea; done.

jlowin · 2018-07-30T18:40:31Z

src/prefect/engine/task_runner.py


            # a DONTRUN signal at any point breaks the chain and we return
            # the most recently computed state
-            except signals.DONTRUN:
+            except signals.DONTRUN as exc:
+                if "manual_only" in str(exc):


This feels fragile. I totally get why it's here and what it's accomplishing but I think the fact it's here suggests that we have a new type of state that needs managing.

Maybe the manual_only trigger should be raising a new TRIGGERDONTRUN signal (analogous to TRIGGERFAILED) which could be handled explicitly in handle_signals(). This new signal could automatically build and return a Pending(cached_inputs=inputs) state, rather than hoping to trap the error message itself.

If we want to punt on this (because this approach will work at the moment), I'm fine with that too. We can open a new issue to readdress it later. The core functionality here is more important!

Yea, that makes sense; I think I'd rather punt on this and think about it a little more. Having handle_signals do something like inputs = kwargs.get('inputs') doesn't feel great either.

jlowin · 2018-07-30T19:43:31Z

LGTM! So glad to see this

cicdw added 4 commits July 26, 2018 18:28

Write some spect tests

83da2bb

Very close to have clean retry-cacheing implementation

26a0c28

Input cacheing tests pass ✅

256120c

Ensure manual_only trigger also caches w/ test

6e5afd0

cicdw requested a review from jlowin July 27, 2018 19:31

Ensure all calls to get_retry_state attempt to cache inputs

3f32c0a

jlowin requested changes Jul 27, 2018

View reviewed changes

Edits for clarity, both in code and in english

3c7f40c

Rename data -> result and elevate other items as attributes

6b753a1

cicdw mentioned this pull request Jul 28, 2018

State class refactor #80

Merged

2 tasks

cicdw and others added 4 commits July 27, 2018 18:31

Remove data kwarg from signals

c176ce6

Only allow Pending states to have cached_inputs attribute

8a48788

⚫ Remove attributes from base State class oof

retry_time -> scheduled_time

e355838

Merge pull request #80 from PrefectHQ/state-refactor

f79410d

State class refactor

jlowin requested changes Jul 30, 2018

View reviewed changes

Update state.__eq__ to be more robust

dd56545

cicdw mentioned this pull request Jul 30, 2018

Handle manual_only triggers more elegantly #83

Closed

jlowin approved these changes Jul 30, 2018

View reviewed changes

cicdw merged commit 7f11398 into master Jul 30, 2018

cicdw deleted the cache-inputs branch July 30, 2018 19:45

cicdw mentioned this pull request Jul 30, 2018

Implement checkpointing for data #56

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Input cacheing #78

Input cacheing #78

cicdw commented Jul 27, 2018 •

edited

Loading

jlowin commented Jul 27, 2018

jlowin Jul 27, 2018

cicdw Jul 27, 2018

jlowin Jul 27, 2018

cicdw Jul 27, 2018

jlowin Jul 27, 2018

cicdw Jul 27, 2018

jlowin Jul 27, 2018

cicdw commented Jul 27, 2018 •

edited

Loading

joshmeek commented Jul 27, 2018 •

edited

Loading

jlowin commented Jul 27, 2018

jlowin Jul 30, 2018

cicdw Jul 30, 2018

jlowin Jul 30, 2018

jlowin Jul 30, 2018

cicdw Jul 30, 2018

jlowin commented Jul 30, 2018

		@@ -199,6 +199,9 @@ def __init__(self, name: str, default: Any = None, required: bool = True) -> Non

		super().__init__(name=name, slug=name)

		def __repr__(self) -> str:

Input cacheing #78

Input cacheing #78

Conversation

cicdw commented Jul 27, 2018 • edited Loading

jlowin commented Jul 27, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cicdw commented Jul 27, 2018 • edited Loading

joshmeek commented Jul 27, 2018 • edited Loading

jlowin commented Jul 27, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jlowin commented Jul 30, 2018

cicdw commented Jul 27, 2018 •

edited

Loading

cicdw commented Jul 27, 2018 •

edited

Loading

joshmeek commented Jul 27, 2018 •

edited

Loading