Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PIN 16: Use cache_for and cache_validator kwargs against a task's checkpointed Result #2619

Closed
lauralorenz opened this issue May 20, 2020 · 2 comments
Labels
enhancement An improvement of an existing feature

Comments

@lauralorenz
Copy link

Current behavior

Please describe how the feature works today
Currently users can cache data output from a task run between flow runs via two paths:

  • utilize the target and result kwargs on tasks to enforce cache behavior using checkpointed results against existence of a specific location name in persistent storage
  • utilize cache_for/ cache_key / cache_validators kwargs on tasks to enforce cache behavior against existence of a matching entry in prefect.context.caches in-memory of the current Python process

The former works only against a path name existing (and no other terms, such as duration or arbitrary validation against the deserialized Python object), but is persistent -- starting up a flow run later can utilize the same data as long as it is configured to look at the same location. The latter provides more cache validation options since it supports arbitrary validation functions against the cached value, but only works for as long as prefect.context.caches exists in memory.

Proposed behavior

Please describe your proposed change to the current behavior
Implement the part of PIN 16 that describes honoring kwargs like cache_for and cache_validators against not just the in-memory cache but also the serialized object from a prior task run that should have been persisted by the Result interface. This gives us a persistent cache 'for free' since Result subclasses already write to their storage backend during pipeline execution, and then it becomes useful to extend just the Result subclasses to include more cache-like backends.

Example

Please give an example of how the enhancement would be useful

For example a task decorator like:

@task(cache_for=timedelta(seconds=10), result=LocalResult(location="{task_name}"), task_name="hello")

would check during the pre-pipeline checks for this task run for a file at ~/.prefect/hello and use the data from that location as that task run's Cached state's result, as long as the rerun was within 10 seconds of the first run.

Other thoughts/ideas:

  • Note that this behavior should probably be incompatible with the target kwarg, since the target kwarg is purposefully only about location existence, not any other cache validation.
  • Probably need to figure out/decide who to prioritize, in-memory cache or persistent cache, if both exist -- or does in-memory cache go away entirely?
@lauralorenz lauralorenz added the enhancement An improvement of an existing feature label May 20, 2020
@jorwoods
Copy link

I think this would resolve the issues that I had been asking about in the slack channels. Would it be that the presence of both a target and Result.location would raise an error, or that target would take precedence?

There was discussion on slack about the presence of a target being an opt-in to the caching behavior, but from my perspective, providing a template string to location is equally indicative of a desire to utilize caching, especially if validators are assigned to the Result.

Caching via Result.location also enables a scenario where a user may wish to invalidate the cache of a Task or entire flow by appending a never_use to the validators. This would be useful in scenarios where something in the source has changed, and sections or the entire Flow need to be rerun to pick up the alterations.

@cicdw
Copy link
Member

cicdw commented Jun 14, 2020

providing a template string to location is equally indicative of a desire to utilize caching, especially if validators are assigned to the Result.

Providing locations such as location="{task_name}/{date:%A}" can use templating purely for organizational convenience. The target kwarg is specifically designed so that its presence indicates the desire for caching behavior.

I think this issue will do exactly what you want though - results can be initialized with location + caching configuration so that the location is used for caching.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

3 participants