PIN 16: Use `cache_for` and `cache_validator` kwargs against a task's checkpointed Result #2619

lauralorenz · 2020-05-20T21:39:10Z

Current behavior

Please describe how the feature works today
Currently users can cache data output from a task run between flow runs via two paths:

utilize the target and result kwargs on tasks to enforce cache behavior using checkpointed results against existence of a specific location name in persistent storage
utilize cache_for/ cache_key / cache_validators kwargs on tasks to enforce cache behavior against existence of a matching entry in prefect.context.caches in-memory of the current Python process

The former works only against a path name existing (and no other terms, such as duration or arbitrary validation against the deserialized Python object), but is persistent -- starting up a flow run later can utilize the same data as long as it is configured to look at the same location. The latter provides more cache validation options since it supports arbitrary validation functions against the cached value, but only works for as long as prefect.context.caches exists in memory.

Proposed behavior

Please describe your proposed change to the current behavior
Implement the part of PIN 16 that describes honoring kwargs like cache_for and cache_validators against not just the in-memory cache but also the serialized object from a prior task run that should have been persisted by the Result interface. This gives us a persistent cache 'for free' since Result subclasses already write to their storage backend during pipeline execution, and then it becomes useful to extend just the Result subclasses to include more cache-like backends.

Example

Please give an example of how the enhancement would be useful

For example a task decorator like:

@task(cache_for=timedelta(seconds=10), result=LocalResult(location="{task_name}"), task_name="hello")

would check during the pre-pipeline checks for this task run for a file at ~/.prefect/hello and use the data from that location as that task run's Cached state's result, as long as the rerun was within 10 seconds of the first run.

Other thoughts/ideas:

Note that this behavior should probably be incompatible with the target kwarg, since the target kwarg is purposefully only about location existence, not any other cache validation.
Probably need to figure out/decide who to prioritize, in-memory cache or persistent cache, if both exist -- or does in-memory cache go away entirely?

The text was updated successfully, but these errors were encountered:

jorwoods · 2020-06-12T18:07:48Z

I think this would resolve the issues that I had been asking about in the slack channels. Would it be that the presence of both a target and Result.location would raise an error, or that target would take precedence?

There was discussion on slack about the presence of a target being an opt-in to the caching behavior, but from my perspective, providing a template string to location is equally indicative of a desire to utilize caching, especially if validators are assigned to the Result.

Caching via Result.location also enables a scenario where a user may wish to invalidate the cache of a Task or entire flow by appending a never_use to the validators. This would be useful in scenarios where something in the source has changed, and sections or the entire Flow need to be rerun to pick up the alterations.

cicdw · 2020-06-14T22:05:33Z

providing a template string to location is equally indicative of a desire to utilize caching, especially if validators are assigned to the Result.

Providing locations such as location="{task_name}/{date:%A}" can use templating purely for organizational convenience. The target kwarg is specifically designed so that its presence indicates the desire for caching behavior.

I think this issue will do exactly what you want though - results can be initialized with location + caching configuration so that the location is used for caching.

lauralorenz added the enhancement An improvement of an existing feature label May 20, 2020

jcrist mentioned this issue May 22, 2020

Cache across task/flow runs in different process with dask #2636

Closed

cicdw added the status:stale label Aug 30, 2022

cicdw closed this as completed Aug 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PIN 16: Use `cache_for` and `cache_validator` kwargs against a task's checkpointed Result #2619

PIN 16: Use `cache_for` and `cache_validator` kwargs against a task's checkpointed Result #2619

lauralorenz commented May 20, 2020

jorwoods commented Jun 12, 2020

cicdw commented Jun 14, 2020

PIN 16: Use cache_for and cache_validator kwargs against a task's checkpointed Result #2619

PIN 16: Use cache_for and cache_validator kwargs against a task's checkpointed Result #2619

Comments

lauralorenz commented May 20, 2020

Current behavior

Proposed behavior

Example

jorwoods commented Jun 12, 2020

cicdw commented Jun 14, 2020

PIN 16: Use `cache_for` and `cache_validator` kwargs against a task's checkpointed Result #2619

PIN 16: Use `cache_for` and `cache_validator` kwargs against a task's checkpointed Result #2619