Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make context Available to Cache Validator Functions #1509

Closed
joeschmid opened this issue Sep 14, 2019 · 3 comments · Fixed by #1510
Closed

Make context Available to Cache Validator Functions #1509

joeschmid opened this issue Sep 14, 2019 · 3 comments · Fixed by #1510
Labels
feature A new feature

Comments

@joeschmid
Copy link

Use Case

Right now, it seems like prefect.context is not available inside of a cache_validator function. (At least import perfect; logger = prefect.context.get("logger") seems to return None for logger.)

I have two use cases in mind, one mundane and one that I think could be very powerful:

  1. It's useful to be able to log messages inside of a cache_validator function for debugging purposes
  2. It would allow for very dynamic cache validation driven off of Flow parameters, the Task being executed, etc.

Solution

For use case 2 above, imagine a data science workflow where the first portion of the flow is a series of feature engineering tasks. It would be very useful and powerful to be able to run the flow in one of three modes:

Mode 1 - generate feature matrices: perform feature engineering and write out a file for each large feature matrix, e.g. with daskdataframe.to_parquet() to a cloud bucket, using a specific timestamp or label in the bucket/key name along with the name of the task, e.g. generate_vitals_features_201909141352.parquet
Mode 2 - use latest available set of feature matrices: treat those tasks as cached, thus skipping task execution, and instead read from the most recent existing parquet files
Mode 3 - use the set of feature matrices from a specific timestamp/label: again treat those tasks as cached, thus skipping task execution, but read from a specific set of parquet files (not the most recent one)

The selection of mode would be driven by a Parameter (or Parameters) to the Flow. Parameters are already available to cache_validator functions, but a cache_validator might also want to examine the name of the current task in order to query a cloud bucket to see if a parquet file that includes the task name already exists for that task, e.g. someone might have requested to run the Flow in Mode 2, but never ran in Mode 1 so no file is available, etc.

The advantage, of course, is that after the first Flow run (in Mode 1), subsequent runs could execute far more quickly. (There are some other potential optimizations like skipping intermediate reading of parquet files and only reading a final, merged large feature matrix, etc.)

Being able to supply our own cache_validator function is already very powerful, but with the addition of being able to examine context for task names we could write a common cache validator that could return True or False appropriately, etc.

This also would allow data scientists to focus on just the featuring engineering code in their tasks yet get this persistence "for free" just by specifying a common "feature caching" result_handler and cache_validator.

Alternatives

We could place the code to write and read the feature matrices inside the tasks themselves, but that seems like a far less elegant and reusable approach. The existing ResultHandler and cache_validator plumbing is already very close to allowing us to implement this idea and hopefully this would be a relatively small, but powerful change. (BTW, if this is better as a Prefect Improvement Notice feel free to treat it as one.)

@joeschmid joeschmid added the feature A new feature label Sep 14, 2019
@cicdw
Copy link
Member

cicdw commented Sep 14, 2019

Hi @joeschmid! Excellent call out - at this moment I would expect the full task name to be available in context, but you are correct that the logger is not; this is a surprisingly easy thing to fix, so expect a PR shortly! I'll also add tests for the presence of task name, parameters, etc.

Also thank you for this excellent write-up - it's both useful for us to hear how users expect the tool to behave / be customized, and I'm sure there are others who will read this and better understand how to implement their own custom caching layer in Prefect.

@joeschmid
Copy link
Author

Hi @joeschmid! Excellent call out - at this moment I would expect the full task name to be available in context, but you are correct that the logger is not; this is a surprisingly easy thing to fix, so expect a PR shortly! I'll also add tests for the presence of task name, parameters, etc.

Fantastic! In that case, sorry for my long-winded write-up. I should have tested accessing other aspects of context.

Thanks for the lightning fast PR! (BTW, don't ever feel like you need to address this stuff on the weekend -- no problem at all to wait for the week.)

@cicdw
Copy link
Member

cicdw commented Sep 14, 2019

No worries at all - the write-up is still a very useful resource for others!

zanieb pushed a commit that referenced this issue Apr 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature A new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants