Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plugin: cache hook with diskcache #684

Merged
merged 21 commits into from
Feb 16, 2024
Merged

Plugin: cache hook with diskcache #684

merged 21 commits into from
Feb 16, 2024

Conversation

zilto
Copy link
Collaborator

@zilto zilto commented Feb 7, 2024

This hook uses the diskcache to cache node execution on disk. The cache key is a tuple of the function's (source code, input a, ..., input n). It is like memoization but it considers the function's source code in addition to inputs.

diskcache has great features to:

  • set maximum cache size
  • set automated eviction policy once maximum size is reached
  • allow custom Disk implementations to change the serialization protocol (e.g., pickle, JSON)

Changes

  • added a hamilton.plugins.h_diskcache which contains a caching hook and cache management functions
  • added examples/cache_hook to demonstrate how to use h_diskcache
  • modified setup.py to include sf-hamilton[diskcache]
  • added tests and added diskcache to requirements-docs.txt to run tests
  • added a function to hash nodes in graph_utils and corresponding tests.

How I tested this

  • added 4 tests to cover hook behavior
  • added 2 tests for cache management utility functions
  • added 2 tests for hashing function

Notes / TODO

  • will need to write a docs page
  • create an example to use h_diskcache with notebooks for interactive/iterative development
  • the hash function could be improved. The current implementation uses inspect.getsource and will return distinct hashes for changes to docstrings, newlines and comments.

Checklist

  • PR has an informative and human-readable title (this will be pulled into the release notes)
  • Changes are limited to a single goal (no scope creep)
  • Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future TODOs are captured in comments
  • Project documentation has been updated if adding/changing functionality.

@zilto zilto added the enhancement New feature or request label Feb 7, 2024
Copy link
Collaborator

@elijahbenizzy elijahbenizzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, some comments

examples/cache_hook/README.md Outdated Show resolved Hide resolved
@@ -57,3 +58,20 @@ def create_temporary_module(*functions: Callable, module_name: str = None) -> Mo
setattr(module, fn_name, fn)
sys.modules[module_name] = module
return module


def module_from_source(source: str) -> ModuleType:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to be in this PR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this function, the CacheHook doesn't work in notebooks because it won't be able to use inspect to get the source code and create a hash.

The alternative would be to use the functions's __code__ attribute, but this would dependent on the Python version. It might not be an issue because the cache's pickling is already Python version dependent

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this used? Think I'm missing it... But that makes sense I think. You're talking about temporary modules, right?

With temporary modules you'd have to special-case them, go through every function and use inspect on them.

return hashlib.sha256(source.encode()).hexdigest()


def remove_docs_and_comments(source: str):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might not be worth removing comments TBH -- can you at least add some comments here as to what you're doing? Code is a bit cryptic.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also worth adding a test for this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Behavior is tested via test_graph_utils:: test_different_hash_docstring and test_different_hash_comment. They check both if "the hashes are unequal when not stripping docs and comments" and "the hashes are equal when stripping docs and comments".

The tests would be improved by not relying on graph_utils.hash_source_code() though

Copy link
Collaborator Author

@zilto zilto Feb 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add comments to the function

I think the value of removing docstring/comments will depend on the workflow for the CI / CLI code diff (which rely on the same hashing).

If we marked as "changed" a node that had a changed docstring or added #comment AND all downstream nodes, the amount of false positive will make the feature pointless IMHO

hamilton/plugins/h_diskcache.py Show resolved Hide resolved
hamilton/plugins/h_diskcache.py Show resolved Hide resolved
@zilto
Copy link
Collaborator Author

zilto commented Feb 16, 2024

I think it would be valuable to add this as a core feature of Hamilton in the future. I could see a driver.Builder().with_cache() and allow to write locally, to S3, or to DAGWorks platform. I think it's especially powerful to enable smoother "script-to-notebook" and "notebook-to-script" devX and save costs for LLM development. It could be a big feature for us to push.

To avoid introducing diskcache as a dependency, we could implement a simple alternative with the stdlib shelve (similar to the Experiment manager). We could eventually integrate experiment management with it.

It could also enable "watermarking" for light orchestration such as retries or backfills. Also probably a good basis for Burr snapshotting

Copy link
Collaborator

@elijahbenizzy elijahbenizzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, don't want to overthink it, we can always change it up later if we need to and this works.

Just fix the 3.8 tests and merge!

@zilto zilto merged commit dab7fac into main Feb 16, 2024
22 checks passed
@zilto zilto deleted the plugin/diskcache branch February 16, 2024 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants