Plugin: cache hook with `diskcache` #684

zilto · 2024-02-07T15:22:33Z

This hook uses the diskcache to cache node execution on disk. The cache key is a tuple of the function's (source code, input a, ..., input n). It is like memoization but it considers the function's source code in addition to inputs.

diskcache has great features to:

set maximum cache size
set automated eviction policy once maximum size is reached
allow custom Disk implementations to change the serialization protocol (e.g., pickle, JSON)

Changes

added a hamilton.plugins.h_diskcache which contains a caching hook and cache management functions
added examples/cache_hook to demonstrate how to use h_diskcache
modified setup.py to include sf-hamilton[diskcache]
added tests and added diskcache to requirements-docs.txt to run tests
added a function to hash nodes in graph_utils and corresponding tests.

How I tested this

added 4 tests to cover hook behavior
added 2 tests for cache management utility functions
added 2 tests for hashing function

Notes / TODO

will need to write a docs page
create an example to use h_diskcache with notebooks for interactive/iterative development
the hash function could be improved. The current implementation uses inspect.getsource and will return distinct hashes for changes to docstrings, newlines and comments.

Checklist

PR has an informative and human-readable title (this will be pulled into the release notes)
Changes are limited to a single goal (no scope creep)
Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
Any change in functionality is tested
New functions are documented (with a description, list of inputs, and expected output)
Placeholder code is flagged / future TODOs are captured in comments
Project documentation has been updated if adding/changing functionality.

elijahbenizzy

Looks good, some comments

examples/cache_hook/README.md

elijahbenizzy · 2024-02-16T05:16:46Z

hamilton/ad_hoc_utils.py

@@ -57,3 +58,20 @@ def create_temporary_module(*functions: Callable, module_name: str = None) -> Mo
        setattr(module, fn_name, fn)
    sys.modules[module_name] = module
    return module
+
+
+def module_from_source(source: str) -> ModuleType:


Is this supposed to be in this PR?

Without this function, the CacheHook doesn't work in notebooks because it won't be able to use inspect to get the source code and create a hash.

The alternative would be to use the functions's __code__ attribute, but this would dependent on the Python version. It might not be an issue because the cache's pickling is already Python version dependent

Where is this used? Think I'm missing it... But that makes sense I think. You're talking about temporary modules, right?

With temporary modules you'd have to special-case them, go through every function and use inspect on them.

elijahbenizzy · 2024-02-16T05:18:08Z

hamilton/graph_utils.py

+    return hashlib.sha256(source.encode()).hexdigest()
+
+
+def remove_docs_and_comments(source: str):


Might not be worth removing comments TBH -- can you at least add some comments here as to what you're doing? Code is a bit cryptic.

Also worth adding a test for this

Behavior is tested via test_graph_utils:: test_different_hash_docstring and test_different_hash_comment. They check both if "the hashes are unequal when not stripping docs and comments" and "the hashes are equal when stripping docs and comments".

The tests would be improved by not relying on graph_utils.hash_source_code() though

Will add comments to the function

I think the value of removing docstring/comments will depend on the workflow for the CI / CLI code diff (which rely on the same hashing).

If we marked as "changed" a node that had a changed docstring or added #comment AND all downstream nodes, the amount of false positive will make the feature pointless IMHO

hamilton/plugins/h_diskcache.py

zilto · 2024-02-16T15:44:40Z

I think it would be valuable to add this as a core feature of Hamilton in the future. I could see a driver.Builder().with_cache() and allow to write locally, to S3, or to DAGWorks platform. I think it's especially powerful to enable smoother "script-to-notebook" and "notebook-to-script" devX and save costs for LLM development. It could be a big feature for us to push.

To avoid introducing diskcache as a dependency, we could implement a simple alternative with the stdlib shelve (similar to the Experiment manager). We could eventually integrate experiment management with it.

It could also enable "watermarking" for light orchestration such as retries or backfills. Also probably a good basis for Burr snapshotting

elijahbenizzy

Looks good, don't want to overthink it, we can always change it up later if we need to and this works.

Just fix the 3.8 tests and merge!

zilto added 6 commits February 6, 2024 16:10

diskcache hook added

37c4fb4

tests added

9d15a7a

examples/cache_hook added

51be118

moved hash_callable to graph_utils

961aa5f

sf-hamilton[diskcache] added

ed46976

pre-commit hooks applied

5dfe25d

zilto added the enhancement New feature or request label Feb 7, 2024

zilto requested a review from elijahbenizzy February 7, 2024 15:22

zilto added 9 commits February 7, 2024 10:37

added test requirements

d92e78e

test import fixed

f68ba55

pre-commit fixed

162babe

strips comments and docstring before hashing source code

dc330c3

add dynamic module to linecache; inspect can be used

d9e5205

strip feature for hashing source code

47cf088

strip function indent fixed

e195e1f

pre-commit hook

4040708

fixed type annotations for 3.8

e48f085

elijahbenizzy reviewed Feb 16, 2024

View reviewed changes

zilto added 2 commits February 16, 2024 10:33

docstring remover now uses AST; added tests; added documentation

f1fac0a

compare CacheHook to CachingGraphAdapter

c0c3747

zilto added 2 commits February 16, 2024 10:52

strip whitespaces in attempt to fix 3.8 errors

1604247

pre-commit fix (again)

315c4e7

elijahbenizzy approved these changes Feb 16, 2024

View reviewed changes

zilto added 2 commits February 16, 2024 13:27

skip tests in Python 3.8

894de21

pytest version notation fixed

68ff666

zilto merged commit dab7fac into main Feb 16, 2024
22 checks passed

zilto deleted the plugin/diskcache branch February 16, 2024 18:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plugin: cache hook with `diskcache` #684

Plugin: cache hook with `diskcache` #684

zilto commented Feb 7, 2024 •

edited

Loading

elijahbenizzy left a comment

elijahbenizzy Feb 16, 2024

zilto Feb 16, 2024

elijahbenizzy Feb 16, 2024

elijahbenizzy Feb 16, 2024

elijahbenizzy Feb 16, 2024

zilto Feb 16, 2024

zilto Feb 16, 2024 •

edited

Loading

zilto commented Feb 16, 2024 •

edited

Loading

elijahbenizzy left a comment

		return hashlib.sha256(source.encode()).hexdigest()


		def remove_docs_and_comments(source: str):

Plugin: cache hook with diskcache #684

Plugin: cache hook with diskcache #684

Conversation

zilto commented Feb 7, 2024 • edited Loading

Changes

How I tested this

Notes / TODO

Checklist

elijahbenizzy left a comment

Choose a reason for hiding this comment

elijahbenizzy Feb 16, 2024

Choose a reason for hiding this comment

zilto Feb 16, 2024

Choose a reason for hiding this comment

elijahbenizzy Feb 16, 2024

Choose a reason for hiding this comment

elijahbenizzy Feb 16, 2024

Choose a reason for hiding this comment

elijahbenizzy Feb 16, 2024

Choose a reason for hiding this comment

zilto Feb 16, 2024

Choose a reason for hiding this comment

zilto Feb 16, 2024 • edited Loading

Choose a reason for hiding this comment

zilto commented Feb 16, 2024 • edited Loading

elijahbenizzy left a comment

Choose a reason for hiding this comment

Plugin: cache hook with `diskcache` #684

Plugin: cache hook with `diskcache` #684

zilto commented Feb 7, 2024 •

edited

Loading

zilto Feb 16, 2024 •

edited

Loading

zilto commented Feb 16, 2024 •

edited

Loading