Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache to file and auto-load from cache #169

Open
hXtreme opened this issue Jan 5, 2022 · 4 comments
Open

Cache to file and auto-load from cache #169

hXtreme opened this issue Jan 5, 2022 · 4 comments

Comments

@hXtreme
Copy link

hXtreme commented Jan 5, 2022

It would be awesome if there was an easy way to cache results to a file and if the cache was found then load from the cache without recomputing.

To basically support the following usecase/api

# main.py

def expensive_function(x):
   """Consider an expensive to compute function"""
   print(x, end=" ")
   return x**2

s = seq([x for x in range(4)], cache_dir="/path/to/cache/dir")

expensive_result = s.map(expensive_function)\
    .with_caching(name="expensive_result")

expensive_result.for_each(lambda x: print(f"\n", x, end=""))

First run:

$ python main.py
0 1 2 3
0
1
4
9

Second run (the first run should have created the cache file which will now be used instead of recomputing the sequence expensive_result):

$ python main.py

0
1
4
9

Please let me know if you have any questions or would like some clarifications.


Loving the project so far, thanks for your effort!

@EntilZha
Copy link
Owner

Hi @hXtreme, thanks for the great idea! I'm generally open to PR contributions for good ideas, just don't have much time to implement new things myself.

On your idea, I have a few questions on how this might work. At a high level, the challenge is to (efficiently) detect when a result is cached versus not cached. You could do this I think at either the level of a pipeline stage (e.g., map(expensive_function) or for each individual element (so basically ergonomics on top of memoization). The challenge in the first case is knowing when the entire input has already been cached, without scanning it again. In the second case, does this amount to some ergonomics on top memoization that is serialized to disk? The second one could very well be nice to have, but would be great to get a bit more clarification around how the decision on whether to compute or use cache would be done.

@hXtreme
Copy link
Author

hXtreme commented Jan 10, 2022

This is what I'd think it works like:

# Untested Code

from pathlib import Path
from typing import TypeVar

import dill
import functional as funct
from functional.pipeline import Sequence

T = TypeVar("T")


def with_caching(seq: Sequence[T], file: str, path: str = "~/.cache") -> Sequence[T]:
    path = Path(path)
    path.mkdir(exist_ok=True)
    cache_file = path / file
    if cache_file.exists():
        cache = dill.load(file=cache_file, ignore=False)
        return funct.seq(cache)
    else:
        results = seq.to_list()
        dill.dump(results, file=cache_file)
        return seq

This is an external function (because I don't know enough about PyFunctional to be able to modify it)

It is used as follows:

expensive_result = with_caching(s.map(expensive_function), file="expensive_result", path="./experiment-cache")

Hope this helps.

@bmsan
Copy link

bmsan commented Jan 20, 2022

I also like the idea but I think there are some things to take into account:

  • the input could change:

    • provided input to the sequence is different
    • somewhere down the chain an intermediary map is provided a function that gets an updated implementation
    • the chain sequence is modified (operations are added or removed to/from the chain)
  • Assuming all of the above would be taken care of, your expensive_function could get an updated implementation.

So the hashing should take into account the input that is processed at the caching point but also the code of the expensive_function.

I think JobLib does exactly what you are after taking into account all of the above: https://joblib.readthedocs.io/en/latest/memory.html

>>> cachedir = 'your_cache_location_directory'
>>> from joblib import Memory
>>> memory = Memory(cachedir, verbose=0)

>>> @memory.cache
... def f(x):
...     print('Running f(%s)' % x)
...     return x

@stale
Copy link

stale bot commented Mar 10, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Mar 10, 2022
@stale stale bot removed the stale label Mar 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants