Cache to file and auto-load from cache #169

hXtreme · 2022-01-05T09:46:47Z

It would be awesome if there was an easy way to cache results to a file and if the cache was found then load from the cache without recomputing.

To basically support the following usecase/api

# main.py

def expensive_function(x):
   """Consider an expensive to compute function"""
   print(x, end=" ")
   return x**2

s = seq([x for x in range(4)], cache_dir="/path/to/cache/dir")

expensive_result = s.map(expensive_function)\
    .with_caching(name="expensive_result")

expensive_result.for_each(lambda x: print(f"\n", x, end=""))

First run:

$ python main.py
0 1 2 3
0
1
4
9

Second run (the first run should have created the cache file which will now be used instead of recomputing the sequence expensive_result):

$ python main.py

0
1
4
9

Please let me know if you have any questions or would like some clarifications.

Loving the project so far, thanks for your effort!

The text was updated successfully, but these errors were encountered:

EntilZha · 2022-01-10T21:04:21Z

Hi @hXtreme, thanks for the great idea! I'm generally open to PR contributions for good ideas, just don't have much time to implement new things myself.

On your idea, I have a few questions on how this might work. At a high level, the challenge is to (efficiently) detect when a result is cached versus not cached. You could do this I think at either the level of a pipeline stage (e.g., map(expensive_function) or for each individual element (so basically ergonomics on top of memoization). The challenge in the first case is knowing when the entire input has already been cached, without scanning it again. In the second case, does this amount to some ergonomics on top memoization that is serialized to disk? The second one could very well be nice to have, but would be great to get a bit more clarification around how the decision on whether to compute or use cache would be done.

hXtreme · 2022-01-10T21:30:28Z

This is what I'd think it works like:

# Untested Code

from pathlib import Path
from typing import TypeVar

import dill
import functional as funct
from functional.pipeline import Sequence

T = TypeVar("T")


def with_caching(seq: Sequence[T], file: str, path: str = "~/.cache") -> Sequence[T]:
    path = Path(path)
    path.mkdir(exist_ok=True)
    cache_file = path / file
    if cache_file.exists():
        cache = dill.load(file=cache_file, ignore=False)
        return funct.seq(cache)
    else:
        results = seq.to_list()
        dill.dump(results, file=cache_file)
        return seq

This is an external function (because I don't know enough about PyFunctional to be able to modify it)

It is used as follows:

expensive_result = with_caching(s.map(expensive_function), file="expensive_result", path="./experiment-cache")

Hope this helps.

bmsan · 2022-01-20T21:24:08Z

I also like the idea but I think there are some things to take into account:

the input could change:
- provided input to the sequence is different
- somewhere down the chain an intermediary map is provided a function that gets an updated implementation
- the chain sequence is modified (operations are added or removed to/from the chain)
Assuming all of the above would be taken care of, your expensive_function could get an updated implementation.

So the hashing should take into account the input that is processed at the caching point but also the code of the expensive_function.

I think JobLib does exactly what you are after taking into account all of the above: https://joblib.readthedocs.io/en/latest/memory.html

>>> cachedir = 'your_cache_location_directory'
>>> from joblib import Memory
>>> memory = Memory(cachedir, verbose=0)

>>> @memory.cache
... def f(x):
...     print('Running f(%s)' % x)
...     return x

stale · 2022-03-10T21:49:22Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale bot added the stale label Mar 10, 2022

EntilZha added the evergreen label Mar 10, 2022

stale bot removed the stale label Mar 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache to file and auto-load from cache #169

Cache to file and auto-load from cache #169

hXtreme commented Jan 5, 2022 •

edited

EntilZha commented Jan 10, 2022

hXtreme commented Jan 10, 2022 •

edited

bmsan commented Jan 20, 2022

stale bot commented Mar 10, 2022

Cache to file and auto-load from cache #169

Cache to file and auto-load from cache #169

Comments

hXtreme commented Jan 5, 2022 • edited

EntilZha commented Jan 10, 2022

hXtreme commented Jan 10, 2022 • edited

bmsan commented Jan 20, 2022

stale bot commented Mar 10, 2022

hXtreme commented Jan 5, 2022 •

edited

hXtreme commented Jan 10, 2022 •

edited