Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disabling requests-cache is not thread safe. Discussion of the future of _restclient's usage of requests-cache #14

Closed
aaraney opened this issue Jan 13, 2021 · 6 comments
Labels
bug Something isn't working

Comments

@aaraney
Copy link
Member

aaraney commented Jan 13, 2021

Due to the way requests-cache (used in _restclient) is implemented, if the cache has been "installed" and a downstream package uses requests, the downstream package's requests will be cached. Meaning that requests_cache implicitly changes the behavior of the requests package for all downstream callers.

Per requests-cache's documentation, they do provide a context manager to "disable" the cache. However, as @christophertubbs pointed out their implementation is not thread safe:

 @contextmanager
    def cache_disabled(self):
        """
        Context manager for temporary disabling cache
        ::
            >>> s = CachedSession()
            >>> with s.cache_disabled():
            ...     s.get('http://httpbin.org/ip')
        """
        self._is_cache_disabled = True
        try:
            yield
        finally:
            self._is_cache_disabled = False

In discussing this issue with @hellkite500, @hellkite500 mentioned the (actively developed) project CacheControl. Before active development to fix this behavior, its worth exploring a bit and determining what long term solution to implement.

@aaraney aaraney changed the title Disabling requests-cache is not thread safe. Discussion of the future of _restclient' Disabling requests-cache is not thread safe. Discussion of the future of _restclient's usage of requests-cache Jan 13, 2021
@aaraney aaraney added the bug Something isn't working label Jan 13, 2021
@jarq6c
Copy link
Collaborator

jarq6c commented Jan 14, 2021

With reference to CacheControl, the default cache is a dict. They also have a FileCache that will persist requests using pickle and a RedisCache. I'm wondering about performance using the FileCache and portability using the RedisCache.

@aaraney
Copy link
Member Author

aaraney commented Jan 14, 2021

With reference to CacheControl, the default cache is a dict. They also have a FileCache that will persist requests using pickle and a RedisCache. I'm wondering about performance using the FileCache and portability using the RedisCache.

Thanks for doing a little digging... Having not done that so far myself, the issues that you mentioned are problematic and do not sound like they would support our needs. Specifically a FileCache does not support our needs nor an in-RAM solution like Redis. Additionally, I am not overly keen in bringing in another dependency just to fix this one issue. Plus with the status of requests-cache:

Noticed last commit was 14-Aug-2019. Which is almost 1.5 years ago as of today.

#14 may deem it necessary to develop our own cache sub-package. For reference, this is something @jarq6c and @aaraney have previously discussed, but never formally tabled. At face value it seems that we should be able to learn from the implementations of both CacheControl and requests-cache and sort of merge them. I think having something a little more general, as we've discussed prior, that permits caching things in general via HD5 as well as SQlite would be a requirement for such a feature.

With all of the being said, I don't want this issue to dissolve into another subject, but just speak generally and voice the potential feature addition to address this bug.

@jarq6c
Copy link
Collaborator

jarq6c commented Jan 14, 2021

The disadvantages of a hypothetical evaluation_tools._cache package are that someone has to write and maintain it. We may also end up reinventing the wheel to some extent. On the other hand, perhaps there is a way to design it such that wheel reinventing is minimized and we're really just writing wrappers around existing libraries.

The potential advantages of such a package go beyond its primary use to cache NWIS and NWM data. We could unify caching and storing of data through a singular interface. This might allow data/computational scientists using evaluation_tools to speed up development of long workflows by storing interim data. This interface could be incorporated into an event_detection_service that caches events. If an object store (or cloud bucket) became available we could use this package to automatically push data from any other evaluation_tools subpackage.

I'm imagining a subpackage that takes evaluation_tools canonical pandas.DataFrames or geopandas.GeoDataFrames and automagically caches and/or stores the data in a variety of formats.

@hellkite500
Copy link
Member

@jarq6c its ironic you mention an event cache, I may or may not have had the need to implement:

class EventCache(object):
    """
        Event cache manager
        FIXME current multiprocessing works because of a guarantee of process order,
        but this may need to be better protected from reader/writer issues
    """

I would encourage thinking about the needs of caching and what exactly is being cached. A dataframe cache with clear semantics isn't too far fetched.

@JWCook
Copy link

JWCook commented Apr 17, 2021

FYI, a better option for using requests-cache is using CachedSession directly, instead of patching with install_cache(). It makes it more explicit what you are and aren't caching, and doesn't affect downstream requests calls. It's mostly thread-safe, except for cache_disabled(), as you noted; instead, when you want to make a non-cached request, you can just use a regular requests.Session.

Also, the library is being maintained again (with a fairly large release last week), and issues and PRs are welcome.

@aaraney
Copy link
Member Author

aaraney commented Jun 9, 2021

Because of #78, multiprocessing usage should become more obsolete in the context of retrieving data. With that in mind, I am going to close this and we can reopen in the future to continue discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants