Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrency cache write guarantee #18

Closed
d-cameron opened this issue Mar 6, 2017 · 2 comments
Closed

Concurrency cache write guarantee #18

d-cameron opened this issue Mar 6, 2017 · 2 comments

Comments

@d-cameron
Copy link

d-cameron commented Mar 6, 2017

Would it be possible to guarantee that, for every cache file written, the contents of the file are correct? I'm using parallel was hoping that I could just assume that if I read a cached value, it would be correct. Unfortunately, saveCache.R appears to write directly to the cache file thus is inappropriate for parallel usage.

Would you consider updating saveCache() to write to a temporary file in the destination directory then moving the file to the correct cache location once the write is complete? This would allow a common cache to be used by multiple R instances and would have the added benefit that R processes killed whilst writing an R.cache file do not result in cache corruption.

@HenrikBengtsson
Copy link
Owner

Hi. Did you experience this as a problem or did you bring it up from code inspection of R.cache? If from experience, I'm really keen to hear more details about it, because that should then definitely be considered a bug.

What you're suggesting is what I often refer to as atomic writing emulated by the fact that a file rename / move is as atomic as we get on a random file system. I use this almost everywhere possible in all of my packages (e.g. R.utils::saveObject()) and if I could I would love for all write-to-file functions in R to have this by default.

I think I have had this on the to-do-list for R.cache as well (fail to find note from quick search). However, loadCache() was designed to ignore corrupt cache files, which may occur due to concurrency overwrites or for one of many other reasons (process terminations, power failures, ...). R.cache has used this "optimistic" approach since a very long time and we and lots of others have been using it on compute clusters with shared NFS etc through lots of computations (also anyone using the Aroma Project framework which makes heavy use of R.cache internally). Because of this, I'd argue that you should be fine to use it already now.

I think I would be fine with adding a layer of "atomic" writing to saveCache(). The reason why I say "think" is that I should find my old notes about this to make sure I didn't make an argument to myself that it's better to just rely on the above "optimistic" approach. Because something telling me that there's a reason why it's not there in R.cache but in all of my other packages (but it could also be because it slipped my mind and wasn't really needed)

BTW, at first one might think you'll get better cache hits of one writes atomically, but I'm not sure about that. With atomic writing it's either there or not, whereas with current writing it's either there or corrupt; in both setups I think the chance for a cache hit is approximately the same.

Finally, note that on some file systems such as shared NFS, there might be up to a 30-second delay from that one machine writes a file and another machine sees it. I've seen this many times in many different places. For some example pointers, see #7 (comment)

Say hi to Tony from me.

PS. <shameful self promotion>Off topic, but since you're saying you're using parallel for parallelization, you might be interesting in the future package which allows you do the same but scale it up to run on clusters etc without changing a single line of code.</shameful self promotion>

@d-cameron
Copy link
Author

Did you experience this as a problem or did you bring it up from code inspection of R.cache?

Purely through code inspected. I checked for documentation and, failing that, checked the source code for explicit locking or defensive copies.

BTW, at first one might think you'll get better cache hits of one writes atomically, but I'm not sure about that. With atomic writing it's either there or not, whereas with current writing it's either there or corrupt; in both setups I think the chance for a cache hit is approximately the same.

I'm not really that concerned about a cache miss on parellel operations - my concern was with removing the possibility of concurrent writes and corrupted read as I didn't want a job to fail due to caching issues.

R.cache has used this "optimistic" approach since a very long time and we and lots of others have been using it on compute clusters with shared NFS etc through lots of computations (also anyone using the Aroma Project framework which makes heavy use of R.cache internally). Because of this, I'd argue that you should be fine to use it already now.

I have ~ 11,000 jobs all sharing ~1,000 cache objects. I'll kick off a full run and see how it goes.

you might be interesting in the future package which allows you do the same but scale it up to run on clusters etc without changing a single line of code.

I currently have a helper script that does a whole lot of write("Rscript run.sh <params> | qsub", stdout()) so that approach does appeal to me - thanks for the heads up.

Thanks for the prompt response :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants