Features/cache archives #579

scheibelp · 2016-03-19T01:10:10Z

This creates a cache directory for source archives under var/cache/ and avoids re-downloading source archives for package installs that use the same source (normally e.g. installing the same package with different build options would trigger a re-download). This is likely to primarily be useful to people testing several different build configurations of the same package.

This does not deal with caching for source control repositories (i.e. does not modify VCSFetchStrategy).

EDIT (3/23): when this PR was started it did not handle caching VCSFetchingStrategy but has since been edited to support them.

…hStrategy check the cache and skip the download if the source is available there.

…ake sure that an archive without a checksum isnt placed there (this wouldn't cause an error but does waste space and might be confusing)

tgamblin · 2016-03-19T01:15:49Z

lib/spack/spack/__init__.py

@@ -46,6 +46,8 @@
 stage_path     = join_path(var_path, "stage")
 repos_path     = join_path(var_path, "repos")
 share_path     = join_path(spack_root, "share", "spack")
+cache_path     = join_path(spack_root, "var", "cache")


Can you change this to var/spack/cache?

scheibelp · 2016-03-19T01:30:05Z

I'll continue along the line suggested in the comments (assuming that is of interest). I won't get a chance to keep going with it until Monday.

Thanks!

tgamblin · 2016-03-19T01:30:36Z

Thanks! This actually looks simpler than I thought it was going to be. The mirror/stage/fetch logic is kind of convoluted b/c it started out as one class. But thanks for hacking on it!

…as a mirror. this should support both URLFetchStrategy as well as VCSFetchStrategy (the previous strategy supported only the former). this won't work until URLFetchStrategy.archive is updated

…ll causes a failure on the initial download)

scheibelp · 2016-03-22T03:58:58Z

I'm curious if it is acceptable to change the expectations for the FetchStrategy.archive method: in particular that it copy resources vs. move them. Latest commits are an attempt at the above suggestions (it depends on the suggested change and there is a TODO added there to that effect).

tgamblin · 2016-03-22T06:04:21Z

@scheibelp: I'm ok with making it copy instead of move. I think that would be fine.

I guess one question would be whether it makes sense to have the URLFetchStrategy fetch straight into the cache and untar from the cache. I think that might be good and would help LC save some space. That might take more work, though, so just copying into the cache would be a good start.

…tests (just for git tests although the same concept applies to the other unit tests which are failing - namely those for svn and hg)

scheibelp · 2016-03-23T02:59:45Z

just copying into the cache would be a good start.

I'll stick with that for now

I need to find a cleaner way to fix the failing unit tests (currently git/hg/svn_fetch although the last commit handles git_fetch) - perhaps move the cleanup I added in git_fetch.teardown to MockPackagesTest? Tests which don't inherit that and directly involve the fetching/repo management logic would be fragile.

…got a better way to avoid test fragility but Ill add this for now)

…ject. tests may assign a mock cache but by default it is None (this will avoid any implicit caching behavior confusing unit tests)

scheibelp · 2016-03-24T03:22:27Z

OK a bit of cleanup is needed but I think the latest updates reduce test fragility: same approach as before but cache users now refer to it as an object, so the test code can replace the cache with a mock.

…xtra work is required in MockPackagesTest) (2) removing outdated logic (which originated in this branch) and minor cleanup

…r tests)

scheibelp · 2016-03-25T02:53:08Z

@tgamblin you asked if the implementation would support caching for ResourceStage: I added cache_local as a method to Stage and StageComposite (using the composite decorator) so this should be the case.

DIY stage resources are not cached (I think that is possible but didn't think it was needed)

Latest commits improve robustness of tests (as of yesterday tests attempted to load from filesystem cache but now the cache provides a fetcher which can be mocked)

citibeth · 2016-03-25T05:35:38Z

I think this will be a really nice feature, with very little downside.

HOWEVER... it struck me that we cannot trust that upstream authors won't change their tarballs while keeping the same name (I know, bad practice). Or they might rename the tarball while keeping the file contents the same.

Another problem is, what if two websites share different tarballs but named the same? Or if one website distributes their tarballs with a really generic name, like 'download.tar.gz'?

For all those reasons... it seems that the "right" way to store things in the cache is by hashcode. I know it's cryptic, so put a user-readable part into the filename as well.

For example... if a website provides mylib-1.2.1.tar.gz, then it would go into the cache under a filename like:

14af3293442fb99-mylib-1.2.1.tar.gz

scheibelp · 2016-03-26T01:07:33Z

If by hash you mean the digest in a package.py version directive then I think I understand how this would fix things.

@tgamblin I think the appropriate way to implement that is modifying mirror_archive_filename to include the digest - does that sound agreeable?

In more detail:

it struck me that we cannot trust that upstream authors won't change their tarballs while keeping the same name (I know, bad practice).

This would be the case I imagine would be frustrating with the caching logic that I've implemented so far: it would successfully download and then the install would fail complaining about mismatched checksums.

Or they might rename the tarball while keeping the file contents the same.

This would potentially cause an unnecessary extra download but wouldn't behave incorrectly. I don't see how your proposal addresses it although IMO it is OK not to handle it.

Another problem is, what if two websites share different tarballs but named the same?

The current mirror logic creates a separate directory for each package

Or if one website distributes their tarballs with a really generic name, like 'download.tar.gz'?

This would result in the same error as the first case mentioned.

Thanks!

…g. if resource maintainer uses same name for each new version of a package)

tgamblin · 2016-03-29T19:27:09Z

ops wrong button -- accidentally closed. reopening.

…igest

…ve filename

…names from run to run (as long as the available digests do not change)

scheibelp · 2016-03-30T02:33:48Z

So the intent here is to be able to store multiple versions of an archive in a mirror when the same version has been released multiple times?

This is a side effect: My primary goal is to avoid the cache erroneously succeeding with an old checksum (since that short-circuits the fetching logic). A couple other possibilities to achieve this:

Mirrors could use a subclass of URLFetchStrategy that does checksum verification as part of the download
It could instead be expected that the package.py maintainer add new versions to identify the hash changes (in which case the latest commits would not be required).

The first alternative avoids preserving archives that will likely never be used.

Obviously, the hash should still be verified, though.

This doesnt interfere with checksum verification (moreover the archive is not cached unless the verification succeeds)

it would be nice if the human-readable part of the archive came first

done

I think it would also be good if the cipher name were included in the filename

done (added 'digests' property to Package, removed 'file_hash' property from FetchStrategy)

adamjstewart · 2016-04-01T15:20:44Z

This PR is of interest to me. Currently, I need to install 24 different variants for each version of HDF5 so not having to re-download it would be great.

There are a few points that I am confused on. Does this just create a default mirror within Spack, so that whenever you download something, it first checks the mirror, and if it's not their it downloads it from the url and stores it in the mirror? If so that would be amazing.

I'm currently working on a few packages like PGI and CUDA that need to be downloaded by hand. If I could run a one-line Spack command to add them to this default repo it would save me a lot of time. Of course, one of the problems with these packages is that they have different files for each OS/arch, each with different names and md5s. This may be outside the scope of this PR, but it should be something to keep in mind.

scheibelp · 2016-04-01T17:09:17Z

Does this just create a default mirror within Spack, so that whenever you download something, it first checks the mirror, and if it's not their it downloads it from the url and stores it in the mirror?

Yes

I'm currently working on a few packages like PGI and CUDA that need to be downloaded by hand. If I could run a one-line Spack command to add them to this default repo it would save me a lot of time.

That would be cool but I'd prefer to do it to a separate PR. I'm thinking a command could take a spec and a file and place that file in the default mirror. Actually come to think of it all you should have to do is provide the file and the package name (vs. full spec with version etc.) since Spack could automatically match the file hash to the version. Does that sound reasonable?

Of course, one of the problems with these packages is that they have different files for each OS/arch, each with different names and md5s.

Perhaps this could be encoded into the version?

adamjstewart · 2016-04-01T17:22:10Z

@scheibelp That sounds good to me. I'll save these ideas for a later PR.

citibeth · 2016-04-01T17:36:41Z

I'm currently working on a few packages like PGI and CUDA that need to be
downloaded by hand. If I could run a one-line Spack command to add them to
this default repo it would save me a lot of time.

That would be cool but I'd prefer to do it to a separate PR. I'm thinking
a command could take a spec and a file and place that file in the default
mirror. Actually come to think of it all you should have to do is provide
the file and the package name (vs. full spec with version etc.) since Spack
could automatically match the file hash to the version. Does that sound
reasonable?

Sounds good to me. However... we will want to keep Spack-downloaded files
separate from these handcrafted-downloaded files. The first we'll be
willing to blow away from time to time, and the second we'll want to keep
around and backed up.

-- Elizabeth

scheibelp · 2016-04-07T03:24:14Z

I should provide an update on this: at the 3/31 Spack meeting I was expanding on my above response to Todd:

So the intent here is to be able to store multiple versions of an archive in a mirror when the same version has been released multiple times?

This is a side effect: My primary goal is to avoid the cache erroneously succeeding with an old checksum (since that short-circuits the fetching logic). A couple other possibilities to achieve this:

To elaborate on my issue I was attempting to address (by adding hashes to the mirror archive filename) I'll provide a concrete scenario:

User installs package X with archive file named Y and contents Z
Archive Y changes to contents Z' with a new hash (i.e. no change in version or archive name)
User updates package.py for X with new digest (replacing old digest with new)
User tries installing package X

Without using the digest to create the mirror archive filename, the last step will retrieve the out-of-date file from the mirror and then fail the installation with a checksum error. Using the hash as part of the filename has other implications though like persisting the archives with the old hash; this implies they would be useful at some point in the future, which IMO is not the case (and if that were true they likely ought to be encoded as versions in the package.py file). I promoted switching to an approach I discussed earlier:

Mirrors could use a subclass of URLFetchStrategy that does checksum verification as part of the download

Todd mentioned he wanted to think on that so I've held off on further development in the meantime. FWIW I think this will work and that there are no other issues in the way of merging this PR.

citibeth · 2016-04-28T00:53:42Z

Any progress on this? The server for MUMPS went down today, I don't know for how long. I found another URL for the same tarball and kept going by changing the MUMPS package.py. But this feature, turned on routinely, would have saved me a bit of grief.

scheibelp · 2016-04-28T00:59:32Z

No: sorry. Let me check in w/ Todd. It slipped my mind to ask him about it the last couple times we've talked.

davydden · 2016-05-09T05:28:28Z

and the trilinos website is down today, so at the moment one needs to hack another download source and a different hash. So the feature is indeed very welcome 😄 👍

…lculate and compare the checksum. This achieves the original goal of discarding stale cache files without preserving multiple files for the same version.

… [1] to conditionally cache resource: only save it if there is a feature which identifies it uniquely (for example do not cache a repository if it pulls the latest state vs. a particular tag/commit)

…used to manage all mirror URLs - just the cache (the specific behavior that a URL may refer to a stale resource doesn't necessarily apply to mirrors)

scheibelp added 4 commits March 18, 2016 15:50

(1) add a var/cache directory under spack. (2) downloads from URLFetc…

16fa40b

…hStrategy check the cache and skip the download if the source is available there.

rename for clarity

ac73231

since only archives with checksums can be retrieved from the cache, m…

fd067dd

…ake sure that an archive without a checksum isnt placed there (this wouldn't cause an error but does waste space and might be confusing)

move cache to var/spack/cache

d632266

tgamblin reviewed Mar 19, 2016
View reviewed changes

scheibelp added 3 commits March 21, 2016 20:48

pursuing a strategy using fetch.archive and treating var/spack/cache …

ee5e507

…as a mirror. this should support both URLFetchStrategy as well as VCSFetchStrategy (the previous strategy supported only the former). this won't work until URLFetchStrategy.archive is updated

undoing whitespace-only diff

b255f02

temporarily wrap archiving with conditional to avoid moving (this sti…

41a97c8

…ll causes a failure on the initial download)

scheibelp added 2 commits March 22, 2016 10:43

URLFetchStrategy.archive does a copy vs. a move now

75460d8

(1) relocate cache for tests (2) initial approach for restoring unit …

cb9fba9

…tests (just for git tests although the same concept applies to the other unit tests which are failing - namely those for svn and hg)

scheibelp mentioned this pull request Mar 23, 2016

Implement download cache #281

Closed

scheibelp added 2 commits March 23, 2016 19:49

clear test cache before and after each MockPackagesTest (I think Ive …

ed0f6f7

…got a better way to avoid test fragility but Ill add this for now)

replace references to cache directory with references to new cache ob…

dbfa6c9

…ject. tests may assign a mock cache but by default it is None (this will avoid any implicit caching behavior confusing unit tests)

scheibelp added 5 commits March 24, 2016 12:02

(1) move definition of MockCache to test command (no definitions or e…

13bf7d4

…xtra work is required in MockPackagesTest) (2) removing outdated logic (which originated in this branch) and minor cleanup

remove unused import

fe71ba9

stage creates cache fetcher with cache object (so it can be mocked fo…

142d1f5

…r tests)

implemented cache_local method for DIY stage (as a noop)

6423eab

spacing issue

bd5abb2

handle case where file contents change but resource name does not (e.…

06c9832

…g. if resource maintainer uses same name for each new version of a package)

scheibelp added 5 commits March 29, 2016 18:25

mirror archive filename now includes the digest type as well as the d…

bee224c

…igest

(1) access package via spec property (2) use any digest to form archi…

ce4de62

…ve filename

in the case of multiple digests, avoid creating different mirror file…

03d907e

…names from run to run (as long as the available digests do not change)

added docstring

c405594

removed stale TODO

a0c42a3

merge from develop

9eb314f

tgamblin mentioned this pull request May 9, 2016

Stage archives without a hash for DAG #917

Closed

merge from develop

dd26c0b

tgamblin mentioned this pull request May 25, 2016

spack install hdf5 followed by spack install hdf5~fortran downloads hdf5 tarball twice #988

Closed

scheibelp added 3 commits June 6, 2016 12:26

change source archive caching to omit digest from name and instead ca…

de1ec4b

…lculate and compare the checksum. This achieves the original goal of discarding stale cache files without preserving multiple files for the same version.

(1) FsCache store now takes a fetcher vs. just a copy command (2) use…

a275489

… [1] to conditionally cache resource: only save it if there is a feature which identifies it uniquely (for example do not cache a repository if it pulls the latest state vs. a particular tag/commit)

rename URLMirrorFetchStrategy to CacheURLFetchStrategy since it isnt …

3b71d78

…used to manage all mirror URLs - just the cache (the specific behavior that a URL may refer to a stale resource doesn't necessarily apply to mirrors)

luigi-calori mentioned this pull request Jun 15, 2016

fixed libxcb deps and added patch to remove pthread-stubs dep #1042

Closed

Merge branch 'develop' into features/cache-archives

1dc62e8

tgamblin merged commit 1dc62e8 into spack:develop Jun 29, 2016

tgamblin mentioned this pull request Jun 29, 2016

Alternate Download Locations #1090

Closed

scheibelp mentioned this pull request Jun 29, 2016

Broken edge case for caching #1143

Closed

adamjstewart mentioned this pull request Jul 21, 2016

Spack doesn't clean up in /tmp directory #529

Closed

citibeth mentioned this pull request Jan 19, 2017

Tarball cache not happening #2883

Closed

matz-e pushed a commit to matz-e/spack that referenced this pull request Apr 27, 2020

Change CORENRN_NMODL_ROOT to CORENRN_NMODL_DIR (spack#579)

ccc5938

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Features/cache archives #579

Features/cache archives #579

scheibelp commented Mar 19, 2016

tgamblin Mar 19, 2016

scheibelp Mar 19, 2016

scheibelp commented Mar 19, 2016

tgamblin commented Mar 19, 2016

scheibelp commented Mar 22, 2016

tgamblin commented Mar 22, 2016

scheibelp commented Mar 23, 2016

scheibelp commented Mar 24, 2016

scheibelp commented Mar 25, 2016

citibeth commented Mar 25, 2016

scheibelp commented Mar 26, 2016

tgamblin commented Mar 29, 2016

scheibelp commented Mar 30, 2016

adamjstewart commented Apr 1, 2016

scheibelp commented Apr 1, 2016

adamjstewart commented Apr 1, 2016

citibeth commented Apr 1, 2016

scheibelp commented Apr 7, 2016

citibeth commented Apr 28, 2016

scheibelp commented Apr 28, 2016

davydden commented May 9, 2016 •

edited

Loading

Features/cache archives #579

Features/cache archives #579

Conversation

scheibelp commented Mar 19, 2016

tgamblin Mar 19, 2016

Choose a reason for hiding this comment

scheibelp Mar 19, 2016

Choose a reason for hiding this comment

scheibelp commented Mar 19, 2016

tgamblin commented Mar 19, 2016

scheibelp commented Mar 22, 2016

tgamblin commented Mar 22, 2016

scheibelp commented Mar 23, 2016

scheibelp commented Mar 24, 2016

scheibelp commented Mar 25, 2016

citibeth commented Mar 25, 2016

scheibelp commented Mar 26, 2016

tgamblin commented Mar 29, 2016

scheibelp commented Mar 30, 2016

adamjstewart commented Apr 1, 2016

scheibelp commented Apr 1, 2016

adamjstewart commented Apr 1, 2016

citibeth commented Apr 1, 2016

scheibelp commented Apr 7, 2016

citibeth commented Apr 28, 2016

scheibelp commented Apr 28, 2016

davydden commented May 9, 2016 • edited Loading

davydden commented May 9, 2016 •

edited

Loading