use faster algo for digest in downloadData #295
Closed
Milestone
Comments
Note that using a faster hashing algorithm means we can hash the entire file rather than simply checking the first |
also note that changing the algo also requires updating the checksums in all modules that download data! (but only a single checksum is needed per file, unlike #230) |
UPDATE: added xxhash64 and re-ran benchmarks above |
STEPS TO UPDATE YOUR CHECKSUM FILES: ## 1. specify your module here
moduleName <- "my_module"
## 2. use a temp dir to ensure all modules get fresh copies of the data
tmpdir <- file.path(tempdir(), "SpaDES_modules")
## 3. download your module's data to the temp dir
downloadData(moduleName, tmpdir)
## 4. initialize a dummy simulation to ensure any 'data prep' steps in the .inputObjects section are run
simInit(modules = moduleName)
## 5. recalculate your checksums and overwrite the file
checksums(moduleName, tmpdir, write = TRUE)
## 6. copy the new checksums file to your working module directory (the one not in the temp dir)
file.copy(from = file.path(tmpdir, moduleName, 'data', 'CHECKSUMS.txt'),
to = file.path('path/to/my/moduleDir', moduleName, 'data', 'CHECKSUMS.txt'),
overwrite = TRUE) |
Can we put this in the help manual for checksums fn? |
yes, I will flesh out the docs |
achubaty
added a commit
that referenced
this issue
Sep 12, 2016
**bonus:** Windows and Unixy systems both give same value :)
achubaty
added a commit
that referenced
this issue
Sep 12, 2016
@eliotmcintire I did not implement the change in |
achubaty
added a commit
to PredictiveEcology/SpaDES.core
that referenced
this issue
Jun 18, 2018
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
@eliotmcintire and @YongLuo007: per my discussion with Eliot yesterday, the
digest
package provides crc32, xxhash32, and xxhash64, all of which are much faster than md5 and sha1. Because we only care about error detection in downloaded files rather than detecting malicious file modifications, we probably don't need to worry about using a cryptographic hash.Some quick benchmarks using files on a hard drive (not SSD):
Obviously the hashing speed is still going to be I/O limited as the files are read from disk, but using a faster algorithm like xxhash should make a big difference in the time taken to check downloaded data files.
The text was updated successfully, but these errors were encountered: