New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use faster algo for digest in downloadData #295

Closed
achubaty opened this Issue Aug 10, 2016 · 7 comments

Comments

Projects
None yet
2 participants
@achubaty
Contributor

achubaty commented Aug 10, 2016

@eliotmcintire and @YongLuo007: per my discussion with Eliot yesterday, the digest package provides crc32, xxhash32, and xxhash64, all of which are much faster than md5 and sha1. Because we only care about error detection in downloaded files rather than detecting malicious file modifications, we probably don't need to worry about using a cryptographic hash.

Some quick benchmarks using files on a hard drive (not SSD):

## 6.0 GB file
Unit: seconds
                                                       expr       min        lq      mean    median        uq       max neval
 digest::digest(object = f, file = TRUE, algo = "xxhash32")  2.829808  2.853730  2.876551  2.867893  2.899608  2.924287    10
 digest::digest(object = f, file = TRUE, algo = "xxhash64")  2.174798  2.192070  2.224008  2.205877  2.236510  2.375106    10
    digest::digest(object = f, file = TRUE, algo = "crc32")  7.770354  7.789431  8.769517  7.811685  7.847242 17.386918    10
      digest::digest(object = f, file = TRUE, algo = "md5") 17.963256 18.046148 18.156453 18.162494 18.198804 18.535671    10
     digest::digest(object = f, file = TRUE, algo = "sha1") 26.554875 26.586677 26.707898 26.706250 26.739184 26.923447    10

## 3.3 GB file
Unit: seconds
                                                       expr       min        lq      mean    median        uq       max neval
 digest::digest(object = g, file = TRUE, algo = "xxhash32")  1.579314  1.587872  1.806039  1.643914  1.730626  2.533955    10
 digest::digest(object = g, file = TRUE, algo = "xxhash64")  1.224522  1.227078  2.384854  1.258772  2.158726 10.694350    10
    digest::digest(object = g, file = TRUE, algo = "crc32")  4.360705  4.393467  4.560576  4.435819  4.641302  5.277457    10
      digest::digest(object = g, file = TRUE, algo = "md5") 10.058111 10.108590 10.330602 10.152135 10.206027 11.268888    10
     digest::digest(object = g, file = TRUE, algo = "sha1") 14.843534 15.044012 17.022944 15.555446 15.881427 31.102097    10

## 1.8 GB file
Unit: milliseconds
                                                       expr       min        lq      mean    median        uq        max neval
 digest::digest(object = h, file = TRUE, algo = "xxhash32")  878.0688  884.6397  900.5384  892.7122  905.5935   943.5876    10
 digest::digest(object = h, file = TRUE, algo = "xxhash64")  681.1487  686.9948 2124.2758  693.2693  702.9555 15001.0616    10
    digest::digest(object = h, file = TRUE, algo = "crc32") 2412.7605 2419.1621 2451.3399 2428.8509 2452.9748  2588.1020    10
      digest::digest(object = h, file = TRUE, algo = "md5") 5562.2937 5598.7545 5635.6798 5607.4553 5625.8232  5852.3422    10
     digest::digest(object = h, file = TRUE, algo = "sha1") 8241.1719 8271.6608 8353.2186 8319.1521 8417.2380  8619.4000    10

Obviously the hashing speed is still going to be I/O limited as the files are read from disk, but using a faster algorithm like xxhash should make a big difference in the time taken to check downloaded data files.

@achubaty

This comment has been minimized.

Contributor

achubaty commented Aug 10, 2016

Note that using a faster hashing algorithm means we can hash the entire file rather than simply checking the first 3e7 bytes as in https://github.com/PredictiveEcology/SpaDES/blob/development/R/module-repository.R#L434.

@achubaty achubaty changed the title from use faster algo for digest in downloadData to use faster algo for digest in downloadData and checkpoint Aug 10, 2016

@achubaty

This comment has been minimized.

Contributor

achubaty commented Aug 10, 2016

also note that changing the algo also requires updating the checksums in all modules that download data! (but only a single checksum is needed per file, unlike #230)

@achubaty

This comment has been minimized.

Contributor

achubaty commented Aug 10, 2016

UPDATE: added xxhash64 and re-ran benchmarks above

@achubaty achubaty modified the milestone: v1.3.0 Sep 7, 2016

@achubaty

This comment has been minimized.

Contributor

achubaty commented Sep 12, 2016

STEPS TO UPDATE YOUR CHECKSUM FILES:

## 1. specify your module here
moduleName <- "my_module"

## 2. use a temp dir to ensure all modules get fresh copies of the data
tmpdir <- file.path(tempdir(), "SpaDES_modules")

## 3. download your module's data to the temp dir
downloadData(moduleName, tmpdir)

## 4. initialize a dummy simulation to ensure any 'data prep' steps in the .inputObjects section are run
simInit(modules = moduleName)

## 5. recalculate your checksums and overwrite the file
checksums(moduleName, tmpdir, write = TRUE)

## 6. copy the new checksums file to your working module directory (the one not in the temp dir)
file.copy(from = file.path(tmpdir, moduleName, 'data', 'CHECKSUMS.txt'),
          to = file.path('path/to/my/moduleDir', moduleName, 'data', 'CHECKSUMS.txt'),
          overwrite = TRUE)
@eliotmcintire

This comment has been minimized.

Contributor

eliotmcintire commented Sep 12, 2016

Can we put this in the help manual for checksums fn?

@achubaty

This comment has been minimized.

Contributor

achubaty commented Sep 12, 2016

yes, I will flesh out the docs

achubaty added a commit that referenced this issue Sep 12, 2016

use 'xxhash64' instead of 'md5' for checksums (close #295)
**bonus:** Windows and Unixy systems both give same value :)

achubaty added a commit that referenced this issue Sep 12, 2016

@achubaty achubaty changed the title from use faster algo for digest in downloadData and checkpoint to use faster algo for digest in downloadData Sep 12, 2016

@achubaty

This comment has been minimized.

Contributor

achubaty commented Sep 12, 2016

@eliotmcintire I did not implement the change in checkpoint, only in downloadData.

@achubaty achubaty closed this Sep 13, 2016

achubaty added a commit to PredictiveEcology/SpaDES.core that referenced this issue Jun 18, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment