Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clean up and export crc32c function #22274

Merged
merged 14 commits into from Jun 13, 2017
Merged

clean up and export crc32c function #22274

merged 14 commits into from Jun 13, 2017

Conversation

stevengj
Copy link
Member

@stevengj stevengj commented Jun 7, 2017

As discussed in #21154, this exports and documents the crc32c function for CRC-32c checksums, which we already use internally to validate .ji files.

You can do crc32c(data) on an Array{UInt8}, a contiguous subarray thereof. Originally, I also allowed data to be a String, but I removed this. Since the CRC of a string is encoding-dependent, it seemed better to require the caller to explicitly do crc32c(Vector{UInt8}(s)). data can also be a String or an IO stream, with an optimized method for IOBuffer.

Since computing the CRC of a file is a common operation and doing it efficiently is easy to get wrong (the fastest way is to use mmap a sequence ofreadbytes! calls), I included a crc32c(io, numbytes) methods. As is noted in the documentation, you can checksum an entire file efficiently with open(crc32c, filename).

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jun 7, 2017

the fastest way is to use mmap

I would prefer we don't make this the default. It shouldn't be much faster (the bottleneck is typically the hardware fetch speed. If it is much slower, we should probably investigate our IO speed further.). However, mmap can also be much slower and reveal surprising bugs (if you try to try to open something that isn't a local file). And any failure to read the file (networking error, concurrent file modification, etc.) can cause abrupt termination (usually reported as a SEGV) rather than being able to report failures as Julia errors with backtraces.

@stevengj
Copy link
Member Author

stevengj commented Jun 7, 2017

@vtjnash, okay. I just verified that reading the file in 16k chunks is actually slightly faster than mmap anyway.

@stevengj
Copy link
Member Author

stevengj commented Jun 8, 2017

Fixed the documentation build.

@yuyichao
Copy link
Contributor

yuyichao commented Jun 8, 2017

Can we use open(crc32c, file) instead of crc32c(file)? I think crc on string is actually useful...

@stevengj
Copy link
Member Author

stevengj commented Jun 8, 2017

Sure, will do.

…optimized open(crc32c, filename), make IOBuffer checksums consistent with other streams
base/util.jl Outdated
buf = Array{UInt8}(min(nb, 16384))
while !eof(f) && nb > 16384
n = readbytes!(f, buf)
while !eof(io) && nb > 16384
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be 8192 * 3 ? That's the LONG block size used in the sse4.2 version (and also on ARM in one of my up coming change)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I tried 16384 and 32768 and the latter wasn't any faster on my machine, but 8192 * 3 is fine too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll be catched by the short version so it won't matter too much but in principle 8192 * 2 and 8192 * 4 are equally bad since neither of them makes full use of the LONG loop.

@stevengj
Copy link
Member Author

stevengj commented Jun 9, 2017

Whoops, looks like I forgot to push my copy-free IOBuffer method; will add that tomorrow. (Done.)

@stevengj
Copy link
Member Author

stevengj commented Jun 9, 2017

Tests look good; there was a Travis failure in testset spawn probably due to #17626, so I restarted that one.

@stevengj
Copy link
Member Author

stevengj commented Jun 12, 2017

Will squash/merge in a day or two if there are no further objections.

@stevengj stevengj merged commit 84300a6 into JuliaLang:master Jun 13, 2017
@stevengj stevengj deleted the crc branch June 13, 2017 14:50
@stevengj stevengj mentioned this pull request Jun 13, 2017
2 tasks
quinnj pushed a commit that referenced this pull request Jun 20, 2017
* clean up and export crc32c function

* added PR to NEWS

* restore crc32 of String, add crc32c(io) to read all of a stream, add optimized open(crc32c, filename), make IOBuffer checksums consistent with other streams

* use crc32c block size of 8192*3, matching the underling C library

* optimized IOBuffer crc32c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants