A small (prerelease) tool for creating and searching Golomb Compressed Sets.
Golomb Compressed Sets are similar to Bloom filters - they're space-efficient data structures that let you test whether a given element is a member of a set.
Like Bloom filters, they have a controllable rate of false-positives - they may consider an element a member of a set even if it's never been seen before - while having no false negatives. If a GCS hasn't seen it, it's not on the list.
Let's illustrate with a real-world problem: checking against leaked password lists.
Say you have a copy of pwned-passwords-2.0.txt, and you want to check your users passwords against the list when they sign up. Unfortunately it's nearly 30GB, and not in a format particularly suited for searching.
You cut it down quite a bit simply by converting the SHA1 hashes in it to binary, and even truncating them in the process, but even being really aggressive and spending just 6 bytes per entry (for a false-positive rate of 1 in 500,000) only gets you down to 3 GB.
Let's see what this tool can do:
# Use precomputed hashes directly with -H hex % gcstool -H hex create -p 500000 pwned-passwords-2.0.txt pwned-passwords-2.0-p500k.gcs (5 minutes and 3.8 GiB of RAM later) % du -h pwned-passwords-2.0-p500k.gcs 1.2G pwned-passwords-2.0-p500k.gcs
An equivalent Bloom filter would consume 1.6 GiB. Not too shabby.
But 1 in 500,000 passwords being randomly rejected for no good reason is a bit crap. Your users deserve better than that, surely. You could double-check against the pwned passwords API, but then we're leaking a little bit about our users passwords to a third party and that makes me itch. What if we could just up the error rate so it's negligible?
% gcstool -H hex create -p 50000000 pwned-passwords-2.0.txt pwned-passwords-2.0-p50m.gcs (5 minutes and 3.8 GiB of RAM later) % du -h pwned-passwords-2.0-p50m.gcs 1.6G pwned-passwords-2.0-p50m.gcs
Contrast this with a 2.15 GiB Bloom filter.
We can now query a database for for known passwords:
% gcstool query pwned-passwords-2.0-p50m.gcs Ready for queries on 501636842 items with a 1 in 50000000 false-positive rate. ^D to exit. > password Found in 0.3ms > not a leaked password Not found in 1.7ms > I love dogs Found in 0.7ms > I guess it works Not found in 0.1ms
Integrating it into your website is left as an exercise right now. Eventually I'll factor this out into a library and sort out a Rubygem.
Golomb Compressed Sets are surprisingly simple:
Choose a false positive rate
p. For each of your
n elements, hash to an integer
mod pn. Then sort the list of integers, and convert each entry to a diff from the
previous one, so you end up with a list of offsets.
This is the set - just a list of offsets. You can prove to yourself that this works because you've distributed your n items in n*p buckets, so a bucket has just a 1-in-p probability of being filled by accident by any given item.
Now it's just a matter of storing each offset efficiently. We use a subset variant of Golomb coding called Rice coding - divmod by p, store the quotient in unary (0 = 0, 1 = 1, 2 = 11, 3 = 111, etc), and store the remainder in log2(p) bits.
Then we simply store an index indicating what and where every
nth element is so we
can jump in the middle of the set to find a random item. In this version of the GCS,
we simply store pairs of 64-bit integers indicating the number and the offset.
More complex index encoding is possible to make it smaller, but since a default index for even 500 million items is just 16MB it barely seems worth the effort.