A small tool for creating and searching Golomb Compressed Sets
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
src writeln -> eprintln May 13, 2018
Cargo.toml Add more hashing methods: SHA256, SHA512, MD5, Blake2b. Apr 26, 2018
LICENSE.txt README and LICENSE. Apr 7, 2018
README.md README: Fix create examples. May 30, 2018

README.md

gcstool

A small (prerelease) tool for creating and searching Golomb Compressed Sets.

What?

Golomb Compressed Sets are similar to Bloom filters - they're space-efficient data structures that let you test whether a given element is a member of a set.

Like Bloom filters, they have a controllable rate of false-positives - they may consider an element a member of a set even if it's never been seen before - while having no false negatives. If a GCS hasn't seen it, it's not on the list.

Why?

Let's illustrate with a real-world problem: checking against leaked password lists.

Say you have a copy of pwned-passwords-2.0.txt, and you want to check your users passwords against the list when they sign up. Unfortunately it's nearly 30GB, and not in a format particularly suited for searching.

You cut it down quite a bit simply by converting the SHA1 hashes in it to binary, and even truncating them in the process, but even being really aggressive and spending just 6 bytes per entry (for a false-positive rate of 1 in 500,000) only gets you down to 3 GB.

Let's see what this tool can do:

# Use precomputed hashes directly with -H hex
% gcstool -H hex create -p 500000 pwned-passwords-2.0.txt pwned-passwords-2.0-p500k.gcs
(5 minutes and 3.8 GiB of RAM later)
% du -h pwned-passwords-2.0-p500k.gcs
1.2G    pwned-passwords-2.0-p500k.gcs

An equivalent Bloom filter would consume 1.6 GiB. Not too shabby.

But 1 in 500,000 passwords being randomly rejected for no good reason is a bit crap. Your users deserve better than that, surely. You could double-check against the pwned passwords API, but then we're leaking a little bit about our users passwords to a third party and that makes me itch. What if we could just up the error rate so it's negligible?

% gcstool -H hex create -p 50000000 pwned-passwords-2.0.txt pwned-passwords-2.0-p50m.gcs
(5 minutes and 3.8 GiB of RAM later)
% du -h pwned-passwords-2.0-p50m.gcs
1.6G    pwned-passwords-2.0-p50m.gcs

Contrast this with a 2.15 GiB Bloom filter.

We can now query a database for for known passwords:

% gcstool query pwned-passwords-2.0-p50m.gcs
Ready for queries on 501636842 items with a 1 in 50000000 false-positive rate.  ^D to exit.
> password
Found in 0.3ms
> not a leaked password
Not found in 1.7ms
> I love dogs
Found in 0.7ms
> I guess it works
Not found in 0.1ms

Yay.

Integrating it into your website is left as an exercise right now. Eventually I'll factor this out into a library and sort out a Rubygem.

How?

Golomb Compressed Sets are surprisingly simple:

Choose a false positive rate p. For each of your n elements, hash to an integer mod pn. Then sort the list of integers, and convert each entry to a diff from the previous one, so you end up with a list of offsets.

This is the set - just a list of offsets. You can prove to yourself that this works because you've distributed your n items in n*p buckets, so a bucket has just a 1-in-p probability of being filled by accident by any given item.

Now it's just a matter of storing each offset efficiently. We use a subset variant of Golomb coding called Rice coding - divmod by p, store the quotient in unary (0 = 0, 1 = 1, 2 = 11, 3 = 111, etc), and store the remainder in log2(p) bits.

Then we simply store an index indicating what and where every nth element is so we can jump in the middle of the set to find a random item. In this version of the GCS, we simply store pairs of 64-bit integers indicating the number and the offset.

More complex index encoding is possible to make it smaller, but since a default index for even 500 million items is just 16MB it barely seems worth the effort.