Skip to content

Freaky/ruby-gcs

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
bin
 
 
lib
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

ruby-gcs

This is a small prerelease Ruby library for generating and querying Golomb Compressed Set databases, as produced by gcstool.

Golomb Compressed Sets are similar to Bloom filters - they're space-efficient data structures that let you test whether a given element is a member of a set.

Like Bloom filters, they have a controllable rate of false-positives - they may consider an element a member of a set even if it's never been seen before - while having no false negatives. If a GCS hasn't seen it, it's not on the list.

Their main benefit over Bloom filters is being a little more compact - particularly with larger lists and better false-positive rates.

Usage

ruby-gcs comes with two small command-line utilities to create and query GCS databases.

bin/create

% wc -l /usr/share/dict/words
  235924 words
% gzip --stdout -9 /usr/share/dict/words >words.gz && du -Ah words.gz
  737K    words.gz
% bin/create 10000000 /usr/share/dict/words words-p10m.gcs
% du -Ah words-p10m.gcs
  741K    words-p10m.gcs

So, about the same size as a gzip -9 file, at the expense of 1 in every 10 million queries for words not in the dictionary being "found".

bin/query

% bin/query words-p10m.gcs
abiogenesis
Found in 0.77ms
llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch
Not found in 1.73ms

Refer to these to see how the API works.

The full 500 million-strong pwned-passwords-2.0.txt imports to 1.5GB with a 1 in 10 million false-positive rate - some improvement on the 9GB compressed hash list, 30GB uncompressed text file, or 1.95GB Bloom filter.

You're advised to use gcstool for generating such large databases, as it's both much faster (~20x) and much more memory efficient (~8x).

TODO

  • Test suite.
  • Less basic tools.
  • On-disk intermediate state for building large files.
  • Better documentation.
  • Plugin for Rodauth.

About

A small Ruby library for creating and searching Golomb Compressed Sets

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published