public
Description: Counting Bloom Filter implemented in Ruby
Homepage: http://www.igvita.com/2008/12/27/scalable-datasets-bloom-filters-in-ruby/
Clone URL: git://github.com/igrigorik/bloomfilter.git
name age message
file .gitignore Mon Dec 29 09:54:53 -0800 2008 adding a test task and a test [tenderlove]
file README.rdoc Sun Feb 22 19:33:02 -0800 2009 counting bloom filter example [igrigorik]
file Rakefile Mon Dec 29 09:54:53 -0800 2008 adding a test task and a test [tenderlove]
file bloomfilter.gemspec Sat Apr 04 10:17:23 -0700 2009 merging fix by postmodern (thanks!) and version++; [igrigorik]
directory examples/ Sun Feb 22 19:28:17 -0800 2009 added gemspec + updated examples [igrigorik]
directory ext/ Thu Jul 30 20:52:06 -0700 2009 merging 1.9 compat changes by danielsdeleo + R... [igrigorik]
directory lib/ Wed Feb 18 18:10:15 -0800 2009 added delete [joshbuddy]
directory test/ Wed Jun 03 14:56:18 -0700 2009 blindly accept everything and to_s it. [joshbuddy]
README.rdoc

BloomFilter

Counting Bloom Filter implemented in Ruby.

Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positives are possible, but false negatives are not. For more detail: en.wikipedia.org/wiki/Bloom_filter

Implementation

Instead of using k different hash functions, this implementation seeds the CRC32 hash with k different initial values (0, 1, …, k-1). This may or may not give you a good distribution, it all depends on the data.

Example

  require 'bloomfilter'

  # M (size of bit array)
  # K (number of hash functions)
  # R (random seed) 100000000, k=4, random seed=1

  #                    M,  K, R
  bf = BloomFilter.new(10, 2, 1)
  bf.insert("test")
  bf.include?("test")
  => true
  bf.include?("test2")
  => false
  bf.delete("test")
  bf.include?("test")
  => false

  # Hash with a bloom filter!
  bf["test2"] = "bar"
  bf["test2"]
  => "bar"
  bf["test3"]
  => nil

  bf.stats
  Number of filter bits (m): 10
  Number of filter elements (n): 2
  Number of filter hashes (k) : 2
  Predicted false positive rate = 10.87%

Configuring Bloom Filter

Performance of the Bloom filter depends on a number of variables:

  • size of the bit array
  • number of hash functions

To figure out the values for these parameters, refer to:

 http://www.igvita.com/2008/12/27/scalable-datasets-bloom-filters-in-ruby/

Credits

Tatsuya Mori <valdzone@gmail.com> (Original: vald.x0.com/sb/)