BloomJoin: Bloom Filter Based Joins

An R package implementing Bloom filter-based joins for improved performance with large datasets.

Overview

BloomJoin provides an alternative join implementation for R that uses a hash-based approach inspired by Bloom filters to optimize the performance of joins between data frames. Traditional joins in R can be inefficient when dealing with large datasets, especially when one table is significantly larger than the other and the join key selectivity is low.

Installation

# Install from GitHub
devtools::install_github("gojiplus/bloomjoin")

Usage

library(bloomjoin)

# Basic usage
result <- bloom_join(df1, df2, by = "id", type = "inner")

# With multiple join columns
result <- bloom_join(df1, df2, by = c("id", "date"), type = "left")

# With performance tuning parameters
result <- bloom_join(df1, df2, 
                    by = "id", 
                    type = "inner",
                    bloom_size = 1000000, 
                    false_positive_rate = 0.001,
                    verbose = TRUE)

How It Works

BloomJoin uses a hash-based approach to optimize joins:

Create a hash set of all keys from the lookup table (y)
Filter the primary table (x) to only include rows with keys that exist in the hash set
Perform a standard join on the filtered dataset

This pre-filtering step can significantly reduce the size of the join operation when many keys in the primary table don't exist in the lookup table.

Performance Benchmarks

See here

Future Work

Implement true Bloom filters for potentially better memory efficiency
Optimize for composite keys and other join types
Parallel processing for hash creation and filtering
Automatic parameter tuning based on input data characteristics

License

MIT

Contributing

Contributions welcome! Please feel free to submit a Pull Request.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
Meta		Meta
R		R
doc		doc
docs		docs
man		man
src		src
tests		tests
vignettes		vignettes
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.md		README.md
_pkgdown.yml		_pkgdown.yml
bloomjoin.Rproj		bloomjoin.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Uh oh!

Repository files navigation

BloomJoin: Bloom Filter Based Joins

Overview

Installation

Usage

How It Works

Performance Benchmarks

Future Work

License

Contributing

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Licenses found

gojiplus/bloomjoin

Folders and files

Latest commit

History

Repository files navigation

BloomJoin: Bloom Filter Based Joins

Overview

Installation

Usage

How It Works

Performance Benchmarks

Future Work

License

Contributing

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages