advanced equijoin #9

piccolbo opened this Issue Sep 13, 2011 · 2 comments

1 participant


The equjoin currently in dev is the basic one. Doesn't scale well when one key is predominant, doesn't exploit special cases like one side having a small number of records. There is lot of work that could go into having a better join feature.


See the paper "processing theta joins in mapreduce"


One possible technique is to do a preliminary job to create a bloom filter with the keys of one or both sides of the join, then perform the join using the bloom filter as, indeed, a filter in the map phase. Adds jobs but moves work to the map side (reportedly faster in real instances)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment