New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Implement sorted merge join #197

Open

raronson opened this issue Feb 22, 2013 · 1 comment

Labels

Milestone

Contributor

raronson commented Feb 22, 2013

If both data sets are stored sorted on the join key, then its possible to perform the join on the map side. The general idea is to:

Build up an index of keys to file location/offset of one of the data sets.
Use the other data set as normal input to a map job.
For each key, look up the the corresponding file/offset from the index.
Directly read the file, seeking to the offset.

There are already implementations in both pig and hive, and would be a nice addition to scoobi.

Pigs implementation - http://wiki.apache.org/pig/PigMergeJoin
Hives implementation - https://issues.apache.org/jira/browse/HIVE-1194

kdarshit999 commented Mar 3, 2014

i need code to implement sort merge join any suggestions ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment