Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement sorted merge join #197

Open
raronson opened this issue Feb 22, 2013 · 1 comment
Open

Implement sorted merge join #197

raronson opened this issue Feb 22, 2013 · 1 comment
Milestone

Comments

@raronson
Copy link
Contributor

If both data sets are stored sorted on the join key, then its possible to perform the join on the map side. The general idea is to:

  • Build up an index of keys to file location/offset of one of the data sets.
  • Use the other data set as normal input to a map job.
  • For each key, look up the the corresponding file/offset from the index.
  • Directly read the file, seeking to the offset.

There are already implementations in both pig and hive, and would be a nice addition to scoobi.

Pigs implementation - http://wiki.apache.org/pig/PigMergeJoin
Hives implementation - https://issues.apache.org/jira/browse/HIVE-1194

@kdarshit999
Copy link

i need code to implement sort merge join any suggestions ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants