secondary keys #129

piccolbo opened this Issue Aug 31, 2012 · 3 comments


None yet

2 participants


use hashing to make it more user-friendly

piccolbo commented Dec 7, 2012

With the current reduce interface this doesn't matter that much because one can sort the values associated with a key in memory. If we revisit the issue of an iterator type interface for reduce (for when values are big and can not be held in memory more than a few at a time) then this will go on the short track. The reference to hashing above means the following. Since this binary partitioner is very low level and allows only to specify the keys as number of bytes to consider or skip, it would be very hard to support complex keys and provide a user friendly API. If we take two lists of primary and secondary keys, hash the former and then prepend the hash to the key, we can use this simple binary partitioner even with complex keys. The next hurdle though is ordering, as byte ordering which what java would perform is unlikely to be the correct ordering for the original key domain. An additional hurdle is the efficient implementation of all of this. One wonders why the wise @klbostee,the author of the patch for the above issue, didn't use typedbytes serialization for this case as he did with the multiplefileoutputformat. In that case, the key is an ArrayList with the first element being the filename and the second the actual key. The same could have been done for primary and secondary, and we should consider submitting our own patch to make it that way.

klbostee commented Jan 8, 2013

No grand reasons for it really, it just seemed sensible to keep it general/low-level. Writing a custom partitioner using typed bytes isn't hard though, e.g.:

@piccolbo piccolbo closed this Mar 5, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment