equijoin where left has more keys than right #139

Closed
authompson opened this Issue Oct 13, 2012 · 8 comments

Comments

Projects
None yet
2 participants

Again, I'm looking to understand your function "equijoin." After experimenting with the function, it seems at though the function only works when every unique key is represented at least once in both left.input and right.input. If one input has an extra key, the reduce function does not finish.

Is there a way to complete this join?

IDs <- c("A", rep("B", 2), rep("C", 3), rep("D", 4), rep("E", 5), rep("F", 6), rep("G", 7), rep("H", 8), rep("I", 9), rep("J", 10))
cats <- round(runif(55)*10, 0)
data2a <- keyval(IDs, cats)
data2a <- to.dfs(data2a)

IDt <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "L", "N")
tar <- sample(c(0,1), 12, replace=T)
targetfile <- to.dfs(keyval(IDt, tar))

Note in targetfile, the additional keys L and N appear compared with the data2a file.

ejoint <- equijoin(left.input = data2a,
right.input = targetfile,
outer="left",
map.left = function(k,v) keyval(k, v),
map.right = function(k,v) keyval(k, v)
)

I have been spending a significant amount of time going through your source code in the mapreduce.R file which has the equijoin function. It appears that in the reduce statement with the merge, the option i'm looking for is "all=TRUE".

Please let me know if this sort of join is possible. And if not, do you have a recommendation on how to make it work?

Thanks,
A

Collaborator

piccolbo commented Oct 15, 2012

Hi I can't reproduce exactly your error but I get some other error I am working on, maybe fixing that will fix yours as well. I don't think I gave sufficient thought to the default reducer and rmr2 forces me to be more general which is hard, for now can you just specify your own reducer and see how far you get? You can specify any function of a key and two set of values, one from the right side and one from the left side. What the actual type of that set is should be the same as the type of the values returned by the map function. Thanks

Collaborator

piccolbo commented Oct 15, 2012

After some inspection I think the problem is the poor fit between the new rmr2 data model and the implementation choices behind equijoin. Attributes don't carry well through c and split of a variety of data structures and we need to come up with a different idea. I can tell you that this is going to be high on my list of things to fix but I can't give you a date. Downgrading to 1.3.1 may be your only option at the moment.

Collaborator

piccolbo commented Oct 16, 2012

Hi I checked into the 2.0.1 a fixed equijoin. I fixed the basic mechanism whereby the reduce side can tell which records came from one side and which came form the other, but I skirted the issue of the merge. So what you are going to get now in output is keyval(k, list(values-from-the-left, values-from-the-right ) as you can see from the default. Of course you can provide your own reducer and do a merge in there but since I can't assume I have data frames in input I need to think a bit more what the right approach is. Maybe the current default is going to be it, maybe not.

Collaborator

piccolbo commented Oct 16, 2012

I restored the merge in the default reduce for everything but lists, for which the default behavior of merge is counterintuitive at least to me. Give it a try.

Thanks again for your prompt attention to this. I took your suggestion and, for now, am using rmr-1.3.1. When I get the opportunity, I will try the 2.0.1 fixed equijoin and let you know how it goes.
Thanks again,
A

Collaborator

piccolbo commented Oct 24, 2012

Since you won't test before release I have to close this base on my testing (or we won't release), you can always reopen if things go wrong.

piccolbo closed this Oct 24, 2012

Apologies for not getting back to you. I did test it actually and the reduce now works within the context of rmr2. I am going to continue to work with it and will let you know if I have any other questions. Thanks again for your quick responses.

Collaborator

piccolbo commented Oct 24, 2012

wonderful thanks

Antonio

On Wed, Oct 24, 2012 at 2:18 PM, authompson notifications@github.comwrote:

Apologies for not getting back to you. I did test it actually and the
reduce now works within the context of rmr2. I am going to continue to work
with it and will let you know if I have any other questions. Thanks again
for your quick responses.


Reply to this email directly or view it on GitHubhttps://github.com/RevolutionAnalytics/RHadoop/issues/139#issuecomment-9757103.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment