Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge/inner join not symmetric #743

Closed
arcosdium opened this issue Jul 30, 2014 · 5 comments
Closed

merge/inner join not symmetric #743

arcosdium opened this issue Jul 30, 2014 · 5 comments
Assignees
Milestone

Comments

@arcosdium
Copy link

I encounted a case where merge is not symmetric, using version 1.9.2 as well as 1.9.3. I have two data.tables Filter (read from a file with fread) und KOut (contructed by serveral data.table operations including merge und setcolorder)

tables() gives

NAME NROW MB COLS KEY
KOut 3,741,559 143 x1,x2,x3,x4,x5,value,y1,y2
Filter 17,172 1 x1,x2,x3,x4,x5

merging the tables results in a diffrent number of rows, depending on the order of the arguments

> nrow(merge(KOut, Filter,by=names(Filter)))
[1] 2936586
> nrow(merge(Filter,KOut,by=names(Filter)))
[1] 2555944

For comparison the merge as data.frames

> nrow(merge(data.frame(KOut),data.frame(Filter),by=names(Filter)))
[1] 2936586
> nrow(merge(data.frame(Filter),data.frame(KOut),by=names(Filter)))
[1] 2936586

I supected a bug in merge(Filter,KOut,by=names(Filter)), so I followed the code of merge till the essential statement:

y[xkey,nomatch=ifelse(all.x,NA,0),allow.cartesian=allow.cartesian]   
# same as y[xkey,nomatch=0]

Here tables() gives

NAME NROW MB COLS KEY
xkey 17,172 1 x1,x2,x3,x4,x5 x1,x2,x3,x4,x5
y 3,741,559 143 x1,x2,x3,x4,x5,value,y1,y2 x1,x2,x3,x4,x5

Some joins are:

y[xkey, nomatch=0]
x1 x2 x3 x4 x5 value y1 y2
1: 1 1 1 1 0 1.20693421 57 1
2: 1 1 1 1 0 -0.36395694 57 2
3: 1 1 1 1 0 -1.91636684 57 3
4: 1 1 1 1 0 -0.38118758 57 4
5: 1 1 1 1 0 0.84860626 57 5
---
2555940: 3 1 1 21 2 0.49530287 11697 4400
2555941: 3 1 1 21 2 -2.03795092 11697 4401
2555942: 3 1 1 21 2 1.28866177 11697 4402
2555943: 3 1 1 21 2 -2.02472550 11697 4403
2555944: 3 1 1 21 2 0.01210244 11697 4404

xkey[y, nomatch=0]
x1 x2 x3 x4 x5 value y1 y2
1: 1 0 1 1 0 -0.693537811 57 70578
2: 1 0 1 1 0 0.585084541 57 70579
3: 1 0 1 1 0 0.384647254 57 70580
4: 1 0 1 1 0 -1.011123900 57 70581
5: 1 0 1 1 0 -0.008338746 57 70582
---
2936582: 3 1 1 21 2 0.495302870 11697 4400
2936583: 3 1 1 21 2 -2.037950918 11697 4401
2936584: 3 1 1 21 2 1.288661770 11697 4402
2936585: 3 1 1 21 2 -2.024725499 11697 4403
2936586: 3 1 1 21 2 0.012102439 11697 4404

y[xkey]
x1 x2 x3 x4 x5 value y1 y2
1: 1 0 1 1 0 NA NA NA
2: 1 0 1 1 10 NA NA NA
3: 1 0 1 1 20 NA NA NA
4: 1 0 1 1 30 NA NA NA
5: 1 0 1 1 40 NA NA NA
---
2573000: 3 1 3 21 56 NA NA NA
2573001: 3 1 3 21 57 NA NA NA
2573002: 3 1 3 21 58 NA NA NA
2573003: 3 1 3 21 59 NA NA NA
2573004: 3 1 3 21 60 NA NA NA

Remarkable is the first line of y[xkey], which says the key combination (1,0,1,1,0) in xkey has no match in y. But the first line of xkey[y, nomatch=0] shows that there is in fact such a line in y!

Any ideas?

@arunsrinivasan
Copy link
Member

@arcosdium, thanks for the report. It'd make our job very much easier if we already have a minimal example that reproduces this issue, for example, like this one. Could you please edit your post with such an example? Thanks again.

@arcosdium
Copy link
Author

I managed to track to problem down. During the operations on KOut the key Column x2 changed the type from int to num. Using as.interger fixes the problem. A minimal working example for the problem is

library(data.table)

Filter<-data.table(data.frame(x2=c(as.integer(0))))
Filter2<-data.table(data.frame(x2=c(-(11-11)/10 )))

merge(Filter2, Filter,by=names(Filter) )
##  x2
## 1:  0
merge(Filter, Filter2,by=names(Filter) )
## Empty data.table (0 rows) of 1 col: x2

I assume the 0 in Filter2 is not precisely 0, due to some rounding isses.

By the way, great package I really need the speed and the memory efficiency.

@arcosdium arcosdium reopened this Aug 2, 2014
@arunsrinivasan
Copy link
Member

Great! Thanks for the example. My hunch is that it's due to the sign bit:

data.table:::binary(0)
# [1] "0 00000000000 000000000000000000000000000000000000 00000000 00000000"
data.table:::binary(-0)
# [1] "1 00000000000 000000000000000000000000000000000000 00000000 00000000"

If this is indeed the problem, then it should be localised to only -0's and should be fixed. Thanks again.

@arunsrinivasan
Copy link
Member

Yes, it's due to the sign bit. Here's an example that should double-verify it:

dt = data.table(x=c(0,0,0,-0,-0,-0), y=1:6)
dt[, .N, by=x]
#    x N
# 1: 0 3
# 2: 0 3

Will fix. Thanks.

@arunsrinivasan arunsrinivasan added this to the v1.9.4 milestone Aug 2, 2014
@arunsrinivasan
Copy link
Member

Now fixed:

library(data.table)
dt = data.table(x=c(0,0,0,-0,-0,-0), y=1:6)
dt[, .N, by=x]
#    x N
# 1: 0 6
dt1 <- data.table(x2 = 0L)
dt2 <- data.table(x2 = -(11-11)/10)

merge(dt2, dt1, by="x2")
#  x2
# 1:  0
merge(dt1, dt2, by="x2")
#    x2
# 1:  0

@arunsrinivasan arunsrinivasan self-assigned this Aug 2, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants