Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

joins with type coercion: 1.5 == 1 is TRUE #2592

Closed
MarkusBonsch opened this issue Jan 27, 2018 · 10 comments · Fixed by #2734
Closed

joins with type coercion: 1.5 == 1 is TRUE #2592

MarkusBonsch opened this issue Jan 27, 2018 · 10 comments · Fixed by #2734
Assignees
Milestone

Comments

@MarkusBonsch
Copy link
Contributor

@MarkusBonsch MarkusBonsch commented Jan 27, 2018

Maybe this is a known issue, but to me it came as a very bad surprise.
During a join, if i is coerced to integer to match x's column type, 1.5 is joined to 1:

library(data.table)
dt1 <- data.table(x = 1L, y = 1)
dt2 <- data.table(x = 1.5, z = 2)
dt1[dt2, on = "x", verbose = TRUE]
# Calculated ad hoc index in 0 secs
# Coercing double column i.'x' to integer to match type of x.'x'. Please avoid coercion for efficiency.
# Starting bmerge ...done in 0 secs
#     x y z
# 1: 1 1 2
# 1: 1 1 2

merge(dt2, dt1, by = "x")
#    x y z
# 1: 1 1 2

The reason is the blind coercion to integer of dt2$x during bmerge.
We should adopt the base R loic of merge.data.frame where the join columns get coerced to the 'highest' involved type (https://github.com/wch/r-source/blob/e690b0d6998dfbc360f0fa14492eb8648df20949/src/main/unique.c) Lines 902ff:

    /* Coerce to a common type; type == NILSXP is ok here.
     * Note that above we coerce factors and "POSIXlt", only to character.
     * Hence, coerce to character or to `higher' type
     * (given that we have "Vector" or NULL) */
    if(TYPEOF(x) >= STRSXP || TYPEOF(table) >= STRSXP) type = STRSXP;
    else type = TYPEOF(x) < TYPEOF(table) ? TYPEOF(table) : TYPEOF(x);
    PROTECT(x	  = coerceVector(x,	type)); nprot++;
PROTECT(table = coerceVector(table, type)); nprot++;
@mattdowle
Copy link
Member

@mattdowle mattdowle commented Jan 28, 2018

Good spot and agree. With the caveat that it should only coerce to double if isReallyReal(). Otherwise, coerce the not-really-real double to int (which is the more common case in mind). Both with warning rather than just verbose.

@MarkusBonsch
Copy link
Contributor Author

@MarkusBonsch MarkusBonsch commented Feb 2, 2018

I have done a bit more extensive testing on joins with non-matching types. The table below lists the findings and my proposed change of behaviour. Code for examples can be found at the end of this post.

type of x column type of i column behaviour recommended behaviour comment
logical integer throws error don't change, unify error messages
logical numeric throws error don't change, unify error messages
logical character throws error don't change, unify error messages
logical factor throws error don't change, unify error messages
integer logical coercion to integer so that 1 joins to TRUE, 0 to FALSE, but 2 to nothing. definitely add a warning no idea if this is "correct"
integer numeric bug: coercion to integer so that 1.5 joins to 1 If the numeric column in i isReallyReal, coercion to numeric with warning. If the column in i is not reallyReal, coercion to integer with warning see example in code below
integer character throws error don't change, unify style of error messages
integer factor bug: joins so that integer 1 joins to the first level of the factor. Throw error see
numeric logical similar to integer - logical add warning no idea if this is "correct"
numeric integer coercion to numeric don't change, add warning
numeric character throws error don't change, unify error messages
numeric factor throws error don't change, unify error messages
character logical throws error don't change, unify error messages
character integer throws error (different style) don't change, unify error messages
character numeric throws error don't change, unify error messages
character factor coercion to character don't change, add warning
factor logical throws error don't change, unify error messages
factor integer throws error don't change, unify error messages
factor numeric throws error don't change, unify error messages
factor character coercion to factor don't change, add warning
int64 integer coercion to integer64 don't change, add warning
integer int64 coercion to integer (with overflow leading to NA if int64 is too large) don't change, add warning
int64 logical coercion to integer64 don't change, add warning
logical int64 throws error don't change, unify error message
int64 numeric same bug as with integer If the numeric column in i isReallyReal, coercion to numeric with warning. If the column in i is not reallyReal, coercion to integer64 with warning
numeric int64 coercion to numeric don't change, add warning
int64 character throws error don't change, unify error message
character int64 throws error don't change, unify error message
int64 factor throws error don't change, unify error message
factor int64 throws error don't change, unify error message

Here is the code with examples for all cases:

library(data.table)
dt <- data.table(int = 1L:10L, 
                 num = seq(0.5,5, 0.5),
                 char = as.character(1:10),
                 bool = as.logical(0:9),
                 fact = factor(letters[1:10]))

cols <- c("int", "num", "char", "bool", "fact")
cols <- c(paste0("x.", cols), paste0("i.", cols))


## integer - logical
test <- dt[dt, cols, on = "int==bool", nomatch = 0L, with = FALSE]
class(test$i.bool)
# "integer"
test
#    x.int x.num x.char x.bool x.fact i.int i.num i.char i.bool i.fact
# 1:     1   0.5      1  FALSE      a     2   1.0      2      1      b
# 2:     1   0.5      1  FALSE      a     3   1.5      3      1      c
# 3:     1   0.5      1  FALSE      a     4   2.0      4      1      d
# 4:     1   0.5      1  FALSE      a     5   2.5      5      1      e
# 5:     1   0.5      1  FALSE      a     6   3.0      6      1      f
# 6:     1   0.5      1  FALSE      a     7   3.5      7      1      g
# 7:     1   0.5      1  FALSE      a     8   4.0      8      1      h
# 8:     1   0.5      1  FALSE      a     9   4.5      9      1      i
# 9:     1   0.5      1  FALSE      a    10   5.0     10      1      j


## logical - integer
test <- dt[dt, cols, on = "bool==int", nomatch = 0L, with = FALSE]
# Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch,  : 
# typeof x.bool (logical) != typeof i.int (integer)


## integer - numeric
test <- dt[dt, cols, on = "int==num", nomatch = 0L, with = FALSE] 
class(test$i.num)
# "integer"
test
#    x.int x.num x.char x.bool x.fact i.int i.num i.char i.bool i.fact
# 1:     1   0.5      1  FALSE      a     2     1      2   TRUE      b
# 2:     1   0.5      1  FALSE      a     3     1      3   TRUE      c
# 3:     2   1.0      2   TRUE      b     4     2      4   TRUE      d
# 4:     2   1.0      2   TRUE      b     5     2      5   TRUE      e
# 5:     3   1.5      3   TRUE      c     6     3      6   TRUE      f
# 6:     3   1.5      3   TRUE      c     7     3      7   TRUE      g
# 7:     4   2.0      4   TRUE      d     8     4      8   TRUE      h
# 8:     4   2.0      4   TRUE      d     9     4      9   TRUE      i
# 9:     5   2.5      5   TRUE      e    10     5     10   TRUE      j
## we see that i.num 1 and 1.5 both join to x.int = 1 as they are coerced to integer. This is a bug


## numeric - integer
test <- dt[dt, cols, on = "num==int", nomatch = 0L, with = FALSE] 
class(test$i.int)
# "integer"
test
#    x.int x.num x.char x.bool x.fact i.int i.num i.char i.bool i.fact
# 1:     2     1      2   TRUE      b     1   0.5      1  FALSE      a
# 2:     4     2      4   TRUE      d     2   1.0      2   TRUE      b
# 3:     6     3      6   TRUE      f     3   1.5      3   TRUE      c
# 4:     8     4      8   TRUE      h     4   2.0      4   TRUE      d
# 5:    10     5     10   TRUE      j     5   2.5      5   TRUE      e


## integer - character
test <- dt[dt, cols, on = "int==char", nomatch = 0L, with = FALSE] 
# Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch,  : 
# typeof x.int (integer) != typeof i.char (character)

## character - integer
test <- dt[dt, cols, on = "char==int", nomatch = 0L, with = FALSE] 
# Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch,  : 
# x.'char' is a character column being joined to i.'int' which is type 'integer'. Character columns must join to factor or character columns.)

## integer - factor 
test <- dt[dt, cols, on = "int==fact", nomatch = 0L, with = FALSE] 
class(test$i.fact)
# "factor"
typeof(test$i.fact)
# "integer"
test
#    x.int x.num x.char x.bool x.fact i.int i.num i.char i.bool i.fact
# 1:     1   0.5      1  FALSE      a     1   0.5      1  FALSE      a
# 2:     2   1.0      2   TRUE      b     2   1.0      2   TRUE      b
# 3:     3   1.5      3   TRUE      c     3   1.5      3   TRUE      c
# 4:     4   2.0      4   TRUE      d     4   2.0      4   TRUE      d
# 5:     5   2.5      5   TRUE      e     5   2.5      5   TRUE      e
# 6:     6   3.0      6   TRUE      f     6   3.0      6   TRUE      f
# 7:     7   3.5      7   TRUE      g     7   3.5      7   TRUE      g
# 8:     8   4.0      8   TRUE      h     8   4.0      8   TRUE      h
# 9:     9   4.5      9   TRUE      i     9   4.5      9   TRUE      i
# 10:    10  5.0     10   TRUE      j    10   5.0     10   TRUE      j
## no type coercion, but internally factors are just integers, 
## so that in this case i.fact = a is joined to i.int = 1, which clearly is a bug.


## factor - integer 
test <- dt[dt, cols, on = "fact==int", nomatch = 0L, with = FALSE] 
# Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch,  : 
# x.'fact' is a factor column being joined to i.'int' which is type 'integer'. Factor columns must join to factor or character columns.


## numeric - logical
test <- dt[dt, cols, on = "num==bool", nomatch = 0L, with = FALSE]
class(test$i.bool)
# "numeric"
test
#    x.int x.num x.char x.bool x.fact i.int i.num i.char i.bool i.fact
# 1:     2     1      2   TRUE      b     2   1.0      2      1      b
# 2:     2     1      2   TRUE      b     3   1.5      3      1      c
# 3:     2     1      2   TRUE      b     4   2.0      4      1      d
# 4:     2     1      2   TRUE      b     5   2.5      5      1      e
# 5:     2     1      2   TRUE      b     6   3.0      6      1      f
# 6:     2     1      2   TRUE      b     7   3.5      7      1      g
# 7:     2     1      2   TRUE      b     8   4.0      8      1      h
# 8:     2     1      2   TRUE      b     9   4.5      9      1      i
# 9:     2     1      2   TRUE      b    10   5.0     10      1      j


## logical - numeric
test <- dt[dt, cols, on = "bool==num", nomatch = 0L, with = FALSE]
# Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch,  : 
# typeof x.bool (logical) != typeof i.num (double)


## numeric - character
test <- dt[dt, cols, on = "num==char", nomatch = 0L, with = FALSE]
# Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch,  : 
# typeof x.num (double) != typeof i.char (character)


## character - numeric
test <- dt[dt, cols, on = "char==num", nomatch = 0L, with = FALSE]
# Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch,  : 
# x.'char' is a character column being joined to i.'num' which is type 'double'. Character columns must join to factor or character columns.


## numeric - factor
test <- dt[dt, cols, on = "num==fact", nomatch = 0L, with = FALSE]
# Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch,  : 
# typeof x.num (double) != typeof i.fact (integer)


## factor - numeric
test <- dt[dt, cols, on = "fact==num", nomatch = 0L, with = FALSE]
# Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch,  : 
# x.'fact' is a factor column being joined to i.'num' which is type 'double'. Factor columns must join to factor or character columns.


## logical - character
test <- dt[dt, cols, on = "bool==char", nomatch = 0L, with = FALSE]
# Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch,  : 
# typeof x.bool (logical) != typeof i.char (character)


## character - logical
test <- dt[dt, cols, on = "char==bool", nomatch = 0L, with = FALSE]
# Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch,  : 
# x.'char' is a character column being joined to i.'bool' which is type 'logical'. Character columns must join to factor or character columns.


## logical - factor
test <- dt[dt, cols, on = "bool==fact", nomatch = 0L, with = FALSE]
# Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch,  : 
# typeof x.bool (logical) != typeof i.fact (integer)r)


## factor - logical
test <- dt[dt, cols, on = "fact==bool", nomatch = 0L, with = FALSE]
# Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch,  : 
# x.'fact' is a factor column being joined to i.'bool' which is type 'logical'. Factor columns must join to factor or character columns.


## character - factor
test <- dt[dt, cols, on = "char==fact", nomatch = 0L, with = FALSE]
class(test$i.fact)
# "character"
test
# Empty data.table (0 rows) of 10 cols: x.int,x.num,x.char,x.bool,x.fact,i.int...


## factor - character
test <- dt[dt, cols, on = "fact==char", nomatch = 0L, with = FALSE]
class(test$i.char)
# "factor"
test
# Empty data.table (0 rows) of 10 cols: x.int,x.num,x.char,x.bool,x.fact,i.int...


library(bit64)
dt[, int64 := as.integer64(c(1:9, 2.5e10))]
cols <- c("int", "num", "char", "bool", "fact", "int64")
cols <- c(paste0("x.", cols), paste0("i.", cols))

## int64 - integer
test <- dt[dt, cols, on = "int64==int", nomatch = 0L, with = FALSE]
class(test$i.int)
# "integer64"
test
#    x.int x.num x.char x.bool x.fact x.int64 i.int i.num i.char i.bool i.fact i.int64
# 1:     1   0.5      1  FALSE      a       1     1   0.5      1  FALSE      a       1
# 2:     2   1.0      2   TRUE      b       2     2   1.0      2   TRUE      b       2
# 3:     3   1.5      3   TRUE      c       3     3   1.5      3   TRUE      c       3
# 4:     4   2.0      4   TRUE      d       4     4   2.0      4   TRUE      d       4
# 5:     5   2.5      5   TRUE      e       5     5   2.5      5   TRUE      e       5
# 6:     6   3.0      6   TRUE      f       6     6   3.0      6   TRUE      f       6
# 7:     7   3.5      7   TRUE      g       7     7   3.5      7   TRUE      g       7
# 8:     8   4.0      8   TRUE      h       8     8   4.0      8   TRUE      h       8
# 9:     9   4.5      9   TRUE      i       9     9   4.5      9   TRUE      i       9


## integer - int64
test <- dt[dt, cols, on = "int==int64", nomatch = 0L, with = FALSE]
class(test$i.int64)
# "integer"
test
#    x.int x.num x.char x.bool x.fact x.int64 i.int i.num i.char i.bool i.fact i.int64
# 1:     1   0.5      1  FALSE      a       1     1   0.5      1  FALSE      a       1
# 2:     2   1.0      2   TRUE      b       2     2   1.0      2   TRUE      b       2
# 3:     3   1.5      3   TRUE      c       3     3   1.5      3   TRUE      c       3
# 4:     4   2.0      4   TRUE      d       4     4   2.0      4   TRUE      d       4
# 5:     5   2.5      5   TRUE      e       5     5   2.5      5   TRUE      e       5
# 6:     6   3.0      6   TRUE      f       6     6   3.0      6   TRUE      f       6
# 7:     7   3.5      7   TRUE      g       7     7   3.5      7   TRUE      g       7
# 8:     8   4.0      8   TRUE      h       8     8   4.0      8   TRUE      h       8
# 9:     9   4.5      9   TRUE      i       9     9   4.5      9   TRUE      i       9    


## int64 - logical
test <- dt[dt, cols, on = "int64==bool", nomatch = 0L, with = FALSE]
class(test$i.bool)
# "integer64"
test
#    x.int x.num x.char x.bool x.fact x.int64 i.int i.num i.char i.bool i.fact     i.int64
# 1:     1   0.5      1  FALSE      a       1     2   1.0      2      1      b           2
# 2:     1   0.5      1  FALSE      a       1     3   1.5      3      1      c           3
# 3:     1   0.5      1  FALSE      a       1     4   2.0      4      1      d           4
# 4:     1   0.5      1  FALSE      a       1     5   2.5      5      1      e           5
# 5:     1   0.5      1  FALSE      a       1     6   3.0      6      1      f           6
# 6:     1   0.5      1  FALSE      a       1     7   3.5      7      1      g           7
# 7:     1   0.5      1  FALSE      a       1     8   4.0      8      1      h           8
# 8:     1   0.5      1  FALSE      a       1     9   4.5      9      1      i           9
# 9:     1   0.5      1  FALSE      a       1    10   5.0     10      1      j 25000000000


## logical - int64
test <- dt[dt, cols, on = "bool==int64", nomatch = 0L, with = FALSE]
# Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch,  : 
# typeof x.bool (logical) != typeof i.int64 (double)


## int64 - numeric
test <- dt[dt, cols, on = "int64==num", nomatch = 0L, with = FALSE]
class(test$i.num)
# "integer64"
test
#    x.int x.num x.char x.bool x.fact x.int64 i.int i.num i.char i.bool i.fact     i.int64
# 1:     1   0.5      1  FALSE      a       1     2     1      2   TRUE      b           2
# 2:     1   0.5      1  FALSE      a       1     3     1      3   TRUE      c           3
# 3:     2   1.0      2   TRUE      b       2     4     2      4   TRUE      d           4
# 4:     2   1.0      2   TRUE      b       2     5     2      5   TRUE      e           5
# 5:     3   1.5      3   TRUE      c       3     6     3      6   TRUE      f           6
# 6:     3   1.5      3   TRUE      c       3     7     3      7   TRUE      g           7
# 7:     4   2.0      4   TRUE      d       4     8     4      8   TRUE      h           8
# 8:     4   2.0      4   TRUE      d       4     9     4      9   TRUE      i           9
# 9:     5   2.5      5   TRUE      e       5    10     5     10   TRUE      j 25000000000
## same bug as for integer - numeric: conversion to integer means that 1.5 is joined to 1, etc.


## numeric - int64
test <- dt[dt, cols, on = "num==int64", nomatch = 0L, with = FALSE]
class(test$i.int64)
# "numeric"
test
#    x.int x.num x.char x.bool x.fact     x.int64 i.int i.num i.char i.bool i.fact i.int64
# 1:     2     1      2   TRUE      b           2     1   0.5      1  FALSE      a       1
# 2:     4     2      4   TRUE      d           4     2   1.0      2   TRUE      b       2
# 3:     6     3      6   TRUE      f           6     3   1.5      3   TRUE      c       3
# 4:     8     4      8   TRUE      h           8     4   2.0      4   TRUE      d       4
# 5:    10     5     10   TRUE      j 25000000000     5   2.5      5   TRUE      e       5


## int64 - character
test <- dt[dt, cols, on = "int64==char", nomatch = 0L, with = FALSE]
# Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch,  : 
# typeof x.int64 (double) != typeof i.char (character)


## character - int64
test <- dt[dt, cols, on = "char==int64", nomatch = 0L, with = FALSE]
# Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch,  : 
# x.'char' is a character column being joined to i.'int64' which is type 'double'. Character columns must join to factor or character columns.


## int64 - factor
test <- dt[dt, cols, on = "int64==fact", nomatch = 0L, with = FALSE]
# Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch,  : 
# typeof x.int64 (double) != typeof i.fact (integer)


## factor - int64
test <- dt[dt, cols, on = "fact==int64", nomatch = 0L, with = FALSE]
# Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch,  : 
# x.'fact' is a factor column being joined to i.'int64' which is type 'double'. Factor columns must join to factor or character columns.

@MarkusBonsch
Copy link
Contributor Author

@MarkusBonsch MarkusBonsch commented Apr 10, 2018

I have updated the table with type conversions in the last post to reflect all combinations.

@mattdowle mattdowle added this to the v1.10.6 milestone Apr 12, 2018
@mattdowle mattdowle removed this from the v1.11.0 milestone Apr 29, 2018
@mattdowle mattdowle added this to the v1.11.2 milestone Apr 29, 2018
@mattdowle mattdowle removed this from the 1.12.0 milestone Jan 11, 2019
@mattdowle mattdowle added this to the 1.12.2 milestone Jan 11, 2019
@jangorecki jangorecki removed this from the 1.12.2 milestone Jan 24, 2019
@jangorecki jangorecki added this to the 1.12.4 milestone Jan 24, 2019
@Fablepongiste
Copy link

@Fablepongiste Fablepongiste commented May 1, 2019

If type is NA in dt2 and let say character in dt1 (or any other type really for dt1), should the merge not failed and still work , assuming NA in dt2 is of type of dt1 ?

Sometimes do not know you are getting NA, and there is no reason that column with type NA is of type logical if not NA ?

@Fablepongiste
Copy link

@Fablepongiste Fablepongiste commented May 1, 2019

Is that supposed to fail ?

merge(dt1, dt2, by = intersect(names(dt1), names(dt2)), all = TRUE, sort = FALSE)
dt1 <- data.table("a" = 1, b = "test")
 dt2 <- data.table("a" = 2, b = NA)
Error in bmerge(i, x, leftcols, rightcols, xo, roll, rollends, nomatch,  : 
  typeof x.b (logical) != typeof i.b (character)

@jangorecki
Copy link
Member

@jangorecki jangorecki commented May 1, 2019

@Fablepongiste Thanks for reporting.
Definitely NA should be coerced to NA_character_ transparently to user. I am not sure but #2734 might already cover that.
Lets add to that also special case of both fields being NAs of different type

dt1 <- data.table("a" = 1, b = NA_character_)
dt2 <- data.table("a" = 2, b = NA_integer_)
merge(dt1, dt2, by = intersect(names(dt1), names(dt2)), all = TRUE, sort = FALSE)

@Fablepongiste
Copy link

@Fablepongiste Fablepongiste commented May 1, 2019

@jangorecki Do you expect this fix to be released soon at least in develop branch ?
Have seen issue #2734 but seems not updated from long time ?

@jangorecki
Copy link
Member

@jangorecki jangorecki commented May 1, 2019

@Fablepongiste cannot promise but we are trying to clear out PR queue, so it will be definitely on our list

@MarkusBonsch
Copy link
Contributor Author

@MarkusBonsch MarkusBonsch commented May 1, 2019

Interesting case @Fablepongiste. We will have to check, whether PR #2734 covers that and amend if not. Unfortunately, the PR is very old, so merging will be painful and I have very limited time at the moment. I will nonetheless try to see after it in the next weeks.

@mattdowle
Copy link
Member

@mattdowle mattdowle commented May 22, 2019

I managed to take a look. #2734 now covers all-NA coercion. Thanks @Fablepongiste for raising.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants