Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

malformed factor resulting from 'by' expression when using melt.data.table in 'j' expression #2199

Closed
vh-d opened this issue Jun 11, 2017 · 6 comments · Fixed by #3906
Closed
Milestone

Comments

@vh-d
Copy link

vh-d commented Jun 11, 2017

I get unprintable data.table object whose printing results in error

Error in as.character.factor(x) : malformed factor

but in some cases (large dataset) also crashes R.

My code looks like this

require(data.table)

# generate demo dataset
ids   <- letters[1:3]
dates <- 1:2

dt <- data.table(CJ(dates, ids, ids))
setnames(dt, c("date", "id1", "id2"))

dt[, value := rnorm(18)]

# IMPORTANT: to reproduce the bug, drop some id
dt <- dt[!(date == 1 & (id1 == "a" | id2 == "a"))]

# demo function
f1 <- function(sdt) {
  dt1 <- dcast.data.table(sdt, id1 ~ id2)
  dt2 <- melt.data.table(dt1, id.vars = "id1")
  print(dt2)
  
  dt2
}

res <- dt[, f1(.SD), by = date]
#id1 variable      value
#1:   b        b -1.3643635
#2:   c        b  0.5305674
#3:   b        c  0.2023935
#4:   c        c  0.1063894
#id1 variable       value
#1:   a        a -0.35193161
#2:   b        a -0.65570503
#3:   c        a  0.01524152
#4:   a        b -0.07234880
#5:   b        b -1.16267653
#6:   c        b -1.41490080
#7:   a        c  0.10225144
#8:   b        c -0.84277336
#9:   c        c -0.23164772

res
#Error in as.character.factor(x) : malformed factor

In my case I forgot to put the usual variable.factor = FALSE into the melt.data.table() which makes it work ok. But this behavior surprises me. It only appears when the two sets of factors differ (for date=1 and date=2 the ids are different sets), so if you I skip the line

dt <- dt[!(date == 1 & (id1 == "a" | id2 == "a"))]

it works alright.

I am on the latest stable version of data.table.

My sessionInfo()

R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: KDE neon User Edition 5.10

...

other attached packages:
[1] data.table_1.10.4

loaded via a namespace (and not attached):
[1] compiler_3.4.0 tools_3.4.0
@vh-d vh-d changed the title malformed factor resulting from 'by' expression when using melt.data.table in 'i' expression malformed factor resulting from 'by' expression when using melt.data.table in 'j' expression Jun 11, 2017
@franknarf1
Copy link
Contributor

franknarf1 commented Jun 11, 2017

In my opinion, combining vectors with inconsistent attributes is a bad idea, and it would be reasonable to leave the responsibility for handling that with the user.

Some other examples:

  • data.table(g = 1:2)[, factor(letters[.GRP]), by=g]
  • c(factor("a"), factor("b"))

On the other hand, rbind/rbindlist apparently has special handling for factors: rbind(data.table(x = factor("a")), data.table(x = factor("b")))

@vh-d
Copy link
Author

vh-d commented Jun 12, 2017

I somehow agree about the responsibility (I usually avoid factors at all), but for ordinary R user I think this is unexpected behavior in two ways:

a) the error message about corrupted factors comes only when printing the data.table.
b) with my original dataset (biggish but not really huge), this code always crashes my R session without error message.

Only after I tried View(res) instead of print(res) I've got the error message so I eventually had some clue what was going on.

@vh-d
Copy link
Author

vh-d commented Jun 12, 2017

So this crashes my R/RStudio session

require(data.table)

set.seed(2)

# generate demo dataset
ids   <- sample(letters, 20)
dates <- 1:40

dt <- data.table(CJ(dates, ids, ids))
setnames(dt, c("date", "id1", "id2"))

dt[, value := rnorm(length(date))]

# IMPORTANT: to reproduce the bug, drop some id
dt <- dt[!(date == 1 & (id1 == "a" | id2 == "a"))]
dt <- dt[!(date == 4 & (id1 == "e" | id2 == "e"))]

# demo function
f1 <- function(sdt) {
  dt1 <- dcast.data.table(sdt, id1 ~ id2)
  dt2 <- melt.data.table(dt1, id.vars = "id1")
  # print(dt2)
  
  dt2
}

res <- dt[, f1(.SD), by = date]

res
R version 3.4.0 (2017-04-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server 2008 R2 x64 (build 7601) Service Pack 1

Matrix products: default

...

other attached packages:
[1] data.table_1.10.4

loaded via a namespace (and not attached):
[1] compiler_3.4.0 tools_3.4.0   

@Raoul-Kima
Copy link

Raoul-Kima commented Dec 25, 2017

I was about to post the same issue. In addition to the arguments above I think the issue should be adressed because it can lead to silent errors in data analysis without any kind of error or warning message. I found this behaviour because I got strange results in a study due to that.

As factor variables are one of the fundamental data types in R and one of the first things encountered by new R users, many people using them won't even know that such things as "attributes" even exist. Therefore they can't be held responsible for possible problems caused by them.

Similar to the other poster, this issue can also crash R with a fatal error on my computer.

@mattdowle mattdowle added this to the 1.12.2 milestone Mar 14, 2019
@mattdowle mattdowle mentioned this issue Mar 14, 2019
8 tasks
@mattdowle mattdowle modified the milestones: 1.12.2, 1.12.4 Mar 20, 2019
@mattdowle
Copy link
Member

mattdowle commented Sep 21, 2019

Further example to Frank's (showed it isn't due to melt or dcast) to show the malformed factor error too. Problem lies in internal dogroups.

> DT = data.table(A=1:2)
> g = function(x) { if (x==1L) factor(c("a","b")) else factor(c("b","c")) }
> ans = DT[,g(.GRP),by=A]
> ans
       A     V1
   <int> <fctr>
1:     1      a
2:     1      b
3:     2      a   # wrong silently
4:     2      b   # wrong silently
> g = function(x) { if (x==1L) factor(c("a","b")) else factor(c("a","b","c")) }
> ans = DT[,g(.GRP),by=A]
> ans
Error in as.character.factor(x) : malformed factor
> unclass(ans$V1)
[1] 1 2 1 2 3
attr(,"levels")
[1] "a" "b"
> 

@jangorecki
Copy link
Member

I was facing similar problem in frollapply, where we call R C eval for every single row (here for every single group).
It will require to detect if factor fields are among the results, and synchronise their values and attributes. It is far from trivial.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants