# updating factor column by reference after joining with other DT in i when both contain NA values gives NA as a factor level #1718

Closed
opened this Issue May 24, 2016 · 4 comments

Projects
None yet
3 participants

### daniellemccool commented May 24, 2016

 I'm not totally certain if this is a data.table specific issue since data.table is the only function in R that works this way that I know of, but it's possible it's a function of the way factors are handled in general. Here's my minimal working example: ``````library(data.table) A <- data.table(foo = c(1, 2, 3), bar = c(4, 5, 6)) B <- data.table(foo = c(1, 2, 3, 4, 5, 6), bar = c(NA, NA, NA, 4, 5, 6)) setkey(A, foo) setkey(B, foo) A[, bar := factor(bar, levels = c(4, 5), labels = c("Boop", "Beep"), exclude = 6)] B[, bar := factor(bar, levels = c(4, 5), labels = c("Boop", "Beep"), exclude = 6)] B[A, bar := i.bar] B[, .N, bar] B[is.na(bar)] `````` As you can see, if you try to update by reference, you end up with (the original) DT A's NA being a proper NA value, but DT B's NA is now another factor level. If only one of the columns contains NA, there's no problem.

### daniellemccool commented May 26, 2016 • edited

 So I've still got no clue where things actually go wrong, but it's worth noting that you can't select it by, for example, either by ``````B[bar == "NA"] # Expected B[bar == levels(bar)[3]] # Kind of figured that would work at least, but it clearly identifies the level as NA instead of NA as a factor level `````` What does work to get around the issue is droplevels() `B[, bar := droplevels(bar)]` Maybe that gives some extra insight into what happens.
Member

### jangorecki commented May 26, 2016 • edited

 I'm not sure if having `NA` values as a factor level is a correct usage of factor. I would expect to have NA factor value, not NA factor level. Now your `B\$bar` has one NA value and another not NA value referring to NA level. Try the following: ``````levels(B\$bar) levels(factor(c(letters[1:3], NA))) `````` It avoids creating NA level.

### daniellemccool commented May 26, 2016 • edited

 Hi jangorecki, The goal here was to illustrate a problem that I'm having in my real dataset, wherein I am merging multiple data sets a column at a time after converting them to factors (the idea being that this reduces the human error component. If I have, for example, a column sex in two different data tables, and in one, it's {0, 1} and in the other, it's {1, 2}, then I can better first turn them both into factors before the join than I can manually convert the numbers. What happens, however, when I have NA values (not as factor levels, but properly NA) in both factors before the merge, they merge into two different NAs somehow, as is demonstrated by my first code chunk. Before the join, A\$bar and B\$bar both contained one NA value. (Not a factor level.) After the join, B\$bar contains one NA value and one NA factor. That's not the intended behavior, and I'm not sure why it happens. Instead there should be, as you say, two values of NA in B\$bar. Does that make sense? Edit: To respond to your suggestion, the problem is that I would rather have the join and assign-by-reference work as it should rather than having to recreate the factor levels again after the fact. I CAN of course do that, but it adds another layer to something that ought to work intrinsically that is -- the expected behaviour from joining the two data frames: ``````library(data.table) A <- data.table(foo = c(1, 2, 3), bar = c(4, 5, 6)) B <- data.table(foo = c(1, 2, 3, 4, 5, 6), bar = c(NA, NA, NA, 4, 5, 6)) setkey(A, foo) setkey(B, foo) A[, bar := factor(bar, levels = c(4, 5), labels = c("Boop", "Beep"), exclude = 6)] # A\$bar now looks like c("Boop", "Beep", NA) where NA is a value, not a level. This is correct. is.na(A\$bar) # FALSE, FALSE, TRUE B[, bar := factor(bar, levels = c(4, 5), labels = c("Boop", "Beep"), exclude = 6)] # B\$bar looks like c(NA, NA, NA, "Boop", "Beep", NA), where NA is a value, not a factor level. Also correct. is.na(B\$bar) # TRUE, TRUE, TRUE, FALSE, FALSE, TRUE B[A, bar := i.bar] # This is where things go wrong. # The NA from A becomes a factor LEVEL and not a value. I can't imagine why it does so. is.na(B\$bar) # FALSE, FALSE, FALSE, FALSE, FALSE, TRUE `````` But not only does it become a factor level, it's fundamentally impossible to access with any sort of boolean logic referring to the value it takes on. I can't call it to change it like so: ``````B[bar == "NA", bar := NA] `````` And I can't even access it by referencing what levels() indicates is its factor level. (I don't want it to be a factor level, of course. I want it to have the value NA. I just thought this might provide more insight.) ``````B[bar == levels(bar)[3], bar := NA] `````` I hope that helps with understanding what I'm trying to talk about. The error is that it becomes a factor level after the join and update by reference. It shouldn't. It should stay with the value NA.

Member

### arunsrinivasan commented May 26, 2016

 This particular case happens because levels doesn't contain `NA` in both data.tables. And the internal logic for generating levels in an attempt to add missing levels handles this case wrongly. The root of the problem can be seen by doing: ```as.integer(B\$bar) # [1] 1 2 3 1 2 NA``` after the join+update operation. The 3rd value needs to be `NA`. Will fix.

### mattdowle added a commit that referenced this issue Jun 29, 2017

``` Test 1675.1 updated to cope with a change in R-devel related to and l… ```
`…evels. #1718`
``` 58b4f2d ```

### mattdowle added a commit that referenced this issue Jun 29, 2017

``` Minor news item update. #1718 ```
``` ea1e50e ```