One-hot encoding #3305

Atrebas · 2019-01-22T16:58:30Z

What is your opinion about having a function for one-hot encoding in data.table?
The goal is to convert categorical variables into dummy/indicator variables.

Several options are available, but have some drawbacks:

stats::model.matrix is fast, but not easy to customized and dealing with missing data is not handy
Matrix::sparse.model.matrix is similar, with the benefit of a sparse matrix output but slower
this SO question pointed to the mltools::one_hot function: it uses data.table, has nice options, but not as efficient as it could be
there is also a similar function in caret
...

I wrote a small function as a test. It is faster than mltools::one_hot because columns are manipulated by reference but it is slower than stats::model.matrix.
Here is a first version (a second one includes the mltools::one_hot parameters). I can share more code/benchmarks/references if there is some interest. This is just to demonstrate the idea.

library(data.table)

setDummies <- function(dt){

   fCols <- names(dt)[sapply(dt, is.factor)]
   
   for (fCol in fCols) {
     levs  <- dt[, levels(get(fCol))]
     for (lev in levs) {
       newCol <- paste(fCol, lev, sep = "_")
       dt[, (newCol) := 0L]
       dt[get(fCol) %in% lev, (newCol) := 1L]
    }
  }

}

x <- c("red", NA, "blue")
N <- 10
dt <- data.table(ID = 1:N, color = factor(sample(x, N, replace = TRUE)))

setDummies(dt)
dt[]

## output:
#    ID color color_blue color_red
#  1:  1  blue          1         0
#  2:  2  <NA>          0         0
#  3:  3  <NA>          0         0
#  4:  4   red          0         1
#  5:  5  blue          1         0
#  6:  6  <NA>          0         0
#  7:  7  <NA>          0         0
#  8:  8  <NA>          0         0
#  9:  9   red          0         1
# 10: 10  <NA>          0         0

Also, in Python, pandas includes the get_dummies function.
So, I guess it could also be useful for the Python version of data.table.

Thanks.

The text was updated successfully, but these errors were encountered:

st-pasha · 2019-01-22T19:45:37Z

In python datatable there is .split_into_nhot() function (introduced in h2oai/datatable#1304). It is more general than 1-hot encoding, but using a sep that cannot appear in any string (for example sep='\x00') will achieve 1-hot encoding.

jangorecki · 2019-01-23T04:48:56Z

I would stay with Matrix::sparse.model.matrix, data.table solution won't scale in memory, we would need to introduce new sparse data type.

MichaelChirico · 2019-01-23T06:46:47Z

I would focus on improving efficiency of mltools helper function or sparse.model.matrix.

FWIW I would one-hot using the following:

n = length(x)
out = matrix(0L, nrow = n, ncol = length(levels(x)))
out[cbind(seq_len(n), x)] = 1L

then as.data.table as necessary

DavidArenburg · 2019-01-23T09:17:22Z

dcast(dt, ID ~ color, length) ?

Atrebas · 2019-01-23T19:22:25Z

Thank you all for your feedback. The different solutions have pros and cons. I guess I will further explore some tweaks using a sparse matrix format as output.

Atrebas closed this as completed Jan 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

One-hot encoding #3305

One-hot encoding #3305

Atrebas commented Jan 22, 2019

st-pasha commented Jan 22, 2019

jangorecki commented Jan 23, 2019

MichaelChirico commented Jan 23, 2019

DavidArenburg commented Jan 23, 2019

Atrebas commented Jan 23, 2019

One-hot encoding #3305

One-hot encoding #3305

Comments

Atrebas commented Jan 22, 2019

st-pasha commented Jan 22, 2019

jangorecki commented Jan 23, 2019

MichaelChirico commented Jan 23, 2019

DavidArenburg commented Jan 23, 2019

Atrebas commented Jan 23, 2019