Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One-hot encoding #3305

Closed
Atrebas opened this issue Jan 22, 2019 · 5 comments
Closed

One-hot encoding #3305

Atrebas opened this issue Jan 22, 2019 · 5 comments

Comments

@Atrebas
Copy link

Atrebas commented Jan 22, 2019

What is your opinion about having a function for one-hot encoding in data.table?
The goal is to convert categorical variables into dummy/indicator variables.

Several options are available, but have some drawbacks:

  • stats::model.matrix is fast, but not easy to customized and dealing with missing data is not handy
  • Matrix::sparse.model.matrix is similar, with the benefit of a sparse matrix output but slower
  • this SO question pointed to the mltools::one_hot function: it uses data.table, has nice options, but not as efficient as it could be
  • there is also a similar function in caret
  • ...

I wrote a small function as a test. It is faster than mltools::one_hot because columns are manipulated by reference but it is slower than stats::model.matrix.
Here is a first version (a second one includes the mltools::one_hot parameters). I can share more code/benchmarks/references if there is some interest. This is just to demonstrate the idea.

library(data.table)

setDummies <- function(dt){

   fCols <- names(dt)[sapply(dt, is.factor)]
   
   for (fCol in fCols) {
     levs  <- dt[, levels(get(fCol))]
     for (lev in levs) {
       newCol <- paste(fCol, lev, sep = "_")
       dt[, (newCol) := 0L]
       dt[get(fCol) %in% lev, (newCol) := 1L]
    }
  }

}

x <- c("red", NA, "blue")
N <- 10
dt <- data.table(ID = 1:N, color = factor(sample(x, N, replace = TRUE)))

setDummies(dt)
dt[]

## output:
#    ID color color_blue color_red
#  1:  1  blue          1         0
#  2:  2  <NA>          0         0
#  3:  3  <NA>          0         0
#  4:  4   red          0         1
#  5:  5  blue          1         0
#  6:  6  <NA>          0         0
#  7:  7  <NA>          0         0
#  8:  8  <NA>          0         0
#  9:  9   red          0         1
# 10: 10  <NA>          0         0

Also, in Python, pandas includes the get_dummies function.
So, I guess it could also be useful for the Python version of data.table.

Thanks.

@st-pasha
Copy link
Contributor

In python datatable there is .split_into_nhot() function (introduced in h2oai/datatable#1304). It is more general than 1-hot encoding, but using a sep that cannot appear in any string (for example sep='\x00') will achieve 1-hot encoding.

@jangorecki
Copy link
Member

I would stay with Matrix::sparse.model.matrix, data.table solution won't scale in memory, we would need to introduce new sparse data type.

@MichaelChirico
Copy link
Member

I would focus on improving efficiency of mltools helper function or sparse.model.matrix.

FWIW I would one-hot using the following:

n = length(x)
out = matrix(0L, nrow = n, ncol = length(levels(x)))
out[cbind(seq_len(n), x)] = 1L

then as.data.table as necessary

@DavidArenburg
Copy link
Member

dcast(dt, ID ~ color, length) ?

@Atrebas
Copy link
Author

Atrebas commented Jan 23, 2019

Thank you all for your feedback. The different solutions have pros and cons. I guess I will further explore some tweaks using a sparse matrix format as output.

@Atrebas Atrebas closed this as completed Jan 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants