New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
One-hot encoding #3305
Comments
In python |
I would stay with |
I would focus on improving efficiency of FWIW I would one-hot using the following:
then |
|
Thank you all for your feedback. The different solutions have pros and cons. I guess I will further explore some tweaks using a sparse matrix format as output. |
What is your opinion about having a function for one-hot encoding in data.table?
The goal is to convert categorical variables into dummy/indicator variables.
Several options are available, but have some drawbacks:
stats::model.matrix
is fast, but not easy to customized and dealing with missing data is not handyMatrix::sparse.model.matrix
is similar, with the benefit of a sparse matrix output but slowermltools::one_hot
function: it uses data.table, has nice options, but not as efficient as it could beI wrote a small function as a test. It is faster than
mltools::one_hot
because columns are manipulated by reference but it is slower thanstats::model.matrix
.Here is a first version (a second one includes the
mltools::one_hot parameters
). I can share more code/benchmarks/references if there is some interest. This is just to demonstrate the idea.Also, in Python, pandas includes the get_dummies function.
So, I guess it could also be useful for the Python version of data.table.
Thanks.
The text was updated successfully, but these errors were encountered: