Introduce EntityEmbeddings#267
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## dev #267 +/- ##
==========================================
+ Coverage 92.42% 96.48% +4.06%
==========================================
Files 11 14 +3
Lines 330 512 +182
==========================================
+ Hits 305 494 +189
+ Misses 25 18 -7 ☔ View full report in Codecov by Sentry. |
The two classifiers and two regressors.
| rng::Union{AbstractRNG, Int64} | ||
| optimiser_changes_trigger_retraining::Bool | ||
| acceleration::AbstractResource # eg, `CPU1()` or `CUDALibs()` | ||
| embedding_dims::Dict{Symbol, Real} |
There was a problem hiding this comment.
@ablaom I handle differently depending on whether it's an integer or float as in the docs.
Co-authored-by: Anthony Blaom, PhD <anthony.blaom@gmail.com>
|
|
||
| [targets] | ||
| test = ["CUDA", "cuDNN", "LinearAlgebra", "MLJBase", "Random", "StableRNGs", "StatisticalMeasures", "StatsBase", "Test"] | ||
| test = ["CUDA", "cuDNN", "LinearAlgebra", "MLJBase", "Random", "StableRNGs", "StatisticalMeasures", "StatsBase", "ScientificTypes", "Test"] |
There was a problem hiding this comment.
I think MLJBase, which depends on ScientificTypes, re-exports all the public ScientificTypes methods, so you may be able to dump it here.
There was a problem hiding this comment.
@ablaom So replace using ScientificTypes: coerce, Multiclass, OrderedFactor with using MLJBase: coerce, Multiclass, OrderedFactor in the test file? This is minor because it's in the test only right and it shouldn't redownload already downloaded package?
There was a problem hiding this comment.
No, I think using MLJBase suffices. All those objects are re-exported.
| """ | ||
| shape(model::MultitargetNeuralNetworkRegressor, X, y) = (ncols(X), ncols(y)) | ||
| is_embedding_enabled_type(::MultitargetNeuralNetworkRegressor) = true | ||
|
|
There was a problem hiding this comment.
As you are now acting on instances instead of types, I'd change the name of your trait from is_embedding_enabled_type to is_embedding_enabled, but this is just a suggestion.
| hasproperty(transformer, :embedding_dims) || return Xnew | ||
| ordinal_mappings, embedding_matrices = fitresult[3:4] | ||
| Xnew = ordinal_encoder_transform(Xnew, ordinal_mappings) | ||
| Xnew_transf = embedding_transform(Xnew, embedding_matrices) |
There was a problem hiding this comment.
Also, do we have any test for transform(::MLJFluxModel, ...) ?
There was a problem hiding this comment.
I think transform is tested in entity_embedding.jl
There was a problem hiding this comment.
Okay, I see it's covered in Codecov.
Can I ask that this method (and corresponding test) be moved to "mlj_model_interface.jl", where the other implementations of MLJModelInterface methods (such as fit and predict) live?
ablaom
left a comment
There was a problem hiding this comment.
Thanks for the speedy response to my review.
A couple of minor points left. Note in particular the need to annotate the type of model (transformer) in MLJModelInterface.transform overloading. And we should have a test for this method for one of the models.
|
@ablaom have addressed everything (except removing ScientificTypes from tests). I noticed that after adding categorical variables to integration tests, GPU testing fails :(. Will look into that when I get the chance. |
|
@EssamWisam You can try raising the tolerance near test/classifier.jl:114, which seems to be the fail. |
|
Hooray it worked 🎉 @ablaom |
ablaom
left a comment
There was a problem hiding this comment.
Thanks for the updates.
Just a couple more details, as flagged.
|
@ablaom let's finalize this! |
ablaom
left a comment
There was a problem hiding this comment.
Thanks again @EssamWisam for this valuable contribution.
Just made a few doc string tweaks after reviewing these again. Ready to go. 🎉
Basic Description
This PR extends the
NeuralNetworkClassifier,NeuralNetworkBinaryClassifier,NeuralNetworkRegressor,MultitargetNeuralNetworkRegressormodels of theMLJFlux.jllibrary such that:These model now support tables with categorical columns. Iff any are present, an extra entity embedding layer is introduced after the input as described in the paper Entity Embeddings of Categorical Variables by Cheng Guo, Felix Berkhahn.
It's possible, after training any of such models, to transform the categorical columns of a new sample, seen or unseen in training for the purposes of encoding the categorical column for a further model or transformer in a pipeline
See the updated documentation to learn more.
Implementation Plan
The following was my plan for implementing this feature which is more nontrivial than it may seem in the first glance.
MLJFluxwith more formal tests that compare a mathematical implementation to the actual oneEntityEmbedderinputMLJInterface.fitinsertEntityEmbedderinto the model chain when needed (input has categorical columns)MLJInterface.updateaccordinglyclassifier.jlandregressor.jlMLJInterface.fit(refactoring)MLJModelsInterface.jlfor less redundancy and more organizationEntityEmbeddertypes.jlEntityEmbedderEntityEmbedder(plan to also make a tutorial(s) later)entity-embedding.jl,entity-embedding-utils.jlandencoders.jl