-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use of ScientificTypes
and CategoricalArrays
in native model
#907
Comments
Thanks for your query. There are of course plenty of models that specify models() do m
AbstractVector{Finite{3}} <: m.target_scitype
end
end There aren't many models that handle a table of julia> models() do m
Table(Finite{3}) <: m.input_scitype
end
13-element Vector{NamedTuple{(:name, :package_name, :is_supervised, :abstract_type, :deep_properties, :docstring, :fit_data_scitype, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :inverse_transform_scitype, :is_pure_julia, :is_wrapper, :iteration_parameter, :load_path, :package_license, :package_url, :package_uuid, :predict_scitype, :prediction_type, :supports_class_weights, :supports_online, :supports_training_losses, :supports_weights, :transform_scitype, :input_scitype, :target_scitype, :output_scitype), T} where T<:Tuple}:
(name = ConstantClassifier, package_name = MLJModels, ... )
(name = ConstantRegressor, package_name = MLJModels, ... )
(name = ContinuousEncoder, package_name = MLJModels, ... )
(name = DecisionTreeClassifier, package_name = BetaML, ... )
(name = DecisionTreeRegressor, package_name = BetaML, ... )
(name = DeterministicConstantClassifier, package_name = MLJModels, ... )
(name = DeterministicConstantRegressor, package_name = MLJModels, ... )
(name = FeatureSelector, package_name = MLJModels, ... )
(name = FillImputer, package_name = MLJModels, ... )
(name = OneHotEncoder, package_name = MLJModels, ... )
(name = RandomForestClassifier, package_name = BetaML, ... )
(name = RandomForestRegressor, package_name = BetaML, ... )
(name = Standardizer, package_name = MLJModels, ... ) Under the hood some of these models convert the categorical vectors to integer vectors (and back again) but as a categorical array is essentially an array of integers plus metadata, I don't think there's a big performance cost. (You can reduce the cost further by implementing a data "font-end" but I doubt it's worth it unless your model has an iteration parameter, maybe.) The DecisionTree.jl models (not listed) support features that are The advantage of having categorical data arrive as a Does this adequately address your query? |
Thank's for the comprehensive answer! In the meantime I've learned that I hadn't a complete understanding of how But there are still a few details about handling types outside and inside of MLJ which I don't understand yet:
module MyModel
function my_fit(X::AbstractDataFrame, y::AbstractVector)
...
end ... and I want to integrate this model into MLJ (i.e. register it as a new model in MLJ). Am I running into any trouble because the type of |
In case it's not clear (it probably is) under the hood you are free whatever types you like. What I think we are discussing is here is how the data arrives to the user (for training) and the form data leaves (eg, prediction) which should match where appropriate. I suppose it's not strictly mandatory to require the target to come in as a CategoricalArray. You just need to be able to articulate your data requirements using scientific types. So you could declare, say, target scitype to be Could you say more about why you might not want to force MLJ users to use CategoricalArrays? And by the way, there's nothing to stop you have a "local" interface which completely dodges the scitype issue, and a separate MLJ interface.
As I say, you can use whatever type you like under the hood. However, if you are using Perhaps you want to share more detail about the model you have in mind? |
Thank's for your explanations! I've just finished an update to the model which I would like to register with MLJ (roland-KA/OneRule). So we have now an example to look at. This model uses for the features as well as for the target categorical data. It uses for its internal fit-function ( function get_best_tree(X::AbstractDataFrame, y::AbstractVector)
trees = all_trees(X, y)
return(trees[argmin(trees)])
end But I have defined the data types for use with MLJ as (in MMI.metadata_model(OneRuleClassifier,
input_scitype = MMI.Table(MMI.Finite),
target_scitype = AbstractVector{<: MMI.Finite},
... So a user of the MLJ interface should pass a I hope this makes sense? With my questions above, I just wanted to make sure that the first version isn't a complete mess 🤓. The tests in using DataFrames
using OneRule
using MLJ
using CategoricalArrays
### create test data
weather = DataFrame(
outlook = ["sunny", "sunny", "overcast", "rainy", "rainy", "rainy", "overcast", "sunny", "sunny", "rainy", "sunny", "overcast", "overcast", "rainy"],
temperature = ["hot", "hot", "hot", "mild", "cool", "cool", "cool", "mild", "cool", "mild", "mild", "mild", "hot", "mild"],
humidity = ["high", "high", "high", "high", "normal", "normal", "normal", "high", "normal", "normal", "normal", "high", "normal", "high"],
windy = ["false", "true", "false", "false", "false", "true", "true", "false", "false", "false", "true", "true", "false", "true"]
)
play = ["no", "no", "yes", "yes", "yes", "no", "yes", "no", "yes", "yes", "yes", "yes", "yes", "no"]
# create/adapt test data for use via MLJ interface
coerce!(weather, Textual => Multiclass)
play_cat = categorical(play)
# ML workflow
orc = OneRuleClassifier()
mach = machine(orc, weather, play_cat)
fit!(mach)
yhat_cat = MLJ.predict(mach, weather)
fitted_tree = report(mach).tree |
Cool. Hopefully, I can take a look next week. |
So, I close this issue as the remaining questions are better addressed in the issue you opened on |
The discussion here and on roland-KA/OneRule.jl#2 helped me, to understand several aspects of using
Learning models outside and inside of MLJ:
|
@roland-KA Thanks indeed for taking the time to document your experience! I think the synopsis is generally correct. I wouldn't say that As you correctly explain, when you implement an interface for an MLJ model, the interface must make type adjustments to account for any mismatch between what a model expects and conceptualises as "multiclass vector", say, and what objects actually have In this way scientific types is simply a way to: (i) enforce a uniformity in the data types that MLJ users present to their models, and (ii) allow the user to focus on the purpose (scientific type) of data rather on the specific machine representation. |
Thank's for your clarifying feedback! Step by step I'm working towards a full understanding of the concepts 🤓. |
I'm trying to adapt a model for use with MLJ. The features as well as the target used in this model is categorical data.
MLJ uses
ScientificTypes
for all data (andCategoricalArrays
for categorical data). Therefore I'm thinking about using these constructs already in the native model. But I didn't find any existing native models using these constructs. So I'm wondering if there are any disadvantages associated with this approach. What are the pros and cons of usingScientificTypes
andCategoricalArrays
already in a native model (if the model should be integrated with MLJ)?The text was updated successfully, but these errors were encountered: