Use of `ScientificTypes` and `CategoricalArrays` in native model #907

roland-KA · 2022-03-02T18:04:26Z

I'm trying to adapt a model for use with MLJ. The features as well as the target used in this model is categorical data.

MLJ uses ScientificTypes for all data (and CategoricalArrays for categorical data). Therefore I'm thinking about using these constructs already in the native model. But I didn't find any existing native models using these constructs. So I'm wondering if there are any disadvantages associated with this approach. What are the pros and cons of using ScientificTypes and CategoricalArrays already in a native model (if the model should be integrated with MLJ)?

The text was updated successfully, but these errors were encountered:

ablaom · 2022-03-06T20:37:28Z

Thanks for your query.

There are of course plenty of models that specify Multiclass, OrderedFactor (or Finite which is either) for the target, and in those cases, you are correct that this means the user is passing a categorical vector (single target case) or table of categorical vectors (multi-target case). For example, to see all the models can handle a single Multiclass target with, say, 3, classes, do:

models() do m
         AbstractVector{Finite{3}} <: m.target_scitype
       end
end

There aren't many models that handle a table of Multiclass features, but there are some:

julia> models() do m
         Table(Finite{3}) <: m.input_scitype
       end
13-element Vector{NamedTuple{(:name, :package_name, :is_supervised, :abstract_type, :deep_properties, :docstring, :fit_data_scitype, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :inverse_transform_scitype, :is_pure_julia, :is_wrapper, :iteration_parameter, :load_path, :package_license, :package_url, :package_uuid, :predict_scitype, :prediction_type, :supports_class_weights, :supports_online, :supports_training_losses, :supports_weights, :transform_scitype, :input_scitype, :target_scitype, :output_scitype), T} where T<:Tuple}:
 (name = ConstantClassifier, package_name = MLJModels, ... )
 (name = ConstantRegressor, package_name = MLJModels, ... )
 (name = ContinuousEncoder, package_name = MLJModels, ... )
 (name = DecisionTreeClassifier, package_name = BetaML, ... )
 (name = DecisionTreeRegressor, package_name = BetaML, ... )
 (name = DeterministicConstantClassifier, package_name = MLJModels, ... )
 (name = DeterministicConstantRegressor, package_name = MLJModels, ... )
 (name = FeatureSelector, package_name = MLJModels, ... )
 (name = FillImputer, package_name = MLJModels, ... )
 (name = OneHotEncoder, package_name = MLJModels, ... )
 (name = RandomForestClassifier, package_name = BetaML, ... )
 (name = RandomForestRegressor, package_name = BetaML, ... )
 (name = Standardizer, package_name = MLJModels, ... )

Under the hood some of these models convert the categorical vectors to integer vectors (and back again) but as a categorical array is essentially an array of integers plus metadata, I don't think there's a big performance cost. (You can reduce the cost further by implementing a data "font-end" but I doubt it's worth it unless your model has an iteration parameter, maybe.) The DecisionTree.jl models (not listed) support features that are OrderedFactor and I don't think there is any conversion, because for the algorithm only the order operator < is needed.

The advantage of having categorical data arrive as a CategoricalArray is that you always get the complete pool of classes, even if resampling has hidden some of them. If you haven't already, have a look at this section of "Working with categorical data".

Does this adequately address your query?

roland-KA · 2022-03-09T12:45:55Z

Thank's for the comprehensive answer! In the meantime I've learned that I hadn't a complete understanding of how ScientificTypes work behind the scenes. Therefore a part of my question didn't make so much sense probably 😬.

But there are still a few details about handling types outside and inside of MLJ which I don't understand yet:

You write, that users pass CategoricalArrays, if the target is declared being of type Multiclass or OrderedFactor. Is the use of a CategoricalArray mandatory in this case or is it just more efficient (since the different classes are stored only once) and more information bearing (since a CategoricalArray knows all classes)? ... or would it be also possible to pass a normal Array?
My second question is about the situation, when the features are of categorical data and thus in MLJ are declared being of type Multiclass (or OrderedFactor). If I have a model MyModel which implements its fit-function my_fit as follows with a relatively specific data type like AbstractDataFrame:

module MyModel

function my_fit(X::AbstractDataFrame, y::AbstractVector)
    ...
end

... and I want to integrate this model into MLJ (i.e. register it as a new model in MLJ). Am I running into any trouble because the type of X might be too specific or does this work as long as I'm using a type which conforms to Tables? Many models I've seen so far use an AbstractMatrix at this place. Are there any assumptions made in MLJ about this situation?

ablaom · 2022-03-09T22:24:28Z

Thank's for the comprehen�sive answer! In the meantime I've learned that I hadn't a complete understanding of how ScientificTypes work behind the scenes. Therefore a part of my question didn't make so much sense probably 😬.

But there are still a few details about handling types outside and inside of MLJ which I don't understand yet:

You write, that users pass CategoricalArrays, if the target is declared being of type Multiclass or OrderedFactor. Is the use of a CategoricalArray mandatory in this case or is it just more efficient (since the different classes are stored only once) and more information bearing (since a CategoricalArray knows all classes)? ... or would it be also possible to pass a normal Array?

In case it's not clear (it probably is) under the hood you are free whatever types you like. What I think we are discussing is here is how the data arrives to the user (for training) and the form data leaves (eg, prediction) which should match where appropriate.

I suppose it's not strictly mandatory to require the target to come in as a CategoricalArray. You just need to be able to articulate your data requirements using scientific types. So you could declare, say, target scitype to be AbstractVector{Count} (or a union of types to allow more than one kind) which would imply the user passes an AbstractVector{<:Integer}, but that has three problems: (i) MLJ propaganda is that Count is for discrete, typically unbounded "frequency" data, so there is a danger of the user misinterpreting the kind of modelling that is happening; (ii) you need a separate mechanism for conveying information about the complete class pool (eg, a separate training argument to fit)(iii) user confusion around fact that all the other MLJ classifiers declare target scitype to be AbstractVector{<:OrderedFactor} or AbstractVector{<:Multiclass} or similar.

Could you say more about why you might not want to force MLJ users to use CategoricalArrays?

And by the way, there's nothing to stop you have a "local" interface which completely dodges the scitype issue, and a separate MLJ interface.

My second question is about the situation, when the features are of categorical data and thus in MLJ are declared being of type Multiclass (or OrderedFactor). If I have a model MyModel which implements its fit-function my_fit as follows with a relatively specific data type like AbstractDataFrame:
module MyModel

function my_fit(X::AbstractDataFrame, y::AbstractVector)
    ...
end
... and I want to integrate this model into MLJ (i.e. register it as a new model in MLJ). Am I running into any trouble because the type of X might be too specific or does this work as long as I'm using a type which conforms to Tables? Many models I've seen so far use an AbstractMatrix at this place. Are there any assumptions made in MLJ about this situation?

As I say, you can use whatever type you like under the hood. However, if you are using AbstractDataFrame there's a chance your model works for any Tables.jl compatible table, you just need to drop the type annotation AbstractDataFrame and declare an input_scitype of MLJModelInterface.Table(Finite), if all columns need to be CategoricalArrays, say.

Perhaps you want to share more detail about the model you have in mind?

roland-KA · 2022-03-09T23:39:50Z

Thank's for your explanations! I've just finished an update to the model which I would like to register with MLJ (roland-KA/OneRule). So we have now an example to look at.

This model uses for the features as well as for the target categorical data. It uses for its internal fit-function (get_best_tree in trees.jl) the data types I've mentioned above as follows:

function get_best_tree(X::AbstractDataFrame, y::AbstractVector)
    trees = all_trees(X, y)
    return(trees[argmin(trees)])
end

But I have defined the data types for use with MLJ as (in OneRule_MLJ.jl):

MMI.metadata_model(OneRuleClassifier,
    input_scitype    = MMI.Table(MMI.Finite),
    target_scitype   = AbstractVector{<: MMI.Finite},
...

So a user of the MLJ interface should pass a DataFrame with columns of CategoricalArrays and the target as well as the predictions should be CategoricalArrays too.

I hope this makes sense? With my questions above, I just wanted to make sure that the first version isn't a complete mess 🤓.

The tests in runtests.jl show more or less how the model can be used via the MLJ interface. Essentially it is:

using DataFrames
using OneRule
using MLJ
using CategoricalArrays

### create test data 

weather = DataFrame(
    outlook = ["sunny", "sunny", "overcast", "rainy", "rainy", "rainy", "overcast", "sunny", "sunny", "rainy",  "sunny", "overcast", "overcast", "rainy"],
    temperature = ["hot", "hot", "hot", "mild", "cool", "cool", "cool", "mild", "cool", "mild", "mild", "mild", "hot", "mild"],
    humidity = ["high", "high", "high", "high", "normal", "normal", "normal", "high", "normal", "normal", "normal", "high", "normal", "high"],
    windy = ["false", "true", "false", "false", "false", "true", "true", "false", "false", "false", "true", "true", "false", "true"]
)

play = ["no", "no", "yes", "yes", "yes", "no", "yes", "no", "yes", "yes", "yes", "yes", "yes", "no"]

# create/adapt test data for use via MLJ interface
coerce!(weather, Textual => Multiclass)
play_cat = categorical(play)

# ML workflow
orc = OneRuleClassifier()
mach = machine(orc, weather, play_cat)
fit!(mach)
yhat_cat = MLJ.predict(mach, weather)
fitted_tree = report(mach).tree

ablaom · 2022-03-10T23:22:01Z

Cool. Hopefully, I can take a look next week.

ablaom · 2022-03-15T22:05:09Z

roland-KA/OneRule.jl#2

roland-KA · 2022-03-17T13:22:29Z

So, I close this issue as the remaining questions are better addressed in the issue you opened on OneRule.

roland-KA · 2022-03-21T20:13:25Z

The discussion here and on roland-KA/OneRule.jl#2 helped me, to understand several aspects of using ScientificTypes and CategoricalArrays inside and outside of MLJ much better. Therefore I will summarize here the take-aways for the points where I had difficulties in the beginning. Perhaps it will help other users, to get started faster with these topics.

ScientificTypes define a type system that is conceptually an abstraction layer on top of the Julia type system (but technically the types defined by ScientificTypes are ordinary Julia types) that specifically addresses the needs of machine learning.
By using ScientificTypes they are just there, ready to use. There is no need to define or declare your data objects or variables for the use with this type system.
- I.e. you can ask for the scientific type of each object or variable using the function scitype (the same way typeof is used on the Julia level). At this stage, ScientifcTypes just tries to infer the scientific type from the Julia type.
- using MLJ implicitly loads ScientificTypes.
In many cases this automatic inference is sufficient. For the remaining cases you have to declare the correct scientific type using coerce or coerce!. The latter variant can only be applied to tabular data. For all other data, a copy (bearing the added information about the correct scientific type) will be created with coerce.
A typical example, where automatic inference is not sufficient, is the situation when categorical data is represented by numbers. E.g. in Germany pupils get grades from 1 (= very good) to 6 (= insufficient). So a list of grades consists only of numbers (integers), which would be interpreted by ScientificTypes as being of type Count. Only with the aforementioned domain knowledge it gets clear, that the correct scientific type is OrderedFactor (and has to be changed explicitly using coerce).
Coercing to a Finite type (Multiclass or OrderedFactor) means that the data concerned gets automatically converted to CategoricalValues (and an array containing such values will be converted to a CategoricalArray).

Learning models outside and inside of MLJ:

A model that works on categorical data isn't restricted to the use of CategoricalValues or CategoricalArrays when used with its native interface (i.e. outside of MLJ). I.e. it may be implemented so that it can process e.g. arrays of String.
Registering a model for use within MLJ implies (among other things), that it specifies which scientific type it accepts for the features and for the target class. A model working on categorical data will therefore specify that this data has to be a subtype of Finite.
The use of data of the specified scientific type will be enforced by MLJ. So this isn't just an information for the user of the model. This implies in particular that a model that is able to process categorical data outside of MLJ e.g. in the form of arrays of String(which is of scientific type Textual), won't accept this data when used via its MLJ interface.

ablaom · 2022-03-31T20:40:38Z

@roland-KA Thanks indeed for taking the time to document your experience!

I think the synopsis is generally correct. I wouldn't say that scitype tries to "guess" the scientific type of data. Rather, it associates a scientific type to each julia type according to a specific convention that was decided upon by mostly matching common usage, but which will not match usage in all cases. (And a developer can in principle implement a different convention using ScientificTypesBase.jl)

As you correctly explain, when you implement an interface for an MLJ model, the interface must make type adjustments to account for any mismatch between what a model expects and conceptualises as "multiclass vector", say, and what objects actually have AbstractVector{<:Multiclass} as scitype under the convention. If the core model expects Vector{<:String}, say (whose instances have scitype AbstractVector{Textual}) then the interface, having declared the expected scitype (eg target_scitype) to be Abstract{<:Multiclass}, will need to convert the data received by MLJBase.fit and MLJBase.predict (typically an unordered categorical vector) into a the Vector{String} type the internal model requires. (An exception occurs if an implementation overloads an additional "data front end".)

In this way scientific types is simply a way to: (i) enforce a uniformity in the data types that MLJ users present to their models, and (ii) allow the user to focus on the purpose (scientific type) of data rather on the specific machine representation.

roland-KA · 2022-04-01T11:55:42Z

Thank's for your clarifying feedback! Step by step I'm working towards a full understanding of the concepts 🤓.

roland-KA closed this as completed Mar 17, 2022

roland-KA mentioned this issue Mar 21, 2022

Feedback on MLJ integration roland-KA/OneRule.jl#2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use of `ScientificTypes` and `CategoricalArrays` in native model #907

Use of `ScientificTypes` and `CategoricalArrays` in native model #907

roland-KA commented Mar 2, 2022

ablaom commented Mar 6, 2022

roland-KA commented Mar 9, 2022 •

edited

ablaom commented Mar 9, 2022

roland-KA commented Mar 9, 2022 •

edited

ablaom commented Mar 10, 2022

ablaom commented Mar 15, 2022

roland-KA commented Mar 17, 2022

roland-KA commented Mar 21, 2022

ablaom commented Mar 31, 2022

roland-KA commented Apr 1, 2022

Use of ScientificTypes and CategoricalArrays in native model #907

Use of ScientificTypes and CategoricalArrays in native model #907

Comments

roland-KA commented Mar 2, 2022

ablaom commented Mar 6, 2022

roland-KA commented Mar 9, 2022 • edited

ablaom commented Mar 9, 2022

roland-KA commented Mar 9, 2022 • edited

ablaom commented Mar 10, 2022

ablaom commented Mar 15, 2022

roland-KA commented Mar 17, 2022

roland-KA commented Mar 21, 2022

ablaom commented Mar 31, 2022

roland-KA commented Apr 1, 2022

Use of `ScientificTypes` and `CategoricalArrays` in native model #907

Use of `ScientificTypes` and `CategoricalArrays` in native model #907

roland-KA commented Mar 9, 2022 •

edited

roland-KA commented Mar 9, 2022 •

edited