Proposal for new `Model` docstrings standard #901

ablaom · 2022-02-11T06:36:49Z

Further to #898, feedback is invited on a proposed standard for detailed Model document strings. A key element of the standard is a requirement to detail in plain english the scitype requirements for training data, which is a known stumbling block for beginners.

Tl;dr

At present there is no strict standard and many models do not have a Julia doc string at all - only a brief description in the docstring trait, pointing users to docs in the algorithm-providing package. In the near future this trait will fallback to the Julia docstring, making it available in the searchable Model Registry.

I've opened a PR revising the DecisionTree.jl docstrings, which gives you the main idea. The latest version will be available here:

See line 26 of this file; from Update doc-strings to meet new standard MLJDecisionTreeInterface.jl#12

At the end is a static version which may not be updated after feedback.

A decision about the standard is planned for February 25th (two weeks). Rolling this out will be a substantial commitment of resources, so were are keen to get this right before proceeding.

cc @OkonSamuel @sjvollmer @tlienart @rikhuijzer @ExpandingMan @bkamins @storopoli @DilumAluthge @jbrea @olivierlabayle @darenasc @fkiraly @alanedelman

The overall structure

The main ingredients of the string are the following sections:

a header (which can be autogenerated) explaining:
- what package provides the core algorithm
- how to import the type from MLJ (recommended because user can already inspect docstring in the Model Registry before loading code-providing package)
- how to instantiate with default hyper-parameters or with keywords
Training data section: Explains how to bind model to data in a machine with all possible signatures (eg, machine(model, X, y) but also machine(model, X, y, w) if, say, weights are supported) with the role and scitype requirements for each data argument itemized. Also, how to fit the machine.
Hyper-parameters section: itemized with defaults given
Operations section: each operation (predict, predict_mode, transform, inverse_transform, etc ) itemized and explained. This should include operations with no data arguments, such as training_losses and feature_importance if implemented.
Fitted parameters section: To explain what is returned by fitted_params(mach) (fields itemized)
Report section (if report is non-empty): To explain what, if anything, is included in the report(mach) (fields itemized)
A closing See also sentence which includes a @ref link to the raw model type (if wrapped)

Click below to see static DecisionTreeClassifier docstring

DecisionTreeClassifier

Model type for CART decision tree classifier, based on DecisionTree.jl, and implementing the MLJ model interface.

From MLJ, the type can be imported using

DecisionTreeClassifier = @load DecisionTreeClassifier pkg=DecisionTree

Do model = DecisionTreeClassifier() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in DecisionTreeClassifier(max_depth=...).

Training data

In MLJ or MLJBase, bind an instance model to data with

mach = machine(model, X, y)

where

X: any table of input features (eg, a DataFrame) whose columns each have one of the following element scitypes: Continuous, Count, or <:OrderedFactor.
y: is the target, which can be any AbstractVector whose element scitype is <:OrderedFactor or <:Multiclass.

Train the machine using fit!(mach, rows=...).

Hyper-parameters

max_depth=-1: max depth of the decision tree (-1=any)
min_samples_leaf=1: max number of samples each leaf needs to have
min_samples_split=2: min number of samples needed for a split
min_purity_increase=0: min purity needed for a split
n_subfeatures=0: number of features to select at random (0 for all, -1 for square root of number of features)
post_prune=false: set to true for post-fit pruning
merge_purity_threshold=1.0: (post-pruning) merge leaves having combined purity >= merge_purity_threshold
display_depth=5: max depth to show when displaying the tree
rng=Random.GLOBAL_RNG: random number generator or seed
pdf_smoothing=0.0: threshold for smoothing the predicted scores. Raw leaf-based probabilities are smoothed as follows: If n is the number of observed classes, then each class probability is replaced by pdf_smoothing/n, if it falls below that ratio, and the resulting vector of probabilities is renormalized. Smoothing is only applied to classes actually observed in training. Unseen classes retain zero-probability predictions.

Operations

predict(mach, Xnew): return predictions of the target given features Xnew having the same scitype as X above. Predictions are probabilistic, but uncalibrated.
predict_mode(mach, Xnew): instead return the mode of each prediction above.

Fitted parameters

The fields of fitted_params(mach) are:

tree: the tree or stump object returned by the core DecisionTree.jl algorithm
encoding: dictionary of target classes keyed on integers used internally by DecisionTree.jl; needed to interpret pretty printing of tree (obtained by calling fit!(mach, verbosity=2) or from report - see below)

Report

The fields of report(mach) are:

classes_seen: list of target classes actually observed in training
print_tree: method to print a pretty representation of the fitted tree, with single argument the tree depth; interpretation requires internal integer-class encoding (see "Fitted parameters" above).

Examples

using MLJ
Tree = @load DecisionTreeClassifier pkg=DecisionTree
tree = Tree(max_depth=4, min_samples_split=3)

X, y = @load_iris
mach = machine(tree, X, y) |> fit!

Xnew = (sepal_length = [6.4, 7.2, 7.4],
        sepal_width = [2.8, 3.0, 2.8],
        petal_length = [5.6, 5.8, 6.1],
        petal_width = [2.1, 1.6, 1.9],)
yhat = predict(mach, Xnew) # probabilistic predictions
predict_mode(mach, Xnew)   # point predictions
pdf.(yhat, "virginica")    # probabilities for the "verginica" class

fitted_params(mach).tree # raw tree or stump object from DecisionTrees.jl

julia> report(mach).print_tree(3)
Feature 4, Threshold 0.8
L-> 1 : 50/50
R-> Feature 4, Threshold 1.75
    L-> Feature 3, Threshold 4.95
        L->
        R->
    R-> Feature 3, Threshold 4.85
        L->
        R-> 3 : 43/43

To interpret the internal class labelling:

julia> fitted_params(mach).encoding
Dict{CategoricalArrays.CategoricalValue{String, UInt32}, UInt32} with 3 entries:
  "virginica"  => 0x00000003
  "setosa"     => 0x00000001
  "versicolor" => 0x00000002

See also DecisionTree.jl and the unwrapped model type MLJDecisionTreeInterface.DecisionTree.DecisionTreeClassifier.

The text was updated successfully, but these errors were encountered:

bkamins · 2022-02-11T07:12:05Z

This is a super useful proposal. My two comments based on the review of the DecisionTree.jl documentation and the proposed docstring:

I would propose to explicitly state in the docstring if missing value is accepted (now it is not clear - e.g. my understanding is that missing is accepted in target but not in features; but maybe the fact that missing is accepted in target in DecisionTrees.jl is a bug)
the docstring is inconsistent with the https://github.com/bensadeghi/DecisionTree.jl readme (which gives a different list of accepted feature types for classification and for regression - but this also might be a bug in DecisionTree.jl documentation)

ablaom · 2022-02-12T21:07:53Z

Feedback from a slack channel: "I love the Training data and Operations sections. That's exactly what I've been looking for many times before."

jbrea · 2022-02-14T20:21:20Z

Great proposal. The two complaints I heard most from my students are, first, "I don't know how to debug this. I don't understand the error message" and, second, "I don't know how to do XY. I didn't find anything useful in the docs." Somebody told me also that he likes python, "because you can go easily from the idea in your head to some code snippets you find in the docs or online and adapt it to your needs." (see e.g. here for a random example of a pretty detailed docstring with examples.) Improved docstrings would definitely help newcomers.

A suggestion: I would add an example section in the new docstring structure. I added pretty basic examples when I tried to improve the docstrings of MLJLinearModels (see e.g. here). While doing that I started to appreciate DocStringExtensions.$TYPEDFIELDS: I like it that I can write the comments in the struct definition and have it automatically pulled into the docstring. Don't know if this is possible across the whole MLJ ecosystem, however.

ablaom · 2022-02-14T20:31:47Z

Great feedback, thanks!

A suggestion: I would add an example section in the new docstring structure.

Sounds good to me.

While doing that I started to appreciate DocStringExtensions.$TYPEDFIELDS:

Will definitely look into this.

ablaom · 2022-02-14T23:00:39Z

One thing I don't like about my proposal is the reference to machines. Strictly speaking, a subtype of Model only has a contract to implement the MLJ model interface (MLJModelInterface.jl). For example, a supervisor model typically implements:

MLJModelInterface.fit(model, verbosity, X, y) -> fitresult, cache, report
MLJModelInterface.predict(model, fitresult, Xnew) -> predictions

and a bunch of traits. Machines don't come into this lower-level interface. The general user is expected to interact using the machine interface, so we do want the user to have this information, but it seems it should be separated out somehow.

This issue is particularly conspicuous when the model struct used by MLJ is the original unwrapped model struct defined in the algorithm-providing package (eg, EvoTrees.jl). In that case then the package may very reasonably object to the inclusion of MLJ-specific help in the doc-string. (I can use EvoTree models using the EvoTree API, ie no machines).

I also believe that at least one eco-system (GeoStats) hooks directly into the MLJ model interface (no machines) to provide ML models to their users, and again framing the doc-string in terms of machines is not really appropriate.

jbrea · 2022-02-15T08:39:52Z

One thing I don't like about my proposal is the reference to machines.

Good point. What about replacing the "training data section" with an examples section? Then there could be examples how to use the model without MLJ and, if desired, other examples that starts with using MLJ and explain how to use the model with a machine. The examples section doesn't need to be standardized, I think, but could be tailored to every model/package. The description of the fit function would naturally fit into the "operations section", where also the requirements on X and y could be explained.

ablaom · 2022-03-01T03:48:32Z

Okay, based on the feedback received and considerable thought, I am settling on more-or-less the original proposal, plus an "Examples" section (optional but strongly recommended).

Documentation of the new standard is here, which will become live when we have a few more exemplars.

ablaom added the docs label Feb 11, 2022

ablaom mentioned this issue Feb 11, 2022

[Discussion] Review model documentation strings #898

Closed

ablaom mentioned this issue Feb 13, 2022

Update doc-strings to meet new standard JuliaAI/MLJDecisionTreeInterface.jl#12

Merged

1 task

ablaom mentioned this issue Mar 1, 2022

Add documentation for the new model docstring standard #906

Merged

1 task

ablaom closed this as completed Mar 1, 2022

This was referenced Mar 4, 2022

Add doc method JuliaAI/MLJModels.jl#442

Merged

Add MLJ-compliant document strings to all transformers JuliaAI/MLJModels.jl#443

Merged

Roll out MLJ-compliant document strings for registered models #913

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal for new `Model` docstrings standard #901

Proposal for new `Model` docstrings standard #901

ablaom commented Feb 11, 2022 •

edited

Training data

Hyper-parameters

Operations

Fitted parameters

Report

Examples

bkamins commented Feb 11, 2022

ablaom commented Feb 12, 2022

jbrea commented Feb 14, 2022 •

edited

ablaom commented Feb 14, 2022

ablaom commented Feb 14, 2022 •

edited

jbrea commented Feb 15, 2022

ablaom commented Mar 1, 2022

Proposal for new Model docstrings standard #901

Proposal for new Model docstrings standard #901

Comments

ablaom commented Feb 11, 2022 • edited

Tl;dr

The overall structure

Training data

Hyper-parameters

Operations

Fitted parameters

Report

Examples

bkamins commented Feb 11, 2022

ablaom commented Feb 12, 2022

jbrea commented Feb 14, 2022 • edited

ablaom commented Feb 14, 2022

ablaom commented Feb 14, 2022 • edited

jbrea commented Feb 15, 2022

ablaom commented Mar 1, 2022

Proposal for new `Model` docstrings standard #901

Proposal for new `Model` docstrings standard #901

ablaom commented Feb 11, 2022 •

edited

jbrea commented Feb 14, 2022 •

edited

ablaom commented Feb 14, 2022 •

edited