Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for new Model docstrings standard #901

Closed
ablaom opened this issue Feb 11, 2022 · 7 comments
Closed

Proposal for new Model docstrings standard #901

ablaom opened this issue Feb 11, 2022 · 7 comments
Labels

Comments

@ablaom
Copy link
Member

ablaom commented Feb 11, 2022

Further to #898, feedback is invited on a proposed standard for detailed Model document strings. A key element of the standard is a requirement to detail in plain english the scitype requirements for training data, which is a known stumbling block for beginners.

Tl;dr

At present there is no strict standard and many models do not have a Julia doc string at all - only a brief description in the docstring trait, pointing users to docs in the algorithm-providing package. In the near future this trait will fallback to the Julia docstring, making it available in the searchable Model Registry.

I've opened a PR revising the DecisionTree.jl docstrings, which gives you the main idea. The latest version will be available here:

At the end is a static version which may not be updated after feedback.

A decision about the standard is planned for February 25th (two weeks). Rolling this out will be a substantial commitment of resources, so were are keen to get this right before proceeding.

cc @OkonSamuel @sjvollmer @tlienart @rikhuijzer @ExpandingMan @bkamins @storopoli @DilumAluthge @jbrea @olivierlabayle @darenasc @fkiraly @alanedelman

The overall structure

The main ingredients of the string are the following sections:

  • a header (which can be autogenerated) explaining:
    • what package provides the core algorithm
    • how to import the type from MLJ (recommended because user can already inspect docstring in the Model Registry before loading code-providing package)
    • how to instantiate with default hyper-parameters or with keywords
  • Training data section: Explains how to bind model to data in a machine with all possible signatures (eg, machine(model, X, y) but also machine(model, X, y, w) if, say, weights are supported) with the role and scitype requirements for each data argument itemized. Also, how to fit the machine.
  • Hyper-parameters section: itemized with defaults given
  • Operations section: each operation (predict, predict_mode, transform, inverse_transform, etc ) itemized and explained. This should include operations with no data arguments, such as training_losses and feature_importance if implemented.
  • Fitted parameters section: To explain what is returned by fitted_params(mach) (fields itemized)
  • Report section (if report is non-empty): To explain what, if anything, is included in the report(mach) (fields itemized)
  • A closing See also sentence which includes a @ref link to the raw model type (if wrapped)

Click below to see static DecisionTreeClassifier docstring

DecisionTreeClassifier

Model type for CART decision tree classifier, based on DecisionTree.jl, and implementing the MLJ model interface.

From MLJ, the type can be imported using

DecisionTreeClassifier = @load DecisionTreeClassifier pkg=DecisionTree

Do model = DecisionTreeClassifier() to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as in DecisionTreeClassifier(max_depth=...).

Training data

In MLJ or MLJBase, bind an instance model to data with

mach = machine(model, X, y)

where

  • X: any table of input features (eg, a DataFrame) whose columns each have one of the following element scitypes: Continuous, Count, or <:OrderedFactor.
  • y: is the target, which can be any AbstractVector whose element scitype is <:OrderedFactor or <:Multiclass.

Train the machine using fit!(mach, rows=...).

Hyper-parameters

  • max_depth=-1: max depth of the decision tree (-1=any)
  • min_samples_leaf=1: max number of samples each leaf needs to have
  • min_samples_split=2: min number of samples needed for a split
  • min_purity_increase=0: min purity needed for a split
  • n_subfeatures=0: number of features to select at random (0 for all, -1 for square root of number of features)
  • post_prune=false: set to true for post-fit pruning
  • merge_purity_threshold=1.0: (post-pruning) merge leaves having combined purity >= merge_purity_threshold
  • display_depth=5: max depth to show when displaying the tree
  • rng=Random.GLOBAL_RNG: random number generator or seed
  • pdf_smoothing=0.0: threshold for smoothing the predicted scores. Raw leaf-based probabilities are smoothed as follows: If n is the number of observed classes, then each class probability is replaced by pdf_smoothing/n, if it falls below that ratio, and the resulting vector of probabilities is renormalized. Smoothing is only applied to classes actually observed in training. Unseen classes retain zero-probability predictions.

Operations

  • predict(mach, Xnew): return predictions of the target given features Xnew having the same scitype as X above. Predictions are probabilistic, but uncalibrated.
  • predict_mode(mach, Xnew): instead return the mode of each prediction above.

Fitted parameters

The fields of fitted_params(mach) are:

  • tree: the tree or stump object returned by the core DecisionTree.jl algorithm
  • encoding: dictionary of target classes keyed on integers used internally by DecisionTree.jl; needed to interpret pretty printing of tree (obtained by calling fit!(mach, verbosity=2) or from report - see below)

Report

The fields of report(mach) are:

  • classes_seen: list of target classes actually observed in training
  • print_tree: method to print a pretty representation of the fitted tree, with single argument the tree depth; interpretation requires internal integer-class encoding (see "Fitted parameters" above).

Examples

using MLJ
Tree = @load DecisionTreeClassifier pkg=DecisionTree
tree = Tree(max_depth=4, min_samples_split=3)

X, y = @load_iris
mach = machine(tree, X, y) |> fit!

Xnew = (sepal_length = [6.4, 7.2, 7.4],
        sepal_width = [2.8, 3.0, 2.8],
        petal_length = [5.6, 5.8, 6.1],
        petal_width = [2.1, 1.6, 1.9],)
yhat = predict(mach, Xnew) # probabilistic predictions
predict_mode(mach, Xnew)   # point predictions
pdf.(yhat, "virginica")    # probabilities for the "verginica" class

fitted_params(mach).tree # raw tree or stump object from DecisionTrees.jl

julia> report(mach).print_tree(3)
Feature 4, Threshold 0.8
L-> 1 : 50/50
R-> Feature 4, Threshold 1.75
    L-> Feature 3, Threshold 4.95
        L->
        R->
    R-> Feature 3, Threshold 4.85
        L->
        R-> 3 : 43/43

To interpret the internal class labelling:

julia> fitted_params(mach).encoding
Dict{CategoricalArrays.CategoricalValue{String, UInt32}, UInt32} with 3 entries:
  "virginica"  => 0x00000003
  "setosa"     => 0x00000001
  "versicolor" => 0x00000002

See also DecisionTree.jl and the unwrapped model type MLJDecisionTreeInterface.DecisionTree.DecisionTreeClassifier.

@bkamins
Copy link

bkamins commented Feb 11, 2022

This is a super useful proposal. My two comments based on the review of the DecisionTree.jl documentation and the proposed docstring:

  • I would propose to explicitly state in the docstring if missing value is accepted (now it is not clear - e.g. my understanding is that missing is accepted in target but not in features; but maybe the fact that missing is accepted in target in DecisionTrees.jl is a bug)
  • the docstring is inconsistent with the https://github.com/bensadeghi/DecisionTree.jl readme (which gives a different list of accepted feature types for classification and for regression - but this also might be a bug in DecisionTree.jl documentation)

@ablaom
Copy link
Member Author

ablaom commented Feb 12, 2022

Feedback from a slack channel: "I love the Training data and Operations sections. That's exactly what I've been looking for many times before."

@jbrea
Copy link
Contributor

jbrea commented Feb 14, 2022

Great proposal. The two complaints I heard most from my students are, first, "I don't know how to debug this. I don't understand the error message" and, second, "I don't know how to do XY. I didn't find anything useful in the docs." Somebody told me also that he likes python, "because you can go easily from the idea in your head to some code snippets you find in the docs or online and adapt it to your needs." (see e.g. here for a random example of a pretty detailed docstring with examples.) Improved docstrings would definitely help newcomers.

A suggestion: I would add an example section in the new docstring structure. I added pretty basic examples when I tried to improve the docstrings of MLJLinearModels (see e.g. here). While doing that I started to appreciate DocStringExtensions.$TYPEDFIELDS: I like it that I can write the comments in the struct definition and have it automatically pulled into the docstring. Don't know if this is possible across the whole MLJ ecosystem, however.

@ablaom
Copy link
Member Author

ablaom commented Feb 14, 2022

Great feedback, thanks!

A suggestion: I would add an example section in the new docstring structure.

Sounds good to me.

While doing that I started to appreciate DocStringExtensions.$TYPEDFIELDS:

Will definitely look into this.

@ablaom
Copy link
Member Author

ablaom commented Feb 14, 2022

One thing I don't like about my proposal is the reference to machines. Strictly speaking, a subtype of Model only has a contract to implement the MLJ model interface (MLJModelInterface.jl). For example, a supervisor model typically implements:

MLJModelInterface.fit(model, verbosity, X, y) -> fitresult, cache, report
MLJModelInterface.predict(model, fitresult, Xnew) -> predictions

and a bunch of traits. Machines don't come into this lower-level interface. The general user is expected to interact using the machine interface, so we do want the user to have this information, but it seems it should be separated out somehow.

This issue is particularly conspicuous when the model struct used by MLJ is the original unwrapped model struct defined in the algorithm-providing package (eg, EvoTrees.jl). In that case then the package may very reasonably object to the inclusion of MLJ-specific help in the doc-string. (I can use EvoTree models using the EvoTree API, ie no machines).

I also believe that at least one eco-system (GeoStats) hooks directly into the MLJ model interface (no machines) to provide ML models to their users, and again framing the doc-string in terms of machines is not really appropriate.

@jbrea
Copy link
Contributor

jbrea commented Feb 15, 2022

One thing I don't like about my proposal is the reference to machines.

Good point. What about replacing the "training data section" with an examples section? Then there could be examples how to use the model without MLJ and, if desired, other examples that starts with using MLJ and explain how to use the model with a machine. The examples section doesn't need to be standardized, I think, but could be tailored to every model/package. The description of the fit function would naturally fit into the "operations section", where also the requirements on X and y could be explained.

@ablaom
Copy link
Member Author

ablaom commented Mar 1, 2022

Okay, based on the feedback received and considerable thought, I am settling on more-or-less the original proposal, plus an "Examples" section (optional but strongly recommended).

Documentation of the new standard is here, which will become live when we have a few more exemplars.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants