-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal for new Model
docstrings standard
#901
Comments
This is a super useful proposal. My two comments based on the review of the DecisionTree.jl documentation and the proposed docstring:
|
Feedback from a slack channel: "I love the Training data and Operations sections. That's exactly what I've been looking for many times before." |
Great proposal. The two complaints I heard most from my students are, first, "I don't know how to debug this. I don't understand the error message" and, second, "I don't know how to do XY. I didn't find anything useful in the docs." Somebody told me also that he likes python, "because you can go easily from the idea in your head to some code snippets you find in the docs or online and adapt it to your needs." (see e.g. here for a random example of a pretty detailed docstring with examples.) Improved docstrings would definitely help newcomers. A suggestion: I would add an example section in the new docstring structure. I added pretty basic examples when I tried to improve the docstrings of MLJLinearModels (see e.g. here). While doing that I started to appreciate |
Great feedback, thanks!
Sounds good to me.
Will definitely look into this. |
One thing I don't like about my proposal is the reference to machines. Strictly speaking, a subtype of MLJModelInterface.fit(model, verbosity, X, y) -> fitresult, cache, report
MLJModelInterface.predict(model, fitresult, Xnew) -> predictions and a bunch of traits. Machines don't come into this lower-level interface. The general user is expected to interact using the machine interface, so we do want the user to have this information, but it seems it should be separated out somehow. This issue is particularly conspicuous when the model struct used by MLJ is the original unwrapped model struct defined in the algorithm-providing package (eg, EvoTrees.jl). In that case then the package may very reasonably object to the inclusion of MLJ-specific help in the doc-string. (I can use EvoTree models using the EvoTree API, ie no machines). I also believe that at least one eco-system (GeoStats) hooks directly into the MLJ model interface (no machines) to provide ML models to their users, and again framing the doc-string in terms of machines is not really appropriate. |
Good point. What about replacing the "training data section" with an examples section? Then there could be examples how to use the model without MLJ and, if desired, other examples that starts with |
Okay, based on the feedback received and considerable thought, I am settling on more-or-less the original proposal, plus an "Examples" section (optional but strongly recommended). Documentation of the new standard is here, which will become live when we have a few more exemplars. |
Further to #898, feedback is invited on a proposed standard for detailed
Model
document strings. A key element of the standard is a requirement to detail in plain english the scitype requirements for training data, which is a known stumbling block for beginners.Tl;dr
At present there is no strict standard and many models do not have a Julia doc string at all - only a brief description in the
docstring
trait, pointing users to docs in the algorithm-providing package. In the near future this trait will fallback to the Julia docstring, making it available in the searchable Model Registry.I've opened a PR revising the DecisionTree.jl docstrings, which gives you the main idea. The latest version will be available here:
At the end is a static version which may not be updated after feedback.
A decision about the standard is planned for February 25th (two weeks). Rolling this out will be a substantial commitment of resources, so were are keen to get this right before proceeding.
cc @OkonSamuel @sjvollmer @tlienart @rikhuijzer @ExpandingMan @bkamins @storopoli @DilumAluthge @jbrea @olivierlabayle @darenasc @fkiraly @alanedelman
The overall structure
The main ingredients of the string are the following sections:
machine(model, X, y)
but alsomachine(model, X, y, w)
if, say, weights are supported) with the role and scitype requirements for each data argument itemized. Also, how to fit the machine.predict
,predict_mode
,transform
,inverse_transform
, etc ) itemized and explained. This should include operations with no data arguments, such astraining_losses
andfeature_importance
if implemented.fitted_params(mach)
(fields itemized)report(mach)
(fields itemized)@ref
link to the raw model type (if wrapped)Click below to see static DecisionTreeClassifier docstring
Model type for CART decision tree classifier, based on DecisionTree.jl, and implementing the MLJ model interface.
From MLJ, the type can be imported using
Do
model = DecisionTreeClassifier()
to construct an instance with default hyper-parameters. Provide keyword arguments to override hyper-parameter defaults, as inDecisionTreeClassifier(max_depth=...)
.Training data
In MLJ or MLJBase, bind an instance
model
to data withwhere
X
: any table of input features (eg, aDataFrame
) whose columns each have one of the following element scitypes:Continuous
,Count
, or<:OrderedFactor
.y
: is the target, which can be anyAbstractVector
whose element scitype is<:OrderedFactor
or<:Multiclass
.Train the machine using
fit!(mach, rows=...)
.Hyper-parameters
max_depth=-1
: max depth of the decision tree (-1=any)min_samples_leaf=1
: max number of samples each leaf needs to havemin_samples_split=2
: min number of samples needed for a splitmin_purity_increase=0
: min purity needed for a splitn_subfeatures=0
: number of features to select at random (0 for all, -1 for square root of number of features)post_prune=false
: set totrue
for post-fit pruningmerge_purity_threshold=1.0
: (post-pruning) merge leaves having combined purity>= merge_purity_threshold
display_depth=5
: max depth to show when displaying the treerng=Random.GLOBAL_RNG
: random number generator or seedpdf_smoothing=0.0
: threshold for smoothing the predicted scores. Raw leaf-based probabilities are smoothed as follows: Ifn
is the number of observed classes, then each class probability is replaced bypdf_smoothing/n
, if it falls below that ratio, and the resulting vector of probabilities is renormalized. Smoothing is only applied to classes actually observed in training. Unseen classes retain zero-probability predictions.Operations
predict(mach, Xnew)
: return predictions of the target given featuresXnew
having the same scitype asX
above. Predictions are probabilistic, but uncalibrated.predict_mode(mach, Xnew)
: instead return the mode of each prediction above.Fitted parameters
The fields of
fitted_params(mach)
are:tree
: the tree or stump object returned by the core DecisionTree.jl algorithmencoding
: dictionary of target classes keyed on integers used internally by DecisionTree.jl; needed to interpret pretty printing of tree (obtained by callingfit!(mach, verbosity=2)
or from report - see below)Report
The fields of
report(mach)
are:classes_seen
: list of target classes actually observed in trainingprint_tree
: method to print a pretty representation of the fitted tree, with single argument the tree depth; interpretation requires internal integer-class encoding (see "Fitted parameters" above).Examples
To interpret the internal class labelling:
See also DecisionTree.jl and the unwrapped model type
MLJDecisionTreeInterface.DecisionTree.DecisionTreeClassifier
.The text was updated successfully, but these errors were encountered: