import Base.eval # hack b/s auto docs put's code in baremodule
import Random.seed!
using MLJ
MLJ.color_off()
seed!(1234)
To load some demonstration data, add RDatasets to your load path and enter
using RDatasets
iris = dataset("datasets", "iris"); # a DataFrame
and then split the data into input and target parts:
using MLJ
y, X = unpack(iris, ==(:Species), colname -> true);
first(X, 3) |> pretty
To list all models available in MLJ's model registry:
models()
In MLJ a model is a struct storing the hyperparameters of the learning algorithm indicated by the struct name.
Assuming the DecisionTree.jl package is in your load path, we can use
@load
to load the code defining the DecisionTreeClassifier
model
type. This macro also returns an instance, with default
hyperparameters.
Drop the verbosity=1
declaration for silent loading:
tree_model = @load DecisionTreeClassifier verbosity=1
Important: DecisionTree.jl and most other packages implementing machine learning algorithms for use in MLJ are not MLJ dependencies. If such a package is not in your load path you will receive an error explaining how to add the package to your current environment.
Once loaded, a model can be evaluated with the evaluate
method:
evaluate(tree_model, X, y,
resampling=CV(shuffle=true), measure=cross_entropy, verbosity=0)
Evaluating against multiple performance measures is also possible. See Evaluating Model Performance for details.
To illustrate MLJ's fit and predict interface, let's perform the above evaluations by hand.
Wrapping the model in data creates a machine which will store training outcomes:
tree = machine(tree_model, X, y)
Training and testing on a hold-out set:
train, test = partition(eachindex(y), 0.7, shuffle=true); # 70:30 split
fit!(tree, rows=train);
yhat = predict(tree, X[test,:]);
yhat[3:5]
cross_entropy(yhat, y[test]) |> mean
Notice that yhat
is a vector of Distribution
objects (because
DecisionTreeClassifier makes probabilistic predictions). The methods
of the Distributions
package can be applied to such distributions:
broadcast(pdf, yhat[3:5], "virginica") # predicted probabilities of virginica
mode.(yhat[3:5])
Or, one can explicitly get modes by using predict_mode
instead of
predict
:
predict_mode(tree, rows=test[3:5])
Unsupervised models have a transform
method instead of predict
,
and may optionally implement an inverse_transform
method:
v = [1, 2, 3, 4]
stand_model = UnivariateStandardizer()
stand = machine(stand_model, v)
fit!(stand)
w = transform(stand, v)
inverse_transform(stand, w)
Machines have an internal state which allows them to avoid redundant calculations when retrained, in certain conditions - for example when increasing the number of trees in a random forest, or the number of epochs in a neural network. The machine building syntax also anticipates a more general syntax for composing multiple models, as explained in Composing Models.
There is a version of evaluate
for machines as well as models. An
exclamation point is added to the method name because machines are
generally mutated when trained:
evaluate!(tree, resampling=Holdout(fraction_train=0.5, shuffle=true),
measure=cross_entropy,
verbosity=0)
Changing a hyperparameter and re-evaluating:
tree_model.max_depth = 3
evaluate!(tree, resampling=Holdout(fraction_train=0.5, shuffle=true),
measure=cross_entropy,
verbosity=0)
To learn a little more about what MLJ can do, browse Common MLJ Workflows or MLJ's tutorials, returning to the manual as needed. Read at least the remainder of this page before considering serious use of MLJ.
MLJ assumes some familiarity with the CategoricalValue
and
CategoricalString
types from
CategoricalArrays.jl,
used here for representing categorical data. For probabilistic
predictors, a basic acquaintance with
Distributions.jl is
also assumed.
The MLJ user should acquaint themselves with some basic assumptions about the form of data expected by MLJ, as outlined below.
machine(model::Supervised, X, y)
machine(model::Unsupervised, X)
Each supervised model in MLJ declares the permitted scientific type
of the inputs X
and targets y
that can be bound to it in the first
constructor above, rather than specifying specific machine types (such
as Array{Float32, 2}
). Similar remarks apply to the input X
of an
unsupervised model.
Scientific types are julia types defined in the package
ScientificTypes.jl,
which also defines the convention used here (and there called mlj)
for assigning a specific scientific type (interpretation) to each
julia object (see the scitype
examples below).
The basic "scalar" scientific types are Continuous
, Multiclass{N}
,
OrderedFactor{N}
and Count
. Be sure you read Container element
types below to be guarantee your scalar data is interpreted
correctly. Tools exist to coerce the data to have the appropriate
scientfic type; see
ScientificTypes.jl
or run ?coerce
for details.
Additionally, most data containers - such as tuples, vectors, matrices and tables - have a scientific type.
Figure 1. Part of the scientific type heirarchy in ScientificTypes.jl.
scitype(4.6)
scitype(42)
x1 = categorical(["yes", "no", "yes", "maybe"]);
scitype(x1)
X = (x1=x1, x2=rand(4), x3=rand(4)) # a "column table"
scitype(X)
All data containers compatible with the
Tables.jl interface (which
includes all source formats listed
here) have the
scientific type Table{K}
, where K
depends on the scientific types
of the columns, which can be individually inspected using schema
:
schema(X)
Since an MLJ model only specifies the scientific type of data, if that
type is Table
- which is the case for the majority of MLJ models -
then any Tables.jl format is
permitted. However, the Tables.jl API excludes matrices. If Xmatrix
is a matrix, convert it to a column table using X = MLJ.table(Xmatrix)
.
Specifically, the requirement for an arbitrary model's input is scitype(X) <: input_scitype(model)
.
The target y
expected by MLJ models is generally an
AbstractVector
. A multivariate target y
will generally be table.
Specifically, the type requirement for a model target is scitype(y) <: target_scitype(model)
.
Given a model instance, one can inspect the admissible scientific types of its input and target by querying the scientific type of the model itself:
tree = @load DecisionTreeClassifier
julia> tree = DecisionTreeClassifier();
julia> scitype(tree)
(input_scitype = ScientificTypes.Table{#s13} where #s13<:(AbstractArray{#s12,1} where #s12<:Continuous),
target_scitype = AbstractArray{#s21,1} where #s21<:Finite,
is_probabilistic = true,)
This does not work if relevant model code has not been loaded. In that
case one can extract this information from the model type's registry
entry, using info
:
info("DecisionTreeClassifier")
Models in MLJ will always apply the mlj convention described in ScientificTypes.jl to decide how to interpret the elements of your container types. Here are the key aspects of that convention:
-
Any
AbstractFloat
is interpreted asContinuous
. -
Any
Integer
is interpreted asCount
. -
Any
CategoricalValue
orCategoricalString
,x
, is interpreted asMulticlass
orOrderedFactor
, depending on the value ofx.pool.ordered
. -
String
s andChar
s are not interpreted asFinite
; they haveUnknown
scitype. Coerce vectors of strings or characters toCategoricalVector
s if they representMulticlass
orOrderedFactor
data. Do?coerce
and?unpack
to learn how. -
In particular, integers (including
Bool
s) cannot be used to represent categorical data.
To designate an intrinsic "true" class for binary data (for purposes
of applying MLJ measures, such as truepositive
), data should be
represented by an ordered CategoricalValue
or
CategoricalString
. This data will have scitype OrderedFactor{2}
and the "true" class is understood to be the second class in the
ordering.