Skip to content

Latest commit

 

History

History
executable file
·
312 lines (226 loc) · 9.34 KB

index.md

File metadata and controls

executable file
·
312 lines (226 loc) · 9.34 KB

Getting Started

import Base.eval # hack b/s auto docs put's code in baremodule
import Random.seed! 
using MLJ
MLJ.color_off()
seed!(1234) 

Choosing and evaluating a model

To load some demonstration data, add RDatasets to your load path and enter

using RDatasets
iris = dataset("datasets", "iris"); # a DataFrame

and then split the data into input and target parts:

using MLJ
y, X = unpack(iris, ==(:Species), colname -> true);
first(X, 3) |> pretty

To list all models available in MLJ's model registry:

models()

In MLJ a model is a struct storing the hyperparameters of the learning algorithm indicated by the struct name.

Assuming the DecisionTree.jl package is in your load path, we can use @load to load the code defining the DecisionTreeClassifier model type. This macro also returns an instance, with default hyperparameters.

Drop the verbosity=1 declaration for silent loading:

tree_model = @load DecisionTreeClassifier verbosity=1

Important: DecisionTree.jl and most other packages implementing machine learning algorithms for use in MLJ are not MLJ dependencies. If such a package is not in your load path you will receive an error explaining how to add the package to your current environment.

Once loaded, a model can be evaluated with the evaluate method:

evaluate(tree_model, X, y, 
         resampling=CV(shuffle=true), measure=cross_entropy, verbosity=0)

Evaluating against multiple performance measures is also possible. See Evaluating Model Performance for details.

Fit and predict

To illustrate MLJ's fit and predict interface, let's perform the above evaluations by hand.

Wrapping the model in data creates a machine which will store training outcomes:

tree = machine(tree_model, X, y)

Training and testing on a hold-out set:

train, test = partition(eachindex(y), 0.7, shuffle=true); # 70:30 split
fit!(tree, rows=train);
yhat = predict(tree, X[test,:]);
yhat[3:5]
cross_entropy(yhat, y[test]) |> mean

Notice that yhat is a vector of Distribution objects (because DecisionTreeClassifier makes probabilistic predictions). The methods of the Distributions package can be applied to such distributions:

broadcast(pdf, yhat[3:5], "virginica") # predicted probabilities of virginica
mode.(yhat[3:5])

Or, one can explicitly get modes by using predict_mode instead of predict:

predict_mode(tree, rows=test[3:5])

Unsupervised models have a transform method instead of predict, and may optionally implement an inverse_transform method:

v = [1, 2, 3, 4]
stand_model = UnivariateStandardizer()
stand = machine(stand_model, v)
fit!(stand)
w = transform(stand, v)
inverse_transform(stand, w)

Machines have an internal state which allows them to avoid redundant calculations when retrained, in certain conditions - for example when increasing the number of trees in a random forest, or the number of epochs in a neural network. The machine building syntax also anticipates a more general syntax for composing multiple models, as explained in Composing Models.

There is a version of evaluate for machines as well as models. An exclamation point is added to the method name because machines are generally mutated when trained:

evaluate!(tree, resampling=Holdout(fraction_train=0.5, shuffle=true),
                measure=cross_entropy,
                verbosity=0)

Changing a hyperparameter and re-evaluating:

tree_model.max_depth = 3
evaluate!(tree, resampling=Holdout(fraction_train=0.5, shuffle=true),
          measure=cross_entropy,
          verbosity=0)

Next steps

To learn a little more about what MLJ can do, browse Common MLJ Workflows or MLJ's tutorials, returning to the manual as needed. Read at least the remainder of this page before considering serious use of MLJ.

Prerequisites

MLJ assumes some familiarity with the CategoricalValue and CategoricalString types from CategoricalArrays.jl, used here for representing categorical data. For probabilistic predictors, a basic acquaintance with Distributions.jl is also assumed.

Data containers and scientific types

The MLJ user should acquaint themselves with some basic assumptions about the form of data expected by MLJ, as outlined below.

machine(model::Supervised, X, y) 
machine(model::Unsupervised, X)

Each supervised model in MLJ declares the permitted scientific type of the inputs X and targets y that can be bound to it in the first constructor above, rather than specifying specific machine types (such as Array{Float32, 2}). Similar remarks apply to the input X of an unsupervised model.

Scientific types are julia types defined in the package ScientificTypes.jl, which also defines the convention used here (and there called mlj) for assigning a specific scientific type (interpretation) to each julia object (see the scitype examples below).

The basic "scalar" scientific types are Continuous, Multiclass{N}, OrderedFactor{N} and Count. Be sure you read Container element types below to be guarantee your scalar data is interpreted correctly. Tools exist to coerce the data to have the appropriate scientfic type; see ScientificTypes.jl or run ?coerce for details.

Additionally, most data containers - such as tuples, vectors, matrices and tables - have a scientific type.

Figure 1. Part of the scientific type heirarchy in ScientificTypes.jl.

scitype(4.6)
scitype(42)
x1 = categorical(["yes", "no", "yes", "maybe"]);
scitype(x1)
X = (x1=x1, x2=rand(4), x3=rand(4))  # a "column table"
scitype(X)

Tabular data

All data containers compatible with the Tables.jl interface (which includes all source formats listed here) have the scientific type Table{K}, where K depends on the scientific types of the columns, which can be individually inspected using schema:

schema(X)

Inputs

Since an MLJ model only specifies the scientific type of data, if that type is Table - which is the case for the majority of MLJ models - then any Tables.jl format is permitted. However, the Tables.jl API excludes matrices. If Xmatrix is a matrix, convert it to a column table using X = MLJ.table(Xmatrix).

Specifically, the requirement for an arbitrary model's input is scitype(X) <: input_scitype(model).

Targets

The target y expected by MLJ models is generally an AbstractVector. A multivariate target y will generally be table.

Specifically, the type requirement for a model target is scitype(y) <: target_scitype(model).

Querying a model for acceptable data types

Given a model instance, one can inspect the admissible scientific types of its input and target by querying the scientific type of the model itself:

tree = @load DecisionTreeClassifier
julia> tree = DecisionTreeClassifier();
julia> scitype(tree)
(input_scitype = ScientificTypes.Table{#s13} where #s13<:(AbstractArray{#s12,1} where #s12<:Continuous),
 target_scitype = AbstractArray{#s21,1} where #s21<:Finite,
 is_probabilistic = true,)

This does not work if relevant model code has not been loaded. In that case one can extract this information from the model type's registry entry, using info:

info("DecisionTreeClassifier")

Container element types

Models in MLJ will always apply the mlj convention described in ScientificTypes.jl to decide how to interpret the elements of your container types. Here are the key aspects of that convention:

  • Any AbstractFloat is interpreted as Continuous.

  • Any Integer is interpreted as Count.

  • Any CategoricalValue or CategoricalString, x, is interpreted as Multiclass or OrderedFactor, depending on the value of x.pool.ordered.

  • Strings and Chars are not interpreted as Finite; they have Unknown scitype. Coerce vectors of strings or characters to CategoricalVectors if they represent Multiclass or OrderedFactor data. Do ?coerce and ?unpack to learn how.

  • In particular, integers (including Bools) cannot be used to represent categorical data.

To designate an intrinsic "true" class for binary data (for purposes of applying MLJ measures, such as truepositive), data should be represented by an ordered CategoricalValue or CategoricalString. This data will have scitype OrderedFactor{2} and the "true" class is understood to be the second class in the ordering.