# Common MLJ Workflows

## Data ingestion

In [1]:
using MLJ
using RDatasets
channing = dataset("boot", "channing");

Inspecting metadata, including column scientific types:

In [2]:
schema(channing)

(names = (:Sex, :Entry, :Exit, :Time, :Cens),
 types = (CategoricalString{UInt8}, Int32, Int32, Int32, Int32),
 scitypes = (Multiclass{2}, Count, Count, Count, Count),
 nrows = 462,)

Unpacking data and correcting for wrong scitypes:

In [3]:
y, X =  unpack(channing,
               ==(:Exit),            # y is the :Exit column
               !=(:Time);            # X is the rest, except :Time
               :Exit=>Continuous,
               :Entry=>Continuous,
               :Cens=>Multiclass)
first(X, 4) |> pretty

┌[0m──────────────────────────[0m┬[0m────────────[0m┬[0m───────────────────────────────[0m┐[0m
│[0m[1m Sex                      [0m│[0m[1m Entry      [0m│[0m[1m Cens                          [0m│[0m
│[0m[90m CategoricalString{UInt8} [0m│[0m[90m Float64    [0m│[0m[90m CategoricalValue{Int32,UInt8} [0m│[0m
│[0m[90m Multiclass{2}            [0m│[0m[90m Continuous [0m│[0m[90m Multiclass{2}                 [0m│[0m
├[0m──────────────────────────[0m┼[0m────────────[0m┼[0m───────────────────────────────[0m┤[0m
│[0m Male                     [0m│[0m 782.0      [0m│[0m 1                             [0m│[0m
│[0m Male                     [0m│[0m 1020.0     [0m│[0m 1                             [0m│[0m
│[0m Male                     [0m│[0m 856.0      [0m│[0m 1                             [0m│[0m
│[0m Male                     [0m│[0m 915.0      [0m│[0m 1                             [0m│[0m
└[0m──────────────────────────[0m┴[0m

In [4]:
y[1:4]

4-element Array{Float64,1}:
  909.0
 1128.0
  969.0
  957.0

Loading a built-in supervised dataset:

In [5]:
X, y = @load_iris;
first(X, 4) |> pretty

┌[0m──────────────[0m┬[0m─────────────[0m┬[0m──────────────[0m┬[0m─────────────[0m┐[0m
│[0m[1m sepal_length [0m│[0m[1m sepal_width [0m│[0m[1m petal_length [0m│[0m[1m petal_width [0m│[0m
│[0m[90m Float64      [0m│[0m[90m Float64     [0m│[0m[90m Float64      [0m│[0m[90m Float64     [0m│[0m
│[0m[90m Continuous   [0m│[0m[90m Continuous  [0m│[0m[90m Continuous   [0m│[0m[90m Continuous  [0m│[0m
├[0m──────────────[0m┼[0m─────────────[0m┼[0m──────────────[0m┼[0m─────────────[0m┤[0m
│[0m 5.1          [0m│[0m 3.5         [0m│[0m 1.4          [0m│[0m 0.2         [0m│[0m
│[0m 4.9          [0m│[0m 3.0         [0m│[0m 1.4          [0m│[0m 0.2         [0m│[0m
│[0m 4.7          [0m│[0m 3.2         [0m│[0m 1.3          [0m│[0m 0.2         [0m│[0m
│[0m 4.6          [0m│[0m 3.1         [0m│[0m 1.5          [0m│[0m 0.2         [0m│[0m
└[0m──────────────[0m┴[0m─────────────[0m┴[0m──────────────[0m┴[0m───────

In [6]:
y[1:4]

4-element CategoricalArray{String,1,UInt32}:
 "setosa"
 "setosa"
 "setosa"
 "setosa"

## Model search

Searching for a supervised model:

In [7]:
X, y = @load_boston
models(matching(X, y))

37-element Array{NamedTuple,1}:
 (name = ARDRegressor, package_name = ScikitLearn, ... )                      
 (name = AdaBoostRegressor, package_name = ScikitLearn, ... )                 
 (name = BaggingRegressor, package_name = ScikitLearn, ... )                  
 (name = BayesianRidgeRegressor, package_name = ScikitLearn, ... )            
 (name = ConstantRegressor, package_name = MLJModels, ... )                   
 (name = DecisionTreeRegressor, package_name = DecisionTree, ... )            
 (name = DeterministicConstantRegressor, package_name = MLJModels, ... )      
 (name = ElasticNetCVRegressor, package_name = ScikitLearn, ... )             
 (name = ElasticNetRegressor, package_name = ScikitLearn, ... )               
 (name = EpsilonSVR, package_name = LIBSVM, ... )                             
 (name = GaussianProcessRegressor, package_name = ScikitLearn, ... )          
 (name = GradientBoostingRegressor, package_name = ScikitLearn, ... )         
 (name = HuberRegres

In [8]:
models(matching(X, y))[6]

[35mDecisionTreeRegressor from DecisionTree.jl.[39m
[35m[Documentation](https://github.com/bensadeghi/DecisionTree.jl).[39m
(name = "DecisionTreeRegressor",
 package_name = "DecisionTree",
 is_supervised = true,
 docstring = "DecisionTreeRegressor from DecisionTree.jl.\n[Documentation](https://github.com/bensadeghi/DecisionTree.jl).",
 hyperparameter_types = ["Float64", "Int64", "Int64", "Int64", "Float64", "Int64", "Bool"],
 hyperparameters = Symbol[:pruning_purity_threshold, :max_depth, :min_samples_leaf, :min_samples_split, :min_purity_increase, :n_subfeatures, :post_prune],
 implemented_methods = Symbol[:fit, :predict, :clean!, :fitted_params],
 is_pure_julia = true,
 is_wrapper = false,
 load_path = "MLJModels.DecisionTree_.DecisionTreeRegressor",
 package_license = "unknown",
 package_url = "https://github.com/bensadeghi/DecisionTree.jl",
 package_uuid = "7806a523-6efd-50cb-b5f6-3fa6f1930dbb",
 prediction_type = :deterministic,
 supports_weights = false,
 input_scitype = Scie

More refined searches:

In [9]:
models() do model
    matching(model, X, y) &&
        model.prediction_type == :deterministic &&
        model.is_pure_julia
end

4-element Array{NamedTuple,1}:
 (name = DecisionTreeRegressor, package_name = DecisionTree, ... )      
 (name = DeterministicConstantRegressor, package_name = MLJModels, ... )
 (name = KNNRegressor, package_name = NearestNeighbors, ... )           
 (name = RidgeRegressor, package_name = MultivariateStats, ... )        

Searching for an unsupervised model:

In [10]:
models(matching(X))

9-element Array{NamedTuple,1}:
 (name = FeatureSelector, package_name = MLJModels, ... )  
 (name = ICA, package_name = MultivariateStats, ... )      
 (name = KMeans, package_name = Clustering, ... )          
 (name = KMedoids, package_name = Clustering, ... )        
 (name = KernelPCA, package_name = MultivariateStats, ... )
 (name = OneClassSVM, package_name = LIBSVM, ... )         
 (name = OneHotEncoder, package_name = MLJModels, ... )    
 (name = PCA, package_name = MultivariateStats, ... )      
 (name = Standardizer, package_name = MLJModels, ... )     

Getting the metadata entry for a given model type:

In [11]:
info("PCA")
info("RidgeRegressor", pkg="MultivariateStats") # a model type in multiple packages

[35mRidgeRegressor from MultivariateStats.jl.[39m
[35m[Documentation](https://github.com/JuliaStats/MultivariateStats.jl).[39m
(name = "RidgeRegressor",
 package_name = "MultivariateStats",
 is_supervised = true,
 docstring = "RidgeRegressor from MultivariateStats.jl.\n[Documentation](https://github.com/JuliaStats/MultivariateStats.jl).",
 hyperparameter_types = ["Float64"],
 hyperparameters = Symbol[:lambda],
 implemented_methods = Symbol[:fit, :predict, :clean!, :fitted_params],
 is_pure_julia = true,
 is_wrapper = false,
 load_path = "MLJModels.MultivariateStats_.RidgeRegressor",
 package_license = "unknown",
 package_url = "https://github.com/JuliaStats/MultivariateStats.jl",
 package_uuid = "6f286f6a-111f-5878-ab1e-185364afe411",
 prediction_type = :deterministic,
 supports_weights = false,
 input_scitype = ScientificTypes.Table{_s13} where _s13<:(AbstractArray{_s12,1} where _s12<:Continuous),
 target_scitype = AbstractArray{Continuous,1},)

### *More on model matching*

- `model` is in the list returned by `models(test)` exactly when
  `test(model) == true`. (Here `model` is some model type metadata
  entry, as returned by `info(...)`.)

- `matching(model, X, y) == true` exactly when `model` is supervised
  and admits inputs and targets with the scientific types of `X` and
  `y`, respectively.

- `matching(model, X) == true` exaclty when `model` is unsupervised
  and admits inputs with the scientific types of `X`.

- The testing objects `matching(model)`, `matching(X, y)` and `matching(X)`,
  which are callable and `Bool`-valued, are just the curried versions of
  the above. So, for example, `matching(X, y)(model) =
  matching(model, X, y)`.

## Instantiating a model

Loading model code:

In [12]:
@load DecisionTreeClassifier

DecisionTreeClassifier(pruning_purity = 1.0,
                       max_depth = -1,
                       min_samples_leaf = 1,
                       min_samples_split = 2,
                       min_purity_increase = 0.0,
                       n_subfeatures = 0,
                       display_depth = 5,
                       post_prune = false,
                       merge_purity_threshold = 0.9,
                       pdf_smoothing = 0.05,)[34m @ 7…72[39m

Instantiating a model:

In [13]:
model = DecisionTreeClassifier(min_samples_split=5, max_depth=4)

DecisionTreeClassifier(pruning_purity = 1.0,
                       max_depth = 4,
                       min_samples_leaf = 1,
                       min_samples_split = 5,
                       min_purity_increase = 0.0,
                       n_subfeatures = 0,
                       display_depth = 5,
                       post_prune = false,
                       merge_purity_threshold = 0.9,
                       pdf_smoothing = 0.05,)[34m @ 7…86[39m

or

In [14]:
model = @load DecisionTreeClassifier
model.min_samples_split = 5
model.max_depth = 4

┌ Info: A model type "DecisionTreeClassifier" is already loaded. 
│ No new code loaded. 
└ @ MLJModels /Users/anthony/Dropbox/Julia7/MLJ/MLJModels/src/loading.jl:41


4

## Evaluating a model:

In [15]:
X, y = @load_boston
model = @load KNNRegressor
evaluate(model, X, y, resampling=CV(nfolds=5), measure=[rms, mav])



┌[0m─────────[0m┬[0m───────────────────[0m┐[0m
│[0m[1m measure [0m│[0m[1m measurement       [0m│[0m
├[0m─────────[0m┼[0m───────────────────[0m┤[0m
│[0m rms     [0m│[0m 8.668102471357711 [0m│[0m
│[0m mav     [0m│[0m 6.047643564356435 [0m│[0m
└[0m─────────[0m┴[0m───────────────────[0m┘[0m


(measure = MLJBase.Measure[rms, mav],
 measurement = [8.668102471357711, 6.047643564356435],
 per_fold = Array{Float64,1}[[8.525465870955774, 8.52461967445231, 10.74455588603451, 9.393386761519249, 6.152484163826722], [6.489306930693069, 5.434059405940592, 7.613069306930692, 6.033663366336635, 4.668118811881189]],
 per_observation = Missing[missing, missing],)

##  Basic fit/evaluate/predict by hand:

In [16]:
using RDatasets
vaso = dataset("robustbase", "vaso"); # a DataFrame
y, X = unpack(vaso, ==(:Y), c -> true; :Y => Multiclass)

tree_model = @load DecisionTreeClassifier
tree_model.max_depth=2

┌ Info: A model type "DecisionTreeClassifier" is already loaded. 
│ No new code loaded. 
└ @ MLJModels /Users/anthony/Dropbox/Julia7/MLJ/MLJModels/src/loading.jl:41


2

Bind the model and data together in a *machine* , which will
additionally store the learned parameters (*fitresults*) when fit:

In [17]:
tree = machine(tree_model, X, y)

[34mMachine{DecisionTreeClassifier} @ 1…17[39m


Split row indices into training and evaluation rows:

In [18]:
train, test = partition(eachindex(y), 0.7, shuffle=true, rng=1234); # 70:30 split

Fit on train and evaluate on test:

In [19]:
fit!(tree, rows=train)
yhat = predict(tree, rows=test);
mean(cross_entropy(yhat, y[test]))

┌ Info: Training [34mMachine{DecisionTreeClassifier} @ 1…17[39m.
└ @ MLJ /Users/anthony/Dropbox/Julia7/MLJ/MLJ/src/machines.jl:141


1.135369212298553

Predict on new data:

In [20]:
Xnew = (Volume=3*rand(3), Rate=3*rand(3))
predict(tree, Xnew)      # a vector of distributions

3-element Array{UnivariateFinite{Int64,UInt8,Float64},1}:
 UnivariateFinite(0=>0.2727272727272727, 1=>0.7272727272727273) 
 UnivariateFinite(0=>0.02439024390243903, 1=>0.9756097560975611)
 UnivariateFinite(0=>0.02439024390243903, 1=>0.9756097560975611)

In [21]:
predict_mode(tree, Xnew) # a vector of point-predictions

3-element Array{CategoricalValue{Int64,UInt8},1}:
 1
 1
 1

###  *More on machines (implementation detail)*

Under the hood, calling `fit!` on a machine calls either
`MLJBase.fit` or `MLJBase.update` depending on the machine's
internal state, as recorded in additional fields `previous_model`
and `rows`. These lower level methods dispatch on the model and a
view of the data depending on the optional `rows` keyword argument
of `fit!` (all rows by default). In this way, if a model `update`
method is implemented, calls to `fit!` can avoid redundant
calculations for certain kinds of model mutations (eg, increasing
the number of epochs in a neural network).

Here is a complete list of the fields of a machine:

- `model` - the struct containing the hyperparameters to be used
in calls to `fit!`

- `fitresult` - the learned parameters in a raw form, initially undefined

- `args` -  a tuple of the data (in the supervised learning example above, `args = (X, y)`)

- `report` - outputs of training not encoded in `fitresult` (eg, feature rankings)

- `previous_model` - a deep copy of the model used in the last call to `fit!`

- `rows` -  a copy of the row indices used in last call to `fit!`

- `cache`

## More performance evaluation examples:

In [22]:
import LossFunctions.ZeroOneLoss

Evaluating model + data directly:

In [23]:
evaluate(tree_model, X, y,
         resampling=Holdout(fraction_train=0.7, shuffle=true, rng=1234),
         measure=[cross_entropy, ZeroOneLoss()])

┌[0m───────────────[0m┬[0m────────────────────[0m┐[0m
│[0m[1m measure       [0m│[0m[1m measurement        [0m│[0m
├[0m───────────────[0m┼[0m────────────────────[0m┤[0m
│[0m cross_entropy [0m│[0m 1.135369212298553  [0m│[0m
│[0m ZeroOneLoss   [0m│[0m 0.4166666666666667 [0m│[0m
└[0m───────────────[0m┴[0m────────────────────[0m┘[0m


(measure = Any[cross_entropy, ZeroOneLoss()],
 measurement = [1.135369212298553, 0.4166666666666667],
 per_fold = Array{Float64,1}[[1.135369212298553], [0.4166666666666667]],
 per_observation = Array{Array{Float64,1},1}[[[0.10536051565782628, 3.7135720667043075, 0.10536051565782628, 2.3025850929940455, 0.10536051565782628, 0.3184537311185346, 0.02469261259037141, 0.3184537311185346, 0.3184537311185346, 1.2992829841302609, 3.7135720667043075, 1.2992829841302609]], [[0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0]]],)

If a machine is already defined, as above:

In [24]:
evaluate!(tree,
          resampling=Holdout(fraction_train=0.7, shuffle=true, rng=1234),
          measure=[cross_entropy, ZeroOneLoss()])

┌[0m───────────────[0m┬[0m────────────────────[0m┐[0m
│[0m[1m measure       [0m│[0m[1m measurement        [0m│[0m
├[0m───────────────[0m┼[0m────────────────────[0m┤[0m
│[0m cross_entropy [0m│[0m 1.135369212298553  [0m│[0m
│[0m ZeroOneLoss   [0m│[0m 0.4166666666666667 [0m│[0m
└[0m───────────────[0m┴[0m────────────────────[0m┘[0m


(measure = Any[cross_entropy, ZeroOneLoss()],
 measurement = [1.135369212298553, 0.4166666666666667],
 per_fold = Array{Float64,1}[[1.135369212298553], [0.4166666666666667]],
 per_observation = Array{Array{Float64,1},1}[[[0.10536051565782628, 3.7135720667043075, 0.10536051565782628, 2.3025850929940455, 0.10536051565782628, 0.3184537311185346, 0.02469261259037141, 0.3184537311185346, 0.3184537311185346, 1.2992829841302609, 3.7135720667043075, 1.2992829841302609]], [[0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0]]],)

Using cross-validation:

In [25]:
evaluate!(tree, resampling=CV(nfolds=5, shuffle=true, rng=1234),
          measure=[cross_entropy, ZeroOneLoss()])



┌[0m───────────────[0m┬[0m────────────────────[0m┐[0m
│[0m[1m measure       [0m│[0m[1m measurement        [0m│[0m
├[0m───────────────[0m┼[0m────────────────────[0m┤[0m
│[0m cross_entropy [0m│[0m 0.8107153382628913 [0m│[0m
│[0m ZeroOneLoss   [0m│[0m 0.4                [0m│[0m
└[0m───────────────[0m┴[0m────────────────────[0m┘[0m


(measure = Any[cross_entropy, ZeroOneLoss()],
 measurement = [0.8107153382628913, 0.4],
 per_fold = Array{Float64,1}[[0.44130929246809064, 1.2635805032959784, 0.6459172309118898, 0.8778906002819279, 0.8248790643565697], [0.5714285714285714, 0.2857142857142857, 0.2857142857142857, 0.5714285714285714, 0.2857142857142857]],
 per_observation = Array{Array{Float64,1},1}[[[0.02469261259037141, 0.02469261259037141, 0.7537718023763802, 0.7537718023763802, 0.7537718023763802, 0.7537718023763802, 0.02469261259037141], [0.3483066942682157, 0.3483066942682157, 0.3483066942682157, 0.3483066942682157, 3.7135720667043075, 3.7135720667043075, 0.02469261259037141], [0.02469261259037141, 0.1823215567939546, 0.1823215567939546, 2.0149030205422647, 1.791759469228055, 0.1823215567939546, 0.1431008436406733], [1.3862943611198906, 1.3862943611198906, 1.3862943611198906, 0.2876820724517809, 0.02469261259037141, 0.2876820724517809, 1.3862943611198906], [0.02469261259037141, 0.02469261259037141, 0.0246926125903

With user-specified train/evaluation pairs of row indices:

In [26]:
f1, f2, f3 = 1:13, 14:26, 27:36
pairs = [(f1, vcat(f2, f3)), (f2, vcat(f3, f1)), (f3, vcat(f1, f2))];
evaluate!(tree,
          resampling=pairs,
          measure=[cross_entropy, ZeroOneLoss()])



┌[0m───────────────[0m┬[0m─────────────────────[0m┐[0m
│[0m[1m measure       [0m│[0m[1m measurement         [0m│[0m
├[0m───────────────[0m┼[0m─────────────────────[0m┤[0m
│[0m cross_entropy [0m│[0m 0.895254695800462   [0m│[0m
│[0m ZeroOneLoss   [0m│[0m 0.24136008918617616 [0m│[0m
└[0m───────────────[0m┴[0m─────────────────────[0m┘[0m


(measure = Any[cross_entropy, ZeroOneLoss()],
 measurement = [0.895254695800462, 0.24136008918617616],
 per_fold = Array{Float64,1}[[0.7538091986662944, 1.1473950551467866, 0.7845598335883047], [0.30434782608695654, 0.30434782608695654, 0.11538461538461539]],
 per_observation = Array{Array{Float64,1},1}[[[0.15415067982725836, 0.15415067982725836, 0.15415067982725836, 0.15415067982725836, 0.15415067982725836, 1.9459101490553135, 0.15415067982725836, 0.02469261259037141, 1.9459101490553135, 1.9459101490553135  …  0.15415067982725836, 1.9459101490553135, 0.15415067982725836, 0.02469261259037141, 3.7135720667043075, 0.02469261259037141, 1.9459101490553135, 0.15415067982725836, 0.15415067982725836, 0.15415067982725836], [0.02469261259037141, 3.7135720667043075, 3.7135720667043075, 0.02469261259037141, 3.7135720667043075, 0.02469261259037141, 3.7135720667043075, 0.02469261259037141, 0.02469261259037141, 0.02469261259037141  …  0.02469261259037141, 0.02469261259037141, 0.02469261259037141, 0.

Changing a hyperparameter and re-evaluating:

In [27]:
tree_model.max_depth = 3
evaluate!(tree,
          resampling=CV(nfolds=5, shuffle=true, rng=1234),
          measure=[cross_entropy, ZeroOneLoss()])



┌[0m───────────────[0m┬[0m─────────────────────[0m┐[0m
│[0m[1m measure       [0m│[0m[1m measurement         [0m│[0m
├[0m───────────────[0m┼[0m─────────────────────[0m┤[0m
│[0m cross_entropy [0m│[0m 0.7857788118033404  [0m│[0m
│[0m ZeroOneLoss   [0m│[0m 0.37142857142857133 [0m│[0m
└[0m───────────────[0m┴[0m─────────────────────[0m┘[0m


(measure = Any[cross_entropy, ZeroOneLoss()],
 measurement = [0.7857788118033404, 0.37142857142857133],
 per_fold = Array{Float64,1}[[0.5192479199123463, 1.1617214839057737, 0.7334426224354447, 0.6982881261612496, 0.816193906601888], [0.42857142857142855, 0.2857142857142857, 0.2857142857142857, 0.5714285714285714, 0.2857142857142857]],
 per_observation = Array{Array{Float64,1},1}[[[0.02469261259037141, 0.02469261259037141, 0.02469261259037141, 1.1786549963416462, 1.1786549963416462, 1.1786549963416462, 0.02469261259037141], [0.6061358035703156, 0.02469261259037141, 0.02469261259037141, 0.02469261259037141, 3.7135720667043075, 3.7135720667043075, 0.02469261259037141], [0.02469261259037141, 0.02469261259037141, 0.02469261259037141, 3.7135720667043075, 1.0986122886681098, 0.02469261259037141, 0.2231435513142097], [0.9808292530117262, 0.9808292530117262, 0.9808292530117262, 0.4700036292457356, 0.02469261259037141, 0.4700036292457356, 0.9808292530117262], [0.02469261259037141, 0.02469261259

##  Inspecting training results:

Fit a ordinary least square model to some synthetic data:

In [28]:
x1 = rand(100)
x2 = rand(100)

X = (x1=x1, x2=x2)
y = x1 - 2x2 + 0.1*rand(100);

ols_model = @load LinearRegressor pkg=GLM
ols =  machine(ols_model, X, y)
fit!(ols)

┌ Info: Training [34mMachine{LinearRegressor} @ 7…80[39m.
└ @ MLJ /Users/anthony/Dropbox/Julia7/MLJ/MLJ/src/machines.jl:141


[34mMachine{LinearRegressor} @ 7…80[39m


Get a named tuple representing the learned parameters,
human-readable if appropriate:

In [29]:
fitted_params(ols)

(coef = [0.9985128951528446, -1.9981845372437947],
 intercept = 0.05141139704717806,)

Get other training-related information:

In [30]:
report(ols)

(deviance = 0.08067317714592058,
 dof_residual = 97.0,
 stderror = [0.009386992893038402, 0.00995817861943297, 0.0073417739672351065],
 vcov = [8.811563557395346e-5 -9.558303404671843e-6 -4.056936372724475e-5; -9.558303404671843e-6 9.916532141653193e-5 -4.6982822143496706e-5; -4.056936372724475e-5 -4.6982822143496706e-5 5.390164498597111e-5],)

##  Basic fit/transform for unsupervised models

Load data:

In [31]:
X, y = @load_iris
train, test = partition(eachindex(y), 0.7, shuffle=true, rng=123)

([125, 100, 130, 9, 70, 148, 39, 64, 6, 107  …  134, 114, 52, 74, 44, 61, 83, 18, 122, 26], [97, 78, 30, 108, 101, 24, 85, 91, 135, 96  …  112, 144, 140, 72, 109, 41, 106, 147, 47, 5])

Instantiate and fit the model/machine:

In [32]:
@load PCA
pca_model = PCA(maxoutdim=2)
pca = machine(pca_model, X)
fit!(pca, rows=train)

┌ Info: Training [34mMachine{PCA} @ 9…33[39m.
└ @ MLJ /Users/anthony/Dropbox/Julia7/MLJ/MLJ/src/machines.jl:141


[34mMachine{PCA} @ 9…33[39m


Transform selected data bound to the machine:

In [33]:
transform(pca, rows=test);

Transform new data:

In [34]:
Xnew = (sepal_length=rand(3), sepal_width=rand(3),
        petal_length=rand(3), petal_width=rand(3));
transform(pca, Xnew)

(x1 = [4.819158264829177, 4.8208386973047, 5.111185670643473],
 x2 = [-4.4441147103696315, -4.4288641941901625, -4.71503950609489],)

##  Inverting learned transformations

In [35]:
y = rand(100);
stand_model = UnivariateStandardizer()
stand = machine(stand_model, y)
fit!(stand)
z = transform(stand, y);
@assert inverse_transform(stand, z) ≈ y # true

┌ Info: Training [34mMachine{UnivariateStandardizer} @ 9…82[39m.
└ @ MLJ /Users/anthony/Dropbox/Julia7/MLJ/MLJ/src/machines.jl:141


## Nested hyperparameter tuning

Load data:

In [36]:
X, y = @load_iris

(150×4 DataFrame
│ Row │ sepal_length │ sepal_width │ petal_length │ petal_width │
│     │ [90mFloat64[39m      │ [90mFloat64[39m     │ [90mFloat64[39m      │ [90mFloat64[39m     │
├─────┼──────────────┼─────────────┼──────────────┼─────────────┤
│ 1   │ 5.1          │ 3.5         │ 1.4          │ 0.2         │
│ 2   │ 4.9          │ 3.0         │ 1.4          │ 0.2         │
│ 3   │ 4.7          │ 3.2         │ 1.3          │ 0.2         │
│ 4   │ 4.6          │ 3.1         │ 1.5          │ 0.2         │
│ 5   │ 5.0          │ 3.6         │ 1.4          │ 0.2         │
│ 6   │ 5.4          │ 3.9         │ 1.7          │ 0.4         │
│ 7   │ 4.6          │ 3.4         │ 1.4          │ 0.3         │
│ 8   │ 5.0          │ 3.4         │ 1.5          │ 0.2         │
│ 9   │ 4.4          │ 2.9         │ 1.4          │ 0.2         │
│ 10  │ 4.9          │ 3.1         │ 1.5          │ 0.1         │
⋮
│ 140 │ 6.9          │ 3.1         │ 5.4          │ 2.1         │
│ 141 │ 6.7      

Define a model with nested hyperparameters:

In [37]:
tree_model = @load DecisionTreeClassifier
forest_model = EnsembleModel(atom=tree_model, n=300)

┌ Info: A model type "DecisionTreeClassifier" is already loaded. 
│ No new code loaded. 
└ @ MLJModels /Users/anthony/Dropbox/Julia7/MLJ/MLJModels/src/loading.jl:41


MLJ.ProbabilisticEnsembleModel(atom = DecisionTreeClassifier(pruning_purity = 1.0,
                                                             max_depth = -1,
                                                             min_samples_leaf = 1,
                                                             min_samples_split = 2,
                                                             min_purity_increase = 0.0,
                                                             n_subfeatures = 0,
                                                             display_depth = 5,
                                                             post_prune = false,
                                                             merge_purity_threshold = 0.9,
                                                             pdf_smoothing = 0.05,),
                               weights = Float64[],
                               bagging_fraction = 0.8,
                               rng = MersenneTwister(UInt32[0

Inspect all hyperparameters, even nested ones (returns nested named tuple):

In [38]:
params(forest_model)

(atom = (pruning_purity = 1.0,
         max_depth = -1,
         min_samples_leaf = 1,
         min_samples_split = 2,
         min_purity_increase = 0.0,
         n_subfeatures = 0,
         display_depth = 5,
         post_prune = false,
         merge_purity_threshold = 0.9,
         pdf_smoothing = 0.05,),
 weights = Float64[],
 bagging_fraction = 0.8,
 rng = MersenneTwister(UInt32[0x71271325, 0x5861ba72, 0x34abacc2, 0x27102d83]),
 n = 300,
 parallel = true,
 out_of_bag_measure = Any[],)

Define ranges for hyperparameters to be tuned:

In [39]:
r1 = range(forest_model, :bagging_fraction, lower=0.5, upper=1.0, scale=:log10)

MLJ.NumericRange(field = :bagging_fraction,
                 lower = 0.5,
                 upper = 1.0,
                 scale = :log10,)[34m @ 1…28[39m

In [40]:
r2 = range(forest_model, :(atom.n_subfeatures), lower=1, upper=4) # nested

MLJ.NumericRange(field = :(atom.n_subfeatures),
                 lower = 1,
                 upper = 4,
                 scale = :linear,)[34m @ 1…75[39m

Wrap the model in a tuning strategy:

In [41]:
tuned_forest = TunedModel(model=forest_model,
                          tuning=Grid(resolution=12),
                          resampling=CV(nfolds=6),
                          ranges=[r1, r2],
                          measure=cross_entropy)

MLJ.ProbabilisticTunedModel(model = MLJ.ProbabilisticEnsembleModel(atom = [34mDecisionTreeClassifier @ 1…80[39m,
                                                                   weights = Float64[],
                                                                   bagging_fraction = 0.8,
                                                                   rng = MersenneTwister(UInt32[0x71271325, 0x5861ba72, 0x34abacc2, 0x27102d83]),
                                                                   n = 300,
                                                                   parallel = true,
                                                                   out_of_bag_measure = Any[],),
                            tuning = Grid(resolution = 12,
                                          parallel = true,),
                            resampling = CV(nfolds = 6,
                                            shuffle = false,
                                            rng = MersenneTwister(

Bound the wrapped model to data:

In [42]:
tuned = machine(tuned_forest, X, y)

[34mMachine{ProbabilisticTunedModel} @ 1…60[39m


Fitting the resultant machine optimizes the hyperaparameters specified in
`range`, using the specified resampling strategy and performance
measure, and retrains on all data bound to the machine:

In [43]:
fit!(tuned)

┌ Info: Training [34mMachine{ProbabilisticTunedModel} @ 1…60[39m.
└ @ MLJ /Users/anthony/Dropbox/Julia7/MLJ/MLJ/src/machines.jl:141
┌ Info: Mimimizing cross_entropy. 
└ @ MLJ /Users/anthony/Dropbox/Julia7/MLJ/MLJ/src/tuning.jl:160
┌ Info: Training best model on all supplied data.
└ @ MLJ /Users/anthony/Dropbox/Julia7/MLJ/MLJ/src/tuning.jl:252


[34mMachine{ProbabilisticTunedModel} @ 1…60[39m


Inspecting the optimal model:

In [44]:
F = fitted_params(tuned)

(best_model = [34mProbabilisticEnsembleModel{DecisionTreeClassifier} @ 1…63[39m,)

In [45]:
F.best_model

MLJ.ProbabilisticEnsembleModel(atom = DecisionTreeClassifier(pruning_purity = 1.0,
                                                             max_depth = -1,
                                                             min_samples_leaf = 1,
                                                             min_samples_split = 2,
                                                             min_purity_increase = 0.0,
                                                             n_subfeatures = 3,
                                                             display_depth = 5,
                                                             post_prune = false,
                                                             merge_purity_threshold = 0.9,
                                                             pdf_smoothing = 0.05,),
                               weights = Float64[],
                               bagging_fraction = 0.5,
                               rng = MersenneTwister(UInt32[0

Inspecting details of tuning procedure:

In [46]:
report(tuned)

(parameter_names = ["bagging_fraction" "atom.n_subfeatures"],
 parameter_scales = Symbol[:log10 :linear],
 parameter_values = Any[0.5 1; 0.5325205447199813 1; … ; 0.9389309106617063 4; 1.0 4],
 measurements = [0.23836844761285972, 0.24310768116519496, 0.23155959227133427, 0.2358303191590729, 0.23388918367157183, 0.23944002555125055, 0.22931761600908399, 0.22924432030705047, 0.22621287086704908, 0.23123283225576788  …  0.1830737398891659, 0.19017188641338933, 0.2062563314942637, 0.2041996514962502, 0.210012168891926, 0.21305031478959782, 0.22735490003858747, 0.2359797272653158, 0.2584476524785048, 0.32572198859316304],
 best_measurement = 0.17477371176810844,)

To plot result of a 2D parameter tune, use `using Plots; pyplot();
plot(tuned)`.

Predicting on new data using the optimized model:

In [47]:
predict(tuned, Xnew)

3-element Array{UnivariateFinite{String,UInt32,Float64},1}:
 UnivariateFinite(setosa=>0.9677419354838652, versicolor=>0.01612903225806445, virginica=>0.01612903225806445)
 UnivariateFinite(setosa=>0.9677419354838652, versicolor=>0.01612903225806445, virginica=>0.01612903225806445)
 UnivariateFinite(setosa=>0.9677419354838652, versicolor=>0.01612903225806445, virginica=>0.01612903225806445)

*This notebook was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*