## A tour of MLJ

### Models, machines, basic training and testing

Let's load data and define train and test rows:

In [1]:
using MLJ
using DataFrames

Xraw = rand(300,3)
y = exp(Xraw[:,1] - Xraw[:,2] - 2Xraw[:,3] + 0.1*rand(300))
X = DataFrame(Xraw)

train, test = partition(eachindex(y), 0.70); # 70:30 split

A *model* is a container for hyperparameters:

In [2]:
knn_model=KNNRegressor(K=10)

# [0m[1mKNNRegressor @ 1…66[22m: 
K                       =>   10
metric                  =>   euclidean (generic function with 1 method)
kernel                  =>   reciprocal (generic function with 1 method)



Wrapping the model in data creates a *machine* which will store training outcomes (called *fit-results*):

In [3]:
knn = machine(knn_model, X, y)

# [0m[1mMachine{KNNRegressor} @ 1…92[22m: 
model                   =>   [0m[1mKNNRegressor @ 1…66[22m
fitresult               =>   (undefined)
cache                   =>   (undefined)
args                    =>   (omitted Tuple{DataFrame,Array{Float64,1}} of length 2)
report                  =>   empty Dict{Symbol,Any}
rows                    =>   (undefined)



Training on the training rows and evaluating on the test rows:

In [4]:
fit!(knn, rows=train)
yhat = predict(knn, X[test,:])
rms(y[test], yhat)

┌ Info: Training [0m[1mMachine{KNNRegressor} @ 1…92[22m.
└ @ MLJ /Users/anthony/Dropbox/Julia7/MLJ/src/machines.jl:69


0.13421003086132813

Our machine/model constructs and associateed fit/predict syntax anticipates a powerful extension for building networks of learners described later. Changing a hyperparameter and re-evaluating:

In [5]:
knn_model.K = 20
fit!(knn)
yhat = predict(knn, X[test,:])
rms(y[test], yhat)

┌ Info: Training [0m[1mMachine{KNNRegressor} @ 1…92[22m.
└ @ MLJ /Users/anthony/Dropbox/Julia7/MLJ/src/machines.jl:69


0.10836269415394763

### Homogeneous ensembles

Here's a bagged ensemble model for 20 K-nearest neighbour regressors:

In [6]:
ensemble_model = EnsembleModel(atom=knn_model, n=20) 

# [0m[1mDeterministicEnsembleModel @ 1…58[22m: 
atom                    =>   [0m[1mKNNRegressor @ 1…66[22m
weights                 =>   0-element Array{Float64,1}
bagging_fraction        =>   0.8
rng_seed                =>   0
n                       =>   20
parallel                =>   true



In [7]:
@more

# [0m[1mDeterministicEnsembleModel @ 1…58[22m: 
atom                    =>   [0m[1mKNNRegressor @ 1…66[22m
weights                 =>   0-element Array{Float64,1}
bagging_fraction        =>   0.8
rng_seed                =>   0
n                       =>   20
parallel                =>   true

## [0m[1mKNNRegressor @ 1…66[22m: 
K                       =>   20
metric                  =>   euclidean (generic function with 1 method)
kernel                  =>   reciprocal (generic function with 1 method)



It can be trained and tested the same as any other model:

In [8]:
ensemble = machine(ensemble_model, X, y)
fit!(ensemble, rows=train)
yhat = predict(ensemble, X[test, :])
rms(y[test], yhat)

┌ Info: Training [0m[1mMachine{DeterministicEnsembleMod…} @ 6…03[22m.
└ @ MLJ /Users/anthony/Dropbox/Julia7/MLJ/src/machines.jl:69
[33mTraining ensemble:  10%[=====>                                            ]  ETA: 0:00:05[39m






0.1948989734481266

### Systematic tuning

Let's simultaneously tune the ensemble's `bagging_fraction` and the K-nearest neighbour hyperparameter `K`. Since one of these models is a field of the other, we have nested hyperparameters:

In [9]:
params(ensemble_model)

Params(:atom => Params(:K => 20, :metric => MLJ.KNN.euclidean, :kernel => MLJ.KNN.reciprocal), :weights => Float64[], :bagging_fraction => 0.8, :rng_seed => 0, :n => 20, :parallel => true)

To define a tuning grid, we construct ranges for the two parameters and collate these ranges following the same pattern above (omitting parameters that don't change):

In [10]:
B_range = range(ensemble_model, :bagging_fraction, lower= 0.5, upper=1.0, scale = :linear)
K_range = range(knn_model, :K, lower=1, upper=100, scale=:log10)
nested_ranges = Params(:atom => Params(:K => K_range), :bagging_fraction => B_range)

Params(:atom => Params(:K => [0m[1mNumericRange @ 1…09[22m), :bagging_fraction => [0m[1mNumericRange @ 1…77[22m)

Now we choose a tuning strategy:

In [11]:
tuning = Grid(resolution=12)

# [0m[1mGrid @ 2…60[22m: 
resolution              =>   12
parallel                =>   true



And a resampling strategy:

In [12]:
resampling = Holdout(fraction_train=0.8)

# [0m[1mHoldout @ 1…22[22m: 
fraction_train          =>   0.8



And define a new model which wraps the these strategies around our ensemble model:

In [13]:
tuned_ensemble_model = TunedModel(model=ensemble_model, 
    tuning_strategy=tuning, resampling_strategy=resampling, nested_ranges=nested_ranges)

# [0m[1mTunedModel @ 1…47[22m: 
model                   =>   [0m[1mDeterministicEnsembleModel @ 1…58[22m
tuning_strategy         =>   [0m[1mGrid @ 2…60[22m
resampling_strategy     =>   [0m[1mHoldout @ 1…22[22m
measure                 =>   rms (generic function with 5 methods)
operation               =>   predict (generic function with 19 methods)
nested_ranges           =>   Params(:atom => Params(:K => [0m[1mNumericRange @ 1…09[22m), :bagging_fraction => [0m[1mNumericRange @ 1…77[22m)
report_measurements     =>   true



Fitting the corresponding machine tunes the underlying model (in this case an ensemble) and retrains on all supplied data:

In [14]:
tuned_ensemble = machine(tuned_ensemble_model, X[train,:], y[train])
fit!(tuned_ensemble);

┌ Info: Training [0m[1mMachine{TunedModel{Grid,Determin…} @ 1…95[22m.
└ @ MLJ /Users/anthony/Dropbox/Julia7/MLJ/src/machines.jl:69
┌ Info: Training best model on all supplied data.
└ @ MLJ /Users/anthony/Dropbox/Julia7/MLJ/src/tuning.jl:107


We can inspect the best model by looking at the machine's `report` field (every machine has one):

In [15]:
tuned_ensemble.report

Dict{Symbol,Any} with 4 entries:
  :measurements     => [0.0579081, 0.0582973, 0.070172, 0.0815501, 0.0929576, 0…
  :models           => DeterministicEnsembleModel{Tuple{Array{Float64,2},Array{…
  :best_model       => [0m[1mDeterministicEnsembleModel @ 1…72[22m
  :best_measurement => 0.0522812

In [16]:
best_model = tuned_ensemble.report[:best_model]
@show best_model.bagging_fraction
@show best_model.atom.K

best_model.bagging_fraction = 0.7727272727272727
(best_model.atom).K = 2


2

Evaluating the tuned model:

In [17]:
yhat = predict(tuned_ensemble, X[test,:])
rms(yhat, y[test])

0.09660282367949262

### Learning networks

MLJ has a flexible interface for building networks from multiple machine learning elements, whose complexity extend beyond linear "pipelines", and with a minimal of added abstraction.

In MLJ, a *learning network* is a graph whose nodes apply an operation, such as `predict` or `transform`, using a fixed machine (requiring training) - or which, alternatively, applies a regular (untrained) mathematical operation to its input(s). In practice, a learning network works with *fixed* sources for its training/evaluation data, but can be built and tested in stages. By contrast, an *exported learning network* is a learning network exported as a stand-alone, re-usable `Model` object, to which all the MLJ `Model`  meta-algorthims can be applied (ensembling, systematic tuning, etc). 

As we shall see, exporting a learning network as a reusable model, is very easy. 

### Building a simple learning network

![](wrapped_ridge.png)

The diagram above depicts a learning network which standardises the input data, `X`, learns an optimal Box-Cox transformation for the target, `y`, predicts new targets using ridge regression, and then inverse-transforms those predictions (for later comparison with the original test data). The machines are labelled yellow. 

To implement the network, we begin by loading all data needed for training and evaluation into *source nodes*:

In [18]:
Xs = source(X)
ys = source(y)

[0m[1mSource @ 9…69[22m

We label nodes according to their outputs in the diagram. Notice that the nodes `z` and `yhat` use the same machine `box` for different operations. 

To construct the `W` node we first need to define the machine `stand` that it will use to transform inputs. 

In [19]:
stand_model = Standardizer()
stand = machine(stand_model, Xs)

[0m[1mNodalMachine @ 1…66[22m = machine([0m[1mStandardizer @ 1…22[22m, [0m[1m1…12[22m)

Because `Xs` is a node, instead of concrete data, we can call `transform` on the machine without first training it, and the result is the new node `W`, instead of concrete transformed data:

In [20]:
W = transform(stand, Xs)

[0m[1mNode @ 1…34[22m = transform([0m[1m1…66[22m, [0m[1m1…12[22m)

To get actual transformed data we *call* the node appropriately, which will require we first train the node. Training a node, rather than a machine, triggers training of *all* necessary machines in the network.

In [21]:
fit!(W, rows=train)
W()          # transform all data
W(rows=test) # transform only test data
W(X[3:4,:])  # transform any data, new or old

┌ Info: Training [0m[1mNodalMachine{Standardizer} @ 1…66[22m.
└ @ MLJ /Users/anthony/Dropbox/Julia7/MLJ/src/machines.jl:69


Unnamed: 0_level_0,x1,x2,x3
Unnamed: 0_level_1,Float64,Float64,Float64
1,-1.61979,0.326544,-0.815113
2,1.75929,-0.413434,-1.10023


If you like, you can think of `W` (and the other nodes we will define) as "dynamic data": `W` is *data*, in the sense that  it an be called ("indexed") on rows, but *dynamic*, in the sense the result depends on the outcome of training events. 

The other nodes of our network are defined similarly:

In [22]:
box_model = UnivariateBoxCoxTransformer()  # for making data look normally-distributed
box = machine(box_model, ys)
z = transform(box, ys)

ridge_model = RidgeRegressor(lambda=0.1)
ridge =machine(ridge_model, W, z)
zhat = predict(ridge, W)

yhat = inverse_transform(box, zhat)

[0m[1mNode @ 1…26[22m = inverse_transform([0m[1m1…67[22m, predict([0m[1m2…48[22m, transform([0m[1m1…66[22m, [0m[1m1…12[22m)))

We are ready to train and evaluate the completed network. Notice that the standardizer, `stand`, is *not* retrained, as MLJ remembers that it was trained earlier:

In [23]:
fit!(yhat, rows=train)
rms(y[test], yhat(rows=test)) # evaluate

┌ Info: Not retraining [0m[1mNodalMachine{Standardizer} @ 1…66[22m. It is up-to-date.
└ @ MLJ /Users/anthony/Dropbox/Julia7/MLJ/src/networks.jl:201
┌ Info: Training [0m[1mNodalMachine{UnivariateBoxCoxTransfor…} @ 1…67[22m.
└ @ MLJ /Users/anthony/Dropbox/Julia7/MLJ/src/machines.jl:69
┌ Info: Training [0m[1mNodalMachine{RidgeRegressor} @ 2…48[22m.
└ @ MLJ /Users/anthony/Dropbox/Julia7/MLJ/src/machines.jl:69


0.04456622213088941

In [24]:
yhat(X[3:4,:])  # predict on new or old data

2-element Array{Float64,1}:
 0.34118069282270513
 1.205787233700913  

We can change hyperparameters and retrain:

In [25]:
ridge_model.lambda = 0.01
fit!(yhat, rows=train) 
rms(y[test], yhat(rows=test))

┌ Info: Not retraining [0m[1mNodalMachine{UnivariateBoxCoxTransfor…} @ 1…67[22m. It is up-to-date.
└ @ MLJ /Users/anthony/Dropbox/Julia7/MLJ/src/networks.jl:201
┌ Info: Not retraining [0m[1mNodalMachine{Standardizer} @ 1…66[22m. It is up-to-date.
└ @ MLJ /Users/anthony/Dropbox/Julia7/MLJ/src/networks.jl:201
┌ Info: Updating [0m[1mNodalMachine{RidgeRegressor} @ 2…48[22m.
└ @ MLJ /Users/anthony/Dropbox/Julia7/MLJ/src/machines.jl:73


0.044294151331316076

> **Notable feature.** The machine, `ridge::NodalMachine{RidgeRegressor}`, is retrained, because its underlying model has been mutated. However, since the outcome of this training has no effect on the training inputs of the machines `stand` and `box`, these transformers are left untouched. (During construction, each node and machine in a learning network determines and records all machines on which it depends.) This behaviour, which extends to exported learning networks, means we can tune our wrapped regressor without re-computing transformations each time the hyperparameter is changed. 

### Exporting a learning network as a composite model

To export a learning network:
- Define a new `mutable struct` model type.
- Wrap the learning network code in a model `fit` method.

All learning networks that make determinisic (or, probabilistic) predictions export as models of subtype `Deterministic{Node}` (respectively, `Probabilistic{Node}`):


In [26]:
mutable struct WrappedRidge <: Deterministic{Node}
    ridge_model
end

Now satisfied that our wrapped Ridge Regression learning network works, we simply cut and paste its defining code into a `fit` method: 

In [27]:
function MLJ.fit(model::WrappedRidge, X, y)
    Xs = source(X)
    ys = source(y)

    stand_model = Standardizer()
    stand = machine(stand_model, Xs)
    W = transform(stand, Xs)

    box_model = UnivariateBoxCoxTransformer()  # for making data look normally-distributed
    box = machine(box_model, ys)
    z = transform(box, ys)

    ridge_model = model.ridge_model ###
    ridge =machine(ridge_model, W, z)
    zhat = predict(ridge, W)

    yhat = inverse_transform(box, zhat)
    fit!(yhat, verbosity=0)
    
    return yhat
end

The line marked `###`, where the new exported model's hyperparameter `ridge_model` is spliced into the network, is the only modification.

This completes the export process.

> **What's going on here?** MLJ's machine interface is built atop a more primitive *[model](adding_new_models.md)* interface, implemented for each algorithm. Each supervised model type (eg, `RidgeRegressor`) requires model `fit` and `predict` methods, which are called by the corresponding machine `fit!` and `predict` methods. We don't need to define a  model `predict` method here because MLJ provides a fallback which simply calls the node returned by `fit` on the data supplied: `MLJ.predict(model::Supervised{Node}, Xnew) = yhat(Xnew)`.

Let's now tune our wrapped ridge model on the Boston dataset:

In [28]:
X, y = X_and_y(load_boston())
train, test = partition(eachindex(y), 0.7)
wrapped_model = WrappedRidge(ridge_model)

# [0m[1mWrappedRidge @ 1…35[22m: 
ridge_model             =>   [0m[1mRidgeRegressor @ 1…12[22m



In [29]:
params(wrapped_model)

Params(:ridge_model => Params(:lambda => 0.01))

In [30]:
nested_ranges = Params(:ridge_model => Params(:lambda => range(ridge_model, :lambda, lower=0.1, upper=100.0, scale=:log10)))

Params(:ridge_model => Params(:lambda => [0m[1mNumericRange @ 4…52[22m))

In [31]:
tuned_wrapped_model = TunedModel(model=wrapped_model, tuning_strategy=Grid(resolution=100), 
    nested_ranges=nested_ranges)

# [0m[1mTunedModel @ 8…06[22m: 
model                   =>   [0m[1mWrappedRidge @ 1…35[22m
tuning_strategy         =>   [0m[1mGrid @ 9…01[22m
resampling_strategy     =>   [0m[1mHoldout @ 9…82[22m
measure                 =>   rms (generic function with 5 methods)
operation               =>   predict (generic function with 19 methods)
nested_ranges           =>   Params(:ridge_model => Params(:lambda => [0m[1mNumericRange @ 4…52[22m))
report_measurements     =>   true



In [32]:
tuned_wrapped = machine(tuned_wrapped_model, X, y)
fit!(tuned_wrapped, rows=train);

┌ Info: Training [0m[1mMachine{TunedModel{Grid,WrappedR…} @ 1…65[22m.
└ @ MLJ /Users/anthony/Dropbox/Julia7/MLJ/src/machines.jl:69
┌ Info: Training best model on all supplied data.
└ @ MLJ /Users/anthony/Dropbox/Julia7/MLJ/src/tuning.jl:107


In [33]:
@show tuned_wrapped.report[:best_model].ridge_model.lambda
@show tuned_wrapped.report[:best_measurement]

((tuned_wrapped.report[:best_model]).ridge_model).lambda = 23.101297000831593
tuned_wrapped.report[:best_measurement] = 2.331822749246328


2.331822749246328