Realize performance improvements for models implementing new data front-end #501

ablaom · 2021-01-04T08:26:53Z

Continuation of #492 with squashed commits.

Addresses key parts of #309.

Builds on the MLJModelInterface PR 76 (now part of MLJModelInterface 0.3.7) which provides a way for the implementer of MLJ's model interface to articulate: (i) how user-data (eg, a table) is transformed into some model-specific representation needed by the core algorithm (eg, a matrix) - the implementer overloads reformat; and (ii) how one can construct samples of the model-specific representation given a vector of observation indices - the implementer overloads selectrows. Detailed responsibilities of this optional "data front-end" implementation are given in this proposed doc PR.

This PR introduces optional caching of the model-specific data representations for machines and for subsamples of these representations, which could speed up computations (at the cost of larger memory allocations). Machines have a new type parameter C which determines whether or not they buy into data caching, with C=true the default. A user opts-out during machine construction, as in mach = machine(model, X, y, cache=false).

Some key implementation details to help with a review:

machines have a new type parameter C (see above) defaulting to true, except for composite models, where caching provides no obvious advantage. (Machines within in a learning network defining the composite may still cache with some benefit.)
machines have new fields data and resampled_data for caching model-specific representations for the what is returned by [arg() for arg in args()] (recall args is a tuple of Source or Node objects). We will refer to the latter as user-supplied data.
the logic of fit_only! at https://github.com/alan-turing-institute/MLJBase.jl/blob/2eda522877451fdd26838faf279ac74c5a5a837c/src/machines.jl#L417 is adapted to apply reformat to the user-supplied data when fit! is first called, or the user-supplied data has changed because of upstream changes in a learning network. This data is cached for future calls to fit!; data is resampled according to the rows specified in the fit! call and also cached. Future calls to fit! only perform reformat and selectrows as needed. All this assumes C=true and otherwise reformat and selectrows are called every time and nothing is cached.
when an operation, predict, say, is called on a machine with the rows=.... syntax, then, assuming the machine has caching enabled, selectrows is applied to the data cached, avoiding a call to reformat. If new data Xnew is supplied instead, as in predict(mach, Xnew), then naturally reformat is then necessary. The relevant changes are in src/operations.jl
in resampling, instead of calling predict(mach, Xtest) as currently, we call predict(mach, rows=test). This avoids one call to reformat. However, there is one case distinction: the uncommon case where one of the measures has is_feature_dependent=true (for measures with X in the signature), in which case Xtest must be still be created, using the slow selectrows(X, test) fallback.
Add interface points for specifying cache=... in machine constructors, and the Resampled objects. An interface point for cache in TunedModels is also needed and addressed in this MLJTuning PR which passes locally (Julia 1.5).

add fit_only! logic that caches reformatted/resampled user-data add test minor comment change "Bump version to 0.16.3" resolve #499 add test add tests for reformat front-end oops add model api tests for reformat front-end tidy up add test for reformat logic in learning network context do not update state of a Machine{<:Static} on fit! unless hyperparams change add tests allow speedup buy-out with machine(model, args..., cache=false) oops have KNN models buy into reformat data front-end for better testing introduce "back-end" resampling in evaluate! implement data front-end on prediction/transformation side update machine show more tests; add cache=... to evaluate(model, ...) more tests make `cache` hyperparam of Resampler for passing onto wrapped machines more tests Composite models do not cache data by default (no benefit) correct comment bump [compat] MLJModelInterface = "^0.3.7" (essential)

codecov-io · 2021-01-04T08:39:22Z

Codecov Report

Merging #501 (2eda522) into dev (6eb7eab) will decrease coverage by 0.14%.
The diff coverage is 77.77%.

@@            Coverage Diff             @@
##              dev     #501      +/-   ##
==========================================
- Coverage   82.72%   82.57%   -0.15%     
==========================================
  Files          39       39              
  Lines        2935     2962      +27     
==========================================
+ Hits         2428     2446      +18     
- Misses        507      516       +9

Impacted Files	Coverage Δ
src/composition/learning_networks/machines.jl	`92.24% <ø> (ø)`
src/machines.jl	`83.22% <75.00%> (-0.80%)`	⬇️
src/operations.jl	`76.19% <81.81%> (-0.29%)`	⬇️
src/resampling.jl	`85.91% <81.81%> (-0.11%)`	⬇️
src/composition/models/methods.jl	`53.48% <100.00%> (+1.10%)`	⬆️
src/composition/learning_networks/arrows.jl	`26.66% <0.00%> (-20.01%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6eb7eab...2eda522. Read the comment docs.

ablaom · 2021-01-05T08:40:14Z

cc @OkonSamuel @tlienart @yalwan-iqvia @jeremiedb

OkonSamuel

LGTM

ablaom added 2 commits January 4, 2021 18:00

Merge remote-tracking branch 'origin/dev' into mlj2-again

2eda522

ablaom requested a review from OkonSamuel January 4, 2021 21:27

ablaom mentioned this pull request Jan 5, 2021

Performance improvements obtained by caching model-specific data representations and resamplings thereof #492

Closed

This was referenced Jan 13, 2021

Meta issue: lssues for possible collaboration with UCL JuliaAI/MLJ.jl#673

Closed

Is there support for a StatsModelsAPI for method stubs only? JuliaStats/StatsBase.jl#625

Closed

Sparse data and abstract matrix input JuliaAI/MLJ.jl#731

Open

OkonSamuel approved these changes Jan 19, 2021

View reviewed changes

ablaom added 2 commits January 26, 2021 13:38

Merge remote-tracking branch 'origin/dev' into mlj2-again

d51bde8

rm stray @load (fixing merge)

7cacd24

ablaom merged commit a2eee18 into dev Jan 26, 2021

ablaom mentioned this pull request Jan 26, 2021

For a 0.17.0 release #508

Merged

ablaom mentioned this pull request Feb 8, 2021

Add option to cache data at nodes of learning networks to avoid repeating operations (transform, predict, etc) JuliaAI/MLJ.jl#702

Closed

ablaom mentioned this pull request Feb 18, 2021

class_weights for evaluate! #464

Closed

DilumAluthge deleted the mlj2-again branch January 23, 2022 20:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Realize performance improvements for models implementing new data front-end #501

Realize performance improvements for models implementing new data front-end #501

ablaom commented Jan 4, 2021 •

edited

Loading

codecov-io commented Jan 4, 2021 •

edited

Loading

ablaom commented Jan 5, 2021

OkonSamuel left a comment

Realize performance improvements for models implementing new data front-end #501

Realize performance improvements for models implementing new data front-end #501

Conversation

ablaom commented Jan 4, 2021 • edited Loading

codecov-io commented Jan 4, 2021 • edited Loading

Codecov Report

ablaom commented Jan 5, 2021

OkonSamuel left a comment

Choose a reason for hiding this comment

ablaom commented Jan 4, 2021 •

edited

Loading

codecov-io commented Jan 4, 2021 •

edited

Loading