Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Realize performance improvements for models implementing new data front-end #501

Merged
merged 4 commits into from
Jan 26, 2021

Conversation

ablaom
Copy link
Member

@ablaom ablaom commented Jan 4, 2021

Continuation of #492 with squashed commits.

Addresses key parts of #309.

Builds on the MLJModelInterface PR 76 (now part of MLJModelInterface 0.3.7) which provides a way for the implementer of MLJ's model interface to articulate: (i) how user-data (eg, a table) is transformed into some model-specific representation needed by the core algorithm (eg, a matrix) - the implementer overloads reformat; and (ii) how one can construct samples of the model-specific representation given a vector of observation indices - the implementer overloads selectrows. Detailed responsibilities of this optional "data front-end" implementation are given in this proposed doc PR.

This PR introduces optional caching of the model-specific data representations for machines and for subsamples of these representations, which could speed up computations (at the cost of larger memory allocations). Machines have a new type parameter C which determines whether or not they buy into data caching, with C=true the default. A user opts-out during machine construction, as in mach = machine(model, X, y, cache=false).

Some key implementation details to help with a review:

  • machines have a new type parameter C (see above) defaulting to true, except for composite models, where caching provides no obvious advantage. (Machines within in a learning network defining the composite may still cache with some benefit.)
  • machines have new fields data and resampled_data for caching model-specific representations for the what is returned by [arg() for arg in args()] (recall args is a tuple of Source or Node objects). We will refer to the latter as user-supplied data.
  • the logic of fit_only! at https://github.com/alan-turing-institute/MLJBase.jl/blob/2eda522877451fdd26838faf279ac74c5a5a837c/src/machines.jl#L417 is adapted to apply reformat to the user-supplied data when fit! is first called, or the user-supplied data has changed because of upstream changes in a learning network. This data is cached for future calls to fit!; data is resampled according to the rows specified in the fit! call and also cached. Future calls to fit! only perform reformat and selectrows as needed. All this assumes C=true and otherwise reformat and selectrows are called every time and nothing is cached.
  • when an operation, predict, say, is called on a machine with the rows=.... syntax, then, assuming the machine has caching enabled, selectrows is applied to the data cached, avoiding a call to reformat. If new data Xnew is supplied instead, as in predict(mach, Xnew), then naturally reformat is then necessary. The relevant changes are in src/operations.jl
  • in resampling, instead of calling predict(mach, Xtest) as currently, we call predict(mach, rows=test). This avoids one call to reformat. However, there is one case distinction: the uncommon case where one of the measures has is_feature_dependent=true (for measures with X in the signature), in which case Xtest must be still be created, using the slow selectrows(X, test) fallback.
  • Add interface points for specifying cache=... in machine constructors, and the Resampled objects. An interface point for cache in TunedModels is also needed and addressed in this MLJTuning PR which passes locally (Julia 1.5).

add fit_only! logic that caches reformatted/resampled user-data

add test

minor comment change

"Bump version to 0.16.3"

resolve #499

add test

add tests for reformat front-end

oops

add model api tests for reformat front-end

tidy up

add test for reformat logic in learning network context

do not update state of a Machine{<:Static} on fit! unless hyperparams change

add tests

allow speedup buy-out with machine(model, args..., cache=false)

oops

have KNN models buy into reformat data front-end for better testing

introduce "back-end" resampling in evaluate!

implement data front-end on prediction/transformation side

update machine show

more tests; add cache=... to evaluate(model, ...)

more tests

make `cache` hyperparam of Resampler for passing onto wrapped machines

more tests

Composite models do not cache data by default (no benefit)

correct comment

bump [compat] MLJModelInterface = "^0.3.7" (essential)
@codecov-io
Copy link

codecov-io commented Jan 4, 2021

Codecov Report

Merging #501 (2eda522) into dev (6eb7eab) will decrease coverage by 0.14%.
The diff coverage is 77.77%.

Impacted file tree graph

@@            Coverage Diff             @@
##              dev     #501      +/-   ##
==========================================
- Coverage   82.72%   82.57%   -0.15%     
==========================================
  Files          39       39              
  Lines        2935     2962      +27     
==========================================
+ Hits         2428     2446      +18     
- Misses        507      516       +9     
Impacted Files Coverage Δ
src/composition/learning_networks/machines.jl 92.24% <ø> (ø)
src/machines.jl 83.22% <75.00%> (-0.80%) ⬇️
src/operations.jl 76.19% <81.81%> (-0.29%) ⬇️
src/resampling.jl 85.91% <81.81%> (-0.11%) ⬇️
src/composition/models/methods.jl 53.48% <100.00%> (+1.10%) ⬆️
src/composition/learning_networks/arrows.jl 26.66% <0.00%> (-20.01%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6eb7eab...2eda522. Read the comment docs.

@ablaom
Copy link
Member Author

ablaom commented Jan 5, 2021

Copy link
Member

@OkonSamuel OkonSamuel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants