Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipelines without macros #639

Closed
wants to merge 33 commits into from
Closed

Pipelines without macros #639

wants to merge 33 commits into from

Conversation

ablaom
Copy link
Member

@ablaom ablaom commented Sep 17, 2021

This PR is a step towards replacement of @pipeline with non-macro alternatives (JuliaAI/MLJ.jl#594). It implements new parametric types for pipelines and a new constructor Pipeline, following the suggestions of @CameronBieganek in the cited issue.

The new pipelines offer all the features of @pipeline with the exception of target transformations, which are to be provided with a separate wrapper. Additional features are:

  • inverse_transform is implemented, where that makes sense (there was an issue raised, which I cannot find)
  • implement predict for unsupervised pipelines ending in a model that implements predict (clustering) (Cannot call predict on a @pipeline when the last component is a transformer MLJClusteringInterface.jl#10)
  • option to specify cache=false to suppress data caching at internal nodal machines
  • Alternative constructor using |>. When two pipelines are concatenated, the result is flat (does not introduce an extra layer of hyper-parameter nesting).

The new doc-string is copied below under "details".

Conflict with existing feature

The overloading of |> conflicts with the "arrow syntax" described here but never actually documented in the manual. It's purpose is to reduce code needed to build learning networks. I have not seen it used anywhere outside of the cited tutorial and would suggest linear pipelines, as a more common use-case, would be a better use. A non-breaking alternative would be to overload the generic associative operator *.

@tlienart may wish to comment.

What do others think?

edit The actual doc-string has minor changes.

Pipeline(component1, component2, ... , componentk; options...)
Pipeline(name1=component1, name2=component2, ..., componentk; options...)
component1 |> component2 |> ... |> componentk

Create an instance of composite model type which sequentially composes
the specified components in order. This means component1 receives
inputs, whose output is passed to component2, and so forth. A
"component" is either a Model instance, a model type (converted
immediately to its default instance) or any callable object.

At most one of the components may be a supervised model, but this
model can appear in any position.

The @pipeline macro accepts key-word options discussed further
below.

Ordinary functions (and other callables) may be inserted in the
pipeline as shown in the following example:

Pipeline(X->coerce(X, :age=>Continuous), OneHotEncoder, ConstantClassifier)

Syntactic sugar

The |> operator is overloaded to construct pipelines out of models,
functions, and existing pipelines:

LinearRegressor = @load LinearRegressor pkg=MLJLinearModels add=true
PCA = @load PCA pkg=MultivariateStats add=true

pipe1 = MLJBase.table |> ContinuousEncoder |> Standardizer
pipe2 = PCA |> LinearRegressor
pipe1 |> pipe2

Special operations

If all the components are invertible unsupervised models
(transformers), then inverse_transform is implemented for the
pipeline. If there are no supervised models, then predict is
nevertheless implemented, assuming the last model (such as KMeans
clustering) also implements it. Similarly, calling transform on a
supervised pipeline calls transform on the supervised component.

Optional key-word arguments

  • prediction_type -
    prediction type of the pipeline; possible values: :deterministic,
    :probabilistic, :interval (default=:deterministic if not inferable)

  • operation - operation applied to the supervised component model,
    when present; possible values: predict, predict_mean,
    predict_median, predict_mode (default=predict)

  • cache - whether the internal machines created for component models
    should cache model-specific representations of data. See machine.

!!! warning "Set cache=false to guarantee data anonymization"

This precaution applies to composite models, and only to those
implemented using learning networks.

To build more complicated non-branching pipelines, refer to the MLJ
manual sections on composing models.

@CameronBieganek Any depth of review here would be greatly appreciated.

@codecov-commenter
Copy link

codecov-commenter commented Sep 17, 2021

Codecov Report

Merging #639 (fb87a71) into dev (239e73f) will increase coverage by 0.24%.
The diff coverage is 94.79%.

Impacted file tree graph

@@            Coverage Diff             @@
##              dev     #639      +/-   ##
==========================================
+ Coverage   85.42%   85.66%   +0.24%     
==========================================
  Files          39       40       +1     
  Lines        3416     3572     +156     
==========================================
+ Hits         2918     3060     +142     
- Misses        498      512      +14     
Impacted Files Coverage Δ
src/MLJBase.jl 100.00% <ø> (ø)
src/composition/learning_networks/nodes.jl 67.80% <0.00%> (-1.16%) ⬇️
src/composition/models/pipelines2.jl 94.55% <94.55%> (ø)
src/composition/learning_networks/inspection.jl 86.53% <100.00%> (+0.26%) ⬆️
src/composition/models/pipelines.jl 98.98% <100.00%> (+0.40%) ⬆️
src/sources.jl 88.00% <100.00%> (+3.55%) ⬆️
src/utilities.jl 83.82% <100.00%> (+2.15%) ⬆️
src/composition/learning_networks/arrows.jl 0.00% <0.00%> (-73.34%) ⬇️
src/machines.jl 84.45% <0.00%> (+1.03%) ⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 239e73f...fb87a71. Read the comment docs.

@tlienart
Copy link
Collaborator

I'm fine with the arrow syntax being dropped in favour of something that people consider more standard and readable :)

@CameronBieganek
Copy link
Collaborator

Awesome!!! I'm busy this weekend, but I'll try to take a look at this as soon as possible.

@CameronBieganek
Copy link
Collaborator

@ablaom Not sure how urgent this is. I'm pretty swamped right now, so I won't be able to look at this this week or this weekend. But maybe next week...

@ablaom
Copy link
Member Author

ablaom commented Sep 28, 2021

@CameronBieganek Thanks for the update!

I think it's important to have a review and you are still my first choice. If you think it's likely more than a month away, let me know.

@ablaom
Copy link
Member Author

ablaom commented Oct 19, 2021

@CameronBieganek Do you see you getting time for this in the next week and a bit?

@CameronBieganek
Copy link
Collaborator

@ablaom I think I can actually take a look at this tomorrow evening. I'm putting it on my calendar. :)

Copy link
Collaborator

@CameronBieganek CameronBieganek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good! Just a few minor comments here and there.

src/composition/models/pipelines2.jl Outdated Show resolved Hide resolved
src/composition/models/pipelines2.jl Outdated Show resolved Hide resolved
src/composition/models/pipelines2.jl Outdated Show resolved Hide resolved
src/composition/models/pipelines2.jl Outdated Show resolved Hide resolved
src/composition/models/pipelines2.jl Show resolved Hide resolved
src/composition/models/pipelines2.jl Outdated Show resolved Hide resolved
src/composition/models/pipelines2.jl Outdated Show resolved Hide resolved
src/composition/models/pipelines2.jl Outdated Show resolved Hide resolved
src/utilities.jl Outdated Show resolved Hide resolved
test/composition/models/pipelines2.jl Outdated Show resolved Hide resolved
ablaom and others added 7 commits October 27, 2021 15:51
Bunch of minor suggestions of reviewer combined into one commit.

Co-authored-by: Cameron Bieganek <8310743+CameronBieganek@users.noreply.github.com>
Add forgotten `@test` before boolean

Co-authored-by: Cameron Bieganek <8310743+CameronBieganek@users.noreply.github.com>
@ablaom
Copy link
Member Author

ablaom commented Oct 29, 2021

Closing in favour of #664

@ablaom ablaom closed this Oct 29, 2021
@ablaom
Copy link
Member Author

ablaom commented Oct 29, 2021

@CameronBieganek Thanks very much indeed for wading through all that code. 🙏🏾

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants