Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MLJ Integration #226

Merged
merged 66 commits into from
Jul 5, 2023
Merged

MLJ Integration #226

merged 66 commits into from
Jul 5, 2023

Conversation

MilesCranmer
Copy link
Owner

@MilesCranmer MilesCranmer commented Jul 3, 2023

@ericphanson I tried an initial implementation. How is this?

Potentially fixes #225

The current things missing are:

  1. No way of passing custom data to the model (possible with the extra parameter via SymbolicRegression.Dataset(...; extra::NamedTuple=...)). Maybe this just isn't suitable for MLJ though?
  2. No way to pass variable_names, or, once Dimensional constraints #220 merges, pass units. Again, perhaps this just is not MLJ-compatible so we don't need to worry about it.

Here's a demo:

julia> using MLJ, SymbolicRegression

julia> model = SRRegressor(binary_operators=[+, *, -, /], unary_operators=[cos], niterations=200, maxsize=20, nested_constraints=[cos=>[cos=>0]], complexity_of_operators=[cos=>2])
...

julia> X = rand(1000, 5) .* 15 .- 7.5;

julia> y = @. cos(X[:, 3] * 2.1 - 0.2) + 0.2 * X[:, 2]^2 * X[1, :];

julia> mach = machine(model, X, y)
...

julia> fit!(mach)
...

julia> report(mach)
(best_idx = 8,
 equations = DynamicExpressions.EquationModule.Node{Float64}[x1, (x1 * 3.740921188733064), ((3.7285630857630214 * x1) - -0.9321906358198151), (((x2 * x2) * x1) * 0.19990596875363167), ((((x2 * x1) + 0.03426565943145631) * 0.19989890752841208) * x2), ((((x2 * x2) * x1) + (0.01871055082023076 / x3)) * 0.19990596872137176), ((((x2 * x2) * x1) * 0.19991301441424844) + cos(x3 / -0.47083256472261414)), ((((x2 * x2) * x1) * 0.19991301441424844) + cos((x3 / -0.47083256472261414) - -0.1750279536531524))],
 equation_strings = ["x1", "(x1 * 3.740921188733064)", "((3.7285630857630214 * x1) - -0.9321906358198151)", "(((x2 * x2) * x1) * 0.19990596875363167)", "((((x2 * x1) + 0.03426565943145631) * 0.19989890752841208) * x2)", "((((x2 * x2) * x1) + (0.01871055082023076 / x3)) * 0.19990596872137176)", "((((x2 * x2) * x1) * 0.19991301441424844) + cos(x3 / -0.47083256472261414))", "((((x2 * x2) * x1) * 0.19991301441424844) + cos((x3 / -0.47083256472261414) - -0.1750279536531524))"],
 losses = [352.4189784722068, 210.99011897736824, 210.12401468032897, 0.5091640874420742, 0.5082645089530076, 0.5069986513792896, 0.024084496999990993, 0.005418666193545019],
 complexities = [1, 3, 5, 7, 9, 11, 13, 15],
 scores = [-5.864820747527641, 0.25650472223596993, 0.0020567001387135193, 3.01134140171983, 0.0008841688504815699, 0.001246827648481203, 1.523469992992648, 0.7458593298183651],)

julia> report(mach).equation_strings
8-element Vector{String}:
 "x1"
 "(x1 * 3.740921188733064)"
 "((3.7285630857630214 * x1) - -0.9321906358198151)"
 "(((x2 * x2) * x1) * 0.19990596875363167)"
 "((((x2 * x1) + 0.03426565943145631) * 0.19989890752841208) * x2)"
 "((((x2 * x2) * x1) + (0.01871055082023076 / x3)) * 0.19990596872137176)"
 "((((x2 * x2) * x1) * 0.19991301441424844) + cos(x3 / -0.47083256472261414))"
 "((((x2 * x2) * x1) * 0.19991301"  37 bytes  "261414) - -0.1750279536531524))"

julia> report(mach).best_idx
8

julia> ypred = predict(mach, X);

julia> sum(abs2,  ypred .- y) ./ length(y)
0.005418666193545014

The default selection going into .best_idx is set by model.selection_method which returns an index for chosen equation.


@Moelf is this the right approach for MLJ or would it be better to have a separate package? MLJModelInterface.jl looks pretty lightweight.

@MilesCranmer
Copy link
Owner Author

@ablaom I'd be very interested to hear your thoughts on this implementation as well.

@ablaom
Copy link

ablaom commented Jul 3, 2023

Thanks for working on an MLJ integration. Very exciting to have new kind class of model, and implemented in Julia, right?

Currently on leave but will definitely be supporting this. @OkonSamuel may also be able to look at this.

On a quick scan, I couldn't find any trait definitions (aka metadata) as defined, e.g. using MMI.metadata_model and MMI.metadata_pkg - see these docs. But maybe I missed them...

And it will be good idea to add MLJTestInterface.jl tests to your tests. There's an example for XGBoost you can probably adapt.

@MilesCranmer
Copy link
Owner Author

Fantastic, thank you. Will add the metadata.

Yes, this package is 100% Julia. The Python frontend PySR is just a lightweight scikit-learn wrapper via PyJulia, but the engine is all on the Julia side. So far the existing Julia interface in SymbolicRegression.jl is not as user-friendly, so I am looking forward to adding this MLJ integration as I think it will certainly improve that, as well as extensibility.

@MilesCranmer MilesCranmer mentioned this pull request Jul 3, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Jul 3, 2023

Benchmark Results

master be60f13... t[master]/t[be60f13...]
search/multithreading 0.0222 ± 0.0026 h 0.0217 ± 0.0011 h 1.02
search/serial 0.0216 ± 0.00035 h 0.0221 ± 0.00041 h 0.981
time_to_load 3.67 ± 0.17 s 3.75 ± 0.11 s 0.978
utils/best_of_sample 2.6 ± 0.9 μs 2.6 ± 0.8 μs 1
utils/check_constraints_x10 16.9 ± 5.4 μs 17.8 ± 5.4 μs 0.949
utils/compute_complexity_x10/Float64 3.3 ± 0.6 μs 3.1 ± 0.5 μs 1.06
utils/compute_complexity_x10/Int64 3.6 ± 0.5 μs 3.5 ± 0.5 μs 1.03
utils/compute_complexity_x10/nothing 2.6 ± 0.3 μs 2.4 ± 0.4 μs 1.08
utils/optimize_constants_x10 0.0721 ± 0.014 s 0.0733 ± 0.015 s 0.984

Benchmark Plots

A plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR.
Go to "Actions"->"Benchmark a pull request"->[the most recent run]->"Artifacts" (at the bottom).

@ericphanson
Copy link

2. No way to pass variable_names

For this, I think a natural way in MLJ would be to allow users to pass X (or rather its transpose) as a table and to use the column names, since MLJ generally works with tables where rows are observations and columns are features: https://alan-turing-institute.github.io/MLJ.jl/stable/getting_started/#Two-dimensional-data

src/MLJInterface.jl Outdated Show resolved Hide resolved
@OkonSamuel
Copy link
Contributor

@MilesCranmer I have taken a first pass through the code and given my review above.
Asides from the suggestions made by @ablaom you also need to add docstring for the MLJ interface. See here
I'll go through again after you add MLJ Interface tests.

@Moelf
Copy link

Moelf commented Jul 3, 2023

I just want to say that having integration here or a separate packages are both fine, the latter is not uncommon:

@MilesCranmer
Copy link
Owner Author

Thanks. I wonder if those were created before the lightweight interface package was created? It looks like the time-to-load is negligible ( #226 (comment) ) so we could just make it a hard dependency for simplicity.

@ablaom
Copy link

ablaom commented Jul 4, 2023

In case this is relevant to the discussion about the hyperparmater struct (aka "options" or, in MLJ, "model") (I didn't follow the details): The actual struct fields do not matter. The MLJ interface only interacts with the struct through properties, which can be overloaded to be different from the fields. But, yes the (public) keyword constructor should match the properties, if these are different from the fields.

For example, in MLJ's Pipeline model struct, we have a field models, invisible to the user, which is a vector of the component models, but the user constructs names for individual models (which become properties) using the kwarg constructor. I say this, because the layering mentioned above and copied below looks a bit weird, and I expect it can be avoided by having properties (public API) different from fields (implementation detail):

SRR(;options = SymbolicRegression.Options(
    binary_operators=[+, *, /, -],
    unary_operators=[cos, exp],
    npopulations=20
))

@MilesCranmer
Copy link
Owner Author

MilesCranmer commented Jul 4, 2023

Okay I think I found a nice way around this. The way it is implemented is far from elegant, but the user point of view is nice.

It uses some macro tricks to store all the parameters and defaults from the Options constructor in a constant, and then uses those to generate the fields and defaults of SRRegressor. So basically the user no longer needs to declare Options() if they are working with MLJ.

I thought about just using options as a keyword but I felt it made things a bit awkward, and maybe it would cause issues for hyperparameter tuning.

For better or for worse, now you get all hyperparameters directly in the SRRegressor (see expanded snippet):
julia> model = SRRegressor(binary_operators=[+, *, -, /], unary_operators=[cos], maxsize=25, nested_constraints=[cos=>[cos=>0]], niterations=100)
SRRegressor(
  binary_operators = Function[+, *, -, /], 
  unary_operators = [cos], 
  constraints = nothing, 
  elementwise_loss = nothing, 
  loss_function = nothing, 
  tournament_selection_n = 12, 
  tournament_selection_p = 0.86f0, 
  topn = 12, 
  complexity_of_operators = nothing, 
  complexity_of_constants = nothing, 
  complexity_of_variables = nothing, 
  parsimony = 0.0032f0, 
  alpha = 0.1f0, 
  maxsize = 25, 
  maxdepth = nothing, 
  fast_cycle = false, 
  turbo = false, 
  migration = true, 
  hof_migration = true, 
  should_simplify = nothing, 
  should_optimize_constants = true, 
  output_file = nothing, 
  npopulations = 15, 
  perturbation_factor = 0.076f0, 
  annealing = false, 
  batching = false, 
  batch_size = 50, 
  mutation_weights = MutationWeights(0.048, 0.47, 0.79, 5.1, 1.7, 0.002, 0.00023, 0.21, 0.0), 
  crossover_probability = 0.066f0, 
  warmup_maxsize_by = 0.0f0, 
  use_frequency = true, 
  use_frequency_in_tournament = true, 
  adaptive_parsimony_scaling = 20.0, 
  npop = 33, 
  ncycles_per_iteration = 550, 
  fraction_replaced = 0.00036f0, 
  fraction_replaced_hof = 0.035f0, 
  verbosity = 1000000000, 
  save_to_file = true, 
  probability_negate_constant = 0.01f0, 
  seed = nothing, 
  bin_constraints = nothing, 
  una_constraints = nothing, 
  progress = true, 
  terminal_width = nothing, 
  optimizer_algorithm = "BFGS", 
  optimizer_nrestarts = 2, 
  optimizer_probability = 0.14f0, 
  optimizer_iterations = nothing, 
  optimizer_options = nothing, 
  recorder = nothing, 
  recorder_file = "pysr_recorder.json", 
  early_stop_condition = nothing, 
  timeout_in_seconds = nothing, 
  max_evals = nothing, 
  skip_mutation_failures = true, 
  enable_autodiff = false, 
  nested_constraints = [cos => [cos => 0]], 
  deterministic = false, 
  define_helper_functions = true, 
  niterations = 100, 
  parallelism = :multithreading, 
  numprocs = nothing, 
  procs = nothing, 
  addprocs_function = nothing, 
  runtests = true, 
  loss_type = Nothing, 
  selection_method = SymbolicRegression.MLJInterfaceModule.choose_best)

The downside is it has to re-create the Options() struct whenever you call, e.g., fit or predict. I think it should be fine though. We could even store options in the fitresult to avoid invalidating the returned expressions (e.g., if the user tries to change the operators used; you would still want predict() to work with the old result).

  • Nevermind, this is no longer an issue. I just create Options at fit, store it in the fitresult, and never need to re-generate it again.

There are a couple of nested hyperparameters, for example MutationWeights is a struct storing some probabilities. I'm not sure if that would pose an issue for hyperparam tuning or not?

src/MLJInterface.jl Outdated Show resolved Hide resolved
src/MLJInterface.jl Outdated Show resolved Hide resolved
src/MLJInterface.jl Outdated Show resolved Hide resolved
src/MLJInterface.jl Outdated Show resolved Hide resolved
src/MLJInterface.jl Outdated Show resolved Hide resolved
src/MLJInterface.jl Outdated Show resolved Hide resolved
src/MLJInterface.jl Outdated Show resolved Hide resolved
src/MLJInterface.jl Outdated Show resolved Hide resolved
src/MLJInterface.jl Outdated Show resolved Hide resolved
src/MLJInterface.jl Outdated Show resolved Hide resolved
src/MLJInterface.jl Outdated Show resolved Hide resolved
src/MLJInterface.jl Outdated Show resolved Hide resolved
@MilesCranmer
Copy link
Owner Author

Okay the output format is updated and seems to work.

@MilesCranmer
Copy link
Owner Author

MilesCranmer commented Jul 4, 2023

Okay I have added the full docstrings. Anything else missing?

I'll probably update the README shortly after merging. I'm really liking MLJ so far, so I think I'll just have that be the default user interface, with equation_search available if someone (mostly me) needs to do low-level things.

Copy link
Contributor

@OkonSamuel OkonSamuel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm that the documentation renders properly on my terminal.
This looks very good and detailed. Thanks @MilesCranmer, We can wait to have this addition to the MLJ registry.

@MilesCranmer
Copy link
Owner Author

Added!

Once it merges, what should I do to add it to the MLJ registry? Should I submit an issue requesting a registry update on the MLJ repo?

@MilesCranmer
Copy link
Owner Author

@OkonSamuel I also realized I need to fix the behavior of sample weights for multitarget case. What is the expected format of the sample weights w in multitarget problems? Is it expected that w is always a 1D array, or is w to be a 2D array, with columns corresponding to the same columns of y?

@MilesCranmer
Copy link
Owner Author

Actually I think I just fixed it. MLJ assumes w is always a 1D vector, even for multitarget problems, correct? (This is what I expect; I just want to make sure it doesn't assume tabular sample weights).

Copy link
Contributor

@OkonSamuel OkonSamuel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!!!

@MilesCranmer MilesCranmer merged commit 87eb325 into master Jul 5, 2023
@MilesCranmer
Copy link
Owner Author

Thanks so much for your help @OkonSamuel; much appreciated!

MilesCranmer added a commit that referenced this pull request Jul 6, 2023
[Diff since v0.19.1](v0.19.1...v0.20.0)

**Closed issues:**
- [Feature]: MLJ integration (#225)
- [Feature]: ternary operators  (#227)

**Merged pull requests:**
- MLJ Integration (#226) (@MilesCranmer)
@MilesCranmer MilesCranmer deleted the mlj-integration branch January 1, 2024 07:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature]: MLJ integration
5 participants