MLJ Integration #226

MilesCranmer · 2023-07-03T04:12:04Z

@ericphanson I tried an initial implementation. How is this?

Potentially fixes #225

The current things missing are:

No way of passing custom data to the model (possible with the extra parameter via SymbolicRegression.Dataset(...; extra::NamedTuple=...)). Maybe this just isn't suitable for MLJ though?
No way to pass variable_names, or, once Dimensional constraints #220 merges, pass units. Again, perhaps this just is not MLJ-compatible so we don't need to worry about it.

Here's a demo:

julia> using MLJ, SymbolicRegression

julia> model = SRRegressor(binary_operators=[+, *, -, /], unary_operators=[cos], niterations=200, maxsize=20, nested_constraints=[cos=>[cos=>0]], complexity_of_operators=[cos=>2])
...

julia> X = rand(1000, 5) .* 15 .- 7.5;

julia> y = @. cos(X[:, 3] * 2.1 - 0.2) + 0.2 * X[:, 2]^2 * X[1, :];

julia> mach = machine(model, X, y)
...

julia> fit!(mach)
...

julia> report(mach)
(best_idx = 8,
 equations = DynamicExpressions.EquationModule.Node{Float64}[x1, (x1 * 3.740921188733064), ((3.7285630857630214 * x1) - -0.9321906358198151), (((x2 * x2) * x1) * 0.19990596875363167), ((((x2 * x1) + 0.03426565943145631) * 0.19989890752841208) * x2), ((((x2 * x2) * x1) + (0.01871055082023076 / x3)) * 0.19990596872137176), ((((x2 * x2) * x1) * 0.19991301441424844) + cos(x3 / -0.47083256472261414)), ((((x2 * x2) * x1) * 0.19991301441424844) + cos((x3 / -0.47083256472261414) - -0.1750279536531524))],
 equation_strings = ["x1", "(x1 * 3.740921188733064)", "((3.7285630857630214 * x1) - -0.9321906358198151)", "(((x2 * x2) * x1) * 0.19990596875363167)", "((((x2 * x1) + 0.03426565943145631) * 0.19989890752841208) * x2)", "((((x2 * x2) * x1) + (0.01871055082023076 / x3)) * 0.19990596872137176)", "((((x2 * x2) * x1) * 0.19991301441424844) + cos(x3 / -0.47083256472261414))", "((((x2 * x2) * x1) * 0.19991301441424844) + cos((x3 / -0.47083256472261414) - -0.1750279536531524))"],
 losses = [352.4189784722068, 210.99011897736824, 210.12401468032897, 0.5091640874420742, 0.5082645089530076, 0.5069986513792896, 0.024084496999990993, 0.005418666193545019],
 complexities = [1, 3, 5, 7, 9, 11, 13, 15],
 scores = [-5.864820747527641, 0.25650472223596993, 0.0020567001387135193, 3.01134140171983, 0.0008841688504815699, 0.001246827648481203, 1.523469992992648, 0.7458593298183651],)

julia> report(mach).equation_strings
8-element Vector{String}:
 "x1"
 "(x1 * 3.740921188733064)"
 "((3.7285630857630214 * x1) - -0.9321906358198151)"
 "(((x2 * x2) * x1) * 0.19990596875363167)"
 "((((x2 * x1) + 0.03426565943145631) * 0.19989890752841208) * x2)"
 "((((x2 * x2) * x1) + (0.01871055082023076 / x3)) * 0.19990596872137176)"
 "((((x2 * x2) * x1) * 0.19991301441424844) + cos(x3 / -0.47083256472261414))"
 "((((x2 * x2) * x1) * 0.19991301" ⋯ 37 bytes ⋯ "261414) - -0.1750279536531524))"

julia> report(mach).best_idx
8

julia> ypred = predict(mach, X);

julia> sum(abs2,  ypred .- y) ./ length(y)
0.005418666193545014

The default selection going into .best_idx is set by model.selection_method which returns an index for chosen equation.

@Moelf is this the right approach for MLJ or would it be better to have a separate package? MLJModelInterface.jl looks pretty lightweight.

MilesCranmer · 2023-07-03T04:17:57Z

@ablaom I'd be very interested to hear your thoughts on this implementation as well.

ablaom · 2023-07-03T04:35:42Z

Thanks for working on an MLJ integration. Very exciting to have new kind class of model, and implemented in Julia, right?

Currently on leave but will definitely be supporting this. @OkonSamuel may also be able to look at this.

On a quick scan, I couldn't find any trait definitions (aka metadata) as defined, e.g. using MMI.metadata_model and MMI.metadata_pkg - see these docs. But maybe I missed them...

And it will be good idea to add MLJTestInterface.jl tests to your tests. There's an example for XGBoost you can probably adapt.

MilesCranmer · 2023-07-03T04:46:16Z

Fantastic, thank you. Will add the metadata.

Yes, this package is 100% Julia. The Python frontend PySR is just a lightweight scikit-learn wrapper via PyJulia, but the engine is all on the Julia side. So far the existing Julia interface in SymbolicRegression.jl is not as user-friendly, so I am looking forward to adding this MLJ integration as I think it will certainly improve that, as well as extensibility.

github-actions · 2023-07-03T05:10:58Z

Benchmark Results

	master	`be60f13`...	t[master]/t[`be60f13`...]
search/multithreading	0.0222 ± 0.0026 h	0.0217 ± 0.0011 h	1.02
search/serial	0.0216 ± 0.00035 h	0.0221 ± 0.00041 h	0.981
time_to_load	3.67 ± 0.17 s	3.75 ± 0.11 s	0.978
utils/best_of_sample	2.6 ± 0.9 μs	2.6 ± 0.8 μs	1
utils/check_constraints_x10	16.9 ± 5.4 μs	17.8 ± 5.4 μs	0.949
utils/compute_complexity_x10/Float64	3.3 ± 0.6 μs	3.1 ± 0.5 μs	1.06
utils/compute_complexity_x10/Int64	3.6 ± 0.5 μs	3.5 ± 0.5 μs	1.03
utils/compute_complexity_x10/nothing	2.6 ± 0.3 μs	2.4 ± 0.4 μs	1.08
utils/optimize_constants_x10	0.0721 ± 0.014 s	0.0733 ± 0.015 s	0.984

Benchmark Plots

A plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR.
Go to "Actions"->"Benchmark a pull request"->[the most recent run]->"Artifacts" (at the bottom).

ericphanson · 2023-07-03T08:07:12Z

2. No way to pass variable_names

For this, I think a natural way in MLJ would be to allow users to pass X (or rather its transpose) as a table and to use the column names, since MLJ generally works with tables where rows are observations and columns are features: https://alan-turing-institute.github.io/MLJ.jl/stable/getting_started/#Two-dimensional-data

src/MLJInterface.jl

OkonSamuel · 2023-07-03T10:34:37Z

@MilesCranmer I have taken a first pass through the code and given my review above.
Asides from the suggestions made by @ablaom you also need to add docstring for the MLJ interface. See here
I'll go through again after you add MLJ Interface tests.

Moelf · 2023-07-03T11:43:54Z

I just want to say that having integration here or a separate packages are both fine, the latter is not uncommon:

MilesCranmer · 2023-07-03T23:38:34Z

Thanks. I wonder if those were created before the lightweight interface package was created? It looks like the time-to-load is negligible ( #226 (comment) ) so we could just make it a hard dependency for simplicity.

ablaom · 2023-07-04T00:28:17Z

In case this is relevant to the discussion about the hyperparmater struct (aka "options" or, in MLJ, "model") (I didn't follow the details): The actual struct fields do not matter. The MLJ interface only interacts with the struct through properties, which can be overloaded to be different from the fields. But, yes the (public) keyword constructor should match the properties, if these are different from the fields.

For example, in MLJ's Pipeline model struct, we have a field models, invisible to the user, which is a vector of the component models, but the user constructs names for individual models (which become properties) using the kwarg constructor. I say this, because the layering mentioned above and copied below looks a bit weird, and I expect it can be avoided by having properties (public API) different from fields (implementation detail):

SRR(;options = SymbolicRegression.Options(
    binary_operators=[+, *, /, -],
    unary_operators=[cos, exp],
    npopulations=20
))

MilesCranmer · 2023-07-04T02:33:30Z

Okay I think I found a nice way around this. The way it is implemented is far from elegant, but the user point of view is nice.

It uses some macro tricks to store all the parameters and defaults from the Options constructor in a constant, and then uses those to generate the fields and defaults of SRRegressor. So basically the user no longer needs to declare Options() if they are working with MLJ.

I thought about just using options as a keyword but I felt it made things a bit awkward, and maybe it would cause issues for hyperparameter tuning.

For better or for worse, now you get all hyperparameters directly in the SRRegressor (see expanded snippet):

julia> model = SRRegressor(binary_operators=[+, *, -, /], unary_operators=[cos], maxsize=25, nested_constraints=[cos=>[cos=>0]], niterations=100)
SRRegressor(
  binary_operators = Function[+, *, -, /], 
  unary_operators = [cos], 
  constraints = nothing, 
  elementwise_loss = nothing, 
  loss_function = nothing, 
  tournament_selection_n = 12, 
  tournament_selection_p = 0.86f0, 
  topn = 12, 
  complexity_of_operators = nothing, 
  complexity_of_constants = nothing, 
  complexity_of_variables = nothing, 
  parsimony = 0.0032f0, 
  alpha = 0.1f0, 
  maxsize = 25, 
  maxdepth = nothing, 
  fast_cycle = false, 
  turbo = false, 
  migration = true, 
  hof_migration = true, 
  should_simplify = nothing, 
  should_optimize_constants = true, 
  output_file = nothing, 
  npopulations = 15, 
  perturbation_factor = 0.076f0, 
  annealing = false, 
  batching = false, 
  batch_size = 50, 
  mutation_weights = MutationWeights(0.048, 0.47, 0.79, 5.1, 1.7, 0.002, 0.00023, 0.21, 0.0), 
  crossover_probability = 0.066f0, 
  warmup_maxsize_by = 0.0f0, 
  use_frequency = true, 
  use_frequency_in_tournament = true, 
  adaptive_parsimony_scaling = 20.0, 
  npop = 33, 
  ncycles_per_iteration = 550, 
  fraction_replaced = 0.00036f0, 
  fraction_replaced_hof = 0.035f0, 
  verbosity = 1000000000, 
  save_to_file = true, 
  probability_negate_constant = 0.01f0, 
  seed = nothing, 
  bin_constraints = nothing, 
  una_constraints = nothing, 
  progress = true, 
  terminal_width = nothing, 
  optimizer_algorithm = "BFGS", 
  optimizer_nrestarts = 2, 
  optimizer_probability = 0.14f0, 
  optimizer_iterations = nothing, 
  optimizer_options = nothing, 
  recorder = nothing, 
  recorder_file = "pysr_recorder.json", 
  early_stop_condition = nothing, 
  timeout_in_seconds = nothing, 
  max_evals = nothing, 
  skip_mutation_failures = true, 
  enable_autodiff = false, 
  nested_constraints = [cos => [cos => 0]], 
  deterministic = false, 
  define_helper_functions = true, 
  niterations = 100, 
  parallelism = :multithreading, 
  numprocs = nothing, 
  procs = nothing, 
  addprocs_function = nothing, 
  runtests = true, 
  loss_type = Nothing, 
  selection_method = SymbolicRegression.MLJInterfaceModule.choose_best)

The downside is it has to re-create the Options() struct whenever you call, e.g., fit or predict. I think it should be fine though. We could even store options in the fitresult to avoid invalidating the returned expressions (e.g., if the user tries to change the operators used; you would still want predict() to work with the old result).

Nevermind, this is no longer an issue. I just create Options at fit, store it in the fitresult, and never need to re-generate it again.

There are a couple of nested hyperparameters, for example MutationWeights is a struct storing some probabilities. I'm not sure if that would pose an issue for hyperparam tuning or not?

src/MLJInterface.jl

src/SymbolicRegression.jl

src/MLJInterface.jl

MilesCranmer · 2023-07-04T15:31:00Z

Okay the output format is updated and seems to work.

MilesCranmer · 2023-07-04T16:12:50Z

Okay I have added the full docstrings. Anything else missing?

I'll probably update the README shortly after merging. I'm really liking MLJ so far, so I think I'll just have that be the default user interface, with equation_search available if someone (mostly me) needs to do low-level things.

src/MLJInterface.jl

OkonSamuel

I can confirm that the documentation renders properly on my terminal.
This looks very good and detailed. Thanks @MilesCranmer, We can wait to have this addition to the MLJ registry.

MilesCranmer · 2023-07-05T11:02:53Z

Added!

Once it merges, what should I do to add it to the MLJ registry? Should I submit an issue requesting a registry update on the MLJ repo?

MilesCranmer · 2023-07-05T11:13:36Z

@OkonSamuel I also realized I need to fix the behavior of sample weights for multitarget case. What is the expected format of the sample weights w in multitarget problems? Is it expected that w is always a 1D array, or is w to be a 2D array, with columns corresponding to the same columns of y?

MilesCranmer · 2023-07-05T11:25:19Z

Actually I think I just fixed it. MLJ assumes w is always a 1D vector, even for multitarget problems, correct? (This is what I expect; I just want to make sure it doesn't assume tabular sample weights).

OkonSamuel

LGTM!!!

MilesCranmer · 2023-07-05T14:32:02Z

Thanks so much for your help @OkonSamuel; much appreciated!

@MilesCranmer

[Diff since v0.19.1](v0.19.1...v0.20.0) **Closed issues:** - [Feature]: MLJ integration (#225) - [Feature]: ternary operators (#227) **Merged pull requests:** - MLJ Integration (#226) (@MilesCranmer)

MilesCranmer mentioned this pull request Jul 3, 2023

[Feature]: MLJ integration #225

Closed

MilesCranmer mentioned this pull request Jul 3, 2023

API Overhaul #187

Closed

OkonSamuel reviewed Jul 3, 2023

View reviewed changes