Hyperparameter optimization #115

MilesCranmer · 2022-02-22T17:30:52Z

MilesCranmer
Feb 22, 2022
Maintainer

The file hyperparamopt.py in benchmarks is for doing distributed hyperparameter optimization. It has been useful in the past for tuning the defaults of PySR for a generic set of problems. print_best_model.py will print the results.

This discussion thread will hold various hyperparam solutions to different problems, to try to see if there are some better defaults to use, etc. (anybody can feel free to post good parameter sets they found).

MilesCranmer · 2022-02-22T17:37:53Z

MilesCranmer
Feb 22, 2022
Maintainer Author

I started a search for the problems:

np.cos(2.3 * X[:, 0]) * np.sin(2.3 * X[:, 0] * X[:, 1] * X[:, 2]) - 10.0,
(np.exp(X[:, 3]*0.3) + 3)/(np.exp(X[:, 1]*0.2) + np.cos(X[:, 0]) + 1.1)

for X = 3 * rstate.randn(200, 5). I have found it very difficult to find these with PySR so the search should be interesting.

I allowed the search 4 cores, and a maximum time of 10 minutes (unlimited iterations) using timeout_in_seconds=10 * 60 as an argument, regardless of what types of internal operations it chooses to do more frequently.

I ran a search 3 times for each of the two expressions (total of 1 hour for an evaluation). I set a maxsize of 30. After the search, I took the loss of the most accurate expression found. I took the median of these losses over the 3 searches, and then the average over the two expressions.

After 9366 trials, the best hyperparameters found were:

{   'alpha': 1.5738013089076586,
                  'annealing': False,
                  'binary_operators': ['*', '/', '+', '-'],
                  'crossoverProbability': 7.110115733919611e-05,
                  'fractionReplaced': 0.013374993832223066,
                  'fractionReplacedHof': 0.030981822809764206,
                  'maxsize': 30,
                  'model_selection': 'accuracy',
                  'ncyclesperiteration': 2559.0,
                  'niterations': 10000,
                  'npop': 35.0,
                  'optimize_probability': 0.014890037032793041,
                  'optimizer_iterations': 6.0,
                  'optimizer_nrestarts': 6.0,
                  'parsimony': 0.0012565117876239059,
                  'perturbationFactor': 43.86194098731627,
                  'populations': 23.0,
                  'topn': 32.0,
                  'tournament_selection_p': 0.8374981013436686,
                  'unary_operators': ['sin', 'cos', 'exp', 'log'],
                  'useFrequency': False,
                  'warmupMaxsizeBy': 0.031172701891355372,
                  'weightAddNode': 0.004522835200528927,
                  'weightDeleteNode': 14.478661896439242,
                  'weightDoNothing': 0.001374964776098089,
                  'weightInsertNode': 27.524411192119594,
                  'weightMutateConstant': 3.4908752644974177,
                  'weightMutateOperator': 0.005418386628230309,
                  'weightRandomize': 1.629159192041718,
                  'weightSimplify': 0.002}

This is quite surprising and it's far away from the current defaults.

(note that I fix weightSimplify=0.002, since it will normalize these weights, and so one of the weights can be fixed.)

The search should continue running for the next week so I'll update this with any updated parameters.

0 replies

MilesCranmer · 2022-02-22T18:56:10Z

MilesCranmer
Feb 22, 2022
Maintainer Author

@JayWadekar @patrick-kidger @kazewong potentially of interest to each of you.

0 replies

MilesCranmer · 2022-03-01T13:49:35Z

MilesCranmer
Mar 1, 2022
Maintainer Author

71000 trials in. Keep in mind these were only tuned for the above problems, and likely should be re-tuned for specific problem domains. Also keep in mind it was limited to 10 minutes on 4 cores. More cores and time probably justifies larger numbers of expressions at once.

Here are the single best hyperparameters:

loss                                          0.278591
alpha                                         0.048618
annealing                                        False
binary_operators                  ['*', '/', '+', '-']
crossoverProbability                          0.051212
fractionReplaced                              0.026796
fractionReplacedHof                           0.036723
maxsize                                             30
model_selection                               accuracy
ncyclesperiteration                              634.0
niterations                                      10000
npop                                              38.0
optimize_probability                          0.154329
optimizer_iterations                               8.0
optimizer_nrestarts                                2.0
parsimony                                     0.003075
perturbationFactor                            0.002067
populations                                       15.0
skip_mutation_failures                            True
topn                                              15.0
tournament_selection_p                         0.80458
unary_operators           ['sin', 'cos', 'exp', 'log']
useFrequency                                      True
warmupMaxsizeBy                               0.076342
weightAddNode                                 0.401272
weightDeleteNode                              1.271694
weightDoNothing                               0.078377
weightInsertNode                              6.211138
weightMutateConstant                          0.079556
weightMutateOperator                          0.526905
weightRandomize                                0.00027
weightSimplify                                   0.002

Since this might be noisy - perhaps this trial got lucky - here are the median hyperparameters of the top 10 trials:

(Ran with python -c 'import pandas as pd; x = pd.read_csv("trials4/summary.csv", sep="|").sort_values("loss").iloc[:10]; y = x.median(); print(y)')

loss                          0.293183
alpha                         0.036668
annealing                     False
crossoverProbability          0.065701
fractionReplaced              0.000364
fractionReplacedHof           0.034940
maxsize                      30.000000
ncyclesperiteration         555.500000
niterations               10000.000000
npop                         33.000000
optimize_probability          0.137467
optimizer_iterations          8.000000
optimizer_nrestarts           2.000000
parsimony                     0.003176
perturbationFactor            0.076306
populations                  15.500000
skip_mutation_failures        True
topn                         12.000000
tournament_selection_p        0.859390
useFrequency                  True
warmupMaxsizeBy               0.083059
weightAddNode                 0.791716
weightDeleteNode              1.734292
weightDoNothing               0.214623
weightInsertNode              5.103537
weightMutateConstant          0.048217
weightMutateOperator          0.474846
weightRandomize               0.000234
weightSimplify                0.002000

We can also look at the spread on these. Many of these are log distributed, so let's look at the standard deviation in log10 space Using python -c 'import numpy as np; import pandas as pd; x = pd.read_csv("trials4/summary.csv", sep="|").sort_values("loss").iloc[:10][["loss", "alpha", "crossoverProbability", "fractionReplaced", "fractionReplacedHof", "maxsize", "ncyclesperiteration", "niterations", "npop", "optimize_probability", "optimizer_iterations", "optimizer_nrestarts", "parsimony", "perturbationFactor", "populations", "topn", "tournament_selection_p", "warmupMaxsizeBy", "weightAddNode", "weightDeleteNode", "weightDoNothing", "weightInsertNode", "weightMutateConstant", "weightMutateOperator", "weightRandomize", "weightSimplify"]]; x = np.log10(x); y = x.std(); print(y)'.

This gives the following:

loss                      0.008101
alpha                     0.229157
crossoverProbability      0.174091
fractionReplaced          0.633015
fractionReplacedHof       0.163865
maxsize                   0.000000
ncyclesperiteration       0.064149
niterations               0.000000
npop                      0.098414
optimize_probability      0.167777
optimizer_iterations      0.053846
optimizer_nrestarts       0.155451
parsimony                 0.126659
perturbationFactor        0.568420
populations               0.076493
topn                      0.035872
tournament_selection_p    0.036660
warmupMaxsizeBy           0.144004
weightAddNode             0.252111
weightDeleteNode          0.152587
weightDoNothing           0.196531
weightInsertNode          0.118023
weightMutateConstant      0.346641
weightMutateOperator      0.185854
weightRandomize           1.023768
weightSimplify            0.000000
dtype: float64

One can roughly interpreting "1" here as meaning the error on the mean (in log space) is from 10x the value to 1/10x the value. The most uncertain values are perturbationFactor, fractionReplaced, and weightRandomize, although it's possible these could simply not have a huge effect on the results, and the model is changing them randomly. For example, alpha has no effect on these results since annealing is turned off, yet there the error in logspace is 0.2!

Edit: here's the copy-pastable version, with only the main hyperparams (niterations, maxsize, warmupMaxsizeBy, and the operators are excluded):

alpha=0.036668,
annealing=False,
crossoverProbability=0.065701,
fractionReplaced=0.000364,
fractionReplacedHof=0.034940,
ncyclesperiteration=555,
npop=33,
optimize_probability=0.137467,
optimizer_iterations=8,
optimizer_nrestarts=2,
parsimony=0.003176,
perturbationFactor=0.076306,
populations=15,
skip_mutation_failures=True,
topn=12,
tournament_selection_p=0.859390,
useFrequency=True,
weightAddNode=0.791716,
weightDeleteNode=1.734292,
weightDoNothing=0.214623,
weightInsertNode=5.103537,
weightMutateConstant=0.048217,
weightMutateOperator=0.474846,
weightRandomize=0.000234,
weightSimplify=0.002000,

0 replies

MilesCranmer · 2022-03-01T13:53:32Z

MilesCranmer
Mar 1, 2022
Maintainer Author

For posterity here is the mean of the top 10 trials computed in log space (10 ** (np.log10(x).mean()))

loss                          0.292467
alpha                         0.036416
crossoverProbability          0.064505
fractionReplaced              0.000479
fractionReplacedHof           0.034607
maxsize                      30.000000
ncyclesperiteration         541.969433
niterations               10000.000000
npop                         31.599738
optimize_probability          0.127483
optimizer_iterations          8.540416
optimizer_nrestarts           1.515717
parsimony                     0.003407
perturbationFactor            0.063978
populations                  15.777888
topn                         12.360778
tournament_selection_p        0.853237
warmupMaxsizeBy               0.078130
weightAddNode                 0.841462
weightDeleteNode              1.753645
weightDoNothing               0.210091
weightInsertNode              5.129917
weightMutateConstant          0.033357
weightMutateOperator          0.494547
weightRandomize               0.000647
weightSimplify                0.002000

0 replies

patrick-kidger · 2022-03-01T14:08:49Z

patrick-kidger
Mar 1, 2022

Hmm. Lots of interesting things here.

fractionReplaced is absolutely tiny. (And has very large spread.) Is migration so unimportant? Or perhaps by having too much migration then the populations homogenise, to the detriment of the overall performance?
Here fractionReplaced << fractionReplacedHoF. In the current defaults its the other way around.
You say above that you clip things to take at most 10 minutes, and unlimited iterations? It looks like you've done this by setting niterations = 10000, which in practice is ridiculously large. For example in Better defaults #99 I advocate for niterations = 4.
Similar story for ncyclesperiteration, which seems very large. I'm thinking these two values will need optimising separately afterwards, or fixed to some reasonable small-ish values.
optimize_probability is bounded in [0, 1], so I'm not sure how much sense a log-space comparison makes. (Probably similar for other hyperparameters too.) What's the non-log-space spread? I'm just surprised this is so low: 0.16; by default it's currently set to 1.0 (always optimise constants).
- FYI this argument isn't documented.
- Likewise it seems like optimize_iterations and optimise_nrestarts are smaller than their current defaults.
I'm curious how sensitive the results are to the weight* arguments. These feel like they should be some of the most important ones but they're also the ones for which we (well, at least me) have least intuition for what are good values.
- I'm not sure how meaningful the spread results are for the weight* values. IIRC these are normalised by their sum so e.g. multiplying them all by 10 won't affect results.
- It might be worth cutting known-uninformative dimensions of the searchspace (alpha given annealing=False; any one of the weight* values) and continuing the search from there?

0 replies

MilesCranmer · 2022-03-01T14:28:50Z

MilesCranmer
Mar 1, 2022
Maintainer Author

Yes, very curious indeed. Thanks for this writeup. Quick comments before I run to a meeting:

I agree niterations shouldn't be 10000 in practice; I simply set that so that if ncyclesperiteration was very very small, it wouldn't quit after 10 seconds and be an unfair comparison. I think fixed time is a more fair comparison; otherwise it would try to optimize the parameters as frequently as possible, since this doesn't really "count" against the number of iterations. If we implement these hyperparams, I think one should just see what niterations gives a runtime of <1 minute. Alternatively we could simply have timeout_in_seconds as the default choice of stopping... Not sure.
weightSimplify is actually fixed, so in theory these other weight values should be absolute in the recorded experiments. Although maybe I should have fixed a more prominent mutation as the fixed one.

0 replies

MilesCranmer · 2022-03-02T15:20:01Z

MilesCranmer
Mar 2, 2022
Maintainer Author

fractionReplaced is absolutely tiny. (And has very large spread.) Is migration so unimportant? Or perhaps by having too much migration then the populations homogenise, to the detriment of the overall performance?

This sounds right: too much migration would reduce diversity, so it's important to either reduce the migration rate, or increase ncyclesperiteration (number of mutations between each migration period), or both.

Regarding the weights, it might be interesting to only optimize those and look at the result. My guess is they vary quite a bit by problem - more complex expressions would probably favor the mutations which add nodes to the tree.

0 replies

patrick-kidger · 2022-03-02T15:37:12Z

patrick-kidger
Mar 2, 2022

Yep, everything you've said, across both posts, makes sense.

0 replies

MilesCranmer · 2022-03-10T23:12:59Z

MilesCranmer
Mar 10, 2022
Maintainer Author

One interesting thing I noticed: increasing the floating point precision from float32 to float64 doesn't really hurt evaluation speed: a single evaluation of a 48-token expression over 200 datapoints only changes from 9.791 us to 10.542 us (see benchmarks/single_eval.jl in the Julia repo).

However, having higher precision can help search speed by avoiding weird combinations of constants. Sometimes the genetic algorithm figures out it can achieve higher precision of a particular constant by stacking multiple constants together. I know this sounds weird, but from the genetic algorithm's perspective, whatever gives it higher accuracy is a good thing!

One example is to achieve 2/3 to higher precision (since you can't represent this exactly in floating point numbers), it would simply multiply by 2.0 and divide by 3.0 in a subsequent operation (both of which can be stored exactly), rather than store a single constant approximately.

From my very non-rigorous experiments, it seems like using higher precision in the search (you can set this in PySR with precision=64) helps avoid these types of situations. But this requires further study...

1 reply

MilesCranmer Mar 10, 2022
Maintainer Author

Actually the performance difference might not be this close after all, as the constants are fixed to Float32 regardless of input type... Eventually i should figure out how to let CONST_TYPE be user-set in SymbolicRegression.jl but right now it seems to require turning many things into templates.

MilesCranmer · 2022-11-20T04:54:20Z

MilesCranmer
Nov 20, 2022
Maintainer Author

Running some new tuning runs in https://github.com/MilesCranmer/pysr_wandb - W&B sweep file here: https://github.com/MilesCranmer/pysr_wandb/blob/master/sweep.yml.

According to W&B, these are the most important parameters to tune:

My initial conclusion is that the following change to hyperparameters (from the current defaults) is quite good:

model.set_params(
    population_size=75,  # default 33
    tournament_selection_n=23,  # default 10
    tournament_selection_p=0.8,  # default 0.86
    ncyclesperiteration=100,  # default 550
    parsimony=1e-3,  # default 0.0032
    fraction_replaced_hof=0.08,  # default 0.035
    optimizer_iterations=25,  # default 8
    crossover_probability=0.12,  # default 0.066
    weight_optimize=0.06,  # default 0.0
    populations=50,  # default 15
    adaptive_parsimony_scaling=100.0,  # default 20
)

although ncyclesperiteration being this small will result in low resource utilization for more cores.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hyperparameter optimization #115

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 10 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Hyperparameter optimization #115

MilesCranmer Feb 22, 2022 Maintainer

Replies: 10 comments · 1 reply

MilesCranmer Feb 22, 2022 Maintainer Author

MilesCranmer Feb 22, 2022 Maintainer Author

MilesCranmer Mar 1, 2022 Maintainer Author

MilesCranmer Mar 1, 2022 Maintainer Author

patrick-kidger Mar 1, 2022

MilesCranmer Mar 1, 2022 Maintainer Author

MilesCranmer Mar 2, 2022 Maintainer Author

patrick-kidger Mar 2, 2022

MilesCranmer Mar 10, 2022 Maintainer Author

MilesCranmer Mar 10, 2022 Maintainer Author

MilesCranmer Nov 20, 2022 Maintainer Author

MilesCranmer
Feb 22, 2022
Maintainer

Replies: 10 comments 1 reply

MilesCranmer
Feb 22, 2022
Maintainer Author

MilesCranmer
Feb 22, 2022
Maintainer Author

MilesCranmer
Mar 1, 2022
Maintainer Author

MilesCranmer
Mar 1, 2022
Maintainer Author

patrick-kidger
Mar 1, 2022

MilesCranmer
Mar 1, 2022
Maintainer Author

MilesCranmer
Mar 2, 2022
Maintainer Author

patrick-kidger
Mar 2, 2022

MilesCranmer
Mar 10, 2022
Maintainer Author

MilesCranmer Mar 10, 2022
Maintainer Author

MilesCranmer
Nov 20, 2022
Maintainer Author