How can I re-discover the Fast Inverse Square Root function? #469

Boxylmer · 2023-11-17T04:21:09Z

Boxylmer
Nov 17, 2023

Hey everyone!

Super excited to start learning this package and adding it to my toolkit, I'm trying to learn the buttons and knobs of this package by re-discovering the famous fast inverse square root function from Quake. The function seems to not be discovering it well, or really finding much that resembles the function.

The reason I'm tackling this one in particular is because I vaguely remember during a talk on the package that Miles may have mentioned it's possible to do this, so first and foremost please correct me if I'm hallucinating, but secondly let me know if the code to do this is already out there somewhere!

Here's my code so far

using SymbolicRegression
using MLJ

X = (; input=Float32.(rand(10:100, 100)))
y = 1 ./ sqrt.(X.input)

shiftright(a::Float32) = reinterpret(Float32, reinterpret(Int32, a) >> 1)
shiftleft(a::Float32) = reinterpret(Float32, reinterpret(Int32, a) << 1)

model = SRRegressor(
    binary_operators=[-, *, +],
    unary_operators=[shiftright, shiftleft],
    niterations=1000,
)
mach = machine(model, X, y)
fit!(mach)


using Plots
plotly()

# testing various outputs
c6(x) = ((shiftright(x) * -3.2614e+17) - -0.3356)
c17(input) = (0.39054 - ((input * 0.38849) * ((0.041157 - (input * (((input - 0.38849) * -2.6506e-06) - -0.00055827))) * 0.6586)))
c14(input) = ((shiftright(shiftleft(shiftleft(input * Float32(-1.4716e+17))) + Float32(7.9136e+34)) - (input * Float32(0.0048464))) - Float32(-0.30194))
s = scatter(X.input, y, label="data")
scatter!(X.input, c6.(X.input), label="6")
scatter!(X.input, c14.(X.input), label="14")
scatter!(X.input, c17.(X.input), label="17")

So we're certainly getting reasonable answers! Just not the Famous 6-8 operation answer. I wanted to ask you all for ideas and tips on how to better explore the function space here, and if I may be doing something wrong or missing some critical aspects of the search.

Thanks!
Will.

Answered by Boxylmer

Nov 25, 2023

Alright, after a bit of effort I'm actually finding functions that, at first glance, appear to outperform the original FISR function. This ended up happening by including the newton approximation in the loss function (as you first suggested) and without any derivative information.

For the sake of learning, I'm actually struggling to get the derivative terms incorporated into the loss function in a way that doesn't constantly end up with NaNs or Infs everywhere, and left everything as-is except for the derivative loss being set to zero (so that by default the code should run correctly).

To show the results so far, I made two plots:

One that shows the log scale results of the outputs from …

View full answer

Boxylmer · 2023-11-17T04:48:29Z

Boxylmer
Nov 17, 2023
Author

Oh, and if anyone would like to verify the FISR in Julia!

function fisr(x::Float32)
    x2 = x * Float32(0.5);
    i  = reinterpret(Int32, x)
    i  = 0x5f3759df - (i >> 1) 
    y  = reinterpret(Float32, i)
    y  = y * (Float32(1.5) - (x2 * y * y))
end
inputs = Float32.(range(1, 100, 1000))
out_approx = fisr.(inputs)
out_actual = 1 ./ sqrt.(inputs)

s = plot(inputs, out_actual)
plot!(inputs, out_approx)

0 replies

MilesCranmer · 2023-11-17T04:54:47Z

MilesCranmer
Nov 17, 2023
Maintainer

Hey @Boxylmer,

You indeed are not hallucinating, I did that search maybe 1-2 years and occasionally mention it in talks. I'm happy to hear you are interested in this type of thing, it would be cool to have a reproducible example in the docs!

I'm not sure if I'll be able to find the original script as it was 2 laptops ago... but I will dig around.

You have the overall right idea with the operators if I recall correctly, but be sure to look at the tuning page: https://astroautomata.com/PySR/tuning/. Also probably need to tweak complexity of each operator (based on number FLOPS – I think I measured those with https://github.com/triscale-innov/GFlops.jl), etc., which will encourage it to use the bitshift operator. Might also want to throw more compute at the problem on a multi-node cluster.

It might have also been that the Newton iteration steps (the y = y * (Float32(1.5) - (x2 * y * y)) – which can be repeated btw) can be just incorporated into the loss function, because it's a standard method (derivable). Maybe that's what I did originally. If not you will probably need to use a larger maxsize to represent the whole algorithm in a single binary tree without shared connections.

I'm trying to also remember if I tweaked any part of the code related to the constants... (since optimizing the constants for this problem is a bit tricky as it's related to the bits, rather than a continuous value). I don't think I hacked anything, but it could have just been I disabled constant optimization (a parameter), or switched to NelderMead (another parameter), or something else. Sorry I'm not more helpful here... But yeah I'd be eager to help get this example working again!

1 reply

MilesCranmer Nov 17, 2023
Maintainer

Wait, not GFlops, I think I just measured the time each operator takes and set the minimum time to 1.. (You can have floating point complexities)

Boxylmer · 2023-11-17T05:25:37Z

Boxylmer
Nov 17, 2023
Author

Hey Miles! Thanks for the detailed and fast response.

Oh yeah, I originally gave it way too much freedom and made an arbitrary bit shift, but quickly realized it never used it, and settled on using the shift-by-one methods above. The docs on PySR were super helpful! I should have checked over there as well rather than just the Julia docs (which I think you actually mentioned too: that PySR was more documented)

The complexity trick is super clever! Just assign the average time that computation takes as the complexity. I don't have too much experience with SLURM clusters, but I want to get into it, and I have a few laptops sitting around that I could use to set it up eventually! I found that the pool of equations stagnated after a bit of time (or at least the hall of fame did). Is this something I should actually just leave for a bit more time? The data has zero noise, so I figured it was just not finding the magic numbers.

On that note, magic numbers for bit shifting is tricky. The number is massive and is represented in hexadecimal form. It seems unlikely that optim would be able to find it unless we search for it in log space instead, but I'm not immediately sure how to set that up and it also feels like cheating.

I'm also willing to put some more effort into it, as I think finding magic tricks with algorithms in specific types could prove super useful for some microcontrollers, and I was hoping to make enough headway to write a short communication on it.

Thanks for the tips!

0 replies

Boxylmer · 2023-11-17T23:51:20Z

Boxylmer
Nov 17, 2023
Author

@MilesCranmer checking in with some updates! I have the current script finding a lot of pretty decent approximations that almost look like piecewise functions. I'm stuck here, I believe, and am not sure how to get it to further find FISR. Here's some of the key steps I've taken

I defined a number of "atomic" binary operations and trick operators that can be used
I made sure FISR can be found using these operations (I've included the function and some code to prove this)
I made SR aware of these operators and gave it what it needs to find the actual function.
I hacked together a hexadecimal magic number search by mapping the Float32 domain onto the UInt32 domain with a sigmoid-like function so that the optimization would get somewhat close. I empirically found that -5.2391f0 is the FISR magic number on that map.

I was genuinely surprised that it didn't turn out able to find it, even with me messing with a few hyperparameters and letting it run overnight. I'm hoping you might be able to point me in the right direction as this is a really interesting problem that I'm hoping to use to find other FISR like tricks for microcontrollers. One hunch I'm getting is that there might be some way to punish non-smooth functions, or those piecewise looking results that you typically get from bit shifting. Telling it that there should only be N instances of certain operators feels like cheating, so I'm not at that point yet, haha.

Here you can see that FISR is near-perfect, but two good functions I found with complexities of 6 and 8 respective look piecewise-like.

using MLJ, SymbolicRegression

X = (; input=Float32.(range(0.1, 100, 10000)))
y = 1 ./ sqrt.(X.input)

shift_right(a::Float32) = reinterpret(Float32, reinterpret(Int32, a) >> 1)
shift_left(a::Float32) = reinterpret(Float32, reinterpret(Int32, a) << 1)

function scaled_sigmoid(x::Float32, k::Float32)
    return 1 / (1 + exp(-k * x))
end
map_float32_to_uint32(f::Float32) = floor(UInt32, ((scaled_sigmoid(f, 0.1f0)) * (2^32 - 2^7 - 1)))
magic_add(val::Float32, magic::Float32) = magic == 0f0 ? val : reinterpret(Float32, reinterpret(UInt32, val) + map_float32_to_uint32(magic)) 

magic_inverse(val::Float32) = reinterpret(Float32, -reinterpret(Int32, val))

function fisr(x::Float32, magicnum::Float32)
    x2 = x * Float32(0.5)
    i1 = magic_inverse(x)
    i2 = shift_right(i1)
    y  = magic_add(i2, magicnum)
    y  = y * (Float32(1.5) - (x2 * y * y))
end

# see the mapping function, maybe this should just be a linear transformation? 
plot(range(-100, 100, 100), map_float32_to_uint32.(Float32.(range(-1000, 1000, 100))))


model = SRRegressor(
    binary_operators=[*, +, magic_add],
    unary_operators=[shift_right, magic_inverse, neg],
    complexity_of_operators=[shift_right=>0.5, magic_inverse=>0.5, magic_add=>0.5],
    niterations=100,
    ncycles_per_iteration=5000,
    optimizer_nrestarts=4,
    optimizer_algorithm="NelderMead",
)
mach = machine(model, X, y)
fit!(mach)

using Plots
plotly()

# testing various outputs
test_x = (; input=Float32.(range(0.05, 10, 1000)))
test_y = 1 ./ sqrt.(test_x.input)


c6(input) = magic_inverse(magic_add(neg(shift_right(input * Float32(-4.1931e-20))), Float32(-2.3726)))
c8(input) = magic_add(magic_inverse(shift_right(input) * Float32(0.88152)), Float32(-0.18826)) * Float32(8.441e-20)
c19(input) = (magic_inverse((shift_right(shift_right(input)) * Float32(1.5938e+29)) * (Float32(1.5938e+29) + magic_add(magic_inverse(shift_right(shift_right(Float32(-0.54733) + input))), Float32(-0.54733)))) * magic_inverse(shift_right(shift_right(input * Float32(5.4562)))))
s = plot()
plot!(s, X.input, y, label="data")
scatter!(s, test_x.input, c8.(test_x.input), label="8")
scatter!(s, test_x.input, c6.(test_x.input), label="6")
scatter!(s, test_x.input, c19.(test_x.input), label="19")

scatter!(s, test_x.input, fisr.(test_x.input, -5.2391f0), label="FISR")
plot!(s, test_x.input, test_y, label="test_answers")



# what's the expected loss if we get the right answer with L2 loss? 
inputs = Float32.(range(1, 100, 1000))
out_actual = 1 ./ sqrt.(inputs)
s = plot(inputs, out_actual)
scatter!(s, inputs, out_approx)
sum((out_approx .- out_actual))^2/length(out_actual) # "perfect" loss
                                # 0x5f3759df 
map_float32_to_uint32(-5.2391f0) # magic numbers!

0 replies

MilesCranmer · 2023-11-18T00:40:00Z

MilesCranmer
Nov 18, 2023
Maintainer

Did you try the tip about doing the Newton approximation by hand? i.e., something like this:

function loss_with_newton_approx(tree, dataset::Dataset{T,L}, options) where {T,L}
    if tree.degree != 0
        return L(10_000)  # Return large loss if not split-able
    end
    f = tree.l
    f_output, f_flag = eval_tree_array(f, dataset.X, options)
    if !f_flag
        return L(Inf)
    end
    g = tree.r
    g_output, g_flag = eval_tree_array(g, dataset.X, options)
    if !g_flag
        return L(Inf)
    end
    
    # Now, we can compute the Newton approx:
    y = f_output
    @. y =  y * (T(1.5) - (g_output * y * y))
    # Second step (or more!) is optional (can remove it from final evaluation on microcontroller)
    @. y =  y * (T(1.5) - (g_output * y * y))

    return sum(i -> (y[i] - dataset.y[i])^2, eachindex(y)) / length(y)
end

model = SRRegressor(; loss_function=loss_with_newton_approx, other_params...)

You can derive that loss from Newton method to optimization (the y=y*(1.5-x*y*y)), from the original formula (without anything crazy), so maybe this is okay to use generally?

Just keep in mind that the formulae it prints out are "unsplit" and you will need to print the equations by yourself in pieces (splitting at tree.l, tree.r)

0 replies

MilesCranmer · 2023-11-18T00:53:31Z

MilesCranmer
Nov 18, 2023
Maintainer

Other assorted tips/questions

You should use a larger maxsize. Maybe just go up to 40. Otherwise it will be impossible to find it because the complexity is too large.
I ran your code on my laptop and it only took 5 minutes to run. But you mentioned it took all night on your machine; do you know if you maybe don't have multithreading turned on or have an incompatible Julia install or something?
When doing functional approximations I like to compute the ground truth with BigFloat, since sqrt(::Float32) might have its own small errors
- My favorite personal example is log(1+erf(x)), which I needed to compute for the weird likelihood in this paper. The default evaluation is awful for x<-5, and even the ones implemented in various packages are just simple Taylor series. The one SR found was significantly more accurate AND cheaper.
Since you want to get to 0.0 loss, you could turn parsimony=0.0 (doesn't punish complexity as much). Then you might want to turn up the adaptive parsimony with adaptive_parsimony_scaling=100 or 1000 (adaptive way to favor diversity of complexities, rather than an absolute penalty term)
You could try turning off optimization, with should_optimize_constants=false. Then it will use only the mutation operator to change constants, rather than NelderMead.

0 replies

MilesCranmer · 2023-11-18T00:57:11Z

MilesCranmer
Nov 18, 2023
Maintainer

Data generation I might go a bit larger dynamic range. Also you probably don't need that many samples:

julia> X = (; input=Float32.(10 .^ collect(range(-3, 3, 1000))))

julia> y = @. Float32(1 / sqrt(big(X.input)));

(Which will also do the calculation with BigFloat)

0 replies

MilesCranmer · 2023-11-18T01:00:17Z

MilesCranmer
Nov 18, 2023
Maintainer

Lastly, for the actual loss, MSE might not be a great choice, as it might just ignore precision for very small values. You could try MSE in log space instead:

loss(prediction, target) = prediction <= 0 ? eltype(target)(Inf) : (log(prediction) - log(target))^2

0 replies

Boxylmer · 2023-11-20T06:43:53Z

Boxylmer
Nov 20, 2023
Author

Wow! Thank you so much for the tips. I feel I'm learning quite a bit as a product of this problem.

I should clarify, it took a few days prior as I turned up the iterations and cycles per iteration as I left to work on other problems, the current iterations I posted here was my "debug run" length. Sorry about that!

One thing I'm having a bit of a hard time understanding is the adaptive parsimony scaling. This encourages diversity in complexity, but if parsimony=0, will this still do something?

Using the tips you've provided, other than the custom loss function to include loss with a step of newton's method (I wasn't able to get my head around the trees with the time I had this weekend), the code I have is below.

This approaches reasonable answers much more quickly than before, and answers seem to be a bit more precise. Right now I think this can be attributed to using loss in the log domain. Overall, current approximations are great for most applications

But we still aren't able to do this naively, I assume with the newtons step taken out of the search, it wouldn't have a problem. It's interesting, however, that we're still seeing "jittery" solutions every time. Have often do you try to incorporate a derivative term in the loss? I.e., total_loss = loss(value) + loss(derivative)?

Finally for the sake of learning, what advantage does turning off optimization have (tip #5)? Is letting it learn through the genetic algorithm preferable if we're trying to navigate a parameter space that's highly nonlinear or unable to be navigated smoothly?

using MLJ, SymbolicRegression
using Plots
plotly()

X = (; input=Float32.(10 .^ collect(range(-3, 3, 1000))))
y = @. Float32(1 / sqrt(big(X.input)));

shift_right(a::Float32) = reinterpret(Float32, reinterpret(Int32, a) >> 1)
shift_left(a::Float32) = reinterpret(Float32, reinterpret(Int32, a) << 1)

function scaled_sigmoid(x::Float32, k::Float32)
    return 1 / (1 + exp(-k * x))
end

map_float32_to_uint32(f::Float32) = floor(UInt32, ((scaled_sigmoid(f, 0.1f0)) * (2^32 - 2^7 - 1)))
magic_add(val::Float32, magic::Float32) = magic == 0f0 ? val : reinterpret(Float32, reinterpret(UInt32, val) + map_float32_to_uint32(magic)) 

magic_inverse(val::Float32) = reinterpret(Float32, -reinterpret(Int32, val))

function fisr(x::Float32, magicnum::Float32)
    x2 = x * Float32(0.5)
    i1 = magic_inverse(x)
    i2 = shift_right(i1)
    y  = magic_add(i2, magicnum)
    y  = y * (Float32(1.5) - (x2 * y * y))
end

# see the mapping function, maybe this should just be a linear transformation? 
plot(range(-100, 100, 100), map_float32_to_uint32.(Float32.(range(-1000, 1000, 100))))

loss(prediction, target) = prediction <= 0 ? eltype(target)(Inf) : (log(prediction) - log(target))^2

model = SRRegressor(
    binary_operators=[*, +, magic_add],
    unary_operators=[shift_right, magic_inverse, neg],
    complexity_of_operators=[shift_right=>0.5, magic_inverse=>0.5, magic_add=>0.5],
    niterations=100000,
    ncycles_per_iteration=1000,
    optimizer_nrestarts=4,
    optimizer_algorithm="NelderMead",
    parsimony=0.0,
    maxsize=50,
    adaptive_parsimony_scaling=1000, # or 100, as miles mentioned. I feel I have an unclear meaning of what this does.
    # should_optimize_constants=false, # more of a hail mary than something that will definitively improve things.
    elementwise_loss=loss,
)
mach = machine(model, X, y)
fit!(mach)
# r = report(mach)
fitted_model = fitted_params(mach)
selected_equation = fitted_model.equations[9]

# testing various outputs
mach.model.selection_method = Returns(9)

s = plot()
scatter!(s, X.input, predict(mach, X), label="predict")
scatter!(s, X.input, fisr.(X.input, -5.2391f0), label="FISR")
plot!(s, X.input, y, label="data", linewidth=4)

0 replies

Boxylmer · 2023-11-24T01:25:48Z

Boxylmer
Nov 24, 2023
Author

I'm actually struggling to implement derivative information in the loss_function input. I've seen the trick proposed prior was to include derivative information as extras in the weights argument, but the pull request for the associated issue didn't pass and the logs have expired. Is there a current best practice on this if I wanted to include said derivative information using the current main version?

X = (; input=Float32.(10 .^ collect(range(-3, 3, 1000))))
y = @. Float32(1 / sqrt(big(X.input)));
dy = Float32.([Zygote.gradient(x -> 1/sqrt(x), val)[1] for val in big.(X.input)])

1 reply

MilesCranmer Nov 24, 2023
Maintainer

Sorry I didn't respond to your other comment yet, this week is crazy busy.

Derivative-based loss makes sense but note that here you can't use the built-in autodiff because your operators are non-differentiable! i.e., bitshift is a discrete operation and thus has no derivative.

However, maybe what you could instead do is a finite difference-based derivative calculation, from within a custom loss objective? That would give something that approximates what you see as the slope. Then you could pass the "true" derivatives into the "weights" and basically use those.

See above for an example of a loss function: #469 (comment)

Boxylmer · 2023-11-24T02:19:23Z

Boxylmer
Nov 24, 2023
Author

Absolutely no need to apologize! I'm enjoying learning the library and figuring out the issue. I really appreciate the time you've taken this far to assist on the problem.

Yup! I use zygote to calculate the derivative for the bigfloat ground truth and planned on using finitediff.jl on the eval_tree_array function. The issue is getting that data available in the dataset using either MLJ or the native interface.

To clarify, I tried passing in a named tuple to get the data in as (; dy=...) on the main branch, but in this case you're just talking about using a single Vector{Float32} as a quick fix until that PR is merged? Or was there a way to do the named tuple that I may have missed?

1 reply

MilesCranmer Nov 24, 2023
Maintainer

Oh wait I didn't see the PR you meant. Yeah it would be good to get that merged. I'm not sure if it requires deeper changes to MLJ though, so timeline TBD.

Since you are just doing 1D regression, it's probably easiest to just pass weights equal to dy – and then you can access that from within a custom loss function (with dataset.weights). Since you have a custom loss function you can do anything you want with the weights. You could technically even store dy in X, and just unpack it from within the loss function. Although you'd need to ensure that the resultant equation never uses that feature as a constraint... like:

    equation_uses_dy = any(
        node -> node.degree == 0 && node.constant == false && node.feature = 2,
        tree
    )
   if equation_uses_dy
       return L(Inf)
   end
   
   data = X[[1], :]
   dy = X[2, :]

Boxylmer · 2023-11-25T08:48:22Z

Boxylmer
Nov 25, 2023
Author

Alright, after a bit of effort I'm actually finding functions that, at first glance, appear to outperform the original FISR function. This ended up happening by including the newton approximation in the loss function (as you first suggested) and without any derivative information.

For the sake of learning, I'm actually struggling to get the derivative terms incorporated into the loss function in a way that doesn't constantly end up with NaNs or Infs everywhere, and left everything as-is except for the derivative loss being set to zero (so that by default the code should run correctly).

To show the results so far, I made two plots:

One that shows the log scale results of the outputs from equations that have losses on the order of the best loss (i.e., from the best loss to 10X the best loss) and had complexities lower than 12. For reference, without Newton's method, the FISR complexity would be 4 to 6.
Another that shows the deviation from the true answer along with FISR

As a side note, I was able to observe the very interesting behavior of the loss suddenly "clicking", and going from ~0.1 to 1e-7 instantly at high complexity equations, then over time the algorithm would find ways to keep that order of loss while reducing the complexity bits and pieces at a time. I understand this is expected and possibly even routine for you, but it's exciting seeing it in action!

The hunch I'm beginning to get from this is that FISR has multiple "fast" solutions, and finding the exact FISR magic number is not something we should expect from this type of approach. The FISR solution deviates in a very similar visual way to that of the other solutions, and about how we would expect for a theoretical solution with a complexity of somewhere between 4 and 6.

Code so far

using MLJ, SymbolicRegression
using Plots
using Zygote
plotly()

X = (; input=Float32.(10 .^ collect(range(-3, 3, 1000))))
y = @. Float32(1 / sqrt(big(X.input)));
dy = Float32.([Zygote.gradient(x -> 1/sqrt(x), val)[1] for val in big.(X.input)])

shift_right(a::Float32) = reinterpret(Float32, reinterpret(Int32, a) >> 1)
shift_left(a::Float32) = reinterpret(Float32, reinterpret(Int32, a) << 1)

function scaled_sigmoid(x::Float32, k::Float32)
    return 1 / (1 + exp(-k * x))
end

map_float32_to_uint32(f::Float32) = floor(UInt32, ((scaled_sigmoid(f, 0.1f0)) * (2^32 - 2^7 - 1)))
magic_add(val::Float32, magic::Float32) = magic == 0f0 ? val : reinterpret(Float32, reinterpret(UInt32, val) + map_float32_to_uint32(magic)) 

magic_inverse(val::Float32) = reinterpret(Float32, -reinterpret(Int32, val))

function fisr(x::Float32, magicnum::Float32)
    x2 = x * Float32(0.5)
    i1 = magic_inverse(x)
    i2 = shift_right(i1)
    y  = magic_add(i2, magicnum)
    y  = y * (Float32(1.5) - (x2 * y * y))
end

# see the mapping function, maybe this should just be a linear transformation? 
plot(range(-100, 100, 100), map_float32_to_uint32.(Float32.(range(-1000, 1000, 100))))

value_loss(prediction, target) = prediction <= 0 ? eltype(target)(Inf) : (log(prediction) - log(target))^2
function deriv_loss(prediction, target)
    # scatter_loss = abs(log((abs(prediction)+1e-20) / (abs(target)+1e-20)))
    # sign_loss = 10 * (sign(prediction) - sign(target))^2
    # return (scatter_loss + sign_loss) / 100
    return 0
end


function numerical_derivative(f, x::AbstractArray{T}, h=T(1e-10)) where T
    return (f(x .+ h) .- f(x)) ./ h
end

function eval_with_newton(tree, x, options, iters=1)
    pred_y, completed = eval_tree_array(tree, x, options)
    for _ in 1:iters
        pred_y .= pred_y .* (3/2 .- (pred_y .* pred_y .* @view x[1, :]) ./ 2)
    end
    return pred_y, completed
end

function loss_function(tree, dataset::Dataset{T,L}, options, idx) where {T,L}
    y = dataset.y
    dy = dataset.weights
    total_loss::L = 0
    # pred_y_improved = eval_with_newton(tree, dataset.X, options)
    pred_y, completed = eval_with_newton(tree, dataset.X, options)
    if !completed
        return L(Inf) # grossly assume that if it ran on the dataset, then it will run on perturbations of those points (±ε of each point in the dataset for 1st deriv.)
    end
   
    function eval_without_flag(x)
        return eval_with_newton(tree, x, options)[1]
    end

    pred_dy = numerical_derivative(eval_without_flag, dataset.X)
    for i in eachindex(y)
        vl = value_loss(pred_y[i], y[i])
        dl = deriv_loss(pred_dy[i], dy[i])
        total_loss += (vl + dl)
    end
    norm_loss = total_loss / length(y)
    return norm_loss
end


model = SRRegressor(
    binary_operators=[*, +, magic_add],
    unary_operators=[shift_right, magic_inverse, neg],
    complexity_of_operators=[shift_right=>1, magic_inverse=>1, magic_add=>1],
    niterations=1000,
    ncycles_per_iteration=100,
    optimizer_nrestarts=4,
    optimizer_algorithm="NelderMead",
    parsimony=0.0,
    maxsize=50,
    adaptive_parsimony_scaling=1000, # or 100, as miles mentioned. I feel I have an unclear meaning of what this does.
    # should_optimize_constants=false, # more of a hail mary than something that will definitively improve things.
    loss_function=loss_function,
    progress=true, # ??? when loss functions are provided it doesn't print by default anymore???
)

mach = machine(model, X, y, dy)
fit!(mach)
r = report(mach)
fitted_model = fitted_params(mach)


# testing various outputs
s = plot()
scatter!(s, X.input, log.(fisr.(X.input, -5.2391f0)), label="FISR")
ds = Dataset(reshape(X.input, 1, :))
options = Options(; unary_operators=[shift_right, magic_inverse, neg], binary_operators=[*, +, magic_add])
loss_cutoff = losses[end] * 100 # only show equations "on the order of" the best loss
complexity_cutoff = 10 # and equations with less than this many operations
for (i, eq) in enumerate(fitted_model.equations)
    if r.losses[i] <= loss_cutoff && r.complexities[i] < complexity_cutoff
        complexity = r.complexities[i]
        loss = r.losses[i]
        x = X.input
        pred_y = eval_with_newton(eq, ds.X, options)[1]
        scatter!(s, x, log.(pred_y), label=string(complexity) * "→ logloss=" * string(round(log10(loss); digits=2)))
    end
end
plot!(s, X.input, log.(y), label="data", linewidth=4)


# deviation curve
s = plot()
plot!(s, X.input, log.(fisr.(X.input, -5.2391f0)) .- log.(y), label="FISR", linewidth=2)
ds = Dataset(reshape(X.input, 1, :))
options = Options(; unary_operators=[shift_right, magic_inverse, neg], binary_operators=[*, +, magic_add])
loss_cutoff = losses[end] * 10 # only show equations "on the order of" the best loss
complexity_cutoff = 12 # and equations with less than this many operations
for (i, eq) in enumerate(fitted_model.equations)
    if r.losses[i] <= loss_cutoff && r.complexities[i] < complexity_cutoff
        complexity = r.complexities[i]
        loss = r.losses[i]
        x = X.input
        ypred = eval_with_newton(eq, ds.X, options)[1]
        plot!(s, x, log.(ypred) .- log.(y), label=string(complexity) * "→ logloss=" * string(round(log10(loss); digits=2)))
    end
end
display(s)
plot!(s, X.input, zeros(length(y)), label="data", linewidth=4)





# why is the equation of complexity 8 so good? it likely isn't being plotted correctly
function unusual_equation(x)
    v1 = magic_add(magic_inverse(shift_right(Float32(x))) * Float32(0.998277), Float32(-0.09932381)) * shift_right(Float32(0.3022318))
    v2 = v1 .* (3/2 .- (v1 .* v1 .* x) ./ 2)
    return v2
end
x = range(0.1, 10, 100)
y_unusual = unusual_equation.(x)
y_exact = 1 ./ sqrt.(big.(x))
s = plot(y_exact, y_unusual .- y_exact) # yup! random bug.

0 replies

MilesCranmer · 2023-11-25T17:00:28Z

MilesCranmer
Nov 25, 2023
Maintainer

This is so awesome. Nice work!!

For the sake of learning, I'm actually struggling to get the derivative terms incorporated into the loss function in a way that doesn't constantly end up with NaNs or Infs everywhere, and left everything as-is except for the derivative loss being set to zero (so that by default the code should run correctly).

The normal eval_tree_array has a return value (output, flag). This is set up so that if it detects a NaN or Inf at any point when evaluating the expression, it will just immediately return with flag == false (some operators throw an error if you feed them an Inf, hence this check). That means the array you get when flag == false is basically just intermediate evaluation values, and shouldn't be used, as it doesn't reflect the full expression.

Luckily, you can use the short-form version tree(X, options) as a sort of eval_without_flag, only that it will set all the array values to NaN if a flag is thrown. This is because (::Node{T})(::AbstractArray{T}, ::Options) is defined as a method for convenient use.

So, here is how I would modify this:

    # Inside loss function:
    pred_dy = numerical_derivative(x -> tree(x, options), dataset.X)
    if !all(isfinite, pred_dy)
        return L(Inf)
    end

Does this make sense?

Side note: this makes me want to get MilesCranmer/SymbolicRegression.jl#135 actually working (all the pieces are there; just need to get the mutations working appropriately). That way it would basically let you evolve the Newton approximation method as well, because it would be able to "re-use" variables further up the tree.

1 reply

MilesCranmer Nov 25, 2023
Maintainer

Here's an example of this behavior:

julia> using SymbolicRegression

julia> options = Options(binary_operators=[+, -, /, *, ^], unary_operators=[sqrt, cos]);

julia> x1, x2, x3 = [Node(feature=i) for i=1:3];

julia> tree = x1 + 1.0/x2
(x1 + (1.0 / x2))

julia> tree(ones(2, 1), options)
1-element Vector{Float64}:
 2.0

julia> tree(hcat(ones(2, 1), zeros(2, 1)), options)
2-element Vector{Float64}:
 NaN
 NaN

MilesCranmer · 2023-11-25T21:23:16Z

MilesCranmer
Nov 25, 2023
Maintainer

Regarding learning the Newton's approximation from scratch (or another graph-like calculation), I've made some progress on SymbolicML/DynamicExpressions.jl#56. Once all of that is done (will take a bit of work), it should be possible to use that to learn "programs" rather than just tree-like expressions. So it should be useful for your use-case.

0 replies

Boxylmer · 2023-11-27T06:13:09Z

Boxylmer
Nov 27, 2023
Author

The normal eval_tree_array has a return value (output, flag)

Whoops! I wish I had seen that. I was even using the function in your runtests.jl to generate random trees in the effort to benchmark the loss function's allocations. Thanks! I've rewritten those functions using the new method. I've also uploaded this as a more elegant writeup here: here instead of posting the code every update :)

I'm getting much worse results when incorporating derivative information, even when playing around with the weight that the derivative itself has (i.e., C*deriv_loss + value_loss, where I mess with C) as well as tricks with the derivative loss function itself. We're already getting decent answers though without it, so maybe it's just unnecessary!

For generalizing the learning to graphs rather than just trees: That would be very exciting! This would be super useful for physical models, where similar terms tend to show up often, especially when units and unit conversions matter.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I re-discover the Fast Inverse Square Root function? #469

{{title}}

Replies: 15 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

How can I re-discover the Fast Inverse Square Root function? #469

Boxylmer Nov 17, 2023

Replies: 15 comments · 4 replies

Boxylmer Nov 17, 2023 Author

MilesCranmer Nov 17, 2023 Maintainer

MilesCranmer Nov 17, 2023 Maintainer

Boxylmer Nov 17, 2023 Author

Boxylmer Nov 17, 2023 Author

MilesCranmer Nov 18, 2023 Maintainer

MilesCranmer Nov 18, 2023 Maintainer

MilesCranmer Nov 18, 2023 Maintainer

MilesCranmer Nov 18, 2023 Maintainer

Boxylmer Nov 20, 2023 Author

Boxylmer Nov 24, 2023 Author

MilesCranmer Nov 24, 2023 Maintainer

Boxylmer Nov 24, 2023 Author

MilesCranmer Nov 24, 2023 Maintainer

Boxylmer Nov 25, 2023 Author

MilesCranmer Nov 25, 2023 Maintainer

MilesCranmer Nov 25, 2023 Maintainer

MilesCranmer Nov 25, 2023 Maintainer

Boxylmer Nov 27, 2023 Author

Boxylmer
Nov 17, 2023

Replies: 15 comments 4 replies

Boxylmer
Nov 17, 2023
Author

MilesCranmer
Nov 17, 2023
Maintainer

MilesCranmer Nov 17, 2023
Maintainer

Boxylmer
Nov 17, 2023
Author

Boxylmer
Nov 17, 2023
Author

MilesCranmer
Nov 18, 2023
Maintainer

MilesCranmer
Nov 18, 2023
Maintainer

MilesCranmer
Nov 18, 2023
Maintainer

MilesCranmer
Nov 18, 2023
Maintainer

Boxylmer
Nov 20, 2023
Author

Boxylmer
Nov 24, 2023
Author

MilesCranmer Nov 24, 2023
Maintainer

Boxylmer
Nov 24, 2023
Author

MilesCranmer Nov 24, 2023
Maintainer

Boxylmer
Nov 25, 2023
Author

MilesCranmer
Nov 25, 2023
Maintainer

MilesCranmer Nov 25, 2023
Maintainer

MilesCranmer
Nov 25, 2023
Maintainer

Boxylmer
Nov 27, 2023
Author