-
Hey everyone! Super excited to start learning this package and adding it to my toolkit, I'm trying to learn the buttons and knobs of this package by re-discovering the famous fast inverse square root function from Quake. The function seems to not be discovering it well, or really finding much that resembles the function. The reason I'm tackling this one in particular is because I vaguely remember during a talk on the package that Miles may have mentioned it's possible to do this, so first and foremost please correct me if I'm hallucinating, but secondly let me know if the code to do this is already out there somewhere! Here's my code so far using SymbolicRegression
using MLJ
X = (; input=Float32.(rand(10:100, 100)))
y = 1 ./ sqrt.(X.input)
shiftright(a::Float32) = reinterpret(Float32, reinterpret(Int32, a) >> 1)
shiftleft(a::Float32) = reinterpret(Float32, reinterpret(Int32, a) << 1)
model = SRRegressor(
binary_operators=[-, *, +],
unary_operators=[shiftright, shiftleft],
niterations=1000,
)
mach = machine(model, X, y)
fit!(mach)
using Plots
plotly()
# testing various outputs
c6(x) = ((shiftright(x) * -3.2614e+17) - -0.3356)
c17(input) = (0.39054 - ((input * 0.38849) * ((0.041157 - (input * (((input - 0.38849) * -2.6506e-06) - -0.00055827))) * 0.6586)))
c14(input) = ((shiftright(shiftleft(shiftleft(input * Float32(-1.4716e+17))) + Float32(7.9136e+34)) - (input * Float32(0.0048464))) - Float32(-0.30194))
s = scatter(X.input, y, label="data")
scatter!(X.input, c6.(X.input), label="6")
scatter!(X.input, c14.(X.input), label="14")
scatter!(X.input, c17.(X.input), label="17") So we're certainly getting reasonable answers! Just not the Famous 6-8 operation answer. I wanted to ask you all for ideas and tips on how to better explore the function space here, and if I may be doing something wrong or missing some critical aspects of the search. Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 15 comments 4 replies
-
Oh, and if anyone would like to verify the FISR in Julia! function fisr(x::Float32)
x2 = x * Float32(0.5);
i = reinterpret(Int32, x)
i = 0x5f3759df - (i >> 1)
y = reinterpret(Float32, i)
y = y * (Float32(1.5) - (x2 * y * y))
end
inputs = Float32.(range(1, 100, 1000))
out_approx = fisr.(inputs)
out_actual = 1 ./ sqrt.(inputs)
s = plot(inputs, out_actual)
plot!(inputs, out_approx) |
Beta Was this translation helpful? Give feedback.
-
Hey @Boxylmer, You indeed are not hallucinating, I did that search maybe 1-2 years and occasionally mention it in talks. I'm happy to hear you are interested in this type of thing, it would be cool to have a reproducible example in the docs! I'm not sure if I'll be able to find the original script as it was 2 laptops ago... but I will dig around. You have the overall right idea with the operators if I recall correctly, but be sure to look at the tuning page: https://astroautomata.com/PySR/tuning/. Also probably need to tweak complexity of each operator (based on number FLOPS – I think I measured those with https://github.com/triscale-innov/GFlops.jl), etc., which will encourage it to use the bitshift operator. Might also want to throw more compute at the problem on a multi-node cluster. It might have also been that the Newton iteration steps (the I'm trying to also remember if I tweaked any part of the code related to the constants... (since optimizing the constants for this problem is a bit tricky as it's related to the bits, rather than a continuous value). I don't think I hacked anything, but it could have just been I disabled constant optimization (a parameter), or switched to NelderMead (another parameter), or something else. Sorry I'm not more helpful here... But yeah I'd be eager to help get this example working again! |
Beta Was this translation helpful? Give feedback.
-
Hey Miles! Thanks for the detailed and fast response. Oh yeah, I originally gave it way too much freedom and made an arbitrary bit shift, but quickly realized it never used it, and settled on using the shift-by-one methods above. The docs on PySR were super helpful! I should have checked over there as well rather than just the Julia docs (which I think you actually mentioned too: that PySR was more documented) The complexity trick is super clever! Just assign the average time that computation takes as the complexity. I don't have too much experience with SLURM clusters, but I want to get into it, and I have a few laptops sitting around that I could use to set it up eventually! I found that the pool of equations stagnated after a bit of time (or at least the hall of fame did). Is this something I should actually just leave for a bit more time? The data has zero noise, so I figured it was just not finding the magic numbers. On that note, magic numbers for bit shifting is tricky. The number is massive and is represented in hexadecimal form. It seems unlikely that optim would be able to find it unless we search for it in log space instead, but I'm not immediately sure how to set that up and it also feels like cheating. I'm also willing to put some more effort into it, as I think finding magic tricks with algorithms in specific types could prove super useful for some microcontrollers, and I was hoping to make enough headway to write a short communication on it. Thanks for the tips! |
Beta Was this translation helpful? Give feedback.
-
@MilesCranmer checking in with some updates! I have the current script finding a lot of pretty decent approximations that almost look like piecewise functions. I'm stuck here, I believe, and am not sure how to get it to further find FISR. Here's some of the key steps I've taken
I was genuinely surprised that it didn't turn out able to find it, even with me messing with a few hyperparameters and letting it run overnight. I'm hoping you might be able to point me in the right direction as this is a really interesting problem that I'm hoping to use to find other FISR like tricks for microcontrollers. One hunch I'm getting is that there might be some way to punish non-smooth functions, or those piecewise looking results that you typically get from bit shifting. Telling it that there should only be N instances of certain operators feels like cheating, so I'm not at that point yet, haha. Here you can see that FISR is near-perfect, but two good functions I found with complexities of 6 and 8 respective look piecewise-like. using MLJ, SymbolicRegression
X = (; input=Float32.(range(0.1, 100, 10000)))
y = 1 ./ sqrt.(X.input)
shift_right(a::Float32) = reinterpret(Float32, reinterpret(Int32, a) >> 1)
shift_left(a::Float32) = reinterpret(Float32, reinterpret(Int32, a) << 1)
function scaled_sigmoid(x::Float32, k::Float32)
return 1 / (1 + exp(-k * x))
end
map_float32_to_uint32(f::Float32) = floor(UInt32, ((scaled_sigmoid(f, 0.1f0)) * (2^32 - 2^7 - 1)))
magic_add(val::Float32, magic::Float32) = magic == 0f0 ? val : reinterpret(Float32, reinterpret(UInt32, val) + map_float32_to_uint32(magic))
magic_inverse(val::Float32) = reinterpret(Float32, -reinterpret(Int32, val))
function fisr(x::Float32, magicnum::Float32)
x2 = x * Float32(0.5)
i1 = magic_inverse(x)
i2 = shift_right(i1)
y = magic_add(i2, magicnum)
y = y * (Float32(1.5) - (x2 * y * y))
end
# see the mapping function, maybe this should just be a linear transformation?
plot(range(-100, 100, 100), map_float32_to_uint32.(Float32.(range(-1000, 1000, 100))))
model = SRRegressor(
binary_operators=[*, +, magic_add],
unary_operators=[shift_right, magic_inverse, neg],
complexity_of_operators=[shift_right=>0.5, magic_inverse=>0.5, magic_add=>0.5],
niterations=100,
ncycles_per_iteration=5000,
optimizer_nrestarts=4,
optimizer_algorithm="NelderMead",
)
mach = machine(model, X, y)
fit!(mach)
using Plots
plotly()
# testing various outputs
test_x = (; input=Float32.(range(0.05, 10, 1000)))
test_y = 1 ./ sqrt.(test_x.input)
c6(input) = magic_inverse(magic_add(neg(shift_right(input * Float32(-4.1931e-20))), Float32(-2.3726)))
c8(input) = magic_add(magic_inverse(shift_right(input) * Float32(0.88152)), Float32(-0.18826)) * Float32(8.441e-20)
c19(input) = (magic_inverse((shift_right(shift_right(input)) * Float32(1.5938e+29)) * (Float32(1.5938e+29) + magic_add(magic_inverse(shift_right(shift_right(Float32(-0.54733) + input))), Float32(-0.54733)))) * magic_inverse(shift_right(shift_right(input * Float32(5.4562)))))
s = plot()
plot!(s, X.input, y, label="data")
scatter!(s, test_x.input, c8.(test_x.input), label="8")
scatter!(s, test_x.input, c6.(test_x.input), label="6")
scatter!(s, test_x.input, c19.(test_x.input), label="19")
scatter!(s, test_x.input, fisr.(test_x.input, -5.2391f0), label="FISR")
plot!(s, test_x.input, test_y, label="test_answers")
# what's the expected loss if we get the right answer with L2 loss?
inputs = Float32.(range(1, 100, 1000))
out_actual = 1 ./ sqrt.(inputs)
s = plot(inputs, out_actual)
scatter!(s, inputs, out_approx)
sum((out_approx .- out_actual))^2/length(out_actual) # "perfect" loss
# 0x5f3759df
map_float32_to_uint32(-5.2391f0) # magic numbers!
|
Beta Was this translation helpful? Give feedback.
-
Did you try the tip about doing the Newton approximation by hand? i.e., something like this: function loss_with_newton_approx(tree, dataset::Dataset{T,L}, options) where {T,L}
if tree.degree != 0
return L(10_000) # Return large loss if not split-able
end
f = tree.l
f_output, f_flag = eval_tree_array(f, dataset.X, options)
if !f_flag
return L(Inf)
end
g = tree.r
g_output, g_flag = eval_tree_array(g, dataset.X, options)
if !g_flag
return L(Inf)
end
# Now, we can compute the Newton approx:
y = f_output
@. y = y * (T(1.5) - (g_output * y * y))
# Second step (or more!) is optional (can remove it from final evaluation on microcontroller)
@. y = y * (T(1.5) - (g_output * y * y))
return sum(i -> (y[i] - dataset.y[i])^2, eachindex(y)) / length(y)
end
model = SRRegressor(; loss_function=loss_with_newton_approx, other_params...) You can derive that loss from Newton method to optimization (the Just keep in mind that the formulae it prints out are "unsplit" and you will need to print the equations by yourself in pieces (splitting at |
Beta Was this translation helpful? Give feedback.
-
Other assorted tips/questions
|
Beta Was this translation helpful? Give feedback.
-
Data generation I might go a bit larger dynamic range. Also you probably don't need that many samples: julia> X = (; input=Float32.(10 .^ collect(range(-3, 3, 1000))))
julia> y = @. Float32(1 / sqrt(big(X.input))); (Which will also do the calculation with BigFloat) |
Beta Was this translation helpful? Give feedback.
-
Lastly, for the actual loss, MSE might not be a great choice, as it might just ignore precision for very small values. You could try MSE in log space instead: loss(prediction, target) = prediction <= 0 ? eltype(target)(Inf) : (log(prediction) - log(target))^2 |
Beta Was this translation helpful? Give feedback.
-
Wow! Thank you so much for the tips. I feel I'm learning quite a bit as a product of this problem. I should clarify, it took a few days prior as I turned up the iterations and cycles per iteration as I left to work on other problems, the current iterations I posted here was my "debug run" length. Sorry about that! One thing I'm having a bit of a hard time understanding is the adaptive parsimony scaling. This encourages diversity in complexity, but if parsimony=0, will this still do something? Using the tips you've provided, other than the custom loss function to include loss with a step of newton's method (I wasn't able to get my head around the trees with the time I had this weekend), the code I have is below. This approaches reasonable answers much more quickly than before, and answers seem to be a bit more precise. Right now I think this can be attributed to using loss in the log domain. Overall, current approximations are great for most applications But we still aren't able to do this naively, I assume with the newtons step taken out of the search, it wouldn't have a problem. It's interesting, however, that we're still seeing "jittery" solutions every time. Have often do you try to incorporate a derivative term in the loss? I.e., total_loss = loss(value) + loss(derivative)? Finally for the sake of learning, what advantage does turning off optimization have (tip #5)? Is letting it learn through the genetic algorithm preferable if we're trying to navigate a parameter space that's highly nonlinear or unable to be navigated smoothly? using MLJ, SymbolicRegression
using Plots
plotly()
X = (; input=Float32.(10 .^ collect(range(-3, 3, 1000))))
y = @. Float32(1 / sqrt(big(X.input)));
shift_right(a::Float32) = reinterpret(Float32, reinterpret(Int32, a) >> 1)
shift_left(a::Float32) = reinterpret(Float32, reinterpret(Int32, a) << 1)
function scaled_sigmoid(x::Float32, k::Float32)
return 1 / (1 + exp(-k * x))
end
map_float32_to_uint32(f::Float32) = floor(UInt32, ((scaled_sigmoid(f, 0.1f0)) * (2^32 - 2^7 - 1)))
magic_add(val::Float32, magic::Float32) = magic == 0f0 ? val : reinterpret(Float32, reinterpret(UInt32, val) + map_float32_to_uint32(magic))
magic_inverse(val::Float32) = reinterpret(Float32, -reinterpret(Int32, val))
function fisr(x::Float32, magicnum::Float32)
x2 = x * Float32(0.5)
i1 = magic_inverse(x)
i2 = shift_right(i1)
y = magic_add(i2, magicnum)
y = y * (Float32(1.5) - (x2 * y * y))
end
# see the mapping function, maybe this should just be a linear transformation?
plot(range(-100, 100, 100), map_float32_to_uint32.(Float32.(range(-1000, 1000, 100))))
loss(prediction, target) = prediction <= 0 ? eltype(target)(Inf) : (log(prediction) - log(target))^2
model = SRRegressor(
binary_operators=[*, +, magic_add],
unary_operators=[shift_right, magic_inverse, neg],
complexity_of_operators=[shift_right=>0.5, magic_inverse=>0.5, magic_add=>0.5],
niterations=100000,
ncycles_per_iteration=1000,
optimizer_nrestarts=4,
optimizer_algorithm="NelderMead",
parsimony=0.0,
maxsize=50,
adaptive_parsimony_scaling=1000, # or 100, as miles mentioned. I feel I have an unclear meaning of what this does.
# should_optimize_constants=false, # more of a hail mary than something that will definitively improve things.
elementwise_loss=loss,
)
mach = machine(model, X, y)
fit!(mach)
# r = report(mach)
fitted_model = fitted_params(mach)
selected_equation = fitted_model.equations[9]
# testing various outputs
mach.model.selection_method = Returns(9)
s = plot()
scatter!(s, X.input, predict(mach, X), label="predict")
scatter!(s, X.input, fisr.(X.input, -5.2391f0), label="FISR")
plot!(s, X.input, y, label="data", linewidth=4) |
Beta Was this translation helpful? Give feedback.
-
I'm actually struggling to implement derivative information in the X = (; input=Float32.(10 .^ collect(range(-3, 3, 1000))))
y = @. Float32(1 / sqrt(big(X.input)));
dy = Float32.([Zygote.gradient(x -> 1/sqrt(x), val)[1] for val in big.(X.input)]) |
Beta Was this translation helpful? Give feedback.
-
Absolutely no need to apologize! I'm enjoying learning the library and figuring out the issue. I really appreciate the time you've taken this far to assist on the problem. Yup! I use zygote to calculate the derivative for the bigfloat ground truth and planned on using finitediff.jl on the To clarify, I tried passing in a named tuple to get the data in as |
Beta Was this translation helpful? Give feedback.
-
Alright, after a bit of effort I'm actually finding functions that, at first glance, appear to outperform the original FISR function. This ended up happening by including the newton approximation in the loss function (as you first suggested) and without any derivative information. For the sake of learning, I'm actually struggling to get the derivative terms incorporated into the loss function in a way that doesn't constantly end up with To show the results so far, I made two plots:
As a side note, I was able to observe the very interesting behavior of the loss suddenly "clicking", and going from ~0.1 to 1e-7 instantly at high complexity equations, then over time the algorithm would find ways to keep that order of loss while reducing the complexity bits and pieces at a time. I understand this is expected and possibly even routine for you, but it's exciting seeing it in action! The hunch I'm beginning to get from this is that FISR has multiple "fast" solutions, and finding the exact FISR magic number is not something we should expect from this type of approach. The FISR solution deviates in a very similar visual way to that of the other solutions, and about how we would expect for a theoretical solution with a complexity of somewhere between 4 and 6. Code so far using MLJ, SymbolicRegression
using Plots
using Zygote
plotly()
X = (; input=Float32.(10 .^ collect(range(-3, 3, 1000))))
y = @. Float32(1 / sqrt(big(X.input)));
dy = Float32.([Zygote.gradient(x -> 1/sqrt(x), val)[1] for val in big.(X.input)])
shift_right(a::Float32) = reinterpret(Float32, reinterpret(Int32, a) >> 1)
shift_left(a::Float32) = reinterpret(Float32, reinterpret(Int32, a) << 1)
function scaled_sigmoid(x::Float32, k::Float32)
return 1 / (1 + exp(-k * x))
end
map_float32_to_uint32(f::Float32) = floor(UInt32, ((scaled_sigmoid(f, 0.1f0)) * (2^32 - 2^7 - 1)))
magic_add(val::Float32, magic::Float32) = magic == 0f0 ? val : reinterpret(Float32, reinterpret(UInt32, val) + map_float32_to_uint32(magic))
magic_inverse(val::Float32) = reinterpret(Float32, -reinterpret(Int32, val))
function fisr(x::Float32, magicnum::Float32)
x2 = x * Float32(0.5)
i1 = magic_inverse(x)
i2 = shift_right(i1)
y = magic_add(i2, magicnum)
y = y * (Float32(1.5) - (x2 * y * y))
end
# see the mapping function, maybe this should just be a linear transformation?
plot(range(-100, 100, 100), map_float32_to_uint32.(Float32.(range(-1000, 1000, 100))))
value_loss(prediction, target) = prediction <= 0 ? eltype(target)(Inf) : (log(prediction) - log(target))^2
function deriv_loss(prediction, target)
# scatter_loss = abs(log((abs(prediction)+1e-20) / (abs(target)+1e-20)))
# sign_loss = 10 * (sign(prediction) - sign(target))^2
# return (scatter_loss + sign_loss) / 100
return 0
end
function numerical_derivative(f, x::AbstractArray{T}, h=T(1e-10)) where T
return (f(x .+ h) .- f(x)) ./ h
end
function eval_with_newton(tree, x, options, iters=1)
pred_y, completed = eval_tree_array(tree, x, options)
for _ in 1:iters
pred_y .= pred_y .* (3/2 .- (pred_y .* pred_y .* @view x[1, :]) ./ 2)
end
return pred_y, completed
end
function loss_function(tree, dataset::Dataset{T,L}, options, idx) where {T,L}
y = dataset.y
dy = dataset.weights
total_loss::L = 0
# pred_y_improved = eval_with_newton(tree, dataset.X, options)
pred_y, completed = eval_with_newton(tree, dataset.X, options)
if !completed
return L(Inf) # grossly assume that if it ran on the dataset, then it will run on perturbations of those points (±ε of each point in the dataset for 1st deriv.)
end
function eval_without_flag(x)
return eval_with_newton(tree, x, options)[1]
end
pred_dy = numerical_derivative(eval_without_flag, dataset.X)
for i in eachindex(y)
vl = value_loss(pred_y[i], y[i])
dl = deriv_loss(pred_dy[i], dy[i])
total_loss += (vl + dl)
end
norm_loss = total_loss / length(y)
return norm_loss
end
model = SRRegressor(
binary_operators=[*, +, magic_add],
unary_operators=[shift_right, magic_inverse, neg],
complexity_of_operators=[shift_right=>1, magic_inverse=>1, magic_add=>1],
niterations=1000,
ncycles_per_iteration=100,
optimizer_nrestarts=4,
optimizer_algorithm="NelderMead",
parsimony=0.0,
maxsize=50,
adaptive_parsimony_scaling=1000, # or 100, as miles mentioned. I feel I have an unclear meaning of what this does.
# should_optimize_constants=false, # more of a hail mary than something that will definitively improve things.
loss_function=loss_function,
progress=true, # ??? when loss functions are provided it doesn't print by default anymore???
)
mach = machine(model, X, y, dy)
fit!(mach)
r = report(mach)
fitted_model = fitted_params(mach)
# testing various outputs
s = plot()
scatter!(s, X.input, log.(fisr.(X.input, -5.2391f0)), label="FISR")
ds = Dataset(reshape(X.input, 1, :))
options = Options(; unary_operators=[shift_right, magic_inverse, neg], binary_operators=[*, +, magic_add])
loss_cutoff = losses[end] * 100 # only show equations "on the order of" the best loss
complexity_cutoff = 10 # and equations with less than this many operations
for (i, eq) in enumerate(fitted_model.equations)
if r.losses[i] <= loss_cutoff && r.complexities[i] < complexity_cutoff
complexity = r.complexities[i]
loss = r.losses[i]
x = X.input
pred_y = eval_with_newton(eq, ds.X, options)[1]
scatter!(s, x, log.(pred_y), label=string(complexity) * "→ logloss=" * string(round(log10(loss); digits=2)))
end
end
plot!(s, X.input, log.(y), label="data", linewidth=4)
# deviation curve
s = plot()
plot!(s, X.input, log.(fisr.(X.input, -5.2391f0)) .- log.(y), label="FISR", linewidth=2)
ds = Dataset(reshape(X.input, 1, :))
options = Options(; unary_operators=[shift_right, magic_inverse, neg], binary_operators=[*, +, magic_add])
loss_cutoff = losses[end] * 10 # only show equations "on the order of" the best loss
complexity_cutoff = 12 # and equations with less than this many operations
for (i, eq) in enumerate(fitted_model.equations)
if r.losses[i] <= loss_cutoff && r.complexities[i] < complexity_cutoff
complexity = r.complexities[i]
loss = r.losses[i]
x = X.input
ypred = eval_with_newton(eq, ds.X, options)[1]
plot!(s, x, log.(ypred) .- log.(y), label=string(complexity) * "→ logloss=" * string(round(log10(loss); digits=2)))
end
end
display(s)
plot!(s, X.input, zeros(length(y)), label="data", linewidth=4)
# why is the equation of complexity 8 so good? it likely isn't being plotted correctly
function unusual_equation(x)
v1 = magic_add(magic_inverse(shift_right(Float32(x))) * Float32(0.998277), Float32(-0.09932381)) * shift_right(Float32(0.3022318))
v2 = v1 .* (3/2 .- (v1 .* v1 .* x) ./ 2)
return v2
end
x = range(0.1, 10, 100)
y_unusual = unusual_equation.(x)
y_exact = 1 ./ sqrt.(big.(x))
s = plot(y_exact, y_unusual .- y_exact) # yup! random bug. |
Beta Was this translation helpful? Give feedback.
-
This is so awesome. Nice work!!
The normal Luckily, you can use the short-form version So, here is how I would modify this: # Inside loss function:
pred_dy = numerical_derivative(x -> tree(x, options), dataset.X)
if !all(isfinite, pred_dy)
return L(Inf)
end Does this make sense? Side note: this makes me want to get MilesCranmer/SymbolicRegression.jl#135 actually working (all the pieces are there; just need to get the mutations working appropriately). That way it would basically let you evolve the Newton approximation method as well, because it would be able to "re-use" variables further up the tree. |
Beta Was this translation helpful? Give feedback.
-
Regarding learning the Newton's approximation from scratch (or another graph-like calculation), I've made some progress on SymbolicML/DynamicExpressions.jl#56. Once all of that is done (will take a bit of work), it should be possible to use that to learn "programs" rather than just tree-like expressions. So it should be useful for your use-case. |
Beta Was this translation helpful? Give feedback.
-
Whoops! I wish I had seen that. I was even using the function in your runtests.jl to generate random trees in the effort to benchmark the loss function's allocations. Thanks! I've rewritten those functions using the new method. I've also uploaded this as a more elegant writeup here: here instead of posting the code every update :) I'm getting much worse results when incorporating derivative information, even when playing around with the weight that the derivative itself has (i.e., C*deriv_loss + value_loss, where I mess with C) as well as tricks with the derivative loss function itself. We're already getting decent answers though without it, so maybe it's just unnecessary! For generalizing the learning to graphs rather than just trees: That would be very exciting! This would be super useful for physical models, where similar terms tend to show up often, especially when units and unit conversions matter. |
Beta Was this translation helpful? Give feedback.
Alright, after a bit of effort I'm actually finding functions that, at first glance, appear to outperform the original FISR function. This ended up happening by including the newton approximation in the loss function (as you first suggested) and without any derivative information.
For the sake of learning, I'm actually struggling to get the derivative terms incorporated into the loss function in a way that doesn't constantly end up with
NaN
s orInf
s everywhere, and left everything as-is except for the derivative loss being set to zero (so that by default the code should run correctly).To show the results so far, I made two plots: