-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected sum Performance Delta vs Unitful #55
Comments
In my reading, you didn't try out the array of array approach with Unitful, right? That would be interesting to see, but right now it seems like the direct point of comparison between Unitful and DynamicQuantities here is just the factor of 10 between the first two cases. Edit: you did indeed to do this with StaticArrays, I missed that, but I don't think there's a benchmark for unitful with regular arrays. |
@mikeingold rather than an array of One option is for us to write a custom |
I think the issue is that you are calling the I think a fair comparison would be to include the |
The original intent here was just to test the performance of some basic operations, e.g. what if I have a collection of vectors that I need to sum? In that sense it didn’t feel right to time the repeat since it was “just” being used to set up the test, but I do suppose it’s a fair counterpoint that maybe the more fundamental performance traits of DynamicQuantities live in that space, rather than in the basic operations themselves. I’ll try to take another look at these tests tonight to see if there’s a more direct apples-to-apples test to be run. I still do think it’s interesting that the performance delta of sum, though, was an order of magnitude. I’m not sure how much of that is accounted for by some of the heavy lifting being performed by the compiler itself, or maybe it’s just an artifact of a how much of the vector that can be fit into cache. |
I've done some more testing and am still getting similar results. Constructing vectors of structs inside the benchmark closed the gap somewhat, but there's still about a 3.5x difference between Unitful and DynamicQuantities. At this point, I suspect the reason is that the Unitful types in this problem are already sufficiently-constrained that type inference and allocations aren't the big tent-pole to begin with. If that's the case, I definitely hadn't expected such a significant delta in what seems to be the performance floor between the two packages, but I suppose it makes sense (units being handled at compile-time vs compute-time). using DynamicQuantities
using Unitful
using BenchmarkTools
# Cartesian Coordinate with Quantity types
struct CoordinateCartesian{L}
x::L
y::L
z::L
end
Base.:+(u::CoordinateCartesian, v::CoordinateCartesian) = CoordinateCartesian(u.x+v.x, u.y+v.y, u.z+v.z)
# Test: Sum an N-length vector of CoordinateCartesian using default DynamicQuantities
function test_DynamicQuantities(N)
arr = [CoordinateCartesian(DynamicQuantities.Quantity(rand(), length=1),
DynamicQuantities.Quantity(rand(), length=1),
DynamicQuantities.Quantity(rand(), length=1) ) for i in 1:N]
sum(arr)
end
# Test: Sum an N-length vector of CoordinateCartesian using compact DynamicQuantities
function test_DynamicQuantities_R8(N)
R8 = Dimensions{DynamicQuantities.FixedRational{Int8,6}}
arr = [CoordinateCartesian(DynamicQuantities.Quantity(rand(), R8, length=1),
DynamicQuantities.Quantity(rand(), R8, length=1),
DynamicQuantities.Quantity(rand(), R8, length=1) ) for i in 1:N]
sum(arr)
end
# Test: Sum an N-length vector of CoordinateCartesian using Unitful
function test_Unitful(N)
arr = [CoordinateCartesian(Unitful.Quantity(rand(), Unitful.@u_str("m")),
Unitful.Quantity(rand(), Unitful.@u_str("m")),
Unitful.Quantity(rand(), Unitful.@u_str("m")) ) for i in 1:N]
sum(arr)
end
bench_dq = @benchmark test_DynamicQuantities($1000) evals=100
bench_dq_r8 = @benchmark test_DynamicQuantities_R8($1000) evals=100
bench_uf = @benchmark test_Unitful($1000) evals=100
Results julia> bench_dq
BenchmarkTools.Trial: 969 samples with 100 evaluations.
Range (min … max): 24.643 μs … 81.427 μs ┊ GC (min … max): 0.00% … 35.44%
Time (median): 48.954 μs ┊ GC (median): 0.00%
Time (mean ± σ): 51.681 μs ± 10.649 μs ┊ GC (mean ± σ): 12.43% ± 16.48%
▄ ▂ ▁▂ ▁█
▂▁▁▂▃▃▂▂▁▂▃▁▁▁▂▂▃▃▆█▇▄▇█████▆███▄▃▃▂▃▂▂▁▂▂▃▃▃▃▄▃▄▄▅▄▃▃▅▄▃▃▄ ▃
24.6 μs Histogram: frequency by time 76.2 μs <
Memory estimate: 117.23 KiB, allocs estimate: 2.
julia> bench_dq_r8
BenchmarkTools.Trial: 146 samples with 100 evaluations.
Range (min … max): 329.198 μs … 382.411 μs ┊ GC (min … max): 0.00% … 4.22%
Time (median): 345.723 μs ┊ GC (median): 4.60%
Time (mean ± σ): 345.099 μs ± 8.225 μs ┊ GC (mean ± σ): 2.77% ± 2.48%
▂ ▅▅▆ ▃█▂▅▅▃▂ ▅ ▃
▅▅▁▄▇▁▁█▄▁▁█▅███▇▅▅▄▄▁▁▇▁▅▄███████▁█▇▅▁█▁█▄█▁▇▇▄▅▄▄▄▁▄▁▁▄▄▁▁▄ ▄
329 μs Histogram: frequency by time 364 μs <
Memory estimate: 250.16 KiB, allocs estimate: 7005.
julia> bench_uf
BenchmarkTools.Trial: 3537 samples with 100 evaluations.
Range (min … max): 7.580 μs … 53.050 μs ┊ GC (min … max): 0.00% … 75.62%
Time (median): 13.517 μs ┊ GC (median): 0.00%
Time (mean ± σ): 14.124 μs ± 6.536 μs ┊ GC (mean ± σ): 10.01% ± 15.33%
▇▂ ▁ ▃▇▇██▅▂▂▁▂▂ ▂
██▇██████████████▆▁▁▁▁▁▁▁▁▁▁▃▁▅▆▅▆▆▇▇▆▄▅▆▆▄▃▄▁▄▄▃▄▇▆▇▆▃▆▇█▇ █
7.58 μs Histogram: log(frequency) by time 45.8 μs <
Memory estimate: 23.48 KiB, allocs estimate: 2. Summary: The Unitful version of this test ran about 3.5x faster than the vanilla DynamicQuantities version. Running the DynamicQuantities version with a more compact type was apparently an order-of-magnitude slower. |
Writing it like function test_dq_naive(N)
arr = [
CoordinateCartesian(
rand() * DQ.@u_str("m"),
rand() * DQ.@u_str("m"),
rand() * DQ.@u_str("m")
)
for _ in 1:N
]
sum(arr)
end
bench_dq_naive = @benchmark test_dq_naive(1000) evals=100 For me this gives a min time of 17.264 us compared to Unitful's 7.581 us, so a 2.3x difference. (This is just because the macro => computes the This is obviously still not great. However this is really tricky because struct CoordinateCartesian{L}
x::L
y::L
z::L
end is optimal for Unitful.jl (the To make things faster for DynamicQuantities, you need to wrap stuff in |
Interesting. I'd actually guessed the opposite, that calling the constructor function directly would've been less complicated than a macro, but that makes sense.
Is the following a fair implementation of this advic, adapting from the prior pattern? These results are roughly in the neighborhood of what I'm seeing for # Test: Sum an N-length vector of QuantityArray's
function test_QuantityArray(N)
arr = [DynamicQuantities.QuantityArray([rand(),rand(),rand()], DynamicQuantities.@u_str("m")) for _ in 1:N]
sum(arr)
end
"""
BenchmarkTools.Trial: 219 samples with 100 evaluations.
Range (min … max): 211.921 μs … 279.226 μs ┊ GC (min … max): 0.00% … 5.85%
Time (median): 231.468 μs ┊ GC (median): 6.02%
Time (mean ± σ): 228.822 μs ± 9.042 μs ┊ GC (mean ± σ): 3.93% ± 3.10%
▂▁ ▂ ▂ █ ▇ ▃ ▄
▅▄▁▁▄▁▅▁▄▃▄▄██▇█▇▆▆▇▃▆▅▆▁▁▃▁▃▃▁▁▁▁▁▁▆▆▅▃▆█▁▇▁█▄██▇█▇█▅█▄▆▆▃▃▄ ▄
212 μs Histogram: frequency by time 240 μs <
Memory estimate: 273.41 KiB, allocs estimate: 3001.
""" The more I'm looking at this, the more it seems like I got lucky and intuited my existing Unitful code into a relatively optimal state where the types are constrained/defined enough to avoid a lot of the big performance pitfalls. The only real remaining "Issue" in my mind is just how surprising it was that there seems to be a delta in what you might call the performance floor of each package, i.e. that it is possible for Unitful to be faster in very specific situations. I'm planning to do some more testing with more complicated expressions to see if the type constraints continue to hold the line, or at what point they break down. Are you on the Julia Discourse @MilesCranmer? Maybe it would be better to migrate this topic there vs it being tracked as a concrete "Issue" here? |
Almost but not quite. Since you are summing coordinates along the sample axis, the QuantityArray needs to also wrap the sample axis (in order for Julia to remove the dimensional analysis). So e.g., all the Unitful gets around this via a sophisticated set of promotion rules. When you create an array of types prametrized to So to get closer in performance you need to also tell Julia that all the units are the same, which you can do with a QuantityArray.
Sounds good to me, it should be a nice way to get additional performance tips. |
@mikeingold I wrote a little example in #49 for how I think you should actually do this (once that PR merges): First, make the coords: struct Coords
x::Float64
y::Float64
end
# Define arithmetic operations on Coords
Base.:+(a::Coords, b::Coords) = Coords(a.x + b.x, a.y + b.y)
Base.:-(a::Coords, b::Coords) = Coords(a.x - b.x, a.y - b.y)
Base.:*(a::Coords, b::Number) = Coords(a.x * b, a.y * b)
Base.:*(a::Number, b::Coords) = Coords(a * b.x, a * b.y)
Base.:/(a::Coords, b::Number) = Coords(a.x / b, a.y / b) We can then build a coord1 = GenericQuantity(Coords(0.3, 0.9), length=1)
coord2 = GenericQuantity(Coords(0.2, -0.1), length=1) and perform operations on these: coord1 + coord2 |> uconvert(us"cm")
# (Coords(50.0, 80.0)) cm The nice part about this is it only stores a single Dimensions (or Then, we can build an array like so: function test_QuantityArray(N)
coord_array = QuantityArray([GenericQuantity(Coords(rand(), rand()), length=1) for i=1:N])
sum(coord_array)
end This |
I'm getting near-identical performance to a regular array now!! julia> @btime sum(coord_array) setup=(N=1000; coord_array=QuantityArray([GenericQuantity(Coords(rand(), rand()), length=1) for i=1:N]))
1.113 μs (0 allocations: 0 bytes)
(Coords(501.7717111461543, 494.36328730797095)) m
julia> @btime sum(array) setup=(N=1000; array=[Coords(rand(), rand()) for i=1:N])
1.087 μs (0 allocations: 0 bytes)
Coords(505.4496129866645, 507.2903371535713) |
Sorry for the delay, haven't had as much time lately to work on this. I'm excited to try out the new update when it's available on General! At this point I'd propose using PR #49 as justification to close this Issue. |
Cool! Closing with v0.8.0. |
I was running some testing to see if I could boost performance in a package of mine by converting from Unitful but ran into some unexpected benchmark results where Unitful seems to perform about an order-of-magnitude faster when
sum
ing vectors of Quantities. Here's a MWE with benchmark results from my system (versioninfo
at bottom). I'm basically just instantiating a 1000-length vector of Cartesian length vectors and thensum
ing it. Any ideas about what's going on here?I first defined a generic struct for Cartesian vectors:
Using the
CoordinateCartesian
struct with Unitful Quantities:Using the
CoordinateCartesian
struct with DynamicQuantities Quantities is roughly an order-of-magnitude slower:Next I tried replacing the struct with the built-in QuantityArray type, which was much slower still, apparently due to allocations.
How about a simple
StaticVector
of Quantities? This gets us back to the neighborhood of the struct with DynamicQuantities, but still slower than the Unitful version of the same.Again, how does this SVector version compare to one with Unitful?
System info:
The text was updated successfully, but these errors were encountered: