Skip to content

Conversation

@LilithHafner
Copy link
Member

Instead of taking of using a pre-allocated linear product tree (i.e. foldl(mul!, arr)), use a balanced binary product tree with pre-allocated linear leaves of length 16. This is far from optimal but already often an order of magnitude or two faster than the current implementation. It can be sub-optimal by up to a factor of log2(n). This is reached in the case where there is one big element and the rest are small and nonzero in which case this implementation performs log2(n) large multiplications while the optimal only performs 1. The current implementation is sub-optimal by a factor of n/log2(n) in cases where all the elements are similar sizes.

Fixes #59052

Benchmarks:

fig
Methodology and test cases

function prod_default(arr::AbstractArray{BigInt})
    invoke(prod, Tuple{AbstractArray{<:Real}}, arr)
end

"Assumes `x != 0`. Computes top_set_bit(abs(x))."

@eval Base.GMP function _top_set_bit(x::BigInt)
    x.size * BITS_PER_LIMB - GC.@preserve x leading_zeros(unsafe_load(x.d, abs(x.size)))
end

@eval Base.GMP function prod_proposed(arr::AbstractArray{BigInt})
    any(iszero, arr) && return zero(BigInt)
    _prod_proposed(arr, firstindex(arr), lastindex(arr))
end
@eval Base.GMP function _prod_proposed(arr::AbstractArray{BigInt}, lo, hi)
    if hi - lo + 1 <= 16
        nbits = BITS_PER_LIMB
        for i in lo:hi
            nbits += _top_set_bit(arr[i])
        end
        init = BigInt(; nbits)
        MPZ.set_si!(init, 1)
        for i in lo:hi
            MPZ.mul!(init, arr[i])
        end
        init
    else
        mid = (lo + hi) ÷ 2
        MPZ.mul!(_prod_proposed(arr, lo, mid), _prod_proposed(arr, mid+1, hi))
    end
end
prod_proposed(x) = Base.GMP.prod_proposed(x)

using Chairmarks, Plots
test_cases = [
    vcat(big(rand(UInt128))^1000, fill(big(3), 10000)),
    vcat(big(rand(UInt128))^1000, fill(big(2), 10000)),
    vcat(big(rand(UInt128))^1000, fill(big(1), 10000)),
    vcat(big(rand(UInt128))^1000, fill(big(0), 10000)),
    vcat(fill(big(3), 10000), big(rand(UInt128))^1000),#5
    vcat(fill(big(2), 10000), big(rand(UInt128))^1000),
    vcat(fill(big(1), 10000), big(rand(UInt128))^1000),
    vcat(fill(big(0), 10000), big(rand(UInt128))^1000),
    [big(rand(UInt128))^8 for _ in 1:5000],
    [big(rand(UInt128))^8 for _ in 1:50],#10
    [big(rand(UInt128))^8 for _ in 1:5],
    [big(rand(UInt128))^4 for _ in 1:10000],
    [big(rand(UInt128))^2 for _ in 1:10000],
    [big(rand(UInt128)) for _ in 1:10000],
    [big(rand(UInt64)) for _ in 1:10000],#15
    [big(rand(UInt32)) for _ in 1:10000],
    [big(rand(UInt16)) for _ in 1:10000],
    [big(rand(1:3)) for _ in 1:100000],
    [big(rand(UInt16)) for _ in 1:1000],
    [big(rand(UInt16)) for _ in 1:10],#20
    collect(big.(1:10_000)),
]
bits() = sum.(Base.top_set_bit, test_cases)

function benchmark()
    [(@b x prod,prod_default,prod_proposed) for x in test_cases]
end

labels() = ["prod", "prod_default", "prod_proposed"]

x = @time benchmark()

M = 3.5e9 * max.(25e-9, transpose(stack([collect(getproperty.(t, :time)) for t in x]))) ./ bits()

yt = [.0005,.0006,.0007,.0008,.0009,.001,.002,.003,.004,.005,.006,.007,.008,.009,.01,.02,.03,.04,.05,.06,.07,.08,.09,.1,.2,.3,.4,.5,.6,.7,.8,.9,1,2,3,4,5,6,7,8,9,10,20,30,40,50,60,70,80,90,100,200,300,400,500,600,700,800,900,1000,2000]; 
scatter(M, yaxis=:log, labels=permutedims(labels()), alpha=.8, 
    yticks=(yt, [x / 10.0^floor(Int, log10(x)) > 1 ? "" : isinteger(x) ? string(Int(x)) : string(x) for x in yt]),
    ylabel="Cycles per bit", xlabel="Test case")

@LilithHafner LilithHafner added performance Must go faster bignums BigInt and BigFloat labels Sep 1, 2025
@adienes
Copy link
Member

adienes commented Sep 2, 2025

will this break on AbstractArrays that don't support linear indexing?

@oscardssmith
Copy link
Member

IIUC AbstractArrays support linear indexing (it just might be slower since it has to do the conversion to cartesian). For a case like this, it's probably worth it no matter what since the indexing will be faster than the BigInt multiplication.

LilithHafner and others added 2 commits September 3, 2025 12:29
Co-authored-by: Alex Arslan <ararslan@comcast.net>
@LilithHafner
Copy link
Member Author

@oscardssmith, are you willing to give this a final review?

LilithHafner and others added 4 commits September 20, 2025 07:41
Tests sign handling in the new divide-and-conquer prod implementation to ensure
negative BigInts are processed correctly with the abs(size) fix.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@LilithHafner
Copy link
Member Author

Bump :)

@LilithHafner LilithHafner merged commit 191be7c into master Sep 20, 2025
7 checks passed
@LilithHafner LilithHafner deleted the lh/big-prod branch September 20, 2025 17:38
xal-0 pushed a commit to xal-0/julia that referenced this pull request Sep 30, 2025
Instead of taking of using a pre-allocated linear product tree (i.e.
`foldl(mul!, arr)`), use a balanced binary product tree with
pre-allocated linear leaves of length 16. This is far from optimal but
already often an order of magnitude or two faster than the current
implementation. It can be sub-optimal by up to a factor of `log2(n)`.
This is reached in the case where there is one big element and the rest
are small and nonzero in which case this implementation performs
`log2(n)` large multiplications while the optimal only performs `1`. The
current implementation is sub-optimal by a factor of `n/log2(n)` in
cases where all the elements are similar sizes.

Fixes JuliaLang#59052
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bignums BigInt and BigFloat performance Must go faster

Projects

None yet

Development

Successfully merging this pull request may close these issues.

prod(::AbstractVector{BigInt}) is O(n^2) when it could be O(n log n)

6 participants