Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slowdown when using alloc! #33

Open
sumiya11 opened this issue Jan 7, 2024 · 4 comments
Open

Slowdown when using alloc! #33

sumiya11 opened this issue Jan 7, 2024 · 4 comments

Comments

@sumiya11
Copy link

sumiya11 commented Jan 7, 2024

Hi,

I have the following example where I observe a 2x slowdown with Bumper.alloc!.

Could you please confirm that I use the package correctly?
Do you have ideas on how to fix this?

Thank you !

using BenchmarkTools, Bumper

function work0(polys; use_custom_allocator=false)
    if use_custom_allocator
        custom_allocator = Bumper.SlabBuffer()
        @no_escape custom_allocator begin
            work1(polys, custom_allocator)
        end
    else
        work1(polys)
    end
end

# Very important work
function work1(polys, custom_allocator=nothing)
    res = 0
    for poly in polys
        new_poly = work2(poly, custom_allocator)
        res += sum(new_poly)
    end
    res
end

function work2(poly::Vector{T}, ::Nothing) where {T}
    new_poly = Vector{T}(undef, length(poly))
    work3!(new_poly)
end

function work2(poly::Vector{T}, custom_allocator) where {T}
    new_poly = Bumper.alloc!(custom_allocator, T, length(poly))
    work3!(new_poly)
end

function work3!(poly::AbstractVector{T}) where {T}
    poly[1] = one(T)
    for i in 2:length(poly)
        poly[i] = convert(T, i)^3 - poly[i - 1]
    end
    poly
end

###

m, n = 1_000, 10_000
polys = [rand(UInt32, rand(1:m)) for _ in 1:n];

@btime work0(polys, use_custom_allocator=false)
#   6.461 ms (10001 allocations: 20.26 MiB)
# 0x0000e2e1c67cdb19

@btime work0(polys, use_custom_allocator=true)
#   14.154 ms (6 allocations: 608 bytes)
# 0x0000e2e1c67cdb19

Running on

julia> versioninfo()
Julia Version 1.10.0
Commit 3120989f39 (2023-12-25 18:01 UTC) 
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 × Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
  Threads: 1 on 8 virtual cores
@tjjarvinen
Copy link

Your problem is that you don't release the allocated arrays, and thus end up allocating a quite bit out of the buffer. If you correct this, you will get a little performance increase over the "normal" version

using BenchmarkTools, Bumper

function work0(polys; use_custom_allocator=false)
    if use_custom_allocator
        custom_allocator = Bumper.SlabBuffer()
        work1(polys, custom_allocator)
    else
        work1(polys)
    end
end

# Very important work
function work1(polys, custom_allocator)
    res = 0
    for poly in polys
        @no_escape custom_allocator begin
            new_poly = work2(poly, custom_allocator)
            res += sum(new_poly)
        end
    end
    res
end

function work1(polys)
    res = 0
    for poly in polys
        new_poly = work2(poly)
        res += sum(new_poly)
    end
    res
end


function work2(poly::Vector{T}) where {T}
    new_poly = Vector{T}(undef, length(poly))
    work3!(new_poly)
end

function work2(poly::Vector{T}, custom_allocator) where {T}
    new_poly = Bumper.alloc!(custom_allocator, T, length(poly))
    work3!(new_poly)
end

function work3!(poly::AbstractVector{T}) where {T}
    poly[1] = one(T)
    for i in 2:length(poly)
        poly[i] = convert(T, i)^3 - poly[i - 1]
    end
    poly
end

## #

m, n = 1_000, 10_000
polys = [rand(UInt32, rand(1:m)) for _ in 1:n];

@btime work0(polys, use_custom_allocator=false)
#   3.430 ms (10001 allocations: 20.20 MiB)
#0x0000e20ca5f991fb

@btime work0(polys, use_custom_allocator=true)
#  3.032 ms (4 allocations: 192 bytes)
#0x0000e20ca5f991fb

The code is not that well optimized, so it will give you only a very small performance improvement.

@sumiya11
Copy link
Author

Your problem is that you don't release the allocated arrays, and thus end up allocating a quite bit out of the buffer.

But that is precisely the point !

@tjjarvinen
Copy link

tjjarvinen commented Jan 11, 2024

The problem is that, when you don't release arrays that are out of scope, you end up using much more memory than you actually need. In this example your data should fit into L1 cache (little over 8kB). But, if you don't release the temporary data, you end up with a buffer that does not fit into cache at all (about 20MB).

edit. The default SlabBuffer is about 8MB, so it should fit in L3 cache.

@sumiya11
Copy link
Author

Thanks for your answer, sorry for being not clear, let me explain.

My example is purposefully inefficient.
In my original problem, releasing intermediate arrays cannot be done easily.
The allocation pattern in my example here roughly resembles this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants