show(io::IO, int) optimization #41415

green-nsk · 2021-06-30T07:52:15Z

This is for #41396

As you'll notice, I couldn't quite figure out the threading/initialization part. Two main problems there:

The Threads module isn't loaded yet at the point intfuncs.jl is being loaded
If you make code work before Threads is loaded, it's not clear how to switch it to thread-compatible code after Threads is loaded

I could use some help there.

Otherwise, I think change is quite straightforward. Some benchmarks:

# master:  61.000 ns (2 allocations: 88 bytes)
# after:   30.409 ns (0 allocations: 0 bytes)
@btime show($iobuf, 123456)
# master:  94.616 ns (2 allocations: 104 bytes)
# after:   73.560 ns (0 allocations: 0 bytes)
@btime show($iobuf, UInt(123456))
# master:  502.962 μs (32 allocations: 29.31 KiB)
# after:   505.108 μs (29 allocations: 29.17 KiB)
@btime show($iobuf, Ptr{Nothing}())

# master:  81.733 ns (2 allocations: 88 bytes)
# master, print(io, signed) = show(io, signed): 61.129 ns (2 allocations: 88 bytes)
# after:   30.411 ns (0 allocations: 0 bytes)
@btime print($iobuf, 123456)
# master:  61.317 ns (2 allocations: 88 bytes)
# after:   31.791 ns (0 allocations: 0 bytes)
@btime print($iobuf, UInt(123456))
# master:  501.052 μs (32 allocations: 29.31 KiB)
# after:   505.025 μs (29 allocations: 29.17 KiB)
@btime print($iobuf, Ptr{Nothing}())

The string(::Integer) call benchmarks about the same as master, as expected.

base/intfuncs.jl

JeffBezanson · 2021-06-30T17:38:45Z

If you look at the top of pcre.jl, it had a similar problem of needing the thread id before Threads is loaded. It has its own implementation of threadid for bootstrapping purposes that you could copy to Base.jl to use more widely.

JeffBezanson · 2021-06-30T18:22:34Z

Ah right, this has to be task-local, not thread-local, since a task can block when trying to print something. Here's one way we used to do it:

julia/base/grisu/grisu.jl

Line 30 in e66a39b

function getbuf()

green-nsk · 2021-07-01T05:49:55Z

Hooking up with TLS really slows it down. How about using alloca() instead?

UPD: alloca isn't really compatible with how llvmcall() works, as llvmcall() creates a new stackframe, and allocated storage is released right away. Nevermind reviewing that.

green-nsk · 2021-07-02T00:57:12Z

So this time it works as expected. I also moved string(int) to use stack-allocated buffers, helps with benchmarks too:

# master:  36.241 ns (2 allocations: 104 bytes)
# After:   24.208 ns (1 allocation: 40 bytes)
@btime string(123456; base = 2)
# master:  36.595 ns (2 allocations: 88 bytes)
# After:   25.597 ns (1 allocation: 24 bytes)
@btime string(123456; base = 8)
# master:  45.193 ns (2 allocations: 88 bytes)
# After:   27.618 ns (1 allocation: 24 bytes)
@btime string(123456)
# master:  35.552 ns (2 allocations: 88 bytes)
# After:   25.723 ns (1 allocation: 24 bytes)
@btime string(123456; base = 16)
# master:  36.212 ns (2 allocations: 104 bytes)
# After:   26.046 ns (1 allocation: 40 bytes)
@btime string(Unsigned(123456); base = 2)
# master:  36.160 ns (2 allocations: 88 bytes)
# After:   25.882 ns (1 allocation: 24 bytes)
@btime string(Unsigned(123456); base = 8)
# master:  45.268 ns (2 allocations: 88 bytes)
# After:   27.702 ns (1 allocation: 24 bytes)
@btime string(Unsigned(123456))
# master:  35.605 ns (2 allocations: 88 bytes)
# After:   24.031 ns (1 allocation: 24 bytes)
@btime string(Unsigned(123456); base = 16)

vtjnash · 2021-07-02T01:59:35Z

alloca is also thread-local, and is therefore prohibited for this use case for the same reason

green-nsk · 2021-07-02T03:33:39Z

Sorry, it's been a moving target, but I think that's all I wanted to push for this PR. Some benchmark for float:

# before:  147.485 ns (2 allocations: 432 bytes)
# after:   120.259 ns (1 allocation: 24 bytes)
# inbounds:  113.345 ns (1 allocation: 24 bytes)
@btime string(123.456)
# before:  184.321 ns (2 allocations: 432 bytes)
# after:   142.257 ns (0 allocations: 0 bytes)
# inbounds:  140.736 ns (0 allocations: 0 bytes)
@btime print($iobuf, 123.456)
# before:  154.228 ns (2 allocations: 432 bytes)
# after:   119.835 ns (0 allocations: 0 bytes)
# inbounds:  112.298 ns (0 allocations: 0 bytes)
@btime show($iobuf, 123.456)

base/ryu/exp.jl

This reverts commit f864790.

StefanKarpinski · 2021-07-23T14:09:48Z

I worry about the Scratch type being useful and therefore leaking out of Base, so it seems like it would be good to pick a clearer name and maybe even consider exporting it. We have a Scratch package that is part of the stdlib, which obviously implements something quite different. Maybe call this type ScratchBuffer?

green-nsk · 2021-07-26T07:12:22Z

I've renamed Scratch -> ScratchBuffer as well as related names.

If exporting anything, I'd say export with_scratch_buffer(n). But we can always make this call later, it seems orthogonal to the purpose of this PR

base/scratch_buffer.jl

green-nsk · 2021-08-13T09:26:49Z

Is there anything else required?

StefanKarpinski · 2021-08-16T20:18:57Z

Bump.

vtjnash · 2021-08-16T20:23:19Z

As I mentioned above in #41415 (comment), this is still attempting to use stack-allocated buffers in most cases which is unsafe for this and therefore prohibited

green-nsk · 2021-08-18T08:51:35Z

This is how the buffer was allocated before the change: a = StringVector(n), here's how the buffer is allocated after the change: buf = Ref(ScratchBufferInline()). For all I know, neither form prescribes a certain way of allocating it.

A naive compiler would allocate both in the heap. On the contrary, Julia 2.0 could optimize out heap operations for "small scope-contained vectors of fixed size" or String's. The compiler we're currently using optimizes Ref where it can prove there're no references to it outside of the scope, but doesn't do the same for vectors or strings. So it makes sense to prefer the Ref way.

This optimization is exploited widely in the codebase I'm working on, and I'm sure I can find examples where it's already being used in Julia Base if I tried. There's no reason Julia Base library shouldn't benefit from the optimization the Julia Compiler team has delivered.

vtjnash · 2021-08-18T15:05:42Z

Hopefully you don't find any places where we use stack memory for IO, as I have taken great pains to ensure that does not happen.

green-nsk · 2021-08-18T15:34:05Z

can you elaborate, what's inherently unsafe in using stack-allocated buffers as opposed to heap-allocated?

vtjnash · 2021-08-18T15:46:26Z

stack memory can become inaccessible while the Task is not running and is therefore forbidden from being passed across Tasks, while heap memory is stable and can thus be freely moved

green-nsk · 2021-08-18T16:06:20Z

Thank you, I think that makes sense. So we could actually use this trick for converting to string, but not for the IO part? And all because some hypothetical IO could decide to do the actual writing from another task?

stack memory can become inaccessible while the Task is not running

I struggle to think of a real-world situation where this is the case. Is this non-x86?

green-nsk · 2021-08-18T16:09:15Z

More interestingly, if tomorrow Julia Compiler made an optimization where short-scope-local-string's to be stack-allocated, we'd not be able to print() them?

vtjnash · 2021-08-18T16:37:21Z

Most (system) IO is done from another Task. And yes, if we do that optimization, it will cause that �problem, so we will need to address that first.

green-nsk · 2021-08-20T09:21:33Z

I don't really know what to do about this. Preformance gains are substantial when using stack-buffers, at least in applications I tried. It may not be the case for everyone, but for people writing a lot of JSON or logs it can add up quickly. Not sure it all has to be in Base though, not my cal to make.

I see the following options:

We find a way to identify "unsafe" IO whether by platform or by some other attributes/environment. Then we keep using stack-buffers when safe, others will use heap buffers (or convert to string prior to printing). String conversion can keep using stack buffers everywhere.
Use heap-buffers via TaskLocalStorage for everything. Not ideal for my use-cases, but safe.
Abandon this effort altogether and keep changes private in the codebase they came from

I'd prefer option 1, but again not my call to make as I am still not sure whether it's even possible to identify platforms/situations where stack memory is truly inaccessible from other tasks.

green-nsk · 2022-04-25T11:38:56Z

with these changes, ScratchBuffer is always dynamically allocated, but we avoid (relatively) expensive TLS lookups and key on threadid() instead.

green-nsk · 2022-05-16T07:46:08Z

I appreciate it's been a long while since this PR started, but I think at this point, all comments have been addressed. Also, it's compatible with migrating threads.

green-nsk · 2022-05-30T07:56:39Z

@vtjnash @StefanKarpinski can you have another look at this? Based on previous comments I'm reasonably sure there's nothing else required to merge it.

KristofferC · 2022-06-15T07:42:00Z

Sorry for the slow replies here. I think the reason for this is mostly that this type of code is kind of "scary" since it deals with buffers shared between threads/tasks etc which is quite hard to get right so it might take a while before someone with enough confidence takes a look at it.

Perhaps one useful thing to add here is a more adverse test that does its best to break this. Start many threads and use scratch buffers while writing integers to see if there is some data races that can be detected etc.

green-nsk · 2022-06-16T15:46:28Z

Thank you Kristoffer! I appreciate this is a sensitive part of codebase, and reviewing can take time. I'm happy as long as it's on someone's radar.

mentics · 2023-02-17T23:55:15Z

I'm not sure if this is the right place to bring this up, but it seemed pertinent to this issue.

I ran into this allocation issue today and found this PR.

If covering only the built in primitive types (and not a numbers with unlimited size), then it could use a fixed size buffer because there is an absolute max length of a string for an int of a particular size. A fixed size buffer doesn't have to allocate. I understand this code couldn't depend on StaticArrays, but I would expect that however it is accomplishing its magic would be reproducible for this. Here's some code that demonstrates it (not optimized):

import Base:_dec_d100
using StaticArrays
const SomeInts = Union{UInt128, UInt16, UInt32, UInt64, UInt8}
const MAX_DIGITS = ndigits(typemax(Int128))
function dec_io(io::IO, x::SomeInts, pad::Int, neg::Bool)
    if neg; write(io, 0x2d); end
    n = ndigits(x)
    npads = max(0, pad - n)
    for _ in 1:npads
        write(io, '0') # TODO: use correct way to get the pad char
    end
    a = MVector{MAX_DIGITS,UInt8}(undef)
    i = n
    @inbounds while i >= 2
        d, r = divrem(x, 0x64)
        d100 = _dec_d100[(r % Int)::Int + 1]
        a[i-1] = d100 % UInt8
        a[i] = (d100 >> 0x8) % UInt8
        x = oftype(x, d)
        i -= 2
    end
    if i > 0
        @inbounds a[1] = 0x30 + (rem(x, 0xa) % UInt8)::UInt8
    end
    GC.@preserve a unsafe_write(io, pointer(a), n) # copied from Base.write(IO, String)
    return
end

Another option is to change the algorithm. A quick and dirty recursive example has no allocations:

function dec_io2(io::IO, x::SomeInts, pad::Int, neg::Bool)
    if neg; write(io, 0x2d); end
    padchar = '0' # TODO: use correct way to get the pad char
    ndigs = ndigits(x)
    npads = max(0, pad - ndigs)
    for _ in 1:npads
        write(io, padchar)
    end
    help1(io, x)
    return
end
@inline function help1(io::IO, x::SomeInts)
    if x < 10
        write(io, '0' + x)
    else
        help1(io, div(x, 10))
        write(io, '0' + (x % 10))
    end
end

Another option would be to reverse the algorithm. Currently it only needs a buffer because it's going from least significant digit to most. With an algorithm that goes most to least, it could write to the stream without any buffer.

Actual benchmarks are a little tricky because if you use Base.devnull for the io argument, the above examples are very fast, even the recursive one. However, writing out one character at a time to IOBuffer appears to be a performance issue. There are potential solutions to that on the IOBuffer side, or something in between.

green-nsk · 2023-02-18T21:20:56Z

@mentics, what you're suggesting is somewhere in the first iteration of this PR. It wasn't accepted because it's not cool to pass stack data to Julia IO operations. The current version uses heap memory (but avoids allocations in most cases) and practically works, but:

needs more stringent testing as it's a core piece of functionality
needs to take care of the jl_adopt_thread feature added in 1.9.0
has some (presumably, minor) conflicts with the current master.

mentics · 2023-02-18T22:24:57Z

I take it that stack data is a problem because the IO contract requires stable data? Is that because of the current implementation of some IO code, or is it a deliberate contract? This seems a bit surprising. I would expect an optimization like this PR to be in an implementation of IO, not its callers. All callers would benefit from optimization, so why not just put it in IO? And not all IO implementations would have this problem, would they?

So, could a simple IO implementation that never yields do so without this extra optimization (neither in itself nor in its callers)? And one that does yield internally, could it copy the passed in data to a buffer (ring buffer, non-blocking queue, thread local, etc., depending on published thread safety for that implementation) without yielding?

JeffBezanson · 2023-03-28T21:09:05Z

base/scratch_buffer.jl

+end
+
+function with_scratch_buffer(f, n::Int)
+    buf, buf_inuse = @inbounds SCRATCH_BUFFERS[_tid()]


The number of threads can now change, so we should check here instead of using __init__.

green-nsk added 4 commits June 30, 2021 08:30

show(io::IO, int) optimization

5a73957

Fix print(signed)

6061ab4

Fix intfuncs tests

7a9b43f

Fix show tests

154e6d4

simeonschaub reviewed Jun 30, 2021

View reviewed changes

base/intfuncs.jl Outdated Show resolved Hide resolved

JeffBezanson added domain:io Involving the I/O subsystem: libuv, read, write, etc. performance Must go faster labels Jun 30, 2021

Stack-allocate scratch space

59c6576

green-nsk added 2 commits July 2, 2021 01:54

alloca() in disguise

62765fc

StringVector -> stack buffer

69d08c8

Move scratch to a separate file

4e6eb22

green-nsk force-pushed the print_int branch from a11ceed to 4e6eb22 Compare July 2, 2021 02:02

green-nsk added 2 commits July 2, 2021 04:03

show(io, float) optimization

9dbc3f9

scatter inbounds

f864790

KristofferC reviewed Jul 2, 2021

View reviewed changes

base/ryu/exp.jl Outdated Show resolved Hide resolved

green-nsk and others added 3 commits July 2, 2021 07:23

Revert "scatter inbounds"

39d1eb2

This reverts commit f864790.

Fixup

e65d424

Merge branch 'master' into print_int

9a40080

Rename Scratch -> ScratchBuffer

8e82b6d

KristofferC reviewed Jul 26, 2021

View reviewed changes

base/scratch_buffer.jl Outdated Show resolved Hide resolved

Remove UndefInitializer argument

8c06a00

melonedo mentioned this pull request Oct 8, 2021

Preserve GC reachability with GC.@preserve instead of @inline #39759

Closed

green-nsk added 2 commits April 25, 2022 11:18

Merge branch 'master' into print_int

d96f4cb

Dynamic-allocated ScratchBuffer space

e5069a1

Initialize thread-local buffers

0d1be8f

KristofferC mentioned this pull request Jun 15, 2022

write(io, ::Int) allocates #39041

Open

KristofferC mentioned this pull request Mar 23, 2023

use gc preserve instead of noinline to remove allocation in unsafe_write #49125

Open

JeffBezanson reviewed Mar 28, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

show(io::IO, int) optimization #41415

show(io::IO, int) optimization #41415

green-nsk commented Jun 30, 2021

JeffBezanson commented Jun 30, 2021

JeffBezanson commented Jun 30, 2021

green-nsk commented Jul 1, 2021 •

edited

green-nsk commented Jul 2, 2021

vtjnash commented Jul 2, 2021

green-nsk commented Jul 2, 2021 •

edited

StefanKarpinski commented Jul 23, 2021

green-nsk commented Jul 26, 2021

green-nsk commented Aug 13, 2021

StefanKarpinski commented Aug 16, 2021

vtjnash commented Aug 16, 2021

green-nsk commented Aug 18, 2021 •

edited

vtjnash commented Aug 18, 2021

green-nsk commented Aug 18, 2021

vtjnash commented Aug 18, 2021

green-nsk commented Aug 18, 2021

green-nsk commented Aug 18, 2021

vtjnash commented Aug 18, 2021

green-nsk commented Aug 20, 2021

green-nsk commented Apr 25, 2022

green-nsk commented May 16, 2022

green-nsk commented May 30, 2022

KristofferC commented Jun 15, 2022 •

edited

green-nsk commented Jun 16, 2022

mentics commented Feb 17, 2023

green-nsk commented Feb 18, 2023

mentics commented Feb 18, 2023

JeffBezanson Mar 28, 2023

show(io::IO, int) optimization #41415

Are you sure you want to change the base?

show(io::IO, int) optimization #41415

Conversation

green-nsk commented Jun 30, 2021

JeffBezanson commented Jun 30, 2021

JeffBezanson commented Jun 30, 2021

green-nsk commented Jul 1, 2021 • edited

green-nsk commented Jul 2, 2021

vtjnash commented Jul 2, 2021

green-nsk commented Jul 2, 2021 • edited

StefanKarpinski commented Jul 23, 2021

green-nsk commented Jul 26, 2021

green-nsk commented Aug 13, 2021

StefanKarpinski commented Aug 16, 2021

vtjnash commented Aug 16, 2021

green-nsk commented Aug 18, 2021 • edited

vtjnash commented Aug 18, 2021

green-nsk commented Aug 18, 2021

vtjnash commented Aug 18, 2021

green-nsk commented Aug 18, 2021

green-nsk commented Aug 18, 2021

vtjnash commented Aug 18, 2021

green-nsk commented Aug 20, 2021

green-nsk commented Apr 25, 2022

green-nsk commented May 16, 2022

green-nsk commented May 30, 2022

KristofferC commented Jun 15, 2022 • edited

green-nsk commented Jun 16, 2022

mentics commented Feb 17, 2023

green-nsk commented Feb 18, 2023

mentics commented Feb 18, 2023

JeffBezanson Mar 28, 2023

Choose a reason for hiding this comment

green-nsk commented Jul 1, 2021 •

edited

green-nsk commented Jul 2, 2021 •

edited

green-nsk commented Aug 18, 2021 •

edited

KristofferC commented Jun 15, 2022 •

edited