Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Julia 1.3 order of magnitude slowdown in function #34195

Closed
jpjones76 opened this issue Dec 25, 2019 · 35 comments
Closed

Julia 1.3 order of magnitude slowdown in function #34195

jpjones76 opened this issue Dec 25, 2019 · 35 comments
Labels
I/O Involving the I/O subsystem: libuv, read, write, etc. performance Must go faster regression Regression in behavior compared to a previous version

Comments

@jpjones76
Copy link

jpjones76 commented Dec 25, 2019

I'm noticing an order-of-magnitude slowdown in a custom function to parse "sample list" files (one ASCII value per line) to Float32 arrays. Any ideas what could be causing this?

Offending function here: https://github.com/jpjones76/SeisIO.jl/blob/master/src/Utils/Parsing/streams.jl

Affects my ASCII readers for SLIST and Lennartz SLIST files.

Before 1.3.0, X[i] = stream_float(io, 0x00) was 2-3x faster than v = readline(io); X[i] = parse(Float32, v). Now stream_float is 4x slower than readline + parse. I actually need the old speeds if possible, so any alternate syntax suggestions would be welcomed.

MWE

using Pkg; Pkg.add("BenchmarkTools"); Pkg.add("SeisIO") 
run(`wget https://github.com/jpjones76/SeisIO-TestData/raw/master/SampleFiles/ASCII/0215162000.c00`)
using BenchmarkTools
import SeisIO:stream_float
io = open("0215162000.c00", "r")
nx = 225000
X = zeros(Float32, nx) 
@benchmark (seekstart($io); readline($io); @inbounds for i in 1:$nx; $X[i] = stream_float($io, 0x00); end)

Example outputs follow. I benchmarked this on a Dell laptop running Ubuntu 18.04 (bionic), kernel 5.0.0-37-generic, CPU Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz, 16 GB RAM.

Julia 1.0.5

BenchmarkTools.Trial: 
  memory estimate:  272 bytes
  allocs estimate:  2
  --------------
  minimum time:     25.533 ms (0.00% GC)
  median time:      25.701 ms (0.00% GC)
  mean time:        25.867 ms (0.00% GC)
  maximum time:     30.592 ms (0.00% GC)
  --------------
  samples:          194
  evals/sample:     1

Julia 1.2.0:

  memory estimate:  272 bytes
  allocs estimate:  2
  --------------
  minimum time:     24.738 ms (0.00% GC)
  median time:      24.850 ms (0.00% GC)
  mean time:        24.910 ms (0.00% GC)
  maximum time:     27.172 ms (0.00% GC)
  --------------
  samples:          201
  evals/sample:     1

Julia 1.3.0:

BenchmarkTools.Trial: 
  memory estimate:  272 bytes
  allocs estimate:  2
  --------------
  minimum time:     236.738 ms (0.00% GC)
  median time:      237.059 ms (0.00% GC)
  mean time:        237.833 ms (0.00% GC)
  maximum time:     245.997 ms (0.00% GC)
  --------------
  samples:          22
  evals/sample:     1
@KristofferC
Copy link
Sponsor Member

KristofferC commented Dec 25, 2019

read is now thread safe which requires acquiring a lock so reading things byte by byte has slowed down. You could perhaps try https://github.com/BioJulia/BufferedStreams.jl. I do think it would be good to have a goto solution for this in Base (perhaps there already is and I just don't know about it).

@KristofferC
Copy link
Sponsor Member

KristofferC commented Dec 25, 2019

Print a reference to the cross post on discourse: https://discourse.julialang.org/t/julia-1-3-order-of-magnitude-slowdown-in-function/32691

@jpjones76
Copy link
Author

jpjones76 commented Dec 25, 2019

Are you sure that's the problem? My other readers have very little slowdown by comparison; even my other ASCII readers are slowed no more than a factor of two.

Does unsafe_read avoid this slowdown?

Is there any startup flag or other option that can prevent this behavior? Seems like this should be very impactful if it's causing that much slowdown on every file read in every package.

BufferedStreams.jl is not a solution for large files as it doubles their memory use. My colleagues are working with GB size files for things like ambient noise correlations. If I try to buffer the entire file, it eats up all their memory and hangs Linux forcing a hard reboot.

@KristofferC
Copy link
Sponsor Member

KristofferC commented Dec 25, 2019

No I am not sure. I just know it has been a slowdown in codes that look similarly. You would have to test it.

BufferedStreams.jl is not a solution for large files as it doubles their memory use. My colleagues are working with GB size files for things like ambient noise correlations. If I try to buffer the entire file, it eats up all their memory and hangs Linux forcing a hard reboot.

Ok but why would you buffer the entire file?? The default in BufferedStreams is 128 KB I think.

@jpjones76
Copy link
Author

jpjones76 commented Dec 25, 2019

I don't think thread safety alone is responsible and would rather not debate the relative merits of using a third-party package to compensate for perceived slowdown in Base when neither of us is certain that said package is the only workaround.

For sake of due diligence, I re-ran my file-read benchmarks from the publication (https://github.com/jpjones76/SeisIO_paper) in Julia 1.0.5 and Julia 1.3.0. My GeoCSV file-format reader uses similar low-level ASCII parsing and is slowed by ~2.1x, but SLIST slowdown is 8.2x. If thread safety was the only problem, they'd be similar. They're off by a factor of four.

@KristofferC
Copy link
Sponsor Member

KristofferC commented Dec 25, 2019

I don't think thread safety alone is responsible and would rather not debate the relative merits of using a third-party package to compensate for perceived slowdown in Base when neither of us is certain that said package is the only workaround.

The point is to introduce a buffered stream to actually check and see if this is a dup of #33335 or dancasimiro/WAV.jl#52 (comment). If a buffered stream doesn't restore performance then great, one less thing to check.

What you also can do is to profile the code (https://docs.julialang.org/en/v1/manual/profile/) in the two different versions and see if there is anything that sticks out when comparing the traces.

@KristofferC KristofferC added performance Must go faster regression Regression in behavior compared to a previous version labels Dec 25, 2019
@KristofferC
Copy link
Sponsor Member

KristofferC commented Dec 25, 2019

From what I can see in profiling, everything points to locking and unlocking in eof, skip, read.

@jpjones76
Copy link
Author

jpjones76 commented Dec 26, 2019

I'm still looking into it, but quite frankly, these changes where every low-level read operation is wrapped in a lock macro are ridiculous. They also invalidate the biggest reason for my most recent submitted manuscript, and the three years of work I put into SeisIO.jl before it.

The main point of my publication, and in fact, one of the main points of spending the last three years working as hard as I have on SeisIO.jl, is that Julia allowed me to create the fastest seismic data parsers on the planet. My code forms the basis of ongoing research at half a dozen institutions including Harvard, the US Geological Survey, and CalTech. We can run code in ten minutes that used to take days.

This change utterly trashes my code...and, by extension, our work...and their research.

You've made Julia parsing slower than Python.

You've made Julia parsing slower than R -- slower than native R, let alone those trash wrappers to undocumented C++ that third-year undergrads like to pretend are R packages.

You're even slower than Matlab and Octave, aren't you?

There is no sugarcoating this: the 1.3 approach to thread-safe file I/O may be the worst coding decision I have ever seen.

Is there no adult in the room when development decisions get made? Did it not occur to anyone that that there might still exist scientific data formats that need low-level read functions for structures other than long arrays?

Was it that difficult to make the lock checks optional?

I'm pretty sure I know the answer to that last question, because you sure seem to have no trouble wrapping parts of your low-level functions in a lock macro -- readbytes_all! in iostream.jl has a good example, if @lock_nofail can wrap a control loop adequately it should damn well be able to wrap the entire function. So why not instruct users to do that?

...now you tell me that the solution is to either (a) add a third-party package, to achieve worse performance than Base had six weeks ago, or (b) rewrite every single low-level read operation to use C wrappers?

How is that consistent with the language's design goals, again? Remind me about ease of use, in particular. I'm having trouble remembering your exact wording at this late hour.

@KristofferC
Copy link
Sponsor Member

KristofferC commented Dec 26, 2019

...now you tell me that the solution is to either (a) add a third-party package, to achieve something that was part of Base six weeks ago, or (b) rewrite every single low-level read operation to use a C wrapper?

Just to be clear, the suggestion about BufferedStreams.jl was merely a way to check if the new thread-safe IO was the source of the regression since it wasn't clear from the original issue (no MWE or profiling). It wasn't meant to be a "solution" to the issue.

@timholy
Copy link
Sponsor Member

timholy commented Dec 26, 2019

@jpjones76, you are in violation of the community standards. Please moderate your anger in future posts.

@jpjones76
Copy link
Author

jpjones76 commented Dec 26, 2019

I think there are good reasons why I'm having a hard time being constructive about this issue. What am I supposed to tell my colleagues?

If low-level read commands are the cause, it's going to take weeks to recode in a way that restores performance to 1.2 levels.

One ASCII reader is not a huge deal in and of itself. However, this change triples the time required to parse SEED, the worldwide seismic data standard (https://ds.iris.edu/ds/nodes/dmc/data/formats/seed/ , http://www.fdsn.org/pdf/SEEDManual_V2.4.pdf). That is huge. Our data sets are TB to tens of TB each experiment. The low-level structure of SEED comprises variable-length packets whose sizes aren't easily guessed, with no byte index. A simple workaround like reading to an intermediary buffer is unlikely to help.

@timholy
Copy link
Sponsor Member

timholy commented Dec 26, 2019

I think there are good reasons why I'm having a hard time being constructive about this issue. What am I supposed to tell my colleagues?

To use Julia 1.2 until folks have had a chance to come back from Christmas day and take a look at this bug report?

We all understand being distressed by a change that affects your code, but that does not give you license to unload a bunch of personal attacks. So yes, I do expect you to be constructive about this issue. You were doing great at the beginning, keep to that tone and we'll all be much happier.

@ViralBShah
Copy link
Sponsor Member

ViralBShah commented Dec 26, 2019

@jpjones76 while your frustration is understandable, the way you have chosen to express it will only make it difficult for people to engage or to help. Perhaps you can help isolate the issue and be patient while people are on vacation.

I request you to revise your comments to be helpful, since perhaps you made them at a later hour, you may not want to leave them as they are for the future.

I was delighted to read about your application and collaboration but it quickly became unreadable.

@JeffBezanson
Copy link
Sponsor Member

JeffBezanson commented Dec 27, 2019

We faced the following difficult decision: either (1) cause data corruption when threads are added to code that does file I/O, or (2) slow down small I/O operations. We chose the latter because reliability is important, and because it is rare to do byte-at-a-time I/O (especially when performance matters).

Particularly at large scale, it is often impossible for one system to meet everybody's needs. That is part of the point of Julia --- you are not stuck with whatever is in the base system. If using BufferedStreams.jl fixes this, what's the problem? In fact I suspect there is room for even further performance gains over what you achieved before.

(b) rewrite every single low-level read operation to use C wrappers?

I certainly wouldn't recommend that, since libc's buffered I/O also uses locking and is still significantly slower than ours. But there are other good options. For example if necessary, you can copy the code from iostream.jl from julia 1.2 into your package. A custom buffered I/O type is also worth exploring; you might get performance gains by dropping some of the generality of existing I/O packages.

I do think it would be good to add an option to turn off locking, though I'm not sure what the API should be. Unfortunately we can't add it until v1.5 though.

@JeffBezanson JeffBezanson added the I/O Involving the I/O subsystem: libuv, read, write, etc. label Dec 27, 2019
@quinnj
Copy link
Member

quinnj commented Dec 27, 2019

@jpjones76, having quite a bit of low-level IO parsing work myself, I'm happy to talk about strategies here (I'm the primary maintainer of Parsers.jl, CSV.jl, JSON3.jl, etc.).

In particular, I've found that to maximize parsing performance, the absolute fastest is to mmap the input file (or Base.read into a Vector{UInt8}), and read byte-by-byte, maintaining your own position variable and manually checking that position < len(input) after incrementing the position. Just dropping down to using your own Vector{UInt8} for parsing has better memory cache-line use, and avoids a whole host of extra checks that occur in simple operations for generic IO objects like position(io), read(io, UInt8), and even eof(io).

Those are just a few ideas/tactics I use to achieve anywhere from 2x to 50x performance gains in Parsers.jl vs. Base.

Happy to discuss more in-depth on the julialang slack (I'm pretty active there in the #data channel) or even do a google hangout if that would be helpful to walk through code, ideas, or strategize how to get even better performance for your code.

@elextr
Copy link

elextr commented Dec 28, 2019

@JeffBezanson said:

I do think it would be good to add an option to turn off locking, though I'm not sure what the API should be.

I do wonder what the use-case would be?

It seems to me that the only safe situation is where the user controls every line of code they use and knows no threaded IO happens (possible but rare), or the user examines every package and library they use, and reexamines them on upgrades, and ensures that no threaded IO happens (possible but even rarer).

I'm not saying not to add the feature, but do document it with a big red "here be (really sneaky mean) dragons".

@timholy
Copy link
Sponsor Member

timholy commented Dec 28, 2019

What about

struct IOSingleton
    #= fields needed for I/O =#
    threadid::Int
end

function write(s::IOSingleton, b::UInt8)
    iswritable(s) || throw(ArgumentError("write failed, IOStream is not writeable"))
    Threads.threadid() == s.threadid || error("can only be called from thread ", s.threadid, ", got ", Threads.threadid())
    ccall(:ios_putc, Cint, (Cint, Ptr{Cvoid}), b, s.ios)
end

@JeffBezanson
Copy link
Sponsor Member

JeffBezanson commented Dec 28, 2019

I think I remember trying something like that, and even just checking the thread id had almost as much slowdown (presumably due to the extra load, which is significant when done per-byte).

@timholy
Copy link
Sponsor Member

timholy commented Dec 28, 2019

Seems like inlining + hoisting should solve that for the case where the byte-by-byte IO is being done from an individual function.

@jpjones76
Copy link
Author

jpjones76 commented Dec 31, 2019

@elextr, I'm so glad that you asked about use cases. I do apologize for my tone last week addressing @KristofferC , but I must ask, here: how much scientific data have you worked with, personally? Which disciplines?

In other words, what are the use cases for which thread-safe reads are preferred? Am I correct in thinking that the intended change was primarily to benefit parallel reads of large files?

If so, what percentage of the scientific data that you've worked with uses a file format where this would be a benefit?

I'll discuss the general and specific use cases that the Julia seismology community faces below.

General use case: Seismology

In geophysics, speaking from 22 years of working with seismic data, my colleagues at IRIS can verify that >99% of extant seismic data are divided into short-record file formats, for which it's faster to read without parallel.

In all common geophysical file formats, a descriptive header with many scalar variables must be read before data can be used. Parallel read offers no advantage for file headers. Parallel read may be an advantage when reading a long, contiguous time series from file, but how often are time series recorded in long, contiguous records?

I have a rough sense of the answer to that. I'm not sure you want my guess. Even without a discussion of transmission breaks and data gaps, many formats allow maximum record lengths <32 kB. Thus, the case for single-worker reads in geophysics should be absolutely clear.

How common a problem?

Virtually all seismic data uses short-record file formats, totaling hundreds of PB to tens of EB of data archived publicly and privately over 40 years. A few example data formats are described in my SeisIO documentation, but there are others.

Does any seismic data benefit from these changes?

I know of only one seismic data format where (thread-safe) parallel reads offers a performance advantage: ASDF by Krischer et al. (2016). This is not the ASDF in astrophysics. Seismic ASDF is used by only two research groups; AFAIK, neither operates a seismic network in such a way that monitoring records use it for archival. Thus the relative fraction of data whose reads would be improved by 1.3 could be called "0.0f0", i.e., less than 1/10^7.

Specific use case: SEED

SEED is the worldwide seismic data standard. Files are divided into variable-length records (see pages 16-17 of the linked PDF), wherein each record comprises a fixed-length header (48 bytes), a variable number of blockettes (variable length), and a variable amount of data (variable length).

The problem should already be clear, but here are details, as I hope to prevent any argument about the intractability of reading SEED archives in parallel.

SEED read and buffer

One cannot merely pre-buffer an entire record. Instead, a conceptual schematic of an optimized read process looks like this:
[read 48 bytes] --> [extract # of blockettes, N] --> for each blockette: [get blockette size K in bytes (*); buffer blockette] --> [get length of data] --> [buffer data] --> [process]

The problem

At (*) lies the issue. Blockette size is stored at the start of the blockette itself. Some blockette types are variable-length. Thus, 2-3 reads are necessary to buffer (or skip) each blockette: blockette type, blockette length K, and K-7 bytes to buffer the blockette itself. This must be done for each blockette -- up to 256 per record. So each record can require up to 770 read() calls, in addition to buffering: 1 for the 48-byte fixed header, 768 for the blockettes, 1 for the data.

Workaround results

Switching to BufferedStreams.jl has been benchmarked by our group and shows 50% slowdown from 1.2 read speeds for SEED. This is unacceptable as the data to be read and processed are hundreds of TB. With a slowdown worse than ~30%, it's faster to use the SEED C library; but this also adds more memory overhead than my Julia-language SEED reader. Moreover, we cannot believe that developers would endorse a potential solution that undermines the use of Julia as a language, particularly considering that my reader was 25% faster than the C library until November 26th.

SEED conclusion

In this file format, the order-of-magnitude slowdown introduced by locking and unlocking is not tenable.

Reminder: SEED popularity

This "use case" represents every byte of seismic data available from every FDSN seismic observatory, which, in turn, represents the majority of univariate geophysical data that are available to the public.

I hope this now hints at the magnitude of the problem (no pun intended). In seismology, virtually all data get recorded in formats where parallel read offers no advantage whatsoever.

@KristofferC
Copy link
Sponsor Member

KristofferC commented Dec 31, 2019

Just as a fun experiment I did an attempt with the Mmap strategy that @quinnj outlined in #34195 (comment) with something like:

using Mmap
mutable struct MmapReader <: IO
    data::Vector{UInt8}
    position::Int
end
MmapReader(file::String) = MmapReader(Mmap.mmap(file), 1)

Base.eof(io::MmapReader) = io.position > length(io.data)
Base.position(io::MmapReader) = io.position
function Base.read(io::MmapReader, ::Type{UInt8})
    # @inbounds below is not safe in general
    v = @inbounds io.data[io.position]
    io.position += 1
    return v
end
Base.seekstart(io::MmapReader) = io.position = 1
Base.skip(io::MmapReader, i::Int) = io.position += i

and then ran the benchmarks as

using BenchmarkTools
io = MmapReader("0215162000.c00")
nx = 225000
X = zeros(Float32, nx) 
@btime (seekstart($io); @inbounds for i in 1:$nx; $X[i] = stream_float($io, 0x00); end)

with the result

  7.522 ms (0 allocations: 0 bytes)

(note that I removed the readline call and the first line of the file because it wasn't relevant for the experiment).

So it might be a strategy worth exploring if performance is really important.

@jpjones76
Copy link
Author

jpjones76 commented Dec 31, 2019

Just as a fun experiment I did an attempt with the Mmap strategy that @quinnj outlined in #34195 (comment) with something like:

using Mmap
mutable struct MmapReader <: IO
    data::Vector{UInt8}
    position::Int
end
MmapReader(file::String) = MmapReader(Mmap.mmap(file), 1)

Base.eof(io::MmapReader) = io.position > length(io.data)
Base.position(io::MmapReader) = io.position
function Base.read(io::MmapReader, ::Type{UInt8})
    # @inbounds below is not safe in general
    v = @inbounds io.data[io.position]
    io.position += 1
    return v
end
Base.seekstart(io::MmapReader) = io.position = 1
Base.skip(io::MmapReader, i::Int) = io.position += i

and then ran the benchmarks as

using BenchmarkTools
io = MmapReader("0215162000.c00")
nx = 225000
X = zeros(Float32, nx) 
@btime (seekstart($io); @inbounds for i in 1:$nx; $X[i] = stream_float($io, 0x00); end)

with the result

  7.522 ms (0 allocations: 0 bytes)

(note that I removed the readline call and the first line of the file because it wasn't relevant for the experiment).

So it might be a strategy worth exploring if performance is really important.

That's certainly good performance, but I hope it's clear that this wasn't the data format I was talking about in my use case for @elextr at all.

SLIST and its variants are ASCII formats used as safe or fallback options. SEED is the worldwide seismic data standard and is generally not ASCII. They aren't equivalent.

This might be a workaround for the specific format mentioned in the original issue, and I'll happily test it as such, but that doesn't make it a general solution.

@timholy
Copy link
Sponsor Member

timholy commented Dec 31, 2019

@quinnj's and @KristofferC's point is that anything you can do with an IOStream, you can do with a mmapped stream too. With better performance and circumventing the very issue that is worrying you.

@elextr
Copy link

elextr commented Dec 31, 2019

@jpjones76 I'm afraid you may have misinterpreted the point of my question, it wasn't to elicit your use-case but to urge careful consideration of adding a general feature (turning off IO locking) of a kind that my experience has shown to be more often a trap than a use.

Unless @JeffBezanson was suggesting that there is a performant way to turn off locking per-stream, then, as I noted, when turning locking off the user needs to ensure no packages, or Julia library code they use does any threaded IO, since the setting will affect all IO not just the specific IO they intended to improve the performance for.

As a hypothetical example, if the AI package you decide to use to analyse your seismic data is multi-threaded, and does IO, then it is likely to fail in mysterious ways if IO locking is turned off to improve your parser.

As to your specific use-case, without having analysed your specific format, even for a binary format memory mapped IO is likely to provide better performance than using low level IO calls, and I have had experience of better performance even if non-mmapable streams are simply fully loaded into memory first. Probably for some of the reasons @quinnj alluded to. So I would recommend not rejecting such approaches without benchmarking them yourself as @KristofferC did for one of your formats.

@jpjones76
Copy link
Author

jpjones76 commented Dec 31, 2019

@quinnj's and @KristofferC's point is that anything you can do with an IOStream, you can do with a mmapped stream too. With better performance and circumventing the very issue that is worrying you.

I mean no offense but are you certain you're comfortable recommending a solution with this amount of hacking to restore the performance of basic functions? In addition to the posted workaround, I'll need to manually define read methods for other bits types. I realize I haven't benchmarked the two solutions side-by-side yet, but I'm not sure how that's less work (or better performance) than wrappers to Julia C functions.

@JeffBezanson
Copy link
Sponsor Member

JeffBezanson commented Dec 31, 2019

Unless @JeffBezanson was suggesting that there is a performant way to turn off locking per-stream

Yes, of course that's what I meant. The idea of turning off all locks globally would never be up for consideration. I've never heard of such a thing.

are you certain you're comfortable recommending a solution with this amount of hacking

Yes. Why is it "hacking"? Do you want fast byte-at-a-time reads or not? As he said, @quinnj does all sorts of tricks like this to max out performance for CSV reading etc., and the code is robust and widely used.

In Base there are fallbacks for reading basically everything in terms of UInt8, so if a stream supports that everything should work. A few more custom methods might be needed for the best performance though of course.

@timholy
Copy link
Sponsor Member

timholy commented Dec 31, 2019

are you certain you're comfortable recommending a solution with this amount of hacking

Absolutely. "System libraries" must be safe for a wide range of applications, many of which are not envisioned by the designer of the library. In contrast, code tuned for a particular purpose does not have to meet the same standard. One of the biggest advantages of Julia is that you, the user & package developer, are not a second-class citizen: you can do state-of-the-art work without having to dive down into C/C++/fortran/assembly. And the composability of packages (another Julia strength compared to the competition) means that you likely aren't on your own. I don't know Parsers.jl, but it looks like there's a bunch of stuff in there that would likely be useful to someone interested in writing fast parsers.

You seem to have an attraction to C wrappers, but folks in this thread are telling you that's probably the wrong approach. Folks have posted benchmarks (backed by real-world experience) suggesting that, if you really value speed, your goal should not simply be to recover the performance you had with Julia 1.2: that your goal should be to get another 3x performance on top of what you've already achieved. Moreover, they've already provided open-source packages, and kind offers of assistance on forums, to help you do it. What's holding you back?

@vtjnash
Copy link
Sponsor Member

vtjnash commented Dec 31, 2019

performant way to turn off locking per-stream

Should be simply h = lock(open("the file")), but looks like that's currently not implemented (defined, just not yet optimized).

comfortable recommending a solution with this amount of hacking

I'm not entirely sure either why he jumped straight to testing a reimplementation of the existing IOBuffer type (although that is a relatively complete implementation of an IO object). That code should behave about the same as the pre-written one:

using Mmap
MmapReader(file::String) = IOBuffer(Mmap.mmap(file))

(I'm not sure I'd call either one hacking, since their implementations are pretty similar)

@StefanKarpinski
Copy link
Sponsor Member

StefanKarpinski commented Dec 31, 2019

Summary of the situation. The choice that was faced in Julia 1.3, given that everything in the language was now potentially multithreaded, was between the following two options:

  1. make I/O thread-unsafe by default, making lots of existing code crash without warning, or
  2. make I/O thread-safe by default, making code using byte-at-a-time I/O significantly slower.

The former is not viable: we can't just make existing code broken without warning. The latter option was deemed better since it doesn't involve making existing code crash and byte-at-a-time I/O is not the best way to do really high-performance I/O anyway. It's also possible that some future compiler work could recover much of the performance lost by locking on each I/O operation.

This is the same tradeoff that libc makes: its I/O functions are also locked so as to be thread-safe. People don't generally do byte-oriented I/O in C because the API doesn't encourage it and because its slow. Julia's byte-oriented I/O in 1.2 and earlier is, as you've noticed, faster than C's, which has perhaps encouraged people like yourself to use it. A couple of things should be noted:

  1. Rewriting your code to call C I/O functions directly will not recover the performance you had in Julia 1.2 since C's I/O functions are slower than Julia's in that version (because of locking).

  2. The choice to have I/O be threadsafe and locking by default is not exactly the reckless, idiotic decision you seem to be making it out to be when you say things like "Is there no adult in the room when development decisions get made?" Unless you think the same thing about the C standards committee. Which maybe you do, in which case perhaps you should get into the PL design game yourself. (Be forewarned: you will get abuse on the Internet from people using your work for free when said free work temporarily inconveniences them.)

We were well aware of the fact that this choice could slow down applications that relied on byte-at-a-time I/O. However, it was unclear if any of those applications would be performance critical, especially since byte-oriented I/O is not the best approach for really high performance. Rather than committing to an API that we might not even need, we decided to wait and see if it was necessary or not. Your issue is the first report we've gotten about someone doing byte-oriented I/O where this slowdown has been a problem. We will consider this issue as a request for allowing each stream to optionally not do locking.

As to "Was it that difficult to make the lock checks optional?" Well, yes, it's a lot of work to make each stream support optional locking in a way that makes unlocked streams as fast as they were on 1.2. Even making seemingly trivial changes to Base Julia is quite a lot of work. Making a change like this correct, fast and general is a significant undertaking. If you feel that it's easy enough that you could do it without too much difficulty, please do make a PR for it. I'm sure someone would be willing to help review your work, so long as you're more civil than you have been in this thread.

Regarding your options, the simplest is: don't upgrade. Just keep using Julia ≤ 1.2 — no work is required, just cap the Julia version in your project file. A slight varation on that is to just put a note in the README indicating that Julia 1.3+ is supported by may be slow. If users want better performance, they are advised to use Julia ≤ 1.2. Also no work required. Rewriting your code to call C directly is not really an option as it will not recover the performance of Julia 1.2 since C has the same issue as Julia 1.3 for the same reasons. Implementing a simple unlocked I/O type that supports just the operations you need would be straightforward (as already demonstrated), and would allow you to support 1.3 without a performance loss and with minimal work. If you're willing to do more work, you can get even better performance then you previously had by changing your code to not do byte-oriented I/O. So you have two zero-work options, a minimal-work option with the same performance, and significant-work option that would give even better performance.

@quinnj
Copy link
Member

quinnj commented Dec 31, 2019

I'm not entirely sure either why he jumped straight to testing a reimplementation of the existing IOBuffer type (although that is a relatively complete implementation of an IO object). That code should behave about the same as the pre-written one

To clarify, my work in Parsers.jl wasn't a simple "jump straight to reimplementing a Base structure", but a pretty involved, step-by-step benchmarked journey of maximizing the performance of parsing integers/floats. The issue is that Base.IOBuffer doesn't allow for well-optimized byte-by-byte parsing workflows, for the following reasons:

  • every read(::IOBuffer, UInt8) includes:
  • those extra checks are costly when doing byte-by-byte, since eof needs to be checked and handled appropriately per byte anyway, to select specific parsing behaviors. So when doing byte-by-byte, I'm essentially doing my own eof checks, with no way to avoid them using IOBuffer (unless I write my own readbyte function, which I did for a while)
  • There are also extra costs from all the loads/stores of using IOBuffer in the first place; is it more convenient? Of course. But when the initial input is a mmapped Vector{UInt8} or String anyway, having specific variables for pos and buffer len cuts down on a lot of extra loads/stores; in read(::IOBuffer, UInt8) there are 5 loads/stores alone, and eof has 2, for example.

This is just to clarify that "just using Base.IOBuffer" isn't actually the best solution currently when doing the lowest-level, byte-by-byte parsing workflows that aim to maximize performance. I remember throwing around the idea of having some of those extra checks use @checkbounds perhaps, so you could do @inbounds read(io, UInt8) to avoid them, but never got around to trying anything out, and wasn't sure if I'd be able to avoid the extra loads/stores anyway, so just dropped down to plain byte vectors and manual position/size tracking.

@jpjones76
Copy link
Author

jpjones76 commented Jan 1, 2020

I'm not entirely sure either why he jumped straight to testing a reimplementation of the existing IOBuffer type (although that is a relatively complete implementation of an IO object). That code should behave about the same as the pre-written one:

using Mmap
MmapReader(file::String) = IOBuffer(Mmap.mmap(file))

Oh, I tried that because it was exactly what @KristofferC suggested. You're right that your suggestion is simpler; it might also provide a speed improvement, still testing. If so, I prefer this approach because I need to keep syntax relatively simple; the planned work involves many researchers spanning a wide range of programming experiences.

@JeffBezanson
Copy link
Sponsor Member

JeffBezanson commented Jan 2, 2020

Your issue is the first report we've gotten about someone doing byte-oriented I/O where this slowdown has been a problem.

We did get another report about this: #32421 (comment), involving WAV.jl. However, as expected that code was already so sub-optimal that we ended up hugely speeding it up independent of the IOStream locks.

@jpjones76
Copy link
Author

jpjones76 commented Jan 21, 2020

Hi everyone,

I want to review the solutions proposed here and the problems I'm having adopting them.

Mmap: the arguments for using this would be greatly strengthened by explanations of how it handles SIGBUS/SIGSEGV and associated risks. For example, what happens in Julia if there's a connection failure during memory-mapped file I/O? How does Julia handle issues like this? I see that Mmap.mmap() in Linux wraps to C mmap via jl_mmap in sys.c; what signal handlers are present? If none, then what are the risks? The chapter in the Julia manual comprises three docstrings and isn't referenced in any other chapter, so any explanations would be reassuring.

BufferedStreams.jl: with apologies, this isn't a viable solution for us. BufferedInputStream() leads to worse performance than 1.2.0 byte-wise reads with every data format tested; best case is ~50% slower. That's untenable when analyzing GB of data. A small buffer relative to file size increases read time; a large buffer increases memory overhead. The additional code needed to adaptively adjust buffer size is a Herculean labor because each data format has its own trade-off curve, which changes when "optional" sections are present in a file.

C wrappers: To be clear, I don't like C wrappers any more than you do, but they appear to safely restore read speeds to 1.2 levels. I'm pretty surprised that the opinions of wrappers are so overwhelmingly negative; my concern is merely that if I tell a colleague to make read calls with C wrappers for performance, I'm going to get a reply like "if all you're doing is wrapping to C, why use Julia?". I can still make a compelling argument based on data processing performance.

Use Julia ≤v1.2.0: I certainly can tell colleagues to do this, but in your opinion, should I? The only way I can convince people to adopt the Julia language is to demonstrate long-term viability; many colleagues have been coding in Fortran since the early 80s. (This sort of ... sociological inertia, I guess you'd call it ... has been very hard to overcome in the past.) I can make a strong case for switching if I show that Julia's performance isn't version-dependent. But that begs the question: what I can I safely tell colleagues about your plans for read() and thread locking?

Finally: are there other solutions that I missed?

PS HDF5 #609 was found during my tests but appears unrelated.

ChrisRackauckas added a commit to ChrisRackauckas/stl-benchmark that referenced this issue Aug 29, 2020
On v1.3 if you are opening paths the I/O was made thread-safe by default, but if you don't need thread-safety it can be faster to ignore the I/O locks. See JuliaLang/julia#34195 for details. This fixes the performance post v1.3:

Before:

```julia
689.599 μs (15 allocations: 347.34 KiB)
```

After:

```julia
243.900 μs (15 allocations: 347.34 KiB)
```
@vtjnash
Copy link
Sponsor Member

vtjnash commented Jan 28, 2021

Addressed in #35426

@StefanKarpinski
Copy link
Sponsor Member

StefanKarpinski commented Jan 28, 2021

For what it's worth, I find that the lock = false option is too inconsistently available to be very useful. Yes, it works on files, but it doesn't work on pretty much anything else — it doesn't work on commands or pipelines for example. I could open a new issue about that, but I'd say this is far from fully addressed still.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I/O Involving the I/O subsystem: libuv, read, write, etc. performance Must go faster regression Regression in behavior compared to a previous version
Development

No branches or pull requests

9 participants