POC/WIP/RFC: faster integer parsing; define `parse` semantics #16981

quinnj · 2016-06-17T07:05:56Z

Now that I've drawn you in with my carrot of a headline, I also want to discuss what I consider a fundamental design flaw in our current parsing functions. Currently, we have Exhibit A:

parse(T, str, base) => T, throw errors on invalid values
tryparse(T, str, base) => Nullable{T}, return Nullable{T}() on invalid values

The problem, as I see it, is we're utilizing Nullable not to represent nullability, but as a form of control flow; basically as a way to avoid having to do a try-catch block as a safeguard against invalid inputs.

This conflicts with a fundamental principle of parsing, which in my book could be stated as:

For a given textual input, parse(input, T) returns a valid instance of T or the absence of a T. Non-valid inputs are always errors.

I think this importance distinction can result in subtle bugs in a user's current code, Exhibit B:

# hey ma, look, I'm parsing integers!
parse(Int, "12345") => Int

# careful son, you should use `tryparse` in case there are missing values
tryparse(Int, "") => Nullable{Int}()

# uh-oh ma, I did a bad thing, I'm unknowingly mis-parsing entire columns of string data
tryparse(Int, "very important string data") => Nullable{Int}()

The new semantics I propose are this, Exhibit C:

parse(input, T) => Nullable{T}, returns Nullable{T}() on missing/absent/empty data, raises errors on invalid inputs

# e.g.
parse("12345", Int) => Nullable{Int}(12345)
parse("", Int) => Nullable{Int}()
parse("very important string data", Int) => ERROR

If you need to sanitize or control flow around invalid inputs, then sanitize or use control flow; nullability should be left to it's job of missingness.

This PR is half-baked right now because I wanted a chance to throw the change in semantics out there and get feedback before plowing ahead to re-organize float parsing similarly (and move it from C to julia).

I'll also note the drastically simpler (i.e. smaller and more legible), yet 4x-10x faster net implementation of integer parsing as we define it to be based on the new semantic and on an IO instead of string.

I'm viewing this as a solid step one to making Base Julia parsing machinery world-class in functionality and performance. Future iterations will involve incorporating more type richness (Dates/DateTimes), greater parsing machinery (utilizing the Options type for specifying delimiters and such), and eventually the excisement of datafmt.jl, which, according to my forecasts, could be re-implemented in about 20 lines of code with the right parsing machinery in place.

Some benchmarks for dessert:

julia> function t1(n)
           strings = Vector{IOBuffer}(n)
           for i = 1:n
               strings[i] = IOBuffer(string(rand(Int8)))
           end
           opts = Base.Options{10}()
           tic()
           for i = 1:n
               parse(strings[i], Int8, opts)
           end
           return toc()
       end
t1 (generic function with 1 method)

julia> function t2(n)
           strings = Vector{String}(n)
           for i = 1:n
               strings[i] = string(rand(Int8))
           end
           tic()
           for i = 1:n
               parse(Int8, strings[i])
           end
           return toc()
       end

Warmed-up results:

julia> t1(100000)
elapsed time: 0.003442966 seconds
0.003442966

julia> t2(100000)
elapsed time: 0.023639765 seconds
0.023639765

KristofferC · 2016-06-17T14:15:19Z

To me, it seems a bit intrusive to always return a Nullable from parse. I think it should be opt in somehow.

StefanKarpinski · 2016-06-17T15:07:51Z

What about parse(Int, "123") for the throwing version and parse(Nullable{Int}, "123") for the version that returns null if parsing fails? This would also be a good place to use a Rust-like Result type that either returns the result of the computation or an error message indicating what went wrong.

StefanKarpinski · 2016-06-17T15:13:28Z

This would also allow fallbacks like this:

function parse{T<:Number}(::Type{T}, str::AbstractString)
    n = parse(Nullable{T}, str, args...)
    return !isnull(n) ? get(n) : throw(ParseError("invalid format for $T: $(repr(str))"))
end

function parse{T<:Number}(::Type{Nullable{T}}, str::AbstractString)
    return try parse(T, str)
    catch err
        return Nullable{T}()
    end
end

Mutual fallbacks seem a bit weird but but this would allow defining either the erroring version or the nullable version. However, I think that making the Result version the code would be the best.

JeffBezanson · 2016-06-17T15:35:00Z

The key to this seems to be the idea that parsing has three cases: valid data, invalid data, or missing data. I think there are only two cases: you either have valid integer data or you don't. The reason is that there is no canonical notion of what a missing integer looks like.

I can also say that Nullable or Maybe types are not necessarily tied to a notion of missing data; they're a general utility for cases where you might not have a value to return.

JeffBezanson · 2016-06-17T15:42:28Z

This speedup is amazing. Looking at our current code, I'm guessing the call to isspace might be the problem. Do you have any more insight into this?

StefanKarpinski · 2016-06-17T15:53:47Z

Maybe distinguishing these various cases makes sense:

parse(Int, str::String) # valid integer syntax, otherwise error
parse(Nullable{Int}, str::String) # valid integer syntax or null syntax, otherwise error
parse(Result{Int}, str::String) # return Result wrapping either an Int or an error
parse(Result{Nullable{Int}}, str::String) # return Result wrapping either an (Int or a null) or an error

Of course that relies on there being some standard way or representing a null, which there isn't.

nalimilan · 2016-06-17T16:12:34Z

@StefanKarpinski See previous discussion at #9487 about parse(Nullable{Int}, x). I still agree it could be a good strategy.

@quinnj In practice, I guess your parsing code has to deal with both "" and "NA"/"NULL" as possible representations of missing data. How do you handle that with your proposal? Does it really improve on the current situation?

quinnj · 2016-06-17T16:13:52Z

I don't think we're that far from a "standard null representation". Currently we have a simple rule that leading/trailing whitespace (which I think we should more strictly define as space or tab) is allowed, that means nulls can only be

"" # empty string or immediate eof(io)
"  " # any length of space/tab characters

In a further iteration, I've had ideas around extending the Options type to include:

immutable Options{B}
   null::String
end

where a user could specify their own null representation to be parsed as such, i.e. Nullable{T}(). The more I've dug into this over the last couple weeks, the more I really don't think we should be using Nullable for control flow (i.e. the try-catch definition). I just think it leads to subtle bugs or sloppy coding; is it really that hard to do a quick isdigit(str) before parsing?

On the other hand, having a notion of "parsing missing values" is much stronger, IMO, interacting with databases, delimited, files, website APIs, any data format really.

@JeffBezanson, I'll admit I too was a bit surprised by just how much of a speedup was available. Here's a shortlist of what I think makes the difference, though I'm unsure on just how much each contributes to the overall speedup:

leveraging unsafe_read(io, UInt8) vs. next(s::String, i) as the core data processor
isspace certainly has a few more checks, but I feel like this was more trivial than other changes
I think we were over-worrying the overflow case; I also ended up basically manually "inlining" the safe_add/safe_mul overflow checks and I think that ends up being a huge speedup

@nalimilan, see my comments above about eventually allowing the user to specify a custom null representation to parse.

vtjnash · 2016-06-17T16:48:48Z

is it really that hard to do a quick isdigit(str) before parsing?

it doesn't generalize very well, because you really need to implement a parser to verify if the number will parse. For example, +32423 is an integer, even though it has several non-isdigit characters. This breaks down even faster when you start talking about parsing in general (floats, complex numbers, full julia syntax, etc)

vtjnash · 2016-06-17T16:54:23Z

base/parse.jl

+    msg::String
+end
+
+immutable Options{base}


this should be an ImmutableDict. this type will cause issues for compilation.

Can you elaborate? I didn't even realize ImmutableDict exists, so it would be interesting and relevant for me to know when I should use it ;)

tkelman · 2016-06-18T00:27:14Z

What motivates changing the argument order from parse(T, input) to parse(input, T) ? That seems confusing and potentially disruptive.

quinnj · 2016-06-18T01:21:37Z

The idea was making it consistent with read(io, T).

timholy · 2016-06-18T15:11:32Z

If anything, I'd change read(io, T). Outputs to the left, inputs to the right.

quinnj · 2016-06-18T16:28:50Z

I like that ^

nalimilan · 2016-06-18T18:12:57Z

+1, but see also discussion at #14412.

timholy · 2016-06-18T18:29:43Z

Personally I think the main motivation to make io first was that most uses are for writing (think the entire show infrastructure), and read just kind of followed along. But yes, there are sure to be dueling conventions.

StefanKarpinski · 2016-06-18T20:33:40Z

We should rank our conventions and then apply that rank consistently in the standard library.

StefanKarpinski · 2017-09-01T16:41:27Z

We should determine if we want to do this, and if so what the minimal breaking change needed to go into 1.0 is – or explicitly decide that we're not doing this.

KristofferC · 2017-11-21T10:58:32Z

As an intermediate step, would there be anyway to extract only the performance improvements while not changing any behaviour?

JeffBezanson · 2018-01-10T20:21:44Z

We have addressed part of this, by returning nothing from tryparse. If parse or tryparse found missing data they would return missing (but this isn't implemented yet AFAIK).

But I'll definitely second the request for the performance improvements :)

quinnj · 2018-01-10T20:24:18Z

Funny, I just started picking this up again the other day. You can catch all the latest improvements over here.

JeffBezanson · 2018-01-10T20:28:34Z

Great! Is the semantic issue resolved sufficiently for 1.0?

quinnj · 2018-01-10T20:41:29Z

Meh, I think it's fine for now. We're basically just providing convenience for

try
  result = parse(T, str)
catch 
  result = nothing
end

which is fine, though doesn't seem like a very common practice to do (i.e. we don't have tryopen or tryrun or tryconnect). Is it just that common that you try to parse something where it might not be? My own feeling is that people often have inputs like "" or null (json) and want to return something else for a few known "sentinel"-like values.

In Parsing.jl, I have ideas around creating various parsing "layers", similar to the approach in TextParse.jl, where you basically have a parsesentinel function that you can pass a Dict{String, Any} that maps sentinel values to other values to return in place. parsesentinel would then attempt to parse a regular T, but upon failure, try to parse one the passed sentinel values instead.

JeffBezanson · 2018-01-10T20:49:01Z

Is it just that common that you try to parse something where it might not be?

I would say so, yes.

I agree something like parsesentinel is the way to go for handling missing, null, etc.

vtjnash · 2018-01-10T21:08:23Z

I agree something like parsesentinel is the way to go for handling missing, null, etc.

I agree too. Maybe we should have some of those others. For a shorter name, we could also perhaps call them or, like parse_or (and find_or, get_or, match_or)?

I think it's also not uncommon to want tryparse never to error due to the data not conforming to the requested format (meaning it can error for totally unrelated reasons like OOM or MethodErrors or UndefVar – cases where the try/catch approach does poorly since it loses this failure information). For an example use case, in Winston, configuration options would be read from the ini with a chain of attempts (roughly tryparse(Int) || tryparse(Float64) || String), similar to how a spreadsheet or table might try to auto-detect column types.

POC implementation of faster integer parsing

3eef01f

pao changed the title ~~POC/WIP/RFC: faster integer parsing~~ POC/WIP/RFC: faster integer parsing; define parse semantics Jun 17, 2016

vtjnash reviewed Jun 17, 2016
View reviewed changes

nalimilan added the domain:missing data Base.missing and related functionality label Sep 6, 2016

TotalVerb mentioned this pull request Sep 10, 2016

Reader generalization JuliaIO/JSON.jl#169

Open

StefanKarpinski added this to the 1.0 milestone Sep 1, 2017

nalimilan mentioned this pull request Nov 9, 2017

Replace Nullable{T} with Union{Some{T}, Void} #23642

Merged

nalimilan mentioned this pull request Dec 16, 2017

Deprecate tryparse(T, s) in favor of parse(Union{T, Void}, s) #25131

Closed

JeffBezanson added the performance Must go faster label Jan 10, 2018

JeffBezanson removed this from the 1.0 milestone Jan 10, 2018

quinnj closed this Apr 3, 2021

DilumAluthge deleted the jq/parsing branch August 24, 2021 05:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POC/WIP/RFC: faster integer parsing; define `parse` semantics #16981

POC/WIP/RFC: faster integer parsing; define `parse` semantics #16981

quinnj commented Jun 17, 2016

KristofferC commented Jun 17, 2016

StefanKarpinski commented Jun 17, 2016

StefanKarpinski commented Jun 17, 2016

JeffBezanson commented Jun 17, 2016

JeffBezanson commented Jun 17, 2016

StefanKarpinski commented Jun 17, 2016

nalimilan commented Jun 17, 2016

quinnj commented Jun 17, 2016

vtjnash commented Jun 17, 2016

vtjnash Jun 17, 2016

SimonDanisch Nov 21, 2017

tkelman commented Jun 18, 2016

quinnj commented Jun 18, 2016

timholy commented Jun 18, 2016

quinnj commented Jun 18, 2016

nalimilan commented Jun 18, 2016

timholy commented Jun 18, 2016

StefanKarpinski commented Jun 18, 2016

StefanKarpinski commented Sep 1, 2017

KristofferC commented Nov 21, 2017

JeffBezanson commented Jan 10, 2018

quinnj commented Jan 10, 2018

JeffBezanson commented Jan 10, 2018

quinnj commented Jan 10, 2018

JeffBezanson commented Jan 10, 2018

vtjnash commented Jan 10, 2018

POC/WIP/RFC: faster integer parsing; define parse semantics #16981

POC/WIP/RFC: faster integer parsing; define parse semantics #16981

Conversation

quinnj commented Jun 17, 2016

KristofferC commented Jun 17, 2016

StefanKarpinski commented Jun 17, 2016

StefanKarpinski commented Jun 17, 2016

JeffBezanson commented Jun 17, 2016

JeffBezanson commented Jun 17, 2016

StefanKarpinski commented Jun 17, 2016

nalimilan commented Jun 17, 2016

quinnj commented Jun 17, 2016

vtjnash commented Jun 17, 2016

vtjnash Jun 17, 2016

Choose a reason for hiding this comment

SimonDanisch Nov 21, 2017

Choose a reason for hiding this comment

tkelman commented Jun 18, 2016

quinnj commented Jun 18, 2016

timholy commented Jun 18, 2016

quinnj commented Jun 18, 2016

nalimilan commented Jun 18, 2016

timholy commented Jun 18, 2016

StefanKarpinski commented Jun 18, 2016

StefanKarpinski commented Sep 1, 2017

KristofferC commented Nov 21, 2017

JeffBezanson commented Jan 10, 2018

quinnj commented Jan 10, 2018

JeffBezanson commented Jan 10, 2018

quinnj commented Jan 10, 2018

JeffBezanson commented Jan 10, 2018

vtjnash commented Jan 10, 2018

POC/WIP/RFC: faster integer parsing; define `parse` semantics #16981

POC/WIP/RFC: faster integer parsing; define `parse` semantics #16981