Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC/WIP/RFC: faster integer parsing; define parse semantics #16981

Closed
wants to merge 1 commit into from

Conversation

quinnj
Copy link
Member

@quinnj quinnj commented Jun 17, 2016

Now that I've drawn you in with my carrot of a headline, I also want to discuss what I consider a fundamental design flaw in our current parsing functions. Currently, we have Exhibit A:

parse(T, str, base) => T, throw errors on invalid values
tryparse(T, str, base) => Nullable{T}, return Nullable{T}() on invalid values

The problem, as I see it, is we're utilizing Nullable not to represent nullability, but as a form of control flow; basically as a way to avoid having to do a try-catch block as a safeguard against invalid inputs.

This conflicts with a fundamental principle of parsing, which in my book could be stated as:

For a given textual input, parse(input, T) returns a valid instance of T or the absence of a T. Non-valid inputs are always errors.

I think this importance distinction can result in subtle bugs in a user's current code, Exhibit B:

# hey ma, look, I'm parsing integers!
parse(Int, "12345") => Int

# careful son, you should use `tryparse` in case there are missing values
tryparse(Int, "") => Nullable{Int}()

# uh-oh ma, I did a bad thing, I'm unknowingly mis-parsing entire columns of string data
tryparse(Int, "very important string data") => Nullable{Int}()

The new semantics I propose are this, Exhibit C:

parse(input, T) => Nullable{T}, returns Nullable{T}() on missing/absent/empty data, raises errors on invalid inputs

# e.g.
parse("12345", Int) => Nullable{Int}(12345)
parse("", Int) => Nullable{Int}()
parse("very important string data", Int) => ERROR

If you need to sanitize or control flow around invalid inputs, then sanitize or use control flow; nullability should be left to it's job of missingness.

This PR is half-baked right now because I wanted a chance to throw the change in semantics out there and get feedback before plowing ahead to re-organize float parsing similarly (and move it from C to julia).

I'll also note the drastically simpler (i.e. smaller and more legible), yet 4x-10x faster net implementation of integer parsing as we define it to be based on the new semantic and on an IO instead of string.

I'm viewing this as a solid step one to making Base Julia parsing machinery world-class in functionality and performance. Future iterations will involve incorporating more type richness (Dates/DateTimes), greater parsing machinery (utilizing the Options type for specifying delimiters and such), and eventually the excisement of datafmt.jl, which, according to my forecasts, could be re-implemented in about 20 lines of code with the right parsing machinery in place.

Some benchmarks for dessert:

julia> function t1(n)
           strings = Vector{IOBuffer}(n)
           for i = 1:n
               strings[i] = IOBuffer(string(rand(Int8)))
           end
           opts = Base.Options{10}()
           tic()
           for i = 1:n
               parse(strings[i], Int8, opts)
           end
           return toc()
       end
t1 (generic function with 1 method)

julia> function t2(n)
           strings = Vector{String}(n)
           for i = 1:n
               strings[i] = string(rand(Int8))
           end
           tic()
           for i = 1:n
               parse(Int8, strings[i])
           end
           return toc()
       end

Warmed-up results:

julia> t1(100000)
elapsed time: 0.003442966 seconds
0.003442966

julia> t2(100000)
elapsed time: 0.023639765 seconds
0.023639765

@KristofferC
Copy link
Sponsor Member

To me, it seems a bit intrusive to always return a Nullable from parse. I think it should be opt in somehow.

@pao pao changed the title POC/WIP/RFC: faster integer parsing POC/WIP/RFC: faster integer parsing; define parse semantics Jun 17, 2016
@StefanKarpinski
Copy link
Sponsor Member

What about parse(Int, "123") for the throwing version and parse(Nullable{Int}, "123") for the version that returns null if parsing fails? This would also be a good place to use a Rust-like Result type that either returns the result of the computation or an error message indicating what went wrong.

@StefanKarpinski
Copy link
Sponsor Member

This would also allow fallbacks like this:

function parse{T<:Number}(::Type{T}, str::AbstractString)
    n = parse(Nullable{T}, str, args...)
    return !isnull(n) ? get(n) : throw(ParseError("invalid format for $T: $(repr(str))"))
end

function parse{T<:Number}(::Type{Nullable{T}}, str::AbstractString)
    return try parse(T, str)
    catch err
        return Nullable{T}()
    end
end

Mutual fallbacks seem a bit weird but but this would allow defining either the erroring version or the nullable version. However, I think that making the Result version the code would be the best.

@JeffBezanson
Copy link
Sponsor Member

The key to this seems to be the idea that parsing has three cases: valid data, invalid data, or missing data. I think there are only two cases: you either have valid integer data or you don't. The reason is that there is no canonical notion of what a missing integer looks like.

I can also say that Nullable or Maybe types are not necessarily tied to a notion of missing data; they're a general utility for cases where you might not have a value to return.

@JeffBezanson
Copy link
Sponsor Member

This speedup is amazing. Looking at our current code, I'm guessing the call to isspace might be the problem. Do you have any more insight into this?

@StefanKarpinski
Copy link
Sponsor Member

Maybe distinguishing these various cases makes sense:

parse(Int, str::String) # valid integer syntax, otherwise error
parse(Nullable{Int}, str::String) # valid integer syntax or null syntax, otherwise error
parse(Result{Int}, str::String) # return Result wrapping either an Int or an error
parse(Result{Nullable{Int}}, str::String) # return Result wrapping either an (Int or a null) or an error

Of course that relies on there being some standard way or representing a null, which there isn't.

@nalimilan
Copy link
Member

@StefanKarpinski See previous discussion at #9487 about parse(Nullable{Int}, x). I still agree it could be a good strategy.

@quinnj In practice, I guess your parsing code has to deal with both "" and "NA"/"NULL" as possible representations of missing data. How do you handle that with your proposal? Does it really improve on the current situation?

@quinnj
Copy link
Member Author

quinnj commented Jun 17, 2016

I don't think we're that far from a "standard null representation". Currently we have a simple rule that leading/trailing whitespace (which I think we should more strictly define as space or tab) is allowed, that means nulls can only be

"" # empty string or immediate eof(io)
"  " # any length of space/tab characters

In a further iteration, I've had ideas around extending the Options type to include:

immutable Options{B}
   null::String
end

where a user could specify their own null representation to be parsed as such, i.e. Nullable{T}(). The more I've dug into this over the last couple weeks, the more I really don't think we should be using Nullable for control flow (i.e. the try-catch definition). I just think it leads to subtle bugs or sloppy coding; is it really that hard to do a quick isdigit(str) before parsing?

On the other hand, having a notion of "parsing missing values" is much stronger, IMO, interacting with databases, delimited, files, website APIs, any data format really.

@JeffBezanson, I'll admit I too was a bit surprised by just how much of a speedup was available. Here's a shortlist of what I think makes the difference, though I'm unsure on just how much each contributes to the overall speedup:

  • leveraging unsafe_read(io, UInt8) vs. next(s::String, i) as the core data processor
  • isspace certainly has a few more checks, but I feel like this was more trivial than other changes
  • I think we were over-worrying the overflow case; I also ended up basically manually "inlining" the safe_add/safe_mul overflow checks and I think that ends up being a huge speedup

@nalimilan, see my comments above about eventually allowing the user to specify a custom null representation to parse.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jun 17, 2016

is it really that hard to do a quick isdigit(str) before parsing?

it doesn't generalize very well, because you really need to implement a parser to verify if the number will parse. For example, +32423 is an integer, even though it has several non-isdigit characters. This breaks down even faster when you start talking about parsing in general (floats, complex numbers, full julia syntax, etc)

msg::String
end

immutable Options{base}
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be an ImmutableDict. this type will cause issues for compilation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate? I didn't even realize ImmutableDict exists, so it would be interesting and relevant for me to know when I should use it ;)

@tkelman
Copy link
Contributor

tkelman commented Jun 18, 2016

What motivates changing the argument order from parse(T, input) to parse(input, T) ? That seems confusing and potentially disruptive.

@quinnj
Copy link
Member Author

quinnj commented Jun 18, 2016

The idea was making it consistent with read(io, T).

@timholy
Copy link
Sponsor Member

timholy commented Jun 18, 2016

If anything, I'd change read(io, T). Outputs to the left, inputs to the right.

@quinnj
Copy link
Member Author

quinnj commented Jun 18, 2016

I like that ^

@nalimilan
Copy link
Member

+1, but see also discussion at #14412.

@timholy
Copy link
Sponsor Member

timholy commented Jun 18, 2016

Personally I think the main motivation to make io first was that most uses are for writing (think the entire show infrastructure), and read just kind of followed along. But yes, there are sure to be dueling conventions.

@StefanKarpinski
Copy link
Sponsor Member

We should rank our conventions and then apply that rank consistently in the standard library.

@nalimilan nalimilan added the domain:missing data Base.missing and related functionality label Sep 6, 2016
@StefanKarpinski StefanKarpinski added this to the 1.0 milestone Sep 1, 2017
@StefanKarpinski
Copy link
Sponsor Member

We should determine if we want to do this, and if so what the minimal breaking change needed to go into 1.0 is – or explicitly decide that we're not doing this.

@KristofferC
Copy link
Sponsor Member

As an intermediate step, would there be anyway to extract only the performance improvements while not changing any behaviour?

@JeffBezanson
Copy link
Sponsor Member

We have addressed part of this, by returning nothing from tryparse. If parse or tryparse found missing data they would return missing (but this isn't implemented yet AFAIK).

But I'll definitely second the request for the performance improvements :)

@quinnj
Copy link
Member Author

quinnj commented Jan 10, 2018

Funny, I just started picking this up again the other day. You can catch all the latest improvements over here.

@JeffBezanson
Copy link
Sponsor Member

Great! Is the semantic issue resolved sufficiently for 1.0?

@quinnj
Copy link
Member Author

quinnj commented Jan 10, 2018

Meh, I think it's fine for now. We're basically just providing convenience for

try
  result = parse(T, str)
catch 
  result = nothing
end

which is fine, though doesn't seem like a very common practice to do (i.e. we don't have tryopen or tryrun or tryconnect). Is it just that common that you try to parse something where it might not be? My own feeling is that people often have inputs like "" or null (json) and want to return something else for a few known "sentinel"-like values.

In Parsing.jl, I have ideas around creating various parsing "layers", similar to the approach in TextParse.jl, where you basically have a parsesentinel function that you can pass a Dict{String, Any} that maps sentinel values to other values to return in place. parsesentinel would then attempt to parse a regular T, but upon failure, try to parse one the passed sentinel values instead.

@JeffBezanson
Copy link
Sponsor Member

Is it just that common that you try to parse something where it might not be?

I would say so, yes.

I agree something like parsesentinel is the way to go for handling missing, null, etc.

@JeffBezanson JeffBezanson added the performance Must go faster label Jan 10, 2018
@JeffBezanson JeffBezanson removed this from the 1.0 milestone Jan 10, 2018
@vtjnash
Copy link
Sponsor Member

vtjnash commented Jan 10, 2018

I agree something like parsesentinel is the way to go for handling missing, null, etc.

I agree too. Maybe we should have some of those others. For a shorter name, we could also perhaps call them or, like parse_or (and find_or, get_or, match_or)?

I think it's also not uncommon to want tryparse never to error due to the data not conforming to the requested format (meaning it can error for totally unrelated reasons like OOM or MethodErrors or UndefVar – cases where the try/catch approach does poorly since it loses this failure information). For an example use case, in Winston, configuration options would be read from the ini with a chain of attempts (roughly tryparse(Int) || tryparse(Float64) || String), similar to how a spreadsheet or table might try to auto-detect column types.

@quinnj quinnj closed this Apr 3, 2021
@DilumAluthge DilumAluthge deleted the jq/parsing branch August 24, 2021 05:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:missing data Base.missing and related functionality performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants