Behaviour of `xparse` when encountering invalid delimiters #93

nickrobinson251 · 2021-10-20T12:10:16Z

I am trying to rely on xparse to correctly parse a value when i know the input contains invalid characters (i.e. an invalid delimiter). I am hoping/expecting to get the correct value and a INVALID_DELIMITER return code.

(I gather we do want it to be possible to rely on xparse in the presence of invalid delimiters, given #78).

But xparse doesn't always return the correct value (using Parsers.jl v2.0.6).
For example, when trying to parse a Float64 when there are special characters like /

julia> using Parsers

julia> buf = codeunits("1.0 /");

julia> res = Parsers.xparse(Float64, buf, 1, length(buf), Parsers.XOPTIONS)
Parsers.Result{Float64}(-32607, 5, 2.3255508133e-314)

julia> res.val, Parsers.codes(res.code)
(2.3255508133e-314, "INVALID: OK | EOF | INVALID_DELIMITER ")

here xparse returned the expected code (INVALID_DELIMITER), but not the correct value (expected is res.val === 1.0)

Looking at what might be happening

The internal xparse2 gives the correct value, suggesting the typeparser actually does extract the correct value (and the "incorrect" code is due to simplifications in xparse2 and doesn't matter here)

julia> res = Parsers.xparse2(Float64, str, 1, length(str), Parsers.XOPTIONS)
Parsers.Result{Float64}(-32735, 5, 1.0)

julia> res.val, Parsers.codes(res.code)
(1.0, "INVALID: OK | EOF ")

And calling typeparser directly, I see the correct value (as expected):

julia> b, code = buf[1], Parsers.SUCCESS;

julia> Parsers.typeparser(Float64, buf, 1, length(buf), b, code, Parsers.XOPTIONS)
(1.0, 1, 4)

This isn't specific to the /character or to Float64, e.g. parsing Int64s:

julia> buf = codeunits("2 _");

julia> res = Parsers.xparse(Int64, buf, 1, length(buf), Parsers.XOPTIONS)
Parsers.Result{Int64}(-32607, 3, 4738866224)

julia> res.val, Parsers.codes(res.code)
(4738866224, "INVALID: OK | EOF | INVALID_DELIMITER ")

julia> Parsers.typeparser(Int64, buf, 1, length(buf), buf[1], code, Parsers.XOPTIONS)
(2, 1, 2)

julia> buf = codeunits("3 *");

julia> res = Parsers.xparse(Int64, buf, 1, length(buf), Parsers.XOPTIONS)
Parsers.Result{Int64}(-32607, 3, 4738866224)

julia> res.val, Parsers.codes(res.code)
(4738866224, "INVALID: OK | EOF | INVALID_DELIMITER ")

julia> Parsers.typeparser(Int64, buf, 1, length(buf), buf[1], code, Parsers.XOPTIONS)
(3, 1, 2)

So i suspect, this isn't to do with the typeparsers, but to do with the logic for handling invalid cases in xparse.

In particular, i think it's because xparse doesn't populate the value when the codes is not ok:

first typeparser returns the correct value

then in xparse correctly sets the code to INVALID_DELIMITER and send us to donedone

Parsers.jl/src/Parsers.jl

Lines 532 to 540 in 6b560d4

    
           # didn't find delimiter or newline, so we're invalid, keep parsing until we find delimiter, newline, or len 
        
           code |= INVALID_DELIMITER 
        
           while true 
        
               pos += 1 
        
               incr!(source) 
        
               if eof(source, pos, len) 
        
                   code |= EOF 
        
                   @goto donedone 
        
               end

but then donedone check's if ok(code) (which is false) and then doesn't pass the value to Result

Parsers.jl/src/Parsers.jl

Lines 659 to 666 in 6b560d4

    
           @label donedone 
        
               tlen = pos - startpos 
        
               if ok(code) 
        
                   y::T = x 
        
                   return Result{S}(code, tlen, y) 
        
               else 
        
                   return Result{S}(code, tlen) 
        
               end

So we have everything we need... but we're not using it.

Possible solution?

I think donedone might be doing this to handle the cases where we get sent to donedone before we've even called typeparser (e.g. because we hit "end of file" before hitting non-whitespace characters)

If this diagnosis is correct, i wonder if we should just handle that explicitly, rather than checking ok(code) e.g.
via a different goto-label, e.g.

+@label earlydone
+    # earlydone means parsing finished before calling `typeparser(T, ...)` to parse a `value::T`
+    tlen = pos - startpos
+    return Result{S}(code, tlen)
+
 @label donedone
     tlen = pos - startpos
-    if ok(code)
-        y::T = x
-        return Result{S}(code, tlen, y)
-    else
-        return Result{S}(code, tlen)
-    end
+    y::T = x
+    return Result{S}(code, tlen, y)

The text was updated successfully, but these errors were encountered:

nickrobinson251 · 2021-10-20T16:16:44Z

alternatively, perhaps the issue is the behaviour of ok?
is this intended to return false?

julia> Parsers.codes(res.code)
"INVALID: OK | NEWLINE | INVALID_DELIMITER "

julia> Parsers.ok(res.code)
false

i.e. why is this

Parsers.jl/src/utils.jl

Line 66 in 6b560d4

ok(x::ReturnCode) = (x & (OK | INVALID)) == OK

not just x & OK == OK?

Docs say:

Parsers.jl/src/utils.jl

Line 30 in 6b560d4

    
               * `OK`: signals specifically that a valid value of type `T` was parsed (check via `Parsers.ok(code)`)

So on reflection, i feel like the current xparse logic of checking ok is fine... but the ok function is returning false for all invalid cases which i think probably it should not be?

@nickrobinson251

As pointed out by @nickrobinson251, if calling `typeparser` succeeds (i.e. `(code & OK) == OK`), then we might as well set the correct value in the parsing `Result` object. Up till now, if there was any `INVALID` code invovled, `res.val` was undefined. With the proposed change in this PR, we introduce a `Parsers.valueok(code)` function that can be checked, and if true, then the user can know that a valid value was parsed and can be accessed via `res.val`. Closes #93.

@nickrobinson251

As pointed out by @nickrobinson251, if calling `typeparser` succeeds (i.e. `(code & OK) == OK`), then we might as well set the correct value in the parsing `Result` object. Up till now, if there was any `INVALID` code invovled, `res.val` was undefined. With the proposed change in this PR, we introduce a `Parsers.valueok(code)` function that can be checked, and if true, then the user can know that a valid value was parsed and can be accessed via `res.val`. Closes #93.

This was referenced Oct 20, 2021

Return the correct value when encoutering an invalid delimiter #94

Closed

Remove EOL options as we can just rely on calling next_line nickrobinson251/PowerFlowData.jl#14

Closed

This was referenced Oct 20, 2021

Fix ok definition so xparse returns a value whenever possible #95

Closed

Remove EOL Options (rely on xparse correctly parsing up to the invalid delim) nickrobinson251/PowerFlowData.jl#28

Merged

quinnj mentioned this issue Oct 21, 2021

Allow setting of parsed value if typeparser succeeded #97

Merged

quinnj closed this as completed in #97 Oct 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Behaviour of `xparse` when encountering invalid delimiters #93

Behaviour of `xparse` when encountering invalid delimiters #93

nickrobinson251 commented Oct 20, 2021 •

edited

Loading

nickrobinson251 commented Oct 20, 2021 •

edited

Loading

Behaviour of xparse when encountering invalid delimiters #93

Behaviour of xparse when encountering invalid delimiters #93

Comments

nickrobinson251 commented Oct 20, 2021 • edited Loading

Looking at what might be happening

Possible solution?

nickrobinson251 commented Oct 20, 2021 • edited Loading

Behaviour of `xparse` when encountering invalid delimiters #93

Behaviour of `xparse` when encountering invalid delimiters #93

nickrobinson251 commented Oct 20, 2021 •

edited

Loading

nickrobinson251 commented Oct 20, 2021 •

edited

Loading