Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Behaviour of xparse when encountering invalid delimiters #93

Closed
nickrobinson251 opened this issue Oct 20, 2021 · 1 comment · Fixed by #97
Closed

Behaviour of xparse when encountering invalid delimiters #93

nickrobinson251 opened this issue Oct 20, 2021 · 1 comment · Fixed by #97

Comments

@nickrobinson251
Copy link
Collaborator

nickrobinson251 commented Oct 20, 2021

I am trying to rely on xparse to correctly parse a value when i know the input contains invalid characters (i.e. an invalid delimiter). I am hoping/expecting to get the correct value and a INVALID_DELIMITER return code.

(I gather we do want it to be possible to rely on xparse in the presence of invalid delimiters, given #78).

But xparse doesn't always return the correct value (using Parsers.jl v2.0.6).
For example, when trying to parse a Float64 when there are special characters like /

julia> using Parsers

julia> buf = codeunits("1.0 /");

julia> res = Parsers.xparse(Float64, buf, 1, length(buf), Parsers.XOPTIONS)
Parsers.Result{Float64}(-32607, 5, 2.3255508133e-314)

julia> res.val, Parsers.codes(res.code)
(2.3255508133e-314, "INVALID: OK | EOF | INVALID_DELIMITER ")

here xparse returned the expected code (INVALID_DELIMITER), but not the correct value (expected is res.val === 1.0)


Looking at what might be happening

The internal xparse2 gives the correct value, suggesting the typeparser actually does extract the correct value (and the "incorrect" code is due to simplifications in xparse2 and doesn't matter here)

julia> res = Parsers.xparse2(Float64, str, 1, length(str), Parsers.XOPTIONS)
Parsers.Result{Float64}(-32735, 5, 1.0)

julia> res.val, Parsers.codes(res.code)
(1.0, "INVALID: OK | EOF ")

And calling typeparser directly, I see the correct value (as expected):

julia> b, code = buf[1], Parsers.SUCCESS;

julia> Parsers.typeparser(Float64, buf, 1, length(buf), b, code, Parsers.XOPTIONS)
(1.0, 1, 4)

This isn't specific to the /character or to Float64, e.g. parsing Int64s:

julia> buf = codeunits("2 _");

julia> res = Parsers.xparse(Int64, buf, 1, length(buf), Parsers.XOPTIONS)
Parsers.Result{Int64}(-32607, 3, 4738866224)

julia> res.val, Parsers.codes(res.code)
(4738866224, "INVALID: OK | EOF | INVALID_DELIMITER ")

julia> Parsers.typeparser(Int64, buf, 1, length(buf), buf[1], code, Parsers.XOPTIONS)
(2, 1, 2)
julia> buf = codeunits("3 *");

julia> res = Parsers.xparse(Int64, buf, 1, length(buf), Parsers.XOPTIONS)
Parsers.Result{Int64}(-32607, 3, 4738866224)

julia> res.val, Parsers.codes(res.code)
(4738866224, "INVALID: OK | EOF | INVALID_DELIMITER ")

julia> Parsers.typeparser(Int64, buf, 1, length(buf), buf[1], code, Parsers.XOPTIONS)
(3, 1, 2)

So i suspect, this isn't to do with the typeparsers, but to do with the logic for handling invalid cases in xparse.

In particular, i think it's because xparse doesn't populate the value when the codes is not ok:

  • first typeparser returns the correct value
  • then in xparse correctly sets the code to INVALID_DELIMITER and send us to donedone

    Parsers.jl/src/Parsers.jl

    Lines 532 to 540 in 6b560d4

    # didn't find delimiter or newline, so we're invalid, keep parsing until we find delimiter, newline, or len
    code |= INVALID_DELIMITER
    while true
    pos += 1
    incr!(source)
    if eof(source, pos, len)
    code |= EOF
    @goto donedone
    end
  • but then donedone check's if ok(code) (which is false) and then doesn't pass the value to Result

    Parsers.jl/src/Parsers.jl

    Lines 659 to 666 in 6b560d4

    @label donedone
    tlen = pos - startpos
    if ok(code)
    y::T = x
    return Result{S}(code, tlen, y)
    else
    return Result{S}(code, tlen)
    end

So we have everything we need... but we're not using it.


Possible solution?

I think donedone might be doing this to handle the cases where we get sent to donedone before we've even called typeparser (e.g. because we hit "end of file" before hitting non-whitespace characters)

If this diagnosis is correct, i wonder if we should just handle that explicitly, rather than checking ok(code) e.g.
via a different goto-label, e.g.

+@label earlydone
+    # earlydone means parsing finished before calling `typeparser(T, ...)` to parse a `value::T`
+    tlen = pos - startpos
+    return Result{S}(code, tlen)
+
 @label donedone
     tlen = pos - startpos
-    if ok(code)
-        y::T = x
-        return Result{S}(code, tlen, y)
-    else
-        return Result{S}(code, tlen)
-    end
+    y::T = x
+    return Result{S}(code, tlen, y)
@nickrobinson251
Copy link
Collaborator Author

nickrobinson251 commented Oct 20, 2021

alternatively, perhaps the issue is the behaviour of ok?
is this intended to return false?

julia> Parsers.codes(res.code)
"INVALID: OK | NEWLINE | INVALID_DELIMITER "

julia> Parsers.ok(res.code)
false

i.e. why is this

ok(x::ReturnCode) = (x & (OK | INVALID)) == OK

not just x & OK == OK?

Docs say:

* `OK`: signals specifically that a valid value of type `T` was parsed (check via `Parsers.ok(code)`)

So on reflection, i feel like the current xparse logic of checking ok is fine... but the ok function is returning false for all invalid cases which i think probably it should not be?

quinnj added a commit that referenced this issue Oct 21, 2021
As pointed out by @nickrobinson251, if calling `typeparser` succeeds
(i.e. `(code & OK) == OK`), then we might as well set the correct value
in the parsing `Result` object. Up till now, if there was any `INVALID`
code invovled, `res.val` was undefined. With the proposed change in this
PR, we introduce a `Parsers.valueok(code)` function that can be checked,
and if true, then the user can know that a valid value was parsed and
can be accessed via `res.val`. Closes #93.
quinnj added a commit that referenced this issue Oct 21, 2021
As pointed out by @nickrobinson251, if calling `typeparser` succeeds
(i.e. `(code & OK) == OK`), then we might as well set the correct value
in the parsing `Result` object. Up till now, if there was any `INVALID`
code invovled, `res.val` was undefined. With the proposed change in this
PR, we introduce a `Parsers.valueok(code)` function that can be checked,
and if true, then the user can know that a valid value was parsed and
can be accessed via `res.val`. Closes #93.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment