-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce groupmark
number-parsing option
#128
Conversation
Codecov Report
@@ Coverage Diff @@
## main #128 +/- ##
==========================================
- Coverage 87.67% 86.25% -1.42%
==========================================
Files 9 9
Lines 2312 2350 +38
==========================================
Hits 2027 2027
- Misses 285 323 +38
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
|
||
@test Parsers.xparse(Float32, "100,000,00099e-2"; groupmark=',').val ≈ 100_000_000.99 | ||
@test Parsers.xparse(Float32, "1,0,0,0,0,0,0,0,099e-2"; groupmark=',').val ≈ 100_000_000.99 | ||
@test Parsers.xparse(Float32, "10000000099e-2"; groupmark=',').val ≈ 100_000_000.99 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests look pretty good. Can we add a few more:
@test_throws
when decimal and groupmark are the same?- Error/boundary cases; like
Parsers.xparse(Float64, "100,abc"; groupmark=',').val === 100.0
, and maybe that we didn't consume past the,
as well (add a test fortlen
on the parsers result) - Same error/boundary cases for ints
- I think we also need to throw an error if
delim
andgroupmark
are the same; like ifdelim=','
andgroupmark=','
, thena,b,c,100,000
is ambiguous (columns of ["a", "b", "c", 100, 0] or ["a", "b", "c", 100_000])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestions, thanks!
For now I allow groupmark == delim
iff quoted
is true (but happy to change if you'd prefer). The reason is that a very common grouping mark is the comma, which is also a very common delimiter... and to disambiguate the two, one should use quoting.
This allows for a wider range of valid CSVs to be parsed, if people don't quote their locale-formatted numbers, this might produce weird results.
We seem to allow the following:
julia> Parsers.xparse(Float64, "42,1"; decimal=',', delim=',')
Parsers.Result{Float64}(33, 4, 42.1)
which is also ambiguous and could happen in Czech locale. This is how Czech numbers are formatted:
Parsers.xparse(Float64, "1 042,1"; groupmark=' ', decimal=',')
I don't think delim
is defined by locale, e.g. in Czech, conventionally ;
is the delimiter, but IIUC, this is not formalized (can't find it in ICU explorer) and a comma could be used as well (hopefully together with quoting).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, I should have clarified that if we're quoted, then we can allow delim == groupmark. Good call.
src/Parsers.jl
Outdated
@@ -135,8 +139,11 @@ function Options( | |||
push!(refs, comment) | |||
cmt = ptrlen(comment) | |||
end | |||
if !isnothing(groupmark) && ((groupmark == decimal || isnumeric(groupmark)) || (delim == groupmark && !quoted) || (oq == groupmark) || (cq == groupmark)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Oh, |
Inspired by the
grouping_mark
option from the R packagereadr
, this allows skipping certain characters used to format numbers. I avoided the termthousand_separators
as different locals use these marks differently (e.g., indian numbering system would format the number 5000000 as50,00,000
).