-
Notifications
You must be signed in to change notification settings - Fork 16
Introduce groupmark number-parsing option
#128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## main #128 +/- ##
==========================================
- Coverage 87.67% 86.25% -1.42%
==========================================
Files 9 9
Lines 2312 2350 +38
==========================================
Hits 2027 2027
- Misses 285 323 +38
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
|
|
||
| @test Parsers.xparse(Float32, "100,000,00099e-2"; groupmark=',').val ≈ 100_000_000.99 | ||
| @test Parsers.xparse(Float32, "1,0,0,0,0,0,0,0,099e-2"; groupmark=',').val ≈ 100_000_000.99 | ||
| @test Parsers.xparse(Float32, "10000000099e-2"; groupmark=',').val ≈ 100_000_000.99 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests look pretty good. Can we add a few more:
@test_throwswhen decimal and groupmark are the same?- Error/boundary cases; like
Parsers.xparse(Float64, "100,abc"; groupmark=',').val === 100.0, and maybe that we didn't consume past the,as well (add a test fortlenon the parsers result) - Same error/boundary cases for ints
- I think we also need to throw an error if
delimandgroupmarkare the same; like ifdelim=','andgroupmark=',', thena,b,c,100,000is ambiguous (columns of ["a", "b", "c", 100, 0] or ["a", "b", "c", 100_000])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestions, thanks!
For now I allow groupmark == delim iff quoted is true (but happy to change if you'd prefer). The reason is that a very common grouping mark is the comma, which is also a very common delimiter... and to disambiguate the two, one should use quoting.
This allows for a wider range of valid CSVs to be parsed, if people don't quote their locale-formatted numbers, this might produce weird results.
We seem to allow the following:
julia> Parsers.xparse(Float64, "42,1"; decimal=',', delim=',')
Parsers.Result{Float64}(33, 4, 42.1)which is also ambiguous and could happen in Czech locale. This is how Czech numbers are formatted:
Parsers.xparse(Float64, "1 042,1"; groupmark=' ', decimal=',')I don't think delim is defined by locale, e.g. in Czech, conventionally ; is the delimiter, but IIUC, this is not formalized (can't find it in ICU explorer) and a comma could be used as well (hopefully together with quoting).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, I should have clarified that if we're quoted, then we can allow delim == groupmark. Good call.
src/Parsers.jl
Outdated
| push!(refs, comment) | ||
| cmt = ptrlen(comment) | ||
| end | ||
| if !isnothing(groupmark) && ((groupmark == decimal || isnumeric(groupmark)) || (delim == groupmark && !quoted) || (oq == groupmark) || (cq == groupmark)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
Oh, |
Inspired by the
grouping_markoption from the R packagereadr, this allows skipping certain characters used to format numbers. I avoided the termthousand_separatorsas different locals use these marks differently (e.g., indian numbering system would format the number 5000000 as50,00,000).