Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce groupmark number-parsing option #128

Merged
merged 7 commits into from
Aug 26, 2022
Merged

Conversation

Drvi
Copy link
Collaborator

@Drvi Drvi commented Aug 25, 2022

Inspired by the grouping_mark option from the R package readr, this allows skipping certain characters used to format numbers. I avoided the term thousand_separators as different locals use these marks differently (e.g., indian numbering system would format the number 5000000 as 50,00,000).

src/Parsers.jl Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Aug 26, 2022

Codecov Report

Merging #128 (ed43e09) into main (aca71eb) will decrease coverage by 1.41%.
The diff coverage is 60.00%.

@@            Coverage Diff             @@
##             main     #128      +/-   ##
==========================================
- Coverage   87.67%   86.25%   -1.42%     
==========================================
  Files           9        9              
  Lines        2312     2350      +38     
==========================================
  Hits         2027     2027              
- Misses        285      323      +38     
Impacted Files Coverage Δ
src/utils.jl 85.11% <18.75%> (-5.34%) ⬇️
src/ints.jl 89.70% <66.66%> (-8.48%) ⬇️
src/Parsers.jl 94.26% <100.00%> (+0.02%) ⬆️
src/floats.jl 92.55% <100.00%> (-0.73%) ⬇️
src/ryu.jl 72.83% <0.00%> (-2.82%) ⬇️
src/dates.jl 75.54% <0.00%> (-0.62%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.


@test Parsers.xparse(Float32, "100,000,00099e-2"; groupmark=',').val ≈ 100_000_000.99
@test Parsers.xparse(Float32, "1,0,0,0,0,0,0,0,099e-2"; groupmark=',').val ≈ 100_000_000.99
@test Parsers.xparse(Float32, "10000000099e-2"; groupmark=',').val ≈ 100_000_000.99
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests look pretty good. Can we add a few more:

  • @test_throws when decimal and groupmark are the same?
  • Error/boundary cases; like Parsers.xparse(Float64, "100,abc"; groupmark=',').val === 100.0, and maybe that we didn't consume past the , as well (add a test for tlen on the parsers result)
  • Same error/boundary cases for ints
  • I think we also need to throw an error if delim and groupmark are the same; like if delim=',' and groupmark=',', then a,b,c,100,000 is ambiguous (columns of ["a", "b", "c", 100, 0] or ["a", "b", "c", 100_000])

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestions, thanks!

For now I allow groupmark == delim iff quoted is true (but happy to change if you'd prefer). The reason is that a very common grouping mark is the comma, which is also a very common delimiter... and to disambiguate the two, one should use quoting.

This allows for a wider range of valid CSVs to be parsed, if people don't quote their locale-formatted numbers, this might produce weird results.

We seem to allow the following:

julia> Parsers.xparse(Float64, "42,1"; decimal=',', delim=',')
Parsers.Result{Float64}(33, 4, 42.1)

which is also ambiguous and could happen in Czech locale. This is how Czech numbers are formatted:

Parsers.xparse(Float64, "1 042,1"; groupmark=' ', decimal=',')

I don't think delim is defined by locale, e.g. in Czech, conventionally ; is the delimiter, but IIUC, this is not formalized (can't find it in ICU explorer) and a comma could be used as well (hopefully together with quoting).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, I should have clarified that if we're quoted, then we can allow delim == groupmark. Good call.

src/Parsers.jl Outdated
@@ -135,8 +139,11 @@ function Options(
push!(refs, comment)
cmt = ptrlen(comment)
end
if !isnothing(groupmark) && ((groupmark == decimal || isnumeric(groupmark)) || (delim == groupmark && !quoted) || (oq == groupmark) || (cq == groupmark))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

test/floats.jl Show resolved Hide resolved
test/floats.jl Show resolved Hide resolved
@Drvi
Copy link
Collaborator Author

Drvi commented Aug 26, 2022

Oh, isnothing is not defined in Julia 1.0

@quinnj quinnj merged commit 88ceff9 into JuliaData:main Aug 26, 2022
@LilithHafner LilithHafner mentioned this pull request Jun 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants