Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Playing with ideas on how to better componentize internal parsing #127

Merged
merged 16 commits into from
Oct 20, 2022

Conversation

quinnj
Copy link
Member

@quinnj quinnj commented Jun 24, 2022

I haven't even tried to run this code yet, so very much WIP, but am
pretty much done for today and maybe the next few days, so wanted to
put what I have up in case anyone else wants to take a look.

The main new code is all in the src/components.jl file, where I try
a functional "layer" approach for the different types of parsing we're doing,
delimited, quoted, strip whitespace, etc. Things I like:

  • Having the components separated allows us, in principle, to have more
    custom "stacks" by including/excluding certain components; this would solve
    the current xparse vs xparse2 awkwardness
  • It forced me to do some code cleanup; i.e. there's only the components.jl
    file of code, and it's factored in such a way that strings.jl now uses a
    few pieces as needed and we avoid the tons of duplicated code we had before
  • This kind of approach also allows, in theory, to add more types of layers
    more easily than the old design where everything was a big monolith
  • This would hopefully provide a more compile-time-friendly setup for
    reviving the precompile efforts. I think a big part of the challenge is
    just the sheer size of the xparse function body.

Things I don't like:

  • It feels like there's a tension between having/needing a Parsers.Options
    object to pass around the various layer configurations (delim, oq, cq, etc.),
    and then the layers are separate functions. Like, if I didn't include/want
    a delim the delimiter layer could still be included and vice versa. If
    the delimiter layer wasn't included in a stack, the Parsers.Options still
    has the delim field. At first, I tried making a separate Delimiter struct
    with just the delimiter-related Parsers.Options fields and having that be
    the delimiter function, but that means you can't have a "static stack"
    since building the stack would rely on knowing all the layer options. Keeping
    the configs in Options means we could theoretically have a compiled-once,
    reused-forever parsing stack, but like I said, it feels smelly to need both.
  • The parsing layers feel a little brittle with regards to their order; like,
    if you tried an order other than delimiter(quoted(sentinel(typeparser)))
    I wouldn't be surprised if it just didn't work. It should be fine to exclude
    certain layers, but it just doesn't feel like we quite have the flexibility
    I had imagined we'd have by componentizing.
  • Related to the last point, I'm not sure if, in CSV.jl for example, we'd opt
    to just have a single static stack function that we compiled once and layers
    would just skip themselves if needed (like if stripwhitespace=false, the
    whitespace layer would still be included, but would just check
    opts.stripwhitespace and skip if needed. It's probably better overall for
    avoiding any parsing layer-stack recompilation costs. But it also feels a
    little lame since we're basically not taking advantage of the componentizing
    at all. I had also imagined letting users compose their own parsing "stacks"
    for custom parsing scenarios, but as I said, I'm not sure if the components
    are too brittle to compose very well. Probably worth experimenting more on this.

I haven't looked at performance yet (or even run this code, as mentioned).

@quinnj
Copy link
Member Author

quinnj commented Jun 24, 2022

Maybe to finish a thought on what the design would end up being:

  • Currently xparse takes a user-provided Options; w/ the component design, I think xparse could essentially be defined as const xparse = Result(delimiter(emptysentinel(whitespace(quoted(whitespace(sentinel(typeparser))))))) and xparse2 would just be const xparse2 = Result(whitespace(typeparser)).
  • Or maybe we keep a slightly higher-level xparse that you could pass a custom "stack" to if you wanted to configure the stack yourself.

@quinnj
Copy link
Member Author

quinnj commented Jun 24, 2022

I keep going back and forth in the desire to componentize things in Parsers.jl, but also make sure we have something relatively static for CSV.jl so we can try to achieve a good workflow there where the Parsers.jl code would almost never recompiled.

@quinnj
Copy link
Member Author

quinnj commented Jun 24, 2022

Ok, this code actually parses Ints/floats now. And pretty fast too! Only a 10-20% regression w/o any code inspection/tuning, so that's pretty good IMO.

@codecov
Copy link

codecov bot commented Oct 12, 2022

Codecov Report

Base: 89.52% // Head: 90.98% // Increases project coverage by +1.45% 🎉

Coverage data is based on head (c964e6c) compared to base (dcdfe83).
Patch coverage: 95.34% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #127      +/-   ##
==========================================
+ Coverage   89.52%   90.98%   +1.45%     
==========================================
  Files           9        9              
  Lines        2445     1608     -837     
==========================================
- Hits         2189     1463     -726     
+ Misses        256      145     -111     
Impacted Files Coverage Δ
src/floats.jl 95.29% <60.00%> (+0.27%) ⬆️
src/ints.jl 91.17% <80.00%> (ø)
src/strings.jl 91.52% <90.19%> (-5.40%) ⬇️
src/utils.jl 88.99% <91.17%> (-1.51%) ⬇️
src/Parsers.jl 96.75% <95.45%> (+2.24%) ⬆️
src/components.jl 99.15% <99.15%> (ø)
src/bools.jl 97.61% <100.00%> (-1.59%) ⬇️
src/dates.jl 78.85% <100.00%> (-0.19%) ⬇️
... and 3 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@quinnj
Copy link
Member Author

quinnj commented Oct 20, 2022

Ok @nickrobinson251, this is passing all tests for me locally, and passes CSV.jl tests with this PR.

@quinnj quinnj merged commit 10b9503 into main Oct 20, 2022
@quinnj quinnj deleted the jq/ideas branch October 20, 2022 17:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants