Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

various refactoring; GHC 9.4 support #29

Closed
wants to merge 11 commits into from
Closed

various refactoring; GHC 9.4 support #29

wants to merge 11 commits into from

Conversation

raehik
Copy link
Collaborator

@raehik raehik commented Oct 20, 2022

This is kind of a playground for things I had been thinking about. It should not be merged without lots of history clean up (I think more likely it would happen piecemeal, in a few chunks). Please throw any criticism or thoughts at me here!

To-dos

  • re-add stateful parser (temporarily removed to make reorganising easier)
  • refactor export names using @AndrasKovacs 's suggestions

Things done

Support GHC 9.4; Remove FlatParse.Internal.UnboxedNumerics

The shims in UnboxedNumerics allowed working with unboxed machine integers across various compiler versions. This was accomplished with lots of CPP macros.

Lots of the explicit unboxing is removed. Where previously we did unary/binary operations on unboxed values, we now wrap them with their respective boxed type constructor (e.g. W8# :: Word8). There is little CPP, and it no longer impacts any types.

Break up main parser module into smaller submodules

Basic is ~1100 lines long. Many primitive parsers and combinators don't interact with each other, so putting them in their own modules isn't particularly hard, and lets us work with smaller modules.

Integer parsers were a reason for this: I've pulled them into a 350 line module with more docs. FlatParse.Basic still exports everything at the end of the day.

Does this potentially impact performance due to differences in how programs are optimized cross-module? Or are we OK because almost everything is INLINEd?

Lay clear naming conventions

  • Parsers are prefixed get* (this used to be split between none, any*, read*). parse is another candidate.
  • CPS parsers are prefixed with*
  • "value checking" parsers like Word8 -> Parser e () are getXOf (idea taken from the CBOR library)
  • combinators are anything (not really touched)
  • *Unsafe suffix means unsafe, *# suffix means some arguments are unboxed

The default "view" taken is bytewise, so take and takeRest now refer to the ByteString functions; takeString, takeRestString do it for Strings (this is swapped from current).

Maybe there should be a text-focused interface exported separately, which would export all the old names e.g. char, string, take using String.

@raehik raehik mentioned this pull request Oct 20, 2022
@raehik raehik changed the title partial work supporting GHC 9.4 various refactoring; GHC 9.4 support Oct 25, 2022
@raehik
Copy link
Collaborator Author

raehik commented Oct 26, 2022

From a very quick benchmark check, it seems we haven't lost performance with the removal of many manually unboxed operations. Though I noticed an unusual pattern of worsening performance on newer GHCs.

@AndrasKovacs
Copy link
Owner

AndrasKovacs commented Oct 29, 2022

I find the get in the naming scheme to be redundant. Instead of getX or getXOf we can have X and XOf. Anyway, currently I'm leaning more towards the old naming scheme because that's closer to *parsec naming. Not to say though that the old naming scheme couldn't be made more consistent, like:

  • X parses a concrete value with type X.
  • anyX parses any value with type X (X could also denote a "subtype" of a type that we don't want to enshrine in newtype-s, like when returning Integer from anyPosInteger)
  • withAnyX for CPS'd versions. This is a bit awkward, because Any is redundant here (because CPS is useless for concrete X value parsers), but perhaps better than withX for consistency reasons.

So readInteger should become anyPosInteger, for instance. It's a good question though whether every X should have all variations, when the variations make at least some sense. Currently we're far from that, but it could be also a bit too much clutter in the library to have that many functions.

In this approach, we would be keeping *parsec conventions but aiming for more consistency than in *parsec libraries.

Alternatively, we could try to just rethink API and naming from the ground up with no regard for *parsecs at all. In this case, we could consider less orthodox things like using classes for primitive parsers and making use of TypeApplications. For example,

class PrimParser a where
 get     :: a -> Parser e a -- concrete
 any     :: Parser e a -- any
 withAny :: (a -> Parser e b) -> Parser e b -- CPS any

Then any @Word8 would parse any Word8.

@raehik
Copy link
Collaborator Author

raehik commented Nov 4, 2022

I decided on getX to better separate primitive parsers from the combinators (distinction is a little arbitrary). It conveniently side steps any possible export overlaps. And I went with parsing any value as the "default" probably because those are the parsers I use personally. But thinking about the whole module as a Parse.X prefix, X and anyX are sensible.

The readX and scanX functions were the big outliers. I redesigned the machine integer parsers to export lots more definitions, including scanX as getXOfUnsafe (with your recommendation, XUnsafe). readInteger was renamed to getAsciiDecimalInteger (or, anyAsciiDecimalInteger). I kind of don't mind having lots of exports, and would shy away from the type class approach -- I've done similar before in my own code for elegance, but for whatever reason it rubs me the wrong way for a low level library.

Thanks for the suggestions! I'll make a note to refactor using that "more consistent *parsec" scheme.

@raehik
Copy link
Collaborator Author

raehik commented Nov 4, 2022

Is the modularization sensible and useful to you? I find it easier to understand the parser internals, and it lets us export internals/unsupported functionality easier, but otherwise it was mostly on a whim. (I am leaning towards explicitly re-exporting everything in the main parser module for a better Haddock experience.)

edit: and do we have to worry about performance changes due to splitting parsers up between modules? I want to say INLINE pragmas solve that, but I don't really know.

@AndrasKovacs
Copy link
Owner

AndrasKovacs commented Nov 4, 2022

I do like the modularization. I also like exporting everything in the parser module.

Indeed INLINE prevents performance issues arising from module splitting. Alternatively, we could use -fexpose-all-unfoldings which AFAIK neutralizes module splitting effects regardless of inline pragmas. I don't think it's needed though because everything that matters is already inline.

In the holidays, around December, I'll try to join the refactoring efforts here, and I'd like to overhaul tutorials, docs & error message infrastructure too.

@AndrasKovacs
Copy link
Owner

Hi, what's the status of this PR? I'd like to do a release sometime next month.

@raehik
Copy link
Collaborator Author

raehik commented Jan 13, 2023

I'm actively looking at this again. It was already quite broken, now moreso and needs to be rewritten for ParserT. Which reminds me, I spoke to @dminuoso the other day about a potential shim for Basic & Stateful - that would be fantastic, since this change would otherwise be coming with lots of repeated definitions (and documentation!).

@raehik raehik mentioned this pull request Jan 13, 2023
5 tasks
@raehik
Copy link
Collaborator Author

raehik commented Jan 18, 2023

The discussion here is useful, but the code isn't -- I'm closing this and referencing it in the new PR #36 .

@raehik raehik closed this Jan 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants