Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whither Text? #51

Open
bitemyapp opened this issue Dec 16, 2015 · 5 comments
Open

Whither Text? #51

bitemyapp opened this issue Dec 16, 2015 · 5 comments

Comments

@bitemyapp
Copy link
Contributor

Thanks to another IRC user, I was able to get Text parsing with Trifecta via this code:

-- Text Rope and parsing
instance Reducer Text Rope where
  unit = unit . strand . encodeUtf8
  cons = cons . strand . encodeUtf8
  snoc r = snoc r . strand . encodeUtf8

parseText :: Parser a -> Delta -> Text -> Result a
parseText p d inp =
  starve $ feed inp $ stepParser (release d *> p)
  mempty mempty

But the copying makes me unhappy. I asked in IRC but no-one really knew, why does Trifecta only support UTF-8 ByteStrings as a first-class input stream?

@phadej
Copy link
Collaborator

phadej commented Dec 16, 2015

supporting Text "natively" would mean supporting also UTF16 bytestrings. I see it may be possible, but I'm not sure how.

@bitemyapp
Copy link
Contributor Author

@phadej is there something intrinsic to how UTF16 bytestrings are laid out that would mean this requires a large-scale revision of the library or is it schlep? Or something in-between?

@phadej
Copy link
Collaborator

phadej commented Dec 17, 2015

@bitemyapp trifecta works on bytestring, I don't see a problem to make newtype Parser16 which would have different CharParsing and DeltaParsing instances, but otherwise the machinery could stay the same; i.e. we need to tell how to get Char from ByteString which is encoded differently.

The problem is that ByteString is

data ByteString = PS {-# UNPACK #-} !(ForeignPtr Word8) -- payload
                     {-# UNPACK #-} !Int                -- offset
                     {-# UNPACK #-} !Int                -- length

but Text is

-- | A space efficient, packed, unboxed Unicode text type.
--
-- Internally, the 'Text' type is represented as an array of 'Word16' UTF-16 code units.
-- The offset and length fields in the constructor are in these units, not units of 'Char'.
data Text = Text
    {-# UNPACK #-} !Array          -- payload (Word16 elements)
    {-# UNPACK #-} !Int              -- offset (units of Word16, not Char)
    {-# UNPACK #-} !Int              -- length (units of Word16, not Char)
    deriving (Typeable)

-- | Immutable array type.
data Array = Array {
      aBA :: ByteArray#
    }

And I'm not sure if one can convert from ByteArray# to ForeignPtr Word8 without copy.

@bitemyapp
Copy link
Contributor Author

And I'm not sure if one can convert from ByteArray# to ForeignPtr Word8 without copy.

This is the essence of my worry - that it would force a larger rewrite to play nice with ByteArray#,

@ekmett
Copy link
Owner

ekmett commented Aug 10, 2016

The real reason was massive amounts of code duplication would be required. I'm open to switching everything to Text from ByteString, but that turns out to be a bit heavy as well. At the time Text didn't support the codepoint-counted cut operations we needed to avoid massive asymptotic slowdowns. I managed to get Bryan to add them, but the few months between asking and receiving robbed the the rebuild of any steam.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants