Whither Text? #51

bitemyapp · 2015-12-16T03:14:26Z

Thanks to another IRC user, I was able to get Text parsing with Trifecta via this code:

-- Text Rope and parsing
instance Reducer Text Rope where
  unit = unit . strand . encodeUtf8
  cons = cons . strand . encodeUtf8
  snoc r = snoc r . strand . encodeUtf8

parseText :: Parser a -> Delta -> Text -> Result a
parseText p d inp =
  starve $ feed inp $ stepParser (release d *> p)
  mempty mempty

But the copying makes me unhappy. I asked in IRC but no-one really knew, why does Trifecta only support UTF-8 ByteStrings as a first-class input stream?

phadej · 2015-12-16T23:40:33Z

supporting Text "natively" would mean supporting also UTF16 bytestrings. I see it may be possible, but I'm not sure how.

bitemyapp · 2015-12-17T06:20:54Z

@phadej is there something intrinsic to how UTF16 bytestrings are laid out that would mean this requires a large-scale revision of the library or is it schlep? Or something in-between?

phadej · 2015-12-17T13:06:46Z

@bitemyapp trifecta works on bytestring, I don't see a problem to make newtype Parser16 which would have different CharParsing and DeltaParsing instances, but otherwise the machinery could stay the same; i.e. we need to tell how to get Char from ByteString which is encoded differently.

The problem is that ByteString is

data ByteString = PS {-# UNPACK #-} !(ForeignPtr Word8) -- payload
                     {-# UNPACK #-} !Int                -- offset
                     {-# UNPACK #-} !Int                -- length

but Text is

-- | A space efficient, packed, unboxed Unicode text type.
--
-- Internally, the 'Text' type is represented as an array of 'Word16' UTF-16 code units.
-- The offset and length fields in the constructor are in these units, not units of 'Char'.
data Text = Text
    {-# UNPACK #-} !Array          -- payload (Word16 elements)
    {-# UNPACK #-} !Int              -- offset (units of Word16, not Char)
    {-# UNPACK #-} !Int              -- length (units of Word16, not Char)
    deriving (Typeable)

-- | Immutable array type.
data Array = Array {
      aBA :: ByteArray#
    }

And I'm not sure if one can convert from ByteArray# to ForeignPtr Word8 without copy.

bitemyapp · 2015-12-17T20:59:13Z

And I'm not sure if one can convert from ByteArray# to ForeignPtr Word8 without copy.

This is the essence of my worry - that it would force a larger rewrite to play nice with ByteArray#,

ekmett · 2016-08-10T05:51:04Z

The real reason was massive amounts of code duplication would be required. I'm open to switching everything to Text from ByteString, but that turns out to be a bit heavy as well. At the time Text didn't support the codepoint-counted cut operations we needed to avoid massive asymptotic slowdowns. I managed to get Bryan to add them, but the few months between asking and receiving robbed the the rebuild of any steam.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whither Text? #51

Whither Text? #51

bitemyapp commented Dec 16, 2015

phadej commented Dec 16, 2015

bitemyapp commented Dec 17, 2015

phadej commented Dec 17, 2015

bitemyapp commented Dec 17, 2015

ekmett commented Aug 10, 2016

Whither Text? #51

Whither Text? #51

Comments

bitemyapp commented Dec 16, 2015

phadej commented Dec 16, 2015

bitemyapp commented Dec 17, 2015

phadej commented Dec 17, 2015

bitemyapp commented Dec 17, 2015

ekmett commented Aug 10, 2016