Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental normalization? #60

Open
jgm opened this issue Apr 18, 2021 · 3 comments
Open

Incremental normalization? #60

jgm opened this issue Apr 18, 2021 · 3 comments

Comments

@jgm
Copy link

jgm commented Apr 18, 2021

According to its documentation, text-icu's collation algorithm uses incremental normalization. This is very helpful in collation: when you're comparing two strings, the decision about how to order them is generally one you can make after the first few characters, so no need to normalize the whole thing.

Could unicode-transforms provide a function that does this? For my purposes, an ideal interface would be

normalizeStreaming :: NormalizationMode -> Text -> [Int]

where the Ints are code points, and the list is produced lazily.

@harendra-kumar
Copy link
Member

The best way to deal with this would be to use stream based normalization. Streamly is going to support that using a signature like this:

normalize :: (IsStream t, Monad m) => NormalizationMode -> t m Char -> t m Char

See this PR composewell/streamly#698 for a working implementation of the above. We are also going to break streamly into several packages so that it can be a lightweight dependency and also have a streamly-unicode package for stream based unicode algorithms (see composewell/streamly#533).

@harendra-kumar
Copy link
Member

To work with text we can convert it to stream, normalize it and convert the stream back to text. In fact, with that we can work with any streamable type, not just text.

@jgm
Copy link
Author

jgm commented Apr 20, 2021

That sounds very good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants