-
Notifications
You must be signed in to change notification settings - Fork 0
GSOC Unicode support
[DONE] Codepoint set via inversion list
Ended up as 2 data structures: more compact RleBitSet and generally faster InverisonList.
[DONE] Flexible n-level bit-trie
[NEED NO WORK] Per-encoding trie generation (UTF-8, UTF-16, UTF-32)
Even better, via reading the whole UTF sequence in one word, see
http://forum.dlang.org/post/jveaua$2bol$1@digitalmars.com
[DONE] *Universal trie data structure (at least integers, strings and arrays of structs)
Though it could be extended in many ways
[DONE] Correct NFC normalization (UTF-8, UTF-16, UTF-32)
Unexpectedly got blocked but coming soon. (i.e. out of GSOC scope formally)
In essence NFC/NFKC are slightly harder thne NFD/NFKD resp. And NFC is the most widely used form in the text interchange.
[DONE] Version for NFKD
[DONE] Optimized all of normalization forms.
Normalization takes into account Quick check proporty and other hacks, along the way high-speed Trie strucutres are used throught. So it should already have good baseline performance that may be tweaked in future.
[DONE] NFD
[DONE] NFKC
[DONE] Simple casefolding comparator (sicmp)
[DONE] Full casefolding comparator (icmp)
Indeed does more work in general.
[TODO] Fixed toUpperCase, toLowerCase etc.
[DONE] Grapheme cluster data-type (small-string optimized array)
[DONE] Update isXXX functions in std.uni
[DONE] An automation script to update to the fresh version of Unicode character database.
[OUT OF SCOPE] Legacy encodings support (std.encoding?), bunch of these commonly found in modern web-browsers.
See also the current slice of documentation: http://blackwhale.github.com/phobos/uni.html#unicode