GSOC Unicode support
Clone this wiki locally
Let's make complex Unicode stuff a piece of cake in the D programming language.
Project status aka TODO list
[DONE] Data structures for Unicode
[DONE] Codepoint set via inversion list
Ended up as 2 data structures: more compact RleBitSet and generally faster InverisonList.
[DONE] Flexible n-level bit-trie
[NEED NO WORK] Per-encoding trie generation (UTF-8, UTF-16, UTF-32)
Even better, via reading the whole UTF sequence in one word, see http://firstname.lastname@example.org
[DONE] *Universal trie data structure (at least integers, strings and arrays of structs)
Though it could be extended in many ways
[DONE] Correct NFC normalization (UTF-8, UTF-16, UTF-32)
Unexpectedly got blocked but coming soon. (i.e. out of GSOC scope formally) In essence NFC/NFKC are slightly harder thne NFD/NFKD resp. And NFC is the most widely used form in the text interchange.
[DONE] Version for NFKD
[DONE] Optimized all of normalization forms.
Normalization takes into account Quick check proporty and other hacks, along the way high-speed Trie strucutres are used throught. So it should already have good baseline performance that may be tweaked in future.
[IN PROGRESS] Case conversions and case-agnostic operations
[DONE] Simple casefolding comparator (sicmp)
[DONE] Full casefolding comparator (icmp)
Indeed does more work in general.
[TODO] Fixed toUpperCase, toLowerCase etc.
[DONE] User perceived Character (Graphemes)
[DONE] Grapheme cluster data-type (small-string optimized array)
[IN PROGRESS] Miscelanous
[DONE] Update isXXX functions in std.uni
[DONE] An automation script to update to the fresh version of Unicode character database.
[OUT OF SCOPE] Legacy encodings support (std.encoding?), bunch of these commonly found in modern web-browsers.
See also the current slice of documentation: http://blackwhale.github.com/phobos/uni.html#unicode