GSOC Unicode support

blackwhale edited this page Jan 11, 2013 · 18 revisions
Clone this wiki locally

Let's make complex Unicode stuff a piece of cake in the D programming language.

Project status aka TODO list

[DONE] Data structures for Unicode

[DONE] Codepoint set via inversion list

  Ended up as 2 data structures: more compact RleBitSet and generally faster InverisonList.

[DONE] Flexible n-level bit-trie

[NEED NO WORK] Per-encoding trie generation (UTF-8, UTF-16, UTF-32)

  Even better, via reading the whole UTF sequence in one word, see$2bol$

[DONE] *Universal trie data structure (at least integers, strings and arrays of structs)

  Though it could be extended in many ways

[DONE] Normalization

[DONE] Correct NFC normalization (UTF-8, UTF-16, UTF-32)

Unexpectedly got blocked but coming soon. (i.e. out of GSOC scope formally) 
In essence NFC/NFKC are slightly harder thne NFD/NFKD resp. And NFC is the most widely used form in the text interchange.

[DONE] Version for NFKD

[DONE] Optimized all of normalization forms.

Normalization takes into account Quick check proporty and other hacks, along the way high-speed Trie strucutres are used throught. So it should already have good baseline performance that may be tweaked in future.



[IN PROGRESS] Case conversions and case-agnostic operations

[DONE] Simple casefolding comparator (sicmp)

[DONE] Full casefolding comparator (icmp)

Indeed does more work in general.

[TODO] Fixed toUpperCase, toLowerCase etc.

[DONE] User perceived Character (Graphemes)

[DONE] Grapheme cluster data-type (small-string optimized array)

[IN PROGRESS] Miscelanous

[DONE] Update isXXX functions in std.uni

[DONE] An automation script to update to the fresh version of Unicode character database.

[OUT OF SCOPE] Legacy encodings support (std.encoding?), bunch of these commonly found in modern web-browsers.

See also the current slice of documentation: