Browse files

Merge Full Unicode Collation Algorithm Implementation

This is a full implementation of the Unicode Collation Algorithm. We iterate
by codepoint and put this into a ring buffer. The ring buffers hold the exact
number of codepoints which comprise the longest sequence of codepoints which
map to its own collation keys in the Unicode Collation Algorithm. As of Unicode
9.0 this number was 3. When Generate-Collation-Data.p6 is run, this number will
be updated in case future versions contain a higher number. Once the codepoint
iteration into the ringbuffer finds a non-matching codepoint, whether the two
codepoints are less or more than each other is saved in a variable for later in
case we end up needing to break a tie by codepoint.

We only pull as many keys as needed, so usually we only need to use what is in
the ring buffer. Then we compare by primary levels, all the keys pushed so far.
If all the primaries match then we iterate more codepoints and push the
collation values onto a stack. This stack is malloced and can expand as needed,
but this should practically never be the case.

Using the stack how we do is a modified verrsion of the UCA which lets us not
have to push all primaries from start to end of the string onto a one
dimensional array, and then after that push all the secondary, then all the

Codepoints which have single collation array elements get the data from the MVM
UCD database created with Any codepoints which have more than one
collation array element or if it is a sequence of codepoints, that uses the data
in src/strings/unicode_uca.c

Tangut, Ideographs, Nushu and Unassigned codepoints have collation values which
are generated algorithmically based on their codepoint. Hangul characters
decompose before they are pushed onto the array.

Like before, we support the ability to configure collation so you can reverse or
disable levels as you wish. The trick to this is knowing that for all collation
values: tertiary < secondary < primary

We use `level_eval_settings` to store the settings for each level, which we set
up based on the bitmask of the collation_mode argument to the fuction. If the
two levels are the same we are able to compare them based on the setting. If the
levels are not equal, we do not need to do this, since tertiary < secondary <
primary for all values.

Some info on our collation values. They are all 1 higher than those listed for
DUCET (Default Unicode Collation Element Table). The reason for this is that a 0
counts as 0 while a 1 is skipped and ignorable. This corresponds to things
listed as 0 in DUCET, which our implementation gives a value of 1. We only use 0
for the tertiary value of the level separator to ensure that longer strings win
(though we also have a fallback to ensure this happens in certain cases which
this isn't enough).
  • Loading branch information...
samcv committed Sep 10, 2017
1 parent 1059eed commit 866623d933246563a4ccc7d4001a3e5284cfcd03
@@ -559,9 +559,9 @@ src/main@obj@: src/main.c
$(MSG) compiling $@
$(CMD)$(CC) @ccswitch@ $(CFLAGS) @mainflags@ $(CINCLUDES) @ccout@$@ $*.c
src/strings/unicode.c: src/strings/unicode_db.c src/strings/unicode_ops.c
src/strings/unicode.c: src/strings/unicode_db.c src/strings/unicode_uca.c src/strings/unicode_ops.c
$(MSG) generating $@
$(CMD)$(CAT) src/strings/unicode_db.c src/strings/unicode_ops.c > $@ $(NOERR)
$(CMD) $(CAT) src/strings/unicode_db.c src/strings/unicode_uca.c src/strings/unicode_ops.c > $@ $(NOERR)
$(MINILUA): 3rdparty/dynasm/minilua.c
$(CC) $(CFLAGS) $(LDFLAGS) 3rdparty/dynasm/minilua.c -o $@ $(LDLIBS)

Large diffs are not rendered by default.

Oops, something went wrong.

Large diffs are not rendered by default.

Oops, something went wrong.
@@ -0,0 +1,315 @@
#!/usr/bin/env perl6
use lib <lib tools/lib>;
use Collation-Gram;
use ArrayCompose;
my $my_debug = False;
# Set this to only generate a partial run, for testing purposes
my Int $less-than;
my $out-file = "src/strings/unicode_uca.c";
class p6node {
has Int $.cp;
has @!collation_elements;
has $!last;
has is rw;
method next-cps { %!*.Int).sort }
method has-collation { @!collation_elements.Bool }
method get-collation { @!collation_elements }
method set-collation (Positional:D $list) {
@!collation_elements = |$list;
method set-cp (Int:D $cp) { $!cp = $cp }
sub p6node-find-node (Int:D $cp, p6node $p6node is rw --> p6node) is rw {
die unless ${$cp}.VAR.^name eq 'Scalar';
die "can't find the node for $cp " unless ${$cp}.isa(p6node);
return-rw ${$cp} orelse die "Can't find node";
sub p6node-create-or-find-node (Int:D $cp, p6node:D $p6node is rw) is rw {
my $hash := $;
#say "p6node-create-or-find-node called for cp $cp";
if $hash{$cp}:exists {
return-rw ${$cp};
else {
my $obj = => $cp, last => $hash);
$hash{$cp} = $obj;
return-rw $hash{$cp};
sub print-var ($var) { $var.gist }
my Str $Unicode-Version;
my @implicit-weights;
my $max-cp = 0;
sub int-bitwidth (Int:D $int) {
$int.base(2).chars + 1;
sub uint-bitwidth (Int:D $int) {
my Int:D $codepoint_sequence_no_max = 0;
sub parse-test-data (p6node:D $main-p6node) {
my $data = "UNIDATA/UCA/allkeys.txt".IO;
my $line-no;
for $data.lines -> $line {
last if $less-than and 10_000 < $less-than;
#say $line-no;
next if $line eq '' or $line.starts-with('#');
if $line.starts-with('@version') {
$Unicode-Version = $line.subst('@version ', '');
if $line.starts-with('@implicitweights') {
@implicit-weights.push: $line.subst('@implicitweights ', '');
my $var =$line, :actions(;
die $line unless $var;
# skip them if it's not a sequence (only one codepoint), AND there
# is only one collation element. These are picked up into the main MVM
# UCD database
next if $var<codepoints>.elems == 1 && $var<array>.elems == 1;
my $node = $main-p6node;
say $line, "\n", $var<codepoints> if $my_debug;
$codepoint_sequence_no_max = $var<codepoints>.elems
if $codepoint_sequence_no_max < $var<codepoints>.elems;
for $var<codepoints>.list -> $cp {
$max-cp = $cp if $max-cp < $cp;
$node = p6node-create-or-find-node($cp, $node);
say "Done with parse-test-data";
class sub_node {
has Int $.codepoint;
has Int $.sub_node_elems is rw ;
has Int $.sub_node_link is rw;
has Int $.collation_key_elems is rw = 0;
has Int $.collation_key_link is rw = 0;
has Int $.element is rw;
method build {
# To save space, set to zero if it's -1 (-1 means there's no link)
# we can determine there is no link by checking the collation_key_elems
# or sub_node_elems
# so we don't need to set these to -1
($!sub_node_link == -1 ?? 0 !! $!sub_node_link),
($!collation_key_link == -1 ?? 0 !! $!collation_key_link),
method Str {
"\{{$.codepoint.fmt("0x%X")}, $!sub_node_elems, $!sub_node_link, $!collation_key_elems, $!collation_key_link\}"
#| Adds the initial codepoint nodes to @main-node
sub add-main-node-to-c-data (p6node:D $p6node is rw, @main-node) is rw {
for $*.Int).sort -> $cp {
my $thing := => $cp, element => @main-node.elems);
@main-node.push: $thing;
#say Dump @main-node;
#| Follows the codepoints already in @main-node and adds sub_nodes based on that
sub sub_node-flesh-out-tree-from-main-node-elems
(p6node:D $main-p6node is rw, @main-node, @collation-elements) {
for ^@main-node -> $i {
#say "Processing $sub_node.codepoint()";
p6node-find-node(@main-node[$i].codepoint, $main-p6node),
@main-node, @collation-elements);
sub sub_node-add-to-c-data-from-sub_node
(sub_node:D $sub_node is rw, p6node:D $p6node is rw, @main-node, @collation-elements --> sub_node:D) is rw {
die unless $sub_node.codepoint == $p6node.cp;
if $p6node.has-collation {
my $temp := sub_node-add-collation-elems-from-p6node($sub_node, $p6node, @collation-elements);
die "\$temp !=== \$sub_node" unless $temp === $sub_node;
#if !$sub_node.sub_node_elems {
$sub_node.sub_node_elems = $;
#die "\$sub_node.sub_node_elems !== \$" unless $sub_node.sub_node_elems == $;
my Int ($last-link, $first-link) = -1 xx 2;
for $ -> $cp {
$last-link = sub_node-add-sub_node($cp, @main-node);
sub_node-add-to-c-data-from-sub_node(@main-node[$last-link], p6node-find-node($cp, $p6node), @main-node, @collation-elements);
$first-link = $last-link if $first-link == -1;
$sub_node.sub_node_link = $first-link;
#say Dump $sub_node;
sub sub_node-add-sub_node (Int:D $cp, @main-node --> Int:D) {
my $node := => $cp, element => @main-node.elems);
die "!\$node.element.defined || !\$node.codepoint.defined"
unless $node.element.defined && $node.codepoint.defined;
@main-node.push: $node;
return @main-node.elems - 1;
my Int:D $max-collation-elems = 0;
my Int:D $max-primary = 0;
my Int:D $max-secondary = 0;
my Int:D $max-tertiary = 0;
my Int:D $max-special = 0;
sub sub_node-add-collation-elems-from-p6node (sub_node:D $sub_node is rw, p6node:D $p6node is rw, @collation-elements --> sub_node:D) is rw {
die "!\$p6node.has-collation" unless $p6node.has-collation;
my Int:D $before-elems = @collation-elements.elems;
for $p6node.get-collation <-> $element {
$max-primary = $element[0] if $max-primary < $element[0];
$max-secondary = $element[1] if $max-secondary < $element[1];
$max-tertiary = $element[2] if $max-tertiary < $element[2];
$max-special = $element[3] if $max-special < $element[3];
@collation-elements.push: $element;
$max-collation-elems = $p6node.get-collation.elems if $max-collation-elems < $p6node.get-collation.elems;
my Int:D $after-elems = @collation-elements.elems;
$sub_node.collation_key_link = $before-elems;
$sub_node.collation_key_elems = $after-elems - $before-elems;
my @main-node;
my @collation-elements;
my $main-p6node =;
sub debug-out-nodes {
use JSON::Fast;
spurt 'out_nodes', to-json(*.build));
sub process-block (Str:D $text) {
if $text ~~ / ^ \s* $<fullnam>=( $<start>=(<:AHex>+) ['..' $<end>=(<:AHex>+)]? \s* ';' \s* $<name>=(.*) ) \s* $ / {
my Int:D $start = $<fullnam><start>.Str.parse-base(16);
my Int:D $end = $<fullnam><end> ?? $<fullnam><end>.Str.parse-base(16) !! $start;
my Str:D $name = $<fullnam><name>.Str;
my Str:D $fullname = $<fullnam>.Str;
return $start, $end, $name, $fullname;
else {
sub get-block-data (Str:D $file, @looking, $funcname) {
die unless "UNIDATA/$file".IO.f;
my $myfile = slurp "UNIDATA/$file";
my @out = "/* Data from $file */", "MVM_STATIC_INLINE MVMuint32 $funcname " ~ '(MVMCodepoint cp) {';
@out.push: 'return'.indent: 4;
my Int:D $num = 0;
for $myfile.lines {
next if /^ \s* '#' /;
next if /^ \s* $/;
if ($file eq 'PropList.txt') {
my @split = .split(/[\s+|';']/, :skip-empty);
#say @split.perl;
next unless @split[1].trim eq @looking.any;
if ($file eq 'Blocks.txt') {
my $found = False;
for @looking -> $looking {
$found = True if m/^\s* <:AHex>+ '..' <:AHex>+ \s* ';' \s* $looking \s* $/;
next unless $found;
#100000..10FFFF; Supplementary Private Use Area-B
#(0x3400 <= cp && cp <= 0x4DB5) /* 3400..4DB5 d*/
my ($start, $end, $name, $fullname) = process-block $_;
my Str:D $or = $num++ ?? '||' !! ' ';
#say $num;
my Str:D $conditional = $start == $end
?? "0x%-22X == cp".sprintf($start)
!! "0x%-5X <= cp && cp <= 0x%-5X".sprintf: $start, $end;
@out.push: "%s (%s) /* %4X..%-4X %-34s */".sprintf($or, $conditional, $start, $end, $name).indent: 4;
#say "start $start end $end name: “$name”";
@out.push: ';'.indent: 4;
@out.push: '}';
@out.join("\n") ~ "\n";
my $main-node-elems = add-main-node-to-c-data($main-p6node, @main-node);
sub_node-flesh-out-tree-from-main-node-elems($main-p6node, @main-node, @collation-elements);
say now - INIT now;
sub format-collation-Str ($a) {
my Str $out;
for $a -> $item is copy {
my $thing = $item.pop;
my Str:D $thing-str = $thing ?? '*' !! '.';
$out ~= "[%s%.4X.%.4X.%.4X]".sprintf($thing-str, |$item);
my @composed-arrays = "/* This file generated from tools/Generate-Collation-Data.p6 */";
sub make-struct (@names, @types, @collation-list-for-packing, $struct-name) {
use lib 'lib';
use BitfieldPacking; use bitfield-rows-switch;
my @order = compute-packing(@collation-list-for-packing);
my @out-str = "struct $struct-name \{";
for @order -> $pair {
@out-str.push: ([~] @types[$pair.key], " ", @names[$pair.key], " :", $pair.value, ";").indent(4);
@out-str.push: '};';
@out-str.join("\n"), @order;
my @collation-list-for-packing =
0 => uint-bitwidth($max-primary),
1 => uint-bitwidth($max-secondary),
2 => uint-bitwidth($max-tertiary),
3 => uint-bitwidth($max-special);
my @collation_key_names =
'primary', 'secondary', 'tertiary', 'special';
my ($collation_key_struct, $collation_key_order) = make-struct(
("MVMuint32" xx 4),
@composed-arrays.push: $collation_key_struct;
my @names2 = <codepoint sub_node_elems sub_node_link
collation_key_elems collation_key_link>;
my @sub_node-list-for-packing2 =
0 => uint-bitwidth($max-cp),
1 => uint-bitwidth(@main-node.elems - 1),
2 => uint-bitwidth(@main-node.elems - 1),
3 => uint-bitwidth($max-collation-elems),
4 => uint-bitwidth(@collation-elements.elems - 1);
my ($sub_node_struct, $order2) = make-struct(
('MVMuint32' xx 5),
@composed-arrays.push: $sub_node_struct;
@composed-arrays.push: "typedef struct sub_node sub_node;";
sub transform-array (@array, @order) {> $item {
my @out;
for ^$item.elems -> $i {
@out[$i] = $item[@order[$i].key];
@composed-arrays.push: "#define main_nodes_elems @main-node.elems()";
@composed-arrays.push: "#define starter_main_nodes_elems $main-node-elems";
@composed-arrays.push: "#define codepoint_sequence_no_max $codepoint_sequence_no_max";
@composed-arrays.push: "#define special_collation_keys_elems @collation-elements.elems()";
@composed-arrays.push: get-block-data("PropList.txt", ("Unified_Ideograph",), "is_unified_ideograph");
@composed-arrays.push: get-block-data("Blocks.txt", ("Nushu",), "is_Assigned_Block_Nushu");
@composed-arrays.push: get-block-data("Blocks.txt", ("Tangut","Tangut Components"), "is_Block_Tangut");
@composed-arrays.push: get-block-data("Blocks.txt", ("CJK Unified Ideographs","CJK Compatibility Ideographs"), "is_Block_CJK_Unified_Ideographs_OR_CJK_Compatibility_Ideographs");
@composed-arrays.push: compose-array('sub_node', 'main_nodes', transform-array(@main-node».build, $order2));
@composed-arrays.push: compose-array( 'struct collation_key', 'special_collation_keys', transform-array(@collation-elements, $collation_key_order));
spurt $out-file, @composed-arrays.join("\n");
print qq:to/END/;
Done writing $out-file.
{'=' x 70}
MAKE SURE TO RUN `tools/CollationTest.t` to ensure there are ~74 failures only!
Oops, something went wrong.

0 comments on commit 866623d

Please sign in to comment.