Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimizing tables for (de)composition #67

Closed
benibela opened this issue May 25, 2016 · 4 comments
Closed

Minimizing tables for (de)composition #67

benibela opened this issue May 25, 2016 · 4 comments

Comments

@benibela
Copy link
Contributor

So I was using a Pascal port of utf8proc to convert utf8 strings between the composed and decomposed form.

Then I noticed the Unicode tables are 600kb large (also due to the port not doing bitpacking), which is not good.

So I dropped every field from the struct that was not used for compositing/decomposition, who needs title case or folding anyways?
Then the tables were only slightly above 300kb large. Much better.

But looking closer at the tables, half the sequence array consists of -1. Those values are not needed for anything. So I removed all the -1, moving the element length to an unused byte in the property array, saving 20kb and bringing the tables to under 300kb.

Then the combination array. It is even worse. Almost all of its entries are -1. Someone lost their -1 collection.
A sparse [1st index][2nd index] array, all 2nd arrays joined together, indexed as [1st index*size + 2nd index]. You could cut the -1 prefix/suffix of each 2nd array, making them variable sized and index it as [startindex[1st index] + 2nd index]. But since I care more about space than performance, I replaced the 2nd array with a list of (2nd index, value) pairs. Now you need a linear search starting at startindex[1st index] to find the 2nd index and then get the value, but it is much smaller.
In fact it reduces this table size to 8kb, saving over 90kb.

Now the property struct contains a 1st and 2nd comb index. Yet there are no characters that have both. I replaced it by a single field comb index, saving 5 kb. Pointless

The stage2 array has a curious property. It is full of almost alternating 0 and 2 elements.
Why? 0 is the default property struct used for missing (?) characters, 2 is the property struct for characters without special properties. Both structs are the same.
Dropping struct 2, turned these stage2 blocks into blocks of zeros, which were then automatically merged by the data generator, saving 20kb.

Back to the sequence array. It is a list of 32-bit numbers, but those numbers are small, wasting a lot of zero bits.
It is like an UTF-32 string. I changed it to UTF-16, which saves 10kb.

The decomposition function does not actually use the decomp_type, only checks if it is zero or not.
I do not need that type. Replacing it with a bool, automatically drops 800 property structs, as they become duplicates.

Now, what exactly is the property array? After removing all that stuff, it is actually a map of codepoint -> decompositon index.
Some structs have a composition index, but only like 10% of them. No point in storing that index for the properties that do not have it. So I replaced the property array with a DWORD-array, bit encoding the properties. Fields with comp index take 2 entries, most only need 1 entry.
This saves 20kb.

This brings us to the final table size: 90 kb.

How did it affect performance? Not much, 10% faster when decomposing/composing each character once.

Want to have a patch?

@stevengj
Copy link
Member

stevengj commented May 25, 2016

A patch compressing the tables using the techniques above without losing functionality would be great. But we don't want to lose titlecase and case-folding.

@benibela
Copy link
Contributor Author

Seems with the full table it does not work half as well. Nothing collapse to clear an entire block.

But the big question is how much the property struct is allowed to change? Why is that even in the public header?

Changing the sequence array saves around 35kb, but people that were using the casefold_mapping field and the sequence array directly, need to adapt. Which should not be an issue, because the sequence array is not directly accessible anyways.

However, changing lowercase_mapping, titlecase_mapping, etc. to an index in the sequence array also saves 30kb. But now everyone will need to call utf8proc_tolower instead reading lowercase_mapping, because the field won't have meaning without that secret array.

@stevengj
Copy link
Member

stevengj commented May 31, 2016

The property struct is in the header for historical reasons; in the longer run, I would prefer to make it private and expose all the functionality through functions like utf8proc_tolower. I doubt this will break much existing code.

@benibela benibela mentioned this issue May 31, 2016
@Keno
Copy link
Member

Keno commented Jul 12, 2016

Fixed by #68?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants