Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

benchmarking/profiling #12

Open
stevengj opened this issue Jul 18, 2014 · 16 comments
Open

benchmarking/profiling #12

stevengj opened this issue Jul 18, 2014 · 16 comments

Comments

@stevengj
Copy link
Member

It would be good to perform some benchmarking of utf8proc against ICU, and in general to perform some profiling to see if there are any easy targets for optimization.

@jiahao
Copy link
Collaborator

jiahao commented Jul 18, 2014

@stevengj
Copy link
Member Author

Sounds like a decent benchmark. (Does PHP use ICU?)

@stevengj
Copy link
Member Author

See also these Ruby benchmarks. The fastest code in that benchmark (although it didn't seem to look at utf8proc or ICU) was unf, which seems to be a wrapper around this unf_ext C++ code.

@stevengj
Copy link
Member Author

The eprun library claims to be fast, albeit in pure Ruby.

A hopeful sign is that eprun was designed by a Unicode expert and includes benchmark data extracted from random Wikipedia pages.

stevengj added a commit that referenced this issue Jul 19, 2014
@stevengj
Copy link
Member Author

I just pushed a little benchmark program based on the eprun data files. I also pushed a corresponding benchmark of ICU (although it gives a slightly unfair advantage to ICU by preallocating a huge buffer for the output, whereas utf8proc figures out the output size dynamically).

Results on my machine:

$ cat bench.out 
Deutsch_.txt: 0.00390216
Japanese_.txt: 0.00458185
Korean_.txt: 0.00171503
Vietnamese_.txt: 0.0036274

$ cat icu.out 
Deutsch_.txt: 0.00011213
Japanese_.txt: 0.00026891
Korean_.txt: 9.649e-05
Vietnamese_.txt: 0.00151726

So, unless I am messing something up, ICU is significantly faster, albeit with a far more painful API.

Just to make sure that ICU is not "cheating" by doing some kind of caching that helps it for repeated normalization of the same string (the above benchmark loops 100x), I tried normalizing a single long string formed by concatenating the above files a few dozen times, and got 0.679s for utf8proc and 0.104s for ICU.

(Compiling with -O3 instead of -O2 makes only a slight (< 5%) difference.)

Would be nice to also benchmark against GNU libunistring and perhaps unf_ext from above.

@stevengj
Copy link
Member Author

Note that eprun, which basically makes clever use of regular expressions to get decent performance in Perl or Ruby, is (unsurprisingly) significantly slower than utf8proc on my machine (though not terrible). Its times in seconds for NFKC from Perl eprun are:

Deutsch_.txt: 0.0102
Japanese_.txt: 0.0127
Korean_.txt: 0.0062
Vietnamese_.txt: 0.0089

I couldn't get the Ruby eprun working on my machine, but I don't really understand Ruby.

@stevengj
Copy link
Member Author

In order to figure out the correct buffer size, utf8proc_map has to perform the canonical decomposition twice, so we have a factor of 2 penalty from that compared to the (somewhat artificial) way I am calling ICU. But this does not correspond to a factor of 2 overall, because decomposition is only part of the process. If I hack out this doubled decomposition, the benchmark numbers improve slightly to:

Deutsch_.txt: 0.00256355
Japanese_.txt: 0.00316227
Korean_.txt: 0.00118366
Vietnamese_.txt: 0.00255491

@stevengj
Copy link
Member Author

From gprof, it looks like 40% of the time is spent in utf8proc_decompose_char, 32% in utf8proc_iterate, 14% in utf8proc_decompose, and 11% in utf8proc_encode_char.

@stevengj
Copy link
Member Author

GNU libunistring (benchmark added in a39c1a6), which has a similar API (operates directly on UTF8 data and does not require the output buffer to be preallocated) looks very comparable to utf8proc:

Deutsch_.txt: 0.00335499
Japanese_.txt: 0.00381766
Korean_.txt: 0.0018932
Vietnamese_.txt: 0.00256382

@stevengj
Copy link
Member Author

stevengj commented Dec 8, 2014

We could make utf8proc_iterate faster if we assumed that the string was valid UTF-8, and hence could remove the checks. This is the case in Julia, because UTF-8 strings are validated when they are created.

One possibility would be to have an additional flag to assume valid input, in which case a codepath with fewer checks is called. (May be somewhat annoying to implement without a bunch of cut-and-paste, though some preprocessor hacks could be used: e.g. have a file with the decompose and iterate functions, and #include it twice with different #defines to get two versions of the functions.)

@ScottPJones
Copy link
Contributor

This is the case in Julia, because UTF-8 strings are validated when they are created.

@stevengj I agree that having valid UTF-8 input could be used to make utf8proc_iterate faster, unfortunately, that's not correct currently in Julia (although @JeffBezanson [I think] and I would like to make that true). About libunistring, pretty please avoid anything with GPL (except in packages), it makes Julia useless (or even dangerous) for people like me, trying to use Julia in a commericial project. For really frequent operations, I'll see if I can get this up to ICU speeds...

@tkelman
Copy link
Contributor

tkelman commented Dec 4, 2015

The author of utf8rewind pings me once in a while on reddit to let me know he's been adding new features, so this would be another thing to compare against eventually: https://bitbucket.org/knight666/utf8rewind/overview

It's MIT licensed and pretty small like utf8proc is, so we could borrow anything that turned out to be worth using.

@xhochy
Copy link

xhochy commented Jun 17, 2020

In the Apache Arrow project, we also evaluating using utf8proc and are running some benchmarks, see apache/arrow#7449 (comment) Currently unilib seems to perform better for us.

@stevengj
Copy link
Member Author

One could probably speed up upper/lowercase conversions, e.g. by adding a specialized table just for this, but it's not clear from the issue whether that functionality is actually performance critical or just a test case?

@stevengj
Copy link
Member Author

This thesis claims to have implemented similar functionality with a considerable speedup: https://bearworks.missouristate.edu/theses/2731/

However, the source code does not seem publicly available. The thesis advisor is tragically deceased, and I'm not sure about the contact information for the author.

@sgllama
Copy link

sgllama commented Apr 2, 2023

This thesis claims to have implemented similar functionality with a considerable speedup: https://bearworks.missouristate.edu/theses/2731/

However, the source code does not seem publicly available. The thesis advisor is tragically deceased, and I'm not sure about the contact information for the author.

This may be the thesis author: https://www.linkedin.com/in/jpdurham

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants