Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up of str_utf8_normalize() for a frequent special case #3616

Closed
mc-butler opened this issue Mar 15, 2016 · 4 comments
Closed

Speed up of str_utf8_normalize() for a frequent special case #3616

mc-butler opened this issue Mar 15, 2016 · 4 comments
Assignees
Labels
area: core Issues not related to a specific subsystem prio: medium Has the potential to affect progress
Milestone

Comments

@mc-butler
Copy link

Important

This issue was migrated from Trac:

Origin https://midnight-commander.org/ticket/3616
Reporter devzero (vadim.ush@….com)

When content of a large directory is being sorted by file names, a significant amount of CPU time is spent in str_utf8_normalize() that is called from str_utf8_create_key_gen().

For example, /usr/bin/ contains 5437 files on my Archlinux box. Running mc /usr/bin/ /usr/bin/ takes approx. 75 000 000 CPU instructions to sort file names, or 25% of total program run time. From these 75 000 000 instructions, 42 500 000 instruction are spent in str_utf8_normalize().

str_utf8_normalize() uses g_utf8_normalize() to do the work. g_utf8_normalize() is a heavyweight function, that converts UTF-8 into UCS-4, does the normalization and then converts UCS-4 back into UTF-8.

Since file names are composed of ASCII characters in most cases, we can speed up str_utf8_normalize() by checking if the heavyweight Unicode normalization is actually needed. Normalization of ASCII string is no-op, so it is effectively "normalized" by just strdup(). Here is the patch:

diff --git a/lib/strutil/strutilutf8.c b/lib/strutil/strutilutf8.c
index 8ec754d..8c7f909 100644
--- a/lib/strutil/strutilutf8.c
+++ b/lib/strutil/strutilutf8.c
@@ -1080,6 +1080,17 @@ str_utf8_normalize (const char *text)
     const char *start;
     const char *end;
 
+    const char *p = text;
+    while (1)
+    {
+        char c = *p;
+        if (c == 0)
+            return g_strndup(text, p - text);
+        else if ((c & 0x80) != 0)
+            break;
+        p++;
+    }
+
     fixed = g_string_sized_new (4);
 
     start = text;

With this patch, running mc /usr/bin/ /usr/bin/ requires just 37 000 000 instructions to sort the file names (down from 75 000 000) and 4 500 000 instuctions to do str_utf8_normalize() (down from 42 500 000).

@mc-butler
Copy link
Author

Changed by andrew_b (@aborodin) on Jul 28, 2017 at 10:29 UTC (comment 1)

  • Milestone changed from Future Releases to 4.8.20
  • Status changed from new to accepted
  • Branch state changed from no branch to on review
  • Owner set to andrew_b

Thanks for the patch! I rewrote the code but kept the idea itself.

Branch: 3616_utf8_normalize_speedup
[d6ca63fa303a0e82281c1764e2ef78e9fa65d3f9]

Please review.

@mc-butler
Copy link
Author

Changed by zaytsev (@zyv) on Jul 28, 2017 at 19:36 UTC (comment 2)

  • Branch state changed from on review to approved
  • Votes set to zaytsev

@mc-butler
Copy link
Author

Changed by andrew_b (@aborodin) on Jul 29, 2017 at 7:25 UTC (comment 3)

  • Branch state changed from approved to merged
  • Resolution set to fixed
  • Votes changed from zaytsev to committed-master
  • Status changed from accepted to testing

Merged to master: [37013e7].

@mc-butler
Copy link
Author

Changed by andrew_b (@aborodin) on Jul 29, 2017 at 7:27 UTC (comment 4)

  • Status changed from testing to closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: core Issues not related to a specific subsystem prio: medium Has the potential to affect progress
Development

No branches or pull requests

2 participants