Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possibly big performance improvement for time_zone::to_sys() #817

Open
zeodtr opened this issue Feb 27, 2024 · 10 comments
Open

Possibly big performance improvement for time_zone::to_sys() #817

zeodtr opened this issue Feb 27, 2024 · 10 comments

Comments

@zeodtr
Copy link

zeodtr commented Feb 27, 2024

Hi,

Currently, time_zone::to_sys() eventually calls time_zone::load_sys_info().
time_zone::load_sys_info() builds sys_info.
While building, it copies abbrev std::string into sys_info.
One std::string copy for one time_zone::to_sys() can significantly impact performance.

In my case, in a big data processing system, time_zone::to_sys() was called at least 0.6 billion times for one query, and it took about 60 seconds to process the query. When I changed the data type of abbrev in sys_info struct from std::string to std::string_view (thus avoiding the string copy), that query took only about 1 second.
Although changing the data type of a public struct is a breaking change, in this case, the impact might not be severe, given that (as far as I understand) the lifetime of the backing store is infinite.
Or perhaps there may be other solutions that can avoid the string copy.

(By the way, the source code version I use is somewhat outdated, but when I looked at the current code, it still copies the string.)

Thank you.

@HowardHinnant
Copy link
Owner

Unfortunately the API of date is pretty much set in stone as it has now been standardized in C++20.

That being said, there are ways to optimize the performance within the current API. Let's start with this crude benchmark for time_zone::to_sys:

#include "date/tz.h"
#include <chrono>
#include <iostream>

using D = std::chrono::system_clock::duration;
auto tz = date::current_zone();

extern date::sys_time<D> tp;

date::sys_time<D>
convert(date::local_time<D> tp_loc)
{
    return tz->to_sys(tp_loc);
}

int
main()
{
    using namespace std::chrono;
    auto tp_start = system_clock::now();
    auto tp_end = tp_start + 1s;
    auto tp_loc = tz->to_local(tp_start);
    std::uint64_t count = 0;
    while (system_clock::now() < tp_end)
    {
        tp = convert(tp_loc);
        tp_loc += 1s;
        ++count;
    }
    auto rate = count / ((tp_end - tp_start) / 1.s);
    std::cout << rate << " conversions / sec\n";
}

date::sys_time<D> tp;

For me, compiled with clang -O3, this yields about 707,000 conversions/sec. This is significantly slower than what you report, but is a good reference point anyway.

If you can make the assumption that every local time has a unique mapping to UTC (i.e. exists and is not ambiguous), then a fairly simple change can be made to convert which will dramatically improve the performance:

date::sys_time<D>
convert(date::local_time<D> tp_loc)
{
    thread_local date::local_info info = tz->get_info(tp_loc);
    date::sys_time<D> tp{tp_loc.time_since_epoch() - info.first.offset};
    if (tp < info.first.begin || tp >= info.first.end)
    {
        info = tz->get_info(tp_loc);
        tp = date::sys_time<D>{tp_loc.time_since_epoch() - info.first.offset};
    }
    return tp;
}

This stores a thread_local local_info initialized by the first local time sent to this function. info.first contains two fields begin and end which are the UTC time points over which this information is known to be accurate. And the information includes the UTC offset. So the conversion is made with the offset, and if the result is not in the range [begin, end), then the local_info is updated, and the conversion is repeated.

For me this increases the performance to about 74.5 million conversions/sec (a speed up of about 100X).

There are a couple of caveats:

  1. If successive calls to convert are not "temporally close" on average, then the performance degrades and will ultimately become worse than the un-optimized convert. This optimization hinges on the guess that the next call to convert will be able to use the same local_info as the last call.
  2. If the local time does not have a unique mapping to UTC (because of a UTC offset change happening in the time zone), then this optimization will silently get the wrong answer. The un-optimized convert would have thrown an exception in this situation. If desired, more complicated logic can be added to protect against this. But more complicated logic will also be slower.

This optimization, if it is acceptable in your application, has the potential to be even faster than the string_view optimization you propose (100X vs 60X).

@zeodtr
Copy link
Author

zeodtr commented Feb 28, 2024

@HowardHinnant Thank you for your detailed response.
I apologize for not being specific about my test. it used 40 threads. So, it could be way faster.
Regarding optimization, since my system is a general-purpose OLAP DBMS, it can't make any assumptions.
Here's another idea: given that to_sys() doesn't necessarily require calling get_info() or load_sys_info(), one could introduce another set of internal functions, perhaps 'casually' named get_info_casual() and load_sys_info_casual(), which does not copy abbrev into sys_info.
In my opinion, abbrev is seldom useful beyond display purposes, and considering its high retrieval cost, it would be beneficial if 'casual' versions were available in the standard library.
Thank you.

@HowardHinnant
Copy link
Owner

There's something else going on here that I'm not completely understanding. abbrev is typically a 3 or 4 character string. And every implementation of string I'm aware of has a short-string buffer much larger than that. And this would make copying abbrev very fast. There should be no allocation/deallocation. Just a memcpy of 4 to 5 char.

Would you mind telling me what compiler/version you're using? That might be important.

@zeodtr
Copy link
Author

zeodtr commented Feb 28, 2024

The compiler version info is as follows:

g++ (GCC) 11.2.1 20220127 (Red Hat 11.2.1-9)
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Given the size of the code base for the OLAP DBMS and the complexity of the query execution path, there could potentially be another factor contributing to this symptom. If that's the case, I apologize for any oversight.

@zeodtr
Copy link
Author

zeodtr commented Feb 28, 2024

The OS info is as follows:

CentOS Linux release 7.9.2009 (Core)

The OS might be too outdated to support the C++11 ABI.
References:
https://stackoverflow.com/questions/46068529/no-small-string-optimization-with-gcc
https://bugzilla.redhat.com/show_bug.cgi?id=1546704

@HowardHinnant
Copy link
Owner

Ah! Do you know if you're using the old reference-counted string? Not sure, but that is probably --with-default-libstdcxx-abi=gcc4-compatible. Still copying a ref-counted string is fast too. Unless maybe there is a lot of multi-thread contention on the ref-count.

Are you using the date lib flag -DUSE_OS_TZDB=1? (https://howardhinnant.github.io/date/tz.html#Installation). This tells the date lib to use the binary IANA database that ships on your OS. This configuration is known to have higher performance than using a downloaded text IANA database (-DUSE_OS_TZDB=0).

@HowardHinnant
Copy link
Owner

Aside: Once you hit g++ 14 (which hasn't yet shipped), you'll be able to migrate off of this lib and use C++20 chrono. So whatever we do here, staying compatible with C++20 means you've got a path forward in the future.

@zeodtr
Copy link
Author

zeodtr commented Feb 28, 2024

I use -DUSE_OS_TXDB.
I think I'll continue with the std::string_view solution for the time being. It doesn't necessitate altering the source code utilizing sys_info, making it practically compatible with the standard. Furthermore, it avoids the need for changes that could potentially impact other sections of the source code, like compile options. Additionally, I can manage that portion of the source code myself.

By the way, I ran my test on a CentOS 8 system a few minutes ago. It was fast without std::string_view workaround.

Thank you for your great and kind responses!

@HowardHinnant
Copy link
Owner

Interesting! So can we conclude that the move from a ref-counted string to a short-short string optimization is the performance difference?

@zeodtr
Copy link
Author

zeodtr commented Feb 28, 2024

Yes, I think the 'short string optimization' is the key.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants