Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support C++20 and std::u8string #104

Open
jwillikers opened this issue Feb 19, 2020 · 8 comments
Open

Support C++20 and std::u8string #104

jwillikers opened this issue Feb 19, 2020 · 8 comments

Comments

@jwillikers
Copy link
Contributor

When compiling for C++20, the following error occurs:

../tests/test_parse_unicode.cpp:51:23: error: no matching conversion for functional-style cast from 'const char8_t [53]' to 'std::string' (aka 'basic_string<char, char_traits<char>, allocator<char> >')
                      std::string(u8"Ýôú'ℓℓ λáƭè ₥è áƒƭèř ƭλïƨ - #"));

It looks like the introduction of std::u8string is causing problems for conversions between char8_t and std::string types.

I'm not sure the best way to handle this. My first though is to create a type alias which can be configured to std::string for C++11, C++14, and C++17 or std::u8string for C++20 and newer. That brings up an important question. Should the API for toml11 only support std::u8string for C++20 and beyond?

@ToruNiina
Copy link
Owner

Yes, I know that problem... The other day, I did the same thing as you did and encountered the same error. I'm also not sure what is the best way to deal with it. Anyway, thank you for reporting this. The priority increased.

There can be several options. One is, as you suggested, to add a type alias to switch the implementation of toml::string from std::string to std::u8string. In this way, the users do not need to mind about the character type used, but combining it with no-u8string (i.e. existing) code in c++20 mode could become a bit harder.
Another is to add a template parameter to toml::value to give the users a choice. We can choose which one to use in the user code, but the templatized code would become messy.
The most ad-hoc solution is to convert char8_t literal to std::string in the test codes byte by byte, but it does not solve the fundamental problem.

Basically, I want to provide users the flexibility and controllability. So I prefer the second option in the previous paragraph, template. But currently, I've not done anything about this because the priority was low. Also, since I recognized the problem only a few days ago, I'm still not so confident about the solution. There could be another, better idea, not sure...

@jwillikers
Copy link
Contributor Author

I dug up some information on this and it looks like nobody is happy about the breaking conversions for std::u8string and char8_t. It looks like several built-in types are missing proper specializations for u8 types in C++20.

  1. {fmt} issue on char8_t support - how do we print u8 literals using fmt? fmtlib/fmt#1405
  2. StackOverflow answer about C++ u8 conversions: https://stackoverflow.com/a/59055485/9835303
  3. Proposal to not using std::u8string or char8_t: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1747r0.html
  4. PR to fix std::u8string usage in PyBind: Add C++20 char8_t/u8string support pybind/pybind11#2026

@levicki
Copy link

levicki commented Jul 10, 2020

Would there be a way to also support reading to wstring instead of string, and serializing from wstring as UTF-8?

@ToruNiina
Copy link
Owner

Sorry for the late response. But we don't have a plan to serialize into/deserialize from wstring. Actually, wchar_t is an implementation-defined character and the internal representation of wchar_t is not guaranteed to be Unicode (it could be a local character encoding format). Even if the environment uses Unicode, the encoding format of wchar_t might not be utf-8, but utf-16 (e.g., windows) or utf-32 (e.g., linux). Since TOML standard says TOML data should be encoded in the utf-8 format, we can focus on char(the traditional way of handling byte arrays) and char8_t.

You can use compiler's builtin or OS API for convertion between an array of wchar_t and a utf-8 byte buffer. <codecvt> could be another option, but note that codecvt_utf8 is deprecated since C++17.

@levicki
Copy link

levicki commented Sep 20, 2020 via email

@ToruNiina
Copy link
Owner

Nice. Most of the libraries are provided as is and toml11 is no exception. I hope you could solve your problem.
The implementation of new features might take some time, and I don't always have time. But pull requests for new features are always welcome.

@ToruNiina
Copy link
Owner

Coming back to the original problem, I have added a workaround and now both ""_toml and u8""_toml literal works in C++20 mode in the current release. Now CI contains test cases with C++20 mode using several famous compilers. It seems that all the features work in C++20.

And thank you very much jwillikers for the surveying the situation.
Currently u8string is still not supported, but I will later implement the conversion from std::u8string via get and find and conversion to toml::value. That means that a normal std::string will be used as an internal string representation and we would not be able to get a raw reference to u8string, but I think it is a good compromise in the current situation. Adding many ifdefs makes the code complicated.

@levicki
Copy link

levicki commented Sep 20, 2020

Most of the libraries are provided as is and toml11 is no exception.

I understand that very well, the only reason I ever asked about std::wstring support is because it is part of C++ STL, and it is kind of unavoidable to use std::wstring and the underlying wchar_t if you want to do any C++ coding on Windows.

I also understand that wchar_t is not the same size on Linux / mac OS, and that char there usually means UTF-8, so if you wrote your library with those operating systems in mind it is clear why you would refuse to support wchar_t and std::wstring.

I hope you could solve your problem.

Yes, I have solved it by switching to toml++.

The implementation of new features might take some time, and I don't always have time. But pull requests for new features are always welcome.

I understand that as well. However, people sometimes need to get their own work done too. That's usually why they look for a library someone else wrote in the first place -- to avoid having to implement stuff in a domain they aren't familiar with under time constraints of their own project or work assignment.

Sorry for the slight off-topic, and I apologize if I came through as disrespectful with my previous response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants