Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What happened to wstring support in open() APIs? #2637

Open
ribtoks opened this issue May 28, 2023 · 26 comments
Open

What happened to wstring support in open() APIs? #2637

ribtoks opened this issue May 28, 2023 · 26 comments
Labels
bug i18n Internationalisation platforms Cygwin, MinGW, cross-compilation, NetBSD, FreeBSD etc windows Windows Specific issues

Comments

@ribtoks
Copy link

ribtoks commented May 28, 2023

Hi

I upgraded exiv2 to 0.28.0 just to find that std::wstring support has been nuked out in 7933ff4 but I could not find a matching PR, issue or at least a discussion that would explain why and what to do now.

std::string just does not work in Windows for non-latin1 paths for Exiv2::ImageFactory::open() (Visual Studio compiler).

What is the intended way to use it now?

These are all the old issues that reference the problem:

@ribtoks ribtoks added the bug label May 28, 2023
@ribtoks ribtoks changed the title What happend to wstring support in open() APIs? What happened to wstring support in open() APIs? May 28, 2023
@cgilles
Copy link
Collaborator

cgilles commented Jun 2, 2023

I can confirm that digiKam 8.1.0 pre-release based on Exiv2 0.28 is broken under windows to read metadata from images.

https://bugs.kde.org/show_bug.cgi?id=469372
https://bugs.kde.org/show_bug.cgi?id=470566

@neheb
Copy link
Collaborator

neheb commented Jun 6, 2023

The intended way to use this is to set UTF-8 as the code page. See: https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page

edit: oh I see. digiKam is compiled with MSVCRT instead of UCRT. I don't think that can work.

ping @piponazo

@cgilles
Copy link
Collaborator

cgilles commented Jun 6, 2023

No, digiKam is cross compiled under Windows using MXE (MinGW env - https://github.com/mxe/mxe) since 10 years now. The MSVC build is just another way to build the application in the KDE infra, it's not the official production release at all, because MSVC is unsafe,slow, and under-productive platform to compile huge code as digiKam, but it's another story...

Exiv2 team, please revert this commit to drop the UNICODE support under Windows. It's not at all a "dead" code... or explain well the work around. Note: I'm surprised to not see an unit-tests crying for the UTF16 long file path under Windows...

@neheb
Copy link
Collaborator

neheb commented Jun 6, 2023

No, digiKam is cross compiled under Windows using MXE (MinGW env - https://github.com/mxe/mxe) since 10 years now. The MSVC build is just another way to build the application in the KDE infra, it's not the official production release at all, because MSVC is unsafe,slow, and under-productive platform to compile huge code as digiKam, but it's another story...

Exiv2 team, please revert this commit to drop the UNICODE support under Windows. It's not at all a "dead" code... or explain well the work around. Note: I'm surprised to not see an unit-tests crying for the UTF16 long file path under Windows...

Read what I said again. MSVCRT and UCRT

@maksim-petukhov
Copy link

@neheb @piponazo
What was the reason for deleting perfectly working code in 7933ff4? I'd like to advocate for reverting this deletion. The removed code was functional and served a purpose, so it doesn't seem like dead code. It's been quite useful for me, and I believe it could continue to benefit other users as well.

The intended way to use this is to set UTF-8 as the code page. See: https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page

I see that the current recommendation is to "set UTF-8 as the code page". It is a bit reckless to ask users of Exiv2 to do that. There are some problems with this approach. First of all, this can perfectly work in tests but there is a reality when you have 150+ external libraries in your project and neither of them mentions that it works flawlessly with relatively new Microsoft invention. There are also loads of your code that relies on assumption "it is not UTF-8 there but something else" you have to check to make sure nothing is broken by forcing UTF-8 code page.

The "UTF-8 as the code page" feature is available only since Windows 1803. For Windows 10 operating systems prior to 1803 (10.0.17134.0), only static linking of UCRT is supported.. Windows 10 version prior 1803 are still supported in LTSC channel and static linking of UCRT is discouraged by Microsoft for security reasons. I don't see a nice way to use 0.28 across all Windows 10 versions.

I'm sharing these insights in the spirit of maintaining the functionality and usability of Exiv2 across a wide range of scenarios. I hope we can discuss this further and consider the potential benefits of restoring the code.

Thank you for your attention and consideration.

@neheb
Copy link
Collaborator

neheb commented Aug 15, 2023

@maksim-petukhov more reading here: #2090

@maksim-petukhov
Copy link

@neheb I read it. It is still doesn't answer why @piponazo deleted code for EXIV2_ENABLE_WIN_UNICODE. exiv2.exe binary works with "set UTF-8 as the code page" - great. But as I'm trying to point out, there are big projects out there relying on exiv2 library that can't just "set UTF-8 as the code page" because this could break a lot of things.

@cgilles
Copy link
Collaborator

cgilles commented Aug 17, 2023

And digiKam also depend on Exiv2 in-deep, and we don't have this kind of problem with UTF8 under Windows when it's cross compiled under Linux MinGW. Please revert this code deletion, as the library is not suitable under Windows when it's cross compiled. I second also maksim-petukhov about the topic : "Why to drop working code tested since a vrey long time ?"

NOTE: recommend me to compile whole digiKam with MSVC under Windows is not a solution for digiKam at all...

Gilles Caulier

@kmilos
Copy link
Collaborator

kmilos commented Aug 18, 2023

NOTE: recommend me to compile whole digiKam with MSVC under Windows is not a solution for digiKam at all...

You have the MSYS2 UCRT64 option (the default now for MSYS2). I think Octave recently switched to UCRT builds also using MXE - @mmuetzel can you please advise the digiKam devs?

@kmilos
Copy link
Collaborator

kmilos commented Aug 18, 2023

AFAICT, it seems Fedora's MinGW cross toolchain can also target UCRT since F37: https://fedoraproject.org/wiki/Changes/F37MingwUCRT (e.g. https://src.fedoraproject.org/rpms/mingw-zlib/pull-request/3)

Please get in touch with any other distro not supporting and ask for UCRT targets (cf. Arch, Debian, don't know about Ubuntu...)

@mmuetzel
Copy link

Just my 2 cents without getting a complete picture yet: Switching Windows to a UTF-8 locale only affects binaries that use the new Universal C Runtime (UCRT) instead of the older MS Visual C Runtime (MSVCRT). But even then, there are still quite a few applications that don't work when switching Windows to a UTF-8 locale. (IIRC, Notepad++ had some issues last time I checked.)

So, the only way to reliably support non-ASCII characters in file names on Windows is to use their wide character API. I guess that is what is meant when people talk about wstring here?

@kmilos
Copy link
Collaborator

kmilos commented Aug 18, 2023

I guess that is what is meant when people talk about wstring here?

Correct. However, we're discussing here about transitioning exiv2 clients to UCRT builds and UTF-8 locale only, as exiv2 in the latest version dropped the W API*. digiKam is a good example, as it uses MXE builds like Octave.

* Why, and why now is a separate discussion, but let's be pragmatic for a moment and focus on the UCRT build transition, as it'll have to be done by people sooner or later anyway...

@mmuetzel
Copy link

I understand that this issue took a turn to become about linking to the UCRT. But to come back to the original issue:
Last time I checked, setting Windows to a UTF-8 locale was still experimental. And I'd really urge you guys too not recommend to your users to activate a setting that will leave them with a half borked Windows installation.

@kmilos
Copy link
Collaborator

kmilos commented Aug 18, 2023

And I'd really urge you guys too not recommend to your users to activate a setting that will leave them with a half borked Windows installation.

How so? The UTF-8 setlocale setting (or the app manifest) only affects the running process, not the system.

Anyhow, as I said, I'm not too keen on getting into why (feel free to join the prior issues and PRs linked above), but trying to offer workarounds.

@kmilos
Copy link
Collaborator

kmilos commented Aug 18, 2023

The "UTF-8 as the code page" feature is available only since Windows 1803. For Windows 10 operating systems prior to 1803 (10.0.17134.0), only static linking of UCRT is supported.. Windows 10 version prior 1803 are still supported in LTSC channel and static linking of UCRT is discouraged by Microsoft for security reasons. I don't see a nice way to use 0.28 across all Windows 10 versions.

Btw, even more than one year ago this was already less than 0.2% of all Windows 10/11 machines (since 2018 >90% out of those 2.4% were already on 1803). (Windows 7 & 8.1 are not counted here as they are no longer supported by MS.) If you say a lot of enterprise machines are probably also not counted here, what could it be realistically then given the 3-5 year enterprise cycles? Still less than 1%, no? What percentage of those will ever use exiv2? The UCRT toolchains are there (both native and cross). Let's move on.

@kmilos kmilos added platforms Cygwin, MinGW, cross-compilation, NetBSD, FreeBSD etc windows Windows Specific issues labels Aug 18, 2023
@mmuetzel
Copy link

If you are sure that you won't restore using the wide character Windows API, that's your decision.

(Cross-)compiling MinGW and GCC that default to linking to the UCRT should only require adding --with-default-msvcrt=ucrt to their respective configure flags.

@neheb
Copy link
Collaborator

neheb commented Aug 21, 2023

wstring API removal was premature looks like. Can probably be added back in terms of std::filesystem.

Debian and hence Ubuntu do not have a UCRT target: https://groups.google.com/g/linux.debian.bugs.dist/c/vU9KxW1RTLg

Fedora has one but it's very barebones.

@neheb
Copy link
Collaborator

neheb commented Aug 21, 2023

Yeah messing around with std::filesystem leads me to believe the wstring API should come back. fs::path is an std::(w)string based on the platform. For Windows it requires calling .string(), which does not work with references as it's a temporary.

@gh-andre
Copy link

gh-andre commented Oct 19, 2023

I encountered this regression as well after upgrading to v0.28.0 and would like to note that Microsoft added UTF-8 code page on top of core Windows functionality, which remains to be UTF-16 (well, UCS-2, actually), so it will end up calling the same W functions in Win32 that _wfopen did, just slower, because of the extra character conversion.

Also, I imagine processes using a specific code page would need to check the code page inherited from the console and report an error if it doesn't match - wouldn't this cause more problems, compared to just using wide characters as 0.26.x did and not worrying about code pages?

It would be great if this breaking change was reverted until a usable alternative is found, whatever it is. You can even make it smaller if you use std::filesystem::path instead of std::string in FileIo::Impl because it already provides functionality to obtain std::string, std::wstring and std::u8string for any given path, so it should be possible to reduce code duplication by just conditionally compiling code where there's difference, like calling _wfopen vs. fopen (or using file I/O traits of sorts).

I will also note that since UTF-8 being the target here, keep in mind that in C++20 char8_t* and char*, as well as std::u8string and std::string are no longer interchangeable, so if you end up calling path functions returning std::u8string, it won't work with the rest of the code expecting std::string.

@neheb
Copy link
Collaborator

neheb commented Oct 19, 2023

Requires small API change. See #2800

@gh-andre
Copy link

While it's a good change, I don't mean it as just an improvement, but rather as a way to reduce the amount of conditionally compiled code, which in 0.26.x included entire methods. Having std::filesystem::path would allow compile conditionally only key function calls, such as _wfopen vs. fopen and call the string method appropriate for the API call.

As a side note, you may want to revisit by-value path parameters - most paths are over the short string optimization limit and will unnecessarily allocate memory. Given that std::filesystem::directory_entry::path() returns std::const filesystem::path&, it would reduce memory allocations for apps that scan a lot of images.

@neheb
Copy link
Collaborator

neheb commented Oct 21, 2023

@gh-andre public API is kept at C++11 and private is C++17. The ref to value change was because it does not work under Windows. The alternative is to reintroduce const std::string& and const std::wstring&.

@gh-andre
Copy link

Not sure what wasn't working on Windows. Maybe it is a C++11 issue, I cannot say. Definitely has no problem in C++17, so I was able to create a patch to handle opening wide-character file paths for Exiv2 v0.28.0. It will only work for files because the implementation isn't cleanly separated for various derived classes of BasicIo and surrounding code.

That is, some things seem like they got worse over time, such as calling fopen and stat outside of the implementation class, and some were like that from the beginning, like calling ::remove and fs::remove within the same block of code, or extracting a part of a URL as if it's a file (e.g. std::make_unique<FileIo>(pathOfFileUrl(path))), which will fail for any URL-encoded sequences, like %20.

Initially, I took a stab at the whole thing, similar to how previous EXIV2_ENABLE_WIN_UNICODE did it, but it was futile because file paths and protocols were all mixed in. The WIN_UNIQUDE patch hacked it in places doing things like assigning wchar_t to char characters, which only works for ASCII, duplicating thumbnail extensions methods, etc. So, I opted for a localized change, just to be able to work with image files. If anyone is interested, here's the patch:

https://github.com/StoneStepsInc/exiv2-nuget/blob/master/patches/01-wide-char-paths.patch

It will apply against the released Exiv2 0.28.0 source in exiv2-0.28.0-Source.tar.gz. Narrow character functions will still work as before, but will not be able to open file paths outside of the current code page. Functions that take const std::filesystem::path& will open Windows paths comprised from valid UTF-16 characters.

@AlynxZhou
Copy link

If I am not wrong, this could be fixed with #2800? Is there anything blocking it from being merged? Thanks!

jim-easterbrook added a commit to jim-easterbrook/python-exiv2 that referenced this issue Dec 21, 2023
This is an alternative approach to using the std::wstring versions of
exiv2 methods that are not available in v0.28.x (see
Exiv2/exiv2#2637). It should also allow non
ascii paths to be used when libexiv2 has been compiled without wstring
methods enabled.

The proper approach is to set the Windows code page to utf-8, but this
probably won't work on old Windows, and it's not clear how to do it for
a Python application. Instead we transcode paths from utf-8 to the
current code page, which could fail.
@cgilles
Copy link
Collaborator

cgilles commented Mar 16, 2024

Note : digiKam is now also compiled with VCPKG and MSVC compiler under Windows and the problem is also present in native build of Exiv2.

https://bugs.kde.org/show_bug.cgi?id=483743

Gilles Caulier

@kmilos
Copy link
Collaborator

kmilos commented Mar 25, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug i18n Internationalisation platforms Cygwin, MinGW, cross-compilation, NetBSD, FreeBSD etc windows Windows Specific issues
Projects
None yet
Development

No branches or pull requests

8 participants