Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Char encoding on Windows 7 #48

Closed
SpartanJ opened this issue Aug 28, 2013 · 28 comments
Closed

Char encoding on Windows 7 #48

SpartanJ opened this issue Aug 28, 2013 · 28 comments
Labels
bug Something isn't working major

Comments

@SpartanJ
Copy link
Owner

Original report by Batte HUCHAI (Bitbucket: bhuchai, ).


Hello,

I have a problem of char encoding on Windows 7. Indeed, when I create a file with Windows Explorer which contains a special char (no ASCII or "basic" char ...) like "♫" (for example) in its name, the EFSW FileWatcher gives to my application the same filename but with a "?" char instead of "♫" :

For example, when I create the file "test♫.xlsx", I get :

DIR (C:\Users\buchet\Documents\Watched\) FILE (test?.xlsx) has event Added

It can be stupid to have a filename with a "♫" but it's just an example. It's the same thing with all special chars which are accepted by Windows Explorer but are "modified" by EFSW naming ...

For your information, it seems working in Mac OS 1.7.5.

Any idea ?

@SpartanJ
Copy link
Owner Author

Original comment by Martín Lucas Golini (Bitbucket: SpartanJ, GitHub: SpartanJ).


Hi Batte, yes, the problem is that it's converting everything to ANSI in Windows, which it is not the best solution. I'll change this to convert everything to UTF-8, but you'll not see the ♫ unless you change the command line default code page since it is using ANSI by default. You can change the codepage to UTF-8 with "chcp 65001", but you also will need to change the default font, for someone that support this characters ( like Lucida Console ).
You can also compile with UNICODE support, if you're using Visual Studio, go to the project properties -> Configuration Properties -> General -> Character Set -> Use unicode character set. But, since i'll compile this change to work always with UTF-8, it'll be exactly the same.
Thanks for reporting it!

@SpartanJ
Copy link
Owner Author

Original comment by Martín Lucas Golini (Bitbucket: SpartanJ, GitHub: SpartanJ).


Fixed in #a4bca78

@SpartanJ
Copy link
Owner Author

Original comment by Batte HUCHAI (Bitbucket: bhuchai, ).


Hello,

I understand but the default encoding of QtCreator's command line page (IDE I use) does not seem to be ANSI, it's the reason why I'm able to see special chars like ♫.

About your new encoding choice (UTF-8), are you sure that's the correct encoding of Windows ? I thouth that's UTF-16 ... Maybe you're right, I'm actually not sure !

Thanks,

B.

@SpartanJ
Copy link
Owner Author

Original comment by Martín Lucas Golini (Bitbucket: SpartanJ, GitHub: SpartanJ).


Yes, Windows encourage the use of UTF-16 as the default encoding, but it's not a requirement, since it supports any Unicode method. I can't use UTF-16 because i'm using std::string to keep it simple, and the other OSes use UTF-8, so the correct approach is to always use the same encoding. There's nothing impeding yo to convert the strings to any other encoding, and you can use the String class used internally by efsw ( efsw::String::fromUtf8( filename ).toWideString() ).
I also use QtCreator, the application output in other OSes is set to UTF-8, and looks fine, but i don't know what's using in Windows, i tried printing UTF-16 and i doesn't seems to work neither. I don't have time to continue testing, but it's not something that i care much about, it's not a problem of efsw.
Sadly i can satisfy every developer, but if you want to suggest other solution, i'm listening!

Edit:
Hint: Read this https://bugreports.qt-project.org/browse/QTCREATORBUG-316
Try with calling this: http://msdn.microsoft.com/en-us/library/windows/desktop/ms686036(v=vs.85).aspx with the correct codepage ( 65001 ).

Regards,
Martín

@SpartanJ
Copy link
Owner Author

SpartanJ commented Sep 2, 2013

Original comment by Batte HUCHAI (Bitbucket: bhuchai, ).


Hello,

All UTF-8 chars seem to be correctly sent by EFSW.

But I have some problems with others files (files with filenames originating from Mac OS X --> created directly on a Mac and correctly printed by Finder and Explorer Windows).

One of them :
https://mega.co.nz/#!FR1UQBDa!TzRLQ210dwpE2KQGUNz__1fWcOsvo19VFLed-2YujLE

On Windows, if you edit its filename, after you copy the filename and you put it on a text plain editor like Notepad, you'll see some specific chars ! And these chars are not correctly sent by EFSW ... Do you know why ?

I hope I'm understandable ...

Do not hesitate to tell me if I'm not.

B.

@SpartanJ
Copy link
Owner Author

SpartanJ commented Sep 3, 2013

Original comment by Martín Lucas Golini (Bitbucket: SpartanJ, GitHub: SpartanJ).


Sorry, but i'm not sure what are you trying to say.
May be you can explain me step by step how to reproduce the problem and try to be a little more clear explaining what's the problem. Because what i understand it doesn't sound like an efsw bug.
Thanks,
Martín

@SpartanJ
Copy link
Owner Author

SpartanJ commented Sep 3, 2013

Original comment by Batte HUCHAI (Bitbucket: bhuchai, ).


OK, sorry to be not understandable.

It's simple. As for "test♫.xlsx", I tried to put the file below into the watched folder :
https://mega.co.nz/#!FR1UQBDa!TzRLQ210dwpE2KQGUNz__1fWcOsvo19VFLed-2YujLE

Result ? As for "test?.xlsx" resulting from EFSW, I received a wrong filename for this new file.

Do you know why ?

For information : after some investigations, I understood that the file was coming from Mac OS X and had some strange chars on its filename (you can see them by copying the filename into Notepad (or other text plain editors ...).

B.

@SpartanJ
Copy link
Owner Author

SpartanJ commented Sep 4, 2013

Original comment by Martín Lucas Golini (Bitbucket: SpartanJ, GitHub: SpartanJ).


This is what i explained in the previous messages, this is not an efsw problem, you need to use a command line that supports UTF-8 character encoding, with a font that also supports it. Or you can change the output to an encoding that the command line interprets correctly.
For example, i used cygwin console to show you that this is already working ( since it use UTF-8 by default ): example working.
There are some other options, just search something like "windows unicode command line support" in Google.
Regards,
Martín

@SpartanJ
Copy link
Owner Author

SpartanJ commented Sep 9, 2013

Original comment by Batte HUCHAI (Bitbucket: bhuchai, ).


Hello,

I've also a char encoding problem on Mac OS. Indeed, if I put a folder "lolélalé" into a watched folder, I'll get "lole'lale'" ...

For help, this is a part of my code :

case efsw::Actions::Add:
                std::cout << "DIR (" << dir << ") FILE (" << filename << ") has event Added" << std::endl;
                _filewatchersignals->emit_addSignal(QString::fromUtf8(dir.c_str()), QString::fromUtf8(filename.c_str()));
                break;
            case efsw::Actions::Delete:
                std::cout << "DIR (" << dir << ") FILE (" << filename << ") has event Delete" << std::endl;
                _filewatchersignals->emit_deleteSignal(QString::fromUtf8(dir.c_str()), QString::fromUtf8(filename.c_str()));
                break;
            case efsw::Actions::Modified:
                std::cout << "DIR (" << dir << ") FILE (" << filename << ") has event Modified" << std::endl;
                _filewatchersignals->emit_modifiedSignal(QString::fromUtf8(dir.c_str()), QString::fromUtf8(filename.c_str()));
                break;
            case efsw::Actions::Moved:
                std::cout << "DIR (" << dir << ") FILE (" << filename << ") has event Moved from (" << oldFilename << ")" << std::endl;
                _filewatchersignals->emit_movedSignal(QString::fromUtf8(dir.c_str()), QString::fromUtf8(filename.c_str()), QString::fromUtf8(oldFilename.c_str()));
                break;
            default:
                std::cout << "Should never happen!" << std::endl;

You can see that I correctly take the outside with UTF8 encoding ...

Thanks for your help.

B.

@SpartanJ
Copy link
Owner Author

Original comment by Martín Lucas Golini (Bitbucket: SpartanJ, GitHub: SpartanJ).


You have the same problem than in Windows, your locale it's not correctly set in the Terminal. I've tested with the default terminal locale ( en_US.UTF-8 ) and everything works just fine. Also works in the application output from QtCreator. Your code looks fine, so i don't thing there's nothing wrong there.
OS X and UTF-8 example

@SpartanJ
Copy link
Owner Author

Original comment by Batte HUCHAI (Bitbucket: bhuchai, ).


Hello,

I understand you're saying but I think that my problem is another thing.

Indeed, my Qt project is a client which talks with a web service.
On Windows, when EFSW gives a file "tété.txt" to the Qt client (which sends it), the web service receives correctly the file (with the same file name "tété.txt").
On Mac OS, when EFSW gives the same file name, the web service receives a wrong filename.

I have looked the decimal value of each char of the file name sent by EFSW and it doesn't seem to be the UTF-8 decimal value of "t" and "é" chars.

Do you know what I mean ?

@SpartanJ
Copy link
Owner Author

Original comment by Martín Lucas Golini (Bitbucket: SpartanJ, GitHub: SpartanJ).


Yes, it's clear what you are describing. I'll compare the string hash produced on OS X and Windows, if something is different, means that efsw is doing something wrong, otherwise it should be something of your application.

Let me see and i'll tell you.

Thanks

@SpartanJ
Copy link
Owner Author

Original comment by Martín Lucas Golini (Bitbucket: SpartanJ, GitHub: SpartanJ).


Ok, i made the tests and it looks everything fine. The string hashes are the same, the binary data is exactly the same. I still think that this is not an efsw issue, if you can reproduce it with a simple example that i can test here, i'll take a look at it. But, please nothing with Qt or client/server, since it has nothing to do with the library.

OS X hashes
Windows 7 hashes

Regards

@SpartanJ
Copy link
Owner Author

Original comment by Batte HUCHAI (Bitbucket: bhuchai, ).


Hello,

I'm sorry but I have still some problems about EFSW encoding.

I have print in hexadecimal the string that EFSW gives after an event occured.
The result is :

#!

DIR (/Users/bb/MCF/bb@gmail.com/Privévè/Coffre-fort/aabbcc/) FILE (pépè.png(0x7065ffffffccffffff817065ffffffccffffff802e706e67)) has event Added

As you can see, all caracters are encoded in Unicode UTF-8 ...

  • "p" : 0x70
  • "." : 0x2e
  • "n" : 0x6e
  • "g" : 0x67

... EXCEPT "é" and "è" :

  • "é" : 0x65ffffffccffffff81 (which seems to be the UTF-8 code of "e" (0x65) and something else ... (0xffffffccffffff81).
  • "è" : 0x65ffffffccffffff80 (which seems to be the UTF-8 code of "e" (0x65) and something else ... (0xffffffccffffff80).

But normally, UTF-8 code of "é" is : 0xc3a9
and UTF-8 of "è" is : 0xc3a8

This difference is the reason why my C++ program (in Qt) doesn't correctly understand the word "pépè.png" ...

Have you the same observation ?
Have you got an explaination ?

Thanks for your help.

B.

@SpartanJ
Copy link
Owner Author

Original comment by Martín Lucas Golini (Bitbucket: SpartanJ, GitHub: SpartanJ).


Sorry, but i tested again and i'm getting the correct UTF-8 codes ( i tested with mingw and vs too ).
I'll need a minimal test where i can reproduce your problem. And, if it's possible without Qt, since i think you are having problems there, have you tested this with the efsw-test that comes with the project?

@SpartanJ
Copy link
Owner Author

Original comment by Batte HUCHAI (Bitbucket: bhuchai, ).


I'm gonna test with efsw-test.

Can you just give me the hexadecimal output of a file "pépè.png" detected by EFSW ?

Something like that :

#!c++

void print_hex(const char *s)
{
    while(*s)
    printf("%02x", (unsigned int) *s++);
}

[...]

switch (action)
{
case efsw::Actions::Add:
    std::cout << "DIR (" << dir << ") FILE (" << filename << "(";
    print_hex(filename.c_str());
    break;
[...]

@SpartanJ
Copy link
Owner Author

Original comment by Martín Lucas Golini (Bitbucket: SpartanJ, GitHub: SpartanJ).


Here it is

@SpartanJ
Copy link
Owner Author

SpartanJ commented Nov 2, 2013

Original comment by Batte HUCHAI (Bitbucket: bhuchai, ).


OK, thank you.

Do you know why "é" and "è" chars are encoded on 14 bytes instead of 2 for others ?

@SpartanJ
Copy link
Owner Author

SpartanJ commented Nov 2, 2013

Original comment by Martín Lucas Golini (Bitbucket: SpartanJ, GitHub: SpartanJ).


No, that's not the encoding, i printed the data as you asked me, converting every char to unsigned int ( printf("%02x", (unsigned int) *s++); ), that's why you see those extra ffffff.
è first byte is: c3 and the second byte is a8.

@SpartanJ
Copy link
Owner Author

SpartanJ commented Nov 2, 2013

Original comment by Martín Lucas Golini (Bitbucket: SpartanJ, GitHub: SpartanJ).


I think that your problem is that you're not converting correctly the UTF-8 std::string to QString, you need to create the string using QString::fromUtf8, and i think you are using QString( str.c_str() ).

@SpartanJ
Copy link
Owner Author

SpartanJ commented Nov 2, 2013

Original comment by Martín Lucas Golini (Bitbucket: SpartanJ, GitHub: SpartanJ).


Oh no, now i see your previous post, you used QString::fromString. So i don't know, still if you want, make a minimal example of this failing, and i'll debug it ( use Qt4 if you want, because i think there's the problem ).

@SpartanJ
Copy link
Owner Author

SpartanJ commented Nov 4, 2013

Original comment by Batte HUCHAI (Bitbucket: bhuchai, ).


Hello,

It's really really strange.

As you advised me, I have changed the "test" sources of EFSW project :

src/test/efsw-test.cpp :

#!c++

[...]
void print_hex(const char *s)
{
      while (*s)
      printf("%02x", (unsigned int) *s++);
}

void handleFileAction( efsw::WatchID watchid, const std::string& dir, const std::string& filename, efsw::Action action, std::string oldFilename = ""  )
      {
      std::cout << "DIR (" << dir + ") FILE (" + ( oldFilename.empty() ? "" : "from file " + oldFilename + " to " ) + filename + " (";
      print_hex(filename.c_str());
      std::cout << ") " << ") has event " << getActionName( action ) << std::endl;
      }
[...]

As you can see, I've just added the "print_hex()" function. There is no worries about Qt ; indeed, I use your makefile to compile test program.

After compiling and executing, I get :

#!

iMac-de-B:bin bb$ ./efsw-test-release 
Press ^C to exit demo
CurPath: /Users/bb/Documents/efsw_test/efsw_project/bin/
Added WatchID: 1
Added WatchID: 2
DIR (/Users/buchet_b/Documents/efsw_test/efsw_project/bin/test/) FILE (from file pépé copie to pépé (7065ffffffccffffff817065ffffffccffffff81) ) has event Moved

So exactly the same ...

I really need your help. You'll find the EFSW project I use, here :

https://mega.co.nz/#!IQ0EDZZB!BAR8vwK8cnDWo05hpIJ_BhOkXgg0CaFNr0zsEPDMWYU

With these sources, what result do you have ?

Do you have any other idea ?

Thanks a lot by advance,

B.

@SpartanJ
Copy link
Owner Author

SpartanJ commented Nov 5, 2013

Original comment by Martín Lucas Golini (Bitbucket: SpartanJ, GitHub: SpartanJ).


Wait... your project file is from OS X, and i was testing on windows... so... your problems now are on OS X?
Give me some minutes and i'll test in OS X ( but i tested previously in this same thread and was working fine ).

@SpartanJ
Copy link
Owner Author

SpartanJ commented Nov 5, 2013

Original comment by Martín Lucas Golini (Bitbucket: SpartanJ, GitHub: SpartanJ).


I'm getting the correct code:

#!c++

DIR (/Users/charly/Downloads/efsw_project/bin/test/) FILE (from file pépé to pèpè (70ffffffc3ffffffa870ffffffc3ffffffa8) ) has event Moved

What i'm thinking is that your OS X file system is using a different encoding for file names.
I've read some articles about that, but i'm not sure how to handle it right now.
What i need you to do is:
run python from the terminal, insert:
import sys
print os. getfilesystemencoding()

(if you're on Mavericks and python crashes running this, fix it with the instructions from here: http://stackoverflow.com/questions/19569143/python3-segmentation-fault-on-osx-mavericks ).
And tell me what you get, it must be something different from utf-8.

It must be something similar to this problems:
https://bugzilla.mozilla.org/show_bug.cgi?id=703161
http://stackoverflow.com/questions/9757843/unicode-encoding-for-filesystem-in-mac-os-x-not-correct-in-python
http://apple.stackexchange.com/questions/10476/how-to-enter-special-characters-so-that-bash-terminal-understands-them

I'm a little bit busy to look for a fix right now, i'll need you to help me with this, or just wait a little bit for me to get some time to read about this. I don't event own a mac, so it's not that easy for me to see this.

Regards,
Martín

@SpartanJ
Copy link
Owner Author

SpartanJ commented Nov 5, 2013

Original comment by Batte HUCHAI (Bitbucket: bhuchai, ).


#!

iMac-de-B:efsw_test bb$ python
Python 2.7.1 (r271:86832, Jun 16 2011, 16:59:05) 
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys; import os; print sys.getfilesystemencoding()
utf-8

:(

Indeed, this link seems to be interesting :

http://stackoverflow.com/questions/9757843/unicode-encoding-for-filesystem-in-mac-os-x-not-correct-in-python

@SpartanJ
Copy link
Owner Author

SpartanJ commented Nov 6, 2013

Original comment by Martín Lucas Golini (Bitbucket: SpartanJ, GitHub: SpartanJ).


I think the problems comes from the file system encoding, please make a test converting the filename string from NFD to NFC, here's a function that i got from stackoverflow:

#!c++

std::string precomposeFilename(const std::string& name)
{
   CFStringRef cfStringRef = CFStringCreateWithCString(kCFAllocatorDefault, name.c_str(), kCFStringEncodingUTF8);
   CFMutableStringRef cfMutable = CFStringCreateMutableCopy(NULL, 0, cfStringRef);

   CFStringNormalize(cfMutable,kCFStringNormalizationFormC);

   char c_str[255 + 1];
   CFStringGetCString(cfMutable, c_str, sizeof(c_str)-1, kCFStringEncodingUTF8);

   CFRelease(cfStringRef);
   CFRelease(cfMutable);

   return std::string(c_str);
}

It seems to be a very common problem, but i'm not sure if we are dealing with this or is another thing.

Regards,
Martín

@SpartanJ
Copy link
Owner Author

Original comment by Batte HUCHAI (Bitbucket: bhuchai, ).


It works. Thanks a lot for this last point.

@SpartanJ
Copy link
Owner Author

Original comment by Martín Lucas Golini (Bitbucket: SpartanJ, GitHub: SpartanJ).


Excellent! I'm glad it worked... at last!
I made a commit with the corresponding changes. So, you'll not need to "patch" anything, just update the library.

@SpartanJ SpartanJ added the major label Jan 14, 2020
@SpartanJ SpartanJ added the bug Something isn't working label Jan 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working major
Projects
None yet
Development

No branches or pull requests

1 participant