UTF-8 Incompatibilities #181

Open
RobertBColton opened this Issue Dec 13, 2014 · 3 comments

Comments

Projects
None yet
2 participants
@RobertBColton
Collaborator

RobertBColton commented Dec 13, 2014

So after enabling unicode support for ENIGMA's windres.exe I discovered a bug of some sort in LGM.

If I use ALT+0169 to add the copyright symbol in LGM's game settings dialog and then save it and load it in GM8.1 it has the weird 'A' symbol which is not supposed to be there which we were seeing in ENIGMA. If I remove that 'A' symbol and save it with GM8.1 and reopen it with LGM, LGM shows a null '?' symbol for a missing unicode char.

Codepage Fix Works
Unicode Breaks with GM81

@RobertBColton

This comment has been minimized.

Show comment
Hide comment
@RobertBColton

RobertBColton Dec 13, 2014

Collaborator

@IsmAvatar This may not actually be an issue, this is because GM8.1 did not support Unicode. Studio and LateralGM share GMX files with UTF-8 encoding just fine. And LGM shares the GMK format with unicode characters just fine.

GM Studio Supports Unicode

Then again GM8.1 can share unicode symbols with itself, and when these files are opened in LGM they have null '?' symbols. @IsmAvatar Does this suggest GMK uses a different encoding than UTF-8?

GM81 Unicode Symbols Are Fine
LGM Does Not Recognize Unicode Symbols

Collaborator

RobertBColton commented Dec 13, 2014

@IsmAvatar This may not actually be an issue, this is because GM8.1 did not support Unicode. Studio and LateralGM share GMX files with UTF-8 encoding just fine. And LGM shares the GMK format with unicode characters just fine.

GM Studio Supports Unicode

Then again GM8.1 can share unicode symbols with itself, and when these files are opened in LGM they have null '?' symbols. @IsmAvatar Does this suggest GMK uses a different encoding than UTF-8?

GM81 Unicode Symbols Are Fine
LGM Does Not Recognize Unicode Symbols

@RobertBColton

This comment has been minimized.

Show comment
Hide comment
@RobertBColton

RobertBColton Dec 13, 2014

Collaborator

@IsmAvatar I found the issue https://github.com/IsmAvatar/LateralGM/blob/master/org/lateralgm/file/GmFileReader.java#L195

From what I can tell GM8.1 uses a different encoding than UTF-8, this code existed before I got here.

I found the following online when searching for evidence that GM8.1 supported UTF-8 encoding.
http://enigma-dev.org/forums/index.php?topic=810
Did you guys even bother to test? I tested all this on 8.1.65

�20�<�RobertWindows�>���30 whether gm81 had utf-8 support�
�20�<�RobertWindows�>���30 JoshDreamland, http://enigma-dev.org/forums/index.php?topic=810�
�20�<�RobertWindows�>���30 because it doesn't seem to be the case�
�18�<�JoshDreamland�18�>�� it's not about GM8 supporting it
�18�<�JoshDreamland�18�>�� GM8 strings are length-prefixed binary blobs
�18�<�JoshDreamland�18�>�� they are encoding-insensitive
�18�<�JoshDreamland�18�>�� the question is whether Game Makre 8.1 itself supports Unicode
�20�<�RobertWindows�>���30 it does�
�20�<�RobertWindows�>���30 if i enter the unicode symbols in gm8.1�
�20�<�RobertWindows�>���30 and save and reopen the file they are fine�
�20�<�RobertWindows�>���30 it just doesn't share them with lgm properly�
�18�<�JoshDreamland�18�>�� that doesn't mean it uses UTF-8
�18�<�JoshDreamland�18�>�� nor any kind of unicode
�18�<�JoshDreamland�18�>�� CP-1512 supports that shit, too
�20�<�RobertWindows�>���30 right�
�20�<�RobertWindows�>���30 so why would we utf-8 encode the gmk format then?�
�20�<�RobertWindows�>���30 or even try to read it that way?�
�18�<�JoshDreamland�18�>�� er
�18�<�JoshDreamland�18�>�� 1252
�18�<�JoshDreamland�18�>�� because CP-1252 does not support, eg, ♥
�18�<�JoshDreamland�18�>�� it does fine with ¤ and © and ® and whatnot
�18�<�JoshDreamland�18�>�� but throw it a ♪, and it's lost
�20�<�RobertWindows�>���30 i just copied and pasted what you said�
�20�<�RobertWindows�>���30 and you'[re correct�
�20�<�RobertWindows�>���30 the heart doesn't show in gm8.1's script editor but the other symbosl do�
<JoshDreamland> ...?
<JoshDreamland> welcome to Windows encoding
<JoshDreamland> for your reference, this is it: http://upload.wikimedia.org/wikipedia/commons/e/e7/Windows-1252.svg
<RobertWindows> it printed that music symbol as aj
<JoshDreamland> the ENTIRE CP-1252 spectrum
<RobertWindows> lowercase j
<JoshDreamland> as it should

Conversation continues:

�20�<�RobertWindows�>���30    if (forceCharset == null)�
�20�<�RobertWindows�>���30     {�
�20�<�RobertWindows�>���30     //if (ver >= 810)�
�20�<�RobertWindows�>���30      //in.setCharset(Charset.forName("UTF-8"));�
�20�<�RobertWindows�>���30     //else�
�20�<�RobertWindows�>���30      in.setCharset(Charset.defaultCharset());�
�20�<�RobertWindows�>���30     }�
�20�<�RobertWindows�>���30    else�
�20�<�RobertWindows�>���30     in.setCharset(forceCharset);�
�20�<�RobertWindows�>���30 ok�
�20�<�RobertWindows�>���30 JoshDreamland, so why are we forcing the utf-8 charset when reading this format?�
�18�<�JoshDreamland�18�>�� &j
�18�<�JoshDreamland�18�>�� js: "\u266a"
�18�<�JoshDreamland�18�>�� ♪
�18�<�JoshDreamland�18�>�� and that's the Unicode difference™
�20�<�RobertWindows�>���30 or writing�
�18�<�JoshDreamland�18�>�� because Unicode is the correct approach
�18�<�JoshDreamland�18�>�� the year is 2014
�20�<�RobertWindows�>���30 yes�
�18�<�JoshDreamland�18�>�� �nearly� 2015
�20�<�RobertWindows�>���30 JoshDreamland, and GMX supports UTF-8�
�18�<�JoshDreamland�18�>�� if GM8 doesn't support UTF-8 in its editor, that's just sad
�18�<�JoshDreamland�18�>�� we have a policy at work
�20�<�RobertWindows�>���30 so why do we want to force incompatibilities with gmk which is also as dead as windows encoding?�
�18�<�JoshDreamland�18�>�� if pasting unicode symbols makes something clearer, do it
�20�<�RobertWindows�>���30 JoshDreamland, it doesn't, studio added unicode support�
�20�<�RobertWindows�>���30 *true unicode support�
�18�<�JoshDreamland�18�>�� because in this day and age, if it doesn't support UTF-8, that's a bug
�18�<�JoshDreamland�18�>�� well, that's fine
�20�<�RobertWindows�>���30 ok well gm8.1 was last released for windows 7�
�20�<�RobertWindows�>���30 so yeah�
�18�<�JoshDreamland�18�>�� now you know my answer, then
�20�<�RobertWindows�>���30 it doesn't�
�20�<�RobertWindows�>���30 JoshDreamland, so I should remove the UTF-8 stuff for GMK?�
�20�<�RobertWindows�>���30 so people can properly import GMK's?�

So we are now looking for ways to make this optional.

<RobertWindows> JoshDreamland, yes but those games would not have been gm8.1 games
<RobertWindows> and if they upgrade them to lgm
<RobertWindows> they can use gmx or egm
<JoshDreamland> EGM is broken
<RobertWindows> how so?
<JoshDreamland> GMX is a bit buggy
<RobertWindows> EGM is fine i just managed to corrupt all old egms
<JoshDreamland> EGM seems to be missing lots of features, and it writes lots of shit as binary
<RobertWindows> but if u send them to me i can fix them
<RobertWindows> ive updated most on the site
<RobertWindows> ok
<JoshDreamland> which makes it volatile at best
<RobertWindows> yes
<RobertWindows> on the bright side gmx is adding version numbers
<JoshDreamland> I personally don't use EGM because I know it will be broken in the future
<RobertWindows> JoshDreamland, how about a preference?
<JoshDreamland> to do what?
<RobertWindows> force utf-8 encoding for gmk
<RobertWindows> JoshDreamland, ?
<RobertWindows> then a warning
<JoshDreamland> you *can* ask when someone presses "Save as"
<RobertWindows> "saving this file with utf-8 encoding may cause some corruption when loading into older gm versions"
<JoshDreamland> but then, how will you know which to use when reading?
<RobertWindows> well
<RobertWindows> hmmm
<JoshDreamland> I don't think Java has built-in encoding detection

Looking for solutions.

�20�<�RobertWindows�>���30 if (in.read5() = "UTF-8") {�
�20�<�RobertWindows�>���30   forceutf8 = true;�
�20�<�RobertWindows�>���30 } else {�
�20�<�RobertWindows�>���30   in.reset();�
�20�<�RobertWindows�>���30 }�
�20�<�RobertWindows�>���30 JoshDreamland, why can't we just do that?�
�18�<�JoshDreamland�18�>�� whence are you reading that?
�20�<�RobertWindows�>���30 at the very very beginning, before everything�
�18�<�JoshDreamland�18�>�� that will break GM8
�20�<�RobertWindows�>���30 either the first 5 bytes or w/e matches our utf-8 identifier or it doesn't�
�20�<�RobertWindows�>���30 oh�
�20�<�RobertWindows�>���30 right�
�18�<�JoshDreamland�18�>�� making the user choose between "break GM8 or break unicode symbols" is a stupid ultimatum if the user isn't using unicode symbols
�20�<�RobertWindows�>���30 what? no it isnt because they arent using unicode so they can breka unicode�
�20�<�RobertWindows�>���30 anyway�
�20�<�RobertWindows�>���30 JoshDreamland, why not just do that for the first string?�
�20�<�RobertWindows�>���30 why do you want to break  ♪ ?�
�18�<�JoshDreamland�18�>�� a "clean" hack does nothing irreversible to the GMK, and a real solution does nothing at all to the GMK if no unicode is used
�18�<�JoshDreamland�18�>�� you can't do encoding detection on just one string
�18�<�JoshDreamland�18�>�� one string is likely to contain no non-ASCII characters
�18�<�JoshDreamland�18�>�� and ascii characters look identical in all 8-bit encodings
�20�<�RobertWindows�>���30 ugh�
�20�<�RobertWindows�>���30 this is a pain in the ass�
�20�<�RobertWindows�>���30 JoshDreamland, shall we postpone this debate until ismavatar comments on github?�
�18�<�JoshDreamland�18�>�� that's fine
Collaborator

RobertBColton commented Dec 13, 2014

@IsmAvatar I found the issue https://github.com/IsmAvatar/LateralGM/blob/master/org/lateralgm/file/GmFileReader.java#L195

From what I can tell GM8.1 uses a different encoding than UTF-8, this code existed before I got here.

I found the following online when searching for evidence that GM8.1 supported UTF-8 encoding.
http://enigma-dev.org/forums/index.php?topic=810
Did you guys even bother to test? I tested all this on 8.1.65

�20�<�RobertWindows�>���30 whether gm81 had utf-8 support�
�20�<�RobertWindows�>���30 JoshDreamland, http://enigma-dev.org/forums/index.php?topic=810�
�20�<�RobertWindows�>���30 because it doesn't seem to be the case�
�18�<�JoshDreamland�18�>�� it's not about GM8 supporting it
�18�<�JoshDreamland�18�>�� GM8 strings are length-prefixed binary blobs
�18�<�JoshDreamland�18�>�� they are encoding-insensitive
�18�<�JoshDreamland�18�>�� the question is whether Game Makre 8.1 itself supports Unicode
�20�<�RobertWindows�>���30 it does�
�20�<�RobertWindows�>���30 if i enter the unicode symbols in gm8.1�
�20�<�RobertWindows�>���30 and save and reopen the file they are fine�
�20�<�RobertWindows�>���30 it just doesn't share them with lgm properly�
�18�<�JoshDreamland�18�>�� that doesn't mean it uses UTF-8
�18�<�JoshDreamland�18�>�� nor any kind of unicode
�18�<�JoshDreamland�18�>�� CP-1512 supports that shit, too
�20�<�RobertWindows�>���30 right�
�20�<�RobertWindows�>���30 so why would we utf-8 encode the gmk format then?�
�20�<�RobertWindows�>���30 or even try to read it that way?�
�18�<�JoshDreamland�18�>�� er
�18�<�JoshDreamland�18�>�� 1252
�18�<�JoshDreamland�18�>�� because CP-1252 does not support, eg, ♥
�18�<�JoshDreamland�18�>�� it does fine with ¤ and © and ® and whatnot
�18�<�JoshDreamland�18�>�� but throw it a ♪, and it's lost
�20�<�RobertWindows�>���30 i just copied and pasted what you said�
�20�<�RobertWindows�>���30 and you'[re correct�
�20�<�RobertWindows�>���30 the heart doesn't show in gm8.1's script editor but the other symbosl do�
<JoshDreamland> ...?
<JoshDreamland> welcome to Windows encoding
<JoshDreamland> for your reference, this is it: http://upload.wikimedia.org/wikipedia/commons/e/e7/Windows-1252.svg
<RobertWindows> it printed that music symbol as aj
<JoshDreamland> the ENTIRE CP-1252 spectrum
<RobertWindows> lowercase j
<JoshDreamland> as it should

Conversation continues:

�20�<�RobertWindows�>���30    if (forceCharset == null)�
�20�<�RobertWindows�>���30     {�
�20�<�RobertWindows�>���30     //if (ver >= 810)�
�20�<�RobertWindows�>���30      //in.setCharset(Charset.forName("UTF-8"));�
�20�<�RobertWindows�>���30     //else�
�20�<�RobertWindows�>���30      in.setCharset(Charset.defaultCharset());�
�20�<�RobertWindows�>���30     }�
�20�<�RobertWindows�>���30    else�
�20�<�RobertWindows�>���30     in.setCharset(forceCharset);�
�20�<�RobertWindows�>���30 ok�
�20�<�RobertWindows�>���30 JoshDreamland, so why are we forcing the utf-8 charset when reading this format?�
�18�<�JoshDreamland�18�>�� &j
�18�<�JoshDreamland�18�>�� js: "\u266a"
�18�<�JoshDreamland�18�>�� ♪
�18�<�JoshDreamland�18�>�� and that's the Unicode difference™
�20�<�RobertWindows�>���30 or writing�
�18�<�JoshDreamland�18�>�� because Unicode is the correct approach
�18�<�JoshDreamland�18�>�� the year is 2014
�20�<�RobertWindows�>���30 yes�
�18�<�JoshDreamland�18�>�� �nearly� 2015
�20�<�RobertWindows�>���30 JoshDreamland, and GMX supports UTF-8�
�18�<�JoshDreamland�18�>�� if GM8 doesn't support UTF-8 in its editor, that's just sad
�18�<�JoshDreamland�18�>�� we have a policy at work
�20�<�RobertWindows�>���30 so why do we want to force incompatibilities with gmk which is also as dead as windows encoding?�
�18�<�JoshDreamland�18�>�� if pasting unicode symbols makes something clearer, do it
�20�<�RobertWindows�>���30 JoshDreamland, it doesn't, studio added unicode support�
�20�<�RobertWindows�>���30 *true unicode support�
�18�<�JoshDreamland�18�>�� because in this day and age, if it doesn't support UTF-8, that's a bug
�18�<�JoshDreamland�18�>�� well, that's fine
�20�<�RobertWindows�>���30 ok well gm8.1 was last released for windows 7�
�20�<�RobertWindows�>���30 so yeah�
�18�<�JoshDreamland�18�>�� now you know my answer, then
�20�<�RobertWindows�>���30 it doesn't�
�20�<�RobertWindows�>���30 JoshDreamland, so I should remove the UTF-8 stuff for GMK?�
�20�<�RobertWindows�>���30 so people can properly import GMK's?�

So we are now looking for ways to make this optional.

<RobertWindows> JoshDreamland, yes but those games would not have been gm8.1 games
<RobertWindows> and if they upgrade them to lgm
<RobertWindows> they can use gmx or egm
<JoshDreamland> EGM is broken
<RobertWindows> how so?
<JoshDreamland> GMX is a bit buggy
<RobertWindows> EGM is fine i just managed to corrupt all old egms
<JoshDreamland> EGM seems to be missing lots of features, and it writes lots of shit as binary
<RobertWindows> but if u send them to me i can fix them
<RobertWindows> ive updated most on the site
<RobertWindows> ok
<JoshDreamland> which makes it volatile at best
<RobertWindows> yes
<RobertWindows> on the bright side gmx is adding version numbers
<JoshDreamland> I personally don't use EGM because I know it will be broken in the future
<RobertWindows> JoshDreamland, how about a preference?
<JoshDreamland> to do what?
<RobertWindows> force utf-8 encoding for gmk
<RobertWindows> JoshDreamland, ?
<RobertWindows> then a warning
<JoshDreamland> you *can* ask when someone presses "Save as"
<RobertWindows> "saving this file with utf-8 encoding may cause some corruption when loading into older gm versions"
<JoshDreamland> but then, how will you know which to use when reading?
<RobertWindows> well
<RobertWindows> hmmm
<JoshDreamland> I don't think Java has built-in encoding detection

Looking for solutions.

�20�<�RobertWindows�>���30 if (in.read5() = "UTF-8") {�
�20�<�RobertWindows�>���30   forceutf8 = true;�
�20�<�RobertWindows�>���30 } else {�
�20�<�RobertWindows�>���30   in.reset();�
�20�<�RobertWindows�>���30 }�
�20�<�RobertWindows�>���30 JoshDreamland, why can't we just do that?�
�18�<�JoshDreamland�18�>�� whence are you reading that?
�20�<�RobertWindows�>���30 at the very very beginning, before everything�
�18�<�JoshDreamland�18�>�� that will break GM8
�20�<�RobertWindows�>���30 either the first 5 bytes or w/e matches our utf-8 identifier or it doesn't�
�20�<�RobertWindows�>���30 oh�
�20�<�RobertWindows�>���30 right�
�18�<�JoshDreamland�18�>�� making the user choose between "break GM8 or break unicode symbols" is a stupid ultimatum if the user isn't using unicode symbols
�20�<�RobertWindows�>���30 what? no it isnt because they arent using unicode so they can breka unicode�
�20�<�RobertWindows�>���30 anyway�
�20�<�RobertWindows�>���30 JoshDreamland, why not just do that for the first string?�
�20�<�RobertWindows�>���30 why do you want to break  ♪ ?�
�18�<�JoshDreamland�18�>�� a "clean" hack does nothing irreversible to the GMK, and a real solution does nothing at all to the GMK if no unicode is used
�18�<�JoshDreamland�18�>�� you can't do encoding detection on just one string
�18�<�JoshDreamland�18�>�� one string is likely to contain no non-ASCII characters
�18�<�JoshDreamland�18�>�� and ascii characters look identical in all 8-bit encodings
�20�<�RobertWindows�>���30 ugh�
�20�<�RobertWindows�>���30 this is a pain in the ass�
�20�<�RobertWindows�>���30 JoshDreamland, shall we postpone this debate until ismavatar comments on github?�
�18�<�JoshDreamland�18�>�� that's fine
@IsmAvatar

This comment has been minimized.

Show comment
Hide comment
@IsmAvatar

IsmAvatar Dec 13, 2014

Owner

Did you guys even bother to test?

No. The new version of GM had come out, and utf8 was new to it, so I was rushing to code everything to support it. I did enough testing to ensure the file format worked at all, but overlooked utf8 because I wasn't entirely sure how to test it. I tested some foreign characters and it seemed to work better than it did before. Also, it was a 1-man show at that point.

Feel free to improve it as you see fit. I'm not too concerned about breaking backward compatibility with GM7 or such since it can be hard to tell what encoding a string is in, so frankly people shouldn't have UTF8 strings and expect them to work in versions of GM that don't support UTF8 anyways.

Owner

IsmAvatar commented Dec 13, 2014

Did you guys even bother to test?

No. The new version of GM had come out, and utf8 was new to it, so I was rushing to code everything to support it. I did enough testing to ensure the file format worked at all, but overlooked utf8 because I wasn't entirely sure how to test it. I tested some foreign characters and it seemed to work better than it did before. Also, it was a 1-man show at that point.

Feel free to improve it as you see fit. I'm not too concerned about breaking backward compatibility with GM7 or such since it can be hard to tell what encoding a string is in, so frankly people shouldn't have UTF8 strings and expect them to work in versions of GM that don't support UTF8 anyways.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment