Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Might need a special shift-JIS mode? #2

Closed
ExpHP opened this issue Jan 1, 2021 · 3 comments
Closed

Might need a special shift-JIS mode? #2

ExpHP opened this issue Jan 1, 2021 · 3 comments

Comments

@ExpHP
Copy link
Owner

ExpHP commented Jan 1, 2021

Previously, I thought the code could be entirely agnostic to Shift-JIS vs UTF-8 since non-ascii bytes can only appear inside comments and strings (where they don't need to be interpreted), and "/* all never appear as infixes of another character.

........but no, there's still problems:

  • \ (byte 0x5c) is an important string metacharacter that can appear as the second byte of a Shift-JIS character.
  • When string literals represent paths read by the compiler (e.g. #pragma mapfile "map/all.stdm"), the encoding matters on Windows because Windows path APIs are encoding-aware. (at this moment, I'm writing a method ast::LiteralStr::as_path which has no real choice but to assume UTF-8 and produce a CompileError on invalid UTF-8 byte sequences)
@ExpHP
Copy link
Owner Author

ExpHP commented Jan 31, 2021

Indeed, this is a phenomenon present all over the files. For instance the character 十 (seen in TH06 msg1.dat) ends with byte 0x5c.

But I noticed the most bizarre thing. Even though this byte is present in this file (and the parser rejects invalid escape sequences), my current local prototype of thmsg is able to decompile and recompile the file back identically without any errors! It turns out the reason is because—get this—during the decompilation step, it's escaping the 0x5c byte in that character. So, in shift-jis in a binary file gets decompiled to 十\ in shift-jis, which gets compiled back to !


I see this as somewhat unfortunate. The plan was to make support for shift-jis "opt-in," with the expectation that thmsg would readily fail on many vanilla files, with an error that directs users to try supplying the shift-jis flag. Unfortunately, since it currently does not fail on these inputs, there's no good way to inform users that the program is most likely doing something mildly incorrect/weird on a fraction of their input files.

Bah. Perhaps we should enforce that text script files are always UTF-8, and fail on decompilation if we don't have UTF-8 strings. :/

@ExpHP
Copy link
Owner Author

ExpHP commented Feb 12, 2021

Okay, I know what I want to do:

  • Require text scripts to be UTF-8.
  • Have binary files be Shift-JIS by default, unless some flag like -u or --encoding=... is given.
  • Strings representing file paths in binary files must be ASCII, and this will be checked. (mostly because, if we transcode them, then the whole \ <-> ¥ thing becomes our problem, whereas if we encode them as ASCII, then the path separator will correctly be encoded as 0x5c)

Why have binary files be Shift-JIS by default rather than UTF-8 by default? The reason is twofold:

  • Using the game's native encoding is the least assuming choice. We don't have to assume thcrap is present.
  • Better errors. If you need to compile to a Shift-JIS file but the default is UTF-8, nothing will ever tell you that you're doing it wrong, and then you may not understand why the resulting file breaks the game. In contrast, if you need a UTF-8 file but the default is Shift-JIS, you'll get an error when it tries to encode a character that's not present in Shift-JIS (and this error can tell you about the -u flag).
  • For decompiling, one may wonder, why not auto-detect? Well, (a) perfect discrimination between Shift-JIS and UTF-8 is impossible, due to \ and ~. (b) Consistency with compilation is nice. And (c) to be entirely honest, decompiling a non-shift-JIS binary file is such a rare thing that people ought to be aware of when they're doing it.

@ExpHP
Copy link
Owner Author

ExpHP commented Feb 15, 2021

This was fixed in a recent push. In fact, what we're missing right now is a UTF-8 mode (for binary file strings)!

@ExpHP ExpHP closed this as completed Feb 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant