-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Might need a special shift-JIS mode? #2
Comments
Indeed, this is a phenomenon present all over the files. For instance the character 十 (seen in TH06 msg1.dat) ends with byte 0x5c. But I noticed the most bizarre thing. Even though this byte is present in this file (and the parser rejects invalid escape sequences), my current local prototype of thmsg is able to decompile and recompile the file back identically without any errors! It turns out the reason is because—get this—during the decompilation step, it's escaping the 0x5c byte in that character. So, I see this as somewhat unfortunate. The plan was to make support for shift-jis "opt-in," with the expectation that thmsg would readily fail on many vanilla files, with an error that directs users to try supplying the shift-jis flag. Unfortunately, since it currently does not fail on these inputs, there's no good way to inform users that the program is most likely doing something mildly incorrect/weird on a fraction of their input files. Bah. Perhaps we should enforce that text script files are always UTF-8, and fail on decompilation if we don't have UTF-8 strings. :/ |
Okay, I know what I want to do:
Why have binary files be Shift-JIS by default rather than UTF-8 by default? The reason is twofold:
|
This was fixed in a recent push. In fact, what we're missing right now is a UTF-8 mode (for binary file strings)! |
Previously, I thought the code could be entirely agnostic to Shift-JIS vs UTF-8 since non-ascii bytes can only appear inside comments and strings (where they don't need to be interpreted), and
"/*
all never appear as infixes of another character.........but no, there's still problems:
\
(byte 0x5c) is an important string metacharacter that can appear as the second byte of a Shift-JIS character.#pragma mapfile "map/all.stdm"
), the encoding matters on Windows because Windows path APIs are encoding-aware. (at this moment, I'm writing a methodast::LiteralStr::as_path
which has no real choice but to assume UTF-8 and produce a CompileError on invalid UTF-8 byte sequences)The text was updated successfully, but these errors were encountered: