-
Notifications
You must be signed in to change notification settings - Fork 587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode issues in console #334
Comments
Actually, string.char doesn't deal in Unicode, at least not directly... It's more accurate to think of it and string.byte as attempting to work with 8-bit (binary) ascii, and everything with the 8bit set in ascii has always been wildly machine dependent. If you're using a development version of Hammerspoon, the latest hs.utf8_53 has codepointToUTF8(codepoint) added, which does take codepoint numbers and convert them to proper UTF8 sequences... If you're sticking with formal releases, here is the relevant function to tide you over until the next release:
And I do agree that crashing, even on bad data, is bad... I'll add it to the list... I have a few other ways to crash the console that I'm looking into...
|
@asmagill were you able to reproduce the crash with current trunk? I was not - it doesn't print anything, but it also doesn't crash. |
With the development build, I get:
11 switch string.char to hs.utf8.codepointToUTF8, and it goes all the way up to 255... It's as if string.char or print is causing the for-loop to "break" on unprintable chars (well, invalid UTF8 ones)... I would be interested in trying this in 5.2, later -- I have a couple of other crashes to pinpoint when the behavior changed (see issue about using hs.inspect on hs.appilcation and hs.window... I forget the number and don't have a browser in front of me right now) and hope to look into it further later today. At any rate, if this is the 5.3 behavior, it seems we may be avoiding a crash, but we'll need to identify if the "break" is being caused by string.char or print... if it's print, I'm ok with it. If it's string.char, then we will have lost at least one way to craft binary data within lua and will have to figure out another at some point... I don't think it's a priority, but I have done it before, so I expect I'll want to do so again sometime...
|
Yes, by 'Unicode-agnostic' I meant Lua strings are just a byte buffer. The 'break' seems to be one example of weird behaviour. I'd bet that the problem is in the print part - or more precisely, in going back to Unicode-world Cocoa in the hs console with non-proper Unicode strings (e.g. a dangling byte of a multi-byte utf8 char) - as if the text area or whatever is going on in the console remains "waiting" for further bytes to complete the code point, instead of respecting string termination and dealing with the garbage. Naked Lua (at least 5.1) in a (Unicode) terminal, as expected, will print the standard 'non-printable character' and carry on. FWIW, this is what I mean with "dangling byte" and problems with string.sub as per OP - string.char just made for a shorter demo: s = '3×4' -- "×" is two bytes in utf8: C3 97
print(string.sub(s,1,3)) -- prints "3×"
print(string.sub(s,1,1)) -- prints "3"
print(string.sub(s,1,2)) -- explodes |
A 'break' out of a loop is odd/problematic behavior, but workable if we can identify it and either work around it or at least be aware of it... a 'crash' to me suggests that Hammerspoon itself stops working altogether... a significantly bigger concern, at least in my mind. As an example of the distinction I mean, I've seen odd output, but not a crash, with hs.inspect(hs.utf8_53) (or hs.inspect(utf8) with the latest development build) because of the "binary" data in charPattern... it's not valid UTF8, so the console just prints '(null)'... This I consider reasonable/bearable, especially as a wrapper function like:
Allows 'asciiOnly(hs.inspect(hs.utf8_53))' to display something more readable. Just so we're on the same page, which behavior were you seeing? wrong output/odd loop/function termination or actual Hammerspoon crashing?
|
Interestingly, I get different behavior if I'm attempting to print string.char(226) within the console, or within the hs shell... From the console (I was attempting to see if our replacement print itself was having problems with the non-unicode character, hence the use of pcall):
but from hs in a terminal window:
followed by a crash of Hammerspoon... This tells me that something about the print substitute used in hs.ipc.handler is having a problem... I'll add that to my list and look into it later today/tomorrow.
|
Assuming that that the lua core itself is truly agnostic:
And that no changes have actually been made to the pod/fabric/library/whatever HS includes lua as, then the problem is somewhere along the lines where we convert things to UTF8 for passing around in the Hammerspoon application code... considering that it seems like the CStringToUTF8 conversion is likely returning either an error or NULL, I'm actually surprised that all the console does is break out of loops... Obvious fix -- check explicit conversions for success/failure. Less obvious, what to do with failure... treat as raw binary code? convert to \xXX in place if it isn't a valid UTF8 character sequence? throw error? Throwing an error is probably extreme... if standalone lua processes it, we should too in the same way. (BTW, the loop above works fine in standalone lua -- it just displays ? for numbers 128 and up -- no 'break' from the loop.) |
FWIW: in my case (using HS binary 0.9.31, not the repo) the offending module (the one attempting to print a "non-utf8-valid" byte sequence to the console) would cease all its output to the console and, as far as I can tell, would just stop working at that point; meanwhile other modules would carry on as usual, including happily printing along. However there would be soon enough an inevitable HS crash. I'm the world's worst expert when it comes to both Unicode and C/ObjC/Cocoa, but it seems to me that it should be OSX/Cocoa's job to handle |
Ok, with regards to the console, what is happening is that core_logmessage tries to convert the output to UTF8, receives null, and then sends an empty string to the console instead. Changing the encoding type to NSNonLossyASCIIStringEncoding, which one site suggested didn't help. And since my example above does 'print(i,string.char(i))' the whole output line is junked. Change it to 'print(i); print(string.char(i))' and you can see that the iteration continues. At least this confirms that lua is working with the non-printable bytes properly -- you can build a string of "bad" bytes and check it's length, output it to a file, etc. You just can't print it. How big of a deal is this? Should we at least have it print a warning to the console when it can't properly encode the output? I'm still looking at IPC and the shell application... I don't think it's just one place that needs to be checked for valid output-able data. |
Confirmed the crash is gone in the current trunk (was fixed by 3624d71); and I stand corrected: I couldn't find a way of delegating the problem to the OS. From http://stackoverflow.com/a/4985534:
(There is a way to do a roundtrip conversion NSString->NSData->NSString via dataUsingEncoding:allowLossyConversion:, but that requires a NSString to begin with, so chicken and egg.) Ideally HS would use some 3rd party library to deal with this (see https://gist.github.com/cherpake/4709652 and http://www.gnu.org/software/libiconv/), but failing that it should print a warning (or at least the replacement character as a 'hint') when the conversion fails. |
@lowne Do we need to start with an NSString to use allowLossyConversion? Could we not feed the C string directly into NSData and go from there? |
As long as it's only affecting the output to the console, I'm ok with either solution... The actual data internally should stay the same... |
Came across this that might be of interest (http://notebook.kulchenko.com/programming/fixing-malformed-utf8-in-lua):
Now, if I do the following:
Then b will be safely printable and c will contain indexes into a where invalid bytes were found. Only semi-related, I guess I didn't look at the lua 5.3 utf8 library close enough... utf8.char is almost the same as my addition of hs.utf8.codepointToUTF8... mine handles invalid utf8 codepoints (e.g. the surrogates) better by doing the 0xFFFD replacement for you, while the builtin utf8.char still tries to convert it and ends up with something unprintable; but as long as you stick with valid UTF8 codepoints, it works just as well. I may revisit my addition, as well as see if I can implement a simpler version of the above in hs.utf8. |
@cmsj I'd like to say "no" as dataUsingEncoding seems to be a method (?) on NSString, then again my understanding of ObjC etc. can be safely rounded to zero, so the truth is I have no idea. As far as I'm concerned, for this issue it would be enough to provide EDIT: I forgot (several times) to add: it seems reasonable to assume that this issue affects Lua string->NSString conversion in general; printing to the console is the case I found (and a 'lazy' solution like above is sufficient there); but I don't know/understand HS ObjC internals sufficiently to tell if it's the only instance - nor I have tested anything else (e.g. |
+1 having a UTF8 fixer-upper in hs.utf8. We may still need to figure out a C version, for the Console window, but equally we could also look at shoving it into evalfn. |
As I feared, On the one hand, it is not entirely unreasonable to imagine a situation with possibly problematic manipulations (such as string.sub of window titles) that feed into hs.settings keys (values seem "fine", in the sense that |
I'll check the spec for application defaults -- since this is the namespace hs.settings.get uses, if it allows a key to be any binary data, then this is a problem. I do vaguely recall reading that base64 is the recommended (only?) way to safely store binary data, but I don't recall anything one way or the other about the keys. The greater problem of it crashing, rather then issuing an error remains either way, but I'll see what fix should be put in place. Yeah, we probably should do an audit at some point -- I know I've assumed UTF8 encodings for most things because that's what is most easily displayed correctly in the console/hs.drawing.text, but we really should be more cognizant of what is "correct" or "acceptable" in each domain rather than adopt a general assumption that limits things more then necessary. |
Maybe the utf8_decode function in utf/internal.m can be adapted for a performance-friendly C-side solution (e.g. in hammerspoon.h, lua_to_NSString):
|
Regarding your earlier example of the failure of hs.settings.get(string.char(226))... is there (or do you actually want) a setting with a key of char(226)? Internally hs.settings is converting this to UTF8, thus null, thus it's actually trying to request the value for the key NULL, which I think is causing the crash. But if you're actually using or expecting to locate a setting for the key string.char(226) then introducing a solution similar to the fix_utf8 one isn't the correct solution for this specific case... the correct solution for this specific case would be to remove the conversion to UTF8... I'm trying to find out if string.char(226) would be a leagle key for NSDefaults... haven't been able to find out, but if you can confirm it should be (or I finally find a confirming reference), I'll remove the internal conversion.
|
AFAICT keys ought to be strings (e.g. here), so no, there shouldn't be a setting with such a key; crashing in such cases is understandable (but ideally, hs should throw a meaningful function save_emojis(em)
local MAX_LEN=21-- did i mention this was a contrived example?
hs.settings.set('favourite_emojis',em:sub(1,MAX_LEN))
end
emojis='💩😱😭😡😞'
save_emojis(emojis) -- all fine so far
-- ...later:
emojis=emojis..'💣' -- ok, i understand there might be no room for this one...
save_emojis(emojis)
print(hs.settings.get('favourite_emojis')) -- ...but where did all my other emojis go? 😱 EDIT2: @asmagill to further clarify (see 3 comments above), I think it's perfectly reasonable to assume utf8-encoded strings everywhere, and I don't think there's any need to extend to other possibilities; I'm only worried about accidentally invalid strings, where the user means to pass along proper utf8, but lua (being agnostic), unbeknownst to her, decides otherwise; and then NSString explodes. |
In my poor attempt to find out if it was legal, what I was really trying to get at was how it should be corrected... crash = bad. But is it because it should be "corrected" or because an error should be thrown? We do a lot of NSString to UTF8 conversions... most of the time it's fine and even makes sense. Where it doesn't (and where it can fail poorly) ... each has to be considered on it's own. IIRC, setData does an explicit conversion to BASE64 internally... this is recommended for binary data in NSDefaults, but I'm not sure if its required... And I suspect if you were to change the set line to hs.settings.setData('favourite_emojis',em:sub(1,MAX_LEN)) then the print would also fail because of the print... what is string.len(hs.settings.get('favourite_emojis'))) for your example (i.e. is your example failing because of the print or because the set stored an empty string)? And all that aside -- yes, we need to audit this everywhere and make fixes -- ideally, nothing should crash, and nothing should be changed without some sort of warning or documentation of the effect, and nothing should be changed in the actual data -- just it's presentation to whatever can't handle it raw. I'll try and get an issue open later with a list of the places I know need to be looked at, since this is larger than just the console.
|
The "fix it on behalf of the incautious user" vs "throw an error" question is indeed a non-obvious one. For stuff like the console, I'd say lossy conversion should be fine (and generally convenient to the user). Possible pragmatic approach: if the post-fix string has at least N (5?) (utf8) characters, or (probably better) retained at least N% (80%?) bytes in length (these are intended as indicators that the original lua source was not complete garbage, but it was meant to be proper utf8), then move on with the fixed string; otherwise throw an error and let the user deal with it. (tbh I'm not entirely convinced by this idea myself) |
@asmagill I see the github mail interface doesn't (understandably) reflect comment edits, so I think you missed this bit from earlier: to further clarify on this:
I think it's perfectly reasonable to assume utf8-encoded strings everywhere, and I don't think there's any need to extend to other possibilities; I'm only worried about accidentally invalid strings, where the user means to pass along proper utf8, but lua (being agnostic), unbeknownst to her, decides otherwise; and then NSString explodes. |
I'll try and get on the web later today and fully read there... Short point, though... I'm fine with enforcing that anything an OSX API accepts as a NSString should be proper UTF8... however, we shouldn't disallow things that would take NSData as well or instead, because Lua can handle it -- our code around Lua, however, may need to be more aware of which, though.
|
I fully agree.
|
What are we missing now for this? with #519 having landed, we no longer crash when printing unprintable characters to the console. Can we close out this issue? |
Strictly speaking, yes, although there's still |
Can you give specific examples (or if there is already an issue I've missed, point me to it) wrt to settings and alerts? I wonder if the proposed changes I'm making to LuaSkin might help with the settings issues, but I need examples to test. Alert I've not looked at but add that to the list :-) |
@asmagill I tried with the usual:
|
Having now |
To clarify: I think we should also error out on |
Ok, alert is doing what it's programmed to do -- a nil check places that string into title and then continues... I'd personally rather it error as @lowne suggests or at the very least something more expressive... As for the settings bug, an error needs to be added when the key name returns nil... The changes I'm introducing will catch non-UTF8 in the value and save it as a block of data instead of a string, but the changes don't apply to the keys which have to be valid UTF8 strings for NSUserDefaults anyway, so I'll add that as well.
|
Silent discarding would have occurred because of invalid UTF8. Silent truncation would have occurred because of embedded null characters. Both should work for values in the LuaString WIP (though to be honest I haven't tried the embedded null character one... it should work, but I'll add it to my tests before my next update.)
|
@asmagill yes, (basically) auto-switching internally to Also hs.drawing.text might have lua->NSString conversions (haven't checked the source). |
...although on second thought, the user might be quite surprised to see what he thought was just a string base64 encoded on |
...so maybe it's better to error out in all cases:
|
My leaning is the other way and was considering actually deprecating setData altogether... The base64 encoding happens behind the scenes - we don't specify it - anytime an NSData object is saved, and since the only valid representation of what the user was trying to save was NSData, that's what happened -- what the user asked for. Since get retrieves either, my thought is that set should too... the only reason it didn't before was because I didn't know a better way to code it (I do now :-) ) However, I leave the final decision to the group... just let me know.
|
In general I agree, but I'm still worried about the case where
To reiterate what point 1 is about, in case I didn't mention it already above, what happened to me originally was:
Now, it could be argued that
and all I can say is that in the user shoes I'd rather have HS yell at me immediately and force me to decide how to fix it (again, probably simply via TL;DR: if we don't error out, we should at least mention the (possibility of) silent conversion in the docstring for |
Been there done that... and I did learn from it as well, so it wasn't all bad :-) Like I said, I'll go with the consensus now that we really can do either. And the argument could be made that we already make a special case for dates so that outside queries can get an expected NSDate rather than an integer which could be anything...
|
This also affects
I guess the easiest option at this point would be a function in hammerspoon.h that lets us filter strings? Or is that already a thing that's hiding somewhere? |
Untested, but we might be able to use some nifty NSString options to filter out characters in + (NSCharacterSet *)illegalCharacterSet |
Check out hs.cleanUTF8forConsole which is basically what I use to make things safe for the console (examples of its use from C in MJLua.m for MJLuaRunString and core_logmessage) -- note that this version also changes NULL since that would prevent output to the console. Or look at the LuaSkin WIP at [skin isValidUTF8AtIndex:#] -- technically all this does is tell you if it's valid, just as a nil does, so... maybe not as useful here... Unless you want something fancier...
|
I think we've taken care of this? It's certainly old enough at this point that I'm going to close it and hope for the best :) |
print(string.char(226))
in the console crashes hs. This can be an issue because, Lua being Unicode-agnostic, things like string.sub might leave a 'dangling' byte out by itself, causing all sorts of weird behaviour (and eventually crashes).While it's true that it's the user responsibility to avoid outputting nonsense, hs should fail gracefully.
The text was updated successfully, but these errors were encountered: