Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode issues in console #334

Closed
lowne opened this issue Jun 22, 2015 · 43 comments
Closed

Unicode issues in console #334

lowne opened this issue Jun 22, 2015 · 43 comments

Comments

@lowne
Copy link
Contributor

lowne commented Jun 22, 2015

print(string.char(226)) in the console crashes hs. This can be an issue because, Lua being Unicode-agnostic, things like string.sub might leave a 'dangling' byte out by itself, causing all sorts of weird behaviour (and eventually crashes).
While it's true that it's the user responsibility to avoid outputting nonsense, hs should fail gracefully.

@asmagill
Copy link
Member

Actually, string.char doesn't deal in Unicode, at least not directly... It's more accurate to think of it and string.byte as attempting to work with 8-bit (binary) ascii, and everything with the 8bit set in ascii has always been wildly machine dependent.

If you're using a development version of Hammerspoon, the latest hs.utf8_53 has codepointToUTF8(codepoint) added, which does take codepoint numbers and convert them to proper UTF8 sequences... If you're sticking with formal releases, here is the relevant function to tide you over until the next release:

  codepointToUTF8 = function(codepoint)
      if type(codepoint) == "string" then codepoint = codepoint:gsub("^U%+","0x") end
      codepoint = tonumber(codepoint)

      -- negatives not allowed
      if codepoint < 0 then return generateUTF8Character(0xFFFD) end

      -- the surrogates cause print() to crash -- and they're invalid UTF-8 anyways
      if codepoint >= 0xD800 and codepoint <=0xDFFF then return generateUTF8Character(0xFFFD) end

          if codepoint < 0x80          then
              return  string.char(codepoint)
          elseif codepoint <= 0x7FF    then
              return  string.char(           bit32.rshift(codepoint,  6)        + 0xC0)..
                      string.char(bit32.band(             codepoint, 0x3F)      + 0x80)
          elseif codepoint <= 0xFFFF   then
              return  string.char(           bit32.rshift(codepoint, 12)        + 0xE0)..
                      string.char(bit32.band(bit32.rshift(codepoint,  6), 0x3F) + 0x80)..
                      string.char(bit32.band(             codepoint, 0x3F)      + 0x80)
          elseif codepoint <= 0x10FFFF then
              return  string.char(           bit32.rshift(codepoint, 18)        + 0xF0)..
                      string.char(bit32.band(bit32.rshift(codepoint, 12), 0x3F) + 0x80)..
                      string.char(bit32.band(bit32.rshift(codepoint,  6), 0x3F) + 0x80)..
                      string.char(bit32.band(             codepoint, 0x3F)      + 0x80)
          end

      -- greater than 0x10FFFF is invalid UTF-8
      return generateUTF8Character(0xFFFD)
  end

hs.utf8_53.codepointToUTF8(226)
â

And I do agree that crashing, even on bad data, is bad... I'll add it to the list... I have a few other ways to crash the console that I'm looking into...

On Jun 22, 2015, at 5:13 AM, Mark Lowne notifications@github.com wrote:

print(string.char(226)) in the console crashes hs. This can be an issue because, Lua being Unicode-agnostic, things like string.sub might leave a 'dangling' byte out by itself, causing all sorts of weird behaviour (and eventually crashes).
While it's true that it's the user responsibility to avoid outputting nonsense, hs should fail gracefully.


Reply to this email directly or view it on GitHub #334.

@cmsj
Copy link
Member

cmsj commented Jun 23, 2015

@asmagill were you able to reproduce the crash with current trunk? I was not - it doesn't print anything, but it also doesn't crash.

@asmagill
Copy link
Member

With the development build, I get:

for i = 0,255 do print(i,string.char(i)) end
0
1
2
3
4
5
6
7
8
9
10

11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33 !
34 "
35 #
36 $
37 %
38 &
39 '
40 (
41 )
42 *
43 +
44 ,
45 -
46 .
47 /
48 0
49 1
50 2
51 3
52 4
53 5
54 6
55 7
56 8
57 9
58 :
59 ;
60 <
61 =
62 >
63 ?
64 @
65 A
66 B
67 C
68 D
69 E
70 F
71 G
72 H
73 I
74 J
75 K
76 L
77 M
78 N
79 O
80 P
81 Q
82 R
83 S
84 T
85 U
86 V
87 W
88 X
89 Y
90 Z
91 [
92
93 ]
94 ^
95 _
96 `
97 a
98 b
99 c
100 d
101 e
102 f
103 g
104 h
105 i
106 j
107 k
108 l
109 m
110 n
111 o
112 p
113 q
114 r
115 s
116 t
117 u
118 v
119 w
120 x
121 y
122 z
123 {
124 |
125 }
126 ~
127 �
169

switch string.char to hs.utf8.codepointToUTF8, and it goes all the way up to 255... It's as if string.char or print is causing the for-loop to "break" on unprintable chars (well, invalid UTF8 ones)... I would be interested in trying this in 5.2, later -- I have a couple of other crashes to pinpoint when the behavior changed (see issue about using hs.inspect on hs.appilcation and hs.window... I forget the number and don't have a browser in front of me right now) and hope to look into it further later today.

At any rate, if this is the 5.3 behavior, it seems we may be avoiding a crash, but we'll need to identify if the "break" is being caused by string.char or print... if it's print, I'm ok with it. If it's string.char, then we will have lost at least one way to craft binary data within lua and will have to figure out another at some point... I don't think it's a priority, but I have done it before, so I expect I'll want to do so again sometime...

On Jun 23, 2015, at 12:28 AM, Chris Jones notifications@github.com wrote:

@asmagill https://github.com/asmagill were you able to reproduce the crash with current trunk? I was not - it doesn't print anything, but it also doesn't crash.


Reply to this email directly or view it on GitHub #334 (comment).

@lowne
Copy link
Contributor Author

lowne commented Jun 23, 2015

Yes, by 'Unicode-agnostic' I meant Lua strings are just a byte buffer. The 'break' seems to be one example of weird behaviour.

I'd bet that the problem is in the print part - or more precisely, in going back to Unicode-world Cocoa in the hs console with non-proper Unicode strings (e.g. a dangling byte of a multi-byte utf8 char) - as if the text area or whatever is going on in the console remains "waiting" for further bytes to complete the code point, instead of respecting string termination and dealing with the garbage. Naked Lua (at least 5.1) in a (Unicode) terminal, as expected, will print the standard 'non-printable character' and carry on.

FWIW, this is what I mean with "dangling byte" and problems with string.sub as per OP - string.char just made for a shorter demo:

s = '3×4' -- "×" is two bytes in utf8: C3 97
print(string.sub(s,1,3)) -- prints "3×"
print(string.sub(s,1,1)) -- prints "3"
print(string.sub(s,1,2)) -- explodes

@asmagill
Copy link
Member

A 'break' out of a loop is odd/problematic behavior, but workable if we can identify it and either work around it or at least be aware of it... a 'crash' to me suggests that Hammerspoon itself stops working altogether... a significantly bigger concern, at least in my mind.

As an example of the distinction I mean, I've seen odd output, but not a crash, with hs.inspect(hs.utf8_53) (or hs.inspect(utf8) with the latest development build) because of the "binary" data in charPattern... it's not valid UTF8, so the console just prints '(null)'... This I consider reasonable/bearable, especially as a wrapper function like:

function asciiOnly(theString, all)
    local all = all or false
    if type(theString) == "string" then
        if all then
            return (theString:gsub("[\x00-\x1f\x7f-\xff]",function(a)
                    return string.format("\\x%02X",string.byte(a))
                end))
        else
            return (theString:gsub("[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\xff]",function(a)
                    return string.format("\\x%02X",string.byte(a))
                end))
        end
    else
        error("string expected", 2) ;
    end
end

Allows 'asciiOnly(hs.inspect(hs.utf8_53))' to display something more readable.

Just so we're on the same page, which behavior were you seeing? wrong output/odd loop/function termination or actual Hammerspoon crashing?

On Jun 23, 2015, at 3:20 AM, Mark Lowne notifications@github.com wrote:

Yes, by 'Unicode-agnostic' I meant Lua strings are just a byte buffer. The 'break' seems to be one example of weird behaviour.

I'd bet that the problem is in the print part - or more precisely, in going back to Unicode-world Cocoa in the hs console with non-proper Unicode strings (e.g. a dangling byte of a multi-byte utf8 char) - as if the text area or whatever is going on in the console remains "waiting" for further bytes to complete the code point, instead of respecting string termination and dealing with the garbage. Naked Lua (at least 5.1) in a (Unicode) terminal, as expected, will print the standard 'non-printable character' and carry on.

FWIW, this is what I mean with "dangling byte" and problems with string.sub as per OP - string.char just made for a shorter demo:

s = '3×4' -- "×" is two bytes in utf8: C3 97
print(string.sub(s,1,3)) -- prints "3×"
print(string.sub(s,1,1)) -- prints "3"
print(string.sub(s,1,2)) -- explodes

Reply to this email directly or view it on GitHub #334 (comment).

@asmagill
Copy link
Member

Interestingly, I get different behavior if I'm attempting to print string.char(226) within the console, or within the hs shell...

From the console (I was attempting to see if our replacement print itself was having problems with the non-unicode character, hence the use of pcall):

pcall(print,string.char(65))
A
true

pcall(print,string.char(226))
true

but from hs in a terminal window:

pcall(print,string.char(65))
A
true
pcall(print,string.char(226))
error: receive timeout

followed by a crash of Hammerspoon... This tells me that something about the print substitute used in hs.ipc.handler is having a problem... I'll add that to my list and look into it later today/tomorrow.

On Jun 23, 2015, at 3:20 AM, Mark Lowne notifications@github.com wrote:

Yes, by 'Unicode-agnostic' I meant Lua strings are just a byte buffer. The 'break' seems to be one example of weird behaviour.

I'd bet that the problem is in the print part - or more precisely, in going back to Unicode-world Cocoa in the hs console with non-proper Unicode strings (e.g. a dangling byte of a multi-byte utf8 char) - as if the text area or whatever is going on in the console remains "waiting" for further bytes to complete the code point, instead of respecting string termination and dealing with the garbage. Naked Lua (at least 5.1) in a (Unicode) terminal, as expected, will print the standard 'non-printable character' and carry on.

FWIW, this is what I mean with "dangling byte" and problems with string.sub as per OP - string.char just made for a shorter demo:

s = '3×4' -- "×" is two bytes in utf8: C3 97
print(string.sub(s,1,3)) -- prints "3×"
print(string.sub(s,1,1)) -- prints "3"
print(string.sub(s,1,2)) -- explodes

Reply to this email directly or view it on GitHub #334 (comment).

@asmagill
Copy link
Member

Assuming that that the lua core itself is truly agnostic:

 $ lua-5.3
 Lua 5.3.0  Copyright (C) 1994-2015 Lua.org, PUC-Rio
 > print(string.char(65))
 A
 > print(string.char(226))
 ?
 > 

And that no changes have actually been made to the pod/fabric/library/whatever HS includes lua as, then the problem is somewhere along the lines where we convert things to UTF8 for passing around in the Hammerspoon application code... considering that it seems like the CStringToUTF8 conversion is likely returning either an error or NULL, I'm actually surprised that all the console does is break out of loops...

Obvious fix -- check explicit conversions for success/failure. Less obvious, what to do with failure... treat as raw binary code? convert to \xXX in place if it isn't a valid UTF8 character sequence? throw error?

Throwing an error is probably extreme... if standalone lua processes it, we should too in the same way. (BTW, the loop above works fine in standalone lua -- it just displays ? for numbers 128 and up -- no 'break' from the loop.)

@lowne
Copy link
Contributor Author

lowne commented Jun 24, 2015

FWIW: in my case (using HS binary 0.9.31, not the repo) the offending module (the one attempting to print a "non-utf8-valid" byte sequence to the console) would cease all its output to the console and, as far as I can tell, would just stop working at that point; meanwhile other modules would carry on as usual, including happily printing along. However there would be soon enough an inevitable HS crash.
Also, HS here (again, 0.9.31 release) instacrashes even with pcall/xpcall wrappers (from the console).

I'm the world's worst expert when it comes to both Unicode and C/ObjC/Cocoa, but it seems to me that it should be OSX/Cocoa's job to handle \xC3 (or whatever) and other invalid utf8 strings (usually by replacing with/displaying REPLACEMENT CHARACTER U+FFFD) - like what happens with Lua in a terminal. If so (and that's a big if), then there must be an overzealous conversion somewhere in HS - as opposed to just merrily passing everything along. (with the potentially important caveat of C-style string termination, which would probably break Lua raw-byte-buffer "strings").

@asmagill
Copy link
Member

Ok, with regards to the console, what is happening is that core_logmessage tries to convert the output to UTF8, receives null, and then sends an empty string to the console instead. Changing the encoding type to NSNonLossyASCIIStringEncoding, which one site suggested didn't help. And since my example above does 'print(i,string.char(i))' the whole output line is junked. Change it to 'print(i); print(string.char(i))' and you can see that the iteration continues. At least this confirms that lua is working with the non-printable bytes properly -- you can build a string of "bad" bytes and check it's length, output it to a file, etc. You just can't print it. How big of a deal is this? Should we at least have it print a warning to the console when it can't properly encode the output?

I'm still looking at IPC and the shell application... I don't think it's just one place that needs to be checked for valid output-able data.

@lowne
Copy link
Contributor Author

lowne commented Jun 25, 2015

Confirmed the crash is gone in the current trunk (was fixed by 3624d71); and I stand corrected: I couldn't find a way of delegating the problem to the OS. From http://stackoverflow.com/a/4985534:

The fact is that NSString initWithData:encoding: strict implementation just return nil when a decoding error occurs. (unlike java for instance which use a replacement character)

(There is a way to do a roundtrip conversion NSString->NSData->NSString via dataUsingEncoding:allowLossyConversion:, but that requires a NSString to begin with, so chicken and egg.)

Ideally HS would use some 3rd party library to deal with this (see https://gist.github.com/cherpake/4709652 and http://www.gnu.org/software/libiconv/), but failing that it should print a warning (or at least the replacement character as a 'hint') when the conversion fails.

@cmsj
Copy link
Member

cmsj commented Jun 25, 2015

@lowne Do we need to start with an NSString to use allowLossyConversion? Could we not feed the C string directly into NSData and go from there?

@asmagill
Copy link
Member

As long as it's only affecting the output to the console, I'm ok with either solution... The actual data internally should stay the same...

@asmagill
Copy link
Member

Came across this that might be of interest (http://notebook.kulchenko.com/programming/fixing-malformed-utf8-in-lua):

function fixUTF8(s, replacement)
  local p, len, invalid = 1, #s, {}
  while p <= len do
    if     p == s:find("[%z\1-\127]", p) then p = p + 1
    elseif p == s:find("[\194-\223][\128-\191]", p) then p = p + 2
    elseif p == s:find(       "\224[\160-\191][\128-\191]", p)
        or p == s:find("[\225-\236][\128-\191][\128-\191]", p)
        or p == s:find(       "\237[\128-\159][\128-\191]", p)
        or p == s:find("[\238-\239][\128-\191][\128-\191]", p) then p = p + 3
    elseif p == s:find(       "\240[\144-\191][\128-\191][\128-\191]", p)
        or p == s:find("[\241-\243][\128-\191][\128-\191][\128-\191]", p)
        or p == s:find(       "\244[\128-\143][\128-\191][\128-\191]", p) then p = p + 4
    else
      s = s:sub(1, p-1)..replacement..s:sub(p+1)
      table.insert(invalid, p)
    end
  end
  return s, invalid
end

Now, if I do the following:

a = string.char(65)..string.char(226)..string.char(66)
b,c = fixUTF8(a, utf8.char(0xFFFD))

Then b will be safely printable and c will contain indexes into a where invalid bytes were found.

Only semi-related, I guess I didn't look at the lua 5.3 utf8 library close enough... utf8.char is almost the same as my addition of hs.utf8.codepointToUTF8... mine handles invalid utf8 codepoints (e.g. the surrogates) better by doing the 0xFFFD replacement for you, while the builtin utf8.char still tries to convert it and ends up with something unprintable; but as long as you stick with valid UTF8 codepoints, it works just as well. I may revisit my addition, as well as see if I can implement a simpler version of the above in hs.utf8.

@lowne
Copy link
Contributor Author

lowne commented Jun 25, 2015

@cmsj I'd like to say "no" as dataUsingEncoding seems to be a method (?) on NSString, then again my understanding of ObjC etc. can be safely rounded to zero, so the truth is I have no idea.
@asmagill good find! I can use that function when I know I'm dealing with potentially unsafe strings - such as after a string.sub, which is what caused this whole issue (but it seems to be too much overhead for a blanket solution; ideally any fixing would trigger only when NSString initWithData returns nil).

As far as I'm concerned, for this issue it would be enough to provide fixUTF8 somewhere in the hs. namespace, and change https://github.com/Hammerspoon/hammerspoon/blob/master/Hammerspoon/MJLua.m#L108 to a replacement char/warning of some sort, and let the user deal with it.

EDIT: I forgot (several times) to add: it seems reasonable to assume that this issue affects Lua string->NSString conversion in general; printing to the console is the case I found (and a 'lazy' solution like above is sufficient there); but I don't know/understand HS ObjC internals sufficiently to tell if it's the only instance - nor I have tested anything else (e.g. hs.settings could be a candidate, depending on which side performs the encoding/decoding)

@cmsj
Copy link
Member

cmsj commented Jun 25, 2015

+1 having a UTF8 fixer-upper in hs.utf8. We may still need to figure out a C version, for the Console window, but equally we could also look at shoving it into evalfn.

@lowne
Copy link
Contributor Author

lowne commented Jun 26, 2015

As I feared, hs.settings.get(string.char(226)) crashes HS (also on a somewhat fresh trunk pull; apparently there's no ad-hoc null-check for this); so there seem to be a more general issue about whether to trust incoming strings from Lua.

On the one hand, it is not entirely unreasonable to imagine a situation with possibly problematic manipulations (such as string.sub of window titles) that feed into hs.settings keys (values seem "fine", in the sense that nil is silently substituted and saved - which admittedly might be even worse than a crash as it seems a perfect recipe for hard to trace, rarely occurring bugs); on the other, I don't know if this is a major issue in practice.

@asmagill
Copy link
Member

I'll check the spec for application defaults -- since this is the namespace hs.settings.get uses, if it allows a key to be any binary data, then this is a problem. I do vaguely recall reading that base64 is the recommended (only?) way to safely store binary data, but I don't recall anything one way or the other about the keys. The greater problem of it crashing, rather then issuing an error remains either way, but I'll see what fix should be put in place.

Yeah, we probably should do an audit at some point -- I know I've assumed UTF8 encodings for most things because that's what is most easily displayed correctly in the console/hs.drawing.text, but we really should be more cognizant of what is "correct" or "acceptable" in each domain rather than adopt a general assumption that limits things more then necessary.

@lowne
Copy link
Contributor Author

lowne commented Jul 5, 2015

Maybe the utf8_decode function in utf/internal.m can be adapted for a performance-friendly C-side solution (e.g. in hammerspoon.h, lua_to_NSString):

  • if lua_checklstring returns null, usual bad argument error
  • if NSString stringWithUTF8String returns nil, try the utf8_fix function (i.e. the modified utf8_decode that discards invalid sequences instead of returning null) on the char*
  • if NSString now succeeds, print a warning to the console and move on

@asmagill
Copy link
Member

asmagill commented Jul 5, 2015

Regarding your earlier example of the failure of hs.settings.get(string.char(226))... is there (or do you actually want) a setting with a key of char(226)? Internally hs.settings is converting this to UTF8, thus null, thus it's actually trying to request the value for the key NULL, which I think is causing the crash.

But if you're actually using or expecting to locate a setting for the key string.char(226) then introducing a solution similar to the fix_utf8 one isn't the correct solution for this specific case... the correct solution for this specific case would be to remove the conversion to UTF8... I'm trying to find out if string.char(226) would be a leagle key for NSDefaults... haven't been able to find out, but if you can confirm it should be (or I finally find a confirming reference), I'll remove the internal conversion.

On Jul 5, 2015, at 6:06 AM, Mark Lowne notifications@github.com wrote:

Maybe the utf8_decode function in utf/internal.m can be adapted for a performance-friendly C-side solution (e.g. in hammerspoon.h, lua_to_NSString):

if lua_checklstring returns null, usual bad argument error
if NSString stringWithUTF8String returns nil, try the utf8_fix function (i.e. the modified utf8_decode that discards invalid sequences instead of returning null) on the char*
if NSString now succeeds, print a warning to the console and move on

Reply to this email directly or view it on GitHub #334 (comment).

@lowne
Copy link
Contributor Author

lowne commented Jul 5, 2015

AFAICT keys ought to be strings (e.g. here), so no, there shouldn't be a setting with such a key; crashing in such cases is understandable (but ideally, hs should throw a meaningful error(), as I suspect most people wouldn't readily identify the cause of the crash in possibly invalid utf8).
But values can also be strings (in hs.settings.set, as opposed to hs.settings.setData - so utf8 conversion is still warranted also for values), and as I said, I think that even crashing is preferable to silently "substituting" nil (which actually deletes the setting) when inadvertently passing strings that turn out to be invalid utf8.
In fact, the "settings.set could silently eat your data" problem seems quite worse than missing output in the console; if a fix is deemed appropriate for this, might as well fix the general issue, which is that since we can't trust lua strings (EDIT: coming from the user) to be valid utf8, making NSStrings from them requires special care.
Contrived example/test code:

function save_emojis(em)
  local MAX_LEN=21-- did i mention this was a contrived example?
  hs.settings.set('favourite_emojis',em:sub(1,MAX_LEN))
end
emojis='💩😱😭😡😞'
save_emojis(emojis) -- all fine so far

-- ...later:
emojis=emojis..'💣' -- ok, i understand there might be no room for this one...
save_emojis(emojis)
print(hs.settings.get('favourite_emojis')) -- ...but where did all my other emojis go? 😱

EDIT2: @asmagill to further clarify (see 3 comments above), I think it's perfectly reasonable to assume utf8-encoded strings everywhere, and I don't think there's any need to extend to other possibilities; I'm only worried about accidentally invalid strings, where the user means to pass along proper utf8, but lua (being agnostic), unbeknownst to her, decides otherwise; and then NSString explodes.

@asmagill
Copy link
Member

asmagill commented Jul 5, 2015

In my poor attempt to find out if it was legal, what I was really trying to get at was how it should be corrected... crash = bad. But is it because it should be "corrected" or because an error should be thrown?

We do a lot of NSString to UTF8 conversions... most of the time it's fine and even makes sense. Where it doesn't (and where it can fail poorly) ... each has to be considered on it's own.

IIRC, setData does an explicit conversion to BASE64 internally... this is recommended for binary data in NSDefaults, but I'm not sure if its required...

And I suspect if you were to change the set line to hs.settings.setData('favourite_emojis',em:sub(1,MAX_LEN)) then the print would also fail because of the print... what is string.len(hs.settings.get('favourite_emojis'))) for your example (i.e. is your example failing because of the print or because the set stored an empty string)?

And all that aside -- yes, we need to audit this everywhere and make fixes -- ideally, nothing should crash, and nothing should be changed without some sort of warning or documentation of the effect, and nothing should be changed in the actual data -- just it's presentation to whatever can't handle it raw.

I'll try and get an issue open later with a list of the places I know need to be looked at, since this is larger than just the console.

On Jul 5, 2015, at 3:38 PM, Mark Lowne notifications@github.com wrote:

AFAICT keys ought to be strings (e.g. here https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man5/plist.5.html), so no, there shouldn't be a setting with such a key; crashing in such cases is understandable (but ideally, hs should throw a meaningful error(), as I suspect most people wouldn't readily identify the cause of the crash in possibly invalid utf8).
But values can also be strings (in hs.settings.set, as opposed to hs.settings.setData - so utf8 conversion is still warranted also for values), and as I said, I think that even crashing is preferable to silently "substituting" nil (which actually deletes the setting) when inadvertently passing strings that turn out to be invalid utf8.
In fact, the "settings.set could silently eat your data" problem seems quite worse than missing output in the console; if a fix is deemed appropriate for this, might as well fix the general issue, which is that since we can't trust lua strings to be valid utf8, making NSStrings from them requires special care.
Contrived example/test code:

function save_emojis(em)
local MAX_LEN=21-- did i mention this was a contrived example?
hs.settings.set('favourite_emojis',em:sub(1,MAX_LEN))
end
emojis='💩😱😭😡😞'
save_emojis(emojis) -- all fine so far

-- ...later:
emojis=emojis..'💣' -- ok, i understand there might be no room for this one...
save_emojis(emojis)
print(hs.settings.get('favourite_emojis')) -- ...but where did all my other emojis go? 😱

Reply to this email directly or view it on GitHub #334 (comment).

@lowne
Copy link
Contributor Author

lowne commented Jul 5, 2015

.set fails by internally passing nil, which deletes the saved setting as evidenced by defaults read in a terminal.
.setData, as expected, works, in that the "invalid-as-utf8-string" binary data is saved just fine. The console can't print it anyway, of course - it'll show (null); but if I do e.g. print(get(...):sub(1,4)), thus just extracting the first 4 bytes from the retrieved binary data, it'll print the first emoji just fine.

The "fix it on behalf of the incautious user" vs "throw an error" question is indeed a non-obvious one. For stuff like the console, I'd say lossy conversion should be fine (and generally convenient to the user).
I'm inclined to say it would also be fine in the vast majority of the (already very rare, as best as we can tell) problematic cases with more important stuff like settings; but it's easy enough to imagine scenarios where the well-intended fix causes even worse problems...

Possible pragmatic approach: if the post-fix string has at least N (5?) (utf8) characters, or (probably better) retained at least N% (80%?) bytes in length (these are intended as indicators that the original lua source was not complete garbage, but it was meant to be proper utf8), then move on with the fixed string; otherwise throw an error and let the user deal with it. (tbh I'm not entirely convinced by this idea myself)

@lowne
Copy link
Contributor Author

lowne commented Jul 5, 2015

@asmagill I see the github mail interface doesn't (understandably) reflect comment edits, so I think you missed this bit from earlier:

to further clarify on this:

I know I've assumed UTF8 encodings for most things because that's what is most easily displayed correctly in the console/hs.drawing.text, but we really should be more cognizant of what is "correct" or "acceptable" in each domain rather than adopt a general assumption that limits things more then necessary.

I think it's perfectly reasonable to assume utf8-encoded strings everywhere, and I don't think there's any need to extend to other possibilities; I'm only worried about accidentally invalid strings, where the user means to pass along proper utf8, but lua (being agnostic), unbeknownst to her, decides otherwise; and then NSString explodes.

@asmagill
Copy link
Member

asmagill commented Jul 5, 2015

I'll try and get on the web later today and fully read there... Short point, though... I'm fine with enforcing that anything an OSX API accepts as a NSString should be proper UTF8... however, we shouldn't disallow things that would take NSData as well or instead, because Lua can handle it -- our code around Lua, however, may need to be more aware of which, though.

On Jul 5, 2015, at 5:08 PM, Mark Lowne notifications@github.com wrote:

@asmagill https://github.com/asmagill I see the github mail interface doesn't (understandably) reflect comment edits, so I think you missed this bit from earlier:

to further clarify on this:

I know I've assumed UTF8 encodings for most things because that's what is most easily displayed correctly in the console/hs.drawing.text, but we really should be more cognizant of what is "correct" or "acceptable" in each domain rather than adopt a general assumption that limits things more then necessary.

I think it's perfectly reasonable to assume utf8-encoded strings everywhere, and I don't think there's any need to extend to other possibilities; I'm only worried about accidentally invalid strings, where the user means to pass along proper utf8, but lua (being agnostic), unbeknownst to her, decides otherwise; and then NSString explodes.


Reply to this email directly or view it on GitHub #334 (comment).

@lowne
Copy link
Contributor Author

lowne commented Jul 5, 2015

I fully agree.
Yet another alternative would be to include something like https://github.com/starwing/luautf8, and then either

  • emphasise in the docs to use utf8. instead of string. when dealing with actual strings as opposed to binary data
  • (evilly) rename string to bytearray or something (probably also removing string stuff like .upper), and rename utf8 to string (and move the string metatable accordingly as well) - but this might risk breakage of current setups for users using string. to handle binary data.

@cmsj
Copy link
Member

cmsj commented Sep 2, 2015

What are we missing now for this? with #519 having landed, we no longer crash when printing unprintable characters to the console. Can we close out this issue?

@lowne
Copy link
Contributor Author

lowne commented Sep 2, 2015

Strictly speaking, yes, although there's still alert showing "error, please file a bug" and settings.get/set crashing.

@asmagill
Copy link
Member

asmagill commented Sep 2, 2015

Can you give specific examples (or if there is already an issue I've missed, point me to it) wrt to settings and alerts? I wonder if the proposed changes I'm making to LuaSkin might help with the settings issues, but I need examples to test. Alert I've not looked at but add that to the list :-)

@lowne
Copy link
Contributor Author

lowne commented Sep 2, 2015

@asmagill I tried with the usual:

  • hs.alert'\226'
  • hs.settings.set('\226','yay')

For slightly more exotic tests, the weird utf8 encoding (I don't know where the extra two bytes come from) suggested in Character Viewer for ❤️ is UTF-8: E2 9D A4 EF B8 8F, however NSString seems to disagree since '\226\157\164\239\184' will similarly misbehave. (derp, I had missed the last byte)

@lowne
Copy link
Contributor Author

lowne commented Sep 2, 2015

Having now hs.utf8.fixUTF8, I think we should just throw an error for any remaining cases of NSString initialising to nil due to invalid utf8 data, and let the user decide how to fix it (lazy way being just wrap the data in fixUTF8).

@lowne
Copy link
Contributor Author

lowne commented Sep 2, 2015

To clarify: I think we should also error out on hs.settings.set('validKeyHoldingImportantData','invalidstring\226'), which currently doesn't crash but silently removes whatever was previously saved for 'validKeyHoldingImportantData' (i.e. it saves nil)

@asmagill
Copy link
Member

asmagill commented Sep 2, 2015

Ok, alert is doing what it's programmed to do -- a nil check places that string into title and then continues... I'd personally rather it error as @lowne suggests or at the very least something more expressive...

As for the settings bug, an error needs to be added when the key name returns nil... The changes I'm introducing will catch non-UTF8 in the value and save it as a block of data instead of a string, but the changes don't apply to the keys which have to be valid UTF8 strings for NSUserDefaults anyway, so I'll add that as well.

On Sep 2, 2015, at 4:16 PM, Mark Lowne notifications@github.com wrote:

Having now hs.utf8.fixUTF8, I think we should just throw an error for any remaining cases of NSString initialising to nil due to invalid utf8 data, and let the user decide how to fix it (lazy way being just wrap the data in fixUTF8).


Reply to this email directly or view it on GitHub #334 (comment).

@asmagill
Copy link
Member

asmagill commented Sep 2, 2015

Silent discarding would have occurred because of invalid UTF8. Silent truncation would have occurred because of embedded null characters. Both should work for values in the LuaString WIP (though to be honest I haven't tried the embedded null character one... it should work, but I'll add it to my tests before my next update.)

On Sep 2, 2015, at 4:26 PM, Mark Lowne notifications@github.com wrote:

To clarify: I think we should also error out on hs.settings.set('validKeyHoldingImportantData','invalidstring\226'), which currently doesn't crash but silently removes whatever was previously saved for 'validKeyHoldingImportantData' (i.e. it saves nil)


Reply to this email directly or view it on GitHub #334 (comment).

@lowne
Copy link
Contributor Author

lowne commented Sep 2, 2015

@asmagill yes, (basically) auto-switching internally to .setData for invalid utf8 works well since the original data is still retrievable via .get and no data loss occurs.

Also hs.drawing.text might have lua->NSString conversions (haven't checked the source).

@lowne
Copy link
Contributor Author

lowne commented Sep 2, 2015

...although on second thought, the user might be quite surprised to see what he thought was just a string base64 encoded on defaults read or other "external" (non-HS) usage of the saved data - as opposed to when calling .setData explicitly.

@lowne
Copy link
Contributor Author

lowne commented Sep 2, 2015

...so maybe it's better to error out in all cases:

  • hs.alert('\226') -> message must be a valid UTF-8 string; use fixUTF8
  • hs.settings.set('\226','yay') -> key must be a valid UTF-8 string; use a proper key
  • hs.settings.set('yay','\226') -> value must be a valid UTF-8 string; use .setData

@asmagill
Copy link
Member

asmagill commented Sep 2, 2015

My leaning is the other way and was considering actually deprecating setData altogether... The base64 encoding happens behind the scenes - we don't specify it - anytime an NSData object is saved, and since the only valid representation of what the user was trying to save was NSData, that's what happened -- what the user asked for.

Since get retrieves either, my thought is that set should too... the only reason it didn't before was because I didn't know a better way to code it (I do now :-) )

However, I leave the final decision to the group... just let me know.

On Sep 2, 2015, at 4:44 PM, Mark Lowne notifications@github.com wrote:

...so maybe it's better to error out in all cases:

hs.alert('\226') -> message must be a valid UTF-8 string; use fixUTF8
hs.settings.set('\226','yay') -> key must be a valid UTF-8 string; use a proper key
hs.settings.set('yay','\226') -> value must be a valid UTF-8 string; use .setData

Reply to this email directly or view it on GitHub #334 (comment).

@lowne
Copy link
Contributor Author

lowne commented Sep 2, 2015

In general I agree, but I'm still worried about the case where

  1. the user (wrongly, it turns out) thinks they're dealing with a string, not with arbitrary binary data
  2. at the same time, the user is using settings data from outside HS

To reiterate what point 1 is about, in case I didn't mention it already above, what happened to me originally was:

  • I was printing to console a bunch of window titles for debugging; I thought (correctly) that window titles are valid strings
  • many titles were too long, so I decided to print(title:sub(1,40)) instead - I was still convinced I was dealing with valid strings (simply because I hadn't thought about utf-8 at all (which is admittedly silly, considering browser windows could be titled anything)), but I was wrong: I found out that terminal windows titles, if one looks closely, have — and × characters, so sooner or later the string slice ends up being invalid utf8
  • before realising what was wrong, I was additionally saving these title slices via settings.set, still believing these were regular strings

Now, it could be argued that

  1. it's not our job to teach users that utf8 and lua strings don't necessarily go well together
  2. we could simply point out in the docs for .set "if you unexpectedly see base64 data in defaults read, it's because your string wasn't valid utf8 so it was converted to raw data"

and all I can say is that in the user shoes I'd rather have HS yell at me immediately and force me to decide how to fix it (again, probably simply via fixUTF8) than finding months later that my external script was failing in subtle ways because I hadn't taken into account that sometimes (what to me should just be just plain) "strings" show up base64 encoded. Then again, if at least point 2 is taken care of, hopefully the observant user will RTFM and preempt such problematic failures.

TL;DR: if we don't error out, we should at least mention the (possibility of) silent conversion in the docstring for settings.set

@asmagill
Copy link
Member

asmagill commented Sep 2, 2015

Been there done that... and I did learn from it as well, so it wasn't all bad :-)

Like I said, I'll go with the consensus now that we really can do either. And the argument could be made that we already make a special case for dates so that outside queries can get an expected NSDate rather than an integer which could be anything...

On Sep 2, 2015, at 5:28 PM, Mark Lowne notifications@github.com wrote:

In general I agree, but I'm still worried about the case where

  1. the user (wrongly, it turns out) thinks they're dealing with a string, not with arbitrary binary data
  2. at the same time, the user is using settings data from outside HS

To reiterate what point 1 is about, in case I didn't mention it already above, what happened to me originally was:

I was printing to console a bunch of window titles for debugging; I thought (correctly) that window titles are valid strings
many titles were too long, so I decided to print(title:sub(1,40)) instead - I was still convinced I was dealing with valid strings (simply because I hadn't thought about utf-8 at all (which is admittedly silly, considering browser windows could be titled anything)), but I was wrong: I found out that terminal windows titles, if one looks closely, have — and × characters, so sooner or later the string slice ends up being invalid utf8
before realising what was wrong, I was additionally saving these title slices via settings.set, still believing these were regular strings
Now, it could be argued that

  1. it's not our job to teach users that utf8 and lua strings don't necessarily go well together
  2. we could simply point out in the docs for .set "if you unexpectedly see base64 data in defaults read, it's because your string wasn't valid utf8 so it was converted to raw data"

and all I can say is that in the user shoes I'd rather have HS yell at me immediately and force me to decide how to fix it (again, probably simply via fixUTF8) than finding months later that my external script was failing in subtle ways because I hadn't taken into account that sometimes (what to me should just be just plain) "strings" show up base64 encoded. Then again, if at least point 2 is taken care of, hopefully the observant user will RTFM and preempt such problematic failures.

TL;DR: if we don't error out, we should at least mention the (possibility of) silent conversion in the docstring for settings.set


Reply to this email directly or view it on GitHub #334 (comment).

@cmsj
Copy link
Member

cmsj commented Sep 4, 2015

This also affects hs.drawing.text(). Easy to reproduce crash:

hs.drawing.text(hs.geometry.rect(0, 0, 100, 100), string.char(226))

I guess the easiest option at this point would be a function in hammerspoon.h that lets us filter strings? Or is that already a thing that's hiding somewhere?

@cmsj
Copy link
Member

cmsj commented Sep 4, 2015

Untested, but we might be able to use some nifty NSString options to filter out characters in + (NSCharacterSet *)illegalCharacterSet

@asmagill
Copy link
Member

asmagill commented Sep 4, 2015

Check out hs.cleanUTF8forConsole which is basically what I use to make things safe for the console (examples of its use from C in MJLua.m for MJLuaRunString and core_logmessage) -- note that this version also changes NULL since that would prevent output to the console.

Or look at the LuaSkin WIP at [skin isValidUTF8AtIndex:#] -- technically all this does is tell you if it's valid, just as a nil does, so... maybe not as useful here...

Unless you want something fancier...

On Sep 4, 2015, at 4:04 AM, Chris Jones notifications@github.com wrote:

Untested, but we might be able to use some nifty NSString options to filter out characters in + (NSCharacterSet *)illegalCharacterSet


Reply to this email directly or view it on GitHub.

@cmsj
Copy link
Member

cmsj commented Apr 27, 2017

I think we've taken care of this? It's certainly old enough at this point that I'm going to close it and hope for the best :)

@cmsj cmsj closed this as completed Apr 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants