Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding issue #95

Closed
JoshCheek opened this issue Dec 14, 2016 · 11 comments
Closed

Encoding issue #95

JoshCheek opened this issue Dec 14, 2016 · 11 comments

Comments

@JoshCheek
Copy link
Owner

JoshCheek commented Dec 14, 2016

Hopefully related to the binary/utf-8 issue in #92

This code: def π; end Explodes in Atom, but works correctly in the shell and TextMate2. I looked at the env vars, and LANG was set in TM but not in Atom.

When I opened Atom's console (cmd+opt+i) and ran process.env.LANG = 'en_US.UTF-8', it then worked correctly.

When I deleted it again: delete process.env.LANG, it then broke again. Stacktrace:

/Users/josh/.gem/ruby/2.3.1/gems/parser-2.3.1.4/lib/parser/source/buffer.rb:164:in `source=': invalid byte sequence in US-ASCII (EncodingError)
	from /Users/josh/.gem/ruby/2.3.1/gems/seeing_is_believing-3.1.0/lib/seeing_is_believing/code.rb:26:in `initialize'
	from /Users/josh/.gem/ruby/2.3.1/gems/seeing_is_believing-3.1.0/lib/seeing_is_believing/binary/remove_annotations.rb:15:in `new'
	from /Users/josh/.gem/ruby/2.3.1/gems/seeing_is_believing-3.1.0/lib/seeing_is_believing/binary/remove_annotations.rb:15:in `initialize'
	from /Users/josh/.gem/ruby/2.3.1/gems/seeing_is_believing-3.1.0/lib/seeing_is_believing/binary/remove_annotations.rb:8:in `new'
	from /Users/josh/.gem/ruby/2.3.1/gems/seeing_is_believing-3.1.0/lib/seeing_is_believing/binary/remove_annotations.rb:8:in `call'
	from /Users/josh/.gem/ruby/2.3.1/gems/seeing_is_believing-3.1.0/lib/seeing_is_believing/binary/engine.rb:102:in `normalized_cleaned_body'
	from /Users/josh/.gem/ruby/2.3.1/gems/seeing_is_believing-3.1.0/lib/seeing_is_believing/binary/engine.rb:91:in `code'
	from /Users/josh/.gem/ruby/2.3.1/gems/seeing_is_believing-3.1.0/lib/seeing_is_believing/binary/engine.rb:28:in `syntax_error?'
	from /Users/josh/.gem/ruby/2.3.1/gems/seeing_is_believing-3.1.0/lib/seeing_is_believing/binary.rb:35:in `call'
	from /Users/josh/.gem/ruby/2.3.1/gems/seeing_is_believing-3.1.0/bin/seeing_is_believing:6:in `<top (required)>'
	from /Users/josh/.gem/ruby/2.3.1/bin/seeing_is_believing:22:in `load'
	from /Users/josh/.gem/ruby/2.3.1/bin/seeing_is_believing:22:in `<main>'
@JoshCheek
Copy link
Owner Author

JoshCheek commented Dec 14, 2016

Without the env var set:

$stdin.external_encoding  # => #<Encoding:US-ASCII>

With the env var set:

$stdin.external_encoding  # => #<Encoding:UTF-8>

@JoshCheek
Copy link
Owner Author

According to opengroup, which cites IEEE Std 1003.1-2001 (might be this, and is almost certainly this, except you can't read it without signing up or something... w/e)

Name Meaning
LANG This variable shall determine the locale category for native language, local customs, and coded character set in the absence of the LC_ALL and other LC_* ( LC_COLLATE , LC_CTYPE , LC_MESSAGES , LC_MONETARY , LC_NUMERIC , LC_TIME ) environment variables. This can be used by applications to determine the language to use for error messages and instructions, collating sequences, date formats, and so on.
LC_ALL This variable shall determine the values for all locale categories. The value of the LC_ALL environment variable has precedence over any of the other environment variables starting with LC_ ( LC_COLLATE , LC_CTYPE , LC_MESSAGES , LC_MONETARY , LC_NUMERIC , LC_TIME ) and the LANG environment variable.
LC_COLLATE This variable shall determine the locale category for character collation. It determines collation information for regular expressions and sorting, including equivalence classes and multi-character collating elements, in various utilities and the strcoll() and strxfrm() functions. Additional semantics of this variable, if any, are implementation-defined.
LC_CTYPE This variable shall determine the locale category for character handling functions, such as tolower(), toupper(), and isalpha(). This environment variable determines the interpretation of sequences of bytes of text data as characters (for example, single as opposed to multi-byte characters), the classification of characters (for example, alpha, digit, graph), and the behavior of character classes. Additional semantics of this variable, if any, are implementation-defined.

There's a more extensive explanation on that site, including how to parse and make sense of the values, it's just prior to section 8.3

@JoshCheek
Copy link
Owner Author

Looks like MRI's hit this, too:

$ ruby < doc/ChangeLog-2.0.0 -e 'puts $stdin.read.split(/^(?=\S)/).select { |paragraph| paragraph["LANG"] }'
Sat Sep 29 19:40:32 2012  Hiroshi Shirosaki  <h.shirosaki@gmail.com>

	* test/ruby/test_unicode_escape.rb (TestUnicodeEscape#test_basic):
	  set script encoding to work with LANG=C. It would work on both
	  Windows and Unix. Refix of r37051.

Sat Sep 29 02:18:57 2012  Hiroshi Shirosaki  <h.shirosaki@gmail.com>

	* test/ruby/test_unicode_escape.rb (TestUnicodeEscape#test_basic):
	  Use ruby only on Windows since the test fails on Unix with LANG=C.
	  [ruby-core:47709] [Bug #7076]

Wed Aug  1 05:50:53 2012  Hiroshi Shirosaki  <h.shirosaki@gmail.com>

	* test/ruby/test_rubyoptions.rb (TestRubyOptions#test_encoding):
	  Fix test_encoding failure on Windows.
	  With chcp 65001, 1252 and 437, test_encoding failed. Test result
	  depends on locale because LANG environment variable doesn't affect
	  locale on Windows.
	  [ruby-core:46872] [Bug #6813]

@JoshCheek JoshCheek mentioned this issue Dec 15, 2016
15 tasks
@JoshCheek
Copy link
Owner Author

Seems to have stemmed from JoshCheek/atom-seeing-is-believing#24 but I would like to fix it in SiB (or at least guess the most likely answer in the event that the invoking context got it wrong), b/c this may be what is affecting #92, and it confusingly looks like a bug in SiB

@JoshCheek
Copy link
Owner Author

JoshCheek commented Dec 15, 2016

Hey, @avdi, I read your post, it was great! It all made a lot of sense to me except for the utf-8 issue. I tried to express it in a coherent manner, but instead I think I just destroyed my brain >.< In an abstract sense (ie conclusions instead of explanations) it's:

If the environment set everything correctly, it should work without needing to care about encodings. So if it blew up, then either SiB got out of whack somewhere, or it's an environment issue. If it's an SiB thing, overriding the defaults, then we should find / fix it. If it's an env thing, we should try to guess a few common environments, and if not we should explain to the user what the problem is

I guess an example would be that a user is actually using a different encoding, then we want to avoid transcoding it since that can lose information (I assume there are encodings with info that is not encompassed by utf-8, though I haven't been able to find an example). So we should do everything in the users's encoding, and translate our internal strings into their encoding and write the file in their encoding.

To figure out which is going on, I've been trying to recreate your issue. I'm pretty sure I can reproduce each of your examples from the blog, but it would be helpful for me if you could describe to me the encoding issue you experienced in #92

Eg:

  • Did you run SiB from emacs?

  • What encoding do you think the file was? (eg what does Emacs say it is?)

  • How was it erroring? stacktrace / mojibaked / something else? If stacktrace, can you provide it? If mojibaked, can you screenshot the incorrect chars?

  • What was the text that caused the issue? (as screenshot in order to avoid encoding issues here, too :P)

  • Did you have any of the environment variables LANG, LC_ALL , LC_COLLATE, LC_CTYPE set? If so, what were their values?

  • If you're able to invoke Ruby the same manner that you invoked SiB, can you try invoking this and let me know what it says?

    require 'pp'
    (r, w), f = IO.pipe, File.open(__FILE__)
    pp [
      [Encoding, Encoding.default_internal, Encoding.default_external, Encoding.locale_charmap,
                 Encoding.find("external"), Encoding.find("internal"), Encoding.find("locale"), Encoding.find("filesystem"),
      ],
      [$stdin,   $stdin.internal_encoding, $stdin.external_encoding],
      [$stdout,  $stdout.internal_encoding, $stdout.external_encoding],
      ["pipe read",  r.internal_encoding, r.external_encoding] ,
      ["pipe write", w.internal_encoding, w.external_encoding],
      ['file',       f.internal_encoding, f.external_encoding],
      ["__ENCODING__",    __ENCODING__],
      [String,            "".encoding],
      ["ENV[LANG]",       ENV["LANG"]],
      ["ENV[LC_ALL]",     ENV["LC_ALL"]],
      ["ENV[LC_COLLATE]", ENV["LC_COLLATE"]],
      ["ENV[LC_CTYPE]",   ENV["LC_CTYPE"]],
    ]

@avdi
Copy link

avdi commented Dec 15, 2016 via email

@avdi
Copy link

avdi commented Dec 15, 2016 via email

@JoshCheek
Copy link
Owner Author

On Windows, Ruby is going to believe that text data it is receiving from
files and STDIN is IBM437, and if it is wrong, things will break. And
unless told otherwise it will always transcode internal utf-8 to IBM437
when it writes, which WILL lose information.

It should be okay, the only data we insert are the markers and object inspections. Markers use the chars # =>~!., which are all used by Ruby, so should be transcodable. Inspections were obtained from the user's code, so barring an encoding issue that exists regardless of SiB, we should be okay to embed them into a comment. If not, then elide the result or mojibake it, but definitely don't explode.

There's no magic sniffing here - it can't know if, say, a cooperating
process is actually piping it utf-8 unless that process somehow tells it so
(or it just assumes by hardcoding an encoding).

For our event stream, we can explicitly set the pipe to UTF-8. No strings provided by the user pass directly through it, only string literals generated within SiB, user data is marshalled and base 64'd, so it passes through the pipe as an ascii subset, and the encoding is preserved on the other side.

Oh, and running Bash scripts is probably meaningless because the
environment inside bash-on-windows tools is likely to skew from baseline
windows enough to render answers which aren't generally applicable.

Aye, I deleted that from the issue after I realized it.

I think it's also true that a lot
of the assumptions most people rely on (or just don't think about) having
to do with encodings in Ruby are wrong. And operating in a UNIX environment
just makes those assumptions invisible.

Definitely. Having AppVeyor and users in that env is awesome ^_^

@avdi
Copy link

avdi commented Dec 15, 2016 via email

@avdi
Copy link

avdi commented Dec 15, 2016 via email

@avdi
Copy link

avdi commented Dec 15, 2016

If you want to play with any of this stuff yourself, MS has free VirtualBox images: https://developer.microsoft.com/en-us/microsoft-edge/tools/vms/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants