Binary items with umlauts from custom data source are created and deleted right away #837

agross · 2016-03-06T17:50:43Z

I'm currently migration an old CMS to nanoc. We load most of the CMS content from an XML file. Some items (binary items) are loaded from the file system.

There is one binary file containing an "ß" character. The item rep (default) for that item gets created and deleted right away. All binary items are handled by a passthrough rule.

$ bundle exec nanoc
Loading site… done
Compiling site…
      create  [0.00s]  build/bin/files/media/image/team-alexander-groß.jpg
      delete  build/bin/files/media/image/team-alexander-groß.jpg

Site compiled in 2.55s.

The text was updated successfully, but these errors were encountered:

denisdefreyne · 2016-03-06T18:04:49Z

Yikes! A wild guess, but this might be caused by Unicode normalisation being done differently in different places.

Do you have a test case for me that I can reproduce locally? If not, it’d be helpful if you could do some digging on your side and see whether you can isolate the issue. If my hunch is correct, pruner.rb:43 would show that present_files has a string normalised to one way, and compiled_files the same string, normalised a different way.

agross · 2016-03-06T21:54:44Z

present_files contains

[ 18] "build/bin/files/media/image/team-alexander-gro\u00DF.jpg",

compiled_files contains

[115] "build/bin/files/media/image/team-alexander-gro\xE1.jpg"

agross · 2016-03-06T22:00:09Z

I added p present_files.find { |e| e =~ /alex/ }.encoding for both collections, both yield #<Encoding:UTF-8> in the output.

agross · 2016-03-06T22:01:57Z

Scratch that.

p present_files.find { |e| e =~ /team-alex/ }.encoding
# => #<Encoding:UTF-8>
p compiled_files.find { |e| e =~ /team-alex/ }.encoding
# => #<Encoding:IBM437>

agross · 2016-03-06T22:08:24Z

It seems this File.expand_path in my data source yields #<Encoding:IBM437>.

But even after changing the line to new_item(File.expand_path(file).encode(Encoding::UTF_8), ... the pruner still sees compiled_files.find { |e| e =~ /team-alex/ }.encoding as #<Encoding:IBM437>.

agross · 2016-03-06T22:18:05Z

Might be related to this bug: https://bugs.ruby-lang.org/issues/9713

agross · 2016-03-06T22:45:30Z

Copying the code from the issue above to something in lib/ and my spec_helper.rb I see there's a slight difference between

$ bundle exec nanoc
Loading site… Encoding.find 'filesystem': #<Encoding:Windows-1252>
Encoding.find 'locale': #<Encoding:IBM437>
Encoding.default internal: nil
Encoding.default external: #<Encoding:IBM437>
Encoding.locale_charmap: "CP437"
__FILE__: #<Encoding:UTF-8>
'foobar': #<Encoding:IBM437>

# and

$ bundle exec rspec # or rake
Encoding.find 'filesystem': #<Encoding:Windows-1252>
Encoding.find 'locale': #<Encoding:IBM437>
Encoding.default internal: nil
Encoding.default external: #<Encoding:IBM437>
Encoding.locale_charmap: "CP437"
__FILE__: #<Encoding:UTF-8>
'foobar': #<Encoding:UTF-8>

denisdefreyne · 2016-03-08T08:54:36Z

Does the problem disappear when you replace

Find.find(site.config[:output_dir] + '/') do |f|

in pruner.rb with

Find.find(site.config[:output_dir] + '/').map { |f| f.encode('UTF-8') }.each do |f|

? If so, it looks like re-encoding all filenames obtained from Dir.glob to be UTF-8 would be the way to go.

agross · 2016-03-09T11:38:26Z

Thanks for the suggestion! Unfortunately it didn't work as it seems the files returned by Find.find are already UTF-8 encoded.

compiled_files contains the "team-alexander-gross" file name with Encoding:IBM437 encoding, so slapping the map on compiled_files did the trick.

Perhaps it's even better enforce the encoding at a more central place, like ItemRep.raw_paths.values or wherever the raw_paths.values come from. This works for me:

all_raw_paths = site.compiler.reps.flat_map { |r| r.raw_paths.values.map { |f| f.encode('UTF-8') } }

denisdefreyne · 2016-04-15T07:29:09Z

Yup, I’d argue that all strings (including filenames) constructed by Nanoc should be in UTF-8. Will fix and ensure that encodings are correct everywhere.

(Hard to fix/test, because the default encoding is sadly part of the global state.)

denisdefreyne · 2016-04-17T10:44:37Z

Fix in #852.

denisdefreyne · 2016-04-17T10:52:49Z

Fixed in #852, and will be part of the 4.1.6 release.

denisdefreyne added the type:bug 🐛 label Mar 6, 2016

denisdefreyne added this to the 4.1.5 milestone Mar 6, 2016

denisdefreyne modified the milestones: 4.1.5, 4.1.6 Mar 24, 2016

denisdefreyne mentioned this issue Apr 17, 2016

Force UTF-8 for item rep paths #852

Merged

denisdefreyne closed this as completed Apr 17, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Binary items with umlauts from custom data source are created and deleted right away #837

Binary items with umlauts from custom data source are created and deleted right away #837

agross commented Mar 6, 2016

denisdefreyne commented Mar 6, 2016

agross commented Mar 6, 2016

agross commented Mar 6, 2016

agross commented Mar 6, 2016

agross commented Mar 6, 2016

agross commented Mar 6, 2016

agross commented Mar 6, 2016

denisdefreyne commented Mar 8, 2016

agross commented Mar 9, 2016

denisdefreyne commented Apr 15, 2016

denisdefreyne commented Apr 17, 2016

denisdefreyne commented Apr 17, 2016

Binary items with umlauts from custom data source are created and deleted right away #837

Binary items with umlauts from custom data source are created and deleted right away #837

Comments

agross commented Mar 6, 2016

denisdefreyne commented Mar 6, 2016

agross commented Mar 6, 2016

agross commented Mar 6, 2016

agross commented Mar 6, 2016

agross commented Mar 6, 2016

agross commented Mar 6, 2016

agross commented Mar 6, 2016

denisdefreyne commented Mar 8, 2016

agross commented Mar 9, 2016

denisdefreyne commented Apr 15, 2016

denisdefreyne commented Apr 17, 2016

denisdefreyne commented Apr 17, 2016