New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeEncodeError (windows/dos) #40

Closed
jesus2099 opened this Issue Dec 18, 2012 · 11 comments

Comments

Projects
None yet
2 participants
@jesus2099
Contributor

jesus2099 commented Dec 18, 2012

The release name contains \u2019 apostrophe :

isrcsubmit 0.5 by JonnyJD for MusicBrainz

using python-musicbrainz2 0.7.4
using Cdrdao 1.2.3
e:
\\.\e:

DiscID:         0wD2_XE_WlugbdbCFq1wM95MW38-
Tracks on Disc: 26
This Disc ID is ambiguous:
0: Miles Davis - Ascenseur pour l'échafaud (Official)
        JP      2004-03-09
1: Miles Davis - Ascenseur pour l'échafaud (Official)
        DE      1989-03-20       042283630529
2: Miles Davis - Traceback (most recent call last):
  File "isrcsubmit.py", line 726, in <module>
    releaseId = disc.release.getId()        # implicitly fetches release
  File "isrcsubmit.py", line 392, in release
    self._release = self.getRelease(self._submit)
  File "isrcsubmit.py", line 416, in getRelease
    print "-", release.getTitle(),
  File "D:\Tristan\PRG\i\Python26\lib\encodings\cp850.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2019' in position 16: character maps to <undefined>
@JonnyJD

This comment has been minimized.

Owner

JonnyJD commented Dec 18, 2012

Thanks, I can reproduce this on my virtual machine. (linux works fine)

The problem seems to be that on linux UTF is used in the terminal, but windows uses cp850, which is missing a lot of characters.

Not really sure how easy it is to fix.
I possibly have to ignore undisplayable characters.

@jesus2099

This comment has been minimized.

Contributor

jesus2099 commented Dec 18, 2012

Ignore yes, for instance.
Or display as is (including garvage text) this is what isrcsubmit.exe does and mayb it’s better for release name consisting of CJK text, only non-ascii. etc.

@JonnyJD

This comment has been minimized.

Owner

JonnyJD commented Dec 18, 2012

What do you mean with "as is"? What is the output of isrcsubmit.exe on this release? (can't fake the discId for isrcsubmit)

@ghost ghost assigned JonnyJD Dec 18, 2012

@JonnyJD

This comment has been minimized.

Owner

JonnyJD commented Dec 18, 2012

I have these (basic) options:

Ascenseur pour léchafaud (ignore)
Ascenseur pour l?échafaud (replace)
Ascenseur pour l&#8271;échafaud (xmlcharrefreplace)
Ascenseur pour l\u2019échafaud (backslashreplace)

I will use replace.

I won't mess with transliteration or similar and also don't want to hack the output to be a byte string. That would generate many problems later on, I guess.

@JonnyJD JonnyJD closed this in 3d45aeb Dec 18, 2012

@jesus2099

This comment has been minimized.

Contributor

jesus2099 commented Dec 18, 2012

FTR, isrcsubmit.exe displays µ│óÕïò in place of 波動. Your replace why not. :)

@JonnyJD

This comment has been minimized.

Owner

JonnyJD commented Dec 18, 2012

Yes, that is a byte string with a mismatching encoding (probably UTF-8 on cp850). That is what automatically happens when you print encoded text on the terminal from C.
However, Python actually tries to find and match the encoding automatically.

That being said, I can just any encoding I want, but then I would have to create a new error_handling for encode or test every char, so I can just use that encoding for unsuitable chars.

However, output like:

Ascenseur pour lÔÇÖ+®chafaud

is much more distinguished (you can see that these are differen chars), rather than everything beeing undistinguishable "?"

Wha I really would like to get working though, is real unicode display in the cmd (should be possible).

JonnyJD added a commit that referenced this issue Dec 19, 2012

use cp65001 ~ UTF-8 on windows, re #40
The Windows cmd has limited/buggy but existing unicode support.

cp65001 is supposed to be UTF-8.
It is buggy and has problems and python decided not to add it as an alias
for utf-8.
Batchfiles are buggy with cp65001, therefore this weird cmd /c change.

Since having unicode output is desirable, we alias cp65001 to utf-8,
and use os.write in printEncoded on Windows.
Using sys.stdout.write gives IOError

We also set cp65001 in isrcsubmit.bat
Please note, that isrcsubmit.bat will not start if cp65001 is set before
starting isrcsubmit.bat

If that change in the batchfile turns out to be buggy,
we should create a separate isrcsubmit-unicode.bat.
@JonnyJD

This comment has been minimized.

Owner

JonnyJD commented Dec 19, 2012

isrcsubmit.bat should have unicode support now.

$ chcp
Active code page: 850
$ chcp 65001
$ isrcsubmit.py
$ chcp 850

Should also work.

Always reset th code page to something different than 65001 again afterwards.
Otherwise batchfiles and other cmd output will not work.
This is the same for isrcsubmit.bat. There is a trick in the bat to make it work at all.

Please report if you have issues with this.

@jesus2099

This comment has been minimized.

Contributor

jesus2099 commented Dec 19, 2012

Hi Jonny ! Thanks big times, I didn’t know this tricks !
I still have an error (see below) but it should be possible to end the fix (see even lower).

DiscID:         TK5efmSk3QXYTIqtVZuCGisoJDg-
Tracks on Disc: 16
Artist:         Traceback (most recent call last):
  File "isrcsubmit.py", line 743, in <module>
    print 'Artist:\t\t', release.getArtist().getName()
LookupError: unknown encoding: cp65001

Over there they attempt to fix this bug by telling Python to make this cp65001 map to utf-8. Maybe that works !

edit Sorry I didn’t understand I had to update my isrcsubmit.py ! :) It works like isrcsubmit.exe now : L’Indécideur
Thanks very much ! :)

@JonnyJD

This comment has been minimized.

Owner

JonnyJD commented Dec 19, 2012

Still not working for you?
Did you set the cmd font to Lucida Console?

On my (virtual) machine it works like it should, meaning it does look like an apostrophe:

2: Miles Davis - Ascenseur pour l’échafaud

Or is than only a problem pasting the output of your terminal into the github comment?

@JonnyJD

This comment has been minimized.

Owner

JonnyJD commented Dec 19, 2012

posted from my virtual windows machine (windows XP):

2: Miles Davis - Ascenseur pour l’échafaud

EDIT:
also works from a physical Windows 7 machine.
With isrcsubmit.bat I always see it as an apostrophe and with isrcsubmit.py I see ? when using standard cp850 and as apostrophe when using cp65001.
With no configuration I see it as garbage characters.

@jesus2099

This comment has been minimized.

Contributor

jesus2099 commented Dec 20, 2012

I have windows XP and tried Lucida font. I have to get used to it now because the extended latin works with it ! :)
Thanks, Jonny ! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment