Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeEncodeError: 'shift_jis' codec can't encode character '\u90d5' in position 3: illegal multibyte sequence #14

Closed
Arutemu64 opened this issue Jul 8, 2023 · 9 comments

Comments

@Arutemu64
Copy link

Arutemu64 commented Jul 8, 2023

Trying out QReader in my project, I discovered it fails to decode some QR codes and I'm not exactly sure why.
Here is an example of such QR code
It scans just fine with TeaCapps QR Reader app (the most popular one) on my Android phone.

@nguyen-viet-hung
Copy link

I faced the same issue and need change code page from "shift_jis" to "cp65001". Seem need update interface of QRDecode that can provide alternative code page for encoding

@Eric-Canas
Copy link
Owner

Eric-Canas commented Jul 12, 2023

Hi!

I have been trying to debug with this QR you sent, but it worked on my environment :(
Which Python version are you using?

I'm testing on Python 3.9, and when decoding it gets me 'qr код' as when I test it with TeaCapps QR Reader App. If I try to decode it with "cp65001" as proposed by @nguyen-viet-hung, it gets me 'qr ミコミセミエ'. Which is your expected output?

But in any case it doesn't produce me the UnicodeEncodeError.

Could you get me more details about your environment?

If you can access to the source, does the solution proposed by @nguyen-viet-hung solves your problem (qreader.py, line 70)?

Thanks!

Any way, I'll include the charset as an input parameter, to allow more flexibility.

@nguyen-viet-hung
Copy link

Hi,

My environment is Python 3.10, qreader 2.12 on Windows. I face same issue with @Arutemu64 when my QR contains Vietnamese. I tried with @Arutemu64 QR code with 'shift_jis' code page it gave also error: "UnicodeEncodeError: 'shift_jis' codec can't encode character '\u1ec5' in position 27: illegal multibyte sequence". So I think it depends on the system running.

@Eric-Canas

@Eric-Canas
Copy link
Owner

I have been testing @Arutemu64 's QR under python 3.10, also on Windows and gives me the same result ('qr код') than before. Could it be related with some internal locale configuration?

Could you share a Vietnamese QR that you know that produces the "UnicodeEncodeError: 'shift_jis' codec can't encode character..." on your side?

@nguyen-viet-hung

@nguyen-viet-hung
Copy link

Hi,

Here my sample QR code image. The content of QR is: "Tôi là Hùng. Đây là bản thử nghiệm để kiểm tra tính năng"
qrcode

@Eric-Canas
Copy link
Owner

Eric-Canas commented Jul 13, 2023

I have been testing with this QR. It also breaks on my side when running

decodedQR[0].data.decode('utf-8').encode('shift-jis').decode('utf-8')

But gives me the correct result just with:

decodedQR[0].data.decode('utf-8')

Could you confirm if it also happens on your side? Have you find any example that needs re-encoding to cp65001?

I did that re-encoding because I found this case
difficult_encoding

Which decoded to: "BCD\n002\n1\nSCT\nRLNWATWW\nテвzte ohne Grenzen\nAT973200000000518548\nSpende\nSpende\nfテシr MSF Nothilfe"

While the original text should be "'BCD\n002\n1\nSCT\nRLNWATWW\nÄrzte ohne Grenzen\nAT973200000000518548\nSpende\nSpende für MSF Nothilfe'

That's why I integrated this section:

try:
    return decodedQR[0].data.decode('utf-8').encode('shift-jis').decode('utf-8')
except UnicodeDecodeError:
    # When double decoding fails, just return the decoded string assuming it could have weird characters
    return decodedQR[0].data.decode('utf-8')

Where I should have also included the 'UnicodeEncodeError', but I did never found a case provoking it.

I would like to discover if there are cases where this update decodes incorrectly.

try:
    return decodedQR[0].data.decode('utf-8').encode('shift-jis').decode('utf-8')
except (UnicodeDecodeError, UnicodeEncodeError):
    # When double decoding fails, just return the decoded string assuming it could have weird characters
    return decodedQR[0].data.decode('utf-8')

But anyway, I'll add that 'shift-jis' as an input parameter on the init.

@nguyen-viet-hung

@nguyen-viet-hung
Copy link

Hi,

As I tested with your QR, it gave the result as you mentions. and for my case, can use only .decode('utf-8') and it gave correct result. After checking for the while, I found the mention of code page here. cp65001 is utf-8, then for some languages (Japanese, Chinese, Arabic ...), they need re-encode and decode as you do in the code.

Before we can find an universal solution, should make encoding code-page as parameter as you do.

@Eric-Canas

@Eric-Canas
Copy link
Owner

Sure, I have included that parameter by the moment. You can upgrade it with pip install --upgrade qreader

And instantiate the QReader object as QReader(reencode_to='cp65001')

By default it re-encodes to shift-jis, just to avoid messing anything with current implementations, but you can set reencode_to=None and it will just do the one-step 'utf-8' decoding. Anyway, if it triggers a UnicodeEncodeError, or an UnicodeDecodeError, it will fallback to utf-8, so if utf-8 worked for you, you don't really need to change anything.

Thanks a lot for your help @nguyen-viet-hung .

@Arutemu64 , your problem should be solved with this update.

Thanks!

@nguyen-viet-hung
Copy link

Hi @Arutemu64

You can use the update that @Eric-Canas has published or try installing extra package ebcdic and re-test with current version of qreader to see if it works.

pip install ebcdic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants