UnicodeEncodeError: 'shift_jis' codec can't encode character '\u90d5' in position 3: illegal multibyte sequence #14

Arutemu64 · 2023-07-08T10:33:06Z

Trying out QReader in my project, I discovered it fails to decode some QR codes and I'm not exactly sure why.
Here is an example of such QR code
It scans just fine with TeaCapps QR Reader app (the most popular one) on my Android phone.

nguyen-viet-hung · 2023-07-12T04:09:24Z

I faced the same issue and need change code page from "shift_jis" to "cp65001". Seem need update interface of QRDecode that can provide alternative code page for encoding

Eric-Canas · 2023-07-12T09:56:49Z

Hi!

I have been trying to debug with this QR you sent, but it worked on my environment :(
Which Python version are you using?

I'm testing on Python 3.9, and when decoding it gets me 'qr код' as when I test it with TeaCapps QR Reader App. If I try to decode it with "cp65001" as proposed by @nguyen-viet-hung, it gets me 'qr ﾐｺﾐｾﾐｴ'. Which is your expected output?

But in any case it doesn't produce me the UnicodeEncodeError.

Could you get me more details about your environment?

If you can access to the source, does the solution proposed by @nguyen-viet-hung solves your problem (qreader.py, line 70)?

Thanks!

Any way, I'll include the charset as an input parameter, to allow more flexibility.

nguyen-viet-hung · 2023-07-12T10:19:41Z

Hi,

My environment is Python 3.10, qreader 2.12 on Windows. I face same issue with @Arutemu64 when my QR contains Vietnamese. I tried with @Arutemu64 QR code with 'shift_jis' code page it gave also error: "UnicodeEncodeError: 'shift_jis' codec can't encode character '\u1ec5' in position 27: illegal multibyte sequence". So I think it depends on the system running.

@Eric-Canas

Eric-Canas · 2023-07-13T08:23:27Z

I have been testing @Arutemu64 's QR under python 3.10, also on Windows and gives me the same result ('qr код') than before. Could it be related with some internal locale configuration?

Could you share a Vietnamese QR that you know that produces the "UnicodeEncodeError: 'shift_jis' codec can't encode character..." on your side?

@nguyen-viet-hung

nguyen-viet-hung · 2023-07-13T08:36:57Z

Hi,

Here my sample QR code image. The content of QR is: "Tôi là Hùng. Đây là bản thử nghiệm để kiểm tra tính năng"

Eric-Canas · 2023-07-13T15:31:59Z

I have been testing with this QR. It also breaks on my side when running

decodedQR[0].data.decode('utf-8').encode('shift-jis').decode('utf-8')

But gives me the correct result just with:

decodedQR[0].data.decode('utf-8')

Could you confirm if it also happens on your side? Have you find any example that needs re-encoding to cp65001?

I did that re-encoding because I found this case

Which decoded to: "BCD\n002\n1\nSCT\nRLNWATWW\nﾃвzte ohne Grenzen\nAT973200000000518548\nSpende\nSpende\nfﾃｼr MSF Nothilfe"

While the original text should be "'BCD\n002\n1\nSCT\nRLNWATWW\nÄrzte ohne Grenzen\nAT973200000000518548\nSpende\nSpende für MSF Nothilfe'

That's why I integrated this section:

try:
    return decodedQR[0].data.decode('utf-8').encode('shift-jis').decode('utf-8')
except UnicodeDecodeError:
    # When double decoding fails, just return the decoded string assuming it could have weird characters
    return decodedQR[0].data.decode('utf-8')

Where I should have also included the 'UnicodeEncodeError', but I did never found a case provoking it.

I would like to discover if there are cases where this update decodes incorrectly.

try:
    return decodedQR[0].data.decode('utf-8').encode('shift-jis').decode('utf-8')
except (UnicodeDecodeError, UnicodeEncodeError):
    # When double decoding fails, just return the decoded string assuming it could have weird characters
    return decodedQR[0].data.decode('utf-8')

But anyway, I'll add that 'shift-jis' as an input parameter on the init.

@nguyen-viet-hung

nguyen-viet-hung · 2023-07-14T03:51:25Z

Hi,

As I tested with your QR, it gave the result as you mentions. and for my case, can use only .decode('utf-8') and it gave correct result. After checking for the while, I found the mention of code page here. cp65001 is utf-8, then for some languages (Japanese, Chinese, Arabic ...), they need re-encode and decode as you do in the code.

Before we can find an universal solution, should make encoding code-page as parameter as you do.

@Eric-Canas

Eric-Canas · 2023-07-14T08:17:21Z

Sure, I have included that parameter by the moment. You can upgrade it with pip install --upgrade qreader

And instantiate the QReader object as QReader(reencode_to='cp65001')

By default it re-encodes to shift-jis, just to avoid messing anything with current implementations, but you can set reencode_to=None and it will just do the one-step 'utf-8' decoding. Anyway, if it triggers a UnicodeEncodeError, or an UnicodeDecodeError, it will fallback to utf-8, so if utf-8 worked for you, you don't really need to change anything.

Thanks a lot for your help @nguyen-viet-hung .

@Arutemu64 , your problem should be solved with this update.

Thanks!

nguyen-viet-hung · 2023-07-14T08:26:56Z

Hi @Arutemu64

You can use the update that @Eric-Canas has published or try installing extra package ebcdic and re-test with current version of qreader to see if it works.

pip install ebcdic

Eric-Canas closed this as completed Jul 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeEncodeError: 'shift_jis' codec can't encode character '\u90d5' in position 3: illegal multibyte sequence #14

UnicodeEncodeError: 'shift_jis' codec can't encode character '\u90d5' in position 3: illegal multibyte sequence #14

Arutemu64 commented Jul 8, 2023 •

edited

Loading

nguyen-viet-hung commented Jul 12, 2023

Eric-Canas commented Jul 12, 2023 •

edited

Loading

nguyen-viet-hung commented Jul 12, 2023

Eric-Canas commented Jul 13, 2023

nguyen-viet-hung commented Jul 13, 2023

Eric-Canas commented Jul 13, 2023 •

edited

Loading

nguyen-viet-hung commented Jul 14, 2023

Eric-Canas commented Jul 14, 2023

nguyen-viet-hung commented Jul 14, 2023

UnicodeEncodeError: 'shift_jis' codec can't encode character '\u90d5' in position 3: illegal multibyte sequence #14

UnicodeEncodeError: 'shift_jis' codec can't encode character '\u90d5' in position 3: illegal multibyte sequence #14

Comments

Arutemu64 commented Jul 8, 2023 • edited Loading

nguyen-viet-hung commented Jul 12, 2023

Eric-Canas commented Jul 12, 2023 • edited Loading

nguyen-viet-hung commented Jul 12, 2023

Eric-Canas commented Jul 13, 2023

nguyen-viet-hung commented Jul 13, 2023

Eric-Canas commented Jul 13, 2023 • edited Loading

nguyen-viet-hung commented Jul 14, 2023

Eric-Canas commented Jul 14, 2023

nguyen-viet-hung commented Jul 14, 2023

Arutemu64 commented Jul 8, 2023 •

edited

Loading

Eric-Canas commented Jul 12, 2023 •

edited

Loading

Eric-Canas commented Jul 13, 2023 •

edited

Loading