New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

decode('utf-8') falls over on alternative encodings #953

Closed
bjeurissen opened this Issue Apr 4, 2017 · 14 comments

Comments

Projects
None yet
3 participants
@bjeurissen
Copy link
Member

bjeurissen commented Apr 4, 2017

I came across DICOM data sets that contain non utf-8 meta information, (e.g. accute accents stored as latin1), causing run.command() calls to fall over:

'utf-8' codec can't decode byte 0xc9 in position 85: invalid continuation byte

As an ugly work-around, I have added errors='replace' to all decode('utf-8') calls, however, not sure what the best way of properly handling this would be...

@jdtournier

This comment has been minimized.

Copy link
Member

jdtournier commented Apr 4, 2017

What command is this? I assume it's a script, but I struggle to see where the interaction with DICOM happens... Are you talking about the filename, or the actual contents of the DICOM?

@bjeurissen

This comment has been minimized.

Copy link
Member

bjeurissen commented Apr 4, 2017

None of the currently available scripts are affected, but the problem pops up when one calls dcminfo using the script API, e.g.:

run.command('dcminfo file.dcm -tag 0010 0010')

with the name containing non-UTF-8 characters.

@Lestropie

This comment has been minimized.

Copy link
Member

Lestropie commented Apr 4, 2017

So it's specifically decoding the stdout / stderr result from subprocess that's falling over, rather than the file name (which would fall over at the POpen() step)?

I was never happy using decode('utf-8') there, but it was the only way to get run.command() to work on both Unix and Windows, and on both Python 2 and 3 for each OS. Happy for the addition on the premise that it's tested in all four scenarios.

@jdtournier

This comment has been minimized.

Copy link
Member

jdtournier commented Apr 6, 2017

Yes, text encoding is a wonderful minefield. I specified utf-8 since I anticipate this is what most systems and terminals will eventually converge to. But when reading DICOM tags, it's going to be tricky. I have a vague feeling that there might be a DICOM tag that specifies the encoding used in text fields. If that's the case, it might be worth trying to catch this in the DICOM handling code, and return properly formatted UTF8. I think C++11 has better facilities for handling Unicode that we might be able to leverage. But otherwise, I'm not sure how we can possibly handle this in any sensible way, errors='replace' might be the most sensible way to go... I'm surprised there wasn't already an errors='ignore' there, that's what the build and configure scripts both rely on...?

@Lestropie

This comment has been minimized.

Copy link
Member

Lestropie commented Apr 7, 2017

I'm surprised there wasn't already an errors='ignore' there, that's what the build and configure scripts both rely on...?

Me too. But I definitely recall having gone back and forth between Windows / Unix and Python 2 / 3 (and their corresponding subprocess implementations) for each trying to get that section of code working, and whatever it ended up with is whatever worked. Can revisit it, I just don't want someone to change it and not do all that testing, because it'll inevitably bork at least one use case.

@bjeurissen

This comment has been minimized.

Copy link
Member

bjeurissen commented Apr 11, 2017

It seems like iso8859 is the default encoding used in DICOM, unless specified otherwise in the (0008,0005)-tag. While utf-8 is supported in DICOM it seems to be not widely used.

A handy overview of how DICOM encodings should be translated to python encodings and how to tackle the problem can be found at https://github.com/darcymason/pydicom/blob/master/pydicom/charset.py and https://github.com/darcymason/pydicom/blob/master/pydicom/valuerep.py

https://groups.google.com/d/msg/comp.protocols.dicom/O-hOxtT9kTQ/whOf2mx1Nc0J is also useful

I checked a few diverse DICOM data sets at my disposal and all of them had an empty tag, thus implying iso8859.

@bjeurissen

This comment has been minimized.

Copy link
Member

bjeurissen commented Apr 11, 2017

Would it be an option to first try utf-8 and if that throws an exception try iso8859?

@bjeurissen

This comment has been minimized.

Copy link
Member

bjeurissen commented Apr 11, 2017

I have a vague feeling that there might be a DICOM tag that specifies the encoding used in text fields. If that's the case, it might be worth trying to catch this in the DICOM handling code, and return properly formatted UTF8

That would of course be the ideal solution! The pydicom link I pasted, could probably be of great help for this.

@jdtournier

This comment has been minimized.

Copy link
Member

jdtournier commented Apr 11, 2017

Indeed, that pydicom link might come in handy if we're to support this properly. Ideally, I think we should aim to have all text converted to UTF8 directly within the DICOM import code, so that it is guaranteed to be valid UFT8 by the time any script makes use of it. The problem is that character encoding support is actually pretty useless in C++11 (contrary to my earlier post). As far as I can tell, the only library that seems to be widely available and suitable for the job is libiconv. But to get this working would require quite a few modifications, from testing for the presence of libiconv in configure and checking that we can compile against it, and then implementing the actual conversion itself, first for iso8859 itself, and then optionally for any encoding specified in the (0008,0005) tag, including figuring out the mapping between the encoding specified in the tag, and its representation in libiconv... Not something I'll have the time or inclination to look into any time soon, but as far as the C++ code itself is concerned, this would be the place to do the conversion itself, if you're game?

@Lestropie

This comment has been minimized.

Copy link
Member

Lestropie commented Apr 10, 2018

Bumping due to forum thread.

Do we think Ben's suggestion of adding errors='replace' to all decode() calls is a sufficient solution to stop this from occurring? After all, it's only the receiving of stderr text data within run.command(), which would either be just terminal text output, or suppressed entirely, so preservation of accents etc. shouldn't be of huge concern.

@jdtournier

This comment has been minimized.

Copy link
Member

jdtournier commented Apr 10, 2018

Yep, like I said before:

I'm surprised there wasn't already an errors='ignore' there, that's what the build and configure scripts both rely on...?

But I don't think that would have fixed the problem on the forum: that seemed to be a filename stored using a non UTF-8 character encoding (most likely iso8859, as @bjeurissen suggested above). I don't think that replacing or ignoring the characters would have helped, in that it would still have resulted in an error down the track...?

Personally, I don't see an easy solution to this. I do agree that we should use errors=ignore or errors=replace everywhere, since that would at least allow error messages, etc to be displayed. But there will be instances where we still won't be able to guarantee correct behaviour...

@bjeurissen

This comment has been minimized.

Copy link
Member

bjeurissen commented Apr 10, 2018

From what I understand iso8859 is the default in DICOM, so most of the data sets out there are using iso8859. It might be more sane to assume iso8859 in the python scripts and then fallback to utf-8 when that causes an exception. I know it is ugly and not really a solution, but it might cover most cases that we will encounter.

Just scrolled up and realised this is what I already suggested above 🤦‍♂️

@jdtournier

This comment has been minimized.

Copy link
Member

jdtournier commented Apr 10, 2018

Just scrolled up and realised this is what I already suggested above

🤣

Or we could simply implement this suggestion within the DICOM handling code...?

As to what to do within the python code, I'm more worried about the character encoding assumed by the OS / filesystem. But I do like what @bjeurissen suggests - we could implement our own decode() function, which might look something like:

def decode (input):
  try:
    return input.decode ('iso-8859-1')
  except UnicodeDecodeError:
    return input.decode ('utf-8', errors='replace')

What I'm not sure about is whether that's the right order... Are there any invalid characters in iso-8859-1 that would cause an error to be detected? Maybe we should try utf-8 first...?

@Lestropie

This comment has been minimized.

Copy link
Member

Lestropie commented Apr 11, 2018

But I don't think that would have fixed the problem on the forum: that seemed to be a filename stored using a non UTF-8 character encoding (most likely iso8859, as @bjeurissen suggested above). I don't think that replacing or ignoring the characters would have helped, in that it would still have resulted in an error down the track...?

In the example on the forum, it hit an error here. That indicates that the mrconvert call in fact completed (though with unknown returncode). The issue is reading the contents of stderr written by mrconvert. Now I suspect that the relevant text will in fact be an error message regarding being unable to open the JSON file that contains the non-standard character (because otherwise there'd be no terminal output emitted by mrconvert containing the name of the JSON file). But that error message alone would have been sufficient for the user to sort themselves out; so I wouldn't call it an "error down the track".

What I'm not sure about is whether that's the right order... Are there any invalid characters in iso-8859-1 that would cause an error to be detected? Maybe we should try utf-8 first...?

The worst that could happen is that an undefined byte is received (ISO8859 doesn't have "non-printable" characters); whether or not this would throw an exception, I don't know, though I suspect so. But it would also succeed in mis-interpreting incoming UTF-8 without exception: Looking at the codepages directly, there are UTF-8 continuation bytes that would be successfully (likely erroneously) decoded by ISO8859. So I don't think ISO8859 followed by UTF8 decoding would work.

Also: Windows1252 is a superset of ISO8859 that includes control characters in 0x00-0x1F and has less undefined characters in the 0x7F-0x9F range, and so would encounter less decoding errors if used as a fallback position.

Question is, whether a UTF8 decoding error should be ignored, replaced, or re-decoded as ISO8859 / Windows1252.

Other question is what to do in the case where the stream is read one byte at a time, i.e. here. I'm thinking maybe the character should be echoed to stderr and written to the stderr content storage without any decoding, and then the string storage decoded in a single step, in order for there to be a chance for appropriate interpretation of multi-byte characters.

Unfortunately I don't have any DICOM data with weird characters in them to test. I also tried inserting weird characters into my file names, and the Python libraries handled these just fine (both Python 2 and 3, on my Linux machine). So I'm kinda flying blind here in both cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment