New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

image.Header(): UTF decoding error in JSON library #1347

Open
Lestropie opened this Issue May 22, 2018 · 6 comments

Comments

Projects
None yet
3 participants
@Lestropie
Copy link
Member

Lestropie commented May 22, 2018

As reported on forum here.

Somehow a non-UTF8 byte is being exported by the mrinfo -json_all option. I'm guessing it's something in a comment field or the like; maybe a MGZ that was converted prior to #970, or something like that. Will know more with sample data.

@Lestropie Lestropie self-assigned this May 22, 2018

@civier

This comment has been minimized.

Copy link
Member

civier commented May 22, 2018

I think I experience a non-UTF8 in dcminfo -all with data collected at the Florey.
I'm not familiar with the underlying code, but may it be related?

These are the problematic fields:

[DCM] 0008 0016 UI       26      104   SOPClassUID                            [ 1.2.840.10008.5.1.4.1.1.4^@ ]
[DCM] 0008 1140 SQ      306      656 > ReferencedImageSequence                
[DCM] FFFE E000 UN       94      664   - Item                                 
[DCM] 0008 1150 UI       26      672     ReferencedSOPClassUID                [ 1.2.840.10008.5.1.4.1.1.4^@ ]
[DCM] 0008 1155 UI       52      706     ReferencedSOPInstanceUID             [ 1.3.12.2.1107.5.2.19.45056.201712051133052344508765^@ ]
[DCM] FFFE E000 UN       94      766   - Item                                 
[DCM] 0008 1150 UI       26      774     ReferencedSOPClassUID                [ 1.2.840.10008.5.1.4.1.1.4^@ ]
[DCM] 0008 1155 UI       52      808     ReferencedSOPInstanceUID             [ 1.3.12.2.1107.5.2.19.45056.2017120511331296101608773 ]
[DCM] FFFE E000 UN       94      868   - Item                                 
[DCM] 0008 1150 UI       26      876     ReferencedSOPClassUID                [ 1.2.840.10008.5.1.4.1.1.4^@ ]
[DCM] 0008 1155 UI       52      910     ReferencedSOPInstanceUID             [ 1.3.12.2.1107.5.2.19.45056.2017120511331689997308777 ]
@Lestropie

This comment has been minimized.

Copy link
Member

Lestropie commented May 22, 2018

dcminfo -all only ever shows the result of handling by the C++ code; which admittedly there are sometimes fields stored within DICOM to indicate the use of alternative encodings, and in those situations our DICOM handling may not be ideal, but it's not something that's ever become sufficiently problematic for us to address.

The issue with mrinfo -json_all is specifically because it's a trick I developed to import header information from any image format supported by MRtrix3 into Python, without having to explicitly code support for all image formats in Python. But it relies on the Python json module to parse the resulting file, which is where (at least for Python 2.7) there appears to be a stumbling point. The fields you refer to there in the DICOM are not ever stored in the image header, and therefore should never be contained in the mrinfo -json_all output. But something is making it in there; my guess is it's a subject name with funky encoding that our C++ DICOM code is essentially importing byte-wise and then dumping into the JSON file.

@civier

This comment has been minimized.

Copy link
Member

civier commented May 22, 2018

For me, the "dcminfo -all" problem was quite probelmatic because I fed it into grep (which didn't work as expected), but I figured out I can use "grep -a" to solve it. So as long as it doesn't bother others, I'm ok. I'll put myself a note to clarify in the dcminfo documentation that it won't handle alternative encodings and non-ASCII output is to be expected sometimes. Worse is better.

@Lestropie

This comment has been minimized.

Copy link
Member

Lestropie commented May 22, 2018

If it's causing problems for you, it's worth adding to the issue list; but it should be listed as a separate GitHub issue, since the required modifications are quite separate from the issue originally listed here.

We have discussed DICOM encoding either here or on the forum, but I don't seem to be able to find it right now. But previously when we've discussed this, it's been regarding subject names with e.g. accent characters; whereas the example you've provided there is very much not that. It would be worth double-checking that there isn't anything particularly unusual about the encoding of those specific fields that our code isn't aware of, that there isn't any kind of buffer overrun occurring, and that your data aren't just plain corrupt; whatever it is, we need to confirm the source of the issue before doing anything about it.

@jdtournier

This comment has been minimized.

Copy link
Member

jdtournier commented May 22, 2018

@Lestropie, I think the issues you were looking for were #1296 and #953.

I think we have a potential solution for non-ASCII encodings (which is most likely to be the windows cp1252) within the DICOM handling - but that's unlikely to be the issue reported here with dcminfo -all: only the last character looks badly interpreted in those fields, and there's no reason for these UID fields to contain non-ASCII characters. I often find non-ASCII in my dcminfo -all output, and resort to grep -a for the same reason...

The reason for that last mangled character is not entirely clear to me, but it's hard to rule out a slightly buggy DICOM encoder here. This likely relates to how DICOM entries are stored: the number of bytes for each DICOM field is explicitly provided in the stream (for example, the first one has length 26 bytes - 5th column in the output). But the strings stored have odd lengths (I count 25 characters for the first one). The DICOM standard used to state that fields should have even length, presumably to ensure alignment at 16 bit boundaries (although I've had to relax our DICOM checks on that since I've encountered data that doesn't honour that - pretty sure it was Siemens too). So the question is what should be stored in that last byte? If that was a null character, I don't think that would cause any trouble, since that's the end of string delimiter anyway. But it looks like Siemens (used to) store some other value in there. I'll have a look at some of the data I have from the Florey Trio, see if there's a simple fix...

This won't solve the json issues though...

jdtournier added a commit that referenced this issue May 22, 2018

strip() function now also strips null characters
This should sort out dangling null characters in DICOM fields - see #1347
@jdtournier

This comment has been minimized.

Copy link
Member

jdtournier commented May 22, 2018

OK, turns out those dangling characters were indeed null... I've pushed a fix in #1349, which seems to sort out the issue - no more Binary file (standard input) matches when feeding dcminfo -all output to grep...

jdtournier added a commit that referenced this issue Jul 12, 2018

strip() function now also strips null characters
This should sort out dangling null characters in DICOM fields - see #1347

jdtournier added a commit that referenced this issue Aug 2, 2018

strip() function now also strips null characters
This should sort out dangling null characters in DICOM fields - see #1347
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment