New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run.command(): More robust text decoding #1296

Merged
merged 4 commits into from Apr 14, 2018

Conversation

Projects
None yet
2 participants
@Lestropie
Copy link
Member

Lestropie commented Apr 13, 2018

Relates to discussion in #953.

Having thought about the situation a bit, I think this is the best solution in terms of the script library. Even if decoding of DICOM text data were improved as described in #953, there's an external tool invoked by run.command() generating data that's genuinely invalid if interpreted as unicode, so the function needs to be robust to that.

This solution works for what I've tested it on thus far; I've got a couple of other cases to test, but would appreciate testing on problematic DICOM data as well.

run.command(): More robust text decoding
When using -debug option, interpret each incoming byte using the Windows-1252 codec, since this involves fixed-width encoding (necessary as one byte as read at a time) and supports conttrol characters.
When not using the -debug option, retain decoding of data as utf-8 but replace problematic bytes with the best possible replacement characters.
Closes #953.

@Lestropie Lestropie self-assigned this Apr 13, 2018

@Lestropie Lestropie added the scripts label Apr 13, 2018

@jdtournier

This comment has been minimized.

Copy link
Member

jdtournier commented Apr 13, 2018

I'm not sure how to handle this... I think the issue here is about the encoding used for filenames (rather than inside a DICOM file)? And that seems rather murky to me (see e.g. this thread). There's the encoding used for the filesystem, and that probably gets translated to the encoding used by the OS and passed to the application (us). Not sure whether this is in fact what happens though... But it seems the encoding assumed by the system is not the same on Windows (cp1252) and Linux (probably utf8) - no idea about macOS. Based on this, I think this should at the very least be dependent on the OS...?

@Lestropie

This comment has been minimized.

Copy link
Member

Lestropie commented Apr 13, 2018

There's three separate cases that are all leading to the same type of error:

  • Special characters contained within filenames, as reported by Hamed on the forum. Importantly, that only causes an error after the command has completed; it's the script library reading and decoding the contents of stderr that results in an exception, which is most likely just mrconvert echoing the name of the file within a progress bar.

  • Special characters stored within subject names inside DICOM data, as reported by Ben in #953. Handling of such could plausibly be improved within the C++ code, but ultimately the problem that was reported was the reading of such a file within a script using the run.command() function, which probably means the subject name being stored under a header comments field and then the header contents being echoed to stderr.

  • Either special characters or bytecode data being detected in stdout / stderr from any arbitrary command that could be executed by a script within run.command(), like reported by Derek on the forum. I was able to reproduce this by convincing the CUDA version of eddy to make a segmentation violation (and that's probably what is happening to Derek as well). In this case there's no way of knowing a priori what might be received on stdout / stderr.

This change should deal with all three cases. If OS-based filename encoding is a problem above and beyond this change, I'd need to see that to justify pursuing it.

@jdtournier jdtournier referenced this pull request Apr 13, 2018

Merged

CI tweaks #1295

Lestropie added some commits Apr 14, 2018

run.command(): Consistent stderr return decoding
When running with -debug option, use the Windows1252 codec to decode byte data for text display on the terminal, but additionally decode the entire stream contents using UTF-8 after the command has completed. This ensures that the returned stderr contents within a script will not vary between use or non-use of the -debug option.
@Lestropie

This comment has been minimized.

Copy link
Member

Lestropie commented Apr 14, 2018

While I'm merging this now, being fairly confident that it'll deal with the reported issues in the context of the script library, will need to keep an eye open for any cases where either exotic file names or subject names within DICOMs still cause issues.

@Lestropie Lestropie merged commit f8f6c92 into dev Apr 14, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@Lestropie Lestropie deleted the run_command_unicode branch Apr 14, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment