Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
Already on GitHub? Sign in to your account
Notes contain non-visible characters #137
Comments
|
Thanks for the report. I think there are some issues in the conversion from the original Microsoft SQL server to Oracle which was done before Tom and I got here. The notes are actually stored in a hexadecimal format and need decoding in order to be readable as plain text. I tried decoding it as UTF-8, but then you get Latin-1 characters sneaking in (in particular the Latin-1 character for a space, which is \x13 I think). I also tried decoding it as Latin-1, but then I received other errors. When we re-extract the notes this is definitely something we will pay attention to. Have you noticed that it only occurs in Metavision notes? That would be consistent with what I've found. The Metavision notes have categories: |
|
They seem to appear in CareVue. If you look at the breakdown of categories, it's primarily mimic=# select category, count(1) from mimiciii.noteevents where text ~ '\x7f' group by category;
category | count
-----------+-------
Radiology | 2
(1 row)mimic=# select category, count(1) from mimiciii.noteevents where text ~ '\x14' group by category;
category | count
---------------+-------
Nursing/other | 47
(1 row)mimic=# select category, count(1) from mimiciii.noteevents where text ~ '\x13' group by category;
category | count
---------------+-------
Nursing/other | 593
(1 row) |
tnaumann commentedOct 3, 2016
There are three non-visible characters (
\x7f,\x14, and\x13) that appear in the notes. These characters cause issues when working with many text processing tools. Removing these characters will make it easier to work with the notes and should not change the underlying meaning.Currently, when using processing tools (e.g., cTAKES) that process a directory, one can workaround this issue by manually removing the offending characters:
find notes -type f -exec sed -i 's/\x7f//g; s/\x14//g; s/\x13//g' {} +These characters appear in ~700 notes. A truncated character distribution shows:
There does not appear to be overlap among the affected notes.