Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Notes contain non-visible characters #137

Open
tnaumann opened this Issue Oct 3, 2016 · 2 comments

Comments

Projects
None yet
2 participants
Contributor

tnaumann commented Oct 3, 2016

There are three non-visible characters (\x7f, \x14, and \x13) that appear in the notes. These characters cause issues when working with many text processing tools. Removing these characters will make it easier to work with the notes and should not change the underlying meaning.

Currently, when using processing tools (e.g., cTAKES) that process a directory, one can workaround this issue by manually removing the offending characters:

find notes -type f -exec sed -i 's/\x7f//g; s/\x14//g; s/\x13//g' {} +

These characters appear in ~700 notes. A truncated character distribution shows:

[ # ...
 ('\x13', 677),
 ('\x14', 49),
 ('\x7f', 2)]

There does not appear to be overlap among the affected notes.

find notes -type f | xargs grep --color='auto' -P -l "[\x7f]" | sort | uniq -c
      1 notes/1062359
      1 notes/862440
find notes -type f | xargs grep --color='auto' -P -l "[\x14]" | sort | uniq -c
      1 notes/1437693
      1 notes/1482241
...
find notes -type f | xargs grep --color='auto' -P -l "[\x13]" | sort | uniq -c
      1 notes/1901399
      1 notes/1902291
...
Owner

alistairewj commented Oct 4, 2016

Thanks for the report. I think there are some issues in the conversion from the original Microsoft SQL server to Oracle which was done before Tom and I got here. The notes are actually stored in a hexadecimal format and need decoding in order to be readable as plain text. I tried decoding it as UTF-8, but then you get Latin-1 characters sneaking in (in particular the Latin-1 character for a space, which is \x13 I think). I also tried decoding it as Latin-1, but then I received other errors. When we re-extract the notes this is definitely something we will pay attention to. Have you noticed that it only occurs in Metavision notes? That would be consistent with what I've found. The Metavision notes have categories: Nursing,Rehab Services,Case Management,General,Consult,Nutrition,Social Work,Pharmacy,Physician,Respiratory. The rest are sourced elsewhere (Nursing/other is CareVue, and the others are from the hospital database).

Contributor

tnaumann commented Oct 6, 2016

They seem to appear in CareVue. If you look at the breakdown of categories, it's primarily Nursing/other.

mimic=# select category, count(1) from mimiciii.noteevents where text ~ '\x7f' group by category;
 category  | count
-----------+-------
 Radiology |     2
(1 row)
mimic=# select category, count(1) from mimiciii.noteevents where text ~ '\x14' group by category;
   category    | count
---------------+-------
 Nursing/other |    47
(1 row)
mimic=# select category, count(1) from mimiciii.noteevents where text ~ '\x13' group by category;
   category    | count
---------------+-------
 Nursing/other |   593
(1 row)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment