# NLP for Healthcare
## Challenges with Processing and De-identifying Clinical Notes

#### Chloe Pou-Prom, Data Scientist, Unity Health Toronto
#### Vaakesan Sundrelingam, Data Scientist, Unity Health Toronto

This is a demonstration of `pydeid`, a Python-based de-identification software that identifies and replaces personal health information (PHI) in free-text clinical data.

We also take a look at several deep-learning based named-entity recognition approaches, as well as some interesting and difficult cases.

### 1 Installing `pyDeid`

`pyDeid` will be released as an open-source Python package from the [GEMINI GitHub Respository](https://github.com/GEMINI-Medicine).

Once the package is available, it can be downloaded (and unzipped), or cloned to a local folder `/path/to/package/` (for example). Once it is available locally at that path, it can be installed using either `pip3 install /path/to/local/copy` or calling `python3 setup.py install` from the top level directory. 

Please follow the package `README.md` for additional detils.

To now import from this package, you may need to add the install location (found via `pip show pyDeid`) to your `$PYTHONPATH` so Python knows where to look for it (as shown below).

In [1]:
import sys
sys.path.insert(0,'/path/from/pip/show/pyDeid')

We can then import the relevant functions for this demo:

In [2]:
from pyDeid import pyDeid, deid_string, reid_string, display_deid
import pandas as pd

We will also import the following to perform some diagnostics:

In [3]:
import os
import time

### 2 Getting Started

We can get started with `pyDeid` right away, using a simple example. `pyDeid` provides functions to de-identify a single string, visualize the found PHI, and re-identify the same string.

In [4]:
original_string = 'Elijah Wood (star of the Lord of the Rings) was born on January 28th, 1981.'
phi, new_string = deid_string(original_string)

The captured PHI is output from the function for inspection:

In [5]:
phi

[{'phi_start': 0,
  'phi_end': 6,
  'phi': 'Elijah',
  'surrogate_start': 0,
  'surrogate_end': 7,
  'surrogate': 'Natalya',
  'types': ['Male First Name (un)',
   'Last Name (un)',
   'First Name4 (NamePattern1)']},
 {'phi_start': 7,
  'phi_end': 11,
  'phi': 'Wood',
  'surrogate_start': 8,
  'surrogate_end': 16,
  'surrogate': 'Kleekamp',
  'types': ['Last Name (ambig)', 'Last Name (NamePattern1)']},
 {'phi_start': 56,
  'phi_end': 74,
  'phi': Date(date_string='January 28th, 1981', day='28', month='January', year='1981'),
  'surrogate_start': 61,
  'surrogate_end': 79,
  'surrogate': '20th-February-1965',
  'types': ['Month Day Year (2) [Month dd, yy(yy)]']}]

We can visualize the found PHI with the `display_deid` function:

In [6]:
display_deid(original_string, phi)

The `deid_string` function outputs a new string with surrogates in place of the original PHI:

In [7]:
new_string

'Natalya Kleekamp (star of the Lord of the Rings) was born on 20th-February-1965.'

Note that the format of the new date was randomized.

We can use this new string in combination with the found PHI output from `deid_string` to re-identify the string using `reid_string`:

In [8]:
reid_string(new_string, phi)

'Elijah Wood (star of the Lord of the Rings) was born on January 28th, 1981.'

The `{deid/reid}_string` functions are designed for demonstration and testing. In settings where it is required to de-identify free text in bulk, we provide the `pyDeid` function. We can use this function on a test `csv` that looks like the following:

In [9]:
pd.read_csv("test.csv")

Unnamed: 0,encounter_id,note_id,note_text
0,1,Record 1,Elijah Wood (star of the Lord of the Rings) wa...
1,2,Record 1,"This event is being held at 192 Spadina Ave, T..."


And the entire file can be de-identified as follows:

In [10]:
pyDeid(
    original_file = 'test.csv',
    note_varname = 'note_text', 
    encounter_id_varname = 'encounter_id',
    phi_output_file_type = 'json'
    )

Processing encounter 2: : 2it [00:00,  6.82it/s]


Diagnostics:
                - chars/s = 475.5608467408189
                - s/note = 0.14719462394714355


`pyDeid` provides many additional options, as can be found in the function docstring:

In [11]:
pyDeid?

Some options to note are the ability to supply custom lists of doctor and patient names, the ability to define a custom regular expressions to scrub from the text, and the ability to optionally use a slower named-entity-recognition approach (which we will discuss in more detail later).

The simplest of these are the custom regular expressions. There are cases where a particular data source might have some unique pattern that should be replaced. This may be a unique identifier that is specific to a data source, such as in the following example:

In [12]:
example1 = "Your unique identifier for today's event is TMLS123456."

example1_phi, _ = deid_string(example1, ID='TMLS\d{6}')

display_deid(example1, example1_phi)

Supplied custom regexes through **kwargs (see custom_regexes in docstring):

- ID : TMLS\d{6}

These custom patterns will be replaced with <PHI>.



In a similar vein, for some applications, the user may have access to a record of doctor and patient names associated with the clinical notes. When such lists are available, sensitivity/recall $\big(\frac{TP}{TP+FN}\big)$ can be improved by passing a `Set` containing these names to `pyDeid`. 

We can demonstrate this below with `deid_string`. Without supplying these lists, certain names are difficult to recognize using the plain regex approach:

In [13]:
example2 = "Bilbo Baggins is a hobbit."

example2_phi, _ = deid_string(example1)

display_deid(example2, example2_phi)

However, by providing a custom list of patient or doctor names, we can do better:

In [14]:
example2_phi, _ = deid_string(
    example2, 
    custom_patient_first_names={'Bilbo'},
    custom_patient_last_names={'Baggins'}
)

display_deid(example2, example2_phi)

If we don't have access to a a list of patient or doctor names, we may still be able to do a better job than plain regex by using named entity recognition:

In [34]:
example2_phi, _ = deid_string(
    example2, 
    named_entity_recognition=True
)

display_deid(example2, example2_phi)

### 3 Other Approaches to Named Entity Recognition

Consider the case below where there is a name that is also an object. If we rely on a purely regex apporach, we will miss this PHI:

In [16]:
example3 = "patient has to rely on walker to travel"

example3_phi, _ = deid_string(example3)
display_deid(example3, example3_phi)

The above case in somewhat ambiguous (it could be a grammar shortcut), however if we capitalize `Walker`, it becomes more clear (to a human) that this is potentially referring to a person:

In [17]:
example4 = "patient has to rely on Walker to travel"

example4_phi, _ = deid_string(example4)
display_deid(example4, example4_phi)

Still with the regex-only approach, we fail to capture this. However, we can use the named entity recognition approach:

In [18]:
example4_phi, _ = deid_string(example4, named_entity_recognition=True)
display_deid(example4, example4_phi)

The named entity recognition model above uses `spaCy's` CNN-based NER model. However there are alternatives such as the BERT-based NER model from `huggingface`:

In [None]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
os.environ["CURL_CA_BUNDLE"]=""

tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER-uncased")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER-uncased")

bert_ner = pipeline("ner", model=model, tokenizer=tokenizer)

We define a function below to take the output of the BERT NER pipeline and display the output neatly:

In [20]:
import spacy

def display_ner(text, ner_result, title=None, ignore=None):
    """Visualize NER with the help of SpaCy"""
    ents = []
    for ent in ner_result:
        if ent["entity"] == ignore:
            continue
        e = {}
        # add the start and end positions of the entity
        e["start"] = ent["start"]
        e["end"] = ent["end"]
        
        e["label"] = ent["entity"]
        if ents and -1 <= ent["start"] - ents[-1]["end"] <= 1 and e["label"] == ents[-1]["label"]:
            # if the current entity is shared with previous entity
            # simply extend the entity end position instead of adding a new one
            ents[-1]["end"] = e["end"]
            e["label"] = e["label"].split("-")[1]
            continue
        ents.append(e)
    # construct data required for displacy.render() method
    render_data = [
    {
      "text": text,
      "ents": ents,
      "title": title,
    }
    ]
    spacy.displacy.render(render_data, style="ent", manual=True, jupyter=True)

In [21]:
display_ner(example3, bert_ner(example3))

In [22]:
display_ner(example4, bert_ner(example4))

The BERT-based model appears to struggle with names that are not title-cased:

In [23]:
example5 = "The Lord of the Rings was directed by Pter Jacksn"

In [24]:
display_ner(example5, bert_ner(example5))

But `spaCy's` (which is the same implementation used in `pyDeid`) NER model appears to handle lowercase names better:

In [None]:
from spacy import displacy
spacy_ner = spacy.load("en_core_web_sm")

In [26]:
displacy.render(spacy_ner(example5), style="ent")

`pyDeid's` regex-only approach will have difficulty handling lowercase names:

In [27]:
example5_phi, _ = deid_string(example5, named_entity_recognition=False)
display_deid(example5, example5_phi)

There are still some issues with the CNN-based NER model, such as in the distinction between Organizations and Persons:

In [28]:
example6 = "Peregrin Took and Meriadoc Brandybuck are also hobbits"

In [29]:
displacy.render(spacy_ner(example6), style="ent")

And similarly for the BERT-based NER model:

In [30]:
display_ner(example6, bert_ner(example6))

### 4 Cases

Ensure that the same de-identification is applied to the same person:

In [31]:
example7 = "The Lord of the Rings film trilogy was directed by Pete Jackson. The screenplay was also written by Pete."

example7_phi, example7_deid = deid_string(example7)
print(example7_deid)

The Lord of the Rings film trilogy was directed by Palma Kit. The screenplay was also written by Palma.


Ensure that surgery or domain-specific terminology does not generate false positives. This is done using a "whitelist" for your specific use-case.

In [32]:
example8 = "recommend Jackson-Pratt drain"
example8_phi, example8_deid = deid_string(example8)

print(example8_deid)

recommend Jackson-Pratt drain


Ensure that the time delta between dates is maintained:

In [33]:
example9 = """
The Lord of the Rings trilogy was filmed in New Zealand for 438 days from October 11, 1999 through December 22, 2000.
"""

example9_phi, example9_deid = deid_string(example9)
print(example9_deid)


The Lord of the Rings trilogy was filmed in New Zealand for 438 days from 1972-13-07 through 1973/24/09.



### 6 Resources

#### About the speakers

* Unity Health Toronto: <https://unityhealth.to/>
* DSAA: <https://unitynet.unity.local/departments-programs-services/corporate-services/data-science-and-advanced-analytics/>
    * Blog: <https://lks-chart.github.io/blog/>
* GEMINI: <https://www.geminimedicine.ca/>
    * Download `pydeid`: [github.com/GEMINI-Medicine](github.com/GEMINI-Medicine)

#### What can we do with clinical notes?

* Yang, Zhen, Chloé Pou-Prom, Ashley Jones, Michaelia Banning, David Dai, Muhammad Mamdani, Jiwon Oh, and Tony Antoniou. 2022. "Assessment of Natural Language Processing Methods for Ascertaining the Expanded Disability Status Scale Score From the Electronic Health Records of Patients With Multiple Sclerosis: Algorithm Development and Validation Study." *JMIR Medical Informatics* 10 (1): e25157. <https://doi.org/10.2196/25157>.

#### Sharing data in healthcare is important

* Seastedt KP, Schwab P, O'Brien Z, Wakida E, Herrera K, et al. (2022) Global healthcare fairness: We should be sharing more, not less, data. PLOS Digital Health 1(10): e0000102. <https://doi.org/10.1371/journal.pdig.0000102>
* Information & Privacy Commissioner of Ontario. Dispelling the Myths Surrounding De-identification: Anonymization Remains a Strong Tool for Protecting Privacy. <https://www.ipc.on.ca/wp-content/uploads/2016/11/anonymization.pdf>

    * https://www.ipc.on.ca/wp-content/uploads/2016/08/Deidentification-Guidelines-for-Structured-Data.pdf
    * https://www.ipc.on.ca/wp-content/uploads/2019/09/2019-08-09-datashare-web.pdf
    * https://nap.nationalacademies.org/resource/18998/ElEmamandMalin-Paper.pdf
    
#### Open Source De-id Tools

[PhysioNet Deid](https://www.physionet.org/content/deid/1.1/):
* Neamatullah I, Douglass M, Lehman LH, Reisner A, Villarroel M, Long WJ, Szolovits P, Moody GB, Mark RG, Clifford GD. Automated De-Identification of Free-Text Medical Records. BMC Medical Informatics and Decision Making, 2008, 8:32. doi:10.1186/1472-6947-8-32
* Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.

[PHIlter](https://github.com/BCHSI/philter-ucsf)
* Norgeot, B., Muenzen, K., Peterson, T.A. et al. Protected Health Information filter (Philter): accurately and securely de-identifying free-text clinical notes. npj Digit. Med. 3, 57 (2020). https://doi.org/10.1038/s41746-020-0258-y

#### Named Entity Recognition

* [spaCy's NER](https://spacy.io/universe/project/video-spacys-ner-model)
* [BERT-base cased NER](https://huggingface.co/dslim/bert-base-NER)
* Nadeau, D., & Sekine, S. (2007). A survey of named entity recognition and classification. Lingvisticae Investigationes, 30(1), 3-26.(https://nlp.cs.nyu.edu/sekine/papers/li07.pdf)

#### What do you need in a de-identification tool?

* Trienes, J., Trieschnigg, D., Seifert, C., & Hiemstra, D. (2020). Comparing rule-based, feature-based and deep neural methods for de-identification of dutch medical records. arXiv preprint arXiv:2001.05714. https://doi.org/10.48550/arXiv.2001.05714
* Abdalla M, Abdalla M, Rudzicz F, Hirst G. Using word embeddings to improve the privacy of clinical notes. J Am Med Inform Assoc. 2020 Jun 1;27(6):901-907. doi: 10.1093/jamia/ocaa038. PMID: 32388549; PMCID: PMC7309261. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7309261/
