# Getting Started With `pyDeid`

Follow the installation instructions in the [README.md](https://gitlab.smh.smhroot.net/geminidata/pydeid) before running this notebook.

## Running a basic example

If running this notebook locally, without using a virtual environment- run the block of code below (note that the path added to the search path is taken from the output of `!pip show pydeid`):

In [1]:
import sys
sys.path.insert(0,'/path/from/pip/show/pydeid')

For this demo, import the following functions from `pyDeid`.

`deid_string`, `reid_string`, and `display_deid` are only imported for demonstration purposes and are not required in the usual workflow. These can be useful to test and debug `pyDeid` and may be useful to investigate errors if they occur during the de-identification process.

In [2]:
from pyDeid import pyDeid, deid_string, reid_string, display_deid

Test out the installation using the following example:

In [7]:
original_string = 'Justin Bieber was born in Stratford on March 1st, 1994.'
phi, new_string = deid_string(original_string)

`deid_string` takes as input a string, and outputs a `new_string` with the PHI found in the original string replaced with surrogates, as well as a `phi` list of information regarding the found PHI:

In [8]:
phi

[{'phi_start': 0,
  'phi_end': 6,
  'phi': 'Justin',
  'surrogate_start': 0,
  'surrogate_end': 7,
  'surrogate': 'Lorinda',
  'types': ['Female First Name (ambig)',
   'Male First Name (ambig)',
   'Last Name (ambig)',
   'First Name8 (NamePattern2)']},
 {'phi_start': 7,
  'phi_end': 13,
  'phi': 'Bieber',
  'surrogate_start': 8,
  'surrogate_end': 13,
  'surrogate': 'Pronk',
  'types': ['Last Name (un)']},
 {'phi_start': 36,
  'phi_end': 51,
  'phi': Date(date_string='March 1st, 1994', day='1', month='March', year='1994'),
  'surrogate_start': 36,
  'surrogate_end': 45,
  'surrogate': '2006/13/5',
  'types': ['Month Day Year (2) [Month dd, yy(yy)]']}]

In [9]:
new_string

'Lorinda Pronk was born in London on 2006/13/5.'

## `pyDeid` Features

`display_deid` allows for visualization of the de-identification in interactive settings such as in Jupyter notebooks. This can be useful for demonstration and debugging:

In [10]:
display_deid(original_string, phi)

We can also re-identify the string to return back the original string:

In [11]:
reid_string(new_string, phi)

'Justin Bieber was born in London on March 1st, 1994.'

Some sites may have custom identifiers. We can also replace these by supplying a custom regular expression as a named argument to either `deid_string` or `pyDeid` as follows:

In [12]:
original_string = 'Niagara has a custom patient identifier in the format NH12345.'

phi, new_string = deid_string(original_string, niagara_patient_id = 'NH\d{5}')

Supplied custom regexes through **kwargs (see custom_regexes in docstring):

- niagara_patient_id : NH\d{5}

These custom patterns will be replaced with <PHI>.



In [13]:
new_string

'Niagara has a custom patient identifier in the format <PHI>.'

Note that the name of the argument that was supplied is used to identify the PHI type:

In [14]:
phi

[{'phi_start': 54,
  'phi_end': 61,
  'phi': 'NH12345',
  'surrogate_start': 54,
  'surrogate_end': 59,
  'surrogate': '<PHI>',
  'types': ['niagara_patient_id']}]

Currently these "custom" regular expressions are replaced with `<PHI>` placeholders, but in the future the user will be able to supply a custom replacement string generator function.

We also have the ability to use a CNN-based Named Entity Recognition pass on the string to identify any missed names. See how rare names are treated *without* named entity recognition:

In [21]:
original_string = 'Frodo Baggins was born in Middle Earth.'

phi, new_string = deid_string(
    original_string, 
    named_entity_recognition=False
)

display_deid(original_string, phi)

And *with* named entity recognition:

In [27]:
phi, new_string = deid_string(
    original_string, 
    named_entity_recognition=True
)

display_deid(original_string, phi)

However, since we have access to patient and doctor names in our master linking logs (MLLs), we will not need to use named entity recognition in our workflow (more on this in section 4.1).

We can supply a list of:

1. Patient first names (through `custom_patient_first_names`)
2. Patient last names (through `custom_patient_last_names`)
3. Doctor first names (through `custom_dr_first_names`)
4. Doctor last names (through `custom_dr_last_names`)

See the example usage below:

In [28]:
phi, new_string = deid_string(
    original_string, 
    custom_patient_first_names={'Frodo'}, 
    custom_patient_last_names={'Baggins'}
)

display_deid(original_string, phi)

Note that these custom namelists are supplied as Python `Sets` for fast lookup.

That details for all the above options are available through the function docstring:

In [None]:
deid_string?

## Bulk De-identification

In our workflow, we will be de-identifying large CSVs of free-text clinical notes and radiology reports (at present). Although users can write a custom loop using `deid_string` to de-identify a large CSV, this has already been done for you in the `pyDeid` function (with some useful additional features).

The most basic usage of the function only requires the user to supply the name of the file to be identified (`original_file`), the name of the column containing a unique identifier for the encounter (`encounter_id_varname`- in our applications this will generally be the `genc_id`), and the name of the column containing the note text to be de-identified (`note_varname`).

Note that some sites will report multiple notes per encounter. Therefore an `encounter_id_varname` is not enough to uniquely identify a note. In these cases, please supply a `note_id_varname` in addition to the `encounter_id_varname`.

Below we will de-identify a file containing dialog from Lord of the Rings:

In [None]:
pyDeid(
    original_file='./tests/test.csv',
    encounter_id_varname='encounter_id',
    note_id_varname='note_id',
    note_varname='note_text'
)

Note that `pyDeid` accepts many other arguments, which can be seen in the function docstring:

In [29]:
pyDeid?

Note that just as in `deid_string`, custom regular expressions can be supplied with named arguments through `**custom_regexes`,  named entity recognition can be used through `named_entity_recognition`, and custom patient and doctor names can be supplied through `custom_{dr/patient}_{first/last}_names`.

We can specify a custom output filename for the de-identified file through `new_file`, and for the found PHI details through `phi_output_file`. If these names are not specified, they will default to `{original filename without extension}__DEID.csv` and `{original filename without extension}__PHI.csv` respectively.

By default, `phi_output_file` is saved as a `csv`. This is the recommended file format for large files (we can consider all files for our applications to be "large"). `verbose` also defaults to true in order to print useful status updates during the run and diagnostics after the run. Please see the function docstring for more details.

## Full Working Example

Below we show how to de-identify a file in an "average" workflow:

First we read the "MLLs" into python `Sets`.

In [30]:
import pandas as pd

MLL_filepath='./tests/mll.csv'
MLL = pd.read_csv(MLL_filepath)

MLL.head()

Unnamed: 0,id,char
0,0,DEAGOL
1,1,SMEAGOL
2,2,DEAGOL
3,3,SMEAGOL
4,4,SMEAGOL


In [33]:
character_first_names = set(MLL.char)

In [34]:
pyDeid(
    original_file='./tests/test.csv',
    encounter_id_varname='encounter_id',
    note_id_varname='note_id',
    note_varname='note_text',
    custom_patient_first_names=character_first_names
)

Processing encounter 2389: : 2390it [10:31,  3.79it/s]

Diagnostics:
                - chars/s = 230.95986752654892
                - s/note = 0.26413507750842363



