# Evaluating `pyDeid`

We show how to evaluate `pyDeid` on a custom dataset of Canadian admission notes, as described in the paper

> pyDeid: An Improved, Fast, Flexible, and Generalizable Rules-based Approach for De-identification of Free-text Medical Records

As well as on the popular `n2c2` benchmark dataset of American discharge notes, using the [ETUDE engine](https://github.com/MUSC-TBIC/etude-engine).

## Testing CSV output against a gold standard dataset with `CSVEvaluator`

Given a test dataset of clinical notes formatted in a `csv` such as that found in `tests/test.csv`, and a "gold standard" dataset with each note split by token, and annotated with the appropriate PII type, we can evaluate the performance of `pyDeid` using the `CSVEvaluator` class.

First we run `pyDeid` on `tests/test.csv`, outputting in `csv` format.

In [1]:
from pyDeid import pyDeid

pyDeid(
    original_file = "../../tests/test.csv",
    note_varname = "note_text",
    encounter_id_varname = "genc_id",
    note_id_varname = "note_id"
)

Processing encounter 3, note Record 3: : 3it [00:01,  2.75it/s]

Diagnostics:
                - chars/s = 149.41586216889525
                - s/note = 0.36363832155863446





From here we use the annotated ground truth dataset.

We show how to create a ground truth CSV given raw notes in the form of `./tests/test.csv`.

1. Begin by tokenizing the raw notes using `tokenize_csv()`.
2. Manually annotate the notes for PHI.
3. Using `melt_annotations()`, combine multi-token PHI into a single entry.

In particular dates and locations are usually multi-token, so we handle them specifically.

In [2]:
from pyDeid.phi_types import tokenize_csv, melt_annotations

tokenize_csv(
    input_file = "../../tests/test.csv",
    output_file = "../../tests/test_tokenized.csv",
    encounter_id_varname = "genc_id",
    note_id_varname = "note_id",
    note_text_varname = "note_text"
)

Then we annotate the file `./tests/test_tokenized.csv` by adding the PHI type to the `annotation` column.

Once that is complete, we combine multi-token PHI. In this example file, dates and locations are split across multiple tokens.

In [3]:
melt_annotations(
    input_file = "../../tests/test_tokenized.csv",
    output_file = "../../tests/ground_truth_processed.csv",
    merge_annotations = ["d", "l"]
)

Now we have a ground truth dataset that we can use to compare against the output of `pyDeid`.

To do this we use the `CSVEvaluator` class.

In [4]:
from pyDeid.phi_types.CSVEvaluator import CSVEvaluator

evaluator = CSVEvaluator()

precision, recall, f1 = evaluator.add_ground_truth_file("../../tests/ground_truth.csv", note_id_varname = "note_id")\
    .add_result_file("../../tests/test__PHI.csv")\
    .evaluate()

print(
    f"""
    Precision: {precision}
    Recall: {recall}
    F1: {f1}
    """
)


    Precision: 1.0
    Recall: 1.0
    F1: 1.0
    


## Using the ETUDE Engine

The ETUDE engine is a well established, standard tool for analyzing de-identification performance against various benchmark dataset formats, such as the `n2c2` format.

In this section, we use the ETUDE engine to evaluate the performance of `pyDeid` on the `n2c2` dataset. We begin by cloning the repository.

In order to run the evaluation on `n2c2`, `pyDeid` does require the ability to read and write from `xml`. This is available through the `pyDeid_n2c2()` function.

Ensure that the `n2c2` dataset is saved to some directory such as `./tests/n2c2`.

In [None]:
from pyDeid.n2c2 import pyDeid_n2c2

pyDeid_n2c2(
    input_dir = "path/to/n2c2_test_data",
    output_dir = "path/to/pydeid_n2c2_output",
)

Note that this is essentially ready to run the evaluation on, however there is a significant difference between how `pyDeid` recognizes names and how names are annotated in the `n2c2` ground truth. In `pyDeid`, first and last names are considered separately, and so we must separate these annotations in the `n2c2` ground truth using `split_multi_word_tags()`.

In [None]:
from pyDeid import split_multi_word_tags

split_multi_word_tags(
    input_dir = "path/to/n2c2_test_data",
    output_dir = "path/to/n2c2_test_data_preprocessed"
)

Now we are ready to run the `ETUDE engine`.

In order to compare a given reference file with a given test file, the `ETUDE engine` uses a config for each of the reference file, and the tool's output. We provide both of the relevant configs under the `./tests/ETUDE_configs` directory.

With this, we run the following command from the directory in which we cloned the `ETUDE engine` repository.

In [6]:
!python H:/repos/GitHub/etude-engine/etude.py \
    --reference-input "path/to/n2c2_test_data_preprocessed" \
    --reference-config ../../tests/n2c2_pydeid.conf \
    --test-input "path/to/pydeid_n2c2_output" \
    --test-config ../../tests/pydeid_n2c2.conf \
    --by-type --score-key "Parent"


exact	TP	FP	TN	FN
micro-average	8280.0	1153.0	0.0	2816.0
Address	116.0	4.0	0.0	420.0
Contact Information	83.0	48.0	0.0	135.0
Identifiers	81.0	33.0	0.0	536.0
Names	4031.0	845.0	0.0	714.0
Time	3969.0	223.0	0.0	1011.0
macro-average by type	8280.0	1153.0	0.0	2816.0



  0%|          | 0/514 [00:00<?, ?it/s]
  0%|          | 1/514 [00:01<09:57,  1.16s/it]
  0%|          | 2/514 [00:02<08:39,  1.02s/it]
  1%|          | 3/514 [00:02<08:17,  1.03it/s]
  1%|          | 4/514 [00:04<08:57,  1.05s/it]
  1%|          | 5/514 [00:05<08:39,  1.02s/it]
  1%|1         | 6/514 [00:06<08:24,  1.01it/s]
  1%|1         | 7/514 [00:07<08:15,  1.02it/s]
  2%|1         | 8/514 [00:08<08:55,  1.06s/it]
  2%|1         | 9/514 [00:09<08:42,  1.04s/it]
  2%|1         | 10/514 [00:10<08:32,  1.02s/it]
  2%|2         | 11/514 [00:11<08:25,  1.01s/it]
  2%|2         | 12/514 [00:12<09:35,  1.15s/it]
  3%|2         | 13/514 [00:13<09:14,  1.11s/it]
  3%|2         | 14/514 [00:14<08:55,  1.07s/it]
  3%|2         | 15/514 [00:15<08:47,  1.06s/it]
  3%|3         | 16/514 [00:16<08:39,  1.04s/it]
  3%|3         | 17/514 [00:17<08:35,  1.04s/it]
  4%|3         | 18/514 [00:18<08:48,  1.06s/it]
  4%|3         | 19/514 [00:19<08:51,  1.07s/it]
  4%|3         | 20/514 [00:20<08:41,