![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/04.7.Deidentification_Custom_Pretrained_Pipelines.ipynb)


## Healthcare NLP for Data Scientists Course

If you are not familiar with the components in this notebook, you can check [Healthcare NLP for Data Scientists Udemy Course](https://www.udemy.com/course/healthcare-nlp-for-data-scientists/) and the [MOOC Notebooks](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP) for each components.

## Colab Setup

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.settings.enforce_versions=True
nlp.install(refresh_install=True)

In [4]:
from johnsnowlabs import nlp, medical
# Automatically load license data and start a session with all jars user has access to

spark = nlp.start()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_9596 (8).json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.5.3, 💊Spark-Healthcare==5.5.3, running on ⚡ PySpark==3.4.0


In [5]:
spark

In [6]:
import json
import os

import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.util import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp_jsl.pipeline_tracer import PipelineTracer
from sparknlp_jsl.pipeline_output_parser import PipelineOutputParser

from pyspark.sql import functions as F
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline, PipelineModel

import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', 200)

import string
import numpy as np

# Pretrained Deidentification Pipeline

The purpose of this notebook is to adjust pretrained pipelines to meet our specific needs. Additionally, it highlights the differences between several de-identification pretrained pipelines. For more information, you can check [here](https://nlp.johnsnowlabs.com/models?task=De-identification&type=pipeline).

# Deidentification Pipelines Stage Comparison


| Deidentification Pipeline Name | Stages |
|:--------------------------------|:-----------------------------|
| [clinical_deidentification](https://nlp.johnsnowlabs.com/2022/09/14/clinical_deidentification_en.html) | 2 NER, 1 Deidentification, 14 Rule-based NER, 1 clinical embedding, 3 chunk merger  |  
| [clinical_deidentification_generic](https://nlp.johnsnowlabs.com/2024/02/21/clinical_deidentification_generic_en.html) | 1 NER, 4 Deidentification, 13 Rule-based NER, 1 clinical embedding, 2 chunk merger  |
| [clinical_deidentification_generic_optimized](https://nlp.johnsnowlabs.com/2024/03/14/clinical_deidentification_generic_optimized_en.html) | 1 NER, 1 Deidentification, 13 Rule-based NER, 1 clinical embedding, 2 chunk merger  |
| [clinical_deidentification_glove](https://nlp.johnsnowlabs.com/2022/03/04/clinical_deidentification_glove_en_3_0.html) | 2 NER, 4 Deidentification, 13 Rule-based NER, 1 clinical embedding, 3 chunk merger  |
| [clinical_deidentification_glove_augmented](https://nlp.johnsnowlabs.com/2022/09/16/clinical_deidentification_glove_augmented_en.html) | 2 NER, 4 Deidentification, 8 Rule-based NER, 1 clinical embedding, 3 chunk merger  |
| [clinical_deidentification_langtest](https://nlp.johnsnowlabs.com/2024/01/10/clinical_deidentification_langtest_en.html) | 2 NER, 4 Deidentification, 12 Rule-based NER, 1 clinical embedding, 3 chunk merger  |
| [clinical_deidentification_multi_mode_output](https://nlp.johnsnowlabs.com/2024/05/31/clinical_deidentification_multi_mode_output_en.html) | 2 NER, 4 Deidentification, 15 Rule-based NER, 1 clinical embedding, 3 chunk merger |
| [clinical_deidentification_obfuscation_medium](https://nlp.johnsnowlabs.com/2024/02/09/clinical_deidentification_obfuscation_medium_en.html) | 2 NER, 1 Deidentification, 2 Rule-based NER, 1 clinical embedding, 1 chunk merger  |
| [clinical_deidentification_obfuscation_small](https://nlp.johnsnowlabs.com/2024/02/09/clinical_deidentification_obfuscation_small_en.html) | 1 NER, 1 Deidentification, 3 Rule-based NER, 1 clinical embedding, 1 chunk merger  |
| [clinical_deidentification_slim](https://nlp.johnsnowlabs.com/2023/06/17/clinical_deidentification_slim_en.html) | 2 NER, 4 Deidentification, 15 Rule-based NER, 1 clinical embedding, 3 chunk merger  |
| [clinical_deidentification_subentity](https://nlp.johnsnowlabs.com/2024/02/21/clinical_deidentification_subentity_en.html) | 1 NER, 4 Deidentification, 13 Rule-based NER, 1 clinical embedding, 2 chunk merger  |
| [clinical_deidentification_subentity_nameAugmented](https://nlp.johnsnowlabs.com/2024/03/14/clinical_deidentification_subentity_nameAugmented_en.html) | 2 NER, 4 Deidentification, 13 Rule-based NER, 1 clinical embedding, 2 chunk merger  |
| [clinical_deidentification_subentity_optimized](https://nlp.johnsnowlabs.com/2024/03/14/clinical_deidentification_subentity_optimized_en.html) | 1 NER, 1 Deidentification, 13 Rule-based NER, 1 clinical embedding, 2 chunk merger  |
| [clinical_deidentification_wip](https://nlp.johnsnowlabs.com/2023/06/17/clinical_deidentification_wip_en.html) | 2 NER, 4 Deidentification, 15 Rule-based NER, 1 clinical embedding, 3 chunk merger  |
| [ner_deid_augmented_pipeline](https://nlp.johnsnowlabs.com/2022/03/21/ner_deid_augmented_pipeline_en_3_0.html) | 1 NER, 1 clinical embedding |
| [ner_deid_biobert_pipeline](https://nlp.johnsnowlabs.com/2022/03/21/ner_deid_biobert_pipeline_en_3_0.html) | 1 NER, 1 clinical embedding  |
| [ner_deid_context_augmented_pipeline](https://nlp.johnsnowlabs.com/2024/05/20/ner_deid_context_augmented_pipeline_en.html) | 2 NER, 14 Rule-based NER, 1 clinical embedding, 3 chunk merger  |
| [ner_deid_context_nameAugmented_pipeline](https://nlp.johnsnowlabs.com/2024/05/21/ner_deid_context_nameAugmented_pipeline_en.html) | 3 NER, 14 Rule-based NER, 1 clinical embedding, 3 chunk merger  |
| [ner_deid_enriched_biobert_pipeline](https://nlp.johnsnowlabs.com/2022/03/21/ner_deid_enriched_biobert_pipeline_en_3_0.html) | 1 NER, 1 clinical embedding  |
| [ner_deid_enriched_pipeline](https://nlp.johnsnowlabs.com/2022/03/21/ner_deid_enriched_pipeline_en_3_0.html) | 1 NER, 1 clinical embedding  |
| [ner_deid_generic_augmented_pipeline](https://nlp.johnsnowlabs.com/2022/03/21/ner_deid_generic_augmented_pipeline_en_3_0.html) | 1 NER, 1 clinical embedding  |
| [ner_deid_generic_context_augmented_pipeline](https://nlp.johnsnowlabs.com/2024/05/20/ner_deid_generic_context_augmented_pipeline_en.html) | 1 NER, 13 Rule-based NER, 1 clinical embedding, 2 chunk merger  |
| [ner_deid_generic_glove_pipeline](https://nlp.johnsnowlabs.com/2023/03/13/ner_deid_generic_glove_pipeline_en.html) | 1 NER, 1 clinical embedding  |
| [ner_deid_large_pipeline](https://nlp.johnsnowlabs.com/2022/03/21/ner_deid_large_pipeline_en_3_0.html) | 1 NER, 1 clinical embedding  |
| [ner_deid_sd_large_pipeline](https://nlp.johnsnowlabs.com/2022/03/21/ner_deid_sd_large_pipeline_en_3_0.html) | 1 NER, 1 clinical embedding  |
| [ner_deid_sd_pipeline](https://nlp.johnsnowlabs.com/2022/03/21/ner_deid_sd_pipeline_en_3_0.html) | 1 NER, 1 clinical embedding  |
| [ner_deid_subentity_augmented_i2b2_pipeline](https://nlp.johnsnowlabs.com/2022/03/21/ner_deid_subentity_augmented_i2b2_pipeline_en_3_0.html) | 1 NER, 1 clinical embedding  |




## clinical_deidentification

This pipeline can be used to deidentify PHI information from medical texts. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask and obfuscate `LOCATION`, `CONTACT`, `PROFESSION`, `NAME`, `DATE`, `ID`, `AGE`, `MEDICALRECORD`, `ORGANIZATION`, `HEALTHPLAN`, `DOCTOR`, `USERNAME`, `URL`, `DEVICE`, `CITY`, `ZIP`, `STATE`, `PATIENT`, `COUNTRY`, `STREET`, `PHONE`, `HOSPITAL`, `EMAIL`, `IDNUM`, `BIOID`, `FAX`, `SSN`, `ACCOUNT`, `DLN`, `PLATE`, `VIN`, `LICENSE` entities.

This pipeline is the optimized version of the previous `clinical_deidentification` pipelines, resulting in significantly improved speed. It returns obfuscated version of the texts as the result and its masked with entity labels version in the metadata.

In [7]:
from sparknlp.pretrained import PretrainedPipeline

deid_pipeline = nlp.PretrainedPipeline("clinical_deidentification", "en", "clinical/models")

clinical_deidentification download started this may take some time.
Approx size to download 1.6 GB
[OK!]


In [8]:
text= """Dr. John Lee, from Royal Medical Clinic in Chicago, attended to the patient on 11/05/2024.
         The patient’s medical record number is 56467890.
         The patient, Emma Wilson, is 50 years old, with a history of chronic kidney disease stage 3 (N18.3).
         Her contact number is 444-456-7890."""

In [9]:
%%time
deid_result = deid_pipeline.fullAnnotate(text)
print(deid_result[0].keys())
print("\nMasked Result")
print("--"*30)
print('\n'.join([i.metadata['masked'] for i in deid_result[0]['obfuscated']]))
print("\nObfuscated Result")
print("--"*30)
print('\n'.join([i.result for i in deid_result[0]['obfuscated']]),"\n")

dict_keys(['obfuscated', 'ner_chunk', 'sentence'])

Masked Result
------------------------------------------------------------
Dr. <DOCTOR>, from <HOSPITAL> in <CITY>, attended to the patient on <DATE>.
The patient’s medical record number is <MEDICALRECORD>.
The patient, <PATIENT>, is <AGE> years old, with a history of chronic kidney disease stage 3 (N18.3).
Her contact number is <PHONE>.

Obfuscated Result
------------------------------------------------------------
Dr. Lacy Pick, from Baltimore VA Medical Center in Healdton, attended to the patient on 06/06/2024.
The patient’s medical record number is 47576981.
The patient, Joyce Nixon, is 40 years old, with a history of chronic kidney disease stage 3 (N18.3).
Her contact number is 222-214-3658. 

CPU times: user 44.5 ms, sys: 19.9 ms, total: 64.4 ms
Wall time: 4.11 s


For this pretrained pipeline, the time it takes to get the result from the text here is 328 ms and you can see below which entity it is.

In [10]:
from sparknlp_jsl.pipeline_tracer import PipelineTracer
pipeline_tracer_deid = PipelineTracer(deid_pipeline)
pipeline_tracer_deid.getPossibleEntities()

['LOCATION',
 'CONTACT',
 'PROFESSION',
 'NAME',
 'DATE',
 'ID',
 'AGE',
 'MEDICALRECORD',
 'ORGANIZATION',
 'HEALTHPLAN',
 'DOCTOR',
 'USERNAME',
 'URL',
 'DEVICE',
 'CITY',
 'ZIP',
 'STATE',
 'PATIENT',
 'COUNTRY',
 'STREET',
 'PHONE',
 'HOSPITAL',
 'EMAIL',
 'IDNUM',
 'BIOID',
 'FAX',
 'SSN',
 'ACCOUNT',
 'DLN',
 'PLATE',
 'VIN',
 'LICENSE']

## clinical_deidentification_subentity_optimized

This pipeline can be used to deidentify PHI information from medical texts. The PHI information will be obfuscated in the resulting text and also masked with entitiy labels in the metadata. The pipeline can obfuscate and mask `MEDICALRECORD`, `ORGANIZATION`, `PROFESSION`, `HEALTHPLAN`, `DOCTOR`, `USERNAME`, `URL`, `LOCATION-OTHER`, `DEVICE`, `CITY`, `DATE`, `ZIP`, `STATE`, `PATIENT`, `COUNTRY`, `STREET`, `PHONE`, `HOSPITAL`, `EMAIL`, `IDNUM`, `BIOID`, `FAX`, `AGE`, `SSN`, `ACCOUNT`, `DLN`, `PLATE`, `VIN`, `LICENSE` entities. This pipeline is built using the `ner_deid_subentity_augmented` model as well as `ContextualParser`, `RegexMatcher`, and `TextMatcher` and a single `Deidentification` stage for optimization.

In [11]:
deid_subentity_pipeline = nlp.PretrainedPipeline("clinical_deidentification_subentity_optimized", "en", "clinical/models")

clinical_deidentification_subentity_optimized download started this may take some time.
Approx size to download 1.6 GB
[OK!]


In [12]:
%%time
deid_subentity_result = deid_subentity_pipeline.fullAnnotate(text)
deid_subentity_result[0].keys()
print("\nMasked Result")
print("--"*30)
print('\n'.join([i.metadata['masked'] for i in deid_subentity_result[0]['obfuscated']]))
print("\nObfuscated Result")
print("--"*30)
print('\n'.join([i.result for i in deid_subentity_result[0]['obfuscated']]),"\n")


Masked Result
------------------------------------------------------------
Dr. <DOCTOR>, from <HOSPITAL> in <CITY>, attended to the patient on <DATE>.
The patient’s medical record number is <IDNUM>.
The patient, <PATIENT>, is <AGE> years old, with a history of chronic kidney disease stage 3 (N18.3).
Her contact number is <PHONE>.

Obfuscated Result
------------------------------------------------------------
Dr. Milburn Aliment, from MARCUS DALY MEMORIAL HOSPITAL in Hostomice pod Brdy, attended to the patient on 13/05/2024.
The patient’s medical record number is 58687092.
The patient, Alvis Jourdain, is 56 years old, with a history of chronic kidney disease stage 3 (N18.3).
Her contact number is 222-214-3658. 

CPU times: user 23.9 ms, sys: 4.89 ms, total: 28.8 ms
Wall time: 503 ms


For this pretrained pipeline, the time it takes to get the result from the text here is 675 ms and you can see below which entities it is.

In [13]:
pipeline_tracer_subentity = PipelineTracer(deid_subentity_pipeline)
pipeline_tracer_subentity.getPossibleEntities()

['MEDICALRECORD',
 'ORGANIZATION',
 'PROFESSION',
 'HEALTHPLAN',
 'DOCTOR',
 'USERNAME',
 'URL',
 'LOCATION-OTHER',
 'DEVICE',
 'CITY',
 'DATE',
 'ZIP',
 'STATE',
 'PATIENT',
 'COUNTRY',
 'STREET',
 'PHONE',
 'HOSPITAL',
 'EMAIL',
 'IDNUM',
 'BIOID',
 'FAX',
 'AGE',
 'SSN',
 'ACCOUNT',
 'DLN',
 'PLATE',
 'VIN',
 'LICENSE']

## clinical_deidentification_generic_optimized

This pipeline can be used to de-identify PHI information from medical texts. The PHI information will be obfuscated in the resulting text and masked with entity labels in the metadata. The pipeline can obfuscate and mask `LOCATION`, `CONTACT`, `PROFESSION`, `NAME`, `DATE`, `ID`, `AGE`, `COUNTRY`, `SSN`, `ACCOUNT`, `DLN`, `PLATE`, `VIN`, `LICENSE`, `PHONE`, `ZIP`, `MEDICALRECORD`, `EMAIL` entities. This pipeline is built using the `ner_deid_generic_augmented` model, and `ContextualParser`, `RegexMatcher`, and `TextMatcher` and a single `Deidentification` stage for optimization.

In [14]:
deid_generic_pipeline = nlp.PretrainedPipeline("clinical_deidentification_generic_optimized", "en", "clinical/models")

clinical_deidentification_generic_optimized download started this may take some time.
Approx size to download 1.6 GB
[OK!]


In [15]:
%%time
deid_generic_result = deid_generic_pipeline.fullAnnotate(text)
deid_generic_result[0].keys()
print("\nMasked Result")
print("--"*30)
print('\n'.join([i.metadata['masked'] for i in deid_generic_result[0]['obfuscated']]))
print("\nObfuscated Result")
print("--"*30)
print('\n'.join([i.result for i in deid_generic_result[0]['obfuscated']]),"\n")


Masked Result
------------------------------------------------------------
Dr. <NAME>, from <LOCATION> in <LOCATION>, attended to the patient on <DATE>.
The patient’s medical record number is <ID>.
The patient, <NAME>, is <AGE> years old, with a history of chronic kidney disease stage 3 (N18.3).
Her contact number is <CONTACT>.

Obfuscated Result
------------------------------------------------------------
Dr. Mliss Anderson, from 220 Steuben St in 2828 North National Avenue, attended to the patient on 12/06/2024.
The patient’s medical record number is 47576981.
The patient, Marcus Sewer, is 47 years old, with a history of chronic kidney disease stage 3 (N18.3).
Her contact number is (54) 5291-7595. 

CPU times: user 21.8 ms, sys: 16.2 ms, total: 37.9 ms
Wall time: 539 ms


For this pretrained pipeline, the time it takes to get the result from the text here is 675 ms and you can see below which entities it is.

In [16]:
pipeline_tracer_generic = PipelineTracer(deid_generic_pipeline)
pipeline_tracer_generic.getPossibleEntities()

['LOCATION',
 'CONTACT',
 'PROFESSION',
 'NAME',
 'DATE',
 'ID',
 'AGE',
 'COUNTRY',
 'SSN',
 'ACCOUNT',
 'DLN',
 'PLATE',
 'VIN',
 'LICENSE',
 'PHONE',
 'ZIP',
 'MEDICALRECORD',
 'EMAIL']

## clinical_deidentification_multi_mode_output

This pipeline can be used to de-identify PHI information from medical texts. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask and obfuscate `LOCATION`, `CONTACT`, `PROFESSION`, `NAME`, `DATE`, `ID`, `AGE`, `MEDICALRECORD`, `ORGANIZATION`, `HEALTHPLAN`, `DOCTOR`, `USERNAME`, `URL`, `DEVICE`, `CITY`, `ZIP`, `STATE`, `PATIENT`, `COUNTRY`, `STREET`, `PHONE`, `HOSPITAL`, `EMAIL`, `IDNUM`, `BIOID`, `FAX`, `SSN`, `ACCOUNT`, `DLN`, `PLATE`, `VIN`, `LICENSE` entities.

This pipeline simultaneously produces masked with entity labels, fixed-length char, same-length char and obfuscated version of the text.

In [17]:
deid_multi_mode_pipeline = nlp.PretrainedPipeline("clinical_deidentification_multi_mode_output", "en", "clinical/models")

clinical_deidentification_multi_mode_output download started this may take some time.
Approx size to download 1.6 GB
[OK!]


In [18]:
%%time
deid_multi_mode_result = deid_multi_mode_pipeline.annotate(text)
deid_multi_mode_result.keys()

pd.set_option("display.max_colwidth", 200)
df= pd.DataFrame(list(zip(deid_multi_mode_result["sentence"],
                          deid_multi_mode_result["masked"],
                          deid_multi_mode_result["masked_with_chars"],
                          deid_multi_mode_result["masked_fixed_length_chars"],
                          deid_multi_mode_result["obfuscated"])),

                 columns= ["Sentence",
                           "Masked",
                           "Masked with Chars",
                           "Masked with Fixed Chars",
                           "Obfuscated"])
df

CPU times: user 54.6 ms, sys: 23.9 ms, total: 78.4 ms
Wall time: 1.15 s


Unnamed: 0,Sentence,Masked,Masked with Chars,Masked with Fixed Chars,Obfuscated
0,"Dr. John Lee, from Royal Medical Clinic in Chicago, attended to the patient on 11/05/2024.","Dr. <DOCTOR>, from <HOSPITAL> in <CITY>, attended to the patient on <DATE>.","Dr. [******], from [******************] in [*****], attended to the patient on [********].","Dr. ****, from **** in ****, attended to the patient on ****.","Dr. Remonia Carmin, from MERCY HOSPITAL ADA, INC. in Botucatu, attended to the patient on 31/05/2024."
1,The patient’s medical record number is 56467890.,The patient’s medical record number is <MEDICALRECORD>.,The patient’s medical record number is [******].,The patient’s medical record number is ****.,The patient’s medical record number is 47576981.
2,"The patient, Emma Wilson, is 50 years old, with a history of chronic kidney disease stage 3 (N18.3).","The patient, <PATIENT>, is <AGE> years old, with a history of chronic kidney disease stage 3 (N18.3).","The patient, [*********], is ** years old, with a history of chronic kidney disease stage 3 (N18.3).","The patient, ****, is **** years old, with a history of chronic kidney disease stage 3 (N18.3).","The patient, Bettylou Brunner, is 56 years old, with a history of chronic kidney disease stage 3 (N18.3)."
3,Her contact number is 444-456-7890.,Her contact number is <PHONE>.,Her contact number is [**********].,Her contact number is ****.,Her contact number is 222-214-3658.


For this pretrained pipeline, the time it takes to get the result from the text here is 1.17 s and you can see below which entity it is.


In [19]:
pipeline_tracer_multi_mode = PipelineTracer(deid_multi_mode_pipeline)
pipeline_tracer_multi_mode.getPossibleEntities()

['LOCATION',
 'CONTACT',
 'PROFESSION',
 'NAME',
 'DATE',
 'ID',
 'AGE',
 'MEDICALRECORD',
 'ORGANIZATION',
 'HEALTHPLAN',
 'DOCTOR',
 'USERNAME',
 'URL',
 'DEVICE',
 'CITY',
 'ZIP',
 'STATE',
 'PATIENT',
 'COUNTRY',
 'STREET',
 'PHONE',
 'HOSPITAL',
 'EMAIL',
 'BIOID',
 'FAX',
 'SSN',
 'ACCOUNT',
 'DLN',
 'PLATE',
 'VIN',
 'LICENSE']

As seen above, each pretrained de-identification pipeline is designed for specific purposes. Some pipelines return only main entities, while others include both main and sub-entities. Additionally, depending on the models and stages used within each pretrained pipeline, there are differences in processing times. It also appears that certain pretrained models have different keys. Here, you can select the pretrained pipeline that best fits your needs. For more information, you can check [here](https://nlp.johnsnowlabs.com/models?task=De-identification&type=pipeline).

# Pipeline Stage Modification

Now we will examine how to modify the pretrained pipelines according to our requirements using the `clinical_deidentification`.


In [20]:
# # We will use transform after every change we make. For this we create empty_data.
empty_data = spark.createDataFrame([[""]]).toDF("text")

Here, we are checking the stages of the pretrained pipeline.

In [21]:
deid_pipeline.model.stages

[DocumentAssembler_0293828e42e5,
 SentenceDetectorDLModel_6bafc4746ea5,
 REGEX_TOKENIZER_ede41b4357b5,
 WORD_EMBEDDINGS_MODEL_9004b1d00302,
 MedicalNerModel_e8178a1262cc,
 NER_CONVERTER_1058f6f116d4,
 MedicalNerModel_9d4a08b1c03d,
 NER_CONVERTER_dc9b41725c7e,
 MERGE_14697a4bf7ea,
 CONTEXTUAL-PARSER_29d8f2e94a43,
 CONTEXTUAL-PARSER_9b30de083926,
 CONTEXTUAL-PARSER_009dd91ad279,
 CONTEXTUAL-PARSER_70bce6260bb4,
 CONTEXTUAL-PARSER_c7c49d4cc377,
 CONTEXTUAL-PARSER_4cdb8328ac10,
 ENTITY_EXTRACTOR_00a0458116f7,
 ENTITY_EXTRACTOR_396241ad6df7,
 CONTEXTUAL-PARSER_24e76bf85739,
 REGEX_MATCHER_e4237b63b8d9,
 CONTEXTUAL-PARSER_2b9eb4befaa6,
 CONTEXTUAL-PARSER_0892cc982b30,
 CONTEXTUAL-PARSER_20cecdf31e95,
 CONTEXTUAL-PARSER_69dda3cbafc9,
 MERGE_359f55073107,
 MERGE_2493d1337efe,
 DE-IDENTIFICATION_030ae1ab1b7a,
 Finisher_e9a5d603229b]

In [22]:
len(deid_pipeline.model.stages)

27

You can view each stage of the pipeline by using the `printPipelineSchema` function.

In [23]:
pipeline_tracer_deid.printPipelineSchema()

root
 |-- DocumentAssembler
 |    |-- uid: string (DocumentAssembler_0293828e42e5)
 |    |-- index: int (0)
 |    |-- inputCol: string (text)
 |    |-- outputCol: string (document)
 |    |-- inputAnnotatorType: none (----------)
 |    |-- outputAnnotatorType: string (DOCUMENT)
 |
 |-- SentenceDetectorDLModel
 |    |-- uid: string (SentenceDetectorDLModel_6bafc4746ea5)
 |    |-- index: int (1)
 |    |-- inputCols: array (document)
 |    |-- outputCol: string (sentence)
 |    |-- inputAnnotatorTypes: array (DOCUMENT)
 |    |-- outputAnnotatorType: string (DOCUMENT)
 |
 |-- TokenizerModel
 |    |-- uid: string (REGEX_TOKENIZER_ede41b4357b5)
 |    |-- index: int (2)
 |    |-- inputCols: array (sentence)
 |    |-- outputCol: string (token)
 |    |-- inputAnnotatorTypes: array (DOCUMENT)
 |    |-- outputAnnotatorType: string (TOKEN)
 |
 |-- WordEmbeddingsModel
 |    |-- uid: string (WORD_EMBEDDINGS_MODEL_9004b1d00302)
 |    |-- index: int (3)
 |    |-- inputCols: array (sentence, token)
 |  

In [24]:
deid_result[0].keys()

dict_keys(['obfuscated', 'ner_chunk', 'sentence'])

In [25]:
for res in deid_result:
    sentence = [original_text.result for original_text in res["sentence"]]
    masked = [masked_text.metadata["masked"] for masked_text in res["obfuscated"]]
    obfuscated = [obfuscated_text.result for obfuscated_text in res["obfuscated"]]

df = pd.DataFrame({"Sentence": sentence, "Masked": masked, "Obfuscated": obfuscated})

df

Unnamed: 0,Sentence,Masked,Obfuscated
0,"Dr. John Lee, from Royal Medical Clinic in Chicago, attended to the patient on 11/05/2024.","Dr. <DOCTOR>, from <HOSPITAL> in <CITY>, attended to the patient on <DATE>.","Dr. Lacy Pick, from Baltimore VA Medical Center in Healdton, attended to the patient on 06/06/2024."
1,The patient’s medical record number is 56467890.,The patient’s medical record number is <MEDICALRECORD>.,The patient’s medical record number is 47576981.
2,"The patient, Emma Wilson, is 50 years old, with a history of chronic kidney disease stage 3 (N18.3).","The patient, <PATIENT>, is <AGE> years old, with a history of chronic kidney disease stage 3 (N18.3).","The patient, Joyce Nixon, is 40 years old, with a history of chronic kidney disease stage 3 (N18.3)."
3,Her contact number is 444-456-7890.,Her contact number is <PHONE>.,Her contact number is 222-214-3658.


## Remove Label (setBlackList)

Now, let's add a **setBlackList** to an existing ChunkMergeModel.

In [26]:
deid_pipeline.model.stages[24] = deid_pipeline.model.stages[24].setBlackList(['DATE'])
deid_pipeline.transform(empty_data)

DataFrame[text: string, finished_sentence: array<string>, finished_ner_chunk: array<string>, finished_obfuscated: array<string>, finished_sentence_metadata: array<struct<_1:string,_2:string>>, finished_ner_chunk_metadata: array<struct<_1:string,_2:string>>, finished_obfuscated_metadata: array<struct<_1:string,_2:string>>]

In [27]:
deid_res= deid_pipeline.fullAnnotate(text)

In [28]:
deid_res[0]["ner_chunk"]

[Annotation(chunk, 4, 11, John Lee, {'entity': 'DOCTOR', 'confidence': '0.9941', 'ner_source': 'ner_chunk_enriched', 'chunk': '0', 'sentence': '0'}, []),
 Annotation(chunk, 19, 38, Royal Medical Clinic, {'entity': 'HOSPITAL', 'confidence': '0.98686665', 'ner_source': 'ner_chunk_enriched', 'chunk': '1', 'sentence': '0'}, []),
 Annotation(chunk, 43, 49, Chicago, {'entity': 'CITY', 'confidence': '0.9773', 'ner_source': 'ner_chunk_enriched', 'chunk': '2', 'sentence': '0'}, []),
 Annotation(chunk, 139, 146, 56467890, {'entity': 'MEDICALRECORD', 'confidence': '0.6796', 'ner_source': 'ner_chunk_enriched', 'chunk': '3', 'sentence': '1'}, []),
 Annotation(chunk, 171, 181, Emma Wilson, {'entity': 'PATIENT', 'confidence': '0.99395', 'ner_source': 'ner_chunk_enriched', 'chunk': '4', 'sentence': '2'}, []),
 Annotation(chunk, 187, 188, 50, {'tokenIndex': '7', 'entity': 'AGE', 'confidence': '0.75', 'field': 'AGE', 'ner_source': 'entity_age', 'chunk': '5', 'normalized': '', 'sentence': '2'}, []),
 Ann

As a result, we have added the **DATE** entity to the blacklist, and it did not return the **DATE** chunks.

In [29]:
for res in deid_res:
    sentence = [original_text.result for original_text in res["sentence"]]
    masked = [masked_text.metadata["masked"] for masked_text in res["obfuscated"]]
    obfuscated = [obfuscated_text.result for obfuscated_text in res["obfuscated"]]

df = pd.DataFrame({"Sentence": sentence, "Masked": masked, "Obfuscated": obfuscated})

df

Unnamed: 0,Sentence,Masked,Obfuscated
0,"Dr. John Lee, from Royal Medical Clinic in Chicago, attended to the patient on 11/05/2024.","Dr. <DOCTOR>, from <HOSPITAL> in <CITY>, attended to the patient on 11/05/2024.","Dr. Lacy Pick, from Baltimore VA Medical Center in Healdton, attended to the patient on 11/05/2024."
1,The patient’s medical record number is 56467890.,The patient’s medical record number is <MEDICALRECORD>.,The patient’s medical record number is 47576981.
2,"The patient, Emma Wilson, is 50 years old, with a history of chronic kidney disease stage 3 (N18.3).","The patient, <PATIENT>, is <AGE> years old, with a history of chronic kidney disease stage 3 (N18.3).","The patient, Joyce Nixon, is 40 years old, with a history of chronic kidney disease stage 3 (N18.3)."
3,Her contact number is 444-456-7890.,Her contact number is <PHONE>.,Her contact number is 222-214-3658.


## Remove Label (setBlackListEntities)

Now, let's add a **setBlackListEntities** to an existing DeIdentificationModel.

In [30]:
deid_pipeline.model.stages[25] = deid_pipeline.model.stages[25].setBlackListEntities(['DOCTOR'])
deid_pipeline.transform(empty_data)

DataFrame[text: string, finished_sentence: array<string>, finished_ner_chunk: array<string>, finished_obfuscated: array<string>, finished_sentence_metadata: array<struct<_1:string,_2:string>>, finished_ner_chunk_metadata: array<struct<_1:string,_2:string>>, finished_obfuscated_metadata: array<struct<_1:string,_2:string>>]

In [31]:
deid_res = deid_pipeline.fullAnnotate(text)

In [32]:
deid_res[0]["obfuscated"]

[Annotation(document, 0, 97, Dr. John Lee, from Baltimore VA Medical Center in Healdton, attended to the patient on 11/05/2024., {'sentence': '0', 'originalIndex': '0', 'masked': 'Dr. John Lee, from <HOSPITAL> in <CITY>, attended to the patient on 11/05/2024.'}, []),
 Annotation(document, 98, 145, The patient’s medical record number is 47576981., {'sentence': '1', 'originalIndex': '100', 'masked': 'The patient’s medical record number is <MEDICALRECORD>.'}, []),
 Annotation(document, 146, 245, The patient, Joyce Nixon, is 40 years old, with a history of chronic kidney disease stage 3 (N18.3)., {'sentence': '2', 'originalIndex': '158', 'masked': 'The patient, <PATIENT>, is <AGE> years old, with a history of chronic kidney disease stage 3 (N18.3).'}, []),
 Annotation(document, 246, 280, Her contact number is 222-214-3658., {'sentence': '3', 'originalIndex': '268', 'masked': 'Her contact number is <PHONE>.'}, [])]

As a result, we have added the **DOCTOR** entity to the BlackListEntities, and it will not obfuscate the **DOCTOR** chunks.

In [33]:
for res in deid_res:
    sentence = [original_text.result for original_text in res["sentence"]]
    masked = [masked_text.metadata["masked"] for masked_text in res["obfuscated"]]
    obfuscated = [obfuscated_text.result for obfuscated_text in res["obfuscated"]]

df = pd.DataFrame({"Sentence": sentence, "Masked": masked, "Obfuscated": obfuscated})

df

Unnamed: 0,Sentence,Masked,Obfuscated
0,"Dr. John Lee, from Royal Medical Clinic in Chicago, attended to the patient on 11/05/2024.","Dr. John Lee, from <HOSPITAL> in <CITY>, attended to the patient on 11/05/2024.","Dr. John Lee, from Baltimore VA Medical Center in Healdton, attended to the patient on 11/05/2024."
1,The patient’s medical record number is 56467890.,The patient’s medical record number is <MEDICALRECORD>.,The patient’s medical record number is 47576981.
2,"The patient, Emma Wilson, is 50 years old, with a history of chronic kidney disease stage 3 (N18.3).","The patient, <PATIENT>, is <AGE> years old, with a history of chronic kidney disease stage 3 (N18.3).","The patient, Joyce Nixon, is 40 years old, with a history of chronic kidney disease stage 3 (N18.3)."
3,Her contact number is 444-456-7890.,Her contact number is <PHONE>.,Her contact number is 222-214-3658.


## Added New Stage (ICD10_CODE)

Now, we will add the **ICD10CM parse model** to the pretrained pipeline we have.

In [34]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

icd10cm = {
    "entity": "ICD10_CODE",
    "ruleScope": "sentence",
    "regex": "^[A-Z]\d{1,2}(.\d{1,2})?$",
    "matchScope": "token",
    "contextLength": 15
}

with open('icd10cm.json', 'w') as f:
    json.dump(icd10cm, f)

icd10cm_parser = ContextualParserApproach() \
      .setInputCols(["sentence", "token"]) \
      .setOutputCol("entity_icd10cm") \
      .setJsonPath("icd10cm.json") \
      .setCaseSensitive(False) \
      .setPrefixAndSuffixMatch(False)\
      .setShortestContextMatch(False)\
      .setOptionalContextRules(False)\
      .setCompleteContextMatch(True)

icd10cm_parser_pipeline = Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    icd10cm_parser
])

empty_data = spark.createDataFrame([[""]]).toDF("text")
icd10cm_parser_model = icd10cm_parser_pipeline.fit(empty_data)

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]


In [35]:
# icd10cm_parser_model test
txt = """During her pregnancy, the patient was diagnosed with gestational diabetes mellitus (O24.489) and pre-existing type 1 diabetes (O24.11),
and she also experienced persistent nausea (R11.1) and frequent urination (R35)."""

LightPipeline(icd10cm_parser_model).annotate(txt)["entity_icd10cm"]

['O24.11', 'R11.1', 'R35']

In [36]:
icd10cm_parser_model.stages[-1].write().overwrite().save("icd10cm_parser")
icd10cm_parser1=ContextualParserModel.load("/content/icd10cm_parser")\
      .setInputCols(["sentence", "token"])\
      .setOutputCol("entity_icd10cm")

In [37]:
deid_pipeline.model.stages = (
    deid_pipeline.model.stages[:22] +  # We are adding our ICD10CM parsel model to the appropriate place in the schema.
    [icd10cm_parser1] +
    deid_pipeline.model.stages[22:]
)

In [38]:
deid_pipeline.model.stages

[DocumentAssembler_0293828e42e5,
 SentenceDetectorDLModel_6bafc4746ea5,
 REGEX_TOKENIZER_ede41b4357b5,
 WORD_EMBEDDINGS_MODEL_9004b1d00302,
 MedicalNerModel_e8178a1262cc,
 NER_CONVERTER_1058f6f116d4,
 MedicalNerModel_9d4a08b1c03d,
 NER_CONVERTER_dc9b41725c7e,
 MERGE_14697a4bf7ea,
 CONTEXTUAL-PARSER_29d8f2e94a43,
 CONTEXTUAL-PARSER_9b30de083926,
 CONTEXTUAL-PARSER_009dd91ad279,
 CONTEXTUAL-PARSER_70bce6260bb4,
 CONTEXTUAL-PARSER_c7c49d4cc377,
 CONTEXTUAL-PARSER_4cdb8328ac10,
 ENTITY_EXTRACTOR_00a0458116f7,
 ENTITY_EXTRACTOR_396241ad6df7,
 CONTEXTUAL-PARSER_24e76bf85739,
 REGEX_MATCHER_e4237b63b8d9,
 CONTEXTUAL-PARSER_2b9eb4befaa6,
 CONTEXTUAL-PARSER_0892cc982b30,
 CONTEXTUAL-PARSER_20cecdf31e95,
 CONTEXTUAL-PARSER_0f3936810ce2,
 CONTEXTUAL-PARSER_69dda3cbafc9,
 MERGE_359f55073107,
 MERGE_2493d1337efe,
 DE-IDENTIFICATION_030ae1ab1b7a,
 Finisher_e9a5d603229b]

In [39]:
len(deid_pipeline.model.stages)

28

We can see an increase in the number of stages in the model above. We are adding the output of the newly added **icd10cm_parser** to **ChunkMerge**.

In [40]:
merger_input_cols = deid_pipeline.model.stages[24].getInputCols()
merger_input_cols

['entity_zip',
 'entity_ssn',
 'entity_account',
 'entity_date',
 'entity_phone',
 'entity_age',
 'entity_email',
 'entity_med',
 'entity_dln',
 'entity_license',
 'entity_plate',
 'entity_vin',
 'entity_country',
 'entity_state']

In [41]:
deid_pipeline.model.stages[24] = deid_pipeline.model.stages[24]\
      .setInputCols(["entity_icd10cm"]+merger_input_cols)\
      .setOutputCol("deid_merged_parse1")

In [42]:
deid_pipeline.model.stages[24].getInputCols()

['entity_icd10cm',
 'entity_zip',
 'entity_ssn',
 'entity_account',
 'entity_date',
 'entity_phone',
 'entity_age',
 'entity_email',
 'entity_med',
 'entity_dln',
 'entity_license',
 'entity_plate',
 'entity_vin',
 'entity_country',
 'entity_state']

In [43]:
deid_lp = LightPipeline(deid_pipeline.model)

In [44]:
deid_res= deid_lp.fullAnnotate(text)

In [45]:
deid_res[0].keys()

dict_keys(['obfuscated', 'ner_chunk', 'sentence'])

In [46]:
for res in deid_res:
    sentence = [original_text.result for original_text in res["sentence"]]
    masked = [masked_text.metadata["masked"] for masked_text in res["obfuscated"]]
    obfuscated = [obfuscated_text.result for obfuscated_text in res["obfuscated"]]

df = pd.DataFrame({"Sentence": sentence, "Masked": masked, "Obfuscated": obfuscated})

df

Unnamed: 0,Sentence,Masked,Obfuscated
0,"Dr. John Lee, from Royal Medical Clinic in Chicago, attended to the patient on 11/05/2024.","Dr. John Lee, from <HOSPITAL> in <CITY>, attended to the patient on 11/05/2024.","Dr. John Lee, from Baltimore VA Medical Center in Healdton, attended to the patient on 11/05/2024."
1,The patient’s medical record number is 56467890.,The patient’s medical record number is <MEDICALRECORD>.,The patient’s medical record number is 47576981.
2,"The patient, Emma Wilson, is 50 years old, with a history of chronic kidney disease stage 3 (N18.3).","The patient, <PATIENT>, is <AGE> years old, with a history of chronic kidney disease stage 3 (<ICD10_CODE>).","The patient, Joyce Nixon, is 40 years old, with a history of chronic kidney disease stage 3 (<ICD10_CODE>)."
3,Her contact number is 444-456-7890.,Her contact number is <PHONE>.,Her contact number is 222-214-3658.


The results of the **ICD10CM** we added are also displayed.

Now we will exclude **ICD10_CODE** entities

In [47]:
deid_pipeline.model.stages[-3].getInputCols()

['deid_merged_parse1', 'deid_merged_chunk']

In [48]:
deid_pipeline.model.stages[-3] = deid_pipeline.model.stages[-3].setBlackList(['ICD10_CODE'])

In [49]:
deid_lp = LightPipeline(deid_pipeline.model)

In [50]:
deid_res= deid_lp.fullAnnotate(text)

In [51]:
for res in deid_res:
    sentence = [original_text.result for original_text in res["sentence"]]
    masked = [masked_text.metadata["masked"] for masked_text in res["obfuscated"]]
    obfuscated = [obfuscated_text.result for obfuscated_text in res["obfuscated"]]

df = pd.DataFrame({"Sentence": sentence, "Masked": masked, "Obfuscated": obfuscated})

df

Unnamed: 0,Sentence,Masked,Obfuscated
0,"Dr. John Lee, from Royal Medical Clinic in Chicago, attended to the patient on 11/05/2024.","Dr. John Lee, from <HOSPITAL> in <CITY>, attended to the patient on <DATE>.","Dr. John Lee, from Baltimore VA Medical Center in Healdton, attended to the patient on 06/06/2024."
1,The patient’s medical record number is 56467890.,The patient’s medical record number is <MEDICALRECORD>.,The patient’s medical record number is 47576981.
2,"The patient, Emma Wilson, is 50 years old, with a history of chronic kidney disease stage 3 (N18.3).","The patient, <PATIENT>, is <AGE> years old, with a history of chronic kidney disease stage 3 (N18.3).","The patient, Joyce Nixon, is 40 years old, with a history of chronic kidney disease stage 3 (N18.3)."
3,Her contact number is 444-456-7890.,Her contact number is <PHONE>.,Her contact number is 222-214-3658.


As an alternative method, you can save the modified pretrained pipeline and load it using the `from_disk` method.


```python
deid_pipeline.model.write().overwrite().save("modified_pipeline")

# We are loading the pretrained pipeline using the `from_disk` method.
from sparknlp.pretrained import PretrainedPipeline

new_pipe = PretrainedPipeline.from_disk('modified_pipeline')

deid_res= new_pipe.fullAnnotate(text)
```



## Remove Stage (Finisher)

Finally, let's review a removal stage. For example, we will remove the Finisher stage from the pipeline.

In [52]:
deid_pipeline.model.stages = deid_pipeline.model.stages[:-1]

In [53]:
deid_pipeline.model.stages

[DocumentAssembler_0293828e42e5,
 SentenceDetectorDLModel_6bafc4746ea5,
 REGEX_TOKENIZER_ede41b4357b5,
 WORD_EMBEDDINGS_MODEL_9004b1d00302,
 MedicalNerModel_e8178a1262cc,
 NER_CONVERTER_1058f6f116d4,
 MedicalNerModel_9d4a08b1c03d,
 NER_CONVERTER_dc9b41725c7e,
 MERGE_14697a4bf7ea,
 CONTEXTUAL-PARSER_29d8f2e94a43,
 CONTEXTUAL-PARSER_9b30de083926,
 CONTEXTUAL-PARSER_009dd91ad279,
 CONTEXTUAL-PARSER_70bce6260bb4,
 CONTEXTUAL-PARSER_c7c49d4cc377,
 CONTEXTUAL-PARSER_4cdb8328ac10,
 ENTITY_EXTRACTOR_00a0458116f7,
 ENTITY_EXTRACTOR_396241ad6df7,
 CONTEXTUAL-PARSER_24e76bf85739,
 REGEX_MATCHER_e4237b63b8d9,
 CONTEXTUAL-PARSER_2b9eb4befaa6,
 CONTEXTUAL-PARSER_0892cc982b30,
 CONTEXTUAL-PARSER_20cecdf31e95,
 CONTEXTUAL-PARSER_0f3936810ce2,
 CONTEXTUAL-PARSER_69dda3cbafc9,
 MERGE_359f55073107,
 MERGE_2493d1337efe,
 DE-IDENTIFICATION_030ae1ab1b7a]

In [54]:
len(deid_pipeline.model.stages)

27

In [55]:
deid_lp = LightPipeline(deid_pipeline.model)

deid_res = deid_lp.fullAnnotate(text)

deid_res[0].keys()

dict_keys(['entity_ssn', 'ner_enriched', 'obfuscated', 'entity_vin', 'entity_dln', 'entity_country', 'document', 'ner_chunk', 'deid_merged_parse1', 'entity_med', 'ner_chunk_large', 'entity_phone', 'entity_zip', 'entity_state', 'entity_account', 'ner_chunk_enriched', 'entity_email', 'entity_icd10cm', 'token', 'entity_date', 'ner', 'entity_age', 'embeddings', 'deid_merged_chunk', 'entity_license', 'sentence', 'entity_plate'])

As seen in the results, previously we were only seeing the outputs of the Finisher method since we were using it. Now that we've removed the Finisher stage, we can see the outputs of all stages.

If you want to see which JSL models are used in the pretrained pipeline stages, you can check in the `/root/cache_pretrained` folder.