

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT.ipynb)




# **Deidentify free text documents**

To run this yourself, you will need to upload your license keys to the notebook. Just Run The Cell Below in order to do that. Also You can open the file explorer on the left side of the screen and upload `license_keys.json` to the folder that opens.
Otherwise, you can look at the example outputs at the bottom of the notebook.



## 1. Colab Setup

Import license keys

In [None]:
import json
import os

from google.colab import files

license_keys = files.upload()

with open(list(license_keys.keys())[0]) as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)

# Adding license key-value pairs to environment variables
os.environ.update(license_keys)

Install dependencies

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

Import dependencies into Python

In [3]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from tabulate import tabulate
import sparknlp
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
import sparknlp_jsl

Start the Spark session

In [4]:
# manually start session
# params = {"spark.driver.memory" : "16G",
#           "spark.kryoserializer.buffer.max" : "2000M",
#           "spark.driver.maxResultSize" : "2000M"}

# spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)


print ("Spark NLP Version :", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark = sparknlp_jsl.start(license_keys['SECRET'])

spark

Spark NLP Version : 3.4.0
Spark NLP_JSL Version : 3.4.0


## 2. Select the NER model and construct the pipeline

Select the models:


* NER Deidentification models: **ner_deid_enriched, ner_deid_large**

* Deidentification models: **deidentify_large, deidentify_rb, deidentify_rb_no_regex**





For more details: https://github.com/JohnSnowLabs/spark-nlp-models#pretrained-models---spark-nlp-for-healthcare

In [17]:
# Change this to the model you want to use and re-run the cells below.
# Anatomy models: ner_anatomy
MODEL_NAME = "ner_deid_subentity_augmented_i2b2"
DEID_MODEL_NAME = "deidentify_large"

Create the pipeline

In [6]:
documentAssembler = DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")


sentenceDetector = SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")


tokenizer = Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
  .setInputCols(["sentence", "token"])\
  .setOutputCol("embeddings")

# NER model trained on n2c2 datasets)
clinical_ner = MedicalNerModel.pretrained(MODEL_NAME, "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

# NER Overwriter to ensure all the entities are deidentified.
# Use this if the NER does not recognize entities.
neroverwriter = NerOverwriter() \
    .setInputCols(["ner"]) \
    .setOutputCol("ner_overwrited") \
    .setStopWords(['AIQING', 'YBARRA']) \
    .setNewResult("B-NAME")

ner_converter = NerConverterInternal()\
  .setInputCols(["sentence", "token", "ner_overwrited"])\
  .setOutputCol("ner_chunk")

nlp_pipeline = Pipeline(stages=[
    documentAssembler, 
    sentenceDetector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    neroverwriter,
    ner_converter])

empty_df = spark.createDataFrame([['']]).toDF('text')
pipeline_model = nlp_pipeline.fit(empty_df)
light_pipeline = LightPipeline(pipeline_model)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_deid_subentity_augmented_i2b2 download started this may take some time.
Approximate size to download 14 MB
[OK!]


## 3. Create example inputs

In [7]:
# Enter examples as strings in this array
input_list = [
    """Patient AIQING, 25 month years-old , born in Beijing, was transfered to the The Johns Hopkins Hospital. Phone number: (541) 754-3010. MSW 100009632582 for his colonic polyps. He wants to know the results from them. He is not taking hydrochlorothiazide and is curious about his blood 
pressure. He said he has cut his alcohol back to 6 pack once a week. He 
has cut back his cigarettes to one time per week. P:   Follow up with Dr. Hobbs in 3 months. Gilbert P. Perez, M.D."""]

## 4. Run the pipeline to find Entities

In [8]:
result = pipeline_model.transform(spark.createDataFrame(pd.DataFrame({'text':input_list})))

Visualize

In [9]:
exploded = F.explode(F.arrays_zip('ner_chunk.result', 'ner_chunk.metadata'))
select_expression_0 = F.expr("cols['0']").alias("chunk")
select_expression_1 = F.expr("cols['1']['entity']").alias("ner_label")
result.select(exploded.alias("cols")) \
    .select(select_expression_0, select_expression_1).show(truncate=False)

+----------------------+---------+
|chunk                 |ner_label|
+----------------------+---------+
|AIQING                |NAME     |
|25                    |AGE      |
|Beijing               |DATE     |
|Johns Hopkins Hospital|HOSPITAL |
|100009632582          |IDNUM    |
|Hobbs                 |DOCTOR   |
|Gilbert P. Perez      |DOCTOR   |
+----------------------+---------+



# 5. Deidentify using Obfuscation Method

In [13]:
obfuscation = DeIdentificationModel.pretrained(DEID_MODEL_NAME, "en", "clinical/models") \
      .setInputCols(["sentence", "token", "ner_chunk"]) \
      .setOutputCol("obfuscated") \
      .setMode("obfuscate")

deid_text = obfuscation.transform(result)

deidentify_large download started this may take some time.
Approximate size to download 188.1 KB
[OK!]


# 6. Visualize Obfusacted Results

In [14]:
deid_text.select(F.explode(F.arrays_zip('sentence.result', 'obfuscated.result')).alias("cols")) \
.select(F.expr("cols['0']").alias("sentence"), F.expr("cols['1']").alias("deidentified")).toPandas()

Unnamed: 0,sentence,deidentified
0,"Patient AIQING, 25 month years-old , born in Beijing, was transfered to the The Johns Hopkins Hospital.","Patient Rozetta Simpers, 88 month years-old , born in <DATE>, was transfered to the The RIDGEVIEW INSTITUTE."
1,Phone number: (541) 754-3010.,Phone number: (G9105795) 568 845 498.
2,MSW 100009632582 for his colonic polyps.,MSW OQ:574595 for his colonic polyps.
3,He wants to know the results from them.,He wants to know the results from them.
4,He is not taking hydrochlorothiazide and is curious about his blood \npressure.,He is not taking hydrochlorothiazide and is curious about his blood \npressure.
5,He said he has cut his alcohol back to 6 pack once a week.,He said he has cut his alcohol back to 6 pack once a week.
6,He \nhas cut back his cigarettes to one time per week.,He \nhas cut back his cigarettes to one time per week.
7,P: Follow up with Dr. Hobbs in 3 months.,P: Follow up with Dr. Dr Selena Stamps in 3 months.
8,"Gilbert P. Perez, M.D.","Dr Unk Mix, M.D."


## 7. Deidentify using Masking Method

In [15]:
masking = DeIdentificationModel.pretrained(DEID_MODEL_NAME, "en", "clinical/models") \
      .setInputCols(["sentence", "token", "ner_chunk"]) \
      .setOutputCol("masked") \
      .setMode("mask")

deid_text = masking.transform(result)

deidentify_large download started this may take some time.
Approximate size to download 188.1 KB
[OK!]


# 8. Visualize Masked Results

In [16]:
deid_text.select(F.explode(F.arrays_zip('sentence.result', 'masked.result')).alias("cols")) \
.select(F.expr("cols['0']").alias("sentence"), F.expr("cols['1']").alias("deidentified")).toPandas()

Unnamed: 0,sentence,deidentified
0,"Patient AIQING, 25 month years-old , born in Beijing, was transfered to the The Johns Hopkins Hospital.","Patient <NAME>, <AGE> month years-old , born in <DATE>, was transfered to the The <HOSPITAL>."
1,Phone number: (541) 754-3010.,Phone number: (<ID>) <CONTACT>.
2,MSW 100009632582 for his colonic polyps.,MSW <IDNUM> for his colonic polyps.
3,He wants to know the results from them.,He wants to know the results from them.
4,He is not taking hydrochlorothiazide and is curious about his blood \npressure.,He is not taking hydrochlorothiazide and is curious about his blood \npressure.
5,He said he has cut his alcohol back to 6 pack once a week.,He said he has cut his alcohol back to 6 pack once a week.
6,He \nhas cut back his cigarettes to one time per week.,He \nhas cut back his cigarettes to one time per week.
7,P: Follow up with Dr. Hobbs in 3 months.,P: Follow up with Dr. <DOCTOR> in 3 months.
8,"Gilbert P. Perez, M.D.","<DOCTOR>, M.D."
