![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT.ipynb)

# **Deidentify free text documents**

To run this yourself, you will need to upload your license keys to the notebook. Just Run The Cell Below in order to do that. Also You can open the file explorer on the left side of the screen and upload `license_keys.json` to the folder that opens.
Otherwise, you can look at the example outputs at the bottom of the notebook.



## 1. Colab Setup

Import license keys

In [None]:
import json, os
from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

In [3]:
import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml import Pipeline,PipelineModel
from pyspark.sql.types import StringType, IntegerType

import pandas as pd
pd.set_option('display.max_colwidth', 200)

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"16G", 
          "spark.kryoserializer.buffer.max":"2000M", 
          "spark.driver.maxResultSize":"2000M"} 

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

print("Spark NLP Version :", sparknlp.version())
print("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 4.2.8
Spark NLP_JSL Version : 4.2.8


## 2. Select the NER model and construct the pipeline

Select the models:


* NER Deidentification models: **ner_deid_enriched, ner_deid_large**

* Deidentification models: **deidentify_large, deidentify_rb, deidentify_rb_no_regex**





For more details: https://github.com/JohnSnowLabs/spark-nlp-models#pretrained-models---spark-nlp-for-healthcare

In [4]:
# Change this to the model you want to use and re-run the cells below.
# Anatomy models: ner_anatomy

MODEL_NAME = "ner_deid_subentity_augmented_i2b2"
# MODEL_NAME = "ner_deid_generic_augmented" 

Create the pipeline

In [5]:
documentAssembler = DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")


sentenceDetector = SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")


tokenizer = Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
  .setInputCols(["sentence", "token"])\
  .setOutputCol("embeddings")

# NER model trained on n2c2 datasets)
clinical_ner = MedicalNerModel.pretrained(MODEL_NAME, "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = NerConverterInternal()\
  .setInputCols(["sentence", "token", "ner"])\
  .setOutputCol("ner_chunk")

nlp_pipeline = Pipeline(stages=[
    documentAssembler, 
    sentenceDetector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    ner_converter])

empty_df = spark.createDataFrame([['']]).toDF('text')
pipeline_model = nlp_pipeline.fit(empty_df)
light_pipeline = LightPipeline(pipeline_model)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_deid_subentity_augmented_i2b2 download started this may take some time.
[OK!]


## 3. Create example inputs

In [6]:
# Enter examples as strings in this array
input_list = [
    """A . Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson , Ora MR . # 719435 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital . 0295 Keats Street. Phone 302-786-5227."""]

In [7]:
import pandas as pd

df = spark.createDataFrame(pd.DataFrame({'text':input_list}))

## 4. Run the pipeline to find Entities

In [8]:
result = pipeline_model.transform(df)

Visualize

In [9]:
result.select(F.explode(F.arrays_zip(result.ner_chunk.result,                                
                                     result.ner_chunk.metadata)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+-----------------------------+-------------+
|chunk                        |ner_label    |
+-----------------------------+-------------+
|2093-01-13                   |DATE         |
|David Hale                   |DOCTOR       |
|Hendrickson , Ora            |PATIENT      |
|719435                       |MEDICALRECORD|
|01/13/93                     |DATE         |
|Oliveira                     |DOCTOR       |
|25                           |AGE          |
|2079-11-09                   |DATE         |
|Cocke County Baptist Hospital|HOSPITAL     |
|0295 Keats Street            |STREET       |
|302-786-5227                 |PHONE        |
+-----------------------------+-------------+



# 5. Deidentify using Obfuscation Method

In [10]:
obfuscation = DeIdentification()\
      .setInputCols(["sentence", "token", "ner_chunk"]) \
      .setOutputCol("obfuscated") \
      .setMode("obfuscate")\
      .setObfuscateDate(True)

obfuscate_pipeline = Pipeline(stages=[
                documentAssembler, 
                sentenceDetector,
                tokenizer,
                word_embeddings,
                clinical_ner,
                ner_converter,
                obfuscation])

# empty_df = spark.createDataFrame([['']]).toDF('text')
# obfuscate_model = 

deid_text = obfuscate_pipeline.fit(df).transform(df)

# 6. Visualize Obfusacted Results

In [11]:
deid_text.select(F.explode(F.arrays_zip(deid_text.sentence.result, 
                                        deid_text.obfuscated.result)).alias("cols")) \
         .select(F.expr("cols['0']").alias("sentence"), 
                 F.expr("cols['1']").alias("deidentified")).toPandas()

Unnamed: 0,sentence,deidentified
0,A .,A .
1,"Record date : 2093-01-13 , David Hale , M.D .","Record date : 2093-02-07 , Dr Warren Jungling , M.D ."
2,", Name : Hendrickson , Ora MR .",", Name : Moise Poli MR ."
3,"# 719435 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09 .","# Z8172585 Date : 07-19-1980 PCP : Dr Jinx Lulas , 28 years-old , Record date : 2079-12-13 ."
4,Cocke County Baptist Hospital .,VA MEDICAL CENTER - BATTLE CREEK .
5,0295 Keats Street.,Ackerweg 32.
6,Phone 302-786-5227.,Phone (60) 245-071.


## 7. Deidentify using Masking Method

In [12]:
masking = DeIdentification()\
      .setInputCols(["sentence", "token", "ner_chunk"]) \
      .setOutputCol("masked") \
      .setMode("mask")

masking_pipeline = Pipeline(
    stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      masking
      ])

deid_text = masking_pipeline.fit(df).transform(df)

# 8. Visualize Masked Results

In [13]:
deid_text.select(F.explode(F.arrays_zip(deid_text.sentence.result, 
                                        deid_text.masked.result)).alias("cols")) \
         .select(F.expr("cols['0']").alias("sentence"), 
                 F.expr("cols['1']").alias("deidentified")).toPandas()

Unnamed: 0,sentence,deidentified
0,A .,A .
1,"Record date : 2093-01-13 , David Hale , M.D .","Record date : <DATE> , <DOCTOR> , M.D ."
2,", Name : Hendrickson , Ora MR .",", Name : <PATIENT> MR ."
3,"# 719435 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09 .","# <MEDICALRECORD> Date : <DATE> PCP : <DOCTOR> , <AGE> years-old , Record date : <DATE> ."
4,Cocke County Baptist Hospital .,<HOSPITAL> .
5,0295 Keats Street.,<STREET>.
6,Phone 302-786-5227.,Phone <PHONE>.


# 9. Comparison of Original Sentence Mased and Obfuscated mode

In [14]:
pd.set_option("display.max_colwidth", None)

deid_pipeline = Pipeline(
    stages=[
        documentAssembler, 
        sentenceDetector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter,
        masking,
        obfuscation
        ])

deid_result = deid_pipeline.fit(df).transform(df)


Visualize

In [15]:
deid_result.select(F.explode(F.arrays_zip(deid_result.sentence.result, 
                                          deid_result.masked.result,
                                          deid_result.obfuscated.result)).alias("cols")) \
           .select(F.expr("cols['0']").alias("Original Sentence"), 
                   F.expr("cols['1']").alias("Masked"),
                   F.expr("cols['2']").alias("Obfuscated")).toPandas()

Unnamed: 0,Original Sentence,Masked,Obfuscated
0,A .,A .,A .
1,"Record date : 2093-01-13 , David Hale , M.D .","Record date : <DATE> , <DOCTOR> , M.D .","Record date : 2093-02-09 , Dr Warren Jungling , M.D ."
2,", Name : Hendrickson , Ora MR .",", Name : <PATIENT> MR .",", Name : Moise Poli MR ."
3,"# 719435 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09 .","# <MEDICALRECORD> Date : <DATE> PCP : <DOCTOR> , <AGE> years-old , Record date : <DATE> .","# Z8172585 Date : 07-19-1980 PCP : Dr Jinx Lulas , 28 years-old , Record date : 2079-12-18 ."
4,Cocke County Baptist Hospital .,<HOSPITAL> .,VA MEDICAL CENTER - BATTLE CREEK .
5,0295 Keats Street.,<STREET>.,Ackerweg 32.
6,Phone 302-786-5227.,Phone <PHONE>.,Phone (60) 245-071.
