

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_EHR_DATA.ipynb)




# **De-identify Structured Data**

To run this yourself, you will need to upload your license keys to the notebook. Just Run The Cell Below in order to do that. Also You can open the file explorer on the left side of the screen and upload `license_keys.json` to the folder that opens.
Otherwise, you can look at the example outputs at the bottom of the notebook.



## 1. Colab Setup

Import license keys

In [1]:
import os
import json

from google.colab import files

license_keys = files.upload()

with open(list(license_keys.keys())[0]) as f:
    license_keys = json.load(f)

sparknlp_version = license_keys["PUBLIC_VERSION"]
jsl_version = license_keys["JSL_VERSION"]

print ('SparkNLP Version:', sparknlp_version)
print ('SparkNLP-JSL Version:', jsl_version)

Saving v3_spark_nlp_for_healthcare.json to v3_spark_nlp_for_healthcare.json
SparkNLP Version: 3.0.1
SparkNLP-JSL Version: 3.0.0


Install dependencies

In [2]:
%%capture
for k,v in license_keys.items(): 
    %set_env $k=$v

!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/jsl_colab_setup.sh
!bash jsl_colab_setup.sh

Import dependencies into Python

In [3]:
import pandas as pd
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from tabulate import tabulate
import sparknlp
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
import sparknlp_jsl

Start the Spark session

In [4]:
spark = sparknlp_jsl.start(license_keys['SECRET'])

# manually start session
# params = {"spark.driver.memory" : "16G",
#           "spark.kryoserializer.buffer.max" : "2000M",
#           "spark.driver.maxResultSize" : "2000M"}

# spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

## 2. Select the NER model and construct the pipeline

Select the models:


* NER Deidentification models: **ner_deid_enriched, ner_deid_large**

* Deidentification models: **deidentify_large, deidentify_rb, deidentify_rb_no_regex**





For more details: https://github.com/JohnSnowLabs/spark-nlp-models#pretrained-models---spark-nlp-for-healthcare

In [5]:
# Change this to the model you want to use and re-run the cells below.
# Anatomy models: ner_anatomy
MODEL_NAME = "ner_deid_large"
DEID_MODEL_NAME = "deidentify_large"

Create the pipeline

In [6]:
documentAssembler = DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentenceDetector = SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")


tokenizer = Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
  .setInputCols(["sentence", "token"])\
  .setOutputCol("embeddings")

# NER model trained on n2c2 datasets)
clinical_ner = MedicalNerModel.pretrained(MODEL_NAME, "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

# NER Overwriter to ensure all the entities are deidentified.
# Use this if the NER does not recognize entities.
neroverwriter = NerOverwriter() \
    .setInputCols(["ner"]) \
    .setOutputCol("ner_overwrited") \
    .setStopWords(['AIQING', 'YBARRA']) \
    .setNewResult("B-NAME")

ner_converter = NerConverterInternal()\
  .setInputCols(["sentence", "token", "ner_overwrited"])\
  .setOutputCol("ner_chunk")

nlp_pipeline = Pipeline(stages=[
    documentAssembler, 
    sentenceDetector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    neroverwriter,
    ner_converter])

empty_df = spark.createDataFrame([['']]).toDF('text')
pipeline_model = nlp_pipeline.fit(empty_df)
light_pipeline = LightPipeline(pipeline_model)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_deid_large download started this may take some time.
Approximate size to download 14.1 MB
[OK!]


## 3. Create example inputs

In [7]:
# Enter examples as strings in this array
df = pd.DataFrame({'Name': ['Dave'], 'DOB':['1970-01-01'], 'Address': ['Kensington Street'], 
                  'Summary':['Mr. Dave said he has cut his alcohol back to 6 pack once a week. He has cut back his cigarettes to one time per week. His PCP was M.D William Boss who had suggested some tests.']
                              })


# 4. De-identify using Obfuscation Method

Define De-identification Model

In [8]:
deidentification = DeIdentificationModel.pretrained(DEID_MODEL_NAME, "en", "clinical/models") \
                                                .setInputCols(["sentence", "token", "ner_chunk"]) \
                                                .setOutputCol("deidentified") \
                                                .setObfuscateDate(True)\
                                                .setMode('obfuscate')

deidentify_large download started this may take some time.
Approximate size to download 188.1 KB
[OK!]


In [9]:
#helper function
def deid_row(df):
    res_m = {}
    for col in df.columns:

        result = pipeline_model.transform(spark.createDataFrame(pd.DataFrame({'text':[df[col].values[0]]})))
        
        deid_text = deidentification.transform(result)
        res1 = deid_text.toPandas()
        sent = ''
        for r in res1['deidentified'].iloc[0]:
            sent = sent + ' ' + r[3]
        res_m[col] = sent

    return pd.DataFrame([res_m])

In [10]:
result_obfuscated = deid_row(df, )

Visualize

In [15]:
#ORIGINAL
print(tabulate(df, headers=('Name', 'DOB', 'Address', 'Summary')))

    Name    DOB         Address            Summary
--  ------  ----------  -----------------  --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 0  Dave    1970-01-01  Kensington Street  Mr. Dave said he has cut his alcohol back to 6 pack once a week. He has cut back his cigarettes to one time per week. His PCP was M.D William Boss who had suggested some tests.


In [17]:
#OBFUSCATED
print(tabulate(result_obfuscated, headers=('Name', 'DOB', 'Address', 'Summary')))

    Name    DOB         Address         Summary
--  ------  ----------  --------------  --------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 0  NARIKO  1970-02-15  667 Port Jacob  Mr. Byram said he has cut his alcohol back to 6 pack once a week. He has cut back his cigarettes to one time per week. His PCP was M.D ROXIE who had suggested some tests.


# 5. De-identify using Masking Method

Define De-identification Model

In [12]:
deidentification = DeIdentificationModel.pretrained(DEID_MODEL_NAME, "en", "clinical/models") \
                                                .setInputCols(["sentence", "token", "ner_chunk"]) \
                                                .setOutputCol("deidentified") \
                                                .setObfuscateDate(True)\
                                                .setMode('mask')

deidentify_large download started this may take some time.
Approximate size to download 188.1 KB
[OK!]


In [13]:
result_masked = deid_row(df)

Visualize

In [18]:
#ORIGINAL
print(tabulate(df, headers=('Name', 'DOB', 'Address', 'Summary')))

    Name    DOB         Address            Summary
--  ------  ----------  -----------------  --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 0  Dave    1970-01-01  Kensington Street  Mr. Dave said he has cut his alcohol back to 6 pack once a week. He has cut back his cigarettes to one time per week. His PCP was M.D William Boss who had suggested some tests.


In [19]:
#MASKED
print(tabulate(result_masked, headers=('Name', 'DOB', 'Address', 'Summary')))

    Name    DOB     Address     Summary
--  ------  ------  ----------  ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 0  <NAME>  <DATE>  <LOCATION>  Mr. <NAME> said he has cut his alcohol back to 6 pack once a week. He has cut back his cigarettes to one time per week. His PCP was M.D <NAME> who had suggested some tests.
