

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_EHR_DATA.ipynb)




# **De-identify Structured Data**

To run this yourself, you will need to upload your license keys to the notebook. Otherwise, you can look at the example outputs at the bottom of the notebook. To upload license keys, open the file explorer on the left side of the screen and upload `workshop_license_keys.json` to the folder that opens.

## 1. Colab Setup

Import license keys

In [None]:
import os
import json

with open('/content/workshop_license_keys.json', 'r') as f:
    license_keys = json.load(f)

license_keys.keys()

secret = license_keys['JSL_SECRET']
os.environ['SPARK_NLP_LICENSE'] = license_keys['SPARK_NLP_LICENSE']
os.environ['JSL_OCR_LICENSE'] = license_keys['JSL_OCR_LICENSE']
os.environ['AWS_ACCESS_KEY_ID'] = license_keys['AWS_ACCESS_KEY_ID']
os.environ['AWS_SECRET_ACCESS_KEY'] = license_keys['AWS_SECRET_ACCESS_KEY']

jsl_version = secret.split('-')[0]
jsl_version

'2.5.5'

Install dependencies

In [None]:
# Install Java
! apt-get update -qq
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null
! java -version

# Install pyspark
! pip install --ignore-installed -q pyspark==2.4.4

# Install Spark NLP
! pip install --ignore-installed spark-nlp
! python -m pip install --upgrade spark-nlp-jsl==$jsl_version --extra-index-url https://pypi.johnsnowlabs.com/$secret

Import dependencies into Python

In [None]:
os.environ['JAVA_HOME'] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ['PATH'] = os.environ['JAVA_HOME'] + "/bin:" + os.environ['PATH']

import pandas as pd
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

import sparknlp
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
import sparknlp_jsl


Start the Spark session

In [None]:
spark = sparknlp_jsl.start(secret)

## 2. Select the NER model and construct the pipeline

Select the models:


* NER Deidentification models: **ner_deid_enriched, ner_deid_large**

* Deidentification models: **deidentify_large, deidentify_rb, deidentify_rb_no_regex**





For more details: https://github.com/JohnSnowLabs/spark-nlp-models#pretrained-models---spark-nlp-for-healthcare

In [None]:
# Change this to the model you want to use and re-run the cells below.
# Anatomy models: ner_anatomy
MODEL_NAME = "ner_deid_large"
DEID_MODEL_NAME = "deidentify_large"

Create the pipeline

In [None]:
documentAssembler = DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")


sentenceDetector = SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")


tokenizer = Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
  .setInputCols(["sentence", "token"])\
  .setOutputCol("embeddings")

# NER model trained on n2c2 datasets)
clinical_ner = NerDLModel.pretrained(MODEL_NAME, "en", "clinical/models") \
  .setInputCols(["sentence", "token", "embeddings"]) \
  .setOutputCol("ner")

# NER Overwriter to ensure all the entities are deidentified.
# Use this if the NER does not recognize entities.
neroverwriter = NerOverwriter() \
    .setInputCols(["ner"]) \
    .setOutputCol("ner_overwrited") \
    .setStopWords(['AIQING', 'YBARRA']) \
    .setNewResult("B-NAME")

ner_converter = NerConverterInternal()\
  .setInputCols(["sentence", "token", "ner_overwrited"])\
  .setOutputCol("ner_chunk")

nlp_pipeline = Pipeline(stages=[
    documentAssembler, 
    sentenceDetector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    neroverwriter,
    ner_converter])

empty_df = spark.createDataFrame([['']]).toDF('text')
pipeline_model = nlp_pipeline.fit(empty_df)
light_pipeline = LightPipeline(pipeline_model)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_deid_large download started this may take some time.
Approximate size to download 13.9 MB
[OK!]


## 3. Create example inputs

In [None]:
# Enter examples as strings in this array
df = pd.DataFrame({'Name': ['Dave'], 'DOB':['1970-01-01'], 'Address': ['Kensington Street'], 
                  'Summary':['Mr. Dave said he has cut his alcohol back to 6 pack once a week. He has cut back his cigarettes to one time per week. His PCP was M.D William Boss who had suggested some tests.']
                              })


# 4. De-identify using Obfuscation Method

Define De-identification Model

In [None]:
deidentification = DeIdentificationModel.pretrained(DEID_MODEL_NAME, "en", "clinical/models") \
                                                .setInputCols(["sentence", "token", "ner_chunk"]) \
                                                .setOutputCol("deidentified") \
                                                .setObfuscateDate(True)\
                                                .setMode('obfuscate')

deidentify_large download started this may take some time.
Approximate size to download 188.1 KB
[OK!]


In [None]:
#helper function
def deid_row(df):
    res_m = {}
    for col in df.columns:

        result = pipeline_model.transform(spark.createDataFrame(pd.DataFrame({'text':[df[col].values[0]]})))
        
        deid_text = deidentification.transform(result)
        res1 = deid_text.toPandas()
        sent = ''
        for r in res1['deidentified'].iloc[0]:
            sent = sent + ' ' + r[3]
        res_m[col] = sent

    return pd.DataFrame([res_m])

In [None]:
result_obfuscated = deid_row(df, )

Visualize

In [None]:
result_obfuscated

Unnamed: 0,Name,DOB,Address,Summary
0,Deane,1970-01-21,Sylvarena,Mr. Michaela said he has cut his alcohol back...


# 5. De-identify using Masking Method

Define De-identification Model

In [None]:
deidentification = DeIdentificationModel.pretrained(DEID_MODEL_NAME, "en", "clinical/models") \
                                                .setInputCols(["sentence", "token", "ner_chunk"]) \
                                                .setOutputCol("deidentified") \
                                                .setObfuscateDate(True)\
                                                .setMode('mask')

In [None]:
result_masked = deid_row(df)

Visualize

In [None]:
result_masked

Unnamed: 0,Name,DOB,Address,Summary
0,<NAME>,<DATE>,<LOCATION>,Mr. <NAME> said he has cut his alcohol back t...
