

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_EHR_DATA.ipynb)




# **De-identify Structured Data**

To run this yourself, you will need to upload your license keys to the notebook. Just Run The Cell Below in order to do that. Also You can open the file explorer on the left side of the screen and upload `license_keys.json` to the folder that opens.
Otherwise, you can look at the example outputs at the bottom of the notebook.



## 1. Colab Setup

Import license keys

In [None]:
import json
import os

from google.colab import files

license_keys = files.upload()

with open(list(license_keys.keys())[0]) as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)

# Adding license key-value pairs to environment variables
os.environ.update(license_keys)

Install dependencies

In [2]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

Import dependencies into Python and Start the Spark session

In [3]:
import json
import os
import pandas as pd

from pyspark.ml import Pipeline,PipelineModel
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
import sparknlp_jsl
import sparknlp

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"16G", 
          "spark.kryoserializer.buffer.max":"2000M", 
          "spark.driver.maxResultSize":"2000M"} 

print ("Spark NLP Version :", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

spark

Spark NLP Version : 3.3.4
Spark NLP_JSL Version : 3.3.4


In [4]:
# if you want to start the session with custom params as in start function above
from pyspark.sql import SparkSession

def start(SECRET):
    builder = SparkSession.builder \
        .appName("Spark NLP Licensed") \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "2000M") \
        .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:"+PUBLIC_VERSION) \
        .config("spark.jars", "https://pypi.johnsnowlabs.com/"+SECRET+"/spark-nlp-jsl-"+JSL_VERSION+".jar")
      
    return builder.getOrCreate()

#spark = start(SECRET)

## 2. Download Structured PHI Data and Create a `DataFrame`

In [5]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/hipaa-table-001.txt

In [6]:
df = spark.read.format("csv") \
    .option("sep", "\t") \
    .option("inferSchema", "true") \
    .option("header", "true") \
    .load("hipaa-table-001.txt")

df = df.withColumnRenamed("PATIENT","NAME")
df.show(truncate=False)

+---------------+----------+---+----------------------------------------------------+-------+--------------+---+---+
|NAME           |DOB       |AGE|ADDRESS                                             |ZIPCODE|TEL           |SBP|DBP|
+---------------+----------+---+----------------------------------------------------+-------+--------------+---+---+
|Cecilia Chapman|04/02/1935|83 |711-2880 Nulla St. Mankato Mississippi              |69200  |(257) 563-7401|101|42 |
|Iris Watson    |03/10/2009|9  |P.O. Box 283 8562 Fusce Rd. Frederick Nebraska      |20620  |(372) 587-2335|159|122|
|Bryar Pitts    |11/01/1921|98 |5543 Aliquet St. Fort Dodge GA                      |20783  |(717) 450-4729|149|52 |
|Theodore Lowe  |13/02/2002|16 |Ap #867-859 Sit Rd. Azusa New York                  |39531  |(793) 151-6230|134|115|
|Calista Wise   |20/08/1942|76 |7292 Dictum Av. San Antonio MI                      |47096  |(492) 709-6392|139|78 |
|Kyla Olsen     |12/05/1973|45 |Ap #651-8679 Sodales Av. Tamunin

## 3. De-identify using Obfuscation Method

In [7]:
from sparknlp_jsl.structured_deidentification import StructuredDeidentification

We will obfuscate `NAME` column as `PATIENT`, `AGE` column as `AGE` and `TEL` column as `PHONE`.

In [8]:
obfuscator = StructuredDeidentification(spark,{"NAME" : "PATIENT", "AGE" : "AGE", "TEL" : "PHONE"}, obfuscateRefSource='faker')
obfuscator_df = obfuscator.obfuscateColumns(df)

In [9]:
obfuscator_df.select("NAME", "AGE", "TEL").show(truncate=False)

+-------------------+----+----------------+
|NAME               |AGE |TEL             |
+-------------------+----+----------------+
|[Daphne Justin]    |[63]|[0356 4886796]  |
|[Rojean Coupe]     |[65]|[(029) 7003-275]|
|[Jolie Joy]        |[22]|[201-948-1927]  |
|[Merlene Gory]     |[6] |[793 358 347]   |
|[Ardell Score]     |[94]|[051-716-212]   |
|[Shirleyann Nao]   |[31]|[085 955 7196]  |
|[Arnett Huntington]|[33]|[96 569629]     |
|[Rosette Barthel]  |[67]|[051-897-728]   |
|[Mela Maker]       |[49]|[71 507 332]    |
|[Gaynell Bunde]    |[41]|[02.74.68.06.67]|
|[Hugh Rede]        |[20]|[031-874-377]   |
|[Alfonso Lent]     |[50]|[614 5454]      |
|[Eddie Mooring]    |[65]|[0680 390 92 93]|
|[Emi An]           |[22]|[799 0260]      |
|[Kelsey Shim]      |[63]|[779 975 396]   |
|[Tomie Pauling]    |[67]|[34 33 96]      |
|[Thena Gully]      |[19]|[0480 49 24 35] |
|[Anabel Maya]      |[50]|[032 304 86 43] |
|[Jodie Sanders]    |[25]|[0681 910 00 64]|
|[Cesar Kennel]     |[33]|[0320-

The annotator does not have fake `DATE` chunks by default. Let's do it manually. We can create a `faker` dictionary for `DOB` column as `DATE` label then we obfuscate `DOB` column as well.

In [10]:
obfuscator_unique_ref_test = '''2022-11-1#DATE
2033-10-30#DATE
2011-8-22#DATE
2005-11-1#DATE
2008-10-30#DATE
2044-8-22#DATE
2022-04-1#DATE
2033-05-30#DATE
2011-09-22#DATE
2005-12-1#DATE
2008-02-30#DATE
2044-03-22#DATE
2055-11-1#DATE
2066-10-30#DATE
2077-8-22#DATE
2088-11-1#DATE
2099-10-30#DATE
2100-8-22#DATE
2111-04-1#DATE
2122-05-30#DATE
2133-09-22#DATE
2144-12-1#DATE
2155-02-30#DATE
2166-03-22#DATE'''

with open('obfuscator_unique_ref_test.txt', 'w') as f:
  f.write(obfuscator_unique_ref_test)

In [None]:
obfuscator = StructuredDeidentification(spark,{"NAME":"PATIENT","AGE":"AGE", "DOB":"DATE", "TEL":"PHONE"}, obfuscateRefFile="/content/obfuscator_unique_ref_test.txt")
obfuscator_df = obfuscator.obfuscateColumns(df)


In [None]:
obfuscator_df.select("NAME", "DOB", "AGE", "TEL").show(truncate=False)

+------------------+------------+----+----------------+
|NAME              |DOB         |AGE |TEL             |
+------------------+------------+----+----------------+
|[Alaina Pavlov]   |[2099-10-30]|[24]|[0473 75 74 88] |
|[Theron Blend]    |[2111-04-1] |[73]|[814-297-1404]  |
|[Doug Ratel]      |[2099-10-30]|[62]|[699 792 243]   |
|[Baker Ano]       |[2066-10-30]|[63]|[0356 7386222]  |
|[Arlene Pin]      |[2005-11-1] |[55]|[09704 13 48 83]|
|[Steva Infante]   |[2044-03-22]|[35]|[78 577 036]    |
|[Darlin Curly]    |[2005-12-1] |[6] |[444 14 907]    |
|[Ailene Dumas]    |[2122-05-30]|[76]|[30-88-20-94]   |
|[Peder Speck]     |[2133-09-22]|[23]|[448 8003]      |
|[Kenn Rater]      |[2033-10-30]|[39]|[24 102884]     |
|[Gaines Colander] |[2022-04-1] |[56]|[250-293-9941]  |
|[Lillard Likens]  |[2144-12-1] |[59]|[0613-9040212]  |
|[Mallissa Cooler] |[2005-12-1] |[73]|[(02) 6765 9044]|
|[Winthrop Matter] |[2088-11-1] |[62]|[95 698307]     |
|[Jenne Czar]      |[2166-03-22]|[76]|[(022) 375