

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_EHR_DATA.ipynb)




# **De-identify Structured Data**

To run this yourself, you will need to upload your license keys to the notebook. Just Run The Cell Below in order to do that. Also You can open the file explorer on the left side of the screen and upload `license_keys.json` to the folder that opens.
Otherwise, you can look at the example outputs at the bottom of the notebook.



## 1. Colab Setup

Import license keys

In [None]:
import json
import os

from google.colab import files

license_keys = files.upload()

with open(list(license_keys.keys())[0]) as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)

# Adding license key-value pairs to environment variables
os.environ.update(license_keys)

Install dependencies

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

Import dependencies into Python

In [3]:
import pandas as pd
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from tabulate import tabulate
import sparknlp
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
import sparknlp_jsl

Start the Spark session

In [4]:
# manually start session
# params = {"spark.driver.memory" : "16G",
#           "spark.kryoserializer.buffer.max" : "2000M",
#           "spark.driver.maxResultSize" : "2000M"}

# spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

print ("Spark NLP Version :", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark = sparknlp_jsl.start(license_keys['SECRET'])

spark

Spark NLP Version : 3.4.0
Spark NLP_JSL Version : 3.4.0


## 2. Download Structured PHI Data and Create a `DataFrame`

In [5]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/hipaa-table-001.txt

In [6]:
df = spark.read.format("csv") \
    .option("sep", "\t") \
    .option("inferSchema", "true") \
    .option("header", "true") \
    .load("hipaa-table-001.txt")

df = df.withColumnRenamed("PATIENT","NAME")
df.show(truncate=False)

+---------------+----------+---+----------------------------------------------------+-------+--------------+---+---+
|NAME           |DOB       |AGE|ADDRESS                                             |ZIPCODE|TEL           |SBP|DBP|
+---------------+----------+---+----------------------------------------------------+-------+--------------+---+---+
|Cecilia Chapman|04/02/1935|83 |711-2880 Nulla St. Mankato Mississippi              |69200  |(257) 563-7401|101|42 |
|Iris Watson    |03/10/2009|9  |P.O. Box 283 8562 Fusce Rd. Frederick Nebraska      |20620  |(372) 587-2335|159|122|
|Bryar Pitts    |11/01/1921|98 |5543 Aliquet St. Fort Dodge GA                      |20783  |(717) 450-4729|149|52 |
|Theodore Lowe  |13/02/2002|16 |Ap #867-859 Sit Rd. Azusa New York                  |39531  |(793) 151-6230|134|115|
|Calista Wise   |20/08/1942|76 |7292 Dictum Av. San Antonio MI                      |47096  |(492) 709-6392|139|78 |
|Kyla Olsen     |12/05/1973|45 |Ap #651-8679 Sodales Av. Tamunin

## 3. De-identify using Obfuscation Method

In [7]:
from sparknlp_jsl.structured_deidentification import StructuredDeidentification

We will obfuscate `NAME` column as `PATIENT`, `AGE` column as `AGE` and `TEL` column as `PHONE`.

We can shift n days in the structured deidentification through "days" parameter when the column is a Date.

In [8]:
obfuscator = StructuredDeidentification(spark,{"NAME" : "PATIENT", "AGE" : "AGE", "TEL" : "PHONE"}, 
                                        obfuscateRefSource='faker', 
                                        columnsSeed={"NAME": 23, "DOB": 23},
                                        days=5)

obfuscator_df = obfuscator.obfuscateColumns(df)

In [9]:
obfuscator_df.select("NAME", "AGE", "TEL").show(truncate=False)

+--------------------+----+----------------+
|NAME                |AGE |TEL             |
+--------------------+----+----------------+
|[Teofilo Dynes]     |[31]|[(029) 7003-275]|
|[Osa Forget]        |[5] |[046 406 6044]  |
|[Maryland Drown]    |[19]|[493 68 821]    |
|[Kristopher Mattock]|[80]|[439 8853]      |
|[Myron Milliner]    |[29]|[31-70-28-28]   |
|[Sabina Merles]     |[36]|[03.48.72.77.73]|
|[Madalyn Ground]    |[5] |[077 4749 3624] |
|[David Age]         |[35]|[02.74.68.06.67]|
|[Darene Squibb]     |[1] |[0487 23 46 71] |
|[Barrington Coder]  |[18]|[081 850 53 42] |
|[Vonita Reasoner]   |[7] |[(11) 9998-3425]|
|[Collene Chime]     |[21]|[02.94.22.49.05]|
|[Marcelyn Saunas]   |[5] |[22 630958]     |
|[Alvis Bach]        |[19]|[453 2590]      |
|[Hallie Georgis]    |[12]|[78 427 062]    |
|[Marilyne Smiling]  |[35]|[083 985 2904]  |
|[Atlee Shearer]     |[32]|[032 433 92 68] |
|[Darrow Blocker]    |[6] |[06-90066747]   |
|[Janis Ghazi]       |[54]|[082 125 8210]  |
|[Shayne P

The annotator does not have fake `DATE` chunks by default. Let's do it manually. We can create a `faker` dictionary for `DOB` column as `DATE` label then we obfuscate `DOB` column as well.

In [10]:
obfuscator_unique_ref_test = '''2022-11-1#DATE
2033-10-30#DATE
2011-8-22#DATE
2005-11-1#DATE
2008-10-30#DATE
2044-8-22#DATE
2022-04-1#DATE
2033-05-30#DATE
2011-09-22#DATE
2005-12-1#DATE
2008-02-30#DATE
2044-03-22#DATE
2055-11-1#DATE
2066-10-30#DATE
2077-8-22#DATE
2088-11-1#DATE
2099-10-30#DATE
2100-8-22#DATE
2111-04-1#DATE
2122-05-30#DATE
2133-09-22#DATE
2144-12-1#DATE
2155-02-30#DATE
2166-03-22#DATE'''

with open('obfuscator_unique_ref_test.txt', 'w') as f:
  f.write(obfuscator_unique_ref_test)

In [11]:
obfuscator = StructuredDeidentification(spark,{"NAME":"PATIENT","AGE":"AGE", "DOB":"DATE", "TEL":"PHONE"}, obfuscateRefFile="/content/obfuscator_unique_ref_test.txt")
obfuscator_df = obfuscator.obfuscateColumns(df)


In [12]:
obfuscator_df.select("NAME", "DOB", "AGE", "TEL").show(truncate=False)

+------------------+------------+----+----------------+
|NAME              |DOB         |AGE |TEL             |
+------------------+------------+----+----------------+
|[Aneta Hobby]     |[01/03/1935]|[31]|[085 773 3897]  |
|[Sherley Mccreedy]|[01/12/2009]|[72]|[958 08 922]    |
|[Bret Penta]      |[22/02/1921]|[31]|[031-276-790]   |
|[Marian Davenport]|[14/04/2002]|[76]|[0328 0152557]  |
|[Linnette Box]    |[08/09/1942]|[44]|[75 713 369]    |
|[Oliver Chaco]    |[15/05/1973]|[41]|[306-323-0194]  |
|[Lindsey Sandifer]|[22/01/1991]|[69]|[31 62 12]      |
|[Cain Mari]       |[06/12/1937]|[84]|[601 189 903]   |
|[Anselmo Rasher]  |[25/05/1980]|[24]|[(32) 8164-5542]|
|[Eli Economy]     |[15/11/1956]|[21]|[26 590425]     |
|[Denna Ghazi]     |[30/01/1907]|[4] |[081 207 11 16] |
|[Willodean Dunker]|[31/10/1983]|[85]|[(19) 9327-6940]|
|[Ranee Im]        |[20/10/2009]|[72]|[72 350 932]    |
|[Edmond Champagne]|[13/11/1920]|[31]|[04.44.34.41.79]|
|[Monta Rick]      |[28/08/1911]|[24]|[32 56 02]