

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare_jsl/DEID_EHR_DATA.ipynb)




# **De-identify Structured Data**

To run this yourself, you will need to upload your license keys to the notebook. Just Run The Cell Below in order to do that. Also You can open the file explorer on the left side of the screen and upload `license_keys.json` to the folder that opens.
Otherwise, you can look at the example outputs at the bottom of the notebook.



## 1. Colab Setup

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print("Please Upload your John Snow Labs License using the button below")
license_keys = files.upload()

In [None]:
from johnsnowlabs import *

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
# Make sure to restart your notebook afterwards for changes to take effect

jsl.install()

## Start Session

In [None]:
from johnsnowlabs import *
# Automatically load license data and start a session with all jars user has access to
spark = jsl.start()

## 2. Download Structured PHI Data and Create a `DataFrame`

In [None]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/hipaa-table-001.txt

In [None]:
df = spark.read.format("csv") \
    .option("sep", "\t") \
    .option("inferSchema", "true") \
    .option("header", "true") \
    .load("hipaa-table-001.txt")

df = df.withColumnRenamed("PATIENT","NAME")
df.show(truncate=False)

+---------------+----------+---+----------------------------------------------------+-------+--------------+---+---+
|NAME           |DOB       |AGE|ADDRESS                                             |ZIPCODE|TEL           |SBP|DBP|
+---------------+----------+---+----------------------------------------------------+-------+--------------+---+---+
|Cecilia Chapman|04/02/1935|83 |711-2880 Nulla St. Mankato Mississippi              |69200  |(257) 563-7401|101|42 |
|Iris Watson    |03/10/2009|9  |P.O. Box 283 8562 Fusce Rd. Frederick Nebraska      |20620  |(372) 587-2335|159|122|
|Bryar Pitts    |11/01/1921|98 |5543 Aliquet St. Fort Dodge GA                      |20783  |(717) 450-4729|149|52 |
|Theodore Lowe  |13/02/2002|16 |Ap #867-859 Sit Rd. Azusa New York                  |39531  |(793) 151-6230|134|115|
|Calista Wise   |20/08/1942|76 |7292 Dictum Av. San Antonio MI                      |47096  |(492) 709-6392|139|78 |
|Kyla Olsen     |12/05/1973|45 |Ap #651-8679 Sodales Av. Tamunin

## 3. De-identify using Obfuscation Method

In [None]:
from sparknlp_jsl.structured_deidentification import StructuredDeidentification

We will obfuscate `NAME` column as `PATIENT`, `AGE` column as `AGE` and `TEL` column as `PHONE`.

We can shift n days in the structured deidentification through "days" parameter when the column is a Date.

In [None]:
obfuscator = StructuredDeidentification(spark,{"NAME" : "PATIENT", "AGE" : "AGE", "TEL" : "PHONE"}, 
                                        obfuscateRefSource='faker', 
                                        columnsSeed={"NAME": 23, "DOB": 23},
                                        days=5)

obfuscator_df = obfuscator.obfuscateColumns(df)

In [None]:
obfuscator_df.select("NAME", "AGE", "TEL").show(truncate=False)

+------------------+----+----------------+
|NAME              |AGE |TEL             |
+------------------+----+----------------+
|[Hayden Hem]      |[8] |[(02) 4075 2559]|
|[Samual Burgess]  |[55]|[0340 7664965]  |
|[Amadeo Bowels]   |[10]|[046 862 3507]  |
|[Florence Byars]  |[79]|[419 07 976]    |
|[March Bacon]     |[0] |[051-414-821]   |
|[Alvie Galla]     |[24]|[0377 7151585]  |
|[Novella Calkins] |[89]|[078 6490 4813] |
|[Lorri Gala]      |[20]|[786 6883]      |
|[Diannia Bear]    |[35]|[041 056 7791]  |
|[Keary Cirri]     |[54]|[051-997-142]   |
|[Marin Gee]       |[71]|[781 5991]      |
|[Mee Humphreys]   |[92]|[0914-6873026]  |
|[Katie Gravel]    |[55]|[30-11-62-95]   |
|[Florida Marvel]  |[10]|[042 159 1988]  |
|[Verdene Crandall]|[80]|[97 155350]     |
|[Gwendolynn Sims] |[20]|[67 788 02 83]  |
|[Leopold Garner]  |[33]|[9691 2587]     |
|[Lorenzo Pickler] |[53]|[440-701-6027]  |
|[Kaye East]       |[73]|[(95) 322-829]  |
|[Landy Favor]     |[39]|[34 13 74]      |
+----------

The annotator does not have fake `DATE` chunks by default. Let's do it manually. We can create a `faker` dictionary for `DOB` column as `DATE` label then we obfuscate `DOB` column as well.

In [None]:
obfuscator_unique_ref_test = '''2022-11-1#DATE
2033-10-30#DATE
2011-8-22#DATE
2005-11-1#DATE
2008-10-30#DATE
2044-8-22#DATE
2022-04-1#DATE
2033-05-30#DATE
2011-09-22#DATE
2005-12-1#DATE
2008-02-30#DATE
2044-03-22#DATE
2055-11-1#DATE
2066-10-30#DATE
2077-8-22#DATE
2088-11-1#DATE
2099-10-30#DATE
2100-8-22#DATE
2111-04-1#DATE
2122-05-30#DATE
2133-09-22#DATE
2144-12-1#DATE
2155-02-30#DATE
2166-03-22#DATE'''

with open('obfuscator_unique_ref_test.txt', 'w') as f:
  f.write(obfuscator_unique_ref_test)

In [None]:
obfuscator = StructuredDeidentification(spark,{"NAME":"PATIENT","AGE":"AGE", "DOB":"DATE", "TEL":"PHONE"}, obfuscateRefFile="/content/obfuscator_unique_ref_test.txt")
obfuscator_df = obfuscator.obfuscateColumns(df)


In [None]:
obfuscator_df.select("NAME", "DOB", "AGE", "TEL").show(truncate=False)

+-------------------+------------+----+----------------+
|NAME               |DOB         |AGE |TEL             |
+-------------------+------------+----+----------------+
|[Mardell Catalina] |[18/02/1935]|[31]|[085 773 3897]  |
|[Michela Cowper]   |[15/11/2009]|[72]|[958 08 922]    |
|[Oleh Ferron]      |[18/02/1921]|[31]|[031-276-790]   |
|[Magdaleno Barrios]|[21/02/2002]|[76]|[0328 0152557]  |
|[Aura Peaches]     |[08/10/1942]|[44]|[75 713 369]    |
|[Kriss Plant]      |[16/05/1973]|[41]|[306-323-0194]  |
|[Reggy Can]        |[07/02/1991]|[69]|[31 62 12]      |
|[Adelia Linear]    |[11/01/1938]|[84]|[601 189 903]   |
|[Patton Davenport] |[31/05/1980]|[24]|[(32) 8164-5542]|
|[Edwardo Court]    |[14/10/1956]|[21]|[26 590425]     |
|[Quillian Custard] |[10/02/1907]|[4] |[081 207 11 16] |
|[Lavera Sandhoff]  |[16/12/1983]|[85]|[(19) 9327-6940]|
|[Jesse Brew]       |[08/10/2009]|[72]|[72 350 932]    |
|[Resa Rowels]      |[27/11/1920]|[31]|[04.44.34.41.79]|
|[Jaycee Dess]      |[21/09/191