

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_EHR_DATA.ipynb)




# **De-identify Structured Data**

To run this yourself, you will need to upload your license keys to the notebook. Just Run The Cell Below in order to do that. Also You can open the file explorer on the left side of the screen and upload `license_keys.json` to the folder that opens.
Otherwise, you can look at the example outputs at the bottom of the notebook.



## 1. Colab Setup

Import license keys

In [1]:
import os
import json

from google.colab import files

license_keys = files.upload()

with open(list(license_keys.keys())[0]) as f:
    license_keys = json.load(f)

sparknlp_version = license_keys["PUBLIC_VERSION"]
jsl_version = license_keys["JSL_VERSION"]

print ('SparkNLP Version:', sparknlp_version)
print ('SparkNLP-JSL Version:', jsl_version)

Saving spark_nlp_for_healthcare.json to spark_nlp_for_healthcare.json
SparkNLP Version: 3.1.1
SparkNLP-JSL Version: 3.1.1


Install dependencies

In [2]:
%%capture
for k,v in license_keys.items(): 
    %set_env $k=$v

!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/jsl_colab_setup.sh
!bash jsl_colab_setup.sh

Import dependencies into Python

In [3]:
import pandas as pd
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from tabulate import tabulate
import sparknlp
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
import sparknlp_jsl

Start the Spark session

In [4]:
spark = sparknlp_jsl.start(license_keys['SECRET'])

# manually start session
# params = {"spark.driver.memory" : "16G",
#           "spark.kryoserializer.buffer.max" : "2000M",
#           "spark.driver.maxResultSize" : "2000M"}

# spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

## 2. Download Structured PHI Data and Create a `DataFrame`

In [6]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/hipaa-table-001.txt

In [7]:
df = spark.read.format("csv") \
    .option("sep", "\t") \
    .option("inferSchema", "true") \
    .option("header", "true") \
    .load("hipaa-table-001.txt")

df = df.withColumnRenamed("PATIENT","NAME")
df.show(truncate=False)

+---------------+----------+---+----------------------------------------------------+-------+--------------+---+---+
|NAME           |DOB       |AGE|ADDRESS                                             |ZIPCODE|TEL           |SBP|DBP|
+---------------+----------+---+----------------------------------------------------+-------+--------------+---+---+
|Cecilia Chapman|04/02/1935|83 |711-2880 Nulla St. Mankato Mississippi              |69200  |(257) 563-7401|101|42 |
|Iris Watson    |03/10/2009|9  |P.O. Box 283 8562 Fusce Rd. Frederick Nebraska      |20620  |(372) 587-2335|159|122|
|Bryar Pitts    |11/01/1921|98 |5543 Aliquet St. Fort Dodge GA                      |20783  |(717) 450-4729|149|52 |
|Theodore Lowe  |13/02/2002|16 |Ap #867-859 Sit Rd. Azusa New York                  |39531  |(793) 151-6230|134|115|
|Calista Wise   |20/08/1942|76 |7292 Dictum Av. San Antonio MI                      |47096  |(492) 709-6392|139|78 |
|Kyla Olsen     |12/05/1973|45 |Ap #651-8679 Sodales Av. Tamunin

## 3. De-identify using Obfuscation Method

In [8]:
from sparknlp_jsl.structured_deidentification import StructuredDeidentification

We will obfuscate `NAME` column as `PATIENT`, `AGE` column as `AGE` and `TEL` column as `PHONE`.

In [None]:
obfuscator = StructuredDeidentification(spark,{"NAME" : "PATIENT", "AGE" : "AGE", "TEL" : "PHONE"}, obfuscateRefSource='faker')
obfuscator_df = obfuscator.obfuscateColumns(df)

In [15]:
obfuscator_df.select("NAME", "AGE", "TEL").show(truncate=False)

+------------------+----+-----------------+
|NAME              |AGE |TEL              |
+------------------+----+-----------------+
|[Kandyce Buckles] |[81]|[736 990 188]    |
|[Felton Awe]      |[40]|[21 258 717 9805]|
|[Helen Maidens]   |[65]|[659 625 303]    |
|[Oren Blowers]    |[3] |[(07) 4550 6193] |
|[Jannelle Havers] |[42]|[488 73 877]     |
|[Ladonna Donath]  |[81]|[078 4260 1147]  |
|[Leavy Stair]     |[69]|[412 5473]       |
|[Cloyde Cowper]   |[75]|[21 226 725 6159]|
|[Ouida Abrahams]  |[53]|[97 895809]      |
|[Laurey Oregon]   |[16]|[22 525335]      |
|[Ellis Raya]      |[66]|[97 895809]      |
|[Loistine Chute]  |[36]|[085 728 5749]   |
|[Heywood Pinon]   |[40]|[(02) 6710 7447] |
|[Suzon Ion]       |[65]|[0359 1465851]   |
|[Arley Lope]      |[76]|[018-8430091]    |
|[Roberta Pancoast]|[75]|[08273 74 50 77] |
|[Mallie Peels]    |[47]|[722 778 902]    |
|[Aurelia Hahn]    |[35]|[734 605 387]    |
|[Higinio Cowden]  |[62]|[052 988 99 51]  |
|[Katie Oliphant]  |[30]|[95 117

The annotator does not have fake `DATE` chunks by default. Let's do it manually. We can create a `faker` dictionary for `DOB` column as `DATE` label then we obfuscate `DOB` column as well.

In [23]:
obfuscator_unique_ref_test = '''2022-11-1#DATE
2033-10-30#DATE
2011-8-22#DATE
2005-11-1#DATE
2008-10-30#DATE
2044-8-22#DATE
2022-04-1#DATE
2033-05-30#DATE
2011-09-22#DATE
2005-12-1#DATE
2008-02-30#DATE
2044-03-22#DATE
2055-11-1#DATE
2066-10-30#DATE
2077-8-22#DATE
2088-11-1#DATE
2099-10-30#DATE
2100-8-22#DATE
2111-04-1#DATE
2122-05-30#DATE
2133-09-22#DATE
2144-12-1#DATE
2155-02-30#DATE
2166-03-22#DATE'''

with open('obfuscator_unique_ref_test.txt', 'w') as f:
  f.write(obfuscator_unique_ref_test)

In [None]:
obfuscator = StructuredDeidentification(spark,{"NAME":"PATIENT","AGE":"AGE", "DOB":"DATE", "TEL":"PHONE"}, obfuscateRefFile="/content/obfuscator_unique_ref_test.txt")
obfuscator_df = obfuscator.obfuscateColumns(df)


In [26]:
obfuscator_df.select("NAME", "DOB", "AGE", "TEL").show(truncate=False)

+------------------+------------+----+----------------+
|NAME              |DOB         |AGE |TEL             |
+------------------+------------+----+----------------+
|[Alaina Pavlov]   |[2099-10-30]|[24]|[0473 75 74 88] |
|[Theron Blend]    |[2111-04-1] |[73]|[814-297-1404]  |
|[Doug Ratel]      |[2099-10-30]|[62]|[699 792 243]   |
|[Baker Ano]       |[2066-10-30]|[63]|[0356 7386222]  |
|[Arlene Pin]      |[2005-11-1] |[55]|[09704 13 48 83]|
|[Steva Infante]   |[2044-03-22]|[35]|[78 577 036]    |
|[Darlin Curly]    |[2005-12-1] |[6] |[444 14 907]    |
|[Ailene Dumas]    |[2122-05-30]|[76]|[30-88-20-94]   |
|[Peder Speck]     |[2133-09-22]|[23]|[448 8003]      |
|[Kenn Rater]      |[2033-10-30]|[39]|[24 102884]     |
|[Gaines Colander] |[2022-04-1] |[56]|[250-293-9941]  |
|[Lillard Likens]  |[2144-12-1] |[59]|[0613-9040212]  |
|[Mallissa Cooler] |[2005-12-1] |[73]|[(02) 6765 9044]|
|[Winthrop Matter] |[2088-11-1] |[62]|[95 698307]     |
|[Jenne Czar]      |[2166-03-22]|[76]|[(022) 375