![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/04.8.Clinical_Deidentification_for_Structured_Data.ipynb)


# Clinical Deidentification for Structured Data

## Healthcare NLP for Data Scientists Course

If you are not familiar with the components in this notebook, you can check [Healthcare NLP for Data Scientists Udemy Course](https://www.udemy.com/course/healthcare-nlp-for-data-scientists/) and the [MOOC Notebooks](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP) for each components.

## Colab Setup

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.settings.enforce_versions=False
nlp.install(refresh_install=True)

In [None]:
# Download MySql connector and give it to spark as a config
!wget https://repo1.maven.org/maven2/com/mysql/mysql-connector-j/9.0.0/mysql-connector-j-9.0.0.jar

In [5]:
from johnsnowlabs import nlp, medical
import pandas as pd

spark = nlp.start(jar_paths = ["/content/mysql-connector-j-9.0.0.jar"])

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_9792 (3).json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.5.2, 💊Spark-Healthcare==5.5.2, running on ⚡ PySpark==3.4.0


In [6]:
spark

## Getting the MYSQL jar connector

# Structured Deidentification for File-Based Data

In [7]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/hipaa-table-001.txt

df = spark.read.format("csv") \
    .option("sep", "\t") \
    .option("inferSchema", "true") \
    .option("header", "true") \
    .load("hipaa-table-001.txt")

df = df.withColumnRenamed("PATIENT","NAME")
df.show(truncate=False)

+---------------+----------+---+----------------------------------------------------+-------+--------------+---+---+
|NAME           |DOB       |AGE|ADDRESS                                             |ZIPCODE|TEL           |SBP|DBP|
+---------------+----------+---+----------------------------------------------------+-------+--------------+---+---+
|Cecilia Chapman|04/02/1935|83 |711-2880 Nulla St. Mankato Mississippi              |69200  |(257) 563-7401|101|42 |
|Iris Watson    |03/10/2009|9  |P.O. Box 283 8562 Fusce Rd. Frederick Nebraska      |20620  |(372) 587-2335|159|122|
|Bryar Pitts    |11/01/1921|98 |5543 Aliquet St. Fort Dodge GA                      |20783  |(717) 450-4729|149|52 |
|Theodore Lowe  |13/02/2002|16 |Ap #867-859 Sit Rd. Azusa New York                  |39531  |(793) 151-6230|134|115|
|Calista Wise   |20/08/1942|76 |7292 Dictum Av. San Antonio MI                      |47096  |(492) 709-6392|139|78 |
|Kyla Olsen     |12/05/1973|45 |Ap #651-8679 Sodales Av. Tamunin

In [8]:
obfuscator = medical.StructuredDeidentification(spark,{"NAME":"PATIENT","AGE":"AGE"}, obfuscateRefSource = "faker")
obfuscator_df = obfuscator.obfuscateColumns(df)
obfuscator_df.show(truncate=False)

+------------------+----------+-----+----------------------------------------------------+-------+--------------+---+---+
|NAME              |DOB       |AGE  |ADDRESS                                             |ZIPCODE|TEL           |SBP|DBP|
+------------------+----------+-----+----------------------------------------------------+-------+--------------+---+---+
|[Kirk Peper]      |04/02/1935|[99] |711-2880 Nulla St. Mankato Mississippi              |69200  |(257) 563-7401|101|42 |
|[Adah Hollering]  |03/10/2009|[7]  |P.O. Box 283 8562 Fusce Rd. Frederick Nebraska      |20620  |(372) 587-2335|159|122|
|[Shelva Dice]     |11/01/1921|[83] |5543 Aliquet St. Fort Dodge GA                      |20783  |(717) 450-4729|149|52 |
|[Servando Danger] |13/02/2002|[15] |Ap #867-859 Sit Rd. Azusa New York                  |39531  |(793) 151-6230|134|115|
|[Emilia Harbour]  |20/08/1942|[66] |7292 Dictum Av. San Antonio MI                      |47096  |(492) 709-6392|139|78 |
|[Genette Kent]    |12/0

In [9]:
obfuscator_unique_ref_test = '''Will Perry#PATIENT
John Smith#PATIENT
Marvin MARSHALL#PATIENT
Hubert GROGAN#PATIENT
ALTHEA COLBURN#PATIENT
Kalil AMIN#PATIENT
Inci FOUNTAIN#PATIENT
Jackson WILLE#PATIENT
Jack SANTOS#PATIENT
Mahmood ALBURN#PATIENT
Marnie MELINGTON#PATIENT
Aysha GHAZI#PATIENT
Maryland CODER#PATIENT
Darene GEORGIOUS#PATIENT
Shelly WELLBECK#PATIENT
Min Kun JAE#PATIENT
Thomson THOMAS#PATIENT
Christian SUDDINBURG#PATIENT
Aberdeen#CITY
Louisburg St#STREET
France#LOC
Nick Riviera#DOCTOR
5552312#PHONE
St James Hospital#HOSPITAL
Calle del Libertador#ADDRESS
111#ID
Will#DOCTOR
20#AGE
30#AGE
40#AGE
50#AGE
60#AGE
'''

with open('obfuscator_unique_ref_test.txt', 'w') as f:
  f.write(obfuscator_unique_ref_test)

In [10]:
# obfuscateRefSource = "file"

obfuscator = medical.StructuredDeidentification(spark,{"NAME":"PATIENT","AGE":"AGE"},
                                        obfuscateRefFile = "/content/obfuscator_unique_ref_test.txt",
                                        obfuscateRefSource = "file",
                                        columnsSeed={"NAME": 23, "AGE": 23})
obfuscator_df = obfuscator.obfuscateColumns(df)
obfuscator_df.select("NAME","AGE").show(truncate=False)

+----------------------+----+
|NAME                  |AGE |
+----------------------+----+
|[Christian SUDDINBURG]|[60]|
|[Christian SUDDINBURG]|[30]|
|[Thomson THOMAS]      |[30]|
|[Aysha GHAZI]         |[40]|
|[Jack SANTOS]         |[40]|
|[Mahmood ALBURN]      |[40]|
|[Jackson WILLE]       |[60]|
|[Maryland CODER]      |[60]|
|[Kalil AMIN]          |[60]|
|[Kalil AMIN]          |[20]|
|[Thomson THOMAS]      |[60]|
|[Kalil AMIN]          |[40]|
|[Mahmood ALBURN]      |[30]|
|[Darene GEORGIOUS]    |[30]|
|[Jack SANTOS]         |[30]|
|[Maryland CODER]      |[60]|
|[Darene GEORGIOUS]    |[50]|
|[Maryland CODER]      |[30]|
|[Mahmood ALBURN]      |[20]|
|[Thomson THOMAS]      |[20]|
+----------------------+----+
only showing top 20 rows



We can **shift n days** in the structured deidentification through "days" parameter when the column is a Date.

In [11]:
df = spark.createDataFrame([
            ["Juan García", "13/02/1977", "711 Nulla St.", "140", "673 431234"],
            ["Will Smith", "23/02/1977", "1 Green Avenue.", "140", "+23 (673) 431234"],
            ["Pedro Ximénez", "11/04/1900", "Calle del Libertador, 7", "100", "912 345623"]
        ]).toDF("NAME", "DOB", "ADDRESS", "SBP", "TEL")
df.show(truncate=False)

+-------------+----------+-----------------------+---+----------------+
|NAME         |DOB       |ADDRESS                |SBP|TEL             |
+-------------+----------+-----------------------+---+----------------+
|Juan García  |13/02/1977|711 Nulla St.          |140|673 431234      |
|Will Smith   |23/02/1977|1 Green Avenue.        |140|+23 (673) 431234|
|Pedro Ximénez|11/04/1900|Calle del Libertador, 7|100|912 345623      |
+-------------+----------+-----------------------+---+----------------+



In [12]:
obfuscator = medical.StructuredDeidentification(spark=spark,
                                        columns={"NAME": "NAME", "DOB": "DATE"},
                                        columnsSeed={"NAME": 23, "DOB": 23},
                                        obfuscateRefSource="faker",
                                        days=5
                                         )

# Structured Deidentification for Relational Database

Let's install the MYSQL server dependencies

In [13]:
!apt-get update -y
!apt-get install -y mysql-server
!pip install pyspark pymysql

0% [Working]            Hit:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Get:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
Hit:3 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:4 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:6 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Hit:7 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:8 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Get:9 https://r2u.stat.illinois.edu/ubuntu jammy/main amd64 Packages [2,646 kB]
Hit:10 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Get:11 http://security.ubuntu.com/ubuntu jammy-security/main amd64 Packages [2,561 kB]
Get:12 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Get:13 http://archive.ubuntu.com/ubuntu jammy-u

Now let's start the mysql service

In [14]:
!service mysql start

 * Starting MySQL database server mysqld
   ...done.


Define the SQL commands to reset the root password


In [15]:
reset_password_sql = """
ALTER USER 'root'@'localhost' IDENTIFIED WITH mysql_native_password BY 'root';
FLUSH PRIVILEGES;
"""

# Write the SQL commands to a file
with open("reset_password.sql", "w") as file:
    file.write(reset_password_sql)

Let's restart the MySQL service to take effect

In [16]:
!service mysql stop
# Start MySQL with the init file in the background
import time

!nohup mysqld --init-file=/content/reset_password.sql > mysql_init.log 2>&1 &
time.sleep(10)

 * Stopping MySQL database server mysqld
   ...done.


In [17]:
# Remove the init file
!rm /content/reset_password.sql
# Create the /nonexistent directory
!mkdir /nonexistent
!service mysql start

 * Starting MySQL database server mysqld
   ...done.


Now, we will create a clinical database to be deidentified, called `healthcare_db` with two tables: patients and appointments with a relational join between them:


In [18]:
# Create the healthcare database
!mysql -u root -proot -e "CREATE DATABASE healthcare_db;"

# Create the patients table
!mysql -u root -proot -e "USE healthcare_db; CREATE TABLE patients ( \
    patient_id INT AUTO_INCREMENT PRIMARY KEY, \
    name VARCHAR(255) NOT NULL, \
    address TEXT, \
    ssn CHAR(11) UNIQUE NOT NULL, \
    email VARCHAR(255) UNIQUE NOT NULL, \
    dob DATE NOT NULL, \
    age INT NOT NULL \
);"

# Create the appointments table with a foreign key reference to patients
!mysql -u root -proot -e "USE healthcare_db; CREATE TABLE appointments ( \
    appointment_id INT AUTO_INCREMENT PRIMARY KEY, \
    patient_id INT NOT NULL, \
    doctor_name VARCHAR(255) NOT NULL, \
    appointment_date DATE NOT NULL, \
    reason TEXT, \
    FOREIGN KEY (patient_id) REFERENCES patients(patient_id) \
    ON DELETE CASCADE ON UPDATE CASCADE \
);"

# Insert fake data into the patients table
!mysql -u root -proot -e "USE healthcare_db; INSERT INTO patients (name, address, ssn, email, dob, age) VALUES \
    ('John Doe', '123 Main St, Springfield', '123-45-6789', 'john.doe@example.com', '1985-04-15', 38), \
    ('Jane Smith', '456 Elm St, Shelbyville', '987-65-4321', 'jane.smith@example.com', '1990-07-20', 33);"

# Insert fake data into the appointments table
!mysql -u root -proot -e "USE healthcare_db; INSERT INTO appointments (patient_id, doctor_name, appointment_date, reason) VALUES \
    (1, 'Dr. Emily Carter', '2024-01-15', 'Annual Checkup'), \
    (2, 'Dr. Sarah Johnson', '2024-02-10', 'Flu Symptoms'), \
    (1, 'Dr. Emily Carter', '2024-02-15', 'Follow-up Visit'), \
    (1, 'Dr. James Wilson', '2024-03-20', 'Routine Blood Test');"



Reading the relational database as spark df for further processing:

In [19]:
jdbc_url = "jdbc:mysql://localhost:3306/healthcare_db"
df = spark.read.format("jdbc").options(
    url=jdbc_url,
    driver="com.mysql.cj.jdbc.Driver",
    dbtable="appointments",
    user="root",
    password="root"
).load()

df.show()

+--------------+----------+-----------------+----------------+------------------+
|appointment_id|patient_id|      doctor_name|appointment_date|            reason|
+--------------+----------+-----------------+----------------+------------------+
|             1|         1| Dr. Emily Carter|      2024-01-15|    Annual Checkup|
|             2|         2|Dr. Sarah Johnson|      2024-02-10|      Flu Symptoms|
|             3|         1| Dr. Emily Carter|      2024-02-15|   Follow-up Visit|
|             4|         1| Dr. James Wilson|      2024-03-20|Routine Blood Test|
+--------------+----------+-----------------+----------------+------------------+



Now, let's import the database_deidentification utility module then setup the relational database deidentification options

In [20]:
from sparknlp_jsl.utils.database_deidentification import *

In [21]:
config = {
    "db_config": {
        "host": "localhost",
        "user": "root",
        "password": "root",
        "database": "healthcare_db"
    },
    "deid_options": {
        "days_to_shift": 10,
        "age_groups": {
            "child": (0, 12),
            "teen": (13, 19),
            "adult": (20, 64),
            "senior": (65, 90)
        },
        "pk_fk_shift_value": 100,
        "use_hipaa": False,
        "output_path": "deidentified_output/"
    },
    "logging": {
        "level": "INFO",
        "file": "deidentification.log"
    }
}

Performing the structured deidentification for the clinical relational database and showing the results

In [22]:
deidentifier = RelationalDBDeidentification(spark, config)
deidentifier.deidentify()

Original table: appointments
+--------------+----------+-----------------+----------------+------------------+
|appointment_id|patient_id|doctor_name      |appointment_date|reason            |
+--------------+----------+-----------------+----------------+------------------+
|1             |1         |Dr. Emily Carter |2024-01-15      |Annual Checkup    |
|2             |2         |Dr. Sarah Johnson|2024-02-10      |Flu Symptoms      |
|3             |1         |Dr. Emily Carter |2024-02-15      |Follow-up Visit   |
|4             |1         |Dr. James Wilson |2024-03-20      |Routine Blood Test|
+--------------+----------+-----------------+----------------+------------------+

De-identified table: appointments
+--------------+----------+-----------+----------------+------------------+
|appointment_id|patient_id|doctor_name|appointment_date|reason            |
+--------------+----------+-----------+----------------+------------------+
|101           |101       |*****      |2024-01-25   