![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/04.5.Clinical_Deidentification_Utility_Module.ipynb)

## Colab Setup

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs==5.1.0

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

In [None]:
from johnsnowlabs import nlp, medical
import pandas as pd
import json
import string
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Automatically load license data and start a session with all jars user has access to

spark = nlp.start()

In [None]:
spark

In [None]:
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T
import pyspark.sql as SQL
from pyspark import keyword_only

# Module

Description of Parameters: <br/>


---



---


`custom_pipeline` : Sparknlp PipelineModel, optional
            custom PipelineModel to be used for deidentification, by default None <br/>
        `ner_chunk` : str, optional
            final chunk column name of custom pipeline that will be deidentified, by default "ner_chunk" <br/>
        `fields` : dict, optional
            fields to be deidentified and their deidentification modes, by default {"text": "mask"} <br/>
        `sentence` : str, optional
            sentence column name of the given custom pipeline, by default "sentence" <br/>
        `token` : str, optional
            token column name of the given custom pipeline, by default "token" <br/>
        `document` : str, optional
            document column name of the given custom pipeline, by default "document" <br/>
        `masking_policy` : str, optional
            masking policy, by default "entity_labels" <br/>
        `fixed_mask_length` : int, optional
            fixed mask length, by default 4 <br/>
        `obfuscate_date` : bool, optional
            obfuscate date, by default True <br/>
        `obfuscate_ref_source` : str, optional
            obfuscate reference source, by default "faker" <br/>
        `obfuscate_ref_file_path` : str, optional
            obfuscate reference file path, by default None <br/>
        `age_group_obfuscation` : bool, optional
            age group obfuscation, by default False <br/>
        `age_ranges` : list, optional
            age ranges for obfuscation, by default [1, 4, 12, 20, 40, 60, 80] <br/>
        `shift_days` : bool, optional
            shift days, by default False <br/>
        `number_of_days` : int, optional
            number of days, by default None <br/>
        `documentHashCoder_col_name` : str, optional
            document hash coder column name, by default "documentHash" <br/>
        `date_tag` : str, optional
            date tag, by default "DATE" <br/>
        `language` : str, optional
            language, by default "en" <br/>
        `region` : str, optional
            region, by default "us" <br/>
        `unnormalized_date` : bool, optional
            unnormalized date, by default False <br/>
        `unnormalized_mode` : str, optional
            unnormalized mode, by default "mask" <br/>
        `id_column_name` : str, optional
            ID column name, by default "id" <br/>
        `date_shift_column_name` : str, optional
            date shift column name, by default "date_shift" <br/>
        `separator` : str, optional
            separator of input csv file, by default "\t" <br/>
        `input_file_path` : str, optional
            input file path, by default None <br/>
        `output_file_path` : str, optional
            output file path, by default 'deidentified.csv'

Returns


---

Spark DataFrame: Spark DataFrame with deidentified text <br/>
csv/json file: A deidentified file.

In [None]:
#sample data
text ='''
Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 .
'''

df= spark.createDataFrame([[text]]).toDF("text")

In [None]:
df_pd= df.toPandas()
df_pd.to_csv("deid_data.csv", index=False)

# With a custom pipeline

Sample custom pipeline with `ner_deid_generic_augmented` to detect PHI entities.

In [None]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

deid_ner = medical.NerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")\
    .setWhiteList(['NAME', 'PROFESSION', 'ID', 'AGE', 'DATE'])

nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler,
      sentenceDetector,
      tokenizer,
      word_embeddings,
      deid_ner,
      ner_converter])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model= nlpPipeline.fit(empty_data)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_deid_generic_augmented download started this may take some time.
[OK!]


## Default parameters

In [None]:
from sparknlp_jsl import Deid

In [None]:
# we need to feed the module with an active spark session and params
deid_implementor= Deid(spark,
                       input_file_path="deid_data.csv",
                        output_file_path="deidentified.csv",
                        custom_pipeline=model)

In [None]:
res= deid_implementor.deidentify()

Deidentification process of the 'text' field has begun...
Deidentification process of the 'text' field was completed...
Deidentifcation successfully completed and the results saved as 'deidentified.csv' !


In [None]:
res.show(truncate=70)

+---+----------------------------------------------------------------------+----------------------------------------------------------------------+
| ID|                                                                  text|                                                     text_deidentified|
+---+----------------------------------------------------------------------+----------------------------------------------------------------------+
|  0| Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson...|Record date : <DATE> , <NAME> , M.D . , Name : <NAME> , MR # <ID> D...|
+---+----------------------------------------------------------------------+----------------------------------------------------------------------+



In [None]:
#checking saved output file
import pandas as pd
res_data= pd.read_csv("deidentified.csv")
res_data.head()

Unnamed: 0,ID,text,text_deidentified
0,0,"Record date : 2093-01-13 , David Hale , M.D ....","Record date : <DATE> , <NAME> , M.D . , Name :..."


## Mask options


### same_length_chars

In [None]:
deid_implementor= Deid(spark,
                      input_file_path="deid_data.csv",
                      output_file_path="deidentified.csv",
                      custom_pipeline=model,
                      fields={"text": "mask"}, masking_policy="same_length_chars" )

In [None]:
res= deid_implementor.deidentify()

Deidentification process of the 'text' field has begun...
Deidentification process of the 'text' field was completed...
Deidentifcation successfully completed and the results saved as 'deidentified.csv' !


In [None]:
res.show(truncate=70)

+---+----------------------------------------------------------------------+----------------------------------------------------------------------+
| ID|                                                                  text|                                                     text_deidentified|
+---+----------------------------------------------------------------------+----------------------------------------------------------------------+
|  0| Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson...|Record date : [********] , [********] , M.D . , Name : [***********...|
+---+----------------------------------------------------------------------+----------------------------------------------------------------------+



### fixed_length_chars

In [None]:
deid_implementor= Deid(spark,
                       input_file_path="deid_data.csv",
                       output_file_path="deidentified.csv",
                       custom_pipeline=model,
                       fields={"text": "mask"}, masking_policy="fixed_length_chars", fixed_mask_length=2)

In [None]:
res= deid_implementor.deidentify()

Deidentification process of the 'text' field has begun...
Deidentification process of the 'text' field was completed...
Deidentifcation successfully completed and the results saved as 'deidentified.csv' !


In [None]:
res.show(truncate=70)

+---+----------------------------------------------------------------------+----------------------------------------------------------------------+
| ID|                                                                  text|                                                     text_deidentified|
+---+----------------------------------------------------------------------+----------------------------------------------------------------------+
|  0| Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson...|Record date : ** , ** , M.D . , Name : ** , MR # ** Date : ** . PCP...|
+---+----------------------------------------------------------------------+----------------------------------------------------------------------+



### masking multiple columns

In [None]:
text= """Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson , Ora MR ."""
text_1= """Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09 """
df= spark.createDataFrame([[text, text_1]]).toDF("text", "text_1")

df_pd= df.toPandas()
df_pd.to_csv("deid_multiple_data.csv", index=False)
df_pd.head()

Unnamed: 0,text,text_1
0,"Record date : 2093-01-13 , David Hale , M.D . ...","Date : 01/13/93 PCP : Oliveira , 25 years-old ..."


In [None]:
deid_implementor= Deid(spark,
                       input_file_path="deid_multiple_data.csv",
                       output_file_path="deidentified.csv",
                       custom_pipeline=model,
                       fields={"text": "mask", "text_1":"obfuscate"}, masking_policy="fixed_length_chars",
                       fixed_mask_length=2, separator=",")

In [None]:
res= deid_implementor.deidentify()

Deidentification process of the 'text' field has begun...
Deidentification process of the 'text' field was completed...
Deidentification process of the 'text_1' field has begun...
Deidentification process of the 'text_1' field was completed...
Deidentifcation successfully completed and the results saved as 'deidentified.csv' !


In [None]:
res.show(truncate=70)

+---+----------------------------------------------------------------------+----------------------------------------------+----------------------------------------------------------------------+----------------------------------------------------------------------+
| ID|                                                                  text|                             text_deidentified|                                                                text_1|                                                   text_1_deidentified|
+---+----------------------------------------------------------------------+----------------------------------------------+----------------------------------------------------------------------+----------------------------------------------------------------------+
|  0|Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson ...|Record date : ** , ** , M.D . , Name : ** MR .|Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-...|Date : 0

## Obfuscate Options

### obfuscate_ref_source="file"

In [None]:
obs_lines = """Marvin MARSHALL#NAME
Hubert GROGAN#NAME
ALTHEA COLBURN#NAME
Kalil AMIN#NAME
Inci FOUNTAIN#NAME
Ekaterina Rosa#DOCTOR
Rudiger Chao#DOCTOR
COLLETTE KOHLER#DOCTOR
Mufi HIGGS#DOCTOR"""


with open ('obfuscation.txt', 'w') as f:
  f.write(obs_lines)

In [None]:
text= """Name of the patient is Leah Shannon.
Her pps number is 1234567.
"""

df= spark.createDataFrame([[text]]).toDF("text")
df_pd= df.toPandas()
df_pd.to_csv("deid_obfs_data.csv", index=False)

In [None]:
deid_implementor= Deid(spark, input_file_path="deid_obfs_data.csv",
                                output_file_path="deidentified.csv",
                                custom_pipeline=model,
                                fields={"text": "obfuscate"}, obfuscate_ref_source="file",
                                obfuscate_ref_file_path="obfuscation.txt")

In [None]:
res= deid_implementor.deidentify()

Deidentification process of the 'text' field has begun...
Deidentification process of the 'text' field was completed...
Deidentifcation successfully completed and the results saved as 'deidentified.csv' !


In [None]:
res.show(truncate=False)

+---+----------------------------------------------------------------+----------------------------------------------------------------+
|ID |text                                                            |text_deidentified                                               |
+---+----------------------------------------------------------------+----------------------------------------------------------------+
|0  |Name of the patient is Leah Shannon. Her pps number is 1234567. |Name of the patient is Inci FOUNTAIN. Her pps number is 4580775.|
+---+----------------------------------------------------------------+----------------------------------------------------------------+



### obfuscate_ref_source=both

In [None]:
deid_implementor= Deid(spark,
                       input_file_path="deid_obfs_data.csv",
                       output_file_path="deidentified.csv",
                       custom_pipeline=model,
                       fields={"text": "obfuscate"}, obfuscate_ref_source="both",
                       obfuscate_ref_file_path="obfuscation.txt")

In [None]:
res= deid_implementor.deidentify()

Deidentification process of the 'text' field has begun...
Deidentification process of the 'text' field was completed...
Deidentifcation successfully completed and the results saved as 'deidentified.csv' !


In [None]:
res.show(truncate=70)

+---+----------------------------------------------------------------+---------------------------------------------------------------+
| ID|                                                            text|                                              text_deidentified|
+---+----------------------------------------------------------------+---------------------------------------------------------------+
|  0|Name of the patient is Leah Shannon. Her pps number is 1234567. |Name of the patient is Lequita Asal. Her pps number is 1921254.|
+---+----------------------------------------------------------------+---------------------------------------------------------------+



### obfuscate_ref_source=faker

In [None]:
deid_implementor= Deid(spark,
                       input_file_path="deid_data.csv",
                       output_file_path="deidentified.csv",
                       custom_pipeline=model, obfuscate_date=True,
                       fields={"text": "obfuscate"})

In [None]:
res= deid_implementor.deidentify()

Deidentification process of the 'text' field has begun...
Deidentification process of the 'text' field was completed...
Deidentifcation successfully completed and the results saved as 'deidentified.csv' !


In [None]:
res.show(truncate=70)

+---+----------------------------------------------------------------------+----------------------------------------------------------------------+
| ID|                                                                  text|                                                     text_deidentified|
+---+----------------------------------------------------------------------+----------------------------------------------------------------------+
|  0| Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson...|Record date : 2093-02-28 , Avie Echevaria , M.D . , Name : Blake Di...|
+---+----------------------------------------------------------------------+----------------------------------------------------------------------+



### age groups obfuscation

In [None]:
# Example data
dates = [
'1 year old baby',
'4 year old kids',
'A 15 year old female with',
'Record date: 2093-01-13, Age: 25',
'Patient is 45 years-old',
'He is 65 years-old male'
]
from pyspark.sql.types import StringType
df_dates = spark.createDataFrame(dates,StringType()).toDF('text')

pd_df_dates= df_dates.toPandas()
pd_df_dates.to_csv("deid_age_group.csv", index=False)

In [None]:
deid_implementor= Deid(spark,
                       input_file_path="deid_age_group.csv",
                       output_file_path="deidentified.csv",
                       custom_pipeline=model,
                       fields={"text": "obfuscate"},
                       age_group_obfuscation=True, age_ranges=[1, 4, 12, 20, 40, 60, 80])

In [None]:
res= deid_implementor.deidentify()

Deidentification process of the 'text' field has begun...
Deidentification process of the 'text' field was completed...
Deidentifcation successfully completed and the results saved as 'deidentified.csv' !


In [None]:
res.show(truncate=False)

+---+--------------------------------+--------------------------------+
|ID |text                            |text_deidentified               |
+---+--------------------------------+--------------------------------+
|0  |1 year old baby                 |2 year old baby                 |
|1  |4 year old kids                 |11 year old kids                |
|2  |A 15 year old female with       |A 18 year old female with       |
|3  |Record date: 2093-01-13, Age: 25|Record date: 2093-02-03, Age: 39|
|4  |Patient is 45 years-old         |Patient is 40 years-old         |
|5  |He is 65 years-old male         |He is 79 years-old male         |
+---+--------------------------------+--------------------------------+



### shifting days according to the ID column

In [None]:
import pandas as pd
data = pd.DataFrame(
    {'patientID' : ['A001', 'A001', 'A002', 'A002'],
     'text' : ['Chris Brown was discharged on 10/02/2022',
               'Mark White was discharged on 02/28/2020',
               'John was discharged on 03/15/2022',
               'John Moore was discharged on 12/31/2022'
              ]
    }
)

my_input_df = spark.createDataFrame(data)

my_input_df.show(truncate = False)

+---------+----------------------------------------+
|patientID|text                                    |
+---------+----------------------------------------+
|A001     |Chris Brown was discharged on 10/02/2022|
|A001     |Mark White was discharged on 02/28/2020 |
|A002     |John was discharged on 03/15/2022       |
|A002     |John Moore was discharged on 12/31/2022 |
+---------+----------------------------------------+



In [None]:
df_pd= my_input_df.toPandas()
df_pd.to_csv("deid_id_data.csv", index=False)

Custom pipeline with `DocumentHashCoder()`.

In [None]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

documentHasher = medical.DocumentHashCoder()\
    .setInputCols("document")\
    .setOutputCol("document2")\
    .setPatientIdColumn("patientID")\
    .setRangeDays(100)\
    .setNewDateShift("shift_days")\
    .setSeed(100)

tokenizer = nlp.Tokenizer()\
    .setInputCols(["document2"])\
    .setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["document2", "token"])\
    .setOutputCol("word_embeddings")

clinical_ner = medical.NerModel\
    .pretrained("ner_deid_subentity_augmented", "en", "clinical/models")\
    .setInputCols(["document2","token", "word_embeddings"])\
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal()\
    .setInputCols(["document2", "token", "ner"])\
    .setOutputCol("ner_chunk")


nlpPipeline = nlp.Pipeline().setStages([
    documentAssembler,
    documentHasher,
    tokenizer,
    embeddings,
    clinical_ner,
    ner_converter])

empty_data = spark.createDataFrame([["", ""]]).toDF("text", "patientID")

pipeline_model = nlpPipeline.fit(empty_data)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_deid_subentity_augmented download started this may take some time.
[OK!]


In [None]:
deid_implementor= Deid(spark,
                       input_file_path="deid_id_data.csv",
                       output_file_path="deidentified.csv",
                       custom_pipeline=pipeline_model,
                       fields={"text": "obfuscate"},
                       shift_days=True,
                       obfuscate_date=True,
                       ner_chunk="ner_chunk",
                       token="token",
                       documenthashcoder_col_name="document2",
                       separator=",",
                       unnormalized_date=False)

In [None]:
res= deid_implementor.deidentify()

Deidentification process of the 'text' field has begun...
Deidentification process of the 'text' field was completed...
Deidentifcation successfully completed and the results saved as 'deidentified.csv' !


In [None]:
res.show(truncate=False)

+---+----------------------------------------+-------------------------------------------+
|ID |text                                    |text_deidentified                          |
+---+----------------------------------------+-------------------------------------------+
|0  |Chris Brown was discharged on 10/02/2022|Tina Griffiths was discharged on 09/27/2022|
|1  |Mark White was discharged on 02/28/2020 |Bynum Bellows was discharged on 02/23/2020 |
|2  |John was discharged on 03/15/2022       |Prince Rome was discharged on 04/13/2022   |
|3  |John Moore was discharged on 12/31/2022 |Alecia Lemming was discharged on 01/29/2023|
+---+----------------------------------------+-------------------------------------------+



### shifting days according to specified values

In [None]:
data = pd.DataFrame(
    {'patientID' : ['A001', 'A002', 'A003', 'A004'],
     'text' : ['Chris Brown was discharged on 10/02/2022',
               'Mark White was discharged on 03/01/2020',
               'John was discharged on 03/15/2022',
               'John Moore was discharged on 12/31/2022'
              ],
     'dateshift' : ['10', '-2', '30', '-8']
    }
)

my_input_df = spark.createDataFrame(data)

my_input_df.show(truncate=False)

+---------+----------------------------------------+---------+
|patientID|text                                    |dateshift|
+---------+----------------------------------------+---------+
|A001     |Chris Brown was discharged on 10/02/2022|10       |
|A002     |Mark White was discharged on 03/01/2020 |-2       |
|A003     |John was discharged on 03/15/2022       |30       |
|A004     |John Moore was discharged on 12/31/2022 |-8       |
+---------+----------------------------------------+---------+



In [None]:
df_pd= my_input_df.toPandas()
df_pd.to_csv("deid_specific_data.csv", index=False)

In [None]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

documentHasher = medical.DocumentHashCoder()\
    .setInputCols("document")\
    .setOutputCol("document2")\
    .setDateShiftColumn("dateshift")\

tokenizer = nlp.Tokenizer()\
    .setInputCols(["document2"])\
    .setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["document2", "token"])\
    .setOutputCol("word_embeddings")

clinical_ner = medical.NerModel\
    .pretrained("ner_deid_subentity_augmented", "en", "clinical/models")\
    .setInputCols(["document2","token", "word_embeddings"])\
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal()\
    .setInputCols(["document2", "token", "ner"])\
    .setOutputCol("ner_chunk")

nlpPipeline = nlp.Pipeline().setStages([
    documentAssembler,
    documentHasher,
    tokenizer,
    embeddings,
    clinical_ner,
    ner_converter

])

empty_data = spark.createDataFrame([["", "", ""]]).toDF("patientID","text", "dateshift")

pipeline_col_model = nlpPipeline.fit(empty_data)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_deid_subentity_augmented download started this may take some time.
[OK!]


In [None]:
deid_implementor= Deid(spark,
                       input_file_path="deid_specific_data.csv",
                       separator=",",
                       output_file_path="deidentified.csv",
                       custom_pipeline=pipeline_col_model,
                       fields={"text": "obfuscate"},
                       shift_days=True,
                       obfuscate_date=True,
                       ner_chunk="ner_chunk",
                       token="token",
                       documenthashcoder_col_name="document2")

In [None]:
res= deid_implementor.deidentify()

Deidentification process of the 'text' field has begun...
Deidentification process of the 'text' field was completed...
Deidentifcation successfully completed and the results saved as 'deidentified.csv' !


In [None]:
res.show(truncate=False)

+---+----------------------------------------+----------------------------------------------+
|ID |text                                    |text_deidentified                             |
+---+----------------------------------------+----------------------------------------------+
|0  |Chris Brown was discharged on 10/02/2022|Sunday Corn was discharged on 10/12/2022      |
|1  |Mark White was discharged on 03/01/2020 |Wynelle Cleveland was discharged on 02/28/2020|
|2  |John was discharged on 03/15/2022       |Houston Siren was discharged on 04/14/2022    |
|3  |John Moore was discharged on 12/31/2022 |Bartolo Darter was discharged on 12/23/2022   |
+---+----------------------------------------+----------------------------------------------+



### unnormalized date formats

In [None]:
import pandas as pd

data = pd.DataFrame(
    {'patientID' : ['A001', 'A001', 'A002', 'A002'],
     'text' : ['Chris Brown was discharged on 10/02/2022',
               'Mark White was discharged on 02/28/2020',
               'John was discharged on 03 Apr2022',          # check this
               'John Moore was discharged on 12/31/2022'
              ],
     'dateshift' : ['-5', '-2', '10', '20']
    }
)


my_input_df = spark.createDataFrame(data)

my_input_df.show(truncate=False)

+---------+----------------------------------------+---------+
|patientID|text                                    |dateshift|
+---------+----------------------------------------+---------+
|A001     |Chris Brown was discharged on 10/02/2022|-5       |
|A001     |Mark White was discharged on 02/28/2020 |-2       |
|A002     |John was discharged on 03 Apr2022       |10       |
|A002     |John Moore was discharged on 12/31/2022 |20       |
+---------+----------------------------------------+---------+



In [None]:
df_pd= my_input_df.toPandas()
df_pd.to_csv("deid_unnormalized_data.csv", index=False)

In [None]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

documentHasher = medical.DocumentHashCoder()\
    .setInputCols("document")\
    .setOutputCol("document2")\
    .setDateShiftColumn("dateshift")\

tokenizer = nlp.Tokenizer()\
    .setInputCols(["document2"])\
    .setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["document2", "token"])\
    .setOutputCol("word_embeddings")

clinical_ner = medical.NerModel\
    .pretrained("ner_deid_subentity_augmented", "en", "clinical/models")\
    .setInputCols(["document2","token", "word_embeddings"])\
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal()\
    .setInputCols(["document2", "token", "ner"])\
    .setOutputCol("ner_chunk")

nlpPipeline = nlp.Pipeline().setStages([
    documentAssembler,
    documentHasher,
    tokenizer,
    embeddings,
    clinical_ner,
    ner_converter
])

empty_data = spark.createDataFrame([["", "", ""]]).toDF("patientID","text", "dateshift")

modelDocHasher = nlpPipeline.fit(empty_data)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_deid_subentity_augmented download started this may take some time.
[OK!]


In [None]:
deid_implementor= Deid(spark,
                       input_file_path="deid_unnormalized_data.csv",
                        output_file_path="deidentified.csv",
                        custom_pipeline=modelDocHasher,
                        fields={"text": "obfuscate"},
                        shift_days=True,
                        obfuscate_date=True,
                        ner_chunk="ner_chunk",
                        token="token",
                        documenthashcoder_col_name="document2",
                        separator=",",
                        unnormalized_date=True,
                        unnormalized_mode="mask")

In [None]:
res= deid_implementor.deidentify()

Deidentification process of the 'text' field has begun...
Deidentification process of the 'text' field was completed...
Deidentifcation successfully completed and the results saved as 'deidentified.csv' !


In [None]:
res.show(truncate=False)

+---+----------------------------------------+----------------------------------------------+
|ID |text                                    |text_deidentified                             |
+---+----------------------------------------+----------------------------------------------+
|0  |Chris Brown was discharged on 10/02/2022|Corinna Gab was discharged on 09/27/2022      |
|1  |Mark White was discharged on 02/28/2020 |Dyann Kief was discharged on 02/26/2020       |
|2  |John was discharged on 03 Apr2022       |Theodoro Doing was discharged on <DATE>       |
|3  |John Moore was discharged on 12/31/2022 |Jamesetta Orleans was discharged on 01/20/2023|
+---+----------------------------------------+----------------------------------------------+



**unnormalized_mode="obfuscate"**

In [None]:
deid_implementor= Deid(spark,
                       input_file_path="deid_unnormalized_data.csv",
                       output_file_path="deidentified1.csv",
                       custom_pipeline=modelDocHasher,
                       fields={"text": "obfuscate"},
                       shift_days=True,
                       obfuscate_date=True,
                       ner_chunk="ner_chunk",
                       token="token",
                       documenthashcoder_col_name="document2",
                       separator=",",
                       unnormalized_date=True,
                       unnormalized_mode="obfuscate")

In [None]:
res= deid_implementor.deidentify()

Deidentification process of the 'text' field has begun...
Deidentification process of the 'text' field was completed...
Deidentifcation successfully completed and the results saved as 'deidentified1.csv' !


In [None]:
res.show(truncate=False)

+---+----------------------------------------+----------------------------------------------+
|ID |text                                    |text_deidentified                             |
+---+----------------------------------------+----------------------------------------------+
|0  |Chris Brown was discharged on 10/02/2022|Leeann Must was discharged on 09/27/2022      |
|1  |Mark White was discharged on 02/28/2020 |Georgiana Spinner was discharged on 02/26/2020|
|2  |John was discharged on 03 Apr2022       |Doreene Burke was discharged on 08-21-1986    |
|3  |John Moore was discharged on 12/31/2022 |Leigh Aurora was discharged on 01/20/2023     |
+---+----------------------------------------+----------------------------------------------+



## Multi-Mode options


### With one column

We have multi-mode functionality in the `DeIdentification()`.

By providing a json file to the `multi_mode_file_path("a JSON path")` parameter, we are able to use multi-mode in de-identification. <br/>



Example JSON file can be like following:
```
{
	"obfuscate": ["PHONE"] ,
	"mask_entity_labels": ["ID"],
	"skip": ["DATE"],
	"mask_same_length_chars":["NAME"],
	"mask_fixed_length_chars":["ZIP", "LOCATION"]
}
```

Description of possible modes to enable multi-mode deidentification:

```
   * 'obfuscate': Replace the values with random values.
   * 'mask_same_length_chars': Replace the name with the minus two same lengths asterix, plus one bracket on both ends.
   * 'mask_entity_labels': Replace the values with the entity value.
   * 'mask_fixed_length_chars': Replace the name with the asterix with fixed length. You can also invoke "setFixedMaskLength()"
   * 'skip': Skip the entities (intact)
```

In [None]:
#json to choose deid modes
sample_json= {
	"obfuscate": ["NAME", "PHONE"] ,
	"mask_entity_labels": ["AGE"],
	"skip": ["SSN"],
	"mask_same_length_chars":["DATE"],
	"mask_fixed_length_chars":["ZIP", "LOCATION"]
}

import json
with open('sample_multi-mode.json', 'w', encoding='utf-8') as f:
    json.dump(sample_json, f, ensure_ascii=False, indent=4)

In [None]:
deid_implementor = Deid(spark,
                       input_file_path="deid_data.csv",
                       output_file_path="deidentified.csv",
                       custom_pipeline=model,
                       multi_mode_file_path="sample_multi-mode.json")

In [None]:
res= deid_implementor.deidentify()


Deidentification process of the 'text' field has begun...
Deidentification process of the 'text' field was completed...
Deidentifcation successfully completed and the results saved as 'deidentified.csv' !


In [None]:
res.show(truncate=70)

+---+----------------------------------------------------------------------+----------------------------------------------------------------------+
| ID|                                                                  text|                                                     text_deidentified|
+---+----------------------------------------------------------------------+----------------------------------------------------------------------+
|  0| Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson...|Record date : [********] , Perlie Gold , M.D . , Name : Levora Dred...|
+---+----------------------------------------------------------------------+----------------------------------------------------------------------+





### With multiple columns

Let's create a new json file describing the multi-mode for the second column

In [None]:
#json to choose deid modes
sample_json_column2= {
	"obfuscate": ["SSN", "AGE"] ,
	"mask_entity_labels": ["DATE"],
	"skip": ["ID"],
	"mask_same_length_chars":["NAME"],
	"mask_fixed_length_chars":["ZIP", "LOCATION"]
}

import json
with open('sample_multi-mode_column2.json', 'w', encoding='utf-8') as f:
    json.dump(sample_json_column2, f, ensure_ascii=False, indent=4)

In [None]:
deid_implementor= Deid(spark,
                       input_file_path="deid_multiple_data.csv",
                       output_file_path="deidentified.csv",
                       custom_pipeline=model,
                       fields={"text": "sample_multi-mode.json", "text_1":"sample_multi-mode_column2.json"}, masking_policy="fixed_length_chars",
                       fixed_mask_length=2, separator=",")

In [None]:
res= deid_implementor.deidentify()

You entered an invalid mode option. Please enter 'mask' or 'obfuscate'...
Deidentification process of the 'text' field has begun...
Deidentification process of the 'text' field was completed...
Deidentification process of the 'text_1' field has begun...
Deidentification process of the 'text_1' field was completed...
Deidentifcation successfully completed and the results saved as 'deidentified.csv' !


In [None]:
res.show(truncate=70)

+---+----------------------------------------------------------------------+----------------------------------------------------------------------+----------------------------------------------------------------------+------------------------------------------------------------------+
| ID|                                                                  text|                                                     text_deidentified|                                                                text_1|                                               text_1_deidentified|
+---+----------------------------------------------------------------------+----------------------------------------------------------------------+----------------------------------------------------------------------+------------------------------------------------------------------+
|  0|Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson ...|Record date : [********] , Loni Muse , M.D . , Name : Lowella Dand

# With no custom pipeline

### Default parameters

In [None]:
deid_implementor= Deid(spark,
                       input_file_path="deid_data.csv",
                       output_file_path="deidentified.csv")

In [None]:
res= deid_implementor.deidentify()

You entered an invalid domain option. You can choose ether 'clinical', 'finance' or 'legal'. 'clinical' is used by default!
ner_deid_subentity_augmented_i2b2_pipeline download started this may take some time.
Approx size to download 1.6 GB
[OK!]
Deidentification process of the 'text' field has begun...
Deidentification process of the 'text' field was completed...
Deidentifcation successfully completed and the results saved as 'deidentified.csv' !


In [None]:
res.show(truncate=70)

+---+----------------------------------------------------------------------+----------------------------------------------------------------------+
| ID|                                                                  text|                                                     text_deidentified|
+---+----------------------------------------------------------------------+----------------------------------------------------------------------+
|  0| Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson...|Record date : <DATE> , <DOCTOR> , M.D . , Name : <PATIENT> , MR # <...|
+---+----------------------------------------------------------------------+----------------------------------------------------------------------+



### Mask options

#### entity_labels

In [None]:
deid_implementor= Deid(spark,
                       input_file_path="deid_data.csv",
                       output_file_path="deidentified.csv",
                       fields={"text": "mask"}, masking_policy="entity_labels")

In [None]:
res= deid_implementor.deidentify()

You entered an invalid domain option. You can choose ether 'clinical', 'finance' or 'legal'. 'clinical' is used by default!
ner_deid_subentity_augmented_i2b2_pipeline download started this may take some time.
Approx size to download 1.6 GB
[OK!]
Deidentification process of the 'text' field has begun...
Deidentification process of the 'text' field was completed...
Deidentifcation successfully completed and the results saved as 'deidentified.csv' !


In [None]:
res.show(truncate=70)

+---+----------------------------------------------------------------------+----------------------------------------------------------------------+
| ID|                                                                  text|                                                     text_deidentified|
+---+----------------------------------------------------------------------+----------------------------------------------------------------------+
|  0| Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson...|Record date : <DATE> , <DOCTOR> , M.D . , Name : <PATIENT> , MR # <...|
+---+----------------------------------------------------------------------+----------------------------------------------------------------------+



#### fixed_length_chars

In [None]:
deid_implementor= Deid(spark,
                       input_file_path="deid_data.csv",
                       output_file_path="deidentified.csv",
                       fields={"text": "mask"},
                       masking_policy="fixed_length_chars",
                       fixed_mask_length=2)

In [None]:
res= deid_implementor.deidentify()

You entered an invalid domain option. You can choose ether 'clinical', 'finance' or 'legal'. 'clinical' is used by default!
ner_deid_subentity_augmented_i2b2_pipeline download started this may take some time.
Approx size to download 1.6 GB
[OK!]
Deidentification process of the 'text' field has begun...
Deidentification process of the 'text' field was completed...
Deidentifcation successfully completed and the results saved as 'deidentified.csv' !


In [None]:
res.show(truncate=70)

+---+----------------------------------------------------------------------+----------------------------------------------------------------------+
| ID|                                                                  text|                                                     text_deidentified|
+---+----------------------------------------------------------------------+----------------------------------------------------------------------+
|  0| Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson...|Record date : ** , ** , M.D . , Name : ** , MR # ** Date : ** . PCP...|
+---+----------------------------------------------------------------------+----------------------------------------------------------------------+



#### same_length_chars

In [None]:
deid_implementor= Deid(spark,
                       input_file_path="deid_data.csv",
                       output_file_path="deidentified.csv",
                       fields={"text": "mask"},
                       masking_policy="same_length_chars")

In [None]:
res= deid_implementor.deidentify()

You entered an invalid domain option. You can choose ether 'clinical', 'finance' or 'legal'. 'clinical' is used by default!
ner_deid_subentity_augmented_i2b2_pipeline download started this may take some time.
Approx size to download 1.6 GB
[OK!]
Deidentification process of the 'text' field has begun...
Deidentification process of the 'text' field was completed...
Deidentifcation successfully completed and the results saved as 'deidentified.csv' !


In [None]:
res.show(truncate=70)

+---+----------------------------------------------------------------------+----------------------------------------------------------------------+
| ID|                                                                  text|                                                     text_deidentified|
+---+----------------------------------------------------------------------+----------------------------------------------------------------------+
|  0| Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson...|Record date : [********] , [********] , M.D . , Name : [***********...|
+---+----------------------------------------------------------------------+----------------------------------------------------------------------+



### Obfuscate option

In [None]:
deid_implementor= Deid(spark,
                       input_file_path="deid_data.csv",
                       output_file_path="deidentified.csv",
                       separator="\t",
                       fields={"text": "obfuscate"},
                       unnormalized_date=False)

In [None]:
res= deid_implementor.deidentify()

You entered an invalid domain option. You can choose ether 'clinical', 'finance' or 'legal'. 'clinical' is used by default!
ner_deid_subentity_augmented_i2b2_pipeline download started this may take some time.
Approx size to download 1.6 GB
[OK!]
Deidentification process of the 'text' field has begun...
Deidentification process of the 'text' field was completed...
Deidentifcation successfully completed and the results saved as 'deidentified.csv' !


In [None]:
res.show(truncate=70)

+---+----------------------------------------------------------------------+----------------------------------------------------------------------+
| ID|                                                                  text|                                                     text_deidentified|
+---+----------------------------------------------------------------------+----------------------------------------------------------------------+
|  0| Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson...|Record date : 2093-02-25 , Elberta Leatherwood , M.D . , Name : Hil...|
+---+----------------------------------------------------------------------+----------------------------------------------------------------------+



#### age groups obfuscation

In [None]:
deid_implementor= Deid(spark,
                       input_file_path="deid_age_group.csv",
                       output_file_path="deidentified.csv",
                       fields={"text": "obfuscate"},
                       age_group_obfuscation=True, age_ranges=[1, 4, 12, 20, 40, 60, 80])

In [None]:
res= deid_implementor.deidentify()

You entered an invalid domain option. You can choose ether 'clinical', 'finance' or 'legal'. 'clinical' is used by default!
ner_deid_subentity_augmented_i2b2_pipeline download started this may take some time.
Approx size to download 1.6 GB
[OK!]
Deidentification process of the 'text' field has begun...
Deidentification process of the 'text' field was completed...
Deidentifcation successfully completed and the results saved as 'deidentified.csv' !


In [None]:
res.show(truncate=70)

+---+--------------------------------+--------------------------------+
| ID|                            text|               text_deidentified|
+---+--------------------------------+--------------------------------+
|  0|                 1 year old baby|                 1 year old baby|
|  1|                 4 year old kids|                 4 year old kids|
|  2|       A 15 year old female with|       A 17 year old female with|
|  3|Record date: 2093-01-13, Age: 25|Record date: 2093-01-28, Age: 31|
|  4|         Patient is 45 years-old|         Patient is 51 years-old|
|  5|         He is 65 years-old male|         He is 76 years-old male|
+---+--------------------------------+--------------------------------+



#### shifting days

In [None]:
data = pd.DataFrame(
    {'patientID' : ['A001', 'A002', 'A003', 'A004'],
     'text' : ['Chris Brown was discharged on 10/02/2022',
               'Mark White was discharged on 03/01/2020',
               'John was discharged on 03/15/2022',
               'John Moore was discharged on 12/31/2022'
              ]
    }
)

my_input_df = spark.createDataFrame(data)
df_pd= my_input_df.toPandas()
df_pd.to_csv("shift_days_data.csv", index=False)

my_input_df.show(truncate=False)

+---------+----------------------------------------+
|patientID|text                                    |
+---------+----------------------------------------+
|A001     |Chris Brown was discharged on 10/02/2022|
|A002     |Mark White was discharged on 03/01/2020 |
|A003     |John was discharged on 03/15/2022       |
|A004     |John Moore was discharged on 12/31/2022 |
+---------+----------------------------------------+



In [None]:
deid_implementor= Deid(spark,
                       input_file_path="shift_days_data.csv",
                       output_file_path="deidentified.csv",
                       fields={"text": "obfuscate"},
                       shift_days=True,
                       obfuscate_date=True,
                       separator=",",
                       number_of_days=2)

In [None]:
res= deid_implementor.deidentify()

You entered an invalid domain option. You can choose ether 'clinical', 'finance' or 'legal'. 'clinical' is used by default!
ner_deid_subentity_augmented_i2b2_pipeline download started this may take some time.
Approx size to download 1.6 GB
[OK!]
Deidentification process of the 'text' field has begun...
Deidentification process of the 'text' field was completed...
Deidentifcation successfully completed and the results saved as 'deidentified.csv' !


In [None]:
res.show(truncate=False)

+---+----------------------------------------+----------------------------------------------+
|ID |text                                    |text_deidentified                             |
+---+----------------------------------------+----------------------------------------------+
|0  |Chris Brown was discharged on 10/02/2022|Luvenia Redden was discharged on 10/04/2022   |
|1  |Mark White was discharged on 03/01/2020 |Dollene Cleveland was discharged on 03/03/2020|
|2  |John was discharged on 03/15/2022       |John was discharged on 03/17/2022             |
|3  |John Moore was discharged on 12/31/2022 |Jewel Baize was discharged on 01/02/2023      |
+---+----------------------------------------+----------------------------------------------+



#### unnormalized date formats

In [None]:
deid_implementor= Deid(spark,
                       input_file_path="deid_unnormalized_data.csv",
                       output_file_path="deidentified.csv",
                       fields={"text": "obfuscate"},
                       shift_days=True,
                       obfuscate_date=True,
                       separator=",",
                       unnormalized_date=True,
                       unnormalized_mode="obfuscate")

In [None]:
res= deid_implementor.deidentify()


You entered an invalid domain option. You can choose ether 'clinical', 'finance' or 'legal'. 'clinical' is used by default!
ner_deid_subentity_augmented_i2b2_pipeline download started this may take some time.
Approx size to download 1.6 GB
[OK!]
Deidentification process of the 'text' field has begun...
Deidentification process of the 'text' field was completed...
Deidentifcation successfully completed and the results saved as 'deidentified.csv' !


In [None]:
res.show(truncate=False)

+---+----------------------------------------+----------------------------------------------+
|ID |text                                    |text_deidentified                             |
+---+----------------------------------------+----------------------------------------------+
|0  |Chris Brown was discharged on 10/02/2022|Standley Brooking was discharged on 11/06/2022|
|1  |Mark White was discharged on 02/28/2020 |Caleen Essex was discharged on 04/03/2020     |
|2  |John was discharged on 03 Apr2022       |Lajoyce Corners was discharged on 07-21-1988  |
|3  |John Moore was discharged on 12/31/2022 |Janalee Dane was discharged on 02/04/2023     |
+---+----------------------------------------+----------------------------------------------+



**unnormalized_mode="mask"**

In [None]:
deid_implementor= Deid(spark,
                       input_file_path="deid_unnormalized_data.csv",
                       output_file_path="deidentified.csv",
                       fields={"text": "obfuscate"},
                       shift_days=True,
                       obfuscate_date=True,
                       separator=",",
                       unnormalized_date=True,
                       unnormalized_mode="mask")

In [None]:
res= deid_implementor.deidentify()


You entered an invalid domain option. You can choose ether 'clinical', 'finance' or 'legal'. 'clinical' is used by default!
ner_deid_subentity_augmented_i2b2_pipeline download started this may take some time.
Approx size to download 1.6 GB
[OK!]
Deidentification process of the 'text' field has begun...
Deidentification process of the 'text' field was completed...
Deidentifcation successfully completed and the results saved as 'deidentified.csv' !


In [None]:
res.show(truncate=False)

+---+----------------------------------------+---------------------------------------------+
|ID |text                                    |text_deidentified                            |
+---+----------------------------------------+---------------------------------------------+
|0  |Chris Brown was discharged on 10/02/2022|Claudie Revering was discharged on 11/07/2022|
|1  |Mark White was discharged on 02/28/2020 |Bedelia Person was discharged on 04/04/2020  |
|2  |John was discharged on 03 Apr2022       |Susanne Borders was discharged on <DATE>     |
|3  |John Moore was discharged on 12/31/2022 |Aram Candela was discharged on 02/05/2023    |
+---+----------------------------------------+---------------------------------------------+



### Multi-Mode options

In [None]:
deid_implementor= Deid(spark,
                       input_file_path="deid_data.csv",
                       output_file_path="deidentified.csv",
                       multi_mode_file_path="sample_multi-mode.json")

In [None]:
res= deid_implementor.deidentify()

You entered an invalid domain option. You can choose ether 'clinical', 'finance' or 'legal'. 'clinical' is used by default!
ner_deid_subentity_augmented_i2b2_pipeline download started this may take some time.
Approx size to download 1.6 GB
[OK!]
Deidentification process of the 'text' field has begun...
Deidentification process of the 'text' field was completed...
Deidentifcation successfully completed and the results saved as 'deidentified.csv' !


In [None]:
res.show(truncate=70)

+---+----------------------------------------------------------------------+----------------------------------------------------------------------+
| ID|                                                                  text|                                                     text_deidentified|
+---+----------------------------------------------------------------------+----------------------------------------------------------------------+
|  0| Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson...|Record date : [********] , <DOCTOR> , M.D . , Name : <PATIENT> , MR...|
+---+----------------------------------------------------------------------+----------------------------------------------------------------------+



# Structured Deidentification

In [None]:
#sample data
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/hipaa-table-001.txt

df = spark.read.format("csv") \
    .option("sep", "\t") \
    .option("inferSchema", "true") \
    .option("header", "true") \
    .load("hipaa-table-001.txt")

df = df.withColumnRenamed("PATIENT","NAME")
df.show(truncate=False)

+---------------+----------+---+----------------------------------------------------+-------+--------------+---+---+
|NAME           |DOB       |AGE|ADDRESS                                             |ZIPCODE|TEL           |SBP|DBP|
+---------------+----------+---+----------------------------------------------------+-------+--------------+---+---+
|Cecilia Chapman|04/02/1935|83 |711-2880 Nulla St. Mankato Mississippi              |69200  |(257) 563-7401|101|42 |
|Iris Watson    |03/10/2009|9  |P.O. Box 283 8562 Fusce Rd. Frederick Nebraska      |20620  |(372) 587-2335|159|122|
|Bryar Pitts    |11/01/1921|98 |5543 Aliquet St. Fort Dodge GA                      |20783  |(717) 450-4729|149|52 |
|Theodore Lowe  |13/02/2002|16 |Ap #867-859 Sit Rd. Azusa New York                  |39531  |(793) 151-6230|134|115|
|Calista Wise   |20/08/1942|76 |7292 Dictum Av. San Antonio MI                      |47096  |(492) 709-6392|139|78 |
|Kyla Olsen     |12/05/1973|45 |Ap #651-8679 Sodales Av. Tamunin

In [None]:
df_pd= df.toPandas()
df_pd.to_csv("deid_structured_data.csv", index=False)

## Default parameters

In [None]:
from sparknlp_jsl.utils.deidentification_utils import structured_deidentifier

In [None]:
res= structured_deidentifier(spark, input_file_path="deid_structured_data.csv")

Deidentifcation successfully completed and the results saved as 'deidentified.csv' !


In [None]:
res.show()

+------------------+----------+-----+--------------------+-------+--------------+---+---+
|              NAME|       DOB|  AGE|             ADDRESS|ZIPCODE|           TEL|SBP|DBP|
+------------------+----------+-----+--------------------+-------+--------------+---+---+
|  [Mckinley Jewel]|04/02/1935| [97]|711-2880 Nulla St...|  69200|(257) 563-7401|101| 42|
|    [Jettie Booze]|03/10/2009| [10]|P.O. Box 283 8562...|  20620|(372) 587-2335|159|122|
|     [Raoul Pitch]|11/01/1921| [89]|5543 Aliquet St. ...|  20783|(717) 450-4729|149| 52|
|   [Doran Stabler]|13/02/2002| [15]|Ap #867-859 Sit R...|  39531|(793) 151-6230|134|115|
|      [Orpah Cobb]|20/08/1942| [68]|7292 Dictum Av. S...|  47096|(492) 709-6392|139| 78|
|  [Marzetta Board]|12/05/1973| [56]|Ap #651-8679 Soda...|  10855|(654) 393-5734|120|112|
|[Shelbie Hutching]|11/01/1991| [28]|191-103 Integer R...|   8219|(404) 960-3807|143|126|
|    [Bryson Dames]|18/11/1937| [88]|P.O. Box 887 2508...|  12482|(314) 244-6306|147| 75|
|   [Hardi

## ref_source=File

In [None]:
obfuscator_unique_ref_test = '''Will Perry#PATIENT
John Smith#PATIENT
Marvin MARSHALL#PATIENT
Hubert GROGAN#PATIENT
ALTHEA COLBURN#PATIENT
Kalil AMIN#PATIENT
Inci FOUNTAIN#PATIENT
Jackson WILLE#PATIENT
Jack SANTOS#PATIENT
Mahmood ALBURN#PATIENT
Marnie MELINGTON#PATIENT
Aysha GHAZI#PATIENT
Maryland CODER#PATIENT
Darene GEORGIOUS#PATIENT
Shelly WELLBECK#PATIENT
Min Kun JAE#PATIENT
Thomson THOMAS#PATIENT
Christian SUDDINBURG#PATIENT
Aberdeen#CITY
Louisburg St#STREET
France#LOC
Nick Riviera#DOCTOR
5552312#PHONE
St James Hospital#HOSPITAL
Calle del Libertador#ADDRESS
111#ID
Will#DOCTOR
20#AGE
30#AGE
40#AGE
50#AGE
60#AGE
'''

with open('obfuscator_unique_ref_test.txt', 'w') as f:
  f.write(obfuscator_unique_ref_test)

In [None]:
res= structured_deidentifier(spark,
                             input_file_path= "deid_structured_data.csv",
                             ref_source="file",
                             columns_dict={'NAME': 'PATIENT', 'AGE': 'AGE'},
                             columns_seed={"NAME": 23, "AGE": 23},
                             obfuscateRefFile="obfuscator_unique_ref_test.txt")

Deidentifcation successfully completed and the results saved as 'deidentified.csv' !


In [None]:
res.show()

+--------------------+----------+----+--------------------+-------+--------------+---+---+
|                NAME|       DOB| AGE|             ADDRESS|ZIPCODE|           TEL|SBP|DBP|
+--------------------+----------+----+--------------------+-------+--------------+---+---+
|[Christian SUDDIN...|04/02/1935|[60]|711-2880 Nulla St...|  69200|(257) 563-7401|101| 42|
|[Christian SUDDIN...|03/10/2009|[30]|P.O. Box 283 8562...|  20620|(372) 587-2335|159|122|
|    [Thomson THOMAS]|11/01/1921|[30]|5543 Aliquet St. ...|  20783|(717) 450-4729|149| 52|
|       [Aysha GHAZI]|13/02/2002|[40]|Ap #867-859 Sit R...|  39531|(793) 151-6230|134|115|
|       [Jack SANTOS]|20/08/1942|[40]|7292 Dictum Av. S...|  47096|(492) 709-6392|139| 78|
|    [Mahmood ALBURN]|12/05/1973|[40]|Ap #651-8679 Soda...|  10855|(654) 393-5734|120|112|
|     [Jackson WILLE]|11/01/1991|[60]|191-103 Integer R...|   8219|(404) 960-3807|143|126|
|    [Maryland CODER]|18/11/1937|[60]|P.O. Box 887 2508...|  12482|(314) 244-6306|147| 75|

## shift days

In [None]:
# We can shift n days in the structured deidentification through "days" parameter when the column is a Date.

df = spark.createDataFrame([
            ["Juan García", "13/02/1977", "711 Nulla St.", "140", "673 431234"],
            ["Will Smith", "23/02/1977", "1 Green Avenue.", "140", "+23 (673) 431234"],
            ["Pedro Ximénez", "11/04/1900", "Calle del Libertador, 7", "100", "912 345623"]
        ]).toDF("NAME", "DOB", "ADDRESS", "SBP", "TEL")

df_pd= df.toPandas()
df_pd.to_csv("deid_dayshift_structured_data.csv", index=False)
df_pd.head()

Unnamed: 0,NAME,DOB,ADDRESS,SBP,TEL
0,Juan García,13/02/1977,711 Nulla St.,140,673 431234
1,Will Smith,23/02/1977,1 Green Avenue.,140,+23 (673) 431234
2,Pedro Ximénez,11/04/1900,"Calle del Libertador, 7",100,912 345623


In [None]:
res= structured_deidentifier(spark,
                             input_file_path= "deid_dayshift_structured_data.csv",
                             columns_dict= {"NAME": "ID", "DOB": "DATE"},
                             columns_seed= {"NAME": 23, "DOB": 23},
                             ref_source="faker",
                             shift_days= 5)

Deidentifcation successfully completed and the results saved as 'deidentified.csv' !


In [None]:
res.show()

+---------------+------------+--------------------+---+----------------+
|           NAME|         DOB|             ADDRESS|SBP|             TEL|
+---------------+------------+--------------------+---+----------------+
|  [CZSE QPBDDS]|[18/02/1977]|       711 Nulla St.|140|      673 431234|
|   [ZJXU BWPNO]|[28/02/1977]|     1 Green Avenue.|140|+23 (673) 431234|
|[VYFRH INQLTXI]|[16/04/1900]|Calle del Liberta...|100|      912 345623|
+---------------+------------+--------------------+---+----------------+

