![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/04.6.Light_Deidentification.ipynb)

# Light Deidentification

# Colab setups

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.settings.enforce_versions=True
nlp.install(refresh_install=True)

In [None]:
from johnsnowlabs import nlp, medical
# Automatically load license data and start a session with all jars user has access to

spark = nlp.start()

In [5]:
spark

In [10]:
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T
import pyspark.sql as SQL
from pyspark import keyword_only
from sparknlp_jsl.pipeline_tracer import PipelineTracer

import pandas as pd
import json
import string
import numpy as np
import warnings
warnings.filterwarnings('ignore')

## Healthcare NLP for Data Scientists Course

If you are not familiar with the components in this notebook, you can check [Healthcare NLP for Data Scientists Udemy Course](https://www.udemy.com/course/healthcare-nlp-for-data-scientists/) and the [MOOC Notebooks](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP) for each components.

# LightDeIdentification

Light DeIdentification is designed to accelarete deidentification by removing regex usage, token usage in order to increase performance significantly.

Deidentification process is taking effect after entities recognition. With Light DeIdentification Annotator, we dont interference ner process with regex etc. anymore. So if ner results are satisfactory, then it is recommended to use Light DeIdentification Annotator.

When Defaults parameters are used, Light DeIdentification is approximately faster  4x more than DeIdentification.

Light DeIdentification is a light version of `DeIdentification`. It replaces sensitive information in a text with `obfuscated` or `masked` fakers. It is designed to work with healthcare data, and it can be used to de-identify **patient names, dates**, and other sensitive information. It can also be used to **obfuscate** or **mask** any other type of sensitive information, such as *doctor names, hospital names*, and other types of sensitive informatio Additionally, it supports millions of embedded `fakers` and If desired, `custom external fakers` can be set with **setCustomFakers** function. It also supports multiple languages such as English, Spanish, French, German, and Arabic. And it supports multi-mode de-Identification with **setSelectiveObfuscationModes** function at the same time.



# Pipeline


# mode: obfuscate, parameters: defaults
by default, the sensitive data are obfuscated with fake data from JSL Faker, except DATE's

In [6]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line
sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

# NER model trained on n2c2 (de-identification and Heart Disease Risk Factors Challenge) datasets)
ner_subentity = medical.NerModel.pretrained("ner_deid_subentity_augmented", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner_subentity")

ner_converter = medical.NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner_subentity"])\
    .setOutputCol("ner_chunk")

light_deidentification = medical.LightDeIdentification() \
    .setInputCols(["ner_chunk", "sentence"]) \
    .setOutputCol("obfuscated") \
    .setMode("obfuscate") \
    .setLanguage("en") \
    .setSeed(10) \

nlpPipeline = nlp.Pipeline(
    stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        ner_subentity,
        ner_converter,
        light_deidentification
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)


embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_deid_subentity_augmented download started this may take some time.
[OK!]


In [7]:
text ='''
Record date : 2093-01-13 , David Hale , M.D . ,
Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 .
PCP : Oliveira , 95 years-old , Record date : 2079-11-09 .
Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 .
'''

In [8]:
result = model.transform(spark.createDataFrame([[text]]).toDF("text"))

In [11]:
print("detected sensitive data by NER model:\n\n")
result.select(F.explode(F.arrays_zip(result.ner_chunk.result,
                                     result.ner_chunk.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

detected sensitive data by NER model:


+-----------------------------+-------------+
|chunk                        |ner_label    |
+-----------------------------+-------------+
|2093-01-13                   |DATE         |
|David Hale                   |DOCTOR       |
|Hendrickson Ora              |PATIENT      |
|7194334                      |MEDICALRECORD|
|01/13/93                     |DATE         |
|Oliveira                     |DOCTOR       |
|2079-11-09                   |DATE         |
|Cocke County Baptist Hospital|HOSPITAL     |
|0295 Keats Street            |STREET       |
|55-555-5555                  |PHONE        |
+-----------------------------+-------------+



In [12]:
result.select(F.explode(F.arrays_zip(result.sentence.result,
                                     result.obfuscated.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("sentence"), F.expr("cols['1']").alias("obfuscated")).toPandas()

Unnamed: 0,sentence,obfuscated
0,"Record date : 2093-01-13 , David Hale , M.D .","Record date : <DATE> , Shery Done , M.D ."
1,",\nName : Hendrickson Ora , MR # 7194334 Date ...",",\nName : Linda Repress , MR # 1610960 Date : ..."
2,"PCP : Oliveira , 95 years-old , Record date : ...","PCP : Clenton Czech , 95 years-old , Record da..."
3,"Cocke County Baptist Hospital , 0295 Keats Str...","BOULDER COMMUNITY HOSPITAL , 3020 West Wheatla..."


by default parameters, in **obfuscate** mode, recognized entities, except DATE, are obfuscated with values of **Faker** module.

## obfuscate DATEs

In [13]:
light_deidentification = medical.LightDeIdentification() \
    .setInputCols(["ner_chunk", "sentence"]) \
    .setOutputCol("obfuscated") \
    .setMode("obfuscate") \
    .setObfuscateDate(True)\
    .setLanguage("en") \
    .setSeed(10) \


nlpPipeline = nlp.Pipeline(
    stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        ner_subentity,
        ner_converter,
        light_deidentification
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

In [14]:
result = model.transform(spark.createDataFrame([[text]]).toDF("text"))
result.select(F.explode(F.arrays_zip(result.sentence.result, result.obfuscated.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("sentence"), F.expr("cols['1']").alias("obfuscated")).toPandas()


Unnamed: 0,sentence,obfuscated
0,"Record date : 2093-01-13 , David Hale , M.D .","Record date : 2093-01-29 , Shery Done , M.D ."
1,",\nName : Hendrickson Ora , MR # 7194334 Date ...",",\nName : Linda Repress , MR # 1610960 Date : ..."
2,"PCP : Oliveira , 95 years-old , Record date : ...","PCP : Clenton Czech , 95 years-old , Record da..."
3,"Cocke County Baptist Hospital , 0295 Keats Str...","BOULDER COMMUNITY HOSPITAL , 3020 West Wheatla..."


now `<DATE>` `s in sentence 1 and 2   are also *obfuscated* with fake date from JSL Faker

## Obfuscate with custom fake data

Instead of fake values coming from JSL Faker module, we can use custom fake data

note: when using custom fake data, no need to set these entities names: PHONE, FAX, ID, IDNUM, BIOID, MEDICALRECORD, ZIP, VIN, SSN, DLN, LICENSE, PLATE.

Those entities are obfuscated from JSL Faker module by default.
But if we add `.setSameLengthFormattedEntities([])` parameter, it will take effect.

In [15]:
light_deidentification = medical.LightDeIdentification() \
    .setInputCols(["ner_chunk", "sentence"]) \
    .setOutputCol("obfuscated") \
    .setMode("obfuscate") \
    .setObfuscateDate(True)\
    .setDateFormats(["MM/dd/yyyy","yyyy-MM-dd" ]) \
    .setDays(7) \
    .setObfuscateRefSource('custom') \
    .setCustomFakers({"Doctor": ["John", "Joe"],
                      "Patient": ["James", "Michael"],
                      "Hospital": ["Medical Center"],
                      "Street" : ["Main Street"],
                      "Age":["1","10", "20", "40","80"],
                      "PHONE":["000-000-0000"]
                      }) \
    .setSameLengthFormattedEntities([])\
    .setAgeRanges([1, 4, 12, 20, 40, 60, 80])\
    .setLanguage("en") \
    .setSeed(42) \
    .setDateEntities(["DATE", "DOB",  "DOD"]) \



nlpPipeline = nlp.Pipeline(
    stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        ner_subentity,
        ner_converter,
        light_deidentification
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

text ='''
Record date : 2093-01-13 , David Hale , M.D . ,
Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 .
PCP : Oliveira , 95 years-old , Record date : 2079-11-09 .
Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555.
'''

result = model.transform(spark.createDataFrame([[text]]).toDF("text"))
result.select(F.explode(F.arrays_zip(result.sentence.result, result.obfuscated.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("sentence"), F.expr("cols['1']").alias("obfuscated")).toPandas()

Unnamed: 0,sentence,obfuscated
0,"Record date : 2093-01-13 , David Hale , M.D .","Record date : 2093-01-20 , John , M.D ."
1,",\nName : Hendrickson Ora , MR # 7194334 Date ...",",\nName : Michael , MR # <MEDICALRECORD> Date ..."
2,"PCP : Oliveira , 95 years-old , Record date : ...","PCP : Joe , 95 years-old , Record date : 2079-..."
3,"Cocke County Baptist Hospital , 0295 Keats Str...","Medical Center , Main Street , Phone 000-000-0..."


NOTE:  Age over 95 are not obfuscated and date   `01/13/93` is recognized  and obfuscated by JSL Faker, not by .setDays(7), because format is MM/dd/yy and it is not in list of format  `.setDateFormats(["MM/dd/yyyy","yyyy-MM-dd" ]) `


In [16]:
light_deidentification = medical.LightDeIdentification() \
    .setInputCols(["ner_chunk", "sentence"]) \
    .setOutputCol("obfuscated") \
    .setMode("obfuscate") \
    .setObfuscateDate(True)\
    .setDateFormats(["MM/dd/yyyy","yyyy-MM-dd", "MM/dd/yy"]) \
    .setDays(7) \
    .setObfuscateRefSource('custom') \
    .setCustomFakers({"Doctor": ["John", "Joe"],
                      "Patient": ["James", "Michael"],
                      "Hospital": ["Medical Center"],
                      "Street" : ["Main Street"],
                      "Age":["1","10", "20", "40","80"],"PHONE":["555-555-0000"]
                      }) \
    .setAgeRanges([1, 4, 12, 20, 40, 60, 80])\
    .setLanguage("en") \
    .setSeed(42) \
    .setDateEntities(["DATE", "DOB",  "DOD"]) \



nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler,
      sentenceDetector,
      tokenizer,
      word_embeddings,
      ner_subentity,
      ner_converter,
      light_deidentification
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

text ='''
Record date : 2093-01-13 , David Hale , M.D . ,
Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 .
PCP : Oliveira , 95 years-old , Record date : 2079-11-09 .
Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555.
'''

result = model.transform(spark.createDataFrame([[text]]).toDF("text"))
result.select(F.explode(F.arrays_zip(result.sentence.result, result.obfuscated.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("sentence"), F.expr("cols['1']").alias("obfuscated")).toPandas()


Unnamed: 0,sentence,obfuscated
0,"Record date : 2093-01-13 , David Hale , M.D .","Record date : 2093-01-20 , John , M.D ."
1,",\nName : Hendrickson Ora , MR # 7194334 Date ...",",\nName : Michael , MR # 1610960 Date : 01/20/..."
2,"PCP : Oliveira , 95 years-old , Record date : ...","PCP : Joe , 95 years-old , Record date : 2079-..."
3,"Cocke County Baptist Hospital , 0295 Keats Str...","Medical Center , Main Street , Phone 45-409-8119."


Now, it is obfuscated as  01/20/93

OR by setting Region with `.setRegion('us') predescribed date formats will take effect

In [17]:
light_deidentification = medical.LightDeIdentification() \
    .setInputCols(["ner_chunk", "sentence"]) \
    .setOutputCol("obfuscated") \
    .setMode("obfuscate") \
    .setObfuscateDate(True)\
    .setDays(7) \
    .setObfuscateRefSource('custom') \
    .setCustomFakers({"Doctor": ["John", "Joe"],
                      "Patient": ["James", "Michael"],
                      "Hospital": ["Medical Center"],
                      "Street" : ["Main Street"],"Phone":["555-555-0000"],
                      "Age":["1","10", "20", "40","80"],"PHONE":["555-555-0000"], "SSN":["123-22-9999"]
                      }) \
    .setAgeRanges([1, 4, 12, 20, 40, 60, 80])\
    .setLanguage("en") \
    .setRegion('us') \
    .setSeed(42) \
    .setDateEntities(["DATE", "DOB",  "DOD"]) \
    .setSameLengthFormattedEntities([])



nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler,
      sentenceDetector,
      tokenizer,
      word_embeddings,
      ner_subentity,
      ner_converter,
      light_deidentification
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

text ='''
Record date : 2093-01-13 , David Hale , M.D . ,
Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 .
PCP : Oliveira , 95 years-old , Record date : 2079-11-09 .
Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 SSN: 123-22-4567.
'''

result = model.transform(spark.createDataFrame([[text]]).toDF("text"))
result.select(F.explode(F.arrays_zip(result.sentence.result, result.obfuscated.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("sentence"), F.expr("cols['1']").alias("obfuscated")).toPandas()


Unnamed: 0,sentence,obfuscated
0,"Record date : 2093-01-13 , David Hale , M.D .","Record date : 2093-01-20 , John , M.D ."
1,",\nName : Hendrickson Ora , MR # 7194334 Date ...",",\nName : Michael , MR # <MEDICALRECORD> Date ..."
2,"PCP : Oliveira , 95 years-old , Record date : ...","PCP : Joe , 95 years-old , Record date : 2079-..."
3,"Cocke County Baptist Hospital , 0295 Keats Str...","Medical Center , Main Street , Phone 555-555-0..."


# mode: mask, parameters: defaults

In [18]:
light_deidentification = medical.LightDeIdentification() \
    .setInputCols(["ner_chunk", "sentence"]) \
    .setOutputCol("masked") \
    .setMode("mask") \



nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler,
      sentenceDetector,
      tokenizer,
      word_embeddings,
      ner_subentity,
      ner_converter,
      light_deidentification
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

text ='''
Record date : 2093-01-13 , David Hale , M.D . ,
Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 .
PCP : Oliveira , 55 years-old , Record date : 2079-11-09 .
Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555
'''

result = model.transform(spark.createDataFrame([[text]]).toDF("text"))
result.select(F.explode(F.arrays_zip(result.sentence.result, result.masked.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("sentence"), F.expr("cols['1']").alias("obfuscated")).toPandas()


Unnamed: 0,sentence,obfuscated
0,"Record date : 2093-01-13 , David Hale , M.D .","Record date : <DATE> , <DOCTOR> , M.D ."
1,",\nName : Hendrickson Ora , MR # 7194334 Date ...",",\nName : <PATIENT> , MR # <MEDICALRECORD> Dat..."
2,"PCP : Oliveira , 55 years-old , Record date : ...","PCP : <DOCTOR> , <AGE> years-old , Record date..."
3,"Cocke County Baptist Hospital , 0295 Keats Str...","<HOSPITAL> , <STREET> , Phone <PHONE>"


all sensitive entities are masked with LABEL by default

There are 3 masking policies:


*   entity_labels - `default`
*   fixed_length_chars
*   same_length_char




## fixed_length_chars

In [19]:
light_deidentification = medical.LightDeIdentification() \
    .setInputCols(["ner_chunk", "sentence"]) \
    .setOutputCol("masked") \
    .setMode("mask") \
    .setMaskingPolicy("fixed_length_chars") \
    .setFixedMaskLength(5) \

nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler,
      sentenceDetector,
      tokenizer,
      word_embeddings,
      ner_subentity,
      ner_converter,
      light_deidentification
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

text ='''
Record date : 2093-01-13 , David Hale , M.D . ,
Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 .
PCP : Oliveira , 55 years-old , Record date : 2079-11-09 .
Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555
'''

result = model.transform(spark.createDataFrame([[text]]).toDF("text"))
result.select(F.explode(F.arrays_zip(result.sentence.result, result.masked.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("sentence"), F.expr("cols['1']").alias("obfuscated")).toPandas()


Unnamed: 0,sentence,obfuscated
0,"Record date : 2093-01-13 , David Hale , M.D .","Record date : ***** , ***** , M.D ."
1,",\nName : Hendrickson Ora , MR # 7194334 Date ...",",\nName : ***** , MR # ***** Date : ***** ."
2,"PCP : Oliveira , 55 years-old , Record date : ...","PCP : ***** , ***** years-old , Record date : ..."
3,"Cocke County Baptist Hospital , 0295 Keats Str...","***** , ***** , Phone *****"


## same_length_chars

In [20]:
light_deidentification = medical.LightDeIdentification() \
    .setInputCols(["ner_chunk", "sentence"]) \
    .setOutputCol("masked") \
    .setMode("mask") \
    .setMaskingPolicy("same_length_chars") \



nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler,
      sentenceDetector,
      tokenizer,
      word_embeddings,
      ner_subentity,
      ner_converter,
      light_deidentification
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

text ='''
Record date : 2093-01-13 , David Hale , M.D . ,
Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 .
PCP : Oliveira , 55 years-old , Record date : 2079-11-09 .
Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555
'''

result = model.transform(spark.createDataFrame([[text]]).toDF("text"))
result.select(F.explode(F.arrays_zip(result.sentence.result, result.masked.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("sentence"), F.expr("cols['1']").alias("masked")).toPandas()


Unnamed: 0,sentence,masked
0,"Record date : 2093-01-13 , David Hale , M.D .","Record date : [********] , [********] , M.D ."
1,",\nName : Hendrickson Ora , MR # 7194334 Date ...",",\nName : [*************] , MR # [*****] Date ..."
2,"PCP : Oliveira , 55 years-old , Record date : ...","PCP : [******] , ** years-old , Record date : ..."
3,"Cocke County Baptist Hospital , 0295 Keats Str...","[***************************] , [*************..."


# multimode:
to deidentify entities with masking or fake data, we are able to use multimode options as following

In [21]:
light_deidentification = medical.LightDeIdentification() \
    .setInputCols(["ner_chunk", "sentence"]) \
    .setOutputCol("masked") \
    .setMode("mask") \
    .setObfuscateDate(True) \
    .setDays(7)\
    .setAgeRanges([1,4,10,20,40,60,80,100])\
    .setSelectiveObfuscationModes({"OBFUSCATE": ["Date","Street","Doctor", "Patient","Age"],
                                    "mask_same_length_chars": ["MEDICALRECORD", "Phone"],
                                    "mask_entity_labels": ["HOSPITAL"],
                                    }) \



nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler,
      sentenceDetector,
      tokenizer,
      word_embeddings,
      ner_subentity,
      ner_converter,
      light_deidentification
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

text ='''
Record date : 2093-01-13 , David Hale , M.D . ,
Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 .
PCP : Oliveira , 55 years-old , Record date : 2079-11-09 .
Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555
'''

result = model.transform(spark.createDataFrame([[text]]).toDF("text"))
result.select(F.explode(F.arrays_zip(result.sentence.result, result.masked.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("sentence"), F.expr("cols['1']").alias("masked")).toPandas()


Unnamed: 0,sentence,masked
0,"Record date : 2093-01-13 , David Hale , M.D .","Record date : 2093-01-20 , Elinor Guardian , M..."
1,",\nName : Hendrickson Ora , MR # 7194334 Date ...",",\nName : Dolph Friar , MR # [*****] Date : 01..."
2,"PCP : Oliveira , 55 years-old , Record date : ...","PCP : Ruthell Cowboy , 50 years-old , Record d..."
3,"Cocke County Baptist Hospital , 0295 Keats Str...","<HOSPITAL> , 2835 Us Hwy 231 N , Phone [******..."


to obfuscate DATEs:
we put these two paramters
* `.setObfuscateDate(True) \`
* `.setDays(7)\`
and put into dictionary below

for other entities, we add this dictionary inside `.setSelectiveObfuscationModes() parameter.


    `{"OBFUSCATE": ["Date", "Street","Doctor", "Patient","Age"],
      "mask_same_length_chars": ["MEDICALRECORD", "Phone"],
      "mask_entity_labels": ["HOSPITAL"],
      }`

Also `age` is in "OBFUSCATE" list and .setAgeRanges() is declared, so age will be fake but reasonably according to related range: 55 will be between 40-60

In [22]:
light_deidentification = medical.LightDeIdentification() \
    .setInputCols(["ner_chunk", "sentence"]) \
    .setOutputCol("masked") \
    .setMode("mask") \
    .setObfuscateDate(True) \
    .setDays(7)\
    .setRegion("us")\
    .setUnnormalizedDateMode("skip") \
    .setAgeRanges([1,4,10,20,40,60,80,100])\
    .setSelectiveObfuscationModes({"OBFUSCATE": ["Date","Street","Doctor", "Patient","Age"],
                                    "mask_same_length_chars": ["MEDICALRECORD", "Phone"],
                                    "mask_entity_labels": ["HOSPITAL"],
                                    }) \



nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler,
      sentenceDetector,
      tokenizer,
      word_embeddings,
      ner_subentity,
      ner_converter,
      light_deidentification
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

text ='''
Record date : 2093-01-13 , David Hale , M.D . ,
Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 .
PCP : Oliveira , 55 years-old , Record date : 2079-11-09 . Discharged date : April 9 2024.
Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555
'''

result = model.transform(spark.createDataFrame([[text]]).toDF("text"))
result.select(F.explode(F.arrays_zip(result.sentence.result, result.masked.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("sentence"), F.expr("cols['1']").alias("masked")).toPandas()


Unnamed: 0,sentence,masked
0,"Record date : 2093-01-13 , David Hale , M.D .","Record date : 2093-01-20 , Dara Ear , M.D ."
1,",\nName : Hendrickson Ora , MR # 7194334 Date ...",",\nName : Alden Humphrey , MR # [*****] Date :..."
2,"PCP : Oliveira , 55 years-old , Record date : ...","PCP : Zelpha Hides , 53 years-old , Record dat..."
3,Discharged date : April 9 2024.,Discharged date : April 9 2024.
4,"Cocke County Baptist Hospital , 0295 Keats Str...","<HOSPITAL> , 1201 West Frank Avenue , Phone [*..."


 note: date format of  `April 9 2024` chunk is not recognized  and
   ` .setUnnormalizedDateMode("skip") \`
parameter takes effect  Options: [mask, obfuscate, skip]. Default: obfuscate.

In [23]:
data = pd.DataFrame(
    {'patientID' : ['A001', 'A002', 'A003', 'A004'],
     'text' : ['Chris Brown was discharged on 10/02/2022',
               'Mark White was discharged on 03/01/2020',
               'John was discharged on 03/15/2022',
               'John Moore was discharged on 12/31/2022'
              ],
     'dateshift' : ['10', '-2', '30', '-8']
    }
)

my_input_df = spark.createDataFrame(data)

my_input_df.show(truncate=False)

+---------+----------------------------------------+---------+
|patientID|text                                    |dateshift|
+---------+----------------------------------------+---------+
|A001     |Chris Brown was discharged on 10/02/2022|10       |
|A002     |Mark White was discharged on 03/01/2020 |-2       |
|A003     |John was discharged on 03/15/2022       |30       |
|A004     |John Moore was discharged on 12/31/2022 |-8       |
+---------+----------------------------------------+---------+



# shiftdays

we can obfuscate the dates not only randomly or add/substract a constant day, but also change per document:

In [24]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

documentHasher = medical.DocumentHashCoder()\
    .setInputCols("document")\
    .setOutputCol("document2")\
    .setDateShiftColumn("dateshift")\

# Sentence Detector annotator, processes various sentences per line
sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document2"])\
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

# NER model trained on n2c2 (de-identification and Heart Disease Risk Factors Challenge) datasets)
ner_subentity = medical.NerModel.pretrained("ner_deid_subentity_augmented", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner_subentity")

ner_converter = medical.NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner_subentity"])\
    .setOutputCol("ner_chunk")

light_deidentification = medical.LightDeIdentification() \
    .setInputCols(["ner_chunk", "sentence"]) \
    .setOutputCol("obfuscated") \
    .setMode("obfuscate") \
    .setObfuscateDate(True)\
    .setLanguage("en") \
    .setSeed(10) \
    .setUseShiftDays(True)\

nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler,
      documentHasher,
      sentenceDetector,
      tokenizer,
      word_embeddings,
      ner_subentity,
      ner_converter,
      light_deidentification
])

empty_data = spark.createDataFrame([["",""]]).toDF("text","dateshift")

model = nlpPipeline.fit(empty_data)

result = model.transform(my_input_df)
result.select(F.explode(F.arrays_zip(result.sentence.result, result.obfuscated.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("sentence"), F.expr("cols['1']").alias("obfuscated")).toPandas()

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_deid_subentity_augmented download started this may take some time.
[OK!]


Unnamed: 0,sentence,obfuscated
0,Chris Brown was discharged on 10/02/2022,Vanderbilt Gene was discharged on 07/04/2022
1,Mark White was discharged on 03/01/2020,Efren Grapes was discharged on 28/02/2020
2,John was discharged on 03/15/2022,Samantha Cress was discharged on 05/10/2022
3,John Moore was discharged on 12/31/2022,Derry Flock was discharged on 02/25/2023
