![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# **Deidentification**

This notebook will cover the different parameters and usages of `Deidentification`. This annotator provides the ability to obfuscate or mask the entities that contains personal information.

**📖 Learning Objectives:**

1. Background: Understand the Deidentification module

2. Colab setup

3. Become comfortable with deidentiifcation using the different parameters of the annotator.


**🔗 Helpful Links:**

- Documentation : [Deidentification](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#deidentification)

- Python Docs : [Deidentification](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/deid/deIdentification/index.html#sparknlp_jsl.annotator.deid.deIdentification.DeIdentification)

- Scala Docs : [Deidentification](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/deid/DeIdentification)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings/Healthcare).

## **📜 Background**


Deidentification is a critical and important technology to facilitate the use of structured or unstructured clinical text while protecting patient privacy and confidentiality. John Snow Labs teams has invested great efforts in developing methods and corpora for deidentification of clinical text, PDF, image, DICOM, containing Protected Health Information (PHI):

*   individual’s past, present, or future physical or mental health or condition.
*   provision of health care to the individual.
*   past, present, or future payment for the health care.

Protected health information includes many common identifiers (e.g., name, address, birth date, Social Security Number) when they can be associated with the health information.

Spark NLP for Healthcare proposes several techniques and strategies for deidentification, the principal ones are:


*   **Mask**:

          *   entity_labels: Mask with the entity type of that chunk. (default)
          *   same_length_chars: Mask the deid entities with same length of asterix ( * ) with brackets ( [ , ] ) on both end.
          *   fixed_length_chars: Mask the deid entities with a fixed length of asterix ( * ). The length is setting up using the setFixedMaskLength() method.


*   **Obfuscation**: replace sensetive entities with random values of the same type.

*   **Faker**:  allows the user to use a set of fake entities that are in the memory of spark-nlp-internal

Also there is an advanced option allowing to deidentify with multiple modes at the same time. (Multi-Mode Deididentification).

## **🎬 Colab Setup**

This module is licensed, so you need a valid license json file.



Installing johnsnowlabs:

In [None]:
!pip install -q johnsnowlabs

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m83.7/83.7 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.4/212.4 MB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.7/486.7 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m639.9/639.9 kB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.4/95.4 kB[0m [31m363.6 kB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.6/198.6 kB[0m [31m18.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.9/66.9 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m83.5/83.5 kB[0m [31m9.1 MB/s[

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

Please Upload your John Snow Labs License using the button below


Saving spark_nlp_for_healthcare_spark_ocr_7566 - Copie.json to spark_nlp_for_healthcare_spark_ocr_7566 - Copie.json


In [None]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_7566 - Copie.json
🚨 Outdated OCR Secrets in license file. Version=4.3.3 but should be Version=4.4.0
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-4.4.1-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-4.4.1-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-4.4.1.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-4.4.1.jar
🙆 JSL Home setup in /root/.johnsnowlabs
👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_7566 - Copie.json
Installing /root/.johnsnowlabs/py_installs/spark_nlp_jsl-4.4.1-py3-none-any.whl to /usr/bin/python3
Installed 1 products:
💊 Spark-Healthcare==4.4.1 installed! ✅ Heal the planet with NLP! 


Starting spark session:

In [None]:
from johnsnowlabs import nlp, medical
import pandas as pd
from pyspark.sql import SparkSession
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
import sparknlp_jsl
import sparknlp
import pandas as pd
from array import array


# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_7566 - Copie.json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==4.4.1, 💊Spark-Healthcare==4.4.1, running on ⚡ PySpark==3.1.2


## **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT`, `CHUNK`, `TOKEN`

- Output: `DOCUMENT`

## **🔎 Parameters**


- `ageRanges`: (IntArrayParam)
List of integers specifying limits of the age groups to preserve during obfuscation

- `blackList`: (StringArrayParam)
List of entities that will be ignored to in the regex file.

- `consistentObfuscation`: (BooleanParam)
Whether to replace very similar entities in a document with the same randomized term (default: true) The similarity is based on the Levenshtein Distance between the words.

- `dateFormats`: (StringArrayParam)
Format of dates to displace

- `dateTag`: (Param[String])
Tag representing what are the NER entity (default: DATE)

- `dateToYear`: (BooleanParam)
true if dates must be converted to years, false otherwise

- `days`: (IntParam)
Number of days to obfuscate the dates by displacement.

- `fixedMaskLength`: (IntParam)
Select the fixed mask length: this is the length of the masking sequence that will be used when the 'fixed_length_chars' masking policy is selected.

- `ignoreRegex`: (BooleanParam)
Select if you want to use regex file loaded in the model.

- `isRandomDateDisplacement`: (BooleanParam)
Use a random displacement days in dates entities,that random number is based on the DeIdentificationParams.seed If true use random displacement days in dates entities,if false use the DeIdentificationParams.days The default value is false.

- `language`: (Param[String])
The language used to select the regex file and some faker entities.'en'(english),'de'(German), 'es'(Spanish), 'fr'(French) or 'ro'(Romanian)

- `mappingsColumn`: (Param[String])
This is the mapping column that will return the Annotations chunks with the fake entities

- `maskingPolicy`: (Param[String])
Select the masking policy:
same_length_chars: Replace the obfuscated entity with a masking sequence composed of asterisks and surrounding squared brackets, being the total length of the masking sequence of the same length as the original sequence. Example, Smith -> [***]. If the entity is less than 3 chars (like Jo, or 5), asterisks without brackets will be returned. entity_labels: Replace the values with the corresponding entity labels. fixed_length_chars: Replace the obfuscated entity with a masking sequence composed of a fixed number of asterisks.

- `minYear`: (IntParam)
Minimum year to use when converting date to year

- `mode`: (Param[String])
Mode for Anonymizer ['mask'|'obfuscate'] Given the following text

- `obfuscateDate`: (BooleanParam)
When mode=="obfuscate" whether to obfuscate dates or not.

- `obfuscateRefFile`: (Param[String])
File with the terms to be used for Obfuscation

- `obfuscateRefSource`: (Param[String])
The source of obfuscation of to obfuscate the entities.For dates entities doesnt apply tha method.

- `outputAsDocument`: (BooleanParam)
Whether to return all sentences joined into a single document

- `refFileFormat`: (Param[String])
Format of the reference file for Obfuscation the default value for that is "csv"

- `refSep`: (Param[String])
Separator character for the csv reference file for Obfuscation de default value is "#"

- `regexOverride`: (BooleanParam)
If is true prioritize the regex entities, if is false prioritize the ner.

- `regexPatternsDictionary`: (ExternalResourceParam)
dictionary with regular expression patterns that match some protected entity if the dictionary in not setting up we will use the default regex file.

- `region`: (Param[String])
Usa or eu

- `returnEntityMappings`: (BooleanParam)
With this property you select if you want to return mapping column

- `sameEntityThreshold`: (DoubleParam)
Similarity threshold [0.0-1.0] to consider two appearances of an entity as the same (default: 0.9) For date entities this method doesn't apply.

- `sameLengthFormattedEntities`: (StringArrayParam)
List of formatted entities to generate the same length outputs as original ones during obfuscation.

- `seed`: (IntParam)
It is the seed to select the entities on obfuscate mode.With the seed you can reply a execution several times with the same ouptut.

- `selectiveObfuscationModesPath`: (Param[String])
Dictionary path where is the json that contains the selective obfuscation modes

- `unnormalizedDateMode`: (Param[String])
The mode to use if the date is not formatted.

- `zipCodeTag`: (Param[String])
Tag representing zip codes in the obfuscate reference file (default: ZIP).


We will cover the use of the most important params of the deidentification annotator below

### `setMode()`

The `setMode` parameter can be used to choose the mode for Anonymizer ['mask'|'obfuscate'] Given the following text.

#### **`Mask`** mode
we can set the policy using the param `setMaskingPolicy`.

In [None]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line
sentenceDetector = SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

# NER model trained on n2c2 (de-identification and Heart Disease Risk Factors Challenge) datasets)
clinical_ner = MedicalNerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")

#deid model with "entity_labels"
deid_entity_labels= DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk"])\
    .setOutputCol("deid_entity_label")\
    .setMode("mask")\
    .setReturnEntityMappings(True)\
    .setMaskingPolicy("entity_labels")

#deid model with "same_length_chars"
deid_same_length= DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk"])\
    .setOutputCol("deid_same_length")\
    .setMode("mask")\
    .setReturnEntityMappings(True)\
    .setMaskingPolicy("same_length_chars")

#deid model with "fixed_length_chars"
deid_fixed_length= DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk"])\
    .setOutputCol("deid_fixed_length")\
    .setMode("mask")\
    .setReturnEntityMappings(True)\
    .setMaskingPolicy("fixed_length_chars")\
    .setFixedMaskLength(4)


deidPipeline = Pipeline(stages=[
      documentAssembler,
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      deid_entity_labels,
      deid_same_length,
      deid_fixed_length])


empty_data = spark.createDataFrame([[""]]).toDF("text")
model_deid = deidPipeline.fit(empty_data)


#sample data
text ='''
Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 .
'''

result = model_deid.transform(spark.createDataFrame([[text]]).toDF("text"))

result.show()

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_deid_generic_augmented download started this may take some time.
[OK!]
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|          embeddings|                 ner|           ner_chunk|   deid_entity_label|                 aux|    deid_same_length|   deid_fixed_length|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|
Record date : 20...|[{document, 0, 23...|[{document, 1, 45...|[{token, 1, 6, Re...|[{word_embeddings...|[{named_entity, 

#### **`Obfuscation`** mode

In the obfuscation mode, the annotator will replace sensetive entities with random values of the same type.

In [None]:
obs_lines = """Marvin MARSHALL#PATIENT
Hubert GROGAN#PATIENT
ALTHEA COLBURN#PATIENT
Kalil AMIN#PATIENT
Inci FOUNTAIN#PATIENT
Ekaterina Rosa#DOCTOR
Rudiger Chao#DOCTOR
COLLETTE KOHLER#NAME
Mufi HIGGS#NAME"""

with open ('obfuscation.txt', 'w') as f:
  f.write(obs_lines)

obfuscation = DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk"]) \
    .setOutputCol("deidentified") \
    .setMode("obfuscate")\
    .setObfuscateDate(True)\
    .setObfuscateRefFile('obfuscation.txt')\
    .setObfuscateRefSource("file")

deidPipeline = Pipeline(stages=[
      documentAssembler,
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      obfuscation])


obfuscation_model = deidPipeline.fit(empty_data)


result = obfuscation_model.transform(spark.createDataFrame([[text]]).toDF("text"))

result.select(F.explode(F.arrays_zip(result.sentence.result,
                                     result.deidentified.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("sentence"), F.expr("cols['1']").alias("deidentified")).toPandas()

Unnamed: 0,sentence,deidentified
0,"Record date : 2093-01-13 , David Hale , M.D .","Record date : 2093-03-10 , Mufi HIGGS , M.D ."
1,", Name : Hendrickson Ora , MR # 7194334 Date :...",", Name : Mufi HIGGS , MR # 7896877 Date : 03/1..."
2,"PCP : Oliveira , 25 years-old , Record date : ...","PCP : Mufi HIGGS , <AGE> years-old , Record da..."
3,"Cocke County Baptist Hospital , 0295 Keats Str...","<LOCATION> , <LOCATION> , Phone <CONTACT> ."


#### **`Faker`** mode

The faker module allows the user to use a set of fake entities that are in the memory of spark-nlp-internal. You can set up this module using the following property: `setObfuscateRefSource('faker')`.

If we select the `setObfuscateRefSource('both')` then we choose randomly the entities using the faker and the fakes entities from the obfuscateRefFile.


The entities that are allowed right now are the followings:

* Location
* Location-other
* Hospital
* City
* State
* Zip
* Country
* Contact
* Username
* Phone
* Fax
* Url
* Email
* Profession
* Name
* Doctor
* Patient
* Id
* Idnum
* Bioid
* Age
* Organization
* Healthplan
* Medicalrecord
* Ssn
* Passport
* DLN
* NPI
* C_card
* IBAN
* DEA
* Device




In [None]:
faker = DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk"]) \
    .setOutputCol("deidentified") \
    .setMode("obfuscate")\
    .setObfuscateDate(True)\
    .setObfuscateRefSource("faker") \


deidPipeline = Pipeline(stages=[
      documentAssembler,
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      faker])

faker_model = deidPipeline.fit(empty_data)


result = faker_model.transform(spark.createDataFrame([[text]]).toDF("text"))

result.select(F.explode(F.arrays_zip(result.sentence.result,
                                     result.deidentified.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("sentence"), F.expr("cols['1']").alias("deidentified")).toPandas()

Unnamed: 0,sentence,deidentified
0,"Record date : 2093-01-13 , David Hale , M.D .","Record date : 2093-02-16 , Lazar Fothergill , ..."
1,", Name : Hendrickson Ora , MR # 7194334 Date :...",", Name : Dwayne Ink , MR # 9792492 Date : 02/1..."
2,"PCP : Oliveira , 25 years-old , Record date : ...","PCP : Inoa Lauber , 24 years-old , Record date..."
3,"Cocke County Baptist Hospital , 0295 Keats Str...","1275 North High Street , 928 Diamond Street , ..."


### `setSelectiveObfuscationModes()`

We have multi-mode functionality in the `DeIdentification()`.

By providing a json file to the `setSelectiveObfuscationModes("a JSON path")` parameter, we are able to use multi-mode in de-identification. <br/>



Example JSON file can be like following:
```
{
	"obfuscate": ["PHONE"] ,
	"mask_entity_labels": ["ID"],
	"skip": ["DATE"],
	"mask_same_length_chars":["NAME"],
	"mask_fixed_length_chars":["zip", "location"]
}
```

Description of possible modes to enable multi-mode deidentification:

```
   * 'obfuscate': Replace the values with random values.
   * 'mask_same_length_chars': Replace the name with the minus two same lengths asterix, plus one bracket on both ends.
   * 'mask_entity_labels': Replace the values with the entity value.
   * 'mask_fixed_length_chars': Replace the name with the asterix with fixed length. You can also invoke "setFixedMaskLength()"
   * 'skip': Skip the entities (intact)
```

In [None]:
#json to choose deid modes
sample_json= {
	"obfuscate": ["PHONE"] ,
	"mask_entity_labels": ["ID"],
	"skip": ["DATE"],
	"mask_same_length_chars":["NAME"],
	"mask_fixed_length_chars":["zip", "location"]
}

import json
with open('sample_multi-mode.json', 'w', encoding='utf-8') as f:
    json.dump(sample_json, f, ensure_ascii=False, indent=4)

deid_doc = DeIdentification() \
      .setInputCols(["sentence", "token", "ner_chunk"]) \
      .setOutputCol("deidentified") \
      .setMode("mask")\
      .setSelectiveObfuscationModesPath("sample_multi-mode.json")


deidPipeline = Pipeline(stages=[
      documentAssembler,
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      deid_doc])

model_agerange = deidPipeline.fit(empty_data)

faker_model = deidPipeline.fit(empty_data)


result = faker_model.transform(spark.createDataFrame([[text]]).toDF("text"))

result.select(F.explode(F.arrays_zip(result.sentence.result,
                                     result.deidentified.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("sentence"), F.expr("cols['1']").alias("deidentified")).toPandas()

Unnamed: 0,sentence,deidentified
0,"Record date : 2093-01-13 , David Hale , M.D .","Record date : 2093-01-13 , [********] , M.D ."
1,", Name : Hendrickson Ora , MR # 7194334 Date :...",", Name : [*************] , MR # <ID> Date : 01..."
2,"PCP : Oliveira , 25 years-old , Record date : ...","PCP : [******] , <AGE> years-old , Record date..."
3,"Cocke County Baptist Hospital , 0295 Keats Str...","******* , ******* , Phone <CONTACT> ."


### `setAgeRanges()`

With this param, we can set a groups of age for obfuscation

In [None]:
from pyspark.sql.types import StringType

obfuscation = DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk"]) \
    .setOutputCol("obfuscation") \
    .setMode("obfuscate")\
    .setObfuscateDate(True)\
    .setObfuscateRefSource("faker") \
    .setAgeRanges([1, 4, 12, 20, 40, 60, 80])

deidPipeline = Pipeline(stages=[
      documentAssembler,
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      obfuscation])

model_agerange = deidPipeline.fit(empty_data)

# Infant = 0-1 year.
# Toddler = 2-4 yrs.
# Child = 5-12 yrs.
# Teen = 13-19 yrs.
# Adult = 20-39 yrs.
# Middle Age Adult = 40-59 yrs.
# Senior Adult = 60+

dates = [
'1 year old baby',
'4 year old kids',
'A 15 year old female with',
'Record date: 2093-01-13, Age: 25',
'Patient is 45 years-old',
'He is 65 years-old male'
]
df_dates = spark.createDataFrame(dates,StringType()).toDF('text')


result = model_agerange.transform(df_dates)

result_df = result.select("text",F.explode(F.arrays_zip(result.ner_chunk.result,
                                                        result.obfuscation.result)).alias("cols")) \
                  .select("text",F.expr("cols['0']").alias("ner_chunk"),
                                 F.expr("cols['1']").alias("obfuscation"))

result_df.show(truncate=False)

+--------------------------------+----------+--------------------------------+
|text                            |ner_chunk |obfuscation                     |
+--------------------------------+----------+--------------------------------+
|1 year old baby                 |1         |3 year old baby                 |
|4 year old kids                 |4         |6 year old kids                 |
|A 15 year old female with       |15        |A 18 year old female with       |
|Record date: 2093-01-13, Age: 25|2093-01-13|Record date: 2093-02-17, Age: 38|
|Record date: 2093-01-13, Age: 25|25        |null                            |
|Patient is 45 years-old         |45        |Patient is 46 years-old         |
|He is 65 years-old male         |65        |He is 70 years-old male         |
+--------------------------------+----------+--------------------------------+



### `setDays()`

Using this param, we can set number of days to obfuscate by displacement the dates

In [None]:
obfuscation = DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk"]) \
    .setOutputCol("obfuscation") \
    .setMode("obfuscate")\
    .setObfuscateDate(True)\
    .setObfuscateRefSource("faker") \
    .setDays(5)

deidPipeline = Pipeline(stages=[
      documentAssembler,
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      obfuscation])

model_days_shift = deidPipeline.fit(empty_data)

dates = [
'Record date 1: 2093-01-13',
'Record date 2: 2021-09-22'

]

df_dates = spark.createDataFrame(dates,StringType()).toDF('text')

result = model_days_shift.transform(df_dates)


result_df = result.select("text",F.explode(F.arrays_zip(result.ner_chunk.result,
                                                        result.obfuscation.result)).alias("cols")) \
                  .select("text",F.expr("cols['0']").alias("ner_chunk"),
                                 F.expr("cols['1']").alias("obfuscation"))

result_df.show(truncate=False)

+-------------------------+----------+-------------------------+
|text                     |ner_chunk |obfuscation              |
+-------------------------+----------+-------------------------+
|Record date 1: 2093-01-13|2093-01-13|Record date 1: 2093-01-18|
|Record date 2: 2021-09-22|2021-09-22|Record date 2: 2021-09-27|
+-------------------------+----------+-------------------------+



### `setUnnormalizedDateMode()`


This parameter is used to mask the DATE entities that can not be normalized. In the example below, please check `03Apr2022` which couldn't be normalized and it is masked in the output.

- `setUnnormalizedDateMode(mask)` parameter is used to mask the DATE entities that can not be normalized.
- `setUnnormalizedDateMode(obfuscate)` parameter is used to obfuscate the DATE entities that can not be normalized.

In [None]:
data = pd.DataFrame(
    {'patientID' : ['A001', 'A001', 'A002', 'A002'],
     'text' : ['Chris Brown was discharged on 10/02/2022',
               'Mark White was discharged on 02/28/2020',
               'John was discharged on 21/2021 ',          # check this
               'John Moore was discharged on 12/31/2022'
              ],
     'dateshift' : ['-5', '-2', '10', '20']
    }
)

my_input_df = spark.createDataFrame(data)

de_identification_mask = DeIdentification() \
    .setInputCols(["ner_chunk", "token", "document"]) \
    .setOutputCol("deid_text_mask") \
    .setMode("obfuscate") \
    .setObfuscateDate(True) \
    .setDateTag("DATE") \
    .setLanguage("en") \
    .setObfuscateRefSource('faker') \
    .setUseShifDays(True)\
    .setRegion('us')\
    .setUnnormalizedDateMode("mask")

de_identification_obf = DeIdentification() \
    .setInputCols(["ner_chunk", "token", "document"]) \
    .setOutputCol("deid_text_obs") \
    .setMode("obfuscate") \
    .setObfuscateDate(True) \
    .setDateTag("DATE") \
    .setLanguage("en") \
    .setObfuscateRefSource('faker') \
    .setUseShifDays(True)\
    .setRegion('us')\
    .setUnnormalizedDateMode("obfuscation")



deidPipeline = Pipeline(stages=[
      documentAssembler,
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      de_identification_mask,
      de_identification_obf])


output = deidPipeline.fit(my_input_df).transform(my_input_df)

output.select('text', 'dateshift', 'deid_text_mask.result','deid_text_obs.result').show(truncate = False)

+----------------------------------------+---------+---------------------------------------------+----------------------------------------------+
|text                                    |dateshift|result                                       |result                                        |
+----------------------------------------+---------+---------------------------------------------+----------------------------------------------+
|Chris Brown was discharged on 10/02/2022|-5       |[Asriel Roberts was discharged on 11/26/2022]|[Nyheim Captain was discharged on 10/26/2022] |
|Mark White was discharged on 02/28/2020 |-2       |[Mark White was discharged on 04/23/2020]    |[Mark White was discharged on 03/23/2020]     |
|John was discharged on 21/2021          |10       |[Gerhart Raveling was discharged on <DATE> ] |[Efraim Messing was discharged on 12-20-1977 ]|
|John Moore was discharged on 12/31/2022 |20       |[Jadden Deed was discharged on 02/24/2023]   |[Paco Dingwall was dischar

### `setDateFormats()`


With this parameter, we can sets list of date formats to automatically displace if parsed.
In the example below, we will take two dates format: "MM/dd/yy" and "yyyy-MM-dd".

In [None]:
deid = DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk"]) \
    .setOutputCol("deidentified") \
    .setMode("obfuscate")\
    .setDateFormats(["MM/dd/yy","yyyy-MM-dd"]) \
    .setObfuscateDate(True)\
    .setObfuscateRefSource("faker") \


deidPipeline = Pipeline(stages=[
      documentAssembler,
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      deid])

data = spark.createDataFrame([["# 7194334 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09."]]).toDF("text")

result = deidPipeline.fit(data).transform(data)
result.select("deidentified.result").show(truncate = False)

+----------------------------------------------------------------------------------------+
|result                                                                                  |
+----------------------------------------------------------------------------------------+
|[# 3555554 Date : 02/09/93 PCP : Kizzy Found , 22 years-old , Record date : 2079-12-06.]|
+----------------------------------------------------------------------------------------+

