![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP/NameChunkObfuscator.ipynb)

# **NameChunkObfuscator**

This notebook will cover the different parameters and usages of `NameChunkObfuscator`. It allows to transform a dataset with an Input Annotation of type CHUNK, into its obfuscated version of by obfuscating the given CHUNKS. This module can replace name entities with consistent fakers, remain others same.

**📖 Learning Objectives:**

1. Obfuscation background

2. Colab setup

3. Become comfortable using the different parameters of the annotator.


**🔗 Helpful Links:**

- Documentation : [NameChunkObfuscator](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#namechunkobfuscator)

- Python Docs : [NameChunkObfuscator](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/deid/name_obfuscator/index.html#sparknlp_jsl.annotator.deid.name_obfuscator.NameChunkObfuscator)

- Scala Docs : [NameChunkObfuscator](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/deid/NameChunkObfuscator.html)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/04.0.Clinical_DeIdentification.ipynb).

## **📜 Background**


Obfuscation, refers to the process of de-identifying or removing sensitive patient information from clinical notes or other healthcare documents. The purpose of PHI obfuscation is to protect patient privacy and comply with regulations such as the Health Insurance Portability and Accountability Act (HIPAA).

It is important to note that the obfuscation should be done carefully to ensure that the de-identified data cannot be re-identified. Organizations must follow best practices and adhere to applicable regulations to protect patient privacy and maintain data security.

## **🎬 Colab Setup**

This module is licensed, so you need a valid license json file.

Installing johsnowlabs:

In [None]:
! pip install -q johnsnowlabs

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.2/265.2 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m565.0/565.0 kB[0m [31m33.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m676.2/676.2 kB[0m [31m51.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.6/95.6 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m76.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.2/139.2 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.9/66.9 kB[0m [31m6.7 

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

Please Upload your John Snow Labs License using the button below


Saving 5.3.3.spark_nlp_for_healthcare.json to 5.3.3.spark_nlp_for_healthcare.json


In [None]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
#nlp.settings.enforce_versions=False
nlp.install()

👌 Detected license file /content/5.3.3.spark_nlp_for_healthcare.json
🚨 Outdated Medical Secrets in license file. Version=5.3.3 but should be Version=5.3.2
🚨 Outdated OCR Secrets in license file. Version=5.1.2 but should be Version=5.3.2
👷 Trying to install compatible secrets. Use nlp.settings.enforce_versions=False if you want to install outdated secrets.
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-5.3.2-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-5.3.2-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-5.3.2.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-5.3.2.jar
🙆 JSL Home setup in /root/.johnsnowlabs
👌 Detected license file /content/5.3.3.spark_nlp_for_healthcare.json
👷 Trying to install compatible secrets. Use nlp.settings.enforce_versions=False 

In [None]:
import pyspark.sql.functions as F
import pandas as pd

# Automatically load license data and start a session with all jars user has access to

spark = nlp.start()

👌 Detected license file /content/5.3.3.spark_nlp_for_healthcare.json
👷 Trying to install compatible secrets. Use nlp.settings.enforce_versions=False if you want to install outdated secrets.
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.3.2, 💊Spark-Healthcare==5.3.2, running on ⚡ PySpark==3.4.0


## **🖨️ Input/Output Annotation Types**

- Input: `CHUNK`

- Output: `CHUNK`

## **🔎 Parameters**


- `seed`: (IntParam) The seed to select the names on obfuscation. With the seed, you can reply an execution several times with the same output..

- `obfuscateRefSource`: (Param[String])
Sets mode for select obfuscate source [‘both’|’faker’| ‘file’] Default: ‘both’.

- `language`: (Param[String])
The language used to select some faker names. The values are the following: ‘en’(english),’de’(german), ‘es’(Spanish), ‘fr’(french) or ‘ro’(romanian) Default:’en’.

- `sameLength`: (BooleanParam)
The sameLength used to select the same length names as original ones during obfuscation. Example: ‘John’ –> ‘Mike’. Default: true.

- `nameEntities`: (List[str])
The nameEntities used to select entities during obfuscation. The supported name entities are NAME, PATIENT, and DOCTOR. Default: 'NAME'

- `genderAwareness`: (BooleanParam)
Whether to use gender-aware names or not during obfuscation. This param effects only names.
Default: False



### `setObfuscateRefSource()`

The `setObfuscateRefSource` parameter should be used to set mode for select obfuscate source [‘both’|’faker’| ‘file’] Default: ‘both’
let's test the 'faker' option in the example bellow:

In [None]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line
sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

# NER model trained on n2c2 (de-identification and Heart Disease Risk Factors Challenge) datasets)
clinical_ner = medical.NerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")

nameChunkObfuscator = medical.NameChunkObfuscator()\
  .setInputCols("ner_chunk")\
  .setOutputCol("replacement")\
  .setObfuscateRefSource("faker")

replacer_name = medical.Replacer()\
  .setInputCols("replacement","sentence")\
  .setOutputCol("obfuscated_sentence_name")\
  .setUseReplacement(True)

nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler,
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      nameChunkObfuscator,
      replacer_name])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_deid_generic_augmented download started this may take some time.
[OK!]


In [None]:
#sample data
text ='''
Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 .
'''

result = model.transform(spark.createDataFrame([[text]]).toDF("text"))

result.select(F.explode(F.arrays_zip(result.sentence.result,
                                     result.obfuscated_sentence_name.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("sentence"), F.expr("cols['1']").alias("obfuscated_sentence_name")).toPandas()

Unnamed: 0,sentence,obfuscated_sentence_name
0,"Record date : 2093-01-13 , David Hale , M.D .","Record date : 2093-01-13 , Marguerita , M.D ."
1,", Name : Hendrickson Ora , MR # 7194334 Date :...",", Name : Fredrik Freeman , MR # 7194334 Date :..."
2,"PCP : Oliveira , 25 years-old , Record date : ...","PCP : Maryjean , 25 years-old , Record date : ..."
3,"Cocke County Baptist Hospital , 0295 Keats Str...","Cocke County Baptist Hospital , 0295 Keats Str..."


As you can see in the example, names "David Hale" and "Hendrickson Ora" are replaced with other fake names respectively.

### `setSameLength()`

The sameLength used to select the same length names as original ones during obfuscation.
        Example: 'John' --> 'Mike'.
Default: true
let's set it to False in the example bellow:

In [None]:
nameChunkObfuscator = medical.NameChunkObfuscator()\
  .setInputCols("ner_chunk")\
  .setOutputCol("replacement")\
  .setObfuscateRefSource("faker")\
  .setSameLength(False)

nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler,
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      nameChunkObfuscator,
      replacer_name])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)



In [None]:
#sample data
text ='''
Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555. Analyzed by Dr. Alex .
'''

result = model.transform(spark.createDataFrame([[text]]).toDF("text"))

result.select(F.explode(F.arrays_zip(result.sentence.result,
                                     result.obfuscated_sentence_name.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("sentence"), F.expr("cols['1']").alias("obfuscated_sentence_name")).toPandas()

Unnamed: 0,sentence,obfuscated_sentence_name
0,"Record date : 2093-01-13 , David Hale , M.D .","Record date : 2093-01-13 , Keenan , M.D ."
1,", Name : Hendrickson Ora , MR # 7194334 Date :...",", Name : Selestine , MR # 7194334 Date : 01/13..."
2,"PCP : Oliveira , 25 years-old , Record date : ...","PCP : Pablo , 25 years-old , Record date : 207..."
3,"Cocke County Baptist Hospital , 0295 Keats Str...","Cocke County Baptist Hospital , 0295 Keats Str..."
4,Analyzed by Dr. Alex .,Analyzed by Dr. Ara .


As you can see in the example, names "David Hale" and "Hendrickson Ora" are replaced without keeping same length names as original ones.

### `setNameEntities()`

The nameEntities used to select entities during obfuscation.
        The supported name entities are NAME, PATIENT, and DOCTOR.
        Default: 'NAME'

Let's use in this case a subentity NER model to detect DOCTOR and PATIENT instead of NAME entity, with setting coresponding NameEntities list

In [None]:
clinical_ner = medical.NerModel.pretrained("ner_deid_subentity_augmented", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

nameChunkObfuscator = medical.NameChunkObfuscator()\
  .setInputCols("ner_chunk")\
  .setOutputCol("replacement")\
  .setObfuscateRefSource("faker")\
  .setNameEntities(["DOCTOR", "PATIENT"])

nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler,
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      nameChunkObfuscator,
      replacer_name])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

ner_deid_subentity_augmented download started this may take some time.
[OK!]


In [None]:
#sample data
text ='''
Record date : 2093-01-13 , David Hale , M.D . , Patient name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555. Analyzed by Dr. Alex .
'''

result = model.transform(spark.createDataFrame([[text]]).toDF("text"))

result.select(F.explode(F.arrays_zip(result.sentence.result,
                                     result.obfuscated_sentence_name.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("sentence"), F.expr("cols['1']").alias("obfuscated_sentence_name")).toPandas()

Unnamed: 0,sentence,obfuscated_sentence_name
0,"Record date : 2093-01-13 , David Hale , M.D .","Record date : 2093-01-13 , Marguerita , M.D ."
1,", Patient name : Hendrickson Ora , MR # 719433...",", Patient name : Fredrik Freeman , MR # 719433..."
2,"PCP : Oliveira , 25 years-old , Record date : ...","PCP : Maryjean , 25 years-old , Record date : ..."
3,"Cocke County Baptist Hospital , 0295 Keats Str...","Cocke County Baptist Hospital , 0295 Keats Str..."
4,Analyzed by Dr. Alex .,Analyzed by Dr. Benn .


As you can see in the example, the patient name "Hendrickson Ora" and the doctor name "Alex" are replaced.

### `setGenderAwareness()`

Set whether to use gender-aware names or not during obfuscation.
        This param effects only names.
        If value is true, it might decrease performance.
Default: False

let's set it to True in the example bellow

In [None]:
nameChunkObfuscator = medical.NameChunkObfuscator()\
  .setInputCols("ner_chunk")\
  .setOutputCol("replacement")\
  .setObfuscateRefSource("faker")\
  .setNameEntities(["DOCTOR", "PATIENT"])\
  .setGenderAwareness(True)

nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler,
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      nameChunkObfuscator,
      replacer_name])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

In [None]:
#sample data
text ='''
Record date : 2093-01-13 , David Hale , M.D . , Patient name : Michael  , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555. Analyzed by Dr. Jennifer  .
'''

result = model.transform(spark.createDataFrame([[text]]).toDF("text"))

result.select(F.explode(F.arrays_zip(result.sentence.result,
                                     result.obfuscated_sentence_name.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("sentence"), F.expr("cols['1']").alias("obfuscated_sentence_name")).toPandas()

Unnamed: 0,sentence,obfuscated_sentence_name
0,"Record date : 2093-01-13 , David Hale , M.D .","Record date : 2093-01-13 , Richardson , M.D ."
1,", Patient name : Michael , MR # 7194334 Date ...",", Patient name : Thaxter , MR # 7194334 Date ..."
2,"PCP : Oliveira , 25 years-old , Record date : ...","PCP : Adelaida , 25 years-old , Record date : ..."
3,"Cocke County Baptist Hospital , 0295 Keats Str...","Cocke County Baptist Hospital , 0295 Keats Str..."
4,Analyzed by Dr. Jennifer .,Analyzed by Dr. Morganne .


As you can see in this example, the male name "Michael" is replaced with the male name "Thaxter" and the female name "Jennifer" is replaced by the female name "Morganne"