![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# **NameChunkObfuscator**

This notebook will cover the different parameters and usages of `NameChunkObfuscatorApproach`. Contains all the methods for training a NameChunkObfuscator model. This module can replace name entities with consistent fakers.

**📖 Learning Objectives:**

1. Obfuscation background

2. Colab setup

3. Become comfortable using the different parameters of the annotator.


**🔗 Helpful Links:**

- Python Docs : [NameChunkObfuscatorApproach](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/deid/name_obfuscator/index.html#sparknlp_jsl.annotator.deid.name_obfuscator.NameChunkObfuscatorApproach)

- Scala Docs : [NameChunkObfuscatorApproach](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/deid/NameChunkObfuscatorApproach.html)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings).

## **📜 Background**


Obfuscation, refers to the process of de-identifying or removing sensitive patient information from clinical notes or other healthcare documents. The purpose of PHI obfuscation is to protect patient privacy and comply with regulations such as the Health Insurance Portability and Accountability Act (HIPAA).

It is important to note that the obfuscation should be done carefully to ensure that the de-identified data cannot be re-identified. Organizations must follow best practices and adhere to applicable regulations to protect patient privacy and maintain data security.

## **🎬 Colab Setup**

This module is licensed, so you need a valid license json file.

Installing johsnowlabs:

In [2]:
! pip install -q johnsnowlabs


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.3/84.3 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.4/212.4 MB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.7/486.7 kB[0m [31m39.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m69.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.4/95.4 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.6/198.6 kB[0m [31m20.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.9/66.9 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m83.5/83.5 kB[0m [31m7.8 MB/s[0m 

In [3]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

Please Upload your John Snow Labs License using the button below


Saving spark_nlp_for_healthcare_spark_ocr_7566 (8).json to spark_nlp_for_healthcare_spark_ocr_7566 (8).json


In [4]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install(force_browser=True)

<IPython.core.display.Javascript object>

127.0.0.1 - - [07/Jun/2023 11:09:16] "GET /login?code=F0FGIpbdUdRhq8GTWhtYnF9iLzoFBs HTTP/1.1" 200 -


<IPython.core.display.Javascript object>

Downloading license...
Licenses extracted successfully
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-4.4.1-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-4.4.3-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-4.4.1.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-4.4.3.jar
🙆 JSL Home setup in /root/.johnsnowlabs
👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_7566 (8).json
Installing /root/.johnsnowlabs/py_installs/spark_nlp_jsl-4.4.3-py3-none-any.whl to /usr/bin/python3
Installing nlu to /usr/bin/python3
Installed 2 products:
💊 Spark-Healthcare==4.4.3 installed! ✅ Heal the planet with NLP! 
🤖 nlu==4.2.1 installed! ✅ 1 line of code to conquer nlp! 


In [5]:
from johnsnowlabs import nlp, medical
import pandas as pd
import json
import string
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Automatically load license data and start a session with all jars user has access to

spark = nlp.start()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_7566 (8).json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==4.4.1, 💊Spark-Healthcare==4.4.3, running on ⚡ PySpark==3.1.2


In [6]:
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T
import pyspark.sql as SQL
from pyspark import keyword_only

## **🖨️ Input/Output Annotation Types**

- Input: `CHUNK`

- Output: `CHUNK`

## **🔎 Parameters**


- `seed`: (IntParam) The seed to select the names on obfuscation. With the seed, you can reply an execution several times with the same output..

- `obfuscateRefSource`: (Param[String])
Sets mode for select obfuscate source [‘both’|’faker’| ‘file’] Default: ‘both’.

- `language`: (Param[String])
The language used to select some faker names. The values are the following: ‘en’(english),’de’(german), ‘es’(Spanish), ‘fr’(french) or ‘ro’(romanian) Default:’en’.

- `sameLength`: (BooleanParam)
The sameLength used to select the same length names as original ones during obfuscation. Example: ‘John’ –> ‘Mike’. Default: true.

- `nameEntities`: (List[str])
The nameEntities used to select entities during obfuscation. The supported name entities are NAME, PATIENT, and DOCTOR. Default: 'NAME'

- `genderAwareness`: (BooleanParam)
Whether to use gender-aware names or not during obfuscation. This param effects only names.
Default: False

- `obfuscateRefFile`: (Param[String])
File with the faker names to be used for obfuscation

- `refFileFormat`: (Param[String])
Format of the reference file

- `refSep`: (Param[String])
Seperator character in refFile

### `setObfuscateRefSource()` 

The `setObfuscateRefSource` parameter should be used to set mode for select obfuscate source [‘both’|’faker’| ‘file’] Default: ‘both’
let's test the 'faker' option in the example bellow:

In [7]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line
sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

# NER model trained on n2c2 (de-identification and Heart Disease Risk Factors Challenge) datasets)
clinical_ner = medical.NerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")
ner_converter = medical.NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")


nameChunkObfuscator = medical.NameChunkObfuscatorApproach()\
  .setInputCols("ner_chunk")\
  .setOutputCol("replacement")\
  .setObfuscateRefSource("faker")

replacer_name = medical.Replacer()\
  .setInputCols("replacement","sentence")\
  .setOutputCol("obfuscated_sentence_name")\
  .setUseReplacement(True)


nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      nameChunkObfuscator,
      replacer_name])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_deid_generic_augmented download started this may take some time.
[OK!]


In [7]:
#sample data
text ='''
Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 .
'''

result = model.transform(spark.createDataFrame([[text]]).toDF("text"))

result.select(F.explode(F.arrays_zip(result.sentence.result, 
                                     result.obfuscated_sentence_name.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("sentence"), F.expr("cols['1']").alias("obfuscated_sentence_name")).toPandas()

Unnamed: 0,sentence,obfuscated_sentence_name
0,"Record date : 2093-01-13 , David Hale , M.D .","Record date : 2093-01-13 , Jacquelene , M.D ."
1,", Name : Hendrickson Ora , MR # 7194334 Date :...",", Name : Liliana Liliane , MR # 7194334 Date :..."
2,"PCP : Oliveira , 25 years-old , Record date : ...","PCP : Roderick , 25 years-old , Record date : ..."
3,"Cocke County Baptist Hospital , 0295 Keats Str...","Cocke County Baptist Hospital , 0295 Keats Str..."


As you can see in the example, names "David Hale" and "Hendrickson Ora" are replaced by "Jacquelene" and "Liliana Liliane" respectively

### `setSameLength()` 

The sameLength used to select the same length names as original ones during obfuscation.
        Example: 'John' --> 'Mike'.
Default: true
let's set it to False in the example bellow:

In [8]:
nameChunkObfuscator = medical.NameChunkObfuscatorApproach()\
  .setInputCols("ner_chunk")\
  .setOutputCol("replacement")\
  .setObfuscateRefSource("faker")\
  .setSameLength(False)

nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      nameChunkObfuscator,
      replacer_name])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)



In [9]:
#sample data
text ='''
Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555.
'''

result = model.transform(spark.createDataFrame([[text]]).toDF("text"))

result.select(F.explode(F.arrays_zip(result.sentence.result, 
                                     result.obfuscated_sentence_name.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("sentence"), F.expr("cols['1']").alias("obfuscated_sentence_name")).toPandas()

Unnamed: 0,sentence,obfuscated_sentence_name
0,"Record date : 2093-01-13 , David Hale , M.D .","Record date : 2093-01-13 , Jefferey , M.D ."
1,", Name : Hendrickson Ora , MR # 7194334 Date :...",", Name : Dorice , MR # 7194334 Date : 01/13/93 ."
2,"PCP : Oliveira , 25 years-old , Record date : ...","PCP : Jerold , 25 years-old , Record date : 20..."
3,"Cocke County Baptist Hospital , 0295 Keats Str...","Cocke County Baptist Hospital , 0295 Keats Str..."


As you can see in the example, names "David Hale" and "Hendrickson Ora" are replaced by "Jefferey" and "Dorice" respectivly without keeping same length names as original ones. 

### `setNameEntities()` 

The nameEntities used to select entities during obfuscation.
        The supported name entities are NAME, PATIENT, and DOCTOR.
        Default: 'NAME'

Let's use in this case a subentity NER model to detect DOCTOR and PATIENT instead of NAME entity, with setting coresponding NameEntities list

In [10]:
clinical_ner = medical.NerModel.pretrained("ner_deid_subentity_augmented", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

nameChunkObfuscator = medical.NameChunkObfuscatorApproach()\
  .setInputCols("ner_chunk")\
  .setOutputCol("replacement")\
  .setObfuscateRefSource("faker")\
  .setNameEntities(["DOCTOR", "PATIENT"])

nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      nameChunkObfuscator,
      replacer_name])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

ner_deid_subentity_augmented download started this may take some time.
[OK!]


In [11]:
#sample data
text ='''
Record date : 2093-01-13 , David Hale , M.D . , Patient name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555. Analyzed by Dr. Alex .
'''

result = model.transform(spark.createDataFrame([[text]]).toDF("text"))

result.select(F.explode(F.arrays_zip(result.sentence.result, 
                                     result.obfuscated_sentence_name.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("sentence"), F.expr("cols['1']").alias("obfuscated_sentence_name")).toPandas()

Unnamed: 0,sentence,obfuscated_sentence_name
0,"Record date : 2093-01-13 , David Hale , M.D .","Record date : 2093-01-13 , Jacquelene , M.D ."
1,", Patient name : Hendrickson Ora , MR # 719433...",", Patient name : Liliana Liliane , MR # 719433..."
2,"PCP : Oliveira , 25 years-old , Record date : ...","PCP : Roderick , 25 years-old , Record date : ..."
3,"Cocke County Baptist Hospital , 0295 Keats Str...","Cocke County Baptist Hospital , 0295 Keats Str..."
4,Analyzed by Dr. Alex .,Analyzed by Dr. Zora .


As you can see in the example, the patient name "Hendrickson Ora" and the doctor name "Alex" are replaced by "Liliana Liliane" and "Zora" respectivly 

### `setGenderAwareness()` 

Set whether to use gender-aware names or not during obfuscation.
        This param effects only names.
        If value is true, it might decrease performance.
Default: False

let's set it to True in the example bellow

In [12]:
nameChunkObfuscator = medical.NameChunkObfuscatorApproach()\
  .setInputCols("ner_chunk")\
  .setOutputCol("replacement")\
  .setObfuscateRefSource("faker")\
  .setNameEntities(["DOCTOR", "PATIENT"])\
  .setGenderAwareness(True)

nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      nameChunkObfuscator,
      replacer_name])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

In [13]:
#sample data
text ='''
Record date : 2093-01-13 , David Hale , M.D . , Patient name : Michael  , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555. Analyzed by Dr. Jennifer  .
'''

result = model.transform(spark.createDataFrame([[text]]).toDF("text"))

result.select(F.explode(F.arrays_zip(result.sentence.result, 
                                     result.obfuscated_sentence_name.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("sentence"), F.expr("cols['1']").alias("obfuscated_sentence_name")).toPandas()

Unnamed: 0,sentence,obfuscated_sentence_name
0,"Record date : 2093-01-13 , David Hale , M.D .","Record date : 2093-01-13 , Chancellor , M.D ."
1,", Patient name : Michael , MR # 7194334 Date ...",", Patient name : Carlton , MR # 7194334 Date ..."
2,"PCP : Oliveira , 25 years-old , Record date : ...","PCP : Melaniya , 25 years-old , Record date : ..."
3,"Cocke County Baptist Hospital , 0295 Keats Str...","Cocke County Baptist Hospital , 0295 Keats Str..."
4,Analyzed by Dr. Jennifer .,Analyzed by Dr. Winifred .


As you can see in this example, the male name "Michael" is replaced with the male name "Carlton" and the female name "Jennifer" is replaced by the female name "Winifred"

### `setObfuscateRefFile()` 

Set file with the terms to be used for Obfuscation
let's create a text file with the names to be used for obfuscation.
Then we set the cooresponding param to use this file.

In [8]:
names = """Mitchell#NAME
Clifford#NAME
Jeremiah#NAME
Lawrence#NAME
Brittany#NAME
Patricia#NAME
Samantha#NAME
Jennifer#NAME
Jackson#NAME
Leonard#NAME
Randall#NAME
Camacho#NAME
Ferrell#NAME
Mueller#NAME
Bowman#NAME
Hansen#NAME
Acosta#NAME
Gillespie#NAME
Zimmerman#NAME
Gillespie#NAME
Chandler#NAME
Bradshaw#NAME
Ferguson#NAME
Jacobson#NAME
Figueroa#NAME
Chandler#NAME
Schaefer#NAME
Matthews#NAME
Ferguson#NAME
Bradshaw#NAME
Figueroa#NAME
Delacruz#NAME
Gallegos#NAME
Villarreal#NAME
Williamson#NAME
Montgomery#NAME
Mclaughlin#NAME
Blankenship#NAME
Fitzpatrick#NAME
"""

with open('names_test.txt', 'w') as file:
    file.write(names)

clinical_ner = medical.NerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

nameChunkObfuscator = medical.NameChunkObfuscatorApproach()\
  .setInputCols("ner_chunk")\
  .setOutputCol("replacement")\
  .setObfuscateRefFile("names_test.txt")\
  .setObfuscateRefSource("file")


nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      nameChunkObfuscator,
      replacer_name])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

ner_deid_generic_augmented download started this may take some time.
[OK!]


In [9]:
#sample data
text ='''
Record date : 2093-01-13 , David Hale , M.D . , Patient name : Michael  , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555. Analyzed by Dr. Jennifer  .
'''

result = model.transform(spark.createDataFrame([[text]]).toDF("text"))

result.select(F.explode(F.arrays_zip(result.sentence.result, 
                                     result.obfuscated_sentence_name.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("sentence"), F.expr("cols['1']").alias("obfuscated_sentence_name")).toPandas()

Unnamed: 0,sentence,obfuscated_sentence_name
0,"Record date : 2093-01-13 , David Hale , M.D .","Record date : 2093-01-13 , Mclaughlin , M.D ."
1,", Patient name : Michael , MR # 7194334 Date ...",", Patient name : Ferrell , MR # 7194334 Date ..."
2,"PCP : Oliveira , 25 years-old , Record date : ...","PCP : Bradshaw , 25 years-old , Record date : ..."
3,"Cocke County Baptist Hospital , 0295 Keats Str...","Cocke County Baptist Hospital , 0295 Keats Str..."
4,Analyzed by Dr. Jennifer .,Analyzed by Dr. Matthews .


As you can see in the example, by seeting setObfuscateRefFile() and setObfuscateRefSource() with the option "file", all NAMEs entitiy are replaced from the file names.

### `setRefFileFormat()` 

Sets format of the reference file

let's create a csv file format with names to be used for obfuscation

In [14]:
names = """Mitchell#NAME
Clifford#NAME
Jeremiah#NAME
Lawrence#NAME
Brittany#NAME
Patricia#NAME
Samantha#NAME
Jennifer#NAME
Jackson#NAME
Leonard#NAME
Randall#NAME
Camacho#NAME
Ferrell#NAME
Mueller#NAME
Bowman#NAME
Hansen#NAME
"""

with open('names_test1.txt', 'w') as file:
    file.write(names)


nameChunkObfuscator = medical.NameChunkObfuscatorApproach()\
  .setInputCols("ner_chunk")\
  .setOutputCol("replacement")\
  .setObfuscateRefFile("names_test1.txt")\
  .setObfuscateRefSource("file")\
  .setRefFileFormat("csv")

nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      nameChunkObfuscator,
      replacer_name])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

In [27]:
#sample data
text ='''
M.D . , Patient name : Michael  , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555. Analyzed by Dr. Jennifer  .
'''

result = model.transform(spark.createDataFrame([[text]]).toDF("text"))

result.select(F.explode(F.arrays_zip(result.sentence.result, 
                                     result.obfuscated_sentence_name.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("sentence"), F.expr("cols['1']").alias("obfuscated_sentence_name")).toPandas()

Unnamed: 0,sentence,obfuscated_sentence_name
0,M.D .,M.D .
1,", Patient name : Michael , MR # 7194334 Date ...",", Patient name : Ferrell , MR # 7194334 Date ..."
2,"PCP : Oliveira , 25 years-old , Record date : ...","PCP : Samantha , 25 years-old , Record date : ..."
3,"Cocke County Baptist Hospital , 0295 Keats Str...","Cocke County Baptist Hospital , 0295 Keats Str..."
4,Analyzed by Dr. Jennifer .,Analyzed by Dr. Samantha .


### `setRefSep()` 

Sets separator character in refFile

We will set "-" as separtor in the reference file in the example below:

In [32]:
names = """Mitchell-NAME
Clifford-NAME
Jeremiah-NAME
Lawrence-NAME
Brittany-NAME
Patricia-NAME
Jennifer-NAME
Jackson-NAME
Leonard-NAME
Randall-NAME
Camacho-NAME
Ferrell-NAME
Mueller-NAME
Bowman-NAME
Hansen-NAME
"""

with open('names_test2.txt', 'w') as file:
    file.write(names)


nameChunkObfuscator = medical.NameChunkObfuscatorApproach()\
  .setInputCols("ner_chunk")\
  .setOutputCol("replacement")\
  .setObfuscateRefFile("names_test2.txt")\
  .setObfuscateRefSource("file")\
  .setRefFileFormat("csv")\
  .setRefSep("-")\

nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      nameChunkObfuscator,
      replacer_name])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

In [33]:
#sample data
text ='''
M.D . , Patient name : Michael  , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555. Analyzed by Dr. Jennifer  .
'''

result = model.transform(spark.createDataFrame([[text]]).toDF("text"))

result.select(F.explode(F.arrays_zip(result.sentence.result, 
                                     result.obfuscated_sentence_name.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("sentence"), F.expr("cols['1']").alias("obfuscated_sentence_name")).toPandas()

Unnamed: 0,sentence,obfuscated_sentence_name
0,M.D .,M.D .
1,", Patient name : Michael , MR # 7194334 Date ...",", Patient name : Ferrell , MR # 7194334 Date ..."
2,"PCP : Oliveira , 25 years-old , Record date : ...","PCP : Clifford , 25 years-old , Record date : ..."
3,"Cocke County Baptist Hospital , 0295 Keats Str...","Cocke County Baptist Hospital , 0295 Keats Str..."
4,Analyzed by Dr. Jennifer .,Analyzed by Dr. Jennifer .
