![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# **Replacer**


This notebook will cover the `Replacer` annotator. 

`Replacer` allows to replace entities in the original text with the ones extracted by the annotators `NameChunkObfuscatorApproach` or `DateNormalizer`. 




**📖 Learning Objectives:**

1. Understand how `Replacer` works.

2. Understand how `Replacer` can be used to with the `DateNormalizer` annotator and in the deintification process.

3. Become comfortable using the `setUseReplacement` parameter of the annotator.


**🔗 Helpful Links:**


- Python Docs : [Replacer](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/deid/replacer/index.html#sparknlp_jsl.annotator.deid.replacer.Replacer)

- Scala Docs : [Replacer](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/deid/Replacer.html)

- For extended examples of usage, see the [Clinical Deidentification](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/4.Clinical_DeIdentification.ipynb#scrollTo=9alThnhZeOvn) and [Date Normalizer](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/13.0.Date_Normalizer.ipynb#scrollTo=yX57W_6SLiWz) notebooks.


## **📜 Background**

`Replacer` is most often used in conjunction with the `DateNormalizer` annotator or in deidentification pipelines.

With the dates, the `Replacer` annotator is used to replace specific tokens in a text with another token or string. The `DateNormalizer` annotator, on the other hand, is used to normalize dates and times to a standardized format.

Obfuscation in healthcare is the act of making healthcare data difficult to understand or use without authorization. This can be done by replacing or removing identifying information, such as names, dates of birth, and Social Security numbers. Obfuscation can also be used to hide the contents of healthcare records, such as diagnoses, medications, and treatment plans.

In the **deidentification** process, the `Replacer` annotator is used to replace certain tokens or patterns in the text with specified values. For example, it can be used to replace all instances of a person's name with a placeholder like "PERSON".

The `NameChunkObfuscatorApproach` annotator is used to identify and obfuscate sensitive named entities in the text, such as people's names, addresses, dates of birth, SSNs etc. 


## **🎬 Colab Setup**

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs 

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical, visual

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

👌 Detected license file /content/4.4.1.spark_nlp_for_healthcare.json
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-4.4.1-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-4.4.1-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-4.4.1.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-4.4.1.jar
🙆 JSL Home setup in /root/.johnsnowlabs
👌 Detected license file /content/4.4.1.spark_nlp_for_healthcare.json
Installing /root/.johnsnowlabs/py_installs/spark_nlp_jsl-4.4.1-py3-none-any.whl to /usr/bin/python3
Installed 1 products:
💊 Spark-Healthcare==4.4.1 installed! ✅ Heal the planet with NLP! 


In [None]:
from johnsnowlabs import nlp, medical, visual
import pandas as pd
import json
import string
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

👌 Detected license file /content/4.4.1.spark_nlp_for_healthcare.json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==4.4.1, 💊Spark-Healthcare==4.4.1, running on ⚡ PySpark==3.1.2


In [None]:
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T
import pyspark.sql as SQL
from pyspark import keyword_only
from pyspark.sql.types import StructType, IntegerType, StringType

## **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT`, `CHUNK`

- Output: `DOCUMENT`

## **🔎 Parameters**

- `setUseReplacement`: (Boolean) Select what output format should be used. By default it will use the current day.   



## **💻 Deidentification Pipeline**

`Obfuscation` refers to the process of making data unclear, confusing, or difficult to understand or interpret. The goal of obfuscation is to hide or protect **sensitive information** by altering it in a way that makes it challenging for unauthorized parties to access or comprehend.




The `NameChunkObfuscatorApproach` annotator contains all the methods for training a NameChunkObfuscator model. This module can replace name entities with consistent fakers. 

In [None]:
names = """Mitchell#NAME
Clifford#NAME
Jeremiah#NAME
Lawrence#NAME
Brittany#NAME
Patricia#NAME
Samantha#NAME
Jennifer#NAME
Jackson#NAME
Leonard#NAME
Randall#NAME
Camacho#NAME
Ferrell#NAME
Mueller#NAME
Bowman#NAME
Hansen#NAME
Acosta#NAME
Gillespie#NAME
Zimmerman#NAME
Gillespie#NAME
Chandler#NAME
Bradshaw#NAME
Ferguson#NAME
Jacobson#NAME
Figueroa#NAME
Chandler#NAME
Schaefer#NAME
Matthews#NAME
Ferguson#NAME
Bradshaw#NAME
Figueroa#NAME
Delacruz#NAME
Gallegos#NAME
Villarreal#NAME
Williamson#NAME
Montgomery#NAME
Mclaughlin#NAME
Blankenship#NAME
Fitzpatrick#NAME
"""

with open('names_test.txt', 'w') as file:
    file.write(names)

### `setUseReplacement`

<br/>

This parameter is used to enable or disable replacement of entities. 

True is for Replacing, False for otherwise.

In [None]:
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP
documentAssembler = nlp.DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("sentence")\

# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer()\
  .setInputCols("sentence")\
  .setOutputCol("token")\

# Clinical word embeddings trained on PubMED dataset
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

# NER model trained on n2c2 (de-identification and Heart Disease Risk Factors Challenge) datasets)
clinical_ner = medical.NerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter_name = medical.NerConverterInternal()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

nameChunkObfuscator = medical.NameChunkObfuscatorApproach()\
  .setInputCols("ner_chunk")\
  .setOutputCol("replacement")\
  .setRefFileFormat("csv")\
  .setObfuscateRefFile("names_test.txt")\
  .setRefSep("#")\

replacer_name = medical.Replacer()\
  .setInputCols("replacement","sentence")\
  .setOutputCol("obfuscated_document_name")\
  .setUseReplacement(True)

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler, 
    tokenizer,
    word_embeddings,
    clinical_ner,
    ner_converter_name,
    nameChunkObfuscator,
    replacer_name,
    ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

result = model.transform(empty_data)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_deid_generic_augmented download started this may take some time.
[OK!]


Let’s use LightPipeline here to extract the entities and make the replacements.

[LightPipeline](https://nlp.johnsnowlabs.com/docs/en/concepts#using-spark-nlps-lightpipeline) is a Spark NLP specific Pipeline class equivalent to the Spark ML Pipeline, which achieves fast results when dealing with small amounts of data.

In [None]:
sample_text = "John Davies is a 62 y.o. patient admitted. Mr. Davies was seen by attending physician Dr. Lorand and was scheduled for emergency assessment."

lmodel = nlp.LightPipeline(model)

res = lmodel.fullAnnotate(sample_text)

The original text and the output of the `Replacer` annotator is shown below. All the names were replaced with values defined in the `names_test.txt` file.

In [None]:
print("Original text.  : ", res[0]['sentence'][0].result)
print("Obfuscated text : ", res[0]['obfuscated_document_name'][0].result)

Original text.  :  John Davies is a 62 y.o. patient admitted. Mr. Davies was seen by attending physician Dr. Lorand and was scheduled for emergency assessment.
Obfuscated text :  Joseeduardo is a 62 y.o. patient admitted. Mr. Teigan was seen by attending physician Dr. Mayson and was scheduled for emergency assessment.


This time, change the `setUseReplacement` parameter setting to **False** and see the difference. 

In [None]:
replacer_name = medical.Replacer()\
  .setInputCols("replacement","sentence")\
  .setOutputCol("obfuscated_document_name")\
  .setUseReplacement(False)

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler, 
    tokenizer,
    word_embeddings,
    clinical_ner,
    ner_converter_name,
    nameChunkObfuscator,
    replacer_name,
    ])

model = nlpPipeline.fit(empty_data)

result = model.transform(empty_data)

In [None]:
lmodel = nlp.LightPipeline(model)

res = lmodel.fullAnnotate(sample_text)

In [None]:
print("Original text.  : ", res[0]['sentence'][0].result)
print("Obfuscated text : ", res[0]['obfuscated_document_name'][0].result)

Original text.  :  John Davies is a 62 y.o. patient admitted. Mr. Davies was seen by attending physician Dr. Lorand and was scheduled for emergency assessment.
Obfuscated text :  John Davies is a 62 y.o. patient admitted. Mr. Davies was seen by attending physician Dr. Lorand and was scheduled for emergency assessment.


As you can see, the names in the text are unaffected because of the change in the setting. 

## **💻 Date Normalizer Pipeline**


The `DateNormalizer` annotator transforms date mentions to a common standard format: YYYY/MM/DD. It is useful when using data from different sources, sometimes from different countries that has different formats to represent dates.

For the relative dates (next year, past month, etc.), it is possible to define an anchor date to create the normalized date by setting the parameters anchorDateYear, anchorDateMonth, and anchorDateDay.

The `Replacer` annotator will use the output of the `DateNormalizer` annotator, which replaced the extracted date entity by the standard date format, and provide a new text by replacing the original date entity with this value. 

In [None]:
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line
sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

# NER model trained on n2c2 (de-identification and Heart Disease Risk Factors Challenge) datasets)
clinical_ner = medical.NerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("date_chunk")\
    .setWhiteList(["DATE"])

date_normalizer = medical.DateNormalizer()\
    .setInputCols('date_chunk')\
    .setOutputCol('normalized_date')

replacer = medical.Replacer()\
    .setInputCols(["normalized_date","document"])\
    .setOutputCol("replaced_document")\
    .setUseReplacement(True)

nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      date_normalizer,
      replacer])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)


embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_deid_generic_augmented download started this may take some time.
[OK!]


Let's define 7 texts; normalize the date entities and then replace the normalized entities with the original dates in the document by using the `Replacer` annotator.

In [None]:
dates = [
'She has a history of pericarditis and pericardectomy on 08/02/2018 and developed a cough with right-sided chest pain.' ,
'She has been receiving gemcitabine and she receives three cycles of this with the last one being given on 11/2018. ',
'She was last seen in the clinic on 11/01/2018by Dr. Y.',
'Chris Brown was discharged on 12Mar2021',
'Last INR was on Tuesday, Jan 30, 2018, and her INR was 2.3. 2. Amiodarone 100 mg p.o. daily. ',
'We reviewed the pathology obtained from the pericardectomy on 13.04.1999, which was diagnostic of mesothelioma', 
'A review of her CT scan on 3 April2020 prior to her pericardectomy, already shows bilateral plural effusions. ',
]

df_dates = spark.createDataFrame(dates,StringType()).toDF('text')

df_dates.show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------+
|text                                                                                                                 |
+---------------------------------------------------------------------------------------------------------------------+
|She has a history of pericarditis and pericardectomy on 08/02/2018 and developed a cough with right-sided chest pain.|
|She has been receiving gemcitabine and she receives three cycles of this with the last one being given on 11/2018.   |
|She was last seen in the clinic on 11/01/2018by Dr. Y.                                                               |
|Chris Brown was discharged on 12Mar2021                                                                              |
|Last INR was on Tuesday, Jan 30, 2018, and her INR was 2.3. 2. Amiodarone 100 mg p.o. daily.                         |
|We reviewed the pathology obtained from

In [None]:
result = model.transform(df_dates)

result_df = result.select("text",F.explode(F.arrays_zip(result.date_chunk.result, 
                                                        result.normalized_date.result,
                                                        result.replaced_document.result)).alias("cols")) \
                  .select("text",F.expr("cols['1']").alias("normalized_date"),
                                 F.expr("cols['2']").alias("replaced_document"))
                  
result_df.show(truncate=100)

+----------------------------------------------------------------------------------------------------+---------------+----------------------------------------------------------------------------------------------------+
|                                                                                                text|normalized_date|                                                                                   replaced_document|
+----------------------------------------------------------------------------------------------------+---------------+----------------------------------------------------------------------------------------------------+
|She has a history of pericarditis and pericardectomy on 08/02/2018 and developed a cough with rig...|     2018/08/02|She has a history of pericarditis and pericardectomy on 2018/08/02 and developed a cough with rig...|
|She has been receiving gemcitabine and she receives three cycles of this with the last one being ...|     2018/11/15|Sh

The dataframe above shows the original texts, normalized dates and `replaced_document` involving the normalized dates. 