![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# **ReIdentification**

This notebook will cover the different parameters and usages of `ReIdentification`. This annotator can reidentifies obfuscated entities by DeIdentification. It requires the outputs from the deidentification as input. Input columns need to be the deidentified document and the deidentification mappings set with DeIdentification.setMappingsColumn.

**📖 Learning Objectives:**

1. Background: Understand the Deidentification then ReIdentification

2. Colab setup

3. Become comfortable using the different parameters of the annotator.


**🔗 Helpful Links:**

- Python Docs : [ReIdentification](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/deid/reIdentification/index.html)

- Scala Docs : [ReIdentification](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/deid/ReIdentification.html)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings).

## **📜 Background**


To understand well the ReIdentification, we have to remind ourselves what is Deidentification. It is the task that consists of anonymizing and protecting medical and health information. We propose several techniques of deidentification such as masking and obfuscation. So the ReIdentification allows to return to the initial state, by reidentifies obfuscated entities by DeIdentification.

## **🎬 Colab Setup**

This module is licensed, so you need a valid license json file.

Installing johsnowlabs:

In [1]:
!pip install -q johnsnowlabs

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.8/84.8 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.4/212.4 MB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m489.8/489.8 kB[0m [31m37.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m641.3/641.3 kB[0m [31m38.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.4/95.4 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.6/198.6 kB[0m [31m18.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.9/66.9 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m83.5/83.5 kB[0m [31m8.6 MB/s[0

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

Please Upload your John Snow Labs License using the button below


Saving spark_nlp_for_healthcare_spark_ocr_7566 - Copie.json to spark_nlp_for_healthcare_spark_ocr_7566 - Copie.json


In [2]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

<IPython.core.display.Javascript object>

127.0.0.1 - - [05/Jul/2023 12:49:03] "GET /login?code=d0hDkNu2dq794vj0h6QuVIo7AOzUxc HTTP/1.1" 200 -


<IPython.core.display.Javascript object>

Downloading license...
Licenses extracted successfully
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-4.4.4-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-4.4.4-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-4.4.4.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-4.4.4.jar
🙆 JSL Home setup in /root/.johnsnowlabs
Installing /root/.johnsnowlabs/py_installs/spark_nlp_jsl-4.4.4-py3-none-any.whl to /usr/bin/python3
Installed 1 products:
💊 Spark-Healthcare==4.4.4 installed! ✅ Heal the planet with NLP! 


Starting spark session:

In [3]:
from johnsnowlabs import nlp, medical
import pandas as pd
from pyspark.sql import SparkSession
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
import sparknlp_jsl
import sparknlp
import pandas as pd

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==4.4.4, 💊Spark-Healthcare==4.4.4, running on ⚡ PySpark==3.1.2


## **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT` , `CHUNK`

- Output: `DOCUMENT`

## **🔎 Example**


To see how the entities are deidentified, please refer to the example of that class.
To do that, let's create first a pipeline for deidentification


In [4]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line
sentenceDetector = SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

# NER model trained on n2c2 (de-identification and Heart Disease Risk Factors Challenge) datasets)
clinical_ner = MedicalNerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")


deidentification = DeIdentification() \
      .setInputCols(["sentence", "token", "ner_chunk"]) \
      .setOutputCol("deidentified") \
      .setMode("mask")\
      .setReturnEntityMappings(True) #  return a new column to save the mappings between the mask/obfuscated entities and original entities.
      #.setMappingsColumn("MappingCol") # change the name of the column, 'aux' is default

deidPipeline = Pipeline(stages=[
      documentAssembler,
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      deidentification])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model_deid = deidPipeline.fit(empty_data)


embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_deid_generic_augmented download started this may take some time.
[OK!]


Let's test the deidentification pipeline to see the deidentified text:

In [7]:
text= """Name : Hendrickson, Ora, Record date: 2093-01-13, Age: 25, # 719435. Dr. John Green, ID: 1231511863, IP 203.120.223.13. He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. Phone (302) 786-5227, 0295 Keats Street, San Francisco."""

In [8]:
result = model_deid.transform(spark.createDataFrame([[text]]).toDF("text"))

In [9]:
result.select('deidentified.result').show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                 |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[Name : <NAME>, Record date: <DATE>, Age: <AGE>, # <CONTACT>., Dr. <NAME>, ID: <ID>, IP 203.120.223.13., He is a <AGE> male was admitted to the <LOCATION

As we can see, protected health information such as NAME, DATE and AGE are deidentified.
If we want to get back the original text, we can then use the ReIdentification annotator as below

In [10]:
reIdentification = ReIdentification()\
    .setInputCols(["aux","deidentified"])\
    .setOutputCol("original")

In [11]:
reid_result = reIdentification.transform(result)

In [14]:
print("Original text:")
print(text + "\n")

print("Recuperated text using ReIdentification after Deidentification:")

reid_result.select('original.result').show(truncate=False)

Original text:
Name : Hendrickson, Ora, Record date: 2093-01-13, Age: 25, # 719435. Dr. John Green, ID: 1231511863, IP 203.120.223.13. He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93. Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B. Phone (302) 786-5227, 0295 Keats Street, San Francisco.

Recuperated text using ReIdentification after Deidentification:
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                       

As we can see, the original text and the recuperated text using ReIdentification after Deidentification are identique