![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/)

# DocumentHashCoder

In this notebook, we will examine the `DocumentHashCoder` annotator.

`DocumentHashCoder()` annotator is used for determining shifts date information for deidentification purposes.

This annotator gets the hash of the specified column and creates a new document column containing day shift information. <br/>


**📖 Learning Objectives:**

1. Understand how to shift days in Deidentification tasks by using `DocumentHashCoder`.

2. Become comfortable using the different parameters of the annotator.

**🔗 Helpful Links:**

For extended examples of usage, see the [Spark NLP Workshop](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/4.Clinical_DeIdentification.ipynb)

Python Documentation: [DocumentHashCoder](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/deid/doccument_hashcoder/index.html#sparknlp_jsl.annotator.deid.doccument_hashcoder.DocumentHashCoder.seed)

Scala Documentation: [DocumentHashCoder](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/deid/DocumentHashCoder.html)


## **📜 Background**


This annotator can replace dates in a column of `DOCUMENT` type according with the hash code of any other column. It uses the hash of the specified column and creates a new document column containing the day shift information. In sequence, the `DeIdentification` annotator deidentifies the document with the shifted date information.

If the specified column contains strings that can be parsed to integers, use those numbers to make the shift in the data accordingly.

## **🎬 Colab Setup**

In [None]:
# Install the johnsnowlabs library to access Spark-NLP for Healthcare
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

In [None]:
from johnsnowlabs import nlp, medical
import pandas as pd

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

In [None]:
spark

In [None]:
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T

## **🖨️ Input/Output Annotation Types**
- Input: `DOCUMENT`
- Output: `DOCUMENT`

## **🔎 Parameters**


- `PatientIdColumn` *(String)*: Name of the column containing patient ID.

- `setDateShiftColumn` *(String)*: Sets column to be used for hash or predefined shift.

- `setNewDateShift` *(String)*: Sets column that has a reference of where chunk begins.

- `setRangeDays` *(int)*: Sets the range of dates to be sampled from.

- `setSeed` *(int)*: Sets the seed for random number generator.

### DocumentHashCoder with Deidentification

We will generate a sample deidentification pipeline with `DocumentHashCoder` to see the capabilities of the annotator.

In [None]:
data = pd.DataFrame(
    {'patientID' : ['A001', 'A001', 'A002', 'A002'],
     'text' : ['Chris Brown was discharged on 10/02/2022',
               'Mark White was discharged on 02/28/2020',
               'John was discharged on 03/15/2022',
               'John Moore was discharged on 12/31/2022'
              ]
    }
)

my_input_df = spark.createDataFrame(data)

my_input_df.show(truncate = False)

+---------+----------------------------------------+
|patientID|text                                    |
+---------+----------------------------------------+
|A001     |Chris Brown was discharged on 10/02/2022|
|A001     |Mark White was discharged on 02/28/2020 |
|A002     |John was discharged on 03/15/2022       |
|A002     |John Moore was discharged on 12/31/2022 |
+---------+----------------------------------------+



### `setPatientIdColumn`

This parameter is set to specify the name of the column containing the patient ID.

It is used when we want to shift the days according to the ID column. <br/>

In [None]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

documentHasher = medical.DocumentHashCoder()\
    .setInputCols("document")\
    .setOutputCol("document2")\
    .setPatientIdColumn("patientID")\
    .setNewDateShift("shift_days")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["document2"])\
    .setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["document2", "token"])\
    .setOutputCol("word_embeddings")

clinical_ner = medical.NerModel\
    .pretrained("ner_deid_subentity_augmented", "en", "clinical/models")\
    .setInputCols(["document2","token", "word_embeddings"])\
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal()\
    .setInputCols(["document2", "token", "ner"])\
    .setOutputCol("ner_chunk")

de_identification = medical.DeIdentification() \
    .setInputCols(["ner_chunk", "token", "document2"]) \
    .setOutputCol("deid_text") \
    .setMode("obfuscate") \
    .setObfuscateDate(True) \
    .setDateTag("DATE") \
    .setLanguage("en") \
    .setObfuscateRefSource('faker') \
    .setUseShifDays(True)\
    .setRegion('us')

pipeline = nlp.Pipeline().setStages([
    documentAssembler,
    documentHasher,
    tokenizer,
    embeddings,
    clinical_ner,
    ner_converter,
    de_identification

])

empty_data = spark.createDataFrame([["", ""]]).toDF("text", "patientID")

pipeline_model = pipeline.fit(empty_data)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_deid_subentity_augmented download started this may take some time.
[OK!]


Checking the results

In [None]:
output = pipeline_model.transform(my_input_df)

output.select('patientID','text', 'deid_text.result').show(truncate = False)

+---------+----------------------------------------+---------------------------------------------+
|patientID|text                                    |result                                       |
+---------+----------------------------------------+---------------------------------------------+
|A001     |Chris Brown was discharged on 10/02/2022|[Aldona Bar was discharged on 05/18/2022]    |
|A001     |Mark White was discharged on 02/28/2020 |[Leta Speller was discharged on 10/14/2019]  |
|A002     |John was discharged on 03/15/2022       |[Lonia Blood was discharged on 01/19/2022]   |
|A002     |John Moore was discharged on 12/31/2022 |[Murriel Hopper was discharged on 11/06/2022]|
+---------+----------------------------------------+---------------------------------------------+



As seen above, we shifted days based on the patient IDs.

### `setNewDateShift`

In the `DocumentHashCoder`, after transforming, a new column which has the number of days' information is created. `setNewDateShift` parameter is used for specifying the name of the new column.

In [None]:
documentHasher = medical.DocumentHashCoder()\
    .setInputCols("document")\
    .setOutputCol("document2")\
    .setPatientIdColumn("patientID")\
    .setNewDateShift("shift_days")

pipeline = nlp.Pipeline().setStages([
    documentAssembler,
    documentHasher,
    tokenizer,
    embeddings,
    clinical_ner,
    ner_converter,
    de_identification

])

empty_data = spark.createDataFrame([["", ""]]).toDF("text", "patientID")

pipeline_model = pipeline.fit(empty_data)

Checking the results

In [None]:
output = pipeline_model.transform(my_input_df)

output.select('patientID','text', 'deid_text.result', 'shift_days').show(truncate = False)

+---------+----------------------------------------+---------------------------------------------+----------+
|patientID|text                                    |result                                       |shift_days|
+---------+----------------------------------------+---------------------------------------------+----------+
|A001     |Chris Brown was discharged on 10/02/2022|[Aldona Bar was discharged on 01/06/2023]    |96        |
|A001     |Mark White was discharged on 02/28/2020 |[Leta Speller was discharged on 06/03/2020]  |96        |
|A002     |John was discharged on 03/15/2022       |[Lonia Blood was discharged on 02/16/2022]   |-27       |
|A002     |John Moore was discharged on 12/31/2022 |[Murriel Hopper was discharged on 12/04/2022]|-27       |
+---------+----------------------------------------+---------------------------------------------+----------+



As seen above, under the "shift_days" column, we can see how many days were shifted for the corresponding patient.

### `setRangeDays`

This parameter is used in order to set the range of dates to be sampled from.

Now, we will set `setRangeDays(60)` and limit the range of the shifted days.

In [None]:
documentHasher = medical.DocumentHashCoder()\
    .setInputCols("document")\
    .setOutputCol("document2")\
    .setPatientIdColumn("patientID")\
    .setRangeDays(60)\
    .setNewDateShift("shift_days")


pipeline = nlp.Pipeline().setStages([
    documentAssembler,
    documentHasher,
    tokenizer,
    embeddings,
    clinical_ner,
    ner_converter,
    de_identification

])

empty_data = spark.createDataFrame([["", ""]]).toDF("text", "patientID")

pipeline_model = pipeline.fit(empty_data)

Checking the results

In [None]:
output = pipeline_model.transform(my_input_df)

output.select('patientID','text', 'deid_text.result', 'shift_days').show(truncate = False)

+---------+----------------------------------------+---------------------------------------------+----------+
|patientID|text                                    |result                                       |shift_days|
+---------+----------------------------------------+---------------------------------------------+----------+
|A001     |Chris Brown was discharged on 10/02/2022|[Aldona Bar was discharged on 10/01/2022]    |-1        |
|A001     |Mark White was discharged on 02/28/2020 |[Leta Speller was discharged on 02/27/2020]  |-1        |
|A002     |John was discharged on 03/15/2022       |[Lonia Blood was discharged on 02/20/2022]   |-23       |
|A002     |John Moore was discharged on 12/31/2022 |[Murriel Hopper was discharged on 12/08/2022]|-23       |
+---------+----------------------------------------+---------------------------------------------+----------+



As seen above, the range of the shifted days is 60.

### `setSeed`

This parameter is used in order to set the seed for random number generator.

Now, we will fit/transform the pipeline with `setSeed(100)` parameter 2 times consecutively in order to see the effect of the parameter.

In [None]:
documentHasher = medical.DocumentHashCoder()\
    .setInputCols("document")\
    .setOutputCol("document2")\
    .setPatientIdColumn("patientID")\
    .setRangeDays(100)\
    .setNewDateShift("shift_days")\
    .setSeed(100)


pipeline = nlp.Pipeline().setStages([
    documentAssembler,
    documentHasher,
    tokenizer,
    embeddings,
    clinical_ner,
    ner_converter,
    de_identification

])

empty_data = spark.createDataFrame([["", ""]]).toDF("text", "patientID")

pipeline_model = pipeline.fit(empty_data)

In [None]:
output = pipeline_model.transform(my_input_df)

output.select('patientID','text', 'deid_text.result', 'shift_days').show(truncate = False)

+---------+----------------------------------------+---------------------------------------------+----------+
|patientID|text                                    |result                                       |shift_days|
+---------+----------------------------------------+---------------------------------------------+----------+
|A001     |Chris Brown was discharged on 10/02/2022|[Aldona Bar was discharged on 09/27/2022]    |-5        |
|A001     |Mark White was discharged on 02/28/2020 |[Leta Speller was discharged on 02/23/2020]  |-5        |
|A002     |John was discharged on 03/15/2022       |[Lonia Blood was discharged on 04/13/2022]   |29        |
|A002     |John Moore was discharged on 12/31/2022 |[Murriel Hopper was discharged on 01/29/2023]|29        |
+---------+----------------------------------------+---------------------------------------------+----------+



Now, we will fit/transform the pipeline again and see if the results are consistent.

In [None]:
pipeline_model = pipeline.fit(empty_data)

output = pipeline_model.transform(my_input_df)
output.select('patientID','text', 'deid_text.result', 'shift_days').show(truncate = False)

+---------+----------------------------------------+---------------------------------------------+----------+
|patientID|text                                    |result                                       |shift_days|
+---------+----------------------------------------+---------------------------------------------+----------+
|A001     |Chris Brown was discharged on 10/02/2022|[Aldona Bar was discharged on 09/27/2022]    |-5        |
|A001     |Mark White was discharged on 02/28/2020 |[Leta Speller was discharged on 02/23/2020]  |-5        |
|A002     |John was discharged on 03/15/2022       |[Lonia Blood was discharged on 04/13/2022]   |29        |
|A002     |John Moore was discharged on 12/31/2022 |[Murriel Hopper was discharged on 01/29/2023]|29        |
+---------+----------------------------------------+---------------------------------------------+----------+



As seen above, we shifted the days consistently in the 2 pipelines since we set the `setSeed()` parameter.

### `setDateShiftColumn`

So far, we shifted days according to ID column, we can specify shifting values with another column by using `setDateShiftColumn`.

Generating a sample dataframe with date shifting column

In [None]:
data = pd.DataFrame(
    {'patientID' : ['A001', 'A001', 'A002', 'A002'],
     'text' : ['Chris Brown was discharged on 10/02/2022',
               'Mark White was discharged on 02/28/2020',
               'John was discharged on 03/15/2022',
               'John Moore was discharged on 12/31/2022'
              ],
     'dateshift' : ['10', '10', '30', '30']
    }
)

my_input_df = spark.createDataFrame(data)

my_input_df.show(truncate=False)

+---------+----------------------------------------+---------+
|patientID|text                                    |dateshift|
+---------+----------------------------------------+---------+
|A001     |Chris Brown was discharged on 10/02/2022|10       |
|A001     |Mark White was discharged on 02/28/2020 |10       |
|A002     |John was discharged on 03/15/2022       |30       |
|A002     |John Moore was discharged on 12/31/2022 |30       |
+---------+----------------------------------------+---------+



Now, we will set `setNewDateShift("dateshift")`

In [None]:
documentHasher = medical.DocumentHashCoder()\
    .setInputCols("document")\
    .setOutputCol("document2")\
    .setDateShiftColumn("dateshift")

pipeline = nlp.Pipeline().setStages([
    documentAssembler,
    documentHasher,
    tokenizer,
    embeddings,
    clinical_ner,
    ner_converter,
    de_identification

])

empty_data = spark.createDataFrame([["", "", ""]]).toDF("patientID","text", "dateshift")

pipeline_col_model = pipeline.fit(empty_data)

Checking results

In [None]:
output = pipeline_col_model.transform(my_input_df)

output.select('text', 'dateshift', 'deid_text.result').show(truncate = False)

+----------------------------------------+---------+---------------------------------------------+
|text                                    |dateshift|result                                       |
+----------------------------------------+---------+---------------------------------------------+
|Chris Brown was discharged on 10/02/2022|10       |[Aldona Bar was discharged on 10/12/2022]    |
|Mark White was discharged on 02/28/2020 |10       |[Leta Speller was discharged on 03/09/2020]  |
|John was discharged on 03/15/2022       |30       |[Lonia Blood was discharged on 04/14/2022]   |
|John Moore was discharged on 12/31/2022 |30       |[Murriel Hopper was discharged on 01/30/2023]|
+----------------------------------------+---------+---------------------------------------------+



As seen above, we shifted days according to the "dateshift" column.