![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# Deidentification Utils
This notebooks aims to showcase how to use the Deidentification module in `johnsnowlabs` library as a helper to carry out all deidentification tasks without any low code.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/finance-nlp/11.2.Deidentification_Utility_Module.ipynb)

# Installation

In [None]:
! pip install -q johnsnowlabs

## Automatic Installation
Using my.johnsnowlabs.com SSO

In [2]:
from johnsnowlabs import nlp, finance

# nlp.install(force_browser=True)

## Manual downloading
If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.

- Go to my.johnsnowlabs.com
- Download your license
- Upload it using the following command

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

- Install it

In [4]:
nlp.install()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_7162 (6).json
🚨 Outdated OCR Secrets in license file. Version=4.4.2 but should be Version=4.4.1
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-4.4.1-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-4.4.2-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-4.4.1.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-4.4.2.jar
🙆 JSL Home setup in /root/.johnsnowlabs
👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_7162 (6).json
Installing /root/.johnsnowlabs/py_installs/spark_nlp_jsl-4.4.2-py3-none-any.whl to /usr/bin/python3
Installed 1 products:
💊 Spark-Healthcare==4.4.2 installed! ✅ Heal the planet with NLP! 


# Starting

In [5]:
spark = nlp.start()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_7162 (6).json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==4.4.1, 💊Spark-Healthcare==4.4.2, running on ⚡ PySpark==3.1.2


In [6]:
import pandas as pd
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T
import pyspark.sql as SQL
from pyspark import keyword_only

# Module

**Description of Parameters:**<br/>

---

`custom_pipeline` : Sparknlp PipelineModel, optional
            custom PipelineModel to be used for deidentification, by default None <br/>
        `ner_chunk` : str, optional
            final chunk column name of custom pipeline that will be deidentified, by default "ner_chunk" <br/>
        `fields` : dict, optional
            fields to be deidentified and their deidentification modes, by default {"text": "mask"} <br/>
        `sentence` : str, optional
            sentence column name of the given custom pipeline, by default "sentence" <br/>
        `token` : str, optional
            token column name of the given custom pipeline, by default "token" <br/>
        `document` : str, optional
            document column name of the given custom pipeline, by default "document" <br/>
        `masking_policy` : str, optional
            masking policy, by default "entity_labels" <br/>
        `fixed_mask_length` : int, optional
            fixed mask length, by default 4 <br/>
        `obfuscate_date` : bool, optional
            obfuscate date, by default True <br/>
        `obfuscate_ref_source` : str, optional
            obfuscate reference source, by default "faker" <br/>
        `obfuscate_ref_file_path` : str, optional
            obfuscate reference file path, by default None <br/>
        `age_group_obfuscation` : bool, optional
            age group obfuscation, by default False <br/>
        `age_ranges` : list, optional
            age ranges for obfuscation, by default [1, 4, 12, 20, 40, 60, 80] <br/>
        `shift_days` : bool, optional
            shift days, by default False <br/>
        `number_of_days` : int, optional
            number of days, by default None <br/>
        `documentHashCoder_col_name` : str, optional
            document hash coder column name, by default "documentHash" <br/>
        `date_tag` : str, optional
            date tag, by default "DATE" <br/>
        `language` : str, optional
            language, by default "en" <br/>
        `region` : str, optional
            region, by default "us" <br/>
        `unnormalized_date` : bool, optional
            unnormalized date, by default False <br/>
        `unnormalized_mode` : str, optional
            unnormalized mode, by default "mask" <br/>
        `id_column_name` : str, optional
            ID column name, by default "id" <br/>
        `date_shift_column_name` : str, optional
            date shift column name, by default "date_shift" <br/>
        `separator` : str, optional
            separator of input csv file, by default "\t" <br/> 
        `input_file_path` : str, optional
            input file path, by default None <br/>
        `output_file_path` : str, optional
            output file path, by default 'deidentified.csv'

**Returns**

---

Spark DataFrame: Spark DataFrame with deidentified text <br/>
csv/json file: A deidentified file.

In [7]:
text= """
Commission file number 000-15867 
_____________________________________
 
CADENCE DESIGN SYSTEMS, INC. 
(Exact name of registrant as specified in its charter)
____________________________________ 
Delaware
 
00-0000000
(State or Other Jurisdiction ofIncorporation or Organization)
 
(I.R.S. EmployerIdentification No.)
2655 Seely Avenue, Building 5,
San Jose,
California
 
95134
(Address of Principal Executive Offices)
 
(Zip Code)
(408)
-943-1234 
(Registrant’s Telephone Number, including Area Code) 
Securities registered pursuant to Section 12(b) of the Act:
Title of Each Class
Trading Symbol(s)
Names of Each Exchange on which Registered
Common Stock, $0.01 par value per share
CDNS
Nasdaq Global Select Market
Securities registered pursuant to Section 12(g) of the Act:"""

df= spark.createDataFrame([[text]]).toDF("text")

In [8]:
df_pd = df.toPandas()
df_pd.to_csv("deid_data.csv", sep='@', index=False)

# With a custom pipeline

In [9]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

ner_model = finance.NerModel.pretrained('finner_sec_10k_summary', 'en', 'finance/models')\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = finance.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")

nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      embeddings,
      ner_model,
      ner_converter])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

bert_embeddings_sec_bert_base download started this may take some time.
Approximate size to download 390.4 MB
[OK!]
finner_sec_10k_summary download started this may take some time.
[OK!]


In [10]:
result = model.transform(spark.createDataFrame([[text]]).toDF("text"))

In [11]:
result.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|          embeddings|                 ner|           ner_chunk|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|
Commission file ...|[{document, 0, 77...|[{document, 1, 10...|[{token, 1, 10, C...|[{word_embeddings...|[{named_entity, 1...|[{chunk, 24, 32, ...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+



In [12]:
result.select(F.explode('sentence')).show(truncate=50)

+--------------------------------------------------+
|                                               col|
+--------------------------------------------------+
|{document, 1, 102, Commission file number 000-1...|
|{document, 105, 290, (Exact name of registrant ...|
|{document, 292, 317, EmployerIdentification No....|
|{document, 318, 548, )
2655 Seely Avenue, Build...|
|{document, 549, 598, (b) of the Act:
Title of E...|
|{document, 599, 762, (s)
Names of Each Exchange...|
|{document, 763, 777, (g) of the Act:, {sentence...|
+--------------------------------------------------+



In [13]:
from pyspark.sql import functions as F

result_df = result.select(F.explode(F.arrays_zip(result.token.result, 
                                                 result.ner.result)).alias("cols")) \
                  .select(F.expr("cols['0']").alias("token"),
                          F.expr("cols['1']").alias("ner_label"))

In [14]:
result_df.show()

+--------------------+---------+
|               token|ner_label|
+--------------------+---------+
|          Commission|        O|
|                file|        O|
|              number|        O|
|           000-15867|    B-CFN|
|_________________...|        O|
|             CADENCE|    B-ORG|
|              DESIGN|    I-ORG|
|             SYSTEMS|    I-ORG|
|                   ,|    I-ORG|
|                 INC|    I-ORG|
|                   .|        O|
|                   (|        O|
|               Exact|        O|
|                name|        O|
|                  of|        O|
|          registrant|        O|
|                  as|        O|
|           specified|        O|
|                  in|        O|
|                 its|        O|
+--------------------+---------+
only showing top 20 rows



In [15]:
result_df.select("token", "ner_label").groupBy('ner_label').count().orderBy('count', ascending=False).show(truncate=False)

+-------------------+-----+
|ner_label          |count|
+-------------------+-----+
|O                  |101  |
|I-ADDRESS          |10   |
|I-ORG              |4    |
|I-PHONE            |4    |
|I-STOCK_EXCHANGE   |3    |
|B-PHONE            |1    |
|B-STATE            |1    |
|B-CFN              |1    |
|B-ADDRESS          |1    |
|B-TICKER           |1    |
|B-TITLE_CLASS_VALUE|1    |
|I-TITLE_CLASS      |1    |
|B-IRS              |1    |
|B-ORG              |1    |
|B-STOCK_EXCHANGE   |1    |
|B-TITLE_CLASS      |1    |
+-------------------+-----+



## Default parameters

In [16]:
deid_implementor = finance.Deid(spark,
                               input_file_path="deid_data.csv",
                               output_file_path="deidentified.csv", 
                               custom_pipeline=model,
                               separator='@')

In [17]:
res = deid_implementor.deidentify()

Deidentification process of the 'text' field has begun...
Deidentification process of the 'text' field was completed...
Deidentifcation successfully completed and the results saved as 'deidentified.csv' !


In [18]:
res.show(n=50, truncate=False)

+---+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|ID |text                                                                                                                                                                                                                                   |text_deidentified                                                                                                                                                                    |
+---+-----------------------------------------------------------------------------------------------------------------------------------------------------------

In [19]:
#checking saved output file
import pandas as pd
res_data= pd.read_csv("deidentified.csv")
res_data.head()

Unnamed: 0,ID,text,text_deidentified
0,0,Commission file number 000-15867 ____________...,Commission file number <CFN> ________________...
1,0,(Exact name of registrant as specified in its ...,(Exact name of registrant as specified in its ...
2,0,EmployerIdentification No.,EmployerIdentification No.
3,0,") 2655 Seely Avenue, Building 5, San Jose, Cal...",) <ADDRESS> 95134 (Address of Principal Exec...
4,0,(b) of the Act: Title of Each Class Trading Sy...,(b) of the Act: Title of Each Class Trading Sy...


## Mask options 


### same_length_chars

In [20]:
deid_implementor = finance.Deid(spark,
                                input_file_path="deid_data.csv",
                                output_file_path="deidentified.csv",
                                custom_pipeline=model,
                                fields={"text": "mask"}, masking_policy="same_length_chars")

res = deid_implementor.deidentify()

Deidentification process of the 'text' field has begun...
Deidentification process of the 'text' field was completed...
Deidentifcation successfully completed and the results saved as 'deidentified.csv' !


In [21]:
res.show(truncate=120)

+---+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+
| ID|                                                                                                                    text|                                                                                                       text_deidentified|
+---+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+
|  0|                  Commission file number 000-15867  _____________________________________   CADENCE DESIGN SYSTEMS, INC.|                  Commission file number [*******]  _____________________________________   [*************************].|
|  0|(Ex

### fixed_length_chars

In [22]:
deid_implementor = finance.Deid(spark,
                                input_file_path="deid_data.csv",
                                output_file_path="deidentified.csv",
                                custom_pipeline=model,
                                fields={"text": "mask"}, masking_policy="fixed_length_chars", fixed_mask_length=2)

res = deid_implementor.deidentify()

Deidentification process of the 'text' field has begun...
Deidentification process of the 'text' field was completed...
Deidentifcation successfully completed and the results saved as 'deidentified.csv' !


In [23]:
res.show(truncate=120)

+---+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+
| ID|                                                                                                                    text|                                                                                                       text_deidentified|
+---+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+
|  0|                  Commission file number 000-15867  _____________________________________   CADENCE DESIGN SYSTEMS, INC.|                                                  Commission file number **  _____________________________________   **.|
|  0|(Ex

## Obfuscate Options

### obfuscate_ref_source="file"

In [24]:
obs_lines = """John Snow Labs#ORG
Sunset Boulevard, 1#ADDRESS
$0.02#TICKER_CLASS_VALUE
000-111-222#CFN
Common Stock#TITLE_CLASS
$0.02#TITLE_CLASS_VALUE
AMZN#TICKER
999-999-999#IRS
(901)133-44-11#PHONE
California#STATE
NASDAQ#STOCK_EXCHANGE"""


with open ('obfuscation.txt', 'w') as f:
  f.write(obs_lines)

In [25]:
df= spark.createDataFrame([[text]]).toDF("text")
df_pd= df.toPandas()
df_pd.to_csv("deid_obfs_data.csv", index=False)

In [26]:
deid_implementor = finance.Deid(spark,
                                input_file_path="deid_obfs_data.csv",
                                output_file_path="deidentified.csv",
                                custom_pipeline=model,
                                fields={"text": "obfuscate"}, obfuscate_ref_source="file",
                                obfuscate_ref_file_path="obfuscation.txt")

res = deid_implementor.deidentify()

Deidentification process of the 'text' field has begun...
Deidentification process of the 'text' field was completed...
Deidentifcation successfully completed and the results saved as 'deidentified.csv' !


In [27]:
res.show(truncate=120)

+---+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+
| ID|                                                                                                                    text|                                                                                                       text_deidentified|
+---+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+
|  0|                  Commission file number 000-15867  _____________________________________   CADENCE DESIGN SYSTEMS, INC.|                             Commission file number 000-111-222  _____________________________________   John Snow Labs.|
|  0|(Ex

### obfuscate_ref_source=both
This option uses both internal faker library and the file. 

In [28]:
deid_implementor = finance.Deid(spark,
                                input_file_path="deid_obfs_data.csv",
                                output_file_path="deidentified.csv",
                                custom_pipeline=model,
                                fields={"text": "obfuscate"}, obfuscate_ref_source="both",
                                obfuscate_ref_file_path="obfuscation.txt")

res = deid_implementor.deidentify()

Deidentification process of the 'text' field has begun...
Deidentification process of the 'text' field was completed...
Deidentifcation successfully completed and the results saved as 'deidentified.csv' !


In [29]:
res.show(truncate=120)

+---+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+
| ID|                                                                                                                    text|                                                                                                       text_deidentified|
+---+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+
|  0|                  Commission file number 000-15867  _____________________________________   CADENCE DESIGN SYSTEMS, INC.|                             Commission file number 000-111-222  _____________________________________   John Snow Labs.|
|  0|(Ex

### obfuscate_ref_source=faker 
You can also use our internal faker library has its own vocabulary. For example, you will see "Florida" as a State, instead of California, other phones, etc.

However, some entities may not be supported by faker, as the number of models increase in the Financial NLP library. If so, you will just see <ENTITY>.

In that case, please come back to a mixed or file-only approaches.

### shifting days according to the ID column

In [30]:
import pandas as pd
data = pd.DataFrame(
    {'clientID' : ['A001', 'A001', 'A002', 'A002'],
     'text' : ['Chris Brown submitted a 10K filing on 10/02/2022', 
               'Mark White received a loan on 02/28/2020', 
               'Jeff Bezos stated he would increase the investment in cloud infrastructure by 1% by 03/15/2022',
               'Satya Nadella reported a decrease in Greenhouse Gas emissions by a 1% on 12/31/2022'
              ]
    }
)

my_input_df = spark.createDataFrame(data)

my_input_df.show(truncate = False)

+--------+----------------------------------------------------------------------------------------------+
|clientID|text                                                                                          |
+--------+----------------------------------------------------------------------------------------------+
|A001    |Chris Brown submitted a 10K filing on 10/02/2022                                              |
|A001    |Mark White received a loan on 02/28/2020                                                      |
|A002    |Jeff Bezos stated he would increase the investment in cloud infrastructure by 1% by 03/15/2022|
|A002    |Satya Nadella reported a decrease in Greenhouse Gas emissions by a 1% on 12/31/2022           |
+--------+----------------------------------------------------------------------------------------------+



In [31]:
df_pd = my_input_df.toPandas()
df_pd.to_csv("deid_id_data.csv", index=False)

Custom pipeline with `DocumentHashCoder()`. 

In [32]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

documentHasher = finance.DocumentHashCoder()\
    .setInputCols("document")\
    .setOutputCol("document2")\
    .setPatientIdColumn("clientID")\
    .setRangeDays(100)\
    .setNewDateShift("shift_days")\
    .setSeed(100)

tokenizer = nlp.Tokenizer()\
    .setInputCols(["document2"])\
    .setOutputCol("token")

embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
    .setInputCols(["document2", "token"]) \
    .setOutputCol("embeddings")

ner_model = finance.NerModel.pretrained('finner_deid', "en", "finance/models")\
    .setInputCols(["document2", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
    .setInputCols(["document2","token","ner"])\
    .setOutputCol("ner_chunk")
    
nlpPipeline = nlp.Pipeline().setStages([
    documentAssembler,
    documentHasher,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter])

empty_data = spark.createDataFrame([["", ""]]).toDF("text", "clientID")

pipeline_model = nlpPipeline.fit(empty_data)

roberta_embeddings_legal_roberta_base download started this may take some time.
Approximate size to download 447.2 MB
[OK!]
finner_deid download started this may take some time.
[OK!]


In [33]:
deid_implementor = finance.Deid(spark,
                                input_file_path="deid_id_data.csv",
                                output_file_path="deidentified.csv",
                                custom_pipeline=pipeline_model,
                                fields={"text": "obfuscate"},
                                shift_days=True,
                                obfuscate_date=True, 
                                ner_chunk="ner_chunk",
                                token="token",
                                documenthashcoder_col_name="document2",
                                separator=",",
                                unnormalized_date=False)

res = deid_implementor.deidentify()

Deidentification process of the 'text' field has begun...
Deidentification process of the 'text' field was completed...
Deidentifcation successfully completed and the results saved as 'deidentified.csv' !


In [34]:
res.show(truncate=False)

+---+----------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------+
|ID |text                                                                                          |text_deidentified                                                                           |
+---+----------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------+
|0  |Chris Brown submitted a 10K filing on 10/02/2022                                              |<PERSON> submitted a 10K filing on 09/27/2022                                               |
|1  |Mark White received a loan on 02/28/2020                                                      |<PERSON> received a loan on 02/23/2020                                                      |
|2  |Jeff Bezos stated he woul

### shifting days according to specified values: XX/XX/XXXX or textual formats: June 10th, 2023

In [41]:
data = pd.DataFrame(
    {'clientID' : ['A001', 'A001', 'A002', 'A002'],
     'text' : ['Chris Brown submitted a 10K filing on June 10th, 2023', 
               'Mark White received a loan on August 8th, 2008', 
               'Jeff Bezos stated he would increase the investment in cloud infrastructure by 1% by 03/15/2022',
               'Satya Nadella reported a decrease in Greenhouse Gas emissions by a 1% on 12/31/2022'
              ],
     'dateshift' : ['10', '-2', '30', '-8']
    }
)

my_input_df = spark.createDataFrame(data)

my_input_df.show(truncate=False)

+--------+----------------------------------------------------------------------------------------------+---------+
|clientID|text                                                                                          |dateshift|
+--------+----------------------------------------------------------------------------------------------+---------+
|A001    |Chris Brown submitted a 10K filing on June 10th, 2023                                         |10       |
|A001    |Mark White received a loan on August 8th, 2008                                                |-2       |
|A002    |Jeff Bezos stated he would increase the investment in cloud infrastructure by 1% by 03/15/2022|30       |
|A002    |Satya Nadella reported a decrease in Greenhouse Gas emissions by a 1% on 12/31/2022           |-8       |
+--------+----------------------------------------------------------------------------------------------+---------+



In [42]:
df_pd = my_input_df.toPandas()
df_pd.to_csv("deid_specific_data.csv", index=False)

In [43]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

documentHasher = finance.DocumentHashCoder()\
    .setInputCols("document")\
    .setOutputCol("document2")\
    .setDateShiftColumn("dateshift")\

tokenizer = nlp.Tokenizer()\
    .setInputCols(["document2"])\
    .setOutputCol("token")

embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
    .setInputCols(["document2", "token"]) \
    .setOutputCol("embeddings")

ner_model = finance.NerModel.pretrained('finner_deid', "en", "finance/models")\
    .setInputCols(["document2", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
    .setInputCols(["document2","token","ner"])\
    .setOutputCol("ner_chunk")
    
nlpPipeline = nlp.Pipeline().setStages([
    documentAssembler,
    documentHasher,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter])

empty_data = spark.createDataFrame([["", "", ""]]).toDF("patientID","text", "dateshift")

pipeline_col_model = nlpPipeline.fit(empty_data)

roberta_embeddings_legal_roberta_base download started this may take some time.
Approximate size to download 447.2 MB
[OK!]
finner_deid download started this may take some time.
[OK!]


In [44]:
deid_implementor = finance.Deid(spark,
                                input_file_path="deid_specific_data.csv",
                                separator=",",
                                output_file_path="deidentified.csv",
                                custom_pipeline=pipeline_col_model,
                                fields={"text": "obfuscate"},
                                shift_days=True,
                                obfuscate_date=True, 
                                ner_chunk="ner_chunk",
                                token="token",
                                documenthashcoder_col_name="document2")

res = deid_implementor.deidentify()

Deidentification process of the 'text' field has begun...
Deidentification process of the 'text' field was completed...
Deidentifcation successfully completed and the results saved as 'deidentified.csv' !


In [45]:
res.show(truncate=False)

+---+----------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------+
|ID |text                                                                                          |text_deidentified                                                                           |
+---+----------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------+
|0  |Chris Brown submitted a 10K filing on June 10th, 2023                                         |<PERSON> submitted a 10K filing on June 20th, 2023                                          |
|1  |Mark White received a loan on August 8th, 2008                                                |<PERSON> received a loan on August 6th, 2008                                                |
|2  |Jeff Bezos stated he woul

### unnormalized date formats

In [46]:
import pandas as pd

data = pd.DataFrame(
    {'clientID' : ['A001', 'A001', 'A002', 'A002'],
     'text' : ['Chris Brown submitted a 10K filing on 10/02/2022', 
               'Mark White received a loan on 3Apr2022', 
               'Jeff Bezos stated he would increase the investment in cloud infrastructure by 1% by 03/15/2022',
               'Satya Nadella reported a decrease in Greenhouse Gas emissions by a 1% on 12/31/2022'
              ],
     'dateshift' : ['10', '-2', '30', '-8']
    }
)


my_input_df = spark.createDataFrame(data)

my_input_df.show(truncate=False)

+--------+----------------------------------------------------------------------------------------------+---------+
|clientID|text                                                                                          |dateshift|
+--------+----------------------------------------------------------------------------------------------+---------+
|A001    |Chris Brown submitted a 10K filing on 10/02/2022                                              |10       |
|A001    |Mark White received a loan on 3Apr2022                                                        |-2       |
|A002    |Jeff Bezos stated he would increase the investment in cloud infrastructure by 1% by 03/15/2022|30       |
|A002    |Satya Nadella reported a decrease in Greenhouse Gas emissions by a 1% on 12/31/2022           |-8       |
+--------+----------------------------------------------------------------------------------------------+---------+



In [47]:
df_pd= my_input_df.toPandas()
df_pd.to_csv("deid_unnormalized_data.csv", index=False)

In [48]:
deid_implementor = finance.Deid(spark,
                                input_file_path="deid_unnormalized_data.csv",
                                output_file_path="deidentified.csv",
                                custom_pipeline=pipeline_col_model,
                                fields={"text": "obfuscate"},
                                shift_days=True,
                                obfuscate_date=True, 
                                ner_chunk="ner_chunk",
                                token="token",
                                documenthashcoder_col_name="document2",
                                separator=",",
                                unnormalized_date=True,
                                unnormalized_mode="mask")

res = deid_implementor.deidentify()

Deidentification process of the 'text' field has begun...
Deidentification process of the 'text' field was completed...
Deidentifcation successfully completed and the results saved as 'deidentified.csv' !


In [49]:
res.show(truncate=False)

+---+----------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------+
|ID |text                                                                                          |text_deidentified                                                                           |
+---+----------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------+
|0  |Chris Brown submitted a 10K filing on 10/02/2022                                              |<PERSON> submitted a 10K filing on 10/12/2022                                               |
|1  |Mark White received a loan on 3Apr2022                                                        |<PERSON> received a loan on <DATE>                                                          |
|2  |Jeff Bezos stated he woul

**unnormalized_mode="obfuscate"**

In [50]:
deid_implementor = finance.Deid(spark,
                                input_file_path="deid_unnormalized_data.csv",
                                output_file_path="deidentified1.csv",
                                custom_pipeline=pipeline_col_model,
                                fields={"text": "obfuscate"},
                                shift_days=True,
                                obfuscate_date=True, 
                                ner_chunk="ner_chunk",
                                token="token",
                                documenthashcoder_col_name="document2",
                                separator=",",
                                unnormalized_date=True,
                                unnormalized_mode="obfuscate")

res = deid_implementor.deidentify()

Deidentification process of the 'text' field has begun...
Deidentification process of the 'text' field was completed...
Deidentifcation successfully completed and the results saved as 'deidentified1.csv' !


In [51]:
res.show(truncate=False)

+---+----------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------+
|ID |text                                                                                          |text_deidentified                                                                           |
+---+----------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------+
|0  |Chris Brown submitted a 10K filing on 10/02/2022                                              |<PERSON> submitted a 10K filing on 10/12/2022                                               |
|1  |Mark White received a loan on 3Apr2022                                                        |<PERSON> received a loan on 03-06-2000                                                      |
|2  |Jeff Bezos stated he woul

# Default pipeline for Financial domain
 

In [52]:
deid_implementor = finance.Deid(spark,
                               input_file_path="deid_data.csv",
                               output_file_path="deidentified_custompipe.csv",
                               domain="finance")

res = deid_implementor.deidentify()

finpipe_deid download started this may take some time.
Approx size to download 452.9 MB
[OK!]
Deidentification process of the 'text' field has begun...
Deidentification process of the 'text' field was completed...
Deidentifcation successfully completed and the results saved as 'deidentified_custompipe.csv' !


In [53]:
res.show(truncate=False)

+---+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|ID |text                                                                                                                                                                                                                                   |text_deidentified                                                                                                                                                                    |
+---+-----------------------------------------------------------------------------------------------------------------------------------------------------------

# Structured Deidentification

In [54]:
#sample data
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/finance-nlp/data/hipaa-table-001.txt

df = spark.read.format("csv") \
    .option("sep", "\t") \
    .option("inferSchema", "true") \
    .option("header", "true") \
    .load("hipaa-table-001.txt")

df.show(truncate=False)

+---------------+----------+---+----------------------------------------------------+-------+--------------+---+---+
|NAME           |DOB       |AGE|ADDRESS                                             |ZIPCODE|TEL           |SBP|DBP|
+---------------+----------+---+----------------------------------------------------+-------+--------------+---+---+
|Cecilia Chapman|04/02/1935|83 |711-2880 Nulla St. Mankato Mississippi              |69200  |(257) 563-7401|101|42 |
|Iris Watson    |03/10/2009|9  |P.O. Box 283 8562 Fusce Rd. Frederick Nebraska      |20620  |(372) 587-2335|159|122|
|Bryar Pitts    |11/01/1921|98 |5543 Aliquet St. Fort Dodge GA                      |20783  |(717) 450-4729|149|52 |
|Theodore Lowe  |13/02/2002|16 |Ap #867-859 Sit Rd. Azusa New York                  |39531  |(793) 151-6230|134|115|
|Calista Wise   |20/08/1942|76 |7292 Dictum Av. San Antonio MI                      |47096  |(492) 709-6392|139|78 |
|Kyla Olsen     |12/05/1973|45 |Ap #651-8679 Sodales Av. Tamunin

## Default parameters

In [55]:
obfuscator = finance.StructuredDeidentification(spark,{"NAME":"NAME","AGE":"AGE"}, obfuscateRefSource = "faker")
obfuscator_df = obfuscator.obfuscateColumns(df)
obfuscator_df.show(truncate=False)

+--------------------+----------+-----+----------------------------------------------------+-------+--------------+---+---+
|NAME                |DOB       |AGE  |ADDRESS                                             |ZIPCODE|TEL           |SBP|DBP|
+--------------------+----------+-----+----------------------------------------------------+-------+--------------+---+---+
|[Robyne Askew]      |04/02/1935|[99] |711-2880 Nulla St. Mankato Mississippi              |69200  |(257) 563-7401|101|42 |
|[Lina Sar]          |03/10/2009|[5]  |P.O. Box 283 8562 Fusce Rd. Frederick Nebraska      |20620  |(372) 587-2335|159|122|
|[Virgel Gess]       |11/01/1921|[92] |5543 Aliquet St. Fort Dodge GA                      |20783  |(717) 450-4729|149|52 |
|[Clent Ridges]      |13/02/2002|[18] |Ap #867-859 Sit Rd. Azusa New York                  |39531  |(793) 151-6230|134|115|
|[Raquel Sarna]      |20/08/1942|[70] |7292 Dictum Av. San Antonio MI                      |47096  |(492) 709-6392|139|78 |
|[Randol

## ref_source=File

In [56]:
obfuscator_unique_ref_test = '''Will Perry#NAME
John Smith#NAME
Marvin MARSHALL#NAME
Hubert GROGAN#NAME
ALTHEA COLBURN#NAME
Kalil AMIN#NAME
Inci FOUNTAIN#NAME
Jackson WILLE#NAME
Jack SANTOS#NAME
Mahmood ALBURN#NAME
Marnie MELINGTON#NAME
Aysha GHAZI#NAME
Maryland CODER#NAME
Darene GEORGIOUS#NAME
Shelly WELLBECK#NAME
Min Kun JAE#NAME
Thomson THOMAS#NAME
Christian SUDDINBURG#NAME
20#AGE
30#AGE
40#AGE
50#AGE
60#AGE
(901)111-2222#TEL
(109)333 1343#TEL
(570) 874-1112#TEL
(901)111-2222#TEL
(109)333 1343#TEL
(570) 874-1112#TEL
28450#ZIPCODE
49144#ZIPCODE
14412#ZIPCODE
10/10/1983#DOB
04/06/1990#DOB
03/11/2001#DOB
'''

with open('obfuscator_unique_ref_test.txt', 'w') as f:
  f.write(obfuscator_unique_ref_test)

In [57]:
obfuscator = finance.StructuredDeidentification(spark,{"NAME":"NAME","AGE":"AGE"}, 
                                        obfuscateRefFile = "/content/obfuscator_unique_ref_test.txt",
                                        obfuscateRefSource = "file",
                                        columnsSeed={"NAME": 23, "AGE": 23})
obfuscator_df = obfuscator.obfuscateColumns(df)

obfuscator_df.select("NAME","AGE").show(truncate=False)

+------------------+----+
|NAME              |AGE |
+------------------+----+
|[Inci FOUNTAIN]   |[60]|
|[Jack SANTOS]     |[30]|
|[Darene GEORGIOUS]|[30]|
|[Shelly WELLBECK] |[40]|
|[Hubert GROGAN]   |[40]|
|[Kalil AMIN]      |[40]|
|[ALTHEA COLBURN]  |[60]|
|[Thomson THOMAS]  |[60]|
|[Jack SANTOS]     |[60]|
|[Will Perry]      |[20]|
|[Jackson WILLE]   |[60]|
|[Shelly WELLBECK] |[40]|
|[Kalil AMIN]      |[30]|
|[Marnie MELINGTON]|[30]|
|[Min Kun JAE]     |[30]|
|[Marvin MARSHALL] |[60]|
|[Marvin MARSHALL] |[50]|
|[Min Kun JAE]     |[30]|
|[Maryland CODER]  |[20]|
|[Marnie MELINGTON]|[20]|
+------------------+----+
only showing top 20 rows



## shift days

In [58]:
# We can shift n days in the structured deidentification through "days" parameter when the column is a Date.

df = spark.createDataFrame([
            ["Juan García", "13/02/1977", "711 Nulla St.", "140", "673 431234"],
            ["Will Smith", "23/02/1977", "1 Green Avenue.", "140", "+23 (673) 431234"],
            ["Pedro Ximénez", "11/04/1900", "Calle del Libertador, 7", "100", "912 345623"]
        ]).toDF("NAME", "DOB", "ADDRESS", "SBP", "TEL")

df_pd= df.toPandas()
df_pd.to_csv("deid_dayshift_structured_data.csv", index=False)
df_pd.head()

Unnamed: 0,NAME,DOB,ADDRESS,SBP,TEL
0,Juan García,13/02/1977,711 Nulla St.,140,673 431234
1,Will Smith,23/02/1977,1 Green Avenue.,140,+23 (673) 431234
2,Pedro Ximénez,11/04/1900,"Calle del Libertador, 7",100,912 345623


In [59]:
obfuscator = finance.StructuredDeidentification(spark, 
                                             columns = {"NAME": "ID", "DOB": "DATE"},
                                             obfuscateRefSource = "faker",
                                             columnsSeed={"NAME": 23, "AGE": 23},
                                             days = 5)
obfuscator_df = obfuscator.obfuscateColumns(df)

In [60]:
obfuscator_df.show(truncate=False)

+----------+------------+-----------------------+---+----------------+
|NAME      |DOB         |ADDRESS                |SBP|TEL             |
+----------+------------+-----------------------+---+----------------+
|[G9296129]|[18/02/1977]|711 Nulla St.          |140|673 431234      |
|[M9239301]|[28/02/1977]|1 Green Avenue.        |140|+23 (673) 431234|
|[H3156881]|[16/04/1900]|Calle del Libertador, 7|100|912 345623      |
+----------+------------+-----------------------+---+----------------+

