![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# Financial Deidentification

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Finance/11.Deidentification.ipynb)

## Colab Setup

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install johnsnowlabs 

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

Saving latest_3_1_x_spark_nlp_for_healthcare_spark_ocr_5112.json to latest_3_1_x_spark_nlp_for_healthcare_spark_ocr_5112.json


In [None]:
from johnsnowlabs import * 
# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
# Make sure to restart your notebook afterwards for changes to take effect
jsl.install()

👌 Detected license file /content/latest_3_1_x_spark_nlp_for_healthcare_spark_ocr_5112.json
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up John Snow Labs home in /home/ckl/.johnsnowlabs this might take a few minutes.
Downloading 🐍+🚀 Python Library Spark-NLP-4.1.0-wheel-for-spark-3.x.x.whl
Downloading 🐍+💊 Python Library hc
Downloading 🐍+🕶 Python Library Spark-OCR-4.0.1-wheel-for-spark-3.x.x.whl
Downloading 🫘+🚀 Java Library Spark-NLP-4.1.0-cpu-for-spark-3.x.x.jar
Downloading 🫘+💊 Java Library hc
Downloading 🫘+🕶 Java Library Spark-OCR-4.0.1-cpu-for-spark-3.x.x.jar
🙆 JSL Home setup in /root/.johnsnowlabs
Running "/usr/bin/python3 -m pip install https://pypi.johnsnowlabs.com/[LIBRARY_SECRET]spark-ocr/spark_ocr-4.0.1-py3-none-any.whl --force-reinstall"
Running "/usr/bin/python3 -m pip install https://pypi.johnsnowlabs.com/[LIBRARY_SECRET]spark-nlp-internal/spark_nlp_internal-4.1.0-py3-none-any.whl --force-reinst

In [None]:
from johnsnowlabs import * 
# Automatically load license data and start a session with all jars user has access to
spark = jsl.start()

👌 Detected license file /content/latest_3_1_x_spark_nlp_for_healthcare_spark_ocr_5112.json
📋 Stored new John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_2_for_Spark-Healthcare_Spark-OCR.json
👌 Launched SparkSession with Jars for: 🚀Spark-NLP, 💊Spark-Healthcare, 🕶Spark-OCR


In [5]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
# if you want to start the session with custom params as in start function above
def start(SECRET):
    builder = SparkSession.builder \
        .appName("Spark NLP Licensed") \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "2000M") \
        .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:"+PUBLIC_VERSION) \
        .config("spark.jars", "https://pypi.johnsnowlabs.com/"+SECRET+"/spark-nlp-jsl-"+JSL_VERSION+".jar")
      
    return builder.getOrCreate()

#spark = start(SECRET)

# Deidentification Model

Some financial information can be considered sensitive. (e.g.,document, organization, address, signer)

In [6]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector =  nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")
    #.setCustomBounds(["\n\n"])

tokenizer =  nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings =  nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

fin_ner = finance.NerModel.pretrained("finner_deid", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner") 
    #.setLabelCasing("upper")

ner_converter = finance.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")\
    .setReplaceLabels({"ORG": "PARTY"}) # Replace "ORG" entity as "PARTY"

nlpPipeline = Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      embeddings,
      fin_ner,
      ner_converter])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
roberta_embeddings_legal_roberta_base download started this may take some time.
Approximate size to download 447.2 MB
[OK!]
finner_deid download started this may take some time.
[OK!]


### Pretrained NER models extracts:
- PROFESSION
- URL
- LOCATION-OTHER
- CITY
- DATE
- ZIP
- PERSON
- STATE
- COUNTRY
- STREET
- ORG
- PHONE
- EMAIL
- FAX
- AGE

In [7]:
fin_ner.getClasses()

['O',
 'I-PROFESSION',
 'B-PROFESSION',
 'B-URL',
 'I-LOCATION-OTHER',
 'I-URL',
 'B-CITY',
 'B-DATE',
 'I-ZIP',
 'I-PERSON',
 'B-LOCATION-OTHER',
 'B-STATE',
 'I-STATE',
 'B-PERSON',
 'I-CITY',
 'I-DATE',
 'B-COUNTRY',
 'B-ZIP',
 'I-STREET',
 'B-ORG',
 'I-ORG',
 'B-PHONE',
 'I-PHONE',
 'B-EMAIL',
 'B-STREET',
 'B-FAX',
 'B-AGE',
 'I-FAX',
 'I-AGE',
 'I-COUNTRY']

In [8]:
text = """
(State or other jurisdictionof incorporation or organization)
(I.R.S. EmployerIdentification No.)
55 Almaden Boulevard, 6th Floor
San Jose, California 95113
(Address of principal executive offices and Zip Code)
799-9666
(Registrant’s telephone number, including area code) """

In [9]:
result = model.transform(spark.createDataFrame([[text]]).toDF("text"))

In [10]:
result_df = result.select(F.explode(F.arrays_zip(result.token.result, 
                                                 result.ner.result)).alias("cols")) \
                  .select(F.expr("cols['0']").alias("token"),
                          F.expr("cols['1']").alias("ner_label"))

In [11]:
result_df.select("token", "ner_label").groupBy('ner_label').count().orderBy('count', ascending=False).show(truncate=False)

+---------+-----+
|ner_label|count|
+---------+-----+
|O        |38   |
|I-STREET |2    |
|I-ZIP    |1    |
|B-STREET |1    |
|B-CITY   |1    |
|B-STATE  |1    |
|I-CITY   |1    |
|B-PHONE  |1    |
+---------+-----+



### Check extracted sensitive entities

In [12]:
result.select(F.explode(F.arrays_zip(result.ner_chunk.result, 
                                     result.ner_chunk.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+--------------------+---------+
|chunk               |ner_label|
+--------------------+---------+
|55 Almaden Boulevard|STREET   |
|San Jose            |CITY     |
|California          |STATE    |
|95113               |ZIP      |
|799-9666            |PHONE    |
+--------------------+---------+



## Masking and Obfuscation

### Replace these enitites with Tags

In [13]:
ner_converter = finance.NerConverterInternal()\
      .setInputCols(["sentence", "token", "ner"])\
      .setOutputCol("ner_chunk") 

deidentification = finance.DeIdentification() \
      .setInputCols(["sentence", "token", "ner_chunk"]) \
      .setOutputCol("deidentified") \
      .setMode("mask")\
      .setReturnEntityMappings(True) #  return a new column to save the mappings between the mask/obfuscated entities and original entities.
      #.setMappingsColumn("MappingCol") # change the name of the column, 'aux' is default

deidPipeline = Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      embeddings,
      fin_ner,
      ner_converter,
      deidentification])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model_deid = deidPipeline.fit(empty_data)

In [14]:
result = model_deid.transform(spark.createDataFrame([[text]]).toDF("text"))

In [15]:
result.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|          embeddings|                 ner|           ner_chunk|        deidentified|                 aux|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|
(State or other ...|[{document, 0, 27...|[{document, 1, 97...|[{token, 1, 1, (,...|[{word_embeddings...|[{named_entity, 1...|[{chunk, 99, 118,...|[{document, 0, 96...|[{chunk, 97, 104,...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+



In [16]:
result.select(F.explode(F.arrays_zip(result.sentence.result, result.deidentified.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("sentence"), F.expr("cols['1']").alias("deidentified")).toPandas()

Unnamed: 0,sentence,deidentified
0,(State or other jurisdictionof incorporation o...,(State or other jurisdictionof incorporation o...
1,"55 Almaden Boulevard, 6th Floor","<STREET>, 6th Floor"
2,"San Jose, California 95113","<CITY>, <STATE> <ZIP>"
3,(Address of principal executive offices and Zi...,(Address of principal executive offices and Zi...
4,"(Registrant’s telephone number, including area...","(Registrant’s telephone number, including area..."


We have three modes to mask the entities in the Deidentification annotator. You can select the modes using the `.setMaskingPolicy()` parameter. The methods are the followings:

**“entity_labels”**: Mask with the entity type of that chunk. (default) <br/>
**“same_length_chars”**: Mask the deid entities with same length of asterix ( * ) with brackets ( [ , ] ) on both end. <br/>
**“fixed_length_chars”**: Mask the deid entities with a fixed length of asterix ( * ). The length is setting up using the `setFixedMaskLength()` method. <br/>

Let's try each of these and compare the results:

In [17]:
#deid model with "entity_labels"
deid_entity_labels= finance.DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk"])\
    .setOutputCol("deid_entity_label")\
    .setMode("mask")\
    .setReturnEntityMappings(True)\
    .setMaskingPolicy("entity_labels")

#deid model with "same_length_chars"
deid_same_length=  finance.DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk"])\
    .setOutputCol("deid_same_length")\
    .setMode("mask")\
    .setReturnEntityMappings(True)\
    .setMaskingPolicy("same_length_chars")

#deid model with "fixed_length_chars"
deid_fixed_length=  finance.DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk"])\
    .setOutputCol("deid_fixed_length")\
    .setMode("mask")\
    .setReturnEntityMappings(True)\
    .setMaskingPolicy("fixed_length_chars")\
    .setFixedMaskLength(4)


deidPipeline = Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      embeddings,
      fin_ner,
      ner_converter,
      deid_entity_labels,
      deid_same_length,
      deid_fixed_length])


empty_data = spark.createDataFrame([[""]]).toDF("text")
model_deid = deidPipeline.fit(empty_data)

In [18]:
policy_result = model_deid.transform(spark.createDataFrame([[text]]).toDF("text"))

In [19]:
policy_result.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|          embeddings|                 ner|           ner_chunk|   deid_entity_label|                 aux|    deid_same_length|   deid_fixed_length|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|
(State or other ...|[{document, 0, 27...|[{document, 1, 97...|[{token, 1, 1, (,...|[{word_embeddings...|[{named_entity, 1...|[{chunk, 99, 118,...|[{document, 0, 96...|[{chunk, 97, 100,...|[{document, 0, 96...|[{document, 0, 96...|
+--------------------+--------------------+--------------------+----

In [20]:
policy_result.select(F.explode(F.arrays_zip(policy_result.sentence.result, 
                                            policy_result.deid_entity_label.result, 
                                            policy_result.deid_same_length.result, 
                                            policy_result.deid_fixed_length.result)).alias("cols")) \
             .select(F.expr("cols['0']").alias("sentence"),
                     F.expr("cols['1']").alias("deid_entity_label"),
                     F.expr("cols['2']").alias("deid_same_length"),
                     F.expr("cols['3']").alias("deid_fixed_length")).toPandas()

Unnamed: 0,sentence,deid_entity_label,deid_same_length,deid_fixed_length
0,(State or other jurisdictionof incorporation o...,(State or other jurisdictionof incorporation o...,(State or other jurisdictionof incorporation o...,(State or other jurisdictionof incorporation o...
1,"55 Almaden Boulevard, 6th Floor","<STREET>, 6th Floor","[******************], 6th Floor","****, 6th Floor"
2,"San Jose, California 95113","<CITY>, <STATE> <ZIP>","[******], [********] [***]","****, **** ****"
3,(Address of principal executive offices and Zi...,(Address of principal executive offices and Zi...,(Address of principal executive offices and Zi...,(Address of principal executive offices and Zi...
4,"(Registrant’s telephone number, including area...","(Registrant’s telephone number, including area...","(Registrant’s telephone number, including area...","(Registrant’s telephone number, including area..."


### Mapping Column

In [21]:
result.select("aux").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [22]:
result.select(F.explode(F.arrays_zip(result.aux.metadata, 
                                     result.aux.result, 
                                     result.aux.begin, 
                                     result.aux.end)).alias("cols")) \
      .select(F.expr("cols['0']['originalChunk']").alias("chunk"),
              F.expr("cols['0']['beginOriginalChunk']").alias("beginChunk"),
              F.expr("cols['0']['endOriginalChunk']").alias("endChunk"),
              F.expr("cols['1']").alias("label"),
              F.expr("cols['2']").alias("beginLabel"),
              F.expr("cols['3']").alias("endLabel")).show(truncate=False)

+--------------------+----------+--------+--------+----------+--------+
|chunk               |beginChunk|endChunk|label   |beginLabel|endLabel|
+--------------------+----------+--------+--------+----------+--------+
|55 Almaden Boulevard|99        |118     |<STREET>|97        |104     |
|San Jose            |131       |138     |<CITY>  |116       |121     |
|California          |141       |150     |<STATE> |124       |130     |
|95113               |152       |156     |<ZIP>   |132       |136     |
|799-9666            |212       |219     |<PHONE> |191       |197     |
+--------------------+----------+--------+--------+----------+--------+



## Reidentification

We can use `ReIdentification` annotator to go back to the original sentence.

In [23]:
reIdentification =  finance.ReIdentification()\
    .setInputCols(["aux","deidentified"])\
    .setOutputCol("original")

In [24]:
reid_result = reIdentification.transform(result)

In [25]:
reid_result.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|          embeddings|                 ner|           ner_chunk|        deidentified|                 aux|            original|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|
(State or other ...|[{document, 0, 27...|[{document, 1, 97...|[{token, 1, 1, (,...|[{word_embeddings...|[{named_entity, 1...|[{chunk, 99, 118,...|[{document, 0, 96...|[{chunk, 97, 104,...|[{document, 1, 97...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----

In [26]:
print(text)

reid_result.select('original.result').show(truncate=False)


(State or other jurisdictionof incorporation or organization)
(I.R.S. EmployerIdentification No.)
55 Almaden Boulevard, 6th Floor
San Jose, California 95113
(Address of principal executive offices and Zip Code)
799-9666
(Registrant’s telephone number, including area code) 
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------

## Using multiple NER in the same pipeline

In [27]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

bert_embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("bert_embeddings")

fin_ner = finance.NerModel.pretrained('finner_deid', "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner") 
    #.setLabelCasing("upper")

ner_converter =  finance.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")\
    .setReplaceLabels({"ORG": "PARTY"}) # Replace "ORG" entity as "PARTY"

ner_finner = finance.NerModel.pretrained("finner_org_per_role_date", "en", "finance/models")\
    .setInputCols(["sentence", "token", "bert_embeddings"]) \
    .setOutputCol("ner_finner") 
    #.setLabelCasing("upper")

ner_converter_finner = nlp.NerConverter() \
    .setInputCols(["sentence", "token", "ner_finner"]) \
    .setOutputCol("ner_finner_chunk") \
    .setWhiteList(['ROLE']) # Just use "ROLE" entity from this NER

chunk_merge =  finance.ChunkMergeApproach()\
    .setInputCols("ner_finner_chunk", "ner_chunk")\
    .setOutputCol("deid_merged_chunk")

deidentification =  finance.DeIdentification() \
    .setInputCols(["sentence", "token", "deid_merged_chunk"]) \
    .setOutputCol("deidentified") \
    .setMode("mask")\
    .setIgnoreRegex(True)


nlpPipeline = Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      embeddings,
      bert_embeddings,
      fin_ner,
      ner_converter,
      ner_finner,
      ner_converter_finner,
      chunk_merge,
      deidentification])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

roberta_embeddings_legal_roberta_base download started this may take some time.
Approximate size to download 447.2 MB
[OK!]
bert_embeddings_sec_bert_base download started this may take some time.
Approximate size to download 390.4 MB
[OK!]
finner_deid download started this may take some time.
[OK!]
finner_org_per_role_date download started this may take some time.
[OK!]


In [28]:
text = """ Jeffrey Preston Bezos is an American entrepreneur, founder and CEO of Amazon  """ 

In [29]:
result = model.transform(spark.createDataFrame([[text]]).toDF("text"))

# fin_ner
result.select(F.explode(F.arrays_zip(result.ner_chunk.result, 
                                     result.ner_chunk.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+---------------------+---------+
|chunk                |ner_label|
+---------------------+---------+
|Jeffrey Preston Bezos|PERSON   |
|Amazon               |PARTY    |
+---------------------+---------+



In [30]:
result = model.transform(spark.createDataFrame([[text]]).toDF("text"))

# ner_finner
result.select(F.explode(F.arrays_zip(result.ner_finner_chunk.result, 
                                     result.ner_finner_chunk.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+-------+---------+
|chunk  |ner_label|
+-------+---------+
|founder|ROLE     |
|CEO    |ROLE     |
+-------+---------+



In [31]:
result = model.transform(spark.createDataFrame([[text]]).toDF("text"))

# merged_chunk
result.select(F.explode(F.arrays_zip(result.deid_merged_chunk.result, 
                                     result.deid_merged_chunk.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+---------------------+---------+
|chunk                |ner_label|
+---------------------+---------+
|Jeffrey Preston Bezos|PERSON   |
|founder              |ROLE     |
|CEO                  |ROLE     |
|Amazon               |PARTY    |
+---------------------+---------+



In [32]:
result.select(F.explode(F.arrays_zip(result.sentence.result, 
                                     result.deidentified.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("sentence"),
              F.expr("cols['1']").alias("deidentified")).toPandas()

Unnamed: 0,sentence,deidentified
0,Jeffrey Preston Bezos is an American entrepren...,"<PERSON> is an American entrepreneur, <ROLE> a..."


## Obfuscation mode

In the obfuscation mode **DeIdentificationModel** will replace sensitive entities with random values of the same type. 


In [33]:
# This is the obfuscation dict for the new entities
obs_lines = """5417543010#PHONE
(123)123-1234#PHONE
+18087339090#PHONE
(555) 555-1234#PHONE
541-700-3010#PHONE
HenryWatson@world.com#EMAIL
yousef@jacob.com#EMAIL
eric.shannon@geegle.com#EMAIL
mgt@jsl.com#EMAIL
gokhan@company.com#EMAIL
richard@company.it#EMAIL
TURER INC#PARTY
Clarus llc.#PARTY
SESA CO.#PARTY
John Snow Labs Inc#PARTY
MGT Trust Company, LLC.#PARTY
26-06-1990#EFFDATE
03/08/2025#EFFDATE
01/01/2045#EFFDATE
11/7/2016#EFFDATE
12-12-2022#EFFDATE
CEO#ROLE
CTO#ROLE
Director#ROLE
James Turner#PERSON
JUAN RAMIREZ#PERSON
Benjamin Curie#PERSON"""

with open ('obfuscate.txt', 'w') as f:
    f.write(obs_lines)

In [34]:
ner_converter_finner = nlp.NerConverter() \
    .setInputCols(["sentence", "token", "ner_finner"]) \
    .setOutputCol("ner_finner_chunk") \
    .setWhiteList(['ROLE'])\

chunk_merge =  finance.ChunkMergeApproach()\
    .setInputCols("ner_finner_chunk", "ner_chunk")\
    .setOutputCol("deid_merged_chunk")

obfuscation =  finance.DeIdentification()\
    .setInputCols(["sentence", "token", "deid_merged_chunk"]) \
    .setOutputCol("deidentified") \
    .setMode("obfuscate")\
    .setObfuscateDate(True)\
    .setObfuscateRefFile('obfuscate.txt')\
    .setObfuscateRefSource("both") #default: "faker"


nlpPipeline = Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      embeddings,
      bert_embeddings,
      fin_ner,
      ner_converter,
      ner_finner,
      ner_converter_finner,
      chunk_merge,
      obfuscation])

obfuscation_model = nlpPipeline.fit(empty_data)

In [35]:
result = obfuscation_model.transform(spark.createDataFrame([[text]]).toDF("text"))

result.select(F.explode(F.arrays_zip(result.sentence.result, result.deidentified.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("sentence"), F.expr("cols['1']").alias("deidentified")).toPandas()

Unnamed: 0,sentence,deidentified
0,Jeffrey Preston Bezos is an American entrepren...,"Benjamin Curie is an American entrepreneur, CT..."


## Use full pipeline in the Light model

In [36]:
light_model = LightPipeline(model)
annotated_text = light_model.annotate(text)
annotated_text['deidentified']

['<PERSON> is an American entrepreneur, <ROLE> and <ROLE> of <PARTY>']

In [37]:
obf_light_model = LightPipeline(obfuscation_model)
annotated_text = obf_light_model.annotate(text)
annotated_text['deidentified']

['Benjamin Curie is an American entrepreneur, CTO and CTO of Clarus llc.']

# Save the Pipeline and Use it from Your Local

In [38]:
model.write().overwrite().save('pipeline_deid')

In [39]:
# from sparknlp.pretrained import PretrainedPipeline

deid_pipeline = PretrainedPipeline.from_disk("pipeline_deid")

In [40]:
data = spark.createDataFrame([[text]]).toDF("text")

In [41]:
deid_pipeline.model.stages

[DocumentAssembler_fc28c9a39714,
 SentenceDetector_b4771907a6a7,
 REGEX_TOKENIZER_bd445fd9aff0,
 ROBERTA_EMBEDDINGS_b915dff90901,
 BERT_EMBEDDINGS_29ce72cd673e,
 MedicalNerModel_7b3b98b32784,
 NER_CONVERTER_44c632daba38,
 MedicalNerModel_7711a4bfd1fa,
 NerConverter_e0a0ce091bbb,
 MERGE_a89476b7600b,
 DE-IDENTIFICATION_56f6b7114a36]

In [42]:
deid_pipeline.model.transform(data).show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|          embeddings|     bert_embeddings|                 ner|           ner_chunk|          ner_finner|    ner_finner_chunk|   deid_merged_chunk|        deidentified|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| Jeffrey Preston ...|[{document, 0, 78...|[{document, 1, 76...|[{token, 1, 7, Je...|[{word_embeddings...|[{word_embeddings...|[{named_entity, 1...|[{chunk, 1, 21, J...|[{named_entity, 1...|[{chunk, 52, 58, ...|[{chunk, 1, 21, J...|[{docu

# Pretrained Deidentification Pipeline

We have this pipeline can be used to deidentify financial information from texts.The financial information will be masked and obfuscated in the resulting text. The pipeline can mask and obfuscate `DOC`, `EFFDATE`, `PARTY`, `ALIAS`, `SIGNING_PERSON`, `SIGNING_TITLE`, `COUNTRY`, `CITY`, `STATE`, `STREET`, `ZIP`, `EMAIL`, `FAX`, `LOCATION-OTHER`, `DATE`,`PHONE` entities.

In [43]:
# from sparknlp.pretrained import PretrainedPipeline

deid_pipeline = PretrainedPipeline("finpipe_deid", "en", "finance/models")


finpipe_deid download started this may take some time.
Approx size to download 900.4 MB
[OK!]


In [44]:
deid_pipeline.model.stages

[DocumentAssembler_57ba7ce8bff9,
 SentenceDetectorDLModel_8aaebf7e098e,
 REGEX_TOKENIZER_2f265bb3f6b5,
 ROBERTA_EMBEDDINGS_b915dff90901,
 BERT_EMBEDDINGS_29ce72cd673e,
 FinanceNerModel_f714c7246b46,
 NerConverter_5a5bb98a24c7,
 FinanceNerModel_2b2f0f671f99,
 NerConverter_8b80797c7f67,
 FinanceNerModel_7b3b98b32784,
 NER_CONVERTER_fb28b23bc35d,
 FinanceNerModel_419e708135cb,
 NER_CONVERTER_af60235365b4,
 CONTEXTUAL-PARSER_85a13a5ff4bd,
 CONTEXTUAL-PARSER_bf8f02fb6658,
 REGEX_MATCHER_6199c32417bc,
 REGEX_MATCHER_2d694c8416b8,
 MERGE_5b96d578aa9b,
 DE-IDENTIFICATION_3d3dd57f734a,
 DE-IDENTIFICATION_471d94c72cd0,
 DE-IDENTIFICATION_29cac8c6cf56,
 DE-IDENTIFICATION_407b57c7d657,
 Finisher_ed29d709e530]

In [45]:
text= """ REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF
Commvault Systems, Inc.  
(Exact name of registrant as specified in its charter) 
Signed By : Sherly Johnson
(Address of principal executive offices, including zip code) 
(732) 870-4000
(telephone number, including area code) 
Name of each exchange on which registered
CVLT
The NASDAQ Stock Market
"""

In [46]:
deid_res= deid_pipeline.annotate(text)

In [47]:
deid_res.keys()

dict_keys(['obfuscated', 'deidentified', 'masked_fixed_length_chars', 'deid_merged_chunk', 'sentence', 'masked_with_chars'])

In [48]:
pd.set_option("display.max_colwidth", 100)

df= pd.DataFrame(list(zip(deid_res["sentence"], 
                          deid_res["deidentified"],
                          deid_res["masked_with_chars"],
                          deid_res["masked_fixed_length_chars"], 
                          deid_res["obfuscated"])),
                 columns= ["Sentence", "Masked", "Masked with Chars", "Masked with Fixed Chars", "Obfuscated"])

df

Unnamed: 0,Sentence,Masked,Masked with Chars,Masked with Fixed Chars,Obfuscated
0,REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF,REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF,REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF,REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF,REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF
1,"Commvault Systems, Inc. \n(Exact name of registrant as specified in its charter)",<PARTY> \n(Exact name of registrant as specified in its charter),[*********************] \n(Exact name of registrant as specified in its charter),**** \n(Exact name of registrant as specified in its charter),John Snow Labs Inc \n(Exact name of registrant as specified in its charter)
2,Signed By : Sherly Johnson,Signed By : <SIGNING_PERSON>,Signed By : [************],Signed By : ****,Signed By : Dorothy Keen
3,"(Address of principal executive offices, including zip code) \n(732) 870-4000\n(telephone number...","(Address of principal executive offices, including zip code) \n<PHONE>\n(telephone number, inclu...","(Address of principal executive offices, including zip code) \n[************]\n(telephone number...","(Address of principal executive offices, including zip code) \n****\n(telephone number, includin...","(Address of principal executive offices, including zip code) \n031460 3797\n(telephone number, i..."
4,Name of each exchange on which registered,Name of each exchange on which registered,Name of each exchange on which registered,Name of each exchange on which registered,Name of each exchange on which registered
5,CVLT,<PARTY>,[**],****,TURER INC
6,The NASDAQ Stock Market,The <PARTY>,The [*****************],The ****,The TURER INC
