# Financial Deidentification

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Finance/11.Deidentification.ipynb)

## Colab Setup

In [3]:
import json, os
from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2 nlu==4.0.1rc2 spark-nlp==$PUBLIC_VERSION 

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

In [None]:
import json
import os

import nlu
import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *

import pyspark.sql.functions as F
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline,PipelineModel
import pandas as pd
pd.set_option('display.max_colwidth', 0)

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"16G", 
          "spark.kryoserializer.buffer.max":"2000M", 
          "spark.driver.maxResultSize":"2000M"} 

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

print("Spark NLP Version :", sparknlp.version())
print("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

# Deidentification Model

Some financial information can be considered sensitive. (e.g.,document, organization, address, signer)

In [5]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")
    #.setCustomBounds(["\n\n"])

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

fin_ner = FinanceNerModel.pretrained("finner_deid", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner") 
    #.setLabelCasing("upper")

ner_converter = NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")\
    .setReplaceLabels({"ORG": "PARTY"}) # Replace "ORG" entity as "PARTY"

nlpPipeline = Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      embeddings,
      fin_ner,
      ner_converter])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[ | ]sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
Download done! Loading the resource.
[OK!]
roberta_embeddings_legal_roberta_base download started this may take some time.
Approximate size to download 447.2 MB
[ | ]roberta_embeddings_legal_roberta_base download started this may take some time.
Approximate size to download 447.2 MB
Download done! Loading the resource.
[OK!]


### Pretrained NER models extracts:
- PROFESSION
- URL
- LOCATION-OTHER
- CITY
- DATE
- ZIP
- PERSON
- STATE
- COUNTRY
- STREET
- ORG
- PHONE
- EMAIL
- FAX
- AGE

In [6]:
fin_ner.getClasses()

['O',
 'I-PROFESSION',
 'B-PROFESSION',
 'B-URL',
 'I-LOCATION-OTHER',
 'I-URL',
 'B-CITY',
 'B-DATE',
 'I-ZIP',
 'I-PERSON',
 'B-LOCATION-OTHER',
 'B-STATE',
 'I-STATE',
 'B-PERSON',
 'I-CITY',
 'I-DATE',
 'B-COUNTRY',
 'B-ZIP',
 'I-STREET',
 'B-ORG',
 'I-ORG',
 'B-PHONE',
 'I-PHONE',
 'B-EMAIL',
 'B-STREET',
 'B-FAX',
 'B-AGE',
 'I-FAX',
 'I-AGE',
 'I-COUNTRY']

In [7]:
text = """
(State or other jurisdictionof incorporation or organization)
(I.R.S. EmployerIdentification No.)
55 Almaden Boulevard, 6th Floor
San Jose, California 95113
(Address of principal executive offices and Zip Code)
799-9666
(Registrant’s telephone number, including area code) """

In [8]:
result = model.transform(spark.createDataFrame([[text]]).toDF("text"))

In [9]:
result_df = result.select(F.explode(F.arrays_zip(result.token.result, 
                                                 result.ner.result)).alias("cols")) \
                  .select(F.expr("cols['0']").alias("token"),
                          F.expr("cols['1']").alias("ner_label"))

In [10]:
result_df.select("token", "ner_label").groupBy('ner_label').count().orderBy('count', ascending=False).show(truncate=False)



+---------+-----+
|ner_label|count|
+---------+-----+
|O        |38   |
|I-STREET |2    |
|B-CITY   |1    |
|I-CITY   |1    |
|B-PHONE  |1    |
|B-STREET |1    |
|I-ZIP    |1    |
|B-STATE  |1    |
+---------+-----+



                                                                                

### Check extracted sensitive entities

In [11]:
result.select(F.explode(F.arrays_zip(result.ner_chunk.result, 
                                     result.ner_chunk.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+--------------------+---------+
|chunk               |ner_label|
+--------------------+---------+
|55 Almaden Boulevard|STREET   |
|San Jose            |CITY     |
|California          |STATE    |
|95113               |ZIP      |
|799-9666            |PHONE    |
+--------------------+---------+



## Masking and Obfuscation

### Replace these enitites with Tags

In [12]:
ner_converter = NerConverterInternal()\
      .setInputCols(["sentence", "token", "ner"])\
      .setOutputCol("ner_chunk") 

deidentification = DeIdentification() \
      .setInputCols(["sentence", "token", "ner_chunk"]) \
      .setOutputCol("deidentified") \
      .setMode("mask")\
      .setReturnEntityMappings(True) #  return a new column to save the mappings between the mask/obfuscated entities and original entities.
      #.setMappingsColumn("MappingCol") # change the name of the column, 'aux' is default

deidPipeline = Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      embeddings,
      fin_ner,
      ner_converter,
      deidentification])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model_deid = deidPipeline.fit(empty_data)

In [13]:
result = model_deid.transform(spark.createDataFrame([[text]]).toDF("text"))

In [14]:
result.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|          embeddings|                 ner|           ner_chunk|        deidentified|                 aux|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|
(State or other ...|[[document, 0, 27...|[[document, 1, 97...|[[token, 1, 1, (,...|[[word_embeddings...|[[named_entity, 1...|[[chunk, 99, 118,...|[[document, 0, 96...|[[chunk, 97, 104,...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+



In [15]:
result.select(F.explode(F.arrays_zip(result.sentence.result, result.deidentified.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("sentence"), F.expr("cols['1']").alias("deidentified")).toPandas()

Unnamed: 0,sentence,deidentified
0,(State or other jurisdictionof incorporation o...,(State or other jurisdictionof incorporation o...
1,"55 Almaden Boulevard, 6th Floor","<STREET>, 6th Floor"
2,"San Jose, California 95113","<CITY>, <STATE> <ZIP>"
3,(Address of principal executive offices and Zi...,(Address of principal executive offices and Zi...
4,"(Registrant’s telephone number, including area...","(Registrant’s telephone number, including area..."


We have three modes to mask the entities in the Deidentification annotator. You can select the modes using the `.setMaskingPolicy()` parameter. The methods are the followings:

**“entity_labels”**: Mask with the entity type of that chunk. (default) <br/>
**“same_length_chars”**: Mask the deid entities with same length of asterix ( * ) with brackets ( [ , ] ) on both end. <br/>
**“fixed_length_chars”**: Mask the deid entities with a fixed length of asterix ( * ). The length is setting up using the `setFixedMaskLength()` method. <br/>

Let's try each of these and compare the results:

In [16]:
#deid model with "entity_labels"
deid_entity_labels= DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk"])\
    .setOutputCol("deid_entity_label")\
    .setMode("mask")\
    .setReturnEntityMappings(True)\
    .setMaskingPolicy("entity_labels")

#deid model with "same_length_chars"
deid_same_length= DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk"])\
    .setOutputCol("deid_same_length")\
    .setMode("mask")\
    .setReturnEntityMappings(True)\
    .setMaskingPolicy("same_length_chars")

#deid model with "fixed_length_chars"
deid_fixed_length= DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk"])\
    .setOutputCol("deid_fixed_length")\
    .setMode("mask")\
    .setReturnEntityMappings(True)\
    .setMaskingPolicy("fixed_length_chars")\
    .setFixedMaskLength(4)


deidPipeline = Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      embeddings,
      fin_ner,
      ner_converter,
      deid_entity_labels,
      deid_same_length,
      deid_fixed_length])


empty_data = spark.createDataFrame([[""]]).toDF("text")
model_deid = deidPipeline.fit(empty_data)

In [17]:
policy_result = model_deid.transform(spark.createDataFrame([[text]]).toDF("text"))

In [18]:
policy_result.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|          embeddings|                 ner|           ner_chunk|   deid_entity_label|                 aux|    deid_same_length|   deid_fixed_length|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|
(State or other ...|[[document, 0, 27...|[[document, 1, 97...|[[token, 1, 1, (,...|[[word_embeddings...|[[named_entity, 1...|[[chunk, 99, 118,...|[[document, 0, 96...|[[chunk, 97, 100,...|[[document, 0, 96...|[[document, 0, 96...|
+--------------------+--------------------+--------------------+----

In [19]:
policy_result.select(F.explode(F.arrays_zip(policy_result.sentence.result, 
                                            policy_result.deid_entity_label.result, 
                                            policy_result.deid_same_length.result, 
                                            policy_result.deid_fixed_length.result)).alias("cols")) \
             .select(F.expr("cols['0']").alias("sentence"),
                     F.expr("cols['1']").alias("deid_entity_label"),
                     F.expr("cols['2']").alias("deid_same_length"),
                     F.expr("cols['3']").alias("deid_fixed_length")).toPandas()

Unnamed: 0,sentence,deid_entity_label,deid_same_length,deid_fixed_length
0,(State or other jurisdictionof incorporation o...,(State or other jurisdictionof incorporation o...,(State or other jurisdictionof incorporation o...,(State or other jurisdictionof incorporation o...
1,"55 Almaden Boulevard, 6th Floor","<STREET>, 6th Floor","[******************], 6th Floor","****, 6th Floor"
2,"San Jose, California 95113","<CITY>, <STATE> <ZIP>","[******], [********] [***]","****, **** ****"
3,(Address of principal executive offices and Zi...,(Address of principal executive offices and Zi...,(Address of principal executive offices and Zi...,(Address of principal executive offices and Zi...
4,"(Registrant’s telephone number, including area...","(Registrant’s telephone number, including area...","(Registrant’s telephone number, including area...","(Registrant’s telephone number, including area..."


### Mapping Column

In [20]:
result.select("aux").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [21]:
result.select(F.explode(F.arrays_zip(result.aux.metadata, 
                                     result.aux.result, 
                                     result.aux.begin, 
                                     result.aux.end)).alias("cols")) \
      .select(F.expr("cols['0']['originalChunk']").alias("chunk"),
              F.expr("cols['0']['beginOriginalChunk']").alias("beginChunk"),
              F.expr("cols['0']['endOriginalChunk']").alias("endChunk"),
              F.expr("cols['1']").alias("label"),
              F.expr("cols['2']").alias("beginLabel"),
              F.expr("cols['3']").alias("endLabel")).show(truncate=False)

+--------------------+----------+--------+--------+----------+--------+
|chunk               |beginChunk|endChunk|label   |beginLabel|endLabel|
+--------------------+----------+--------+--------+----------+--------+
|55 Almaden Boulevard|99        |118     |<STREET>|97        |104     |
|San Jose            |131       |138     |<CITY>  |116       |121     |
|California          |141       |150     |<STATE> |124       |130     |
|95113               |152       |156     |<ZIP>   |132       |136     |
|799-9666            |212       |219     |<PHONE> |191       |197     |
+--------------------+----------+--------+--------+----------+--------+



## Reidentification

We can use `ReIdentification` annotator to go back to the original sentence.

In [22]:
reIdentification = ReIdentification()\
    .setInputCols(["aux","deidentified"])\
    .setOutputCol("original")

In [23]:
reid_result = reIdentification.transform(result)

In [24]:
reid_result.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|          embeddings|                 ner|           ner_chunk|        deidentified|                 aux|            original|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|
(State or other ...|[[document, 0, 27...|[[document, 1, 97...|[[token, 1, 1, (,...|[[word_embeddings...|[[named_entity, 1...|[[chunk, 99, 118,...|[[document, 0, 96...|[[chunk, 97, 104,...|[[document, 1, 97...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----

In [25]:
print(text)

reid_result.select('original.result').show(truncate=False)


(State or other jurisdictionof incorporation or organization)
(I.R.S. EmployerIdentification No.)
55 Almaden Boulevard, 6th Floor
San Jose, California 95113
(Address of principal executive offices and Zip Code)
799-9666
(Registrant’s telephone number, including area code) 
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------

## Using multiple NER in the same pipeline

In [26]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

bert_embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("bert_embeddings")

fin_ner = FinanceNerModel.pretrained('finner_deid' "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner") 
    #.setLabelCasing("upper")

ner_converter = NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")\
    .setReplaceLabels({"ORG": "PARTY"}) # Replace "ORG" entity as "PARTY"

ner_finner = FinanceNerModel.pretrained("finner_org_people_role", "en", "finance/models")\
    .setInputCols(["sentence", "token", "bert_embeddings"]) \
    .setOutputCol("ner_finner") 
    #.setLabelCasing("upper")

ner_converter_finner = NerConverter() \
    .setInputCols(["sentence", "token", "ner_finner"]) \
    .setOutputCol("ner_finner_chunk") \
    .setWhiteList(['ROLE']) # Just use "ROLE" entity from this NER

chunk_merge = ChunkMergeApproach()\
    .setInputCols("ner_finner_chunk", "ner_chunk")\
    .setOutputCol("deid_merged_chunk")

deidentification = DeIdentification() \
    .setInputCols(["sentence", "token", "deid_merged_chunk"]) \
    .setOutputCol("deidentified") \
    .setMode("mask")\
    .setIgnoreRegex(True)


nlpPipeline = Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      embeddings,
      bert_embeddings,
      fin_ner,
      ner_converter,
      ner_finner,
      ner_converter_finner,
      chunk_merge,
      deidentification])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

roberta_embeddings_legal_roberta_base download started this may take some time.
Approximate size to download 447.2 MB
[OK!]
bert_embeddings_sec_bert_base download started this may take some time.
Approximate size to download 390.4 MB
[ | ]bert_embeddings_sec_bert_base download started this may take some time.
Approximate size to download 390.4 MB
Download done! Loading the resource.
[OK!]


In [27]:
text = """ Jeffrey Preston Bezos is an American entrepreneur, founder and CEO of Amazon  """ 

In [28]:
result = model.transform(spark.createDataFrame([[text]]).toDF("text"))

# fin_ner
result.select(F.explode(F.arrays_zip(result.ner_chunk.result, 
                                     result.ner_chunk.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)



+---------------------+---------+
|chunk                |ner_label|
+---------------------+---------+
|Jeffrey Preston Bezos|PERSON   |
|Amazon               |PARTY    |
+---------------------+---------+



                                                                                

In [29]:
result = model.transform(spark.createDataFrame([[text]]).toDF("text"))

# ner_finner
result.select(F.explode(F.arrays_zip(result.ner_finner_chunk.result, 
                                     result.ner_finner_chunk.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

22/09/01 13:34:43 WARN DAGScheduler: Broadcasting large task binary with size 1038.7 KiB
22/09/01 13:34:43 WARN DAGScheduler: Broadcasting large task binary with size 1038.7 KiB
22/09/01 13:34:44 WARN DAGScheduler: Broadcasting large task binary with size 1038.7 KiB


+-------+---------+
|chunk  |ner_label|
+-------+---------+
|founder|ROLE     |
|CEO    |ROLE     |
+-------+---------+



In [30]:
result = model.transform(spark.createDataFrame([[text]]).toDF("text"))

# merged_chunk
result.select(F.explode(F.arrays_zip(result.deid_merged_chunk.result, 
                                     result.deid_merged_chunk.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

22/09/01 13:34:44 WARN DAGScheduler: Broadcasting large task binary with size 1089.9 KiB
22/09/01 13:34:45 WARN DAGScheduler: Broadcasting large task binary with size 1089.9 KiB
22/09/01 13:34:45 WARN DAGScheduler: Broadcasting large task binary with size 1089.9 KiB


+---------------------+---------+
|chunk                |ner_label|
+---------------------+---------+
|Jeffrey Preston Bezos|PERSON   |
|founder              |ROLE     |
|CEO                  |ROLE     |
|Amazon               |PARTY    |
+---------------------+---------+



In [31]:
result.select(F.explode(F.arrays_zip(result.sentence.result, 
                                     result.deidentified.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("sentence"),
              F.expr("cols['1']").alias("deidentified")).toPandas()

22/09/01 13:34:45 WARN DAGScheduler: Broadcasting large task binary with size 1071.3 KiB


Unnamed: 0,sentence,deidentified
0,Jeffrey Preston Bezos is an American entrepren...,"<PERSON> is an American entrepreneur, <ROLE> a..."


## Obfuscation mode

In the obfuscation mode **DeIdentificationModel** will replace sensitive entities with random values of the same type. 


In [32]:
# This is the obfuscation dict for the new entities
obs_lines = """5417543010#PHONE
(123)123-1234#PHONE
+18087339090#PHONE
(555) 555-1234#PHONE
541-700-3010#PHONE
HenryWatson@world.com#EMAIL
yousef@jacob.com#EMAIL
eric.shannon@geegle.com#EMAIL
mgt@jsl.com#EMAIL
gokhan@company.com#EMAIL
richard@company.it#EMAIL
TURER INC#PARTY
Clarus llc.#PARTY
SESA CO.#PARTY
John Snow Labs Inc#PARTY
MGT Trust Company, LLC.#PARTY
26-06-1990#EFFDATE
03/08/2025#EFFDATE
01/01/2045#EFFDATE
11/7/2016#EFFDATE
12-12-2022#EFFDATE
CEO#ROLE
CTO#ROLE
Director#ROLE
James Turner#PERSON
JUAN RAMIREZ#PERSON
Benjamin Curie#PERSON"""

with open ('obfuscate.txt', 'w') as f:
    f.write(obs_lines)

In [33]:
ner_converter_finner = NerConverter() \
    .setInputCols(["sentence", "token", "ner_finner"]) \
    .setOutputCol("ner_finner_chunk") \
    .setWhiteList(['ROLE'])\

chunk_merge = ChunkMergeApproach()\
    .setInputCols("ner_finner_chunk", "ner_chunk")\
    .setOutputCol("deid_merged_chunk")

obfuscation = DeIdentification()\
    .setInputCols(["sentence", "token", "deid_merged_chunk"]) \
    .setOutputCol("deidentified") \
    .setMode("obfuscate")\
    .setObfuscateDate(True)\
    .setObfuscateRefFile('obfuscate.txt')\
    .setObfuscateRefSource("both") #default: "faker"


nlpPipeline = Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      embeddings,
      bert_embeddings,
      fin_ner,
      ner_converter,
      ner_finner,
      ner_converter_finner,
      chunk_merge,
      obfuscation])

obfuscation_model = nlpPipeline.fit(empty_data)

In [34]:
result = obfuscation_model.transform(spark.createDataFrame([[text]]).toDF("text"))

result.select(F.explode(F.arrays_zip(result.sentence.result, result.deidentified.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("sentence"), F.expr("cols['1']").alias("deidentified")).toPandas()

22/09/01 13:34:47 WARN DAGScheduler: Broadcasting large task binary with size 1071.3 KiB


Unnamed: 0,sentence,deidentified
0,Jeffrey Preston Bezos is an American entrepren...,"James Turner is an American entrepreneur, CEO ..."


## Use full pipeline in the Light model

In [35]:
light_model = LightPipeline(model)
annotated_text = light_model.annotate(text)
annotated_text['deidentified']

['<PERSON> is an American entrepreneur, <ROLE> and <ROLE> of <PARTY>']

In [36]:
obf_light_model = LightPipeline(obfuscation_model)
annotated_text = obf_light_model.annotate(text)
annotated_text['deidentified']

['James Turner is an American entrepreneur, CEO and Director of SESA CO.']

# Save the Pipeline and Use it from Your Local

In [37]:
model.write().overwrite().save('pipeline_deid')

In [38]:
from sparknlp.pretrained import PretrainedPipeline

deid_pipeline = PretrainedPipeline.from_disk("pipeline_deid")

In [39]:
data = spark.createDataFrame([[text]]).toDF("text")

In [40]:
deid_pipeline.model.stages

[DocumentAssembler_83583cfa6b45,
 SentenceDetector_8bd70e354fd5,
 REGEX_TOKENIZER_c629adcef744,
 ROBERTA_EMBEDDINGS_b915dff90901,
 BERT_EMBEDDINGS_29ce72cd673e,
 MedicalNerModel_7b3b98b32784,
 NER_CONVERTER_2420acca0391,
 MedicalNerModel_7711a4bfd1fa,
 NerConverter_3957c5f296a8,
 MERGE_70003986c296,
 DE-IDENTIFICATION_455e7fd91506]

In [41]:
deid_pipeline.model.transform(data).show()

22/09/01 13:35:20 WARN DAGScheduler: Broadcasting large task binary with size 1255.9 KiB
22/09/01 13:35:20 WARN DAGScheduler: Broadcasting large task binary with size 1255.9 KiB
22/09/01 13:35:21 WARN DAGScheduler: Broadcasting large task binary with size 1255.9 KiB

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|          embeddings|     bert_embeddings|                 ner|           ner_chunk|          ner_finner|    ner_finner_chunk|   deid_merged_chunk|        deidentified|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| Jeffrey Preston ...|[[document, 0, 78...|[[document, 1, 76...|[[token, 1, 7, Je...|[[word_embeddings...|[[word_embeddings...|[[named_entity, 1...|[[chunk, 1, 21, J...|[[named_entity, 1...|[[chunk, 52, 58, ...|[[chunk, 1, 21, J...|[[docu

                                                                                

# Pretrained Deidentification Pipeline

We have this pipeline can be used to deidentify financial information from texts.The financial information will be masked and obfuscated in the resulting text. The pipeline can mask and obfuscate `DOC`, `EFFDATE`, `PARTY`, `ALIAS`, `SIGNING_PERSON`, `SIGNING_TITLE`, `COUNTRY`, `CITY`, `STATE`, `STREET`, `ZIP`, `EMAIL`, `FAX`, `LOCATION-OTHER`, `DATE`,`PHONE` entities.

In [42]:
from sparknlp.pretrained import PretrainedPipeline

deid_pipeline = PretrainedPipeline.pretrained("finpipe_deid", "en", "finance/models")

In [43]:
deid_pipeline.model.stages

[DocumentAssembler_57ba7ce8bff9,
 SentenceDetectorDLModel_8aaebf7e098e,
 REGEX_TOKENIZER_2f265bb3f6b5,
 ROBERTA_EMBEDDINGS_b915dff90901,
 BERT_EMBEDDINGS_29ce72cd673e,
 MedicalNerModel_f714c7246b46,
 NerConverter_5a5bb98a24c7,
 MedicalNerModel_2b2f0f671f99,
 NerConverter_8b80797c7f67,
 MedicalNerModel_7b3b98b32784,
 NER_CONVERTER_fb28b23bc35d,
 MedicalNerModel_419e708135cb,
 NER_CONVERTER_af60235365b4,
 CONTEXTUAL-PARSER_85a13a5ff4bd,
 CONTEXTUAL-PARSER_bf8f02fb6658,
 REGEX_MATCHER_6199c32417bc,
 REGEX_MATCHER_2d694c8416b8,
 MERGE_5b96d578aa9b,
 DE-IDENTIFICATION_3d3dd57f734a,
 DE-IDENTIFICATION_471d94c72cd0,
 DE-IDENTIFICATION_29cac8c6cf56,
 DE-IDENTIFICATION_407b57c7d657,
 Finisher_ed29d709e530]

In [44]:
text= """ REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF
Commvault Systems, Inc.  
(Exact name of registrant as specified in its charter) 
Signed By : Sherly Johnson
(Address of principal executive offices, including zip code) 
(732) 870-4000
(telephone number, including area code) 
Name of each exchange on which registered
CVLT
The NASDAQ Stock Market
"""

In [45]:
deid_res= deid_pipeline.annotate(text)

In [46]:
deid_res.keys()

dict_keys(['obfuscated', 'deidentified', 'masked_fixed_length_chars', 'deid_merged_chunk', 'sentence', 'masked_with_chars'])

In [47]:
pd.set_option("display.max_colwidth", 100)

df= pd.DataFrame(list(zip(deid_res["sentence"], 
                          deid_res["deidentified"],
                          deid_res["masked_with_chars"],
                          deid_res["masked_fixed_length_chars"], 
                          deid_res["obfuscated"])),
                 columns= ["Sentence", "Masked", "Masked with Chars", "Masked with Fixed Chars", "Obfuscated"])

df

Unnamed: 0,Sentence,Masked,Masked with Chars,Masked with Fixed Chars,Obfuscated
0,REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF,REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF,REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF,REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF,REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF
1,"Commvault Systems, Inc. \n(Exact name of registrant as specified in its charter)",<PARTY> \n(Exact name of registrant as specified in its charter),[*********************] \n(Exact name of registrant as specified in its charter),**** \n(Exact name of registrant as specified in its charter),John Snow Labs Inc \n(Exact name of registrant as specified in its charter)
2,Signed By : Sherly Johnson,Signed By : <SIGNING_PERSON>,Signed By : [************],Signed By : ****,Signed By : Dorothy Keen
3,"(Address of principal executive offices, including zip code) \n(732) 870-4000\n(telephone number...","(Address of principal executive offices, including zip code) \n<PHONE>\n(telephone number, inclu...","(Address of principal executive offices, including zip code) \n[************]\n(telephone number...","(Address of principal executive offices, including zip code) \n****\n(telephone number, includin...","(Address of principal executive offices, including zip code) \n031460 3797\n(telephone number, i..."
4,Name of each exchange on which registered,Name of each exchange on which registered,Name of each exchange on which registered,Name of each exchange on which registered,Name of each exchange on which registered
5,CVLT,<PARTY>,[**],****,TURER INC
6,The NASDAQ Stock Market,The <PARTY>,The [*****************],The ****,The TURER INC
