![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/funance-nlp/11.Deidentification.ipynb)

# Financial Deidentification

# Installation

In [1]:
! pip install -q johnsnowlabs

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m74.8/74.8 KB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.5/469.5 KB[0m [31m16.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.4/212.4 MB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.4/95.4 KB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m570.6/570.6 KB[0m [31m38.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.6/198.6 KB[0m [31m20.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.9/66.9 KB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m82.3/82.3 KB[0m [31m10.7 MB/s

## Automatic Installation
Using my.johnsnowlabs.com SSO

In [3]:
from johnsnowlabs import nlp, finance

# nlp.install(force_browser=True)

<IPython.core.display.Javascript object>

127.0.0.1 - - [27/Feb/2023 14:59:11] "GET /login?code=hnWuzeDhxJabFebPzlwSraEMyoIkl2 HTTP/1.1" 200 -


<IPython.core.display.Javascript object>

Downloading license...
Licenses extracted successfully
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-4.3.0-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-4.3.0-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-4.3.0.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-4.3.0.jar
🙆 JSL Home setup in /root/.johnsnowlabs
Installing /root/.johnsnowlabs/py_installs/spark_nlp_jsl-4.3.0-py3-none-any.whl to /usr/bin/python3
Installed 1 products:
💊 Spark-Healthcare==4.3.0 installed! ✅ Heal the planet with NLP! 


## Manual downloading
If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.

- Go to my.johnsnowlabs.com
- Download your license
- Upload it using the following command

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

- Install it

In [None]:
nlp.install()

# Starting

In [4]:
spark = nlp.start()

👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==4.3.0, 💊Spark-Healthcare==4.3.0, running on ⚡ PySpark==3.1.2


# Deidentification Model

Some legal information can be considered sensitive. (e.g.,document, organization, address, signer)

In [11]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

ner_model = finance.NerModel.pretrained('finner_sec_10k_summary', 'en', 'finance/models')\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = finance.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")

nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      embeddings,
      ner_model,
      ner_converter])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

bert_embeddings_sec_bert_base download started this may take some time.
Approximate size to download 390.4 MB
[OK!]
finner_sec_10k_summary download started this may take some time.
[OK!]


### Pretrained NER models extracts...

In [12]:
ner_model.getClasses()

['O',
 'I-ADDRESS',
 'B-FISCAL_YEAR',
 'I-FISCAL_YEAR',
 'B-TICKER',
 'I-TITLE_CLASS_VALUE',
 'B-TITLE_CLASS',
 'I-TITLE_CLASS',
 'B-ADDRESS',
 'B-ORG',
 'B-CFN',
 'I-ORG',
 'B-PHONE',
 'I-PHONE',
 'I-STOCK_EXCHANGE',
 'I-CFN',
 'B-IRS',
 'B-STATE',
 'B-TITLE_CLASS_VALUE',
 'B-STOCK_EXCHANGE']

In [13]:
text= """
Commission file number 000-15867 
_____________________________________
 
CADENCE DESIGN SYSTEMS, INC. 
(Exact name of registrant as specified in its charter)
____________________________________ 
Delaware
 
00-0000000
(State or Other Jurisdiction ofIncorporation or Organization)
 
(I.R.S. EmployerIdentification No.)
2655 Seely Avenue, Building 5,
San Jose,
California
 
95134
(Address of Principal Executive Offices)
 
(Zip Code)
(408)
-943-1234 
(Registrant’s Telephone Number, including Area Code) 
Securities registered pursuant to Section 12(b) of the Act:
Title of Each Class
Trading Symbol(s)
Names of Each Exchange on which Registered
Common Stock, $0.01 par value per share
CDNS
Nasdaq Global Select Market
Securities registered pursuant to Section 12(g) of the Act:"""

In [14]:
result = model.transform(spark.createDataFrame([[text]]).toDF("text"))

In [15]:
from pyspark.sql import functions as F

result_df = result.select(F.explode(F.arrays_zip(result.token.result, 
                                                 result.ner.result)).alias("cols")) \
                  .select(F.expr("cols['0']").alias("token"),
                          F.expr("cols['1']").alias("ner_label"))

In [16]:
result_df.select("token", "ner_label").groupBy('ner_label').count().orderBy('count', ascending=False).show(truncate=False)

+-------------------+-----+
|ner_label          |count|
+-------------------+-----+
|O                  |101  |
|I-ADDRESS          |10   |
|I-PHONE            |4    |
|I-ORG              |4    |
|I-STOCK_EXCHANGE   |3    |
|B-ORG              |1    |
|B-IRS              |1    |
|B-STATE            |1    |
|B-PHONE            |1    |
|B-TICKER           |1    |
|B-TITLE_CLASS_VALUE|1    |
|I-TITLE_CLASS      |1    |
|B-TITLE_CLASS      |1    |
|B-ADDRESS          |1    |
|B-CFN              |1    |
|B-STOCK_EXCHANGE   |1    |
+-------------------+-----+



### Check extracted sensitive entities

In [17]:
result.select(F.explode(F.arrays_zip(result.ner_chunk.result, 
                                     result.ner_chunk.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+---------------------------------------------------+-----------------+
|chunk                                              |ner_label        |
+---------------------------------------------------+-----------------+
|000-15867                                          |CFN              |
|CADENCE DESIGN SYSTEMS, INC                        |ORG              |
|Delaware                                           |STATE            |
|00-0000000                                         |IRS              |
|2655 Seely Avenue, Building 5,
San Jose,
California|ADDRESS          |
|(408)
-943-1234                                    |PHONE            |
|Common Stock                                       |TITLE_CLASS      |
|$0.01                                              |TITLE_CLASS_VALUE|
|CDNS                                               |TICKER           |
|Nasdaq Global Select Market                        |STOCK_EXCHANGE   |
+---------------------------------------------------+-----------

## Masking and Obfuscation

### Replace these enitites with Tags

In [18]:
deidentification = finance.DeIdentification() \
      .setInputCols(["sentence", "token", "ner_chunk"]) \
      .setOutputCol("deidentified") \
      .setMode("mask")\
      .setReturnEntityMappings(True) #  return a new column to save the mappings between the mask/obfuscated entities and original entities. REquired for "ReIdentification"
      #.setMappingsColumn("MappingCol") # change the name of the column, 'aux' is default

deidPipeline = nlp.Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      embeddings,
      ner_model,
      ner_converter,
      deidentification])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model_deid = deidPipeline.fit(empty_data)

In [19]:
result = model_deid.transform(spark.createDataFrame([[text]]).toDF("text"))

In [20]:
result.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|          embeddings|                 ner|           ner_chunk|        deidentified|                 aux|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|
Commission file ...|[{document, 0, 77...|[{document, 1, 10...|[{token, 1, 10, C...|[{word_embeddings...|[{named_entity, 1...|[{chunk, 24, 32, ...|[{document, 0, 75...|[{chunk, 23, 27, ...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+



In [21]:
reIdentification = finance.ReIdentification()\
    .setInputCols(["aux","deidentified"])\
    .setOutputCol("original")

In [22]:
reid_result = reIdentification.transform(result)

In [23]:
reid_result.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|          embeddings|                 ner|           ner_chunk|        deidentified|                 aux|            original|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|
Commission file ...|[{document, 0, 77...|[{document, 1, 10...|[{token, 1, 10, C...|[{word_embeddings...|[{named_entity, 1...|[{chunk, 24, 32, ...|[{document, 0, 75...|[{chunk, 23, 27, ...|[{document, 1, 10...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----

# ReIdentification

In [24]:
print(text)

reid_result.select('original.result').show(truncate=False)


Commission file number 000-15867 
_____________________________________
 
CADENCE DESIGN SYSTEMS, INC. 
(Exact name of registrant as specified in its charter)
____________________________________ 
Delaware
 
00-0000000
(State or Other Jurisdiction ofIncorporation or Organization)
 
(I.R.S. EmployerIdentification No.)
2655 Seely Avenue, Building 5,
San Jose,
California
 
95134
(Address of Principal Executive Offices)
 
(Zip Code)
(408)
-943-1234 
(Registrant’s Telephone Number, including Area Code) 
Securities registered pursuant to Section 12(b) of the Act:
Title of Each Class
Trading Symbol(s)
Names of Each Exchange on which Registered
Common Stock, $0.01 par value per share
CDNS
Nasdaq Global Select Market
Securities registered pursuant to Section 12(g) of the Act:
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [25]:
result.select(F.explode(F.arrays_zip(result.sentence.result, result.deidentified.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("sentence"), F.expr("cols['1']").alias("deidentified")).toPandas()

Unnamed: 0,sentence,deidentified
0,Commission file number 000-15867 \n___________...,Commission file number <CFN> \n_______________...
1,(Exact name of registrant as specified in its ...,(Exact name of registrant as specified in its ...
2,EmployerIdentification No.,EmployerIdentification No.
3,")\n2655 Seely Avenue, Building 5,\nSan Jose,\n...",)\n<ADDRESS>\n \n95134\n(Address of Principal ...
4,(b) of the Act:\nTitle of Each Class\nTrading ...,(b) of the Act:\nTitle of Each Class\nTrading ...
5,(s)\nNames of Each Exchange on which Registere...,(s)\nNames of Each Exchange on which Registere...
6,(g) of the Act:,(g) of the Act:


## Other different masking strategies 

We have three modes to mask the entities in the Deidentification annotator. You can select the modes using the `.setMaskingPolicy()` parameter. The methods are the followings:

**“entity_labels”**: Mask with the entity type of that chunk. (default) <br/>
**“same_length_chars”**: Mask the deid entities with same length of asterix ( * ) with brackets ( [ , ] ) on both end. <br/>
**“fixed_length_chars”**: Mask the deid entities with a fixed length of asterix ( * ). The length is setting up using the `setFixedMaskLength()` method. <br/>

Let's try each of these and compare the results:

In [26]:
#deid model with "entity_labels"
deid_entity_labels= finance.DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk"])\
    .setOutputCol("deid_entity_label")\
    .setMode("mask")\
    .setReturnEntityMappings(False)\
    .setMaskingPolicy("entity_labels")

#deid model with "same_length_chars"
deid_same_length= finance.DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk"])\
    .setOutputCol("deid_same_length")\
    .setMode("mask")\
    .setReturnEntityMappings(False)\
    .setMaskingPolicy("same_length_chars")

#deid model with "fixed_length_chars"
deid_fixed_length= finance.DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk"])\
    .setOutputCol("deid_fixed_length")\
    .setMode("mask")\
    .setReturnEntityMappings(False)\
    .setMaskingPolicy("fixed_length_chars")\
    .setFixedMaskLength(4)


deidPipeline = nlp.Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      embeddings,
      ner_model,
      ner_converter,
      deid_entity_labels,
      deid_same_length,
      deid_fixed_length])


empty_data = spark.createDataFrame([[""]]).toDF("text")
model_deid = deidPipeline.fit(empty_data)

In [27]:
result = model_deid.transform(spark.createDataFrame([[text]]).toDF("text"))

In [28]:
result.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|          embeddings|                 ner|           ner_chunk|   deid_entity_label|    deid_same_length|   deid_fixed_length|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|
Commission file ...|[{document, 0, 77...|[{document, 1, 10...|[{token, 1, 10, C...|[{word_embeddings...|[{named_entity, 1...|[{chunk, 24, 32, ...|[{document, 0, 75...|[{document, 0, 10...|[{document, 0, 73...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----

In [29]:
result.select(F.explode(F.arrays_zip(result.sentence.result, 
                                            result.deid_entity_label.result, 
                                            result.deid_same_length.result, 
                                            result.deid_fixed_length.result)).alias("cols")) \
             .select(F.expr("cols['0']").alias("sentence"),
                     F.expr("cols['1']").alias("deid_entity_label"),
                     F.expr("cols['2']").alias("deid_same_length"),
                     F.expr("cols['3']").alias("deid_fixed_length")).toPandas()

Unnamed: 0,sentence,deid_entity_label,deid_same_length,deid_fixed_length
0,Commission file number 000-15867 \n___________...,Commission file number <CFN> \n_______________...,Commission file number [*******] \n___________...,Commission file number **** \n________________...
1,(Exact name of registrant as specified in its ...,(Exact name of registrant as specified in its ...,(Exact name of registrant as specified in its ...,(Exact name of registrant as specified in its ...
2,EmployerIdentification No.,EmployerIdentification No.,EmployerIdentification No.,EmployerIdentification No.
3,")\n2655 Seely Avenue, Building 5,\nSan Jose,\n...",)\n<ADDRESS>\n \n95134\n(Address of Principal ...,)\n[******************************************...,)\n****\n \n95134\n(Address of Principal Execu...
4,(b) of the Act:\nTitle of Each Class\nTrading ...,(b) of the Act:\nTitle of Each Class\nTrading ...,(b) of the Act:\nTitle of Each Class\nTrading ...,(b) of the Act:\nTitle of Each Class\nTrading ...
5,(s)\nNames of Each Exchange on which Registere...,(s)\nNames of Each Exchange on which Registere...,(s)\nNames of Each Exchange on which Registere...,(s)\nNames of Each Exchange on which Registere...
6,(g) of the Act:,(g) of the Act:,(g) of the Act:,(g) of the Act:


### Mapping Column

In [34]:
result.select("ner_chunk").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [37]:
result.select(F.explode(F.arrays_zip(result.ner_chunk.metadata,
                                     result.ner_chunk.begin, 
                                     result.ner_chunk.end)).alias("cols")) \
      .select(F.expr("cols['0']['entity']").alias("label"),
              F.expr("cols['1']").alias("beginLabel"),
              F.expr("cols['2']").alias("endLabel")).show(truncate=False)

+-----------------+----------+--------+
|label            |beginLabel|endLabel|
+-----------------+----------+--------+
|CFN              |24        |32      |
|ORG              |75        |101     |
|STATE            |198       |205     |
|IRS              |209       |218     |
|ADDRESS          |320       |370     |
|PHONE            |434       |448     |
|TITLE_CLASS      |646       |657     |
|TITLE_CLASS_VALUE|660       |664     |
|TICKER           |686       |689     |
|STOCK_EXCHANGE   |691       |717     |
+-----------------+----------+--------+



## Using NER, ContextualParser and ZeroShotNER in the same Deideintification pipeline

In [38]:
# Create JSON file for PART
alias = {
  "entity": "ALIAS",
  "ruleScope": "document", 
  "completeMatchRegex": "true",
  "regex":'["“].*?["”]',
  "matchScope": "sub-token",
  "contextLength": 100
}

email = {
  "entity": "EMAIL",
  "ruleScope": "document", 
  "completeMatchRegex": "true",
  "regex":'[\w-\.]+@([\w-]+\.)+[\w-]{2,4}',
  "matchScope": "sub-token",
  "contextLength": 100
}

phone = {
  "entity": "PHONE",
  "ruleScope": "document", 
  "completeMatchRegex": "true",
  "regex":'(\+?\d{1,3}[\s-]?)?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d+',
  "matchScope": "sub-token",
  "contextLength": 100
}

import json
with open('alias.json', 'w') as f:
    json.dump(alias, f)
    
with open('email.json', 'w') as f:
    json.dump(email, f)
    
with open('phone.json', 'w') as f:
    json.dump(phone, f)

In [54]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

ner_model = finance.NerModel.pretrained('finner_sec_10k_summary', 'en', 'finance/models')\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = finance.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")\
    .setGreedyMode(True)

zero_shot_ner = finance.ZeroShotNerModel.pretrained("finner_roberta_zeroshot", "en", "finance/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("zero_shot_ner")\
    .setPredictionThreshold(0.1)\
    .setEntityDefinitions(
        {            
            "ADDRESS":["Which address?", "Where is the location?"],
            "PERSON": ["Which person?", "What is the person name?"],
            "ORG": ["Which LLC?", "Which Inc?", "Which PLC?", "Which Corp?"]
        })


zeroshot_ner_converter = finance.NerConverterInternal() \
    .setInputCols(["sentence", "token", "zero_shot_ner"])\
    .setOutputCol("zero_ner_chunk")\
    .setGreedyMode(True)

alias_parser = finance.ContextualParserApproach() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("alias")\
    .setJsonPath("alias.json") \
    .setPrefixAndSuffixMatch(False)\
    .setOptionalContextRules(True)\
    .setCaseSensitive(False)

email_parser = finance.ContextualParserApproach() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("email")\
    .setJsonPath("email.json") \
    .setPrefixAndSuffixMatch(False)\
    .setOptionalContextRules(True)\
    .setCaseSensitive(False)

phone_parser = finance.ContextualParserApproach() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("phone")\
    .setJsonPath("phone.json") \
    .setPrefixAndSuffixMatch(False)\
    .setOptionalContextRules(True)\
    .setCaseSensitive(False)

chunk_merger = finance.ChunkMergeApproach()\
    .setInputCols("email", "phone", "ner_chunk", "zero_ner_chunk", "alias")\
    .setOutputCol('merged_ner_chunks')

nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      embeddings,
      ner_model,
      ner_converter,
      zero_shot_ner,
      zeroshot_ner_converter,
      alias_parser,
      email_parser,
      phone_parser,
      chunk_merger])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

bert_embeddings_sec_bert_base download started this may take some time.
Approximate size to download 390.4 MB
[OK!]
finner_sec_10k_summary download started this may take some time.
[OK!]
finner_roberta_zeroshot download started this may take some time.
[OK!]


In [55]:
text= """
Commission file number 000-15867 
_____________________________________
 
CADENCE DESIGN SYSTEMS, INC. 
(Exact name of registrant as specified in its charter)
____________________________________ 
Delaware
 
00-0000000
(State or Other Jurisdiction ofIncorporation or Organization)
 
(I.R.S. EmployerIdentification No.)
2655 Seely Avenue, Building 5,
San Jose,
California
 
95134
(Address of Principal Executive Offices)
 
(Zip Code)
(408)
-943-1234 
(Registrant’s Telephone Number, including Area Code) 
Securities registered pursuant to Section 12(b) of the Act:
Title of Each Class
Trading Symbol(s)
Names of Each Exchange on which Registered
Common Stock, $0.01 par value per share
CDNS
Nasdaq Global Select Market
Securities registered pursuant to Section 12(g) of the Act:"""

In [56]:
result = model.transform(spark.createDataFrame([[text]]).toDF("text"))

# financial_ner (10k summary)
result.select(F.explode(F.arrays_zip(result.ner_chunk.result, 
                                     result.ner_chunk.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+---------------------------------------------------+-----------------+
|chunk                                              |ner_label        |
+---------------------------------------------------+-----------------+
|000-15867                                          |CFN              |
|CADENCE DESIGN SYSTEMS, INC                        |ORG              |
|Delaware                                           |STATE            |
|00-0000000                                         |IRS              |
|2655 Seely Avenue, Building 5,
San Jose,
California|ADDRESS          |
|(408)
-943-1234                                    |PHONE            |
|Common Stock                                       |TITLE_CLASS      |
|$0.01                                              |TITLE_CLASS_VALUE|
|CDNS                                               |TICKER           |
|Nasdaq Global Select Market                        |STOCK_EXCHANGE   |
+---------------------------------------------------+-----------

In [57]:
# zero_shot_ner
result.select(F.explode(F.arrays_zip(result.zero_ner_chunk.result, 
                                     result.zero_ner_chunk.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+----------------------+---------+
|chunk                 |ner_label|
+----------------------+---------+
|CADENCE DESIGN SYSTEMS|ORG      |
|INC                   |ORG      |
+----------------------+---------+



It's important the order of the models in MergerChunk. For example, in this case, we have put before the ContextualParser for PHONE numbers than the NER for 10k summaries, which means `000-15867` is detected first as a PHONE number due to the ContextualParser regular expressions without predefined context, and then `CFN` from the NER, whih comes after, is ignored.

In [58]:
# merged_chunk
result.select(F.explode(F.arrays_zip(result.merged_ner_chunks.result, 
                                     result.merged_ner_chunks.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(n=50, truncate=False)

+---------------------------------------------------+-----------------+
|chunk                                              |ner_label        |
+---------------------------------------------------+-----------------+
|000-15867                                          |PHONE            |
|CADENCE DESIGN SYSTEMS, INC                        |ORG              |
|Delaware                                           |STATE            |
|00-0000000                                         |PHONE            |
|2655 Seely Avenue, Building 5,
San Jose,
California|ADDRESS          |
|(408)
-943-1234                                    |PHONE            |
|Common Stock                                       |TITLE_CLASS      |
|$0.01                                              |TITLE_CLASS_VALUE|
|CDNS                                               |TICKER           |
|Nasdaq Global Select Market                        |STOCK_EXCHANGE   |
+---------------------------------------------------+-----------

## Obfuscation mode

In the obfuscation mode **DeIdentificationModel** will replace sensitive entities with random values of the same type. 


### Using external [Faker](https://faker.readthedocs.io/en/master/) library

In [52]:
!pip install faker

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting faker
  Downloading Faker-17.3.0-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m71.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: faker
Successfully installed faker-17.3.0


In [53]:
from faker import Faker
fk = Faker()

In [62]:
# This is the obfuscation dict for the new entities
obs_lines = """NASDAQ#STOCK_EXCHANGE
NYSE#STOCK_EXCHANGE
London Stock Exchange#STOCK_EXCHANGE
Tokyo Stock Exchange#STOCK_EXCHANGE
ABCD#TICKER
EFGH#TICKER
YLJJ#TICKER
Common Stock#TITLE_CLASS
Preferred Stock#TITLE_CLASS
$0.01#TITLE_CLASS_VALUE
USD 0.025#TITLE_CLASS_VALUE
000-00001#CFN
000-00002#CFN
000-00003#CFN"""

for _ in range(25):
    obs_lines += f"\n{fk.name().strip()}#PERSON"
    obs_lines += f"\n{fk.date().strip()}#DATE"
    obs_lines += f"\n{fk.company().strip()}#ORG"
    obs_lines += f"\n{fk.phone_number().strip()}#PHONE"
    obs_lines += f"\n{fk.email().strip()}#EMAIL"
    obs_lines += f"\n{fk.street_address().strip()}#STREET"
    obs_lines += f"\n{fk.city().strip()}#CITY"
    obs_lines += f"\n{fk.state().strip()}#STATE"
    obs_lines += f"\n{fk.country().strip()}#COUNTRY"

with open ('obfuscate.txt', 'w') as f:
    f.write(obs_lines)

In [63]:
# Previous Masking Annotators
#deid model with "entity_labels"
deid_entity_labels= finance.DeIdentification()\
    .setInputCols(["sentence", "token", "merged_ner_chunks"])\
    .setOutputCol("deidentified")\
    .setMode("mask")\
    .setMaskingPolicy("entity_labels")
    
#deid model with "same_length_chars"
deid_same_length= finance.DeIdentification()\
    .setInputCols(["sentence", "token", "merged_ner_chunks"])\
    .setOutputCol("masked_with_chars")\
    .setMode("mask")\
    .setMaskingPolicy("same_length_chars")

#deid model with "fixed_length_chars"
deid_fixed_length= finance.DeIdentification()\
    .setInputCols(["sentence", "token", "merged_ner_chunks"])\
    .setOutputCol("masked_fixed_length_chars")\
    .setMode("mask")\
    .setMaskingPolicy("fixed_length_chars")\
    .setFixedMaskLength(4)


In [64]:
# Obfuscation with Faker
obfuscation = finance.DeIdentification()\
    .setInputCols(["sentence", "token", "merged_ner_chunks"]) \
    .setOutputCol("obfuscated") \
    .setMode("obfuscate")\
    .setObfuscateDate(True)\
    .setObfuscateRefFile('obfuscate.txt')\
    .setObfuscateRefSource("both")

nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      embeddings,
      ner_model,
      ner_converter,
      zero_shot_ner,
      zeroshot_ner_converter,
      alias_parser,
      email_parser,
      phone_parser,
      chunk_merger,
      deid_entity_labels,
      deid_same_length,
      deid_fixed_length,
      obfuscation])

obfuscation_model = nlpPipeline.fit(empty_data)

In [65]:
text= """
Commission file number 000-15867 
_____________________________________
 
CADENCE DESIGN SYSTEMS, INC. 
(Exact name of registrant as specified in its charter)
____________________________________ 
Delaware
 
00-0000000
(State or Other Jurisdiction ofIncorporation or Organization)
 
(I.R.S. EmployerIdentification No.)
2655 Seely Avenue, Building 5,
San Jose,
California
 
95134
(Address of Principal Executive Offices)
 
(Zip Code)
(408)
-943-1234 
(Registrant’s Telephone Number, including Area Code) 
Securities registered pursuant to Section 12(b) of the Act:
Title of Each Class
Trading Symbol(s)
Names of Each Exchange on which Registered
Common Stock, $0.01 par value per share
CDNS
Nasdaq Global Select Market
Securities registered pursuant to Section 12(g) of the Act:"""

In [66]:
result = obfuscation_model.transform(spark.createDataFrame([[text]]).toDF("text"))
print("\n".join(result.select('obfuscated.result').collect()[0].result))

Commission file number 9575 9921 
_____________________________________
 
Blake, Johnston and Nelson.
(Exact name of registrant as specified in its charter)
____________________________________ 
Vermont
 
03.48.72.77.73
(State or Other Jurisdiction ofIncorporation or Organization)
 
(I.R.S.
EmployerIdentification No.
)
<ADDRESS>
 
95134
(Address of Principal Executive Offices)
 
(Zip Code)
040 02.46.36.91.50 
(Registrant’s Telephone Number, including Area Code) 
Securities registered pursuant to Section 12
(b) of the Act:
Title of Each Class
Trading Symbol
(s)
Names of Each Exchange on which Registered
Preferred Stock, $0.01 par value per share
ABCD
London Stock Exchange
Securities registered pursuant to Section 12
(g) of the Act:


## Using Light Pipelines

In [67]:
light_model = nlp.LightPipeline(obfuscation_model)
annotated_text = light_model.annotate(text)
print("\n".join(annotated_text['deidentified']))

Commission file number <PHONE> 
_____________________________________
 
<ORG>.
(Exact name of registrant as specified in its charter)
____________________________________ 
<STATE>
 
<PHONE>
(State or Other Jurisdiction ofIncorporation or Organization)
 
(I.R.S.
EmployerIdentification No.
)
<ADDRESS>
 
95134
(Address of Principal Executive Offices)
 
(Zip Code)<PHONE> 
(Registrant’s Telephone Number, including Area Code) 
Securities registered pursuant to Section 12
(b) of the Act:
Title of Each Class
Trading Symbol
(s)
Names of Each Exchange on which Registered
<TITLE_CLASS>, <TITLE_CLASS_VALUE> par value per share
<TICKER>
<STOCK_EXCHANGE>
Securities registered pursuant to Section 12
(g) of the Act:


In [68]:
print("\n".join(annotated_text['obfuscated']))

Commission file number 9575 9921 
_____________________________________
 
Blake, Johnston and Nelson.
(Exact name of registrant as specified in its charter)
____________________________________ 
Vermont
 
03.48.72.77.73
(State or Other Jurisdiction ofIncorporation or Organization)
 
(I.R.S.
EmployerIdentification No.
)
<ADDRESS>
 
95134
(Address of Principal Executive Offices)
 
(Zip Code)
040 02.46.36.91.50 
(Registrant’s Telephone Number, including Area Code) 
Securities registered pursuant to Section 12
(b) of the Act:
Title of Each Class
Trading Symbol
(s)
Names of Each Exchange on which Registered
Preferred Stock, $0.01 par value per share
ABCD
London Stock Exchange
Securities registered pursuant to Section 12
(g) of the Act:


## Shifting Days

We use the `medical.DocumentHashCoder()` annotator to determine shifting days. This annotator gets the hash of the specified column and creates a new document column containing day shift information. And then, the `medical.DeIdentification()` annotator deidentifies this new doc. We should set the seed parameter to hash consistently.  

In [77]:
import pandas as pd

data = pd.DataFrame(
    {'DocumentID' : ['A001', 'A002'],
     'text' : ['Mark Johansson has bought a stock on 02/28/2020', 
               'John has bought a house on 03/15/2022',
               ]
    }
)

my_input_df = spark.createDataFrame(data)

my_input_df.show(truncate = False)

+----------+-----------------------------------------------+
|DocumentID|text                                           |
+----------+-----------------------------------------------+
|A001      |Mark Johansson has bought a stock on 02/28/2020|
|A002      |John has bought a house on 03/15/2022          |
+----------+-----------------------------------------------+



### Shifting days according to the ID column

We use the `legal.DocumentHashCoder()` annotator to determine shifting days. This annotator gets the hash of the specified column and creates a new document column containing day shift information. And then, the `legal.DeIdentification()` annotator deidentifies this new doc. We should set the seed parameter to hash consistently.  

In [84]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

documentHasher = finance.DocumentHashCoder()\
    .setInputCols("document")\
    .setOutputCol("document2")\
    .setPatientIdColumn("DocumentID")\
    .setRangeDays(100)\
    .setNewDateShift("shift_days")\
    .setSeed(100)


# sentenceDetector = nlp.SentenceDetector()\
#     .setInputCols(["document2"])\
#     .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["document2"])\
    .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")\
    .setInputCols(["document2", "token"])\
    .setOutputCol("embeddings")

finance_ner = finance.NerModel.pretrained('finner_deid_sec', "en", "finance/models")\
    .setInputCols(["document2", "token", "embeddings"]) \
    .setOutputCol("ner") 
    #.setLabelCasing("upper")

ner_converter = finance.NerConverterInternal() \
    .setInputCols(["document2", "token", "ner"])\
    .setOutputCol("ner_chunk")


deid = finance.DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk"]) \
    .setOutputCol("deidentified") \
    .setMode("obfuscate") \
    .setObfuscateDate(True) \
    .setDateTag("DATE") \
    .setLanguage("en") \
    .setObfuscateRefSource('faker') \
    .setUseShifDays(True)\
    .setRegion('us')

pipeline = nlp.Pipeline(stages=[
      documentAssembler, 
      documentHasher,
      tokenizer,
      embeddings,
      finance_ner,
      ner_converter,
      deid])

empty_data = spark.createDataFrame([["", ""]]).toDF("text", "DocumentID")

pipeline_model = pipeline.fit(empty_data)

bert_embeddings_sec_bert_base download started this may take some time.
Approximate size to download 390.4 MB
[OK!]
finner_deid_sec download started this may take some time.
[OK!]


In [78]:
output = pipeline_model.transform(my_input_df)

output.select('DocumentID','text', 'deidentified.result').show(truncate = False)

+----------+-----------------------------------------------+-------------------------------------------+
|DocumentID|text                                           |result                                     |
+----------+-----------------------------------------------+-------------------------------------------+
|A001      |Mark Johansson has bought a stock on 02/28/2020|[<PERSON> has bought a stock on 04/26/2020]|
|A002      |John has bought a house on 03/15/2022          |[<PERSON> has bought a house on 03/19/2022]|
+----------+-----------------------------------------------+-------------------------------------------+



### Shifting days according to specified values

Instead of shifting days according to ID column, we can specify shifting values with another column.

```python
documentHasher = finance.DocumentHashCoder()\
    .setInputCols("document")\
    .setOutputCol("document2")\
    .setDateShiftColumn("dateshift")\
```


In [85]:
data = pd.DataFrame(
    {'DocumentID' : ['A001', 'A002'],
     'text' : ['Mark Johansson has bought a stock on 02/28/2020', 
               'John has bought a house on 03/15/2022',
               ],
     'dateshift' : ['5', '10']
    }
)


my_input_df = spark.createDataFrame(data)

my_input_df.show(truncate = False)

+----------+-----------------------------------------------+---------+
|DocumentID|text                                           |dateshift|
+----------+-----------------------------------------------+---------+
|A001      |Mark Johansson has bought a stock on 02/28/2020|5        |
|A002      |John has bought a house on 03/15/2022          |10       |
+----------+-----------------------------------------------+---------+



In [86]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

documentHasher = finance.DocumentHashCoder()\
    .setInputCols("document")\
    .setOutputCol("document2")\
    .setDateShiftColumn("dateshift")\

# sentenceDetector = nlp.SentenceDetector()\
#     .setInputCols(["document2"])\
#     .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["document2"])\
    .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")\
    .setInputCols(["document2", "token"])\
    .setOutputCol("embeddings")

finance_ner = finance.NerModel.pretrained('finner_deid_sec', "en", "finance/models")\
    .setInputCols(["document2", "token", "embeddings"]) \
    .setOutputCol("ner") 
    #.setLabelCasing("upper")

ner_converter = finance.NerConverterInternal() \
    .setInputCols(["document2", "token", "ner"])\
    .setOutputCol("ner_chunk")

obfuscation = finance.DeIdentification()\
    .setInputCols(["document2", "token", "ner_chunk"]) \
    .setOutputCol("deidentified") \
    .setMode("obfuscate") \
    .setObfuscateDate(True) \
    .setDateTag("DATE") \
    .setLanguage("en") \
    .setObfuscateRefSource('faker') \
    .setUseShifDays(True)\
    .setRegion('us')

pipeline = nlp.Pipeline(stages=[
      documentAssembler, 
      documentHasher,
      tokenizer,
      embeddings,
      finance_ner,
      ner_converter,
      obfuscation])

empty_data = spark.createDataFrame([["", "", ""]]).toDF("text", "DocumentID", "dateshift")

pipeline_model = pipeline.fit(empty_data)

bert_embeddings_sec_bert_base download started this may take some time.
Approximate size to download 390.4 MB
[OK!]
finner_deid_sec download started this may take some time.
[OK!]


In [87]:
output = pipeline_model.transform(my_input_df)

output.select('text', 'dateshift', 'deidentified.result').show(truncate = False)

+-----------------------------------------------+---------+-------------------------------------------+
|text                                           |dateshift|result                                     |
+-----------------------------------------------+---------+-------------------------------------------+
|Mark Johansson has bought a stock on 02/28/2020|5        |[<PERSON> has bought a stock on 03/04/2020]|
|John has bought a house on 03/15/2022          |10       |[<PERSON> has bought a house on 03/25/2022]|
+-----------------------------------------------+---------+-------------------------------------------+



### Masking Unnormalized Date Formats

`setUnnormalizedDateMode()` parameter is used to mask the DATE entities that can not be normalized. In the example below, please check `03Apr2022` which couldn't be normalized and it is masked in the output.

In [91]:
data = pd.DataFrame(
    {'DocumentID' : ['A001', 'A002'],
     'text' : ['Mark Johansson has bought a stock on 02/28/2020', 
               'John has bought a house on 03Apr2022'],
     'dateshift' : ['5', '10']
    }
)

my_input_df = spark.createDataFrame(data)


documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

documentHasher = finance.DocumentHashCoder()\
    .setInputCols("document")\
    .setOutputCol("document2")\
    .setDateShiftColumn("dateshift")\


# sentenceDetector = nlp.SentenceDetector()
#     .setInputCols(["document2"])\
#     .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["document2"])\
    .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")\
    .setInputCols(["document2", "token"])\
    .setOutputCol("embeddings")

legal_ner = finance.NerModel.pretrained('finner_deid_sec', "en", "finance/models")\
    .setInputCols(["document2", "token", "embeddings"]) \
    .setOutputCol("ner") 
    #.setLabelCasing("upper")

ner_converter = finance.NerConverterInternal() \
    .setInputCols(["document2", "token", "ner"])\
    .setOutputCol("ner_chunk")

obfuscation = finance.DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk"]) \
    .setOutputCol("deidentified") \
    .setMode("obfuscate") \
    .setObfuscateDate(True) \
    .setDateTag("DATE") \
    .setLanguage("en") \
    .setObfuscateRefSource('faker') \
    .setUseShifDays(True)\
    .setRegion('us')\
    .setUnnormalizedDateMode("mask")

pipeline = nlp.Pipeline(stages=[
      documentAssembler, 
      documentHasher,
      sentenceDetector,
      tokenizer,
      embeddings,
      legal_ner,
      ner_converter,
      obfuscation])


output = pipeline.fit(my_input_df).transform(my_input_df)

output.select('text', 'dateshift', 'deidentified.result').show(truncate = False)

bert_embeddings_sec_bert_base download started this may take some time.
Approximate size to download 390.4 MB
[OK!]
finner_deid_sec download started this may take some time.
[OK!]
+-----------------------------------------------+---------+-------------------------------------------+
|text                                           |dateshift|result                                     |
+-----------------------------------------------+---------+-------------------------------------------+
|Mark Johansson has bought a stock on 02/28/2020|5        |[<PERSON> has bought a stock on 03/24/2020]|
|John has bought a house on 03Apr2022           |10       |[<PERSON> has bought a house on <DATE>]    |
+-----------------------------------------------+---------+-------------------------------------------+



# Structured Deidentification

In [92]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/legal-nlp/data/hipaa-table-001.txt

df = spark.read.format("csv") \
    .option("sep", "\t") \
    .option("inferSchema", "true") \
    .option("header", "true") \
    .load("hipaa-table-001.txt")

df.show(truncate=False)

+---------------+----------+---+----------------------------------------------------+-------+--------------+---+---+
|NAME           |DOB       |AGE|ADDRESS                                             |ZIPCODE|TEL           |SBP|DBP|
+---------------+----------+---+----------------------------------------------------+-------+--------------+---+---+
|Cecilia Chapman|04/02/1935|83 |711-2880 Nulla St. Mankato Mississippi              |69200  |(257) 563-7401|101|42 |
|Iris Watson    |03/10/2009|9  |P.O. Box 283 8562 Fusce Rd. Frederick Nebraska      |20620  |(372) 587-2335|159|122|
|Bryar Pitts    |11/01/1921|98 |5543 Aliquet St. Fort Dodge GA                      |20783  |(717) 450-4729|149|52 |
|Theodore Lowe  |13/02/2002|16 |Ap #867-859 Sit Rd. Azusa New York                  |39531  |(793) 151-6230|134|115|
|Calista Wise   |20/08/1942|76 |7292 Dictum Av. San Antonio MI                      |47096  |(492) 709-6392|139|78 |
|Kyla Olsen     |12/05/1973|45 |Ap #651-8679 Sodales Av. Tamunin

In [93]:
obfuscator = finance.StructuredDeidentification(spark,{"NAME":"PATIENT","AGE":"AGE"}, obfuscateRefSource = "faker")
obfuscator_df = obfuscator.obfuscateColumns(df)
obfuscator_df.show(truncate=False)

+--------------------+----------+----+----------------------------------------------------+-------+--------------+---+---+
|NAME                |DOB       |AGE |ADDRESS                                             |ZIPCODE|TEL           |SBP|DBP|
+--------------------+----------+----+----------------------------------------------------+-------+--------------+---+---+
|[Lorenda Heman]     |04/02/1935|[60]|711-2880 Nulla St. Mankato Mississippi              |69200  |(257) 563-7401|101|42 |
|[Juliette Goodnight]|03/10/2009|[11]|P.O. Box 283 8562 Fusce Rd. Frederick Nebraska      |20620  |(372) 587-2335|159|122|
|[Rickford Medin]    |11/01/1921|[60]|5543 Aliquet St. Fort Dodge GA                      |20783  |(717) 450-4729|149|52 |
|[Vilma Dirks]       |13/02/2002|[17]|Ap #867-859 Sit Rd. Azusa New York                  |39531  |(793) 151-6230|134|115|
|[Estill Shield]     |20/08/1942|[60]|7292 Dictum Av. San Antonio MI                      |47096  |(492) 709-6392|139|78 |
|[Yolande Drilli

In [94]:
obfuscator_unique_ref_test = '''Will Perry#CLIENT
John Smith#CLIENT
Marvin MARSHALL#CLIENT
Hubert GROGAN#CLIENT
ALTHEA COLBURN#CLIENT
Kalil AMIN#CLIENT
Inci FOUNTAIN#CLIENT
Jackson WILLE#CLIENT
Jack SANTOS#CLIENT
Mahmood ALBURN#CLIENT
Marnie MELINGTON#CLIENT
Aysha GHAZI#CLIENT
Maryland CODER#CLIENT
Darene GEORGIOUS#CLIENT
Shelly WELLBECK#CLIENT
Min Kun JAE#CLIENT
Thomson THOMAS#CLIENT
Christian SUDDINBURG#CLIENT
Aberdeen#CITY
Louisburg St#STREET
France#LOC
5552312#PHONE
Calle del Libertador#ADDRESS
111#ID
20#AGE
30#AGE
40#AGE
50#AGE
60#AGE
'''

with open('obfuscator_unique_ref_test.txt', 'w') as f:
  f.write(obfuscator_unique_ref_test)

In [95]:
# obfuscateRefSource = "file"

obfuscator = finance.StructuredDeidentification(spark,{"NAME":"CLIENT","AGE":"AGE"}, 
                                        obfuscateRefFile = "/content/obfuscator_unique_ref_test.txt",
                                        obfuscateRefSource = "file",
                                        columnsSeed={"NAME": 23, "AGE": 23})
obfuscator_df = obfuscator.obfuscateColumns(df)
obfuscator_df.select("NAME","AGE").show(truncate=False)

+------------------+----+
|NAME              |AGE |
+------------------+----+
|[Inci FOUNTAIN]   |[60]|
|[Jack SANTOS]     |[30]|
|[Darene GEORGIOUS]|[30]|
|[Shelly WELLBECK] |[40]|
|[Hubert GROGAN]   |[40]|
|[Kalil AMIN]      |[40]|
|[ALTHEA COLBURN]  |[60]|
|[Thomson THOMAS]  |[60]|
|[Jack SANTOS]     |[60]|
|[Will Perry]      |[20]|
|[Jackson WILLE]   |[60]|
|[Shelly WELLBECK] |[40]|
|[Kalil AMIN]      |[30]|
|[Marnie MELINGTON]|[30]|
|[Min Kun JAE]     |[30]|
|[Marvin MARSHALL] |[60]|
|[Marvin MARSHALL] |[50]|
|[Min Kun JAE]     |[30]|
|[Maryland CODER]  |[20]|
|[Marnie MELINGTON]|[20]|
+------------------+----+
only showing top 20 rows



We can **shift n days** in the structured deidentification through "days" parameter when the column is a Date.

In [96]:
df = spark.createDataFrame([
            ["Juan García", "13/02/1977", "711 Nulla St.", "140", "673 431234"],
            ["Will Smith", "23/02/1977", "1 Green Avenue.", "140", "+23 (673) 431234"],
            ["Pedro Ximénez", "11/04/1900", "Calle del Libertador, 7", "100", "912 345623"]
        ]).toDF("NAME", "DOB", "ADDRESS", "SBP", "TEL")
df.show(truncate=False)

+-------------+----------+-----------------------+---+----------------+
|NAME         |DOB       |ADDRESS                |SBP|TEL             |
+-------------+----------+-----------------------+---+----------------+
|Juan García  |13/02/1977|711 Nulla St.          |140|673 431234      |
|Will Smith   |23/02/1977|1 Green Avenue.        |140|+23 (673) 431234|
|Pedro Ximénez|11/04/1900|Calle del Libertador, 7|100|912 345623      |
+-------------+----------+-----------------------+---+----------------+



In [97]:
obfuscator = finance.StructuredDeidentification(spark=spark, 
                                        columns={"NAME": "ID", "DOB": "DATE"},
                                        columnsSeed={"NAME": 23, "DOB": 23},
                                        obfuscateRefSource="faker",
                                        days=5
                                         )

In [98]:
result = obfuscator.obfuscateColumns(df)
result.show(truncate=False)

+----------+------------+-----------------------+---+----------------+
|NAME      |DOB         |ADDRESS                |SBP|TEL             |
+----------+------------+-----------------------+---+----------------+
|[N2649912]|[18/02/1977]|711 Nulla St.          |140|673 431234      |
|[W466004] |[28/02/1977]|1 Green Avenue.        |140|+23 (673) 431234|
|[M403810] |[16/04/1900]|Calle del Libertador, 7|100|912 345623      |
+----------+------------+-----------------------+---+----------------+



# Save the Pipeline and Use it from Your Local

In [99]:
model.write().overwrite().save('pipeline_deid')

In [100]:
# from sparknlp.pretrained import PretrainedPipeline

deid_pipeline = nlp.PretrainedPipeline.from_disk("pipeline_deid")

In [101]:
data = spark.createDataFrame([[text]]).toDF("text")

In [102]:
deid_pipeline.model.stages

[DocumentAssembler_f35a4d5c1f80,
 SentenceDetector_dbda64af6cd6,
 REGEX_TOKENIZER_0b5008ebf788,
 BERT_EMBEDDINGS_29ce72cd673e,
 FinanceNerModel_99ecfbac41c1,
 NER_CONVERTER_e87287aa980f,
 ZeroShotRobertaNer_5d06c0297d21,
 NER_CONVERTER_0a3cd5f79c6e,
 CONTEXTUAL-PARSER_0c0324c7f902,
 CONTEXTUAL-PARSER_10e07b53a046,
 CONTEXTUAL-PARSER_e747a453083c,
 MERGE_3ed13c20664e]

In [103]:
deid_pipeline.model.transform(data).show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----+-----+--------------------+--------------------+
|                text|            document|            sentence|               token|          embeddings|                 ner|           ner_chunk|       zero_shot_ner|      zero_ner_chunk|alias|email|               phone|   merged_ner_chunks|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----+-----+--------------------+--------------------+
|
Commission file ...|[{document, 0, 77...|[{document, 1, 10...|[{token, 1, 10, C...|[{word_embeddings...|[{named_entity, 1...|[{chunk, 24, 32, ...|[{named_entity, 1...|[{chunk, 75, 96, ...|   []|   []|[{chunk, 24, 32, ...|[{chunk, 24, 32, ...|
+-------------------

# Pretrained Deidentification Pipeline

We have this pipeline can be used to deidentify legal information from texts.The legal information will be masked and obfuscated in the resulting text. The pipeline can mask and obfuscate `DOC`, `EFFDATE`, `PARTY`, `ALIAS`, `PERSON`, `TITLE`, `COUNTRY`, `CITY`, `STATE`, `STREET`, `ZIP`, `EMAIL`, `FAX`, `LOCATION-OTHER`, `DATE`,`PHONE` and other entities.

In [105]:
# from sparknlp.pretrained import PretrainedPipeline

deid_pipeline = nlp.PretrainedPipeline("finpipe_deid", "en", "finance/models")

finpipe_deid download started this may take some time.
Approx size to download 437.3 MB
[OK!]


In [106]:
deid_pipeline.model.stages

[DocumentAssembler_20aaea0b09c9,
 SentenceDetector_f836f3c49dd7,
 REGEX_TOKENIZER_3d88a1dee1d9,
 BERT_EMBEDDINGS_29ce72cd673e,
 FinanceNerModel_1e04a0ea86dc,
 NER_CONVERTER_053dc2c885dc,
 FinanceNerModel_99ecfbac41c1,
 NER_CONVERTER_c31e7133c116,
 FinanceNerModel_fae1a65403a6,
 NER_CONVERTER_e54c4e5afd15,
 CONTEXTUAL-PARSER_72fff5ea72a3,
 CONTEXTUAL-PARSER_247b3d47153a,
 CONTEXTUAL-PARSER_8804c3848e07,
 CONTEXTUAL-PARSER_138e93ac7638,
 CONTEXTUAL-PARSER_222a1bc3dc39,
 MERGE_72dccb34a947,
 DE-IDENTIFICATION_95319986720c,
 DE-IDENTIFICATION_e98c1ba6424c,
 DE-IDENTIFICATION_b423b4e6a14e,
 DE-IDENTIFICATION_d6ea024c8838]

In [107]:
text= """CARGILL, INCORPORATED

By:     Pirkko Suominen



Name: Pirkko Suominen Title: Director, Bio Technology Development  Center,  Date:   10/19/2011

BIOAMBER, SAS

By:     Jean-François Huc



Name: Jean-François Huc  Title: President Date:   October 15, 2011

email : jeanfran@gmail.com
phone : 18087339090 """

In [108]:
deid_res= deid_pipeline.annotate(text)

In [109]:
deid_res.keys()

dict_keys(['obfuscated', 'ner_10k_chunk', 'email', 'document', 'ner_signers_chunk', 'deidentified', 'alias', 'chiefs', 'masked_fixed_length_chars', 'token', 'ner_signers', 'ner_generic_chunk', 'embeddings', 'merged_ner_chunks', 'ner_10k', 'sentence', 'phone', 'orgs', 'masked_with_chars', 'ner_generic'])

In [110]:
import pandas as pd

pd.set_option("display.max_colwidth", 100)

df= pd.DataFrame(list(zip(deid_res["sentence"], 
                          deid_res["deidentified"],
                          deid_res["masked_with_chars"],
                          deid_res["masked_fixed_length_chars"], 
                          deid_res["obfuscated"])),
                 columns= ["Sentence", "Masked", "Masked with Chars", "Masked with Fixed Chars", "Obfuscated"])

df

Unnamed: 0,Sentence,Masked,Masked with Chars,Masked with Fixed Chars,Obfuscated
0,"CARGILL, INCORPORATED\n\nBy: Pirkko Suominen\n\n\n\nName: Pirkko Suominen Title: Director, B...","<ORG>\n\nBy: <PERSON> Suominen\n\n\n\nName: <PERSON> Suominen Title: <PROFESSION>, Bio Techn...","[*******************]\n\nBy: [****] Suominen\n\n\n\nName: [****] Suominen Title: [******], B...","****\n\nBy: **** Suominen\n\n\n\nName: **** Suominen Title: ****, Bio Technology Development...",White PLC\n\nBy: Katrina Wilson Suominen\n\n\n\nName: Katrina Wilson Suominen Title: Enginee...


In [111]:
text= """
Commission file number 000-15867 
_____________________________________
 
CADENCE DESIGN SYSTEMS, INC. 
(Exact name of registrant as specified in its charter)
____________________________________ 
Delaware
 
00-0000000
(State or Other Jurisdiction ofIncorporation or Organization)
 
(I.R.S. EmployerIdentification No.)
2655 Seely Avenue, Building 5,
San Jose,
California
 
95134
(Address of Principal Executive Offices)
 
(Zip Code)
(408)
-943-1234 
(Registrant’s Telephone Number, including Area Code) 
Securities registered pursuant to Section 12(b) of the Act:
Title of Each Class
Trading Symbol(s)
Names of Each Exchange on which Registered
Common Stock, $0.01 par value per share
CDNS
Nasdaq Global Select Market
Securities registered pursuant to Section 12(g) of the Act:"""

In [112]:
deid_res= deid_pipeline.annotate(text)

In [113]:
deid_res.keys()

dict_keys(['obfuscated', 'ner_10k_chunk', 'email', 'document', 'ner_signers_chunk', 'deidentified', 'alias', 'chiefs', 'masked_fixed_length_chars', 'token', 'ner_signers', 'ner_generic_chunk', 'embeddings', 'merged_ner_chunks', 'ner_10k', 'sentence', 'phone', 'orgs', 'masked_with_chars', 'ner_generic'])

In [114]:
import pandas as pd

pd.set_option("display.max_colwidth", 100)

df= pd.DataFrame(list(zip(deid_res["sentence"], 
                          deid_res["deidentified"],
                          deid_res["masked_with_chars"],
                          deid_res["masked_fixed_length_chars"], 
                          deid_res["obfuscated"])),
                 columns= ["Sentence", "Masked", "Masked with Chars", "Masked with Fixed Chars", "Obfuscated"])

df

Unnamed: 0,Sentence,Masked,Masked with Chars,Masked with Fixed Chars,Obfuscated
0,Commission file number 000-15867 \n_____________________________________\n \nCADENCE DESIGN SYST...,Commission file number <PHONE> \n_____________________________________\n \n<ORG>.,Commission file number [*******] \n_____________________________________\n \n[******************...,Commission file number **** \n_____________________________________\n \n****.,"Commission file number (51) 9668-9224 \n_____________________________________\n \nHorton, Parson..."
1,(Exact name of registrant as specified in its charter)\n____________________________________ \nD...,(Exact name of registrant as specified in its charter)\n____________________________________ \n<...,(Exact name of registrant as specified in its charter)\n____________________________________ \n[...,(Exact name of registrant as specified in its charter)\n____________________________________ \n*...,(Exact name of registrant as specified in its charter)\n____________________________________ \nV...
2,EmployerIdentification No.,EmployerIdentification No.,EmployerIdentification No.,EmployerIdentification No.,EmployerIdentification No.
3,")\n2655 Seely Avenue, Building 5,\nSan Jose,\nCalifornia\n \n95134\n(Address of Principal Execut...",)\n<ADDRESS>\n \n<ZIP>\n(Address of Principal Executive Offices)\n \n(Zip Code)<PHONE> \n(Regist...,)\n[*************************************************]\n \n[***]\n(Address of Principal Executiv...,)\n****\n \n****\n(Address of Principal Executive Offices)\n \n(Zip Co**** \n(Registrant’s Telep...,)\n<ADDRESS>\n \n12862\n(Address of Principal Executive Offices)\n \n(Zip Code)\n52-06-66665887 ...
4,(b) of the Act:\nTitle of Each Class\nTrading Symbol,(b) of the Act:\nTitle <ORG> Symbol,(b) of the Act:\nTitle [*******************] Symbol,(b) of the Act:\nTitle **** Symbol,(b) of the Act:\nTitle White PLC Symbol
5,"(s)\nNames of Each Exchange on which Registered\nCommon Stock, $0.01 par value per share\nCDNS\n...","(s)\nNames of Each Exchange on which Registered\n<TITLE_CLASS>, $0.01 par value per share\n<TICK...","(s)\nNames of Each Exchange on which Registered\n[**********], $0.01 par value per share\n[**]\n...","(s)\nNames of Each Exchange on which Registered\n****, $0.01 par value per share\n****\n****\nSe...","(s)\nNames of Each Exchange on which Registered\n<TITLE_CLASS>, $0.01 par value per share\nABCD\..."
6,(g) of the Act:,(g) of the Act:,(g) of the Act:,(g) of the Act:,(g) of the Act:
