![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/11.Deidentification.ipynb)

# Legal Deidentification

# Installation

In [None]:
! pip install -q johnsnowlabs

## Automatic Installation
Using my.johnsnowlabs.com SSO

In [None]:
from johnsnowlabs import nlp, legal

# nlp.install(force_browser=True)

## Manual downloading
If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.

- Go to my.johnsnowlabs.com
- Download your license
- Upload it using the following command

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

- Install it

In [None]:
nlp.install()

# Starting

In [3]:
spark = nlp.start()

👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==4.3.0, 💊Spark-Healthcare==4.3.0, running on ⚡ PySpark==3.1.2


# Deidentification Model

Some legal information can be considered sensitive. (e.g.,document, organization, address, signer)

In [53]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")
    #.setCustomBounds(["\n\n"])

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

legal_ner = legal.NerModel.pretrained("legner_contract_doc_parties_lg", "en", "legal/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner") 
    #.setLabelCasing("upper")

ner_converter = legal.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")\
    .setReplaceLabels({"ALIAS": "PARTY"}) # "ALIAS" are secondary names of companies, so let's extract them also as PARTY

nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      embeddings,
      legal_ner,
      ner_converter])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
roberta_embeddings_legal_roberta_base download started this may take some time.
Approximate size to download 447.2 MB
[OK!]
legner_contract_doc_parties_lg download started this may take some time.
[OK!]


### Pretrained NER models extracts:
- Document
- Date
- Party (Organization Name)
- Alias

In [5]:
legal_ner.getClasses()

['O',
 'B-ORG',
 'I-DOC',
 'B-EFFDATE',
 'I-ORG',
 'B-ALIAS',
 'I-ALIAS',
 'B-PARTY',
 'I-EFFDATE',
 'B-FORMER_NAME',
 'I-FORMER_NAME',
 'B-DOC',
 'I-PARTY']

In [54]:
text = """THIS STRATEGIC ALLIANCE AGREEMENT ("Agreement") is made and entered into as of December 14, 2016 , by and between Hyatt Franchising Latin America, L.L.C. a limited liability company organized and existing under the laws of the State of Delaware"""

In [55]:
result = model.transform(spark.createDataFrame([[text]]).toDF("text"))

In [56]:
from pyspark.sql import functions as F

result_df = result.select(F.explode(F.arrays_zip(result.token.result, 
                                                 result.ner.result)).alias("cols")) \
                  .select(F.expr("cols['0']").alias("token"),
                          F.expr("cols['1']").alias("ner_label"))

In [57]:
result_df.select("token", "ner_label").groupBy('ner_label').count().orderBy('count', ascending=False).show(truncate=False)

+---------+-----+
|ner_label|count|
+---------+-----+
|O        |31   |
|I-PARTY  |5    |
|I-EFFDATE|3    |
|I-DOC    |2    |
|B-PARTY  |1    |
|B-DOC    |1    |
|B-EFFDATE|1    |
+---------+-----+



### Check extracted sensitive entities

In [58]:
result.select(F.explode(F.arrays_zip(result.ner_chunk.result, 
                                     result.ner_chunk.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+--------------------------------------+---------+
|chunk                                 |ner_label|
+--------------------------------------+---------+
|STRATEGIC ALLIANCE AGREEMENT          |DOC      |
|December 14, 2016                     |EFFDATE  |
|Hyatt Franchising Latin America, L.L.C|PARTY    |
+--------------------------------------+---------+



## Masking and Obfuscation

### Replace these enitites with Tags

In [123]:
deidentification = legal.DeIdentification() \
      .setInputCols(["sentence", "token", "ner_chunk"]) \
      .setOutputCol("deidentified") \
      .setMode("mask")\
      .setReturnEntityMappings(True) #  return a new column to save the mappings between the mask/obfuscated entities and original entities. REquired for "ReIdentification"
      #.setMappingsColumn("MappingCol") # change the name of the column, 'aux' is default

deidPipeline = nlp.Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      embeddings,
      legal_ner,
      ner_converter,
      deidentification])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model_deid = deidPipeline.fit(empty_data)

In [81]:
result = model_deid.transform(spark.createDataFrame([[text]]).toDF("text"))

In [82]:
result.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|          embeddings|                 ner|           ner_chunk|        deidentified|                 aux|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|THIS STRATEGIC AL...|[{document, 0, 24...|[{document, 0, 24...|[{token, 0, 3, TH...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 5, 32, S...|[{document, 0, 18...|[{chunk, 5, 9, <D...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+



In [84]:
reIdentification = legal.ReIdentification()\
    .setInputCols(["aux","deidentified"])\
    .setOutputCol("original")

In [85]:
reid_result = reIdentification.transform(result)

In [86]:
reid_result.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|          embeddings|                 ner|           ner_chunk|        deidentified|                 aux|            original|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|THIS STRATEGIC AL...|[{document, 0, 24...|[{document, 0, 24...|[{token, 0, 3, TH...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 5, 32, S...|[{document, 0, 18...|[{chunk, 5, 9, <D...|[{document, 0, 24...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----

# ReIdentification

In [87]:
print(text)

reid_result.select('original.result').show(truncate=False)

THIS STRATEGIC ALLIANCE AGREEMENT ("Agreement") is made and entered into as of December 14, 2016 , by and between Hyatt Franchising Latin America, L.L.C. a limited liability company organized and existing under the laws of the State of Delaware
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[THIS S

In [83]:
result.select(F.explode(F.arrays_zip(result.sentence.result, result.deidentified.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("sentence"), F.expr("cols['1']").alias("deidentified")).toPandas()

Unnamed: 0,sentence,deidentified
0,"THIS STRATEGIC ALLIANCE AGREEMENT (""Agreement""...","THIS <DOC> (""Agreement"") is made and entered i..."


## Other different masking strategies 

We have three modes to mask the entities in the Deidentification annotator. You can select the modes using the `.setMaskingPolicy()` parameter. The methods are the followings:

**“entity_labels”**: Mask with the entity type of that chunk. (default) <br/>
**“same_length_chars”**: Mask the deid entities with same length of asterix ( * ) with brackets ( [ , ] ) on both end. <br/>
**“fixed_length_chars”**: Mask the deid entities with a fixed length of asterix ( * ). The length is setting up using the `setFixedMaskLength()` method. <br/>

Let's try each of these and compare the results:

In [88]:
#deid model with "entity_labels"
deid_entity_labels= legal.DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk"])\
    .setOutputCol("deid_entity_label")\
    .setMode("mask")\
    .setReturnEntityMappings(False)\
    .setMaskingPolicy("entity_labels")

#deid model with "same_length_chars"
deid_same_length= legal.DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk"])\
    .setOutputCol("deid_same_length")\
    .setMode("mask")\
    .setReturnEntityMappings(False)\
    .setMaskingPolicy("same_length_chars")

#deid model with "fixed_length_chars"
deid_fixed_length= legal.DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk"])\
    .setOutputCol("deid_fixed_length")\
    .setMode("mask")\
    .setReturnEntityMappings(False)\
    .setMaskingPolicy("fixed_length_chars")\
    .setFixedMaskLength(4)


deidPipeline = nlp.Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      embeddings,
      legal_ner,
      ner_converter,
      deid_entity_labels,
      deid_same_length,
      deid_fixed_length])


empty_data = spark.createDataFrame([[""]]).toDF("text")
model_deid = deidPipeline.fit(empty_data)

In [89]:
result = model_deid.transform(spark.createDataFrame([[text]]).toDF("text"))

In [90]:
result.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|          embeddings|                 ner|           ner_chunk|   deid_entity_label|                 aux|    deid_same_length|   deid_fixed_length|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|THIS STRATEGIC AL...|[{document, 0, 24...|[{document, 0, 24...|[{token, 0, 3, TH...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 5, 32, S...|[{document, 0, 18...|[{chunk, 5, 8, **...|[{document, 0, 24...|[{document, 0, 17...|
+--------------------+--------------------+--------------------+----

In [91]:
result.select(F.explode(F.arrays_zip(result.sentence.result, 
                                            result.deid_entity_label.result, 
                                            result.deid_same_length.result, 
                                            result.deid_fixed_length.result)).alias("cols")) \
             .select(F.expr("cols['0']").alias("sentence"),
                     F.expr("cols['1']").alias("deid_entity_label"),
                     F.expr("cols['2']").alias("deid_same_length"),
                     F.expr("cols['3']").alias("deid_fixed_length")).toPandas()

Unnamed: 0,sentence,deid_entity_label,deid_same_length,deid_fixed_length
0,"THIS STRATEGIC ALLIANCE AGREEMENT (""Agreement""...","THIS <DOC> (""Agreement"") is made and entered i...","THIS [**************************] (""Agreement""...","THIS **** (""Agreement"") is made and entered in..."


### Mapping Column

In [92]:
result.select("aux").show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|aux                                                                                                                                                                                                                                                                                                                                                 

In [93]:
result.select(F.explode(F.arrays_zip(result.aux.metadata,
                                     result.aux.begin, 
                                     result.aux.end)).alias("cols")) \
      .select(F.expr("cols['0']['originalChunk']").alias("chunk"),
              F.expr("cols['0']['beginOriginalChunk']").alias("beginChunk"),
              F.expr("cols['0']['endOriginalChunk']").alias("endChunk"),
              F.expr("cols['0']['entity']").alias("label"),
              F.expr("cols['1']").alias("beginLabel"),
              F.expr("cols['2']").alias("endLabel")).show(truncate=False)

+--------------------------------------+----------+--------+-------+----------+--------+
|chunk                                 |beginChunk|endChunk|label  |beginLabel|endLabel|
+--------------------------------------+----------+--------+-------+----------+--------+
|STRATEGIC ALLIANCE AGREEMENT          |5         |32      |DOC    |5         |8       |
|December 14, 2016                     |79        |95      |EFFDATE|55        |58      |
|Hyatt Franchising Latin America, L.L.C|114       |151     |PARTY  |77        |80      |
+--------------------------------------+----------+--------+-------+----------+--------+



## Using NER, ContextualParser and ZeroShotNER in the same Deideintification pipeline

In [94]:
# Create JSON file for PART
alias = {
  "entity": "ALIAS",
  "ruleScope": "document", 
  "completeMatchRegex": "true",
  "regex":'["“].*?["”]',
  "matchScope": "sub-token",
  "contextLength": 100
}

email = {
  "entity": "EMAIL",
  "ruleScope": "document", 
  "completeMatchRegex": "true",
  "regex":'[\w-\.]+@([\w-]+\.)+[\w-]{2,4}',
  "matchScope": "sub-token",
  "contextLength": 100
}

phone = {
  "entity": "PHONE",
  "ruleScope": "document", 
  "completeMatchRegex": "true",
  "regex":'(\+?\d{1,3}[\s-]?)?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d+',
  "matchScope": "sub-token",
  "contextLength": 100
}

import json
with open('alias.json', 'w') as f:
    json.dump(alias, f)
    
with open('email.json', 'w') as f:
    json.dump(email, f)
    
with open('phone.json', 'w') as f:
    json.dump(phone, f)

In [95]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner_model = legal.NerModel.pretrained('legner_contract_doc_parties_lg', 'en', 'legal/models')\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = legal.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")\
    .setWhiteList(['EFFDATE', 'PARTY', 'ALIAS'])\
    .setReplaceLabels({'FORMER_NAME': 'PARTY'})\
    .setGreedyMode(True)

zero_shot_ner = legal.ZeroShotNerModel.pretrained("legner_roberta_zeroshot", "en", "legal/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("zero_shot_ner")\
    .setPredictionThreshold(0.1)\
    .setEntityDefinitions(
        {
            
            "ADDRESS":["Which address?", "Where is the location?"],
            "SIGNING_PERSON": ["Which person?", "What is the person name?"],
            "PARTY": ["Which LLC?", "Which Inc?", "Which PLC?", "Which Corp?"]
        })


zeroshot_ner_converter = legal.NerConverterInternal() \
    .setInputCols(["sentence", "token", "zero_shot_ner"])\
    .setOutputCol("zero_ner_chunk")\

ner_model2 = legal.NerModel.pretrained('legner_signers', 'en', 'legal/models')\
        .setInputCols(["sentence", "token", "embeddings"])\
        .setOutputCol("ner2")

ner_converter2 = nlp.NerConverter()\
        .setInputCols(["sentence","token","ner2"])\
        .setOutputCol("ner_chunk2")

alias_parser = legal.ContextualParserApproach() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("alias")\
    .setJsonPath("alias.json") \
    .setPrefixAndSuffixMatch(False)\
    .setOptionalContextRules(True)\
    .setCaseSensitive(False)

email_parser = legal.ContextualParserApproach() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("email")\
    .setJsonPath("email.json") \
    .setPrefixAndSuffixMatch(False)\
    .setOptionalContextRules(True)\
    .setCaseSensitive(False)

phone_parser = legal.ContextualParserApproach() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("phone")\
    .setJsonPath("phone.json") \
    .setPrefixAndSuffixMatch(False)\
    .setOptionalContextRules(True)\
    .setCaseSensitive(False)

chunk_merger = legal.ChunkMergeApproach()\
    .setInputCols("email", "phone", "ner_chunk", "ner_chunk2","zero_ner_chunk", "alias")\
    .setOutputCol('merged_ner_chunks')

nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      embeddings,
      ner_model,
      ner_converter,
      ner_model2,
      ner_converter2,
      zero_shot_ner,
      zeroshot_ner_converter,
      alias_parser,
      email_parser,
      phone_parser,
      chunk_merger])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
roberta_embeddings_legal_roberta_base download started this may take some time.
Approximate size to download 447.2 MB
[OK!]
legner_contract_doc_parties_lg download started this may take some time.
[OK!]
legner_roberta_zeroshot download started this may take some time.
[OK!]
legner_signers download started this may take some time.
[OK!]


In [96]:
text = """
This Commercial Lease (this “Lease”) dated February 11, 2021, but made effective as of January 1, 2021 (the “Effective Date”), is made by and between 605 NASH, LLC, a California limited liability company (“Landlord”) and NANTKWEST, INC., a Delaware corporation (“Tenant”).

605 NASH, LLC,	 	NANTKWEST, inc.,
a California limited liability company	 	a Delaware corporation
 	 	 	 	 	 	 
 	 	 	 	 	 	 
By:	 	/s/ Charles Kenworthy	 	By:	 	/s/ Richard Adcock
Name: Charles N. Kenworthy	 	Name: Richard Adcock
Title:   Manager	 	Title:   CEO
 	 	 	 	 	 	 
Address:	 	Address:
9922 Jefferson Blvd.	 	3530 Johns Hopkins Court
Culver City, CA 90232	 	San Diego, CA 92121
Attention: Chuck Kenworthy	 	Attention: Chief Financial Officer
Email:
juan@johnsnowlabs.com
Telephone numbers:
304.123.333
304-123-333
+34 304-123-333
0034304123333
"""

In [97]:
result = model.transform(spark.createDataFrame([[text]]).toDF("text"))

# legal_ner
result.select(F.explode(F.arrays_zip(result.ner_chunk.result, 
                                     result.ner_chunk.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+-----------------+---------+
|chunk            |ner_label|
+-----------------+---------+
|February 11, 2021|EFFDATE  |
|January 1, 2021  |EFFDATE  |
|605 NASH, LLC    |PARTY    |
|NANTKWEST, INC   |PARTY    |
|NASH, LLC        |PARTY    |
|NANTKWEST        |PARTY    |
+-----------------+---------+



In [99]:
# ner_signers
result.select(F.explode(F.arrays_zip(result.ner_chunk2.result, 
                                     result.ner_chunk2.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+-----------------------+--------------+
|chunk                  |ner_label     |
+-----------------------+--------------+
|605                    |PARTY         |
|NASH, LLC,             |PARTY         |
|NANTKWEST, INC         |PARTY         |
|NASH, LLC,             |PARTY         |
|NANTKWEST, inc         |PARTY         |
|Charles Kenworthy      |SIGNING_PERSON|
|Richard Adcock         |SIGNING_PERSON|
|Charles N. Kenworthy   |SIGNING_PERSON|
|Richard Adcock         |SIGNING_PERSON|
|Manager                |SIGNING_TITLE |
|CEO                    |SIGNING_TITLE |
|Chief Financial Officer|SIGNING_TITLE |
+-----------------------+--------------+



In [100]:
# zero_shot_ner
result.select(F.explode(F.arrays_zip(result.zero_ner_chunk.result, 
                                     result.zero_ner_chunk.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+-------------------------------------------+--------------+
|chunk                                      |ner_label     |
+-------------------------------------------+--------------+
|LLC                                        |PARTY         |
|NANTKWEST                                  |PARTY         |
|INC                                        |PARTY         |
|Delaware corporation (“Tenant”).           |PARTY         |
|Charles Kenworthy                          |SIGNING_PERSON|
|Richard Adcock                             |SIGNING_PERSON|
|Charles N. Kenworthy                       |SIGNING_PERSON|
|Richard Adcock                             |SIGNING_PERSON|
|CEO                                        |SIGNING_PERSON|
|9922 Jefferson Blvd                        |ADDRESS       |
|3530 Johns Hopkins Court                   |ADDRESS       |
|Culver City, CA 90232	 	San Diego, CA 92121|ADDRESS       |
|Chuck Kenworthy                            |SIGNING_PERSON|
|Chief Financial Officer

In [102]:
# merged_chunk
result.select(F.explode(F.arrays_zip(result.merged_ner_chunks.result, 
                                     result.merged_ner_chunks.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(n=50, truncate=False)

+-------------------------------------------+--------------+
|chunk                                      |ner_label     |
+-------------------------------------------+--------------+
|“Lease”                                    |ALIAS         |
|February 11, 2021                          |EFFDATE       |
|January 1, 2021                            |EFFDATE       |
|“Effective Date”                           |ALIAS         |
|605 NASH, LLC                              |PARTY         |
|“Landlord”                                 |ALIAS         |
|NANTKWEST, INC                             |PARTY         |
|Delaware corporation (“Tenant”).           |PARTY         |
|NASH, LLC,                                 |PARTY         |
|NANTKWEST, inc                             |PARTY         |
|Charles Kenworthy                          |SIGNING_PERSON|
|Richard Adcock                             |SIGNING_PERSON|
|Charles N. Kenworthy                       |SIGNING_PERSON|
|Richard Adcock         

## Obfuscation mode

In the obfuscation mode **DeIdentificationModel** will replace sensitive entities with random values of the same type. 


### Using external [Faker](https://faker.readthedocs.io/en/master/) library

In [103]:
!pip install faker

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting faker
  Downloading Faker-17.0.0-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m37.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: faker
Successfully installed faker-17.0.0


In [104]:
from faker import Faker
fk = Faker()

In [105]:
# This is the obfuscation dict for the new entities
obs_lines = """CEO#SIGNING_TITLE
Chief Executive Officer#SIGNING_TITLE
Chief Legal Officer#SIGNING_TITLE
Chief Financial officer#SIGNING_TITLE
Legal Representative#SIGNING_TILE
"Alias"#ALIAS
"Alias"#ALIAS"""

for _ in range(25):
    add = fk.address().strip()
    for ad in add.split('\n'):
        obs_lines += f"\n{ad}#ADDRESS"
    obs_lines += f"\n{fk.name().strip()}#SIGNING_PERSON"
    obs_lines += f"\n{fk.date().strip()}#EFFDATE"
    obs_lines += f"\n{fk.company().strip()}#PARTY"
    obs_lines += f"\n{fk.phone_number().strip()}#PHONE"
    obs_lines += f"\n{fk.email().strip()}#EMAIL"

with open ('obfuscate.txt', 'w') as f:
    f.write(obs_lines)

In [125]:
# Previous Masking Annotators
#deid model with "entity_labels"
deid_entity_labels= legal.DeIdentification()\
    .setInputCols(["sentence", "token", "merged_ner_chunks"])\
    .setOutputCol("deidentified")\
    .setMode("mask")\
    .setMaskingPolicy("entity_labels")
    
#deid model with "same_length_chars"
deid_same_length= legal.DeIdentification()\
    .setInputCols(["sentence", "token", "merged_ner_chunks"])\
    .setOutputCol("masked_with_chars")\
    .setMode("mask")\
    .setMaskingPolicy("same_length_chars")

#deid model with "fixed_length_chars"
deid_fixed_length= legal.DeIdentification()\
    .setInputCols(["sentence", "token", "merged_ner_chunks"])\
    .setOutputCol("masked_fixed_length_chars")\
    .setMode("mask")\
    .setMaskingPolicy("fixed_length_chars")\
    .setFixedMaskLength(4)


In [126]:
# Obfuscation with Faker
obfuscation = legal.DeIdentification()\
    .setInputCols(["sentence", "token", "merged_ner_chunks"]) \
    .setOutputCol("obfuscated") \
    .setMode("obfuscate")\
    .setObfuscateDate(True)\
    .setObfuscateRefFile('obfuscate.txt')\
    .setObfuscateRefSource("both")

nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      embeddings,
      ner_model,
      ner_converter,
      ner_model2,
      ner_converter2,
      zero_shot_ner,
      zeroshot_ner_converter,
      alias_parser,
      email_parser,
      phone_parser,
      chunk_merger,
      deid_entity_labels,
      deid_same_length,
      deid_fixed_length,
      obfuscation])

obfuscation_model = nlpPipeline.fit(empty_data)

In [127]:
text = """This Commercial Lease (this “Lease”) dated February 11, 2021, but made effective as of January 1, 2021 (the “Effective Date”), is made by and between 605 NASH, LLC, a California limited liability company (“Landlord”) and NANTKWEST, INC., a Delaware corporation (“Tenant”).

605 NASH, LLC,	 	NANTKWEST, inc.,
a California limited liability company	 	a Delaware corporation
 	 	 	 	 	 	 
 	 	 	 	 	 	 
By:	 	/s/ Charles Kenworthy	 	By:	 	/s/ Richard Adcock
Name: Charles N. Kenworthy	 	Name: Richard Adcock
Title:   Manager	 	Title:   CEO
 	 	 	 	 	 	 
Address:	 	Address:
9922 Jefferson Blvd.	 	3530 Johns Hopkins Court
Culver City, CA 90232	 	San Diego, CA 92121
Attention: Chuck Kenworthy	 	Attention: Chief Financial Officer cfo@johnkopkins.com (0031) 913-123"""

In [128]:
result = obfuscation_model.transform(spark.createDataFrame([[text]]).toDF("text"))
print("\n".join(result.select('obfuscated.result').collect()[0].result))

This Commercial Lease (this "Alias") dated 1983-05-13, but made effective as of 1996-11-23 (the "Alias"), is made by and between Chandler and Sons Barker-Spencer a California limited liability company ("Alias") and Cook, Fleming and Scott., a Taylor, Weeks and Ellis corporation ("Alias").
605 Barker-Spencer	 	Hardin Inc.,
a California limited liability company	 	a Delaware corporation
By:	 	/s/ Frances Spencer	 	By:	 	/s/ Tracey Nichols
Name: Paul Ramirez	 	Name: Tracey Nichols
Title:   Chief Executive Officer	 	Title:   Chief Financial officer
 	 	 	 	 	 	
Address:	 	Address:
348 Jason Walks.
	 	East Brendan, NY 29514
Richardsonfort, PA 54312
Attention: Catherine Klein DVM	 	Attention: Chief Financial officer Ilsa@hotmail.com (046 608 8269


## Using Light Pipelines

In [134]:
light_model = nlp.LightPipeline(obfuscation_model)
annotated_text = light_model.annotate(text)
print("\n".join(annotated_text['deidentified']))

This Commercial Lease (this <ALIAS>) dated <EFFDATE>, but made effective as of <EFFDATE> (the <ALIAS>), is made by and between <PARTY> <PARTY> a California limited liability company (<ALIAS>) and <PARTY>., a <PARTY> corporation (<ALIAS>).
605 <PARTY>	 	<PARTY>.,
a California limited liability company	 	a Delaware corporation
By:	 	/s/ <SIGNING_PERSON>	 	By:	 	/s/ <SIGNING_PERSON>
Name: <SIGNING_PERSON>	 	Name: <SIGNING_PERSON>
Title:   <SIGNING_TITLE>	 	Title:   <SIGNING_TITLE>
 	 	 	 	 	 	
Address:	 	Address:
<ADDRESS>.
	 	<ADDRESS>
<ADDRESS>
Attention: <SIGNING_PERSON>	 	Attention: <SIGNING_TITLE> <EMAIL> (<PHONE>


In [135]:
print("\n".join(annotated_text['obfuscated']))

This Commercial Lease (this "Alias") dated 1983-05-13, but made effective as of 1996-11-23 (the "Alias"), is made by and between Chandler and Sons Barker-Spencer a California limited liability company ("Alias") and Cook, Fleming and Scott., a Taylor, Weeks and Ellis corporation ("Alias").
605 Barker-Spencer	 	Hardin Inc.,
a California limited liability company	 	a Delaware corporation
By:	 	/s/ Frances Spencer	 	By:	 	/s/ Tracey Nichols
Name: Paul Ramirez	 	Name: Tracey Nichols
Title:   Chief Executive Officer	 	Title:   Chief Financial officer
 	 	 	 	 	 	
Address:	 	Address:
348 Jason Walks.
	 	East Brendan, NY 29514
Richardsonfort, PA 54312
Attention: Catherine Klein DVM	 	Attention: Chief Financial officer Ilsa@hotmail.com (046 608 8269


## Shifting Days

We use the `medical.DocumentHashCoder()` annotator to determine shifting days. This annotator gets the hash of the specified column and creates a new document column containing day shift information. And then, the `medical.DeIdentification()` annotator deidentifies this new doc. We should set the seed parameter to hash consistently.  

In [136]:
import pandas as pd

data = pd.DataFrame(
    {'DocumentID' : ['A001', 'A001', 'A002', 'A002'],
     'text' : ['Chris Brown was arrested on 10/02/2022', 
               'Mark White has bought a stock on 02/28/2020', 
               'John has bought a house on 03/15/2022',
               'John Moore was discharged on 12/31/2022'
              ]
    }
)

my_input_df = spark.createDataFrame(data)

my_input_df.show(truncate = False)

+----------+-------------------------------------------+
|DocumentID|text                                       |
+----------+-------------------------------------------+
|A001      |Chris Brown was arrested on 10/02/2022     |
|A001      |Mark White has bought a stock on 02/28/2020|
|A002      |John has bought a house on 03/15/2022      |
|A002      |John Moore was discharged on 12/31/2022    |
+----------+-------------------------------------------+



### Shifting days according to the ID column

We use the `legal.DocumentHashCoder()` annotator to determine shifting days. This annotator gets the hash of the specified column and creates a new document column containing day shift information. And then, the `legal.DeIdentification()` annotator deidentifies this new doc. We should set the seed parameter to hash consistently.  

In [137]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

documentHasher = legal.DocumentHashCoder()\
    .setInputCols("document")\
    .setOutputCol("document2")\
    .setPatientIdColumn("DocumentID")\
    .setRangeDays(100)\
    .setNewDateShift("shift_days")\
    .setSeed(100)


# sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
#     .setInputCols(["document2"])\
#     .setOutputCol("sentence")
#     #.setCustomBounds(["\n\n"])

tokenizer = nlp.Tokenizer()\
    .setInputCols(["document2"])\
    .setOutputCol("token")

embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
    .setInputCols(["document2", "token"]) \
    .setOutputCol("embeddings")

legal_ner = legal.NerModel.pretrained('legner_deid', "en", "legal/models")\
    .setInputCols(["document2", "token", "embeddings"]) \
    .setOutputCol("ner") 
    #.setLabelCasing("upper")

ner_converter = legal.NerConverterInternal() \
    .setInputCols(["document2", "token", "ner"])\
    .setOutputCol("ner_chunk") # "ALIAS" are secondary names of companies, so let's extract them also as PARTY


deid = legal.DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk"]) \
    .setOutputCol("deidentified") \
    .setMode("obfuscate") \
    .setObfuscateDate(True) \
    .setDateTag("DATE") \
    .setLanguage("en") \
    .setObfuscateRefSource('faker') \
    .setUseShifDays(True)\
    .setRegion('us')

pipeline = nlp.Pipeline(stages=[
      documentAssembler, 
      documentHasher,
      sentenceDetector,
      tokenizer,
      embeddings,
      legal_ner,
      ner_converter,
      deid])

empty_data = spark.createDataFrame([["", ""]]).toDF("text", "DocumentID")

pipeline_model = pipeline.fit(empty_data)

roberta_embeddings_legal_roberta_base download started this may take some time.
Approximate size to download 447.2 MB
[OK!]
legner_deid download started this may take some time.
[OK!]


In [138]:
output = pipeline_model.transform(my_input_df)

output.select('DocumentID','text', 'deidentified.result').show(truncate = False)

+----------+-------------------------------------------+-------------------------------------------+
|DocumentID|text                                       |result                                     |
+----------+-------------------------------------------+-------------------------------------------+
|A001      |Chris Brown was arrested on 10/02/2022     |[<PERSON> was arrested on 11/27/2022]      |
|A001      |Mark White has bought a stock on 02/28/2020|[<PERSON> has bought a stock on 03/18/2020]|
|A002      |John has bought a house on 03/15/2022      |[<PERSON> has bought a house on 04/18/2022]|
|A002      |John Moore was discharged on 12/31/2022    |[<PERSON> was discharged on 02/22/2023]    |
+----------+-------------------------------------------+-------------------------------------------+



### Shifting days according to specified values

Instead of shifting days according to ID column, we can specify shifting values with another column.

```python
documentHasher = legal.DocumentHashCoder()\
    .setInputCols("document")\
    .setOutputCol("document2")\
    .setDateShiftColumn("dateshift")\
```


In [139]:
data = pd.DataFrame(
    {'DocumentID' : ['A001', 'A001', 'A002', 'A002'],
     'text' : ['Chris Brown was arrested on 10/02/2019', 
               'Mark White has bought a stock on 02/28/2020', 
               'John has bought a house on 03/15/2022',
               'John Moore was discharged on 12/31/2022'
                            ],
     'dateshift' : ['5', '5', '10', '10']
    }
)


my_input_df = spark.createDataFrame(data)

my_input_df.show(truncate = False)

+----------+-------------------------------------------+---------+
|DocumentID|text                                       |dateshift|
+----------+-------------------------------------------+---------+
|A001      |Chris Brown was arrested on 10/02/2019     |5        |
|A001      |Mark White has bought a stock on 02/28/2020|5        |
|A002      |John has bought a house on 03/15/2022      |10       |
|A002      |John Moore was discharged on 12/31/2022    |10       |
+----------+-------------------------------------------+---------+



In [140]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

documentHasher = legal.DocumentHashCoder()\
    .setInputCols("document")\
    .setOutputCol("document2")\
    .setDateShiftColumn("dateshift")\


# sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
#     .setInputCols(["document2"])\
#     .setOutputCol("sentence")
#     #.setCustomBounds(["\n\n"])

tokenizer = nlp.Tokenizer()\
    .setInputCols(["document2"])\
    .setOutputCol("token")

embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
    .setInputCols(["document2", "token"]) \
    .setOutputCol("embeddings")

legal_ner = legal.NerModel.pretrained('legner_deid', "en", "legal/models")\
    .setInputCols(["document2", "token", "embeddings"]) \
    .setOutputCol("ner") 
    #.setLabelCasing("upper")

ner_converter = legal.NerConverterInternal() \
    .setInputCols(["document2", "token", "ner"])\
    .setOutputCol("ner_chunk") # "ALIAS" are secondary names of companies, so let's extract them also as PARTY

obfuscation = legal.DeIdentification()\
    .setInputCols(["document2", "token", "ner_chunk"]) \
    .setOutputCol("deidentified") \
    .setMode("obfuscate") \
    .setObfuscateDate(True) \
    .setDateTag("DATE") \
    .setLanguage("en") \
    .setObfuscateRefSource('faker') \
    .setUseShifDays(True)\
    .setRegion('us')

pipeline = nlp.Pipeline(stages=[
      documentAssembler, 
      documentHasher,
      tokenizer,
      embeddings,
      legal_ner,
      ner_converter,
      obfuscation])

empty_data = spark.createDataFrame([["", "", ""]]).toDF("text", "DocumentID", "dateshift")

pipeline_model = pipeline.fit(empty_data)

roberta_embeddings_legal_roberta_base download started this may take some time.
Approximate size to download 447.2 MB
[OK!]
legner_deid download started this may take some time.
[OK!]


In [141]:
output = pipeline_model.transform(my_input_df)

output.select('text', 'dateshift', 'deidentified.result').show(truncate = False)

+-------------------------------------------+---------+-------------------------------------------+
|text                                       |dateshift|result                                     |
+-------------------------------------------+---------+-------------------------------------------+
|Chris Brown was arrested on 10/02/2019     |5        |[<PERSON> was arrested on 10/07/2019]      |
|Mark White has bought a stock on 02/28/2020|5        |[<PERSON> has bought a stock on 03/04/2020]|
|John has bought a house on 03/15/2022      |10       |[<PERSON> has bought a house on 03/25/2022]|
|John Moore was discharged on 12/31/2022    |10       |[<PERSON> was discharged on 01/10/2023]    |
+-------------------------------------------+---------+-------------------------------------------+



### Masking Unnormalized Date Formats

`setUnnormalizedDateMode()` parameter is used to mask the DATE entities that can not be normalized. In the example below, please check `03Apr2022` which couldn't be normalized and it is masked in the output.

In [142]:
data = pd.DataFrame(
    {'DocumentID' : ['A001', 'A001', 'A002', 'A002'],
     'text' : ['Chris Brown was arrested on 10/02/2022', 
               'Mark White has bought a stock on 02/28/2020', 
               'John has bought a house on 03Apr2022',
               'John Moore was discharged on 12/31/2022'
                            ],
     'dateshift' : ['5', '5', '10', '10']
    }
)

my_input_df = spark.createDataFrame(data)


documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

documentHasher = legal.DocumentHashCoder()\
    .setInputCols("document")\
    .setOutputCol("document2")\
    .setDateShiftColumn("dateshift")\


# sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
#     .setInputCols(["document2"])\
#     .setOutputCol("sentence")
#     #.setCustomBounds(["\n\n"])

tokenizer = nlp.Tokenizer()\
    .setInputCols(["document2"])\
    .setOutputCol("token")

embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
    .setInputCols(["document2", "token"]) \
    .setOutputCol("embeddings")

legal_ner = legal.NerModel.pretrained('legner_deid', "en", "legal/models")\
    .setInputCols(["document2", "token", "embeddings"]) \
    .setOutputCol("ner") 
    #.setLabelCasing("upper")

ner_converter = legal.NerConverterInternal() \
    .setInputCols(["document2", "token", "ner"])\
    .setOutputCol("ner_chunk") # "ALIAS" are secondary names of companies, so let's extract them also as PARTY

obfuscation = legal.DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk"]) \
    .setOutputCol("deidentified") \
    .setMode("obfuscate") \
    .setObfuscateDate(True) \
    .setDateTag("DATE") \
    .setLanguage("en") \
    .setObfuscateRefSource('faker') \
    .setUseShifDays(True)\
    .setRegion('us')\
    .setUnnormalizedDateMode("mask")

pipeline = nlp.Pipeline(stages=[
      documentAssembler, 
      documentHasher,
      sentenceDetector,
      tokenizer,
      embeddings,
      legal_ner,
      ner_converter,
      obfuscation])


output = pipeline.fit(my_input_df).transform(my_input_df)

output.select('text', 'dateshift', 'deidentified.result').show(truncate = False)

roberta_embeddings_legal_roberta_base download started this may take some time.
Approximate size to download 447.2 MB
[OK!]
legner_deid download started this may take some time.
[OK!]
+-------------------------------------------+---------+-------------------------------------------+
|text                                       |dateshift|result                                     |
+-------------------------------------------+---------+-------------------------------------------+
|Chris Brown was arrested on 10/02/2022     |5        |[<PERSON> was arrested on 11/01/2022]      |
|Mark White has bought a stock on 02/28/2020|5        |[<PERSON> has bought a stock on 04/18/2020]|
|John has bought a house on 03Apr2022       |10       |[<PERSON> has bought a house on <DATE>]    |
|John Moore was discharged on 12/31/2022    |10       |[<PERSON> was discharged on 01/02/2023]    |
+-------------------------------------------+---------+-------------------------------------------+



# Structured Deidentification

In [143]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/legal-nlp/data/hipaa-table-001.txt

df = spark.read.format("csv") \
    .option("sep", "\t") \
    .option("inferSchema", "true") \
    .option("header", "true") \
    .load("hipaa-table-001.txt")

df.show(truncate=False)

+---------------+----------+---+----------------------------------------------------+-------+--------------+---+---+
|NAME           |DOB       |AGE|ADDRESS                                             |ZIPCODE|TEL           |SBP|DBP|
+---------------+----------+---+----------------------------------------------------+-------+--------------+---+---+
|Cecilia Chapman|04/02/1935|83 |711-2880 Nulla St. Mankato Mississippi              |69200  |(257) 563-7401|101|42 |
|Iris Watson    |03/10/2009|9  |P.O. Box 283 8562 Fusce Rd. Frederick Nebraska      |20620  |(372) 587-2335|159|122|
|Bryar Pitts    |11/01/1921|98 |5543 Aliquet St. Fort Dodge GA                      |20783  |(717) 450-4729|149|52 |
|Theodore Lowe  |13/02/2002|16 |Ap #867-859 Sit Rd. Azusa New York                  |39531  |(793) 151-6230|134|115|
|Calista Wise   |20/08/1942|76 |7292 Dictum Av. San Antonio MI                      |47096  |(492) 709-6392|139|78 |
|Kyla Olsen     |12/05/1973|45 |Ap #651-8679 Sodales Av. Tamunin

In [144]:
obfuscator = legal.StructuredDeidentification(spark,{"NAME":"PATIENT","AGE":"AGE"}, obfuscateRefSource = "faker")
obfuscator_df = obfuscator.obfuscateColumns(df)
obfuscator_df.show(truncate=False)

+-------------------+----------+----+----------------------------------------------------+-------+--------------+---+---+
|NAME               |DOB       |AGE |ADDRESS                                             |ZIPCODE|TEL           |SBP|DBP|
+-------------------+----------+----+----------------------------------------------------+-------+--------------+---+---+
|[Cornelio Paula]   |04/02/1935|[60]|711-2880 Nulla St. Mankato Mississippi              |69200  |(257) 563-7401|101|42 |
|[Azalia Badder]    |03/10/2009|[9] |P.O. Box 283 8562 Fusce Rd. Frederick Nebraska      |20620  |(372) 587-2335|159|122|
|[Susanne Reeves]   |11/01/1921|[60]|5543 Aliquet St. Fort Dodge GA                      |20783  |(717) 450-4729|149|52 |
|[Calton Prayer]    |13/02/2002|[12]|Ap #867-859 Sit Rd. Azusa New York                  |39531  |(793) 151-6230|134|115|
|[Janyce Sams]      |20/08/1942|[60]|7292 Dictum Av. San Antonio MI                      |47096  |(492) 709-6392|139|78 |
|[Wynell Acron]     |12/

In [145]:
obfuscator_unique_ref_test = '''Will Perry#CLIENT
John Smith#CLIENT
Marvin MARSHALL#CLIENT
Hubert GROGAN#CLIENT
ALTHEA COLBURN#CLIENT
Kalil AMIN#CLIENT
Inci FOUNTAIN#CLIENT
Jackson WILLE#CLIENT
Jack SANTOS#CLIENT
Mahmood ALBURN#CLIENT
Marnie MELINGTON#CLIENT
Aysha GHAZI#CLIENT
Maryland CODER#CLIENT
Darene GEORGIOUS#CLIENT
Shelly WELLBECK#CLIENT
Min Kun JAE#CLIENT
Thomson THOMAS#CLIENT
Christian SUDDINBURG#CLIENT
Aberdeen#CITY
Louisburg St#STREET
France#LOC
5552312#PHONE
Calle del Libertador#ADDRESS
111#ID
20#AGE
30#AGE
40#AGE
50#AGE
60#AGE
'''

with open('obfuscator_unique_ref_test.txt', 'w') as f:
  f.write(obfuscator_unique_ref_test)

In [146]:
# obfuscateRefSource = "file"

obfuscator = legal.StructuredDeidentification(spark,{"NAME":"CLIENT","AGE":"AGE"}, 
                                        obfuscateRefFile = "/content/obfuscator_unique_ref_test.txt",
                                        obfuscateRefSource = "file",
                                        columnsSeed={"NAME": 23, "AGE": 23})
obfuscator_df = obfuscator.obfuscateColumns(df)
obfuscator_df.select("NAME","AGE").show(truncate=False)

+------------------+----+
|NAME              |AGE |
+------------------+----+
|[Inci FOUNTAIN]   |[60]|
|[Jack SANTOS]     |[30]|
|[Darene GEORGIOUS]|[30]|
|[Shelly WELLBECK] |[40]|
|[Hubert GROGAN]   |[40]|
|[Kalil AMIN]      |[40]|
|[ALTHEA COLBURN]  |[60]|
|[Thomson THOMAS]  |[60]|
|[Jack SANTOS]     |[60]|
|[Will Perry]      |[20]|
|[Jackson WILLE]   |[60]|
|[Shelly WELLBECK] |[40]|
|[Kalil AMIN]      |[30]|
|[Marnie MELINGTON]|[30]|
|[Min Kun JAE]     |[30]|
|[Marvin MARSHALL] |[60]|
|[Marvin MARSHALL] |[50]|
|[Min Kun JAE]     |[30]|
|[Maryland CODER]  |[20]|
|[Marnie MELINGTON]|[20]|
+------------------+----+
only showing top 20 rows



We can **shift n days** in the structured deidentification through "days" parameter when the column is a Date.

In [147]:
df = spark.createDataFrame([
            ["Juan García", "13/02/1977", "711 Nulla St.", "140", "673 431234"],
            ["Will Smith", "23/02/1977", "1 Green Avenue.", "140", "+23 (673) 431234"],
            ["Pedro Ximénez", "11/04/1900", "Calle del Libertador, 7", "100", "912 345623"]
        ]).toDF("NAME", "DOB", "ADDRESS", "SBP", "TEL")
df.show(truncate=False)

+-------------+----------+-----------------------+---+----------------+
|NAME         |DOB       |ADDRESS                |SBP|TEL             |
+-------------+----------+-----------------------+---+----------------+
|Juan García  |13/02/1977|711 Nulla St.          |140|673 431234      |
|Will Smith   |23/02/1977|1 Green Avenue.        |140|+23 (673) 431234|
|Pedro Ximénez|11/04/1900|Calle del Libertador, 7|100|912 345623      |
+-------------+----------+-----------------------+---+----------------+



In [148]:
obfuscator = legal.StructuredDeidentification(spark=spark, 
                                        columns={"NAME": "ID", "DOB": "DATE"},
                                        columnsSeed={"NAME": 23, "DOB": 23},
                                        obfuscateRefSource="faker",
                                        days=5
                                         )

In [149]:
result = obfuscator.obfuscateColumns(df)
result.show(truncate=False)

+----------+------------+-----------------------+---+----------------+
|NAME      |DOB         |ADDRESS                |SBP|TEL             |
+----------+------------+-----------------------+---+----------------+
|[N2649912]|[18/02/1977]|711 Nulla St.          |140|673 431234      |
|[W466004] |[28/02/1977]|1 Green Avenue.        |140|+23 (673) 431234|
|[M403810] |[16/04/1900]|Calle del Libertador, 7|100|912 345623      |
+----------+------------+-----------------------+---+----------------+



# Save the Pipeline and Use it from Your Local

In [150]:
model.write().overwrite().save('pipeline_deid')

In [151]:
# from sparknlp.pretrained import PretrainedPipeline

deid_pipeline = nlp.PretrainedPipeline.from_disk("pipeline_deid")

In [152]:
data = spark.createDataFrame([[text]]).toDF("text")

In [153]:
deid_pipeline.model.stages

[DocumentAssembler_ff926bd4d493,
 SentenceDetectorDLModel_8aaebf7e098e,
 REGEX_TOKENIZER_31d25d1aa149,
 ROBERTA_EMBEDDINGS_b915dff90901,
 LegalNerModel_5eb62585382d,
 NER_CONVERTER_0151907ce8bc,
 MedicalNerModel_2b2f0f671f99,
 NerConverter_dd77843f9f26,
 ZeroShotRobertaNer_5d06c0297d21,
 NER_CONVERTER_f9b2bd114a3a,
 CONTEXTUAL-PARSER_8aafa17f4437,
 CONTEXTUAL-PARSER_f93f50c17347,
 CONTEXTUAL-PARSER_9a2f7c8293db,
 MERGE_60c5e3188ff5]

In [154]:
deid_pipeline.model.transform(data).show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|          embeddings|                 ner|           ner_chunk|                ner2|          ner_chunk2|       zero_shot_ner|      zero_ner_chunk|               alias|               email|               phone|   merged_ner_chunks|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|This Commercial L...|[{document, 0, 76...|[{docu

# Pretrained Deidentification Pipeline

We have this pipeline can be used to deidentify legal information from texts.The legal information will be masked and obfuscated in the resulting text. The pipeline can mask and obfuscate `DOC`, `EFFDATE`, `PARTY`, `ALIAS`, `SIGNING_PERSON`, `SIGNING_TITLE`, `COUNTRY`, `CITY`, `STATE`, `STREET`, `ZIP`, `EMAIL`, `FAX`, `LOCATION-OTHER`, `DATE`,`PHONE` entities.

In [155]:
# from sparknlp.pretrained import PretrainedPipeline

deid_pipeline = nlp.PretrainedPipeline("legpipe_deid", "en", "legal/models")

legpipe_deid download started this may take some time.
Approx size to download 921.1 MB
[OK!]


In [156]:
deid_pipeline.model.stages

[DocumentAssembler_c7b58c78b248,
 SentenceDetectorDLModel_8aaebf7e098e,
 REGEX_TOKENIZER_3e1c0446436a,
 ROBERTA_EMBEDDINGS_b915dff90901,
 LegalNerModel_5eb62585382d,
 NER_CONVERTER_a6dff5e8e458,
 MedicalNerModel_2b2f0f671f99,
 NerConverter_df07d611fba8,
 ZeroShotRobertaNer_5d06c0297d21,
 NER_CONVERTER_970306963c3c,
 CONTEXTUAL-PARSER_171c7e78aa02,
 CONTEXTUAL-PARSER_657771204acf,
 CONTEXTUAL-PARSER_562b3f75cffd,
 MERGE_8da2fd9d23ef,
 DE-IDENTIFICATION_3741f6dfecff,
 DE-IDENTIFICATION_ea822920be64,
 DE-IDENTIFICATION_637063fd264b,
 DE-IDENTIFICATION_7b20a724c6cd]

In [157]:
text= """CARGILL, INCORPORATED

By:     Pirkko Suominen



Name: Pirkko Suominen Title: Director, Bio Technology Development  Center,  Date:   10/19/2011

BIOAMBER, SAS

By:     Jean-François Huc



Name: Jean-François Huc  Title: President Date:   October 15, 2011

email : jeanfran@gmail.com
phone : 18087339090 """

In [158]:
deid_res= deid_pipeline.annotate(text)

In [159]:
deid_res.keys()

dict_keys(['ner_chunk2', 'obfuscated', 'zero_shot_ner', 'email', 'document', 'ner_chunk', 'zero_ner_chunk', 'deidentified', 'alias', 'masked_fixed_length_chars', 'token', 'ner2', 'ner', 'embeddings', 'merged_ner_chunks', 'sentence', 'phone', 'masked_with_chars'])

In [160]:
import pandas as pd

pd.set_option("display.max_colwidth", 100)

df= pd.DataFrame(list(zip(deid_res["sentence"], 
                          deid_res["deidentified"],
                          deid_res["masked_with_chars"],
                          deid_res["masked_fixed_length_chars"], 
                          deid_res["obfuscated"])),
                 columns= ["Sentence", "Masked", "Masked with Chars", "Masked with Fixed Chars", "Obfuscated"])

df

Unnamed: 0,Sentence,Masked,Masked with Chars,Masked with Fixed Chars,Obfuscated
0,"CARGILL, INCORPORATED",<PARTY>,[*******************],****,Cunningham-Hendrix
1,By: Pirkko Suominen,By: <SIGNING_PERSON>,By: [*************],By: ****,By: Stacy Bradley
2,"Name: Pirkko Suominen Title: Director, Bio Technology Development Center, Date: 10/19/2011","Name: <SIGNING_PERSON> Title: <SIGNING_PERSON>, <SIGNING_PERSON>, Date: <DATE>","Name: [*************] Title: [******], [********************************], Date: [********]","Name: **** Title: ****, ****, Date: ****","Name: Stacy Bradley Title: Lori Thompson, Stacy Bradley, Date: <DATE>"
3,"BIOAMBER, SAS","<PARTY>, <PARTY>","[******], [*]","****, ****","Thomas, Miller and Kelly, Flowers-Frazier"
4,By: Jean-François Huc,By: <SIGNING_PERSON>,By: [***************],By: ****,By: John Curry
5,"Name: Jean-François Huc Title: President Date: October 15, 2011\n\nemail : jeanfran@gmail.com...",Name: <SIGNING_PERSON> Title: <SIGNING_TITLE> Date: <EFFDATE> 2011\n\nemail : <EMAIL>\nphone ...,Name: [***************] Title: [*******] Date: [*********] 2011\n\nemail : [****************]...,Name: **** Title: **** Date: **** 2011\n\nemail : ****\nphone : ****,Name: John Curry Title: Chief Legal Officer Date: 08-03-2007 2011\n\nemail : Bertell@hotmail....
