![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/finance-nlp/11.1.Pretrained_Deidentification_Pipeline.ipynb)

# Financial Deidentification

# Installation

In [None]:
! pip install -q johnsnowlabs

## Automatic Installation
Using my.johnsnowlabs.com SSO

In [None]:
from johnsnowlabs import nlp, finance

# nlp.install(force_browser=True)

## Manual downloading
If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.

- Go to my.johnsnowlabs.com
- Download your license
- Upload it using the following command

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

- Install it

In [None]:
nlp.install()

# Starting

In [3]:
spark = nlp.start()

👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==4.3.0, 💊Spark-Healthcare==4.3.0, running on ⚡ PySpark==3.1.2


# Pretrained Deidentification Pipeline

We have this pipeline can be used to deidentify financial information from texts.The financial information will be masked and obfuscated in the resulting text. The pipeline can mask and obfuscate `DOC`, `EFFDATE`, `PARTY`, `ALIAS`, `SIGNING_PERSON`, `SIGNING_TITLE`, `COUNTRY`, `CITY`, `STATE`, `STREET`, `ZIP`, `EMAIL`, `FAX`, `LOCATION-OTHER`, `DATE`,`PHONE` entities.

In [4]:
deid_pipeline = nlp.PretrainedPipeline("finpipe_deid", "en", "finance/models")


finpipe_deid download started this may take some time.
Approx size to download 437.3 MB
[OK!]


In [5]:
deid_pipeline.model.stages

[DocumentAssembler_20aaea0b09c9,
 SentenceDetector_f836f3c49dd7,
 REGEX_TOKENIZER_3d88a1dee1d9,
 BERT_EMBEDDINGS_29ce72cd673e,
 FinanceNerModel_1e04a0ea86dc,
 NER_CONVERTER_053dc2c885dc,
 FinanceNerModel_99ecfbac41c1,
 NER_CONVERTER_c31e7133c116,
 FinanceNerModel_fae1a65403a6,
 NER_CONVERTER_e54c4e5afd15,
 CONTEXTUAL-PARSER_72fff5ea72a3,
 CONTEXTUAL-PARSER_247b3d47153a,
 CONTEXTUAL-PARSER_8804c3848e07,
 CONTEXTUAL-PARSER_138e93ac7638,
 CONTEXTUAL-PARSER_222a1bc3dc39,
 MERGE_72dccb34a947,
 DE-IDENTIFICATION_95319986720c,
 DE-IDENTIFICATION_e98c1ba6424c,
 DE-IDENTIFICATION_b423b4e6a14e,
 DE-IDENTIFICATION_d6ea024c8838]

In [6]:
text= """ REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF
Commvault Systems, Inc.  
(Exact name of registrant as specified in its charter) 
Signed By : Sherly Johnson
(Address of principal executive offices, including zip code) 
(732) 870-4000
(telephone number, including area code) 
Name of each exchange on which registered
CVLT
The NASDAQ Stock Market
"""

In [7]:
deid_res= deid_pipeline.annotate(text)

In [8]:
deid_res.keys()

dict_keys(['obfuscated', 'ner_10k_chunk', 'email', 'document', 'ner_signers_chunk', 'deidentified', 'alias', 'chiefs', 'masked_fixed_length_chars', 'token', 'ner_signers', 'ner_generic_chunk', 'embeddings', 'merged_ner_chunks', 'ner_10k', 'sentence', 'phone', 'orgs', 'masked_with_chars', 'ner_generic'])

In [9]:
import pandas as pd

pd.set_option("display.max_colwidth", 100)

df= pd.DataFrame(list(zip(deid_res["sentence"], 
                          deid_res["deidentified"],
                          deid_res["masked_with_chars"],
                          deid_res["masked_fixed_length_chars"], 
                          deid_res["obfuscated"])),
                 columns= ["Sentence", "Masked", "Masked with Chars", "Masked with Fixed Chars", "Obfuscated"])

df

Unnamed: 0,Sentence,Masked,Masked with Chars,Masked with Fixed Chars,Obfuscated
0,REPORT PURSUANT TO SECTION 13 OR 15,REPORT PURSUANT TO SECTION 13 OR 15,REPORT PURSUANT TO SECTION 13 OR 15,REPORT PURSUANT TO SECTION 13 OR 15,REPORT PURSUANT TO SECTION 13 OR 15
1,"(d) OF THE SECURITIES EXCHANGE ACT OF\nCommvault Systems, Inc.",(d) OF <ORG>.,(d) OF [***************************************************].,(d) OF ****.,(d) OF Gillespie Inc.
2,(Exact name of registrant as specified in its charter) \nSigned By : Sherly Johnson\n(Address of...,(Exact name of registrant as specified in its charter) \nSigned By : <PERSON>\n(Address of princ...,(Exact name of registrant as specified in its charter) \nSigned By : [************]\n(Address of...,(Exact name of registrant as specified in its charter) \nSigned By : ****\n(Address of principal...,(Exact name of registrant as specified in its charter) \nSigned By : Ashley Patrick\n(Address of...
